BlinkML: Efficient maximum likelihood estimation with probabilistic guarantees

Yongjoo Park, Jingyi Qing, Xiaoyang Shen, Barzan Mozafari

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26×-629× while guaranteeing the same predictions, with 95% probability, as the full model.

Original languageEnglish (US)
Title of host publicationSIGMOD 2019 - Proceedings of the 2019 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1135-1152
Number of pages18
ISBN (Electronic)9781450356435
DOIs
StatePublished - Jun 25 2019
Externally publishedYes
Event2019 International Conference on Management of Data, SIGMOD 2019 - Amsterdam, Netherlands
Duration: Jun 30 2019Jul 5 2019

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Conference

Conference2019 International Conference on Management of Data, SIGMOD 2019
Country/TerritoryNetherlands
CityAmsterdam
Period6/30/197/5/19

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'BlinkML: Efficient maximum likelihood estimation with probabilistic guarantees'. Together they form a unique fingerprint.

Cite this