Generalized linear models for aggregated data

Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo

Research output: Contribution to journalConference article

Abstract

Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins-with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram-when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.

Original languageEnglish (US)
Pages (from-to)93-101
Number of pages9
JournalJournal of Machine Learning Research
Volume38
StatePublished - Jan 1 2015
Externally publishedYes
Event18th International Conference on Artificial Intelligence and Statistics, AISTATS 2015 - San Diego, United States
Duration: May 9 2015May 12 2015

Fingerprint

Generalized Linear Model
Statistics
Histogram
Testing
Bins
Permutation
Order Statistics
Healthcare
Target
Limiting
Model Fitting
Diminishing
Imputation
Granularity
Modeling
Scenarios
Experimental Results
Estimate
Relationships

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Cite this

Generalized linear models for aggregated data. / Bhowmik, Avradeep; Ghosh, Joydeep; Koyejo, Oluwasanmi.

In: Journal of Machine Learning Research, Vol. 38, 01.01.2015, p. 93-101.

Research output: Contribution to journalConference article

Bhowmik, Avradeep ; Ghosh, Joydeep ; Koyejo, Oluwasanmi. / Generalized linear models for aggregated data. In: Journal of Machine Learning Research. 2015 ; Vol. 38. pp. 93-101.
@article{b9ff1e61345a44c5951b2eef86507de7,
title = "Generalized linear models for aggregated data",
abstract = "Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins-with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram-when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.",
author = "Avradeep Bhowmik and Joydeep Ghosh and Oluwasanmi Koyejo",
year = "2015",
month = "1",
day = "1",
language = "English (US)",
volume = "38",
pages = "93--101",
journal = "Journal of Machine Learning Research",
issn = "1532-4435",
publisher = "Microtome Publishing",

}

TY - JOUR

T1 - Generalized linear models for aggregated data

AU - Bhowmik, Avradeep

AU - Ghosh, Joydeep

AU - Koyejo, Oluwasanmi

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins-with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram-when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.

AB - Databases in domains such as healthcare are routinely released to the public in aggregated form. Unfortunately, naive modeling with aggregated data may significantly diminish the accuracy of inferences at the individual level. This paper addresses the scenario where features are provided at the individual level, but the target variables are only available as histogram aggregates or order statistics. We consider a limiting case of generalized linear modeling when the target variables are only known up to permutation, and explore how this relates to permutation testing; a standard technique for assessing statistical dependency. Based on this relationship, we propose a simple algorithm to estimate the model parameters and individual level inferences via alternating imputation and standard generalized linear model fitting. Our results suggest the effectiveness of the proposed approach when, in the original data, permutation testing accurately ascertains the veracity of the linear relationship. The framework is extended to general histogram data with larger bins-with order statistics such as the median as a limiting case. Our experimental results on simulated data and aggregated healthcare data suggest a diminishing returns property with respect to the granularity of the histogram-when a linear relationship holds in the original data, the targets can be predicted accurately given relatively coarse histograms.

UR - http://www.scopus.com/inward/record.url?scp=84954313053&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84954313053&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84954313053

VL - 38

SP - 93

EP - 101

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

ER -