TY - GEN

T1 - Nonparametric Bayesian classification with massive datasets

T2 - 5th Statistical Problems in Particle Physics, Astrophysics and Cosmology Conference, PHYSTAT 2005

AU - Gray, Alexander

AU - Richards, Gordon

AU - Nichol, Robert

AU - Brunner, Robert

AU - Moore, Andrew

PY - 2006

Y1 - 2006

N2 - The kernel discriminant (a nonparametric Bayesian classifier) is appropriate for many scientific tasks because it is highly accurate (it approaches Bayes optimality as you get more data), distribution-free (works for arbitrary data distributions), and it is easy to inject prior domain knowledge into it and interpret what it's doing. Unfortunately, like other highly accurate classifiers, it is computationally infeasible for massive datasets. We present a fast algorithm for performing classification with the kernel discriminant exactly (i.e. without introducing any approximation error). We demonstrate its use for quasar discovery, a problem central to cosmology and astrophysics, tractably using 500K training data and 800K testing data from the Sloan Digital Sky Survey. The resulting catalog of 100K quasars significantly exceeds existing quasar catalogs in both size and quality, opening a number of new scientific possibilities, including the recent empirical confirmation of cosmic magnification which has received wide attention.

AB - The kernel discriminant (a nonparametric Bayesian classifier) is appropriate for many scientific tasks because it is highly accurate (it approaches Bayes optimality as you get more data), distribution-free (works for arbitrary data distributions), and it is easy to inject prior domain knowledge into it and interpret what it's doing. Unfortunately, like other highly accurate classifiers, it is computationally infeasible for massive datasets. We present a fast algorithm for performing classification with the kernel discriminant exactly (i.e. without introducing any approximation error). We demonstrate its use for quasar discovery, a problem central to cosmology and astrophysics, tractably using 500K training data and 800K testing data from the Sloan Digital Sky Survey. The resulting catalog of 100K quasars significantly exceeds existing quasar catalogs in both size and quality, opening a number of new scientific possibilities, including the recent empirical confirmation of cosmic magnification which has received wide attention.

UR - http://www.scopus.com/inward/record.url?scp=84894153365&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84894153365&partnerID=8YFLogxK

U2 - 10.1142/9781860948985_0031

DO - 10.1142/9781860948985_0031

M3 - Conference contribution

AN - SCOPUS:84894153365

SN - 1860946496

SN - 9781860946493

T3 - Statistical Problems in Particle Physics, Astrophysics and Cosmology - Proceedings of PHYSTAT 2005

SP - 147

EP - 150

BT - Statistical Problems in Particle Physics, Astrophysics and Cosmology - Proceedings of PHYSTAT 2005

PB - Imperial College Press

Y2 - 12 September 2005 through 15 September 2005

ER -