TY - JOUR
T1 - Designing and mining multi-terabyte astronomy archives
T2 - The sloan digital sky survey
AU - Szalay, Alexander S.
AU - Kunszt, Peter Z.
AU - Thakar, Ani
AU - Gray, Jim
AU - Slutz, Don
AU - Brunner, Robert J.
PY - 2000/6
Y1 - 2000/6
N2 - The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow debugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.
AB - The next-generation astronomy digital archives will cover most of the sky at fine resolution in many wavelengths, from X-rays, through ultraviolet, optical, and infrared. The archives will be stored at diverse geographical locations. One of the first of these projects, the Sloan Digital Sky Survey (SDSS) is creating a 5-wavelength catalog over 10,000 square degrees of the sky (see http://www.sdss.org/). The 200 million objects in the multi-terabyte database will have mostly numerical attributes in a 100+ dimensional space. Points in this space have highly correlated distributions. The archive will enable astronomers to explore the data interactively. Data access will be aided by multidimensional spatial and attribute indices. The data will be partitioned in many ways. Small tag objects consisting of the most popular attributes will accelerate frequent searches. Splitting the data among multiple servers will allow parallel, scalable I/O and parallel data analysis. Hashing techniques will allow efficient clustering, and pair-wise comparison algorithms that should parallelize nicely. Randomly sampled subsets will allow debugging otherwise large queries at the desktop. Central servers will operate a data pump to support sweep searches touching most of the data. The anticipated queries will require special operators related to angular distances and complex similarity tests of object properties, like shapes, colors, velocity vectors, or temporal behaviors. These issues pose interesting data management challenges.
KW - Archive
KW - Astronomy
KW - Data analysis
KW - Data mining
KW - Database
KW - Internet
KW - Scaleable
UR - http://www.scopus.com/inward/record.url?scp=0003873676&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0003873676&partnerID=8YFLogxK
U2 - 10.1145/335191.335439
DO - 10.1145/335191.335439
M3 - Review article
AN - SCOPUS:0003873676
SN - 0163-5808
VL - 29
SP - 451
EP - 462
JO - SIGMOD Record (ACM Special Interest Group on Management of Data)
JF - SIGMOD Record (ACM Special Interest Group on Management of Data)
IS - 2
ER -