TY - JOUR
T1 - Structured databases on the web
T2 - Observations and implications
AU - Chang, Kevin Chen Chuan
AU - He, Bin
AU - Li, Chengkai
AU - Patel, Mitesh
AU - Zhang, Zhen
PY - 2004/9
Y1 - 2004/9
N2 - The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep" Web" of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.
AB - The Web has been rapidly "deepened" by the prevalence of databases online. With the potentially unlimited information hidden behind their query interfaces, this "deep" Web" of searchable databases is clearly an important frontier for data access. This paper surveys this relatively unexplored frontier, measuring characteristics pertinent to both exploring and integrating structured Web sources. On one hand, our "macro" study surveys the deep Web at large, in April 2004, adopting the random IP-sampling approach, with one million samples. (How large is the deep Web? How is it covered by current directory services?) On the other hand, our "micro" study surveys source-specific characteristics over 441 sources in eight representative domains, in December 2002. (How "hidden" are deep-Web sources? How do search engines cover their data? How complex and expressive are query forms?) We report our observations and publish the resulting datasets to the research community. We conclude with several implications (of our own) which, while necessarily subjective, might help shape research directions and solutions.
UR - http://www.scopus.com/inward/record.url?scp=5444262639&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=5444262639&partnerID=8YFLogxK
U2 - 10.1145/1031570.1031584
DO - 10.1145/1031570.1031584
M3 - Review article
AN - SCOPUS:5444262639
SN - 0163-5808
VL - 33
SP - 61
EP - 70
JO - SIGMOD Record
JF - SIGMOD Record
IS - 3
ER -