TY - GEN
T1 - Learning to query
T2 - 32nd IEEE International Conference on Data Engineering, ICDE 2016
AU - Fang, Yuan
AU - Zheng, Vincent W.
AU - Chang, Kevin Chen Chuan
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/6/22
Y1 - 2016/6/22
N2 - As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical portal. Given that search engines are the de facto gateways to assess information on the Web, we recognize the essence of our problem as Learning to Query (L2Q)-to intelligently select queries so that we can harvest pages, via a search engine, focused on an entity aspect of interest. Thus, it is crucial to quantify the utilities of the candidate queries w.r.t. some entity aspect. In order to better estimate the utilities, we identify two opportunities and address their challenges. First, a target entity in a given domain has many peers. We leverage these peer entities to become domain aware. Second, a candidate query may overlap with the past queries that have already been fired. We account for these past queries to become context aware. Empirical results show that our approach significantly outperforms both algorithmic and manual baselines by 16% and 10% in F-scores, respectively.
AB - As the Web hosts rich information about real-world entities, our information quests become increasingly entity centric. In this paper, we study the problem of focused harvesting of Web pages for entity aspects, to support downstream applications such as business analytics and building a vertical portal. Given that search engines are the de facto gateways to assess information on the Web, we recognize the essence of our problem as Learning to Query (L2Q)-to intelligently select queries so that we can harvest pages, via a search engine, focused on an entity aspect of interest. Thus, it is crucial to quantify the utilities of the candidate queries w.r.t. some entity aspect. In order to better estimate the utilities, we identify two opportunities and address their challenges. First, a target entity in a given domain has many peers. We leverage these peer entities to become domain aware. Second, a candidate query may overlap with the past queries that have already been fired. We account for these past queries to become context aware. Empirical results show that our approach significantly outperforms both algorithmic and manual baselines by 16% and 10% in F-scores, respectively.
UR - http://www.scopus.com/inward/record.url?scp=84980329499&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84980329499&partnerID=8YFLogxK
U2 - 10.1109/ICDE.2016.7498308
DO - 10.1109/ICDE.2016.7498308
M3 - Conference contribution
AN - SCOPUS:84980329499
T3 - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
SP - 1002
EP - 1013
BT - 2016 IEEE 32nd International Conference on Data Engineering, ICDE 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 16 May 2016 through 20 May 2016
ER -