TY - GEN
T1 - Searching patterns for relation extraction over the Web
T2 - 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
AU - Fang, Yuan
AU - Chang, Kevin Chen Chuan
PY - 2011
Y1 - 2011
N2 - While tuple extraction for a given relation has been an active research area, its dual problem of pattern search- to find and rank patterns in a principled way- has not been studied explicitly. In this paper, we propose and address the problem of pattern search, in addition to tuple extraction. As our objectives, we stress reusability for pattern search and scalability of tuple extraction, such that our approach can be applied to very large corpora like the Web. As the key foundation, we propose a conceptual model PRDualRank to capture the notion of precision and recall for both tuples and patterns in a principled way, leading to the "rediscovery" of the Pattern-Relation Duality- the formal quantification of the reinforcement between patterns and tuples with the metrics of precision and recall. We also develop a concrete framework for PRDualRank, guided by the principles of a perfect sampling process over a complete corpus. Finally, we evaluated our framework over the real Web. Experiments show that on all three target relations our principled approach greatly outperforms the previous state-of-the-art system in both effectiveness and efficiency. In particular, we improved optimal F-score by up to 64%.
AB - While tuple extraction for a given relation has been an active research area, its dual problem of pattern search- to find and rank patterns in a principled way- has not been studied explicitly. In this paper, we propose and address the problem of pattern search, in addition to tuple extraction. As our objectives, we stress reusability for pattern search and scalability of tuple extraction, such that our approach can be applied to very large corpora like the Web. As the key foundation, we propose a conceptual model PRDualRank to capture the notion of precision and recall for both tuples and patterns in a principled way, leading to the "rediscovery" of the Pattern-Relation Duality- the formal quantification of the reinforcement between patterns and tuples with the metrics of precision and recall. We also develop a concrete framework for PRDualRank, guided by the principles of a perfect sampling process over a complete corpus. Finally, we evaluated our framework over the real Web. Experiments show that on all three target relations our principled approach greatly outperforms the previous state-of-the-art system in both effectiveness and efficiency. In particular, we improved optimal F-score by up to 64%.
KW - Algorithms
KW - Design
KW - Experimentation
KW - Performance
UR - http://www.scopus.com/inward/record.url?scp=79952390248&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79952390248&partnerID=8YFLogxK
U2 - 10.1145/1935826.1935933
DO - 10.1145/1935826.1935933
M3 - Conference contribution
AN - SCOPUS:79952390248
SN - 9781450304931
T3 - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
SP - 825
EP - 834
BT - Proceedings of the 4th ACM International Conference on Web Search and Data Mining, WSDM 2011
Y2 - 9 February 2011 through 12 February 2011
ER -