An iterative link-based method for parallel web page mining

Le Liu, Yu Hong, Jun Lu, Jun Lang, Heng Ji, Jianmin Yao

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Identifying parallel web pages from bilingual web sites is a crucial step of bilingual resource construction for crosslingual information processing. In this paper, we propose a link-based approach to distinguish parallel web pages from bilingual web sites. Compared with the existing methods, which only employ the internal translation similarity (such as content-based similarity and page structural similarity), we hypothesize that the external translation similarity is an effective feature to identify parallel web pages. Within a bilingual web site, web pages are interconnected by hyperlinks. The basic idea of our method is that the translation similarity of two pages can be inferred from their neighbor pages, which can be adopted as an important source of external similarity. Thus, the translation similarity of page pairs will influence each other. An iterative algorithm is developed to estimate the external translation similarity and the final translation similarity. Both internal and external similarity measures are combined in the iterative algorithm. Experiments on six bilingual websites demonstrate that our method is effective and obtains significant improvement (6.2% F-Score) over the baseline which only utilizes internal translation similarity.

Original languageEnglish (US)
Title of host publicationEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PublisherAssociation for Computational Linguistics (ACL)
Pages1216-1224
Number of pages9
ISBN (Electronic)9781937284961
StatePublished - Jan 1 2014
Externally publishedYes
Event2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014 - Doha, Qatar
Duration: Oct 25 2014Oct 29 2014

Publication series

NameEMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference

Conference

Conference2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014
CountryQatar
CityDoha
Period10/25/1410/29/14

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Vision and Pattern Recognition
  • Information Systems

Fingerprint Dive into the research topics of 'An iterative link-based method for parallel web page mining'. Together they form a unique fingerprint.

  • Cite this

    Liu, L., Hong, Y., Lu, J., Lang, J., Ji, H., & Yao, J. (2014). An iterative link-based method for parallel web page mining. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (pp. 1216-1224). (EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference). Association for Computational Linguistics (ACL).