Semantic frame-based document representation for comparable corpora

Hyungsul Kim, Xiang Ren, Yizhou Sun, Chi Wang, Jiawei Han

Research output: Contribution to journalConference article

Abstract

Document representation is a fundamental problem for text mining. Many efforts have been done to generate concise yet semantic representation, such as bag-of-words, phrase, sentence and topic-level descriptions. Nevertheless, most existing techniques counter difficulties in handling monolingual comparable corpus, which is a collection of monolingual documents conveying the same topic. In this paper, we propose the use of frame, a high-level semantic unit, and construct frame-based representations to semantically describe documents by bags of frames, using an information network approach. One major challenge in this representation is that semantically similar frames may be of different forms. For example, "radiation leaked" in one news article can appear as "the level of radiation increased" in another article. To tackle the problem, a text-based information network is constructed among frames and words, and a link-based similarity measure called SynRank is proposed to calculate similarity between frames. As a result, different variations of the semantically similar frames are merged into a single descriptive frame using clustering, and a document can then be represented as a bag of representative frames. It turns out that frame-based document representation not only is more interpretable, but also can facilitate other text analysis tasks such as event tracking effectively. We conduct both qualitative and quantitative experiments on three comparable news corpora, to study the effectiveness of frame-based document representation and the similarity measure SynRank, respectively, and demonstrate that the superior performance of frame-based document representation on different real-world applications.

Original languageEnglish (US)
Article number6729519
Pages (from-to)350-359
Number of pages10
JournalProceedings - IEEE International Conference on Data Mining, ICDM
DOIs
StatePublished - Dec 1 2013
Event13th IEEE International Conference on Data Mining, ICDM 2013 - Dallas, TX, United States
Duration: Dec 7 2013Dec 10 2013

    Fingerprint

Keywords

  • Clustering
  • Document Representation
  • Graph Similarity

ASJC Scopus subject areas

  • Engineering(all)

Cite this