Word Embedding-Based Text Complexity Analysis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.

Original languageEnglish (US)
Title of host publicationWisdom, Well-Being, Win-Win - 19th International Conference, iConference 2024, Proceedings
EditorsIsaac Sserwanga, Hideo Joho, Jie Ma, Preben Hansen, Dan Wu, Masanori Koizumi, Anne J. Gilliland
PublisherSpringer
Pages283-292
Number of pages10
ISBN (Print)9783031578663
DOIs
StatePublished - 2024
Externally publishedYes
Event19th International Conference on Wisdom, Well-Being, Win-Win, iConference 2024 - Changchun, China
Duration: Apr 15 2024Apr 26 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14598 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference19th International Conference on Wisdom, Well-Being, Win-Win, iConference 2024
Country/TerritoryChina
CityChangchun
Period4/15/244/26/24

Keywords

  • Pointwise Mutual Information
  • Readability
  • Text complexity
  • Word embedding

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Word Embedding-Based Text Complexity Analysis'. Together they form a unique fingerprint.

Cite this