TY - GEN
T1 - Word Embedding-Based Text Complexity Analysis
AU - Choi, Kahyun
N1 - This work was supported by RE-252382-OLS-22 from the Institute of Museum and Library Services.
PY - 2024
Y1 - 2024
N2 - Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.
AB - Text complexity metrics serve crucial roles in quantifying the readability level of important documents, leading to ensuring public safety, enhancing educational outcomes, and more. Pointwise mutual information (PMI) has been widely used to measure text complexity by capturing the statistical co-occurrence patterns between word pairs, assuming their semantic significance. However, we observed that word embeddings are similar to PMI in that both are based on co-occurrence in large corpora. Yet, word embeddings are superior in terms of faster calculations and more generalizable semantic proximity measures. Given this, we propose a novel text complexity metric that leverages the power of word embeddings to measure the semantic distance between words in a document. We empirically validate our approach by analyzing the OneStopEnglish dataset, which contains news articles annotated with expert-labeled readability scores. Our experiments reveal that the proposed word embedding-based metric demonstrates a stronger correlation with ground-truth readability levels than conventional PMI-based metrics. This study serves as a cornerstone for future research aiming to incorporate context-dependent embeddings and extends applicability to various text types.
KW - Pointwise Mutual Information
KW - Readability
KW - Text complexity
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85192236902&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192236902&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-57867-0_21
DO - 10.1007/978-3-031-57867-0_21
M3 - Conference contribution
AN - SCOPUS:85192236902
SN - 9783031578663
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 283
EP - 292
BT - Wisdom, Well-Being, Win-Win - 19th International Conference, iConference 2024, Proceedings
A2 - Sserwanga, Isaac
A2 - Joho, Hideo
A2 - Ma, Jie
A2 - Hansen, Preben
A2 - Wu, Dan
A2 - Koizumi, Masanori
A2 - Gilliland, Anne J.
PB - Springer
T2 - 19th International Conference on Wisdom, Well-Being, Win-Win, iConference 2024
Y2 - 15 April 2024 through 26 April 2024
ER -