TY - GEN

T1 - An Evaluation of NLP Methods to Extract Mathematical Token Descriptors

AU - Hamel, Emma

AU - Zheng, Hongbo

AU - Kani, Nickvash

N1 - Funding Information:
Supported by University of Illinois at Urbana-Champaign - College of Engineering.
Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

PY - 2022

Y1 - 2022

N2 - Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well (∼ 90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

AB - Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well (∼ 90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

KW - Dataset

KW - Mathematical language processing

KW - Named entity recognition

KW - Text summarization

UR - http://www.scopus.com/inward/record.url?scp=85138781592&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85138781592&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-16681-5_23

DO - 10.1007/978-3-031-16681-5_23

M3 - Conference contribution

AN - SCOPUS:85138781592

SN - 9783031166808

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 329

EP - 343

BT - Intelligent Computer Mathematics - 15th International Conference, CICM 2022, Proceedings

A2 - Buzzard, Kevin

A2 - Kutsia, Temur

PB - Springer

T2 - 15th Conference on Intelligent Computer Mathematics, CICM 2022

Y2 - 19 September 2022 through 23 September 2022

ER -