An Evaluation of NLP Methods to Extract Mathematical Token Descriptors

Emma Hamel, Hongbo Zheng, Nickvash Kani

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Mathematical formulae are a foundational component of information in all scientific and mathematical papers. Parsing meaning from these expressions by extracting textual descriptors of their variable tokens is a unique challenge that requires semantic and grammatical knowledge. In this work, we present a new manually-labeled dataset (called the MTDE dataset) of mathematical objects, the contexts in which they are defined, and their textual definitions. With this dataset, we evaluate the accuracy of several modern neural network models on two definition extraction tasks. While this is not a solved task, modern language models such as BERT perform well (∼ 90%). Both the dataset and neural network models (implemented in PyTorch jupyter notebooks) are available online to help aid future researchers in this space.

Original languageEnglish (US)
Title of host publicationIntelligent Computer Mathematics - 15th International Conference, CICM 2022, Proceedings
EditorsKevin Buzzard, Temur Kutsia
Number of pages15
ISBN (Print)9783031166808
StatePublished - 2022
Event15th Conference on Intelligent Computer Mathematics, CICM 2022 - Tbilisi, Georgia
Duration: Sep 19 2022Sep 23 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13467 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference15th Conference on Intelligent Computer Mathematics, CICM 2022


  • Dataset
  • Mathematical language processing
  • Named entity recognition
  • Text summarization

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science


Dive into the research topics of 'An Evaluation of NLP Methods to Extract Mathematical Token Descriptors'. Together they form a unique fingerprint.

Cite this