OpCitance: Citation contexts identified from the PubMed Central open access articles

Tzu Kun Hsiao, Vetle I. Torvik

Research output: Contribution to journalArticlepeer-review

Abstract

OpCitance contains all the sentences from 2 million PubMed Central open-access (PMCOA) articles, with 137 million inline citations annotated (i.e., the “citation contexts”). Parsing out the references and citation contexts from the PMCOA XML files was non-trivial due to the diversity of referencing style. Only 0.5% citation contexts remain unidentified due to technical or human issues, e.g., references unmentioned by the authors in the text or improper XML nesting, which is more common among older articles (pre-2000). PubMed IDs (PMIDs) linked to inline citations in the XML files compared to citations harvested using the NCBI E-Utilities differed for 70.96% of the articles. Using an in-house citation matcher, called Patci, 6.84% of the referenced PMIDs were supplemented and corrected. OpCitance includes fewer total number of articles than the Semantic Scholar Open Research Corpus, but OpCitance has 160 thousand unique articles, a higher inline citation identification rate, and a more accurate reference mapping to PMIDs. We hope that OpCitance will facilitate citation context studies in particular and benefit text-mining research more broadly.

Original languageEnglish (US)
Article number243
JournalScientific Data
Volume10
Issue number1
DOIs
StatePublished - Dec 2023

ASJC Scopus subject areas

  • Information Systems
  • Education
  • Library and Information Sciences
  • Statistics and Probability
  • Computer Science Applications
  • Statistics, Probability and Uncertainty

Fingerprint

Dive into the research topics of 'OpCitance: Citation contexts identified from the PubMed Central open access articles'. Together they form a unique fingerprint.

Cite this