Author-implicit journal, MeSH, title-word, and affiliation-word pairs based on Author-ity 2009



Contains a series of datasets that score pairs of tokens (words, journal names, and controlled vocabulary terms) based on how often they co-occur within versus across authors' collections of papers. The tokens derive from four different fields of PubMed papers: journal, affiliation, title, MeSH (medical subject headings). Thus, there are 10 different datasets, one for each pair of token type: affiliation-word vs affiliation-word, affiliation-word vs journal, affiliation-word vs mesh, affiliation-word vs title-word, mesh vs mesh, mesh vs journal, etc.

Using authors to link papers and in turn pairs of tokens is an alternative to the usual within-document co-occurrences, and using e.g., citations to link papers. This is particularly striking for journal pairs because a paper almost always appears in a single journal and so within-document co-occurrences are 0, i.e., useless.

The tokens are taken from the Author-ity 2009 dataset which has a cluster of papers for each inferred author, and a summary of each field. For MeSH, title-words, affiliation-words that summary includes only the top-20 most frequent tokens after field-specific stoplisting (e.g., university is stoplisted from affiliation and Humans is stoplisted from MeSH). The score for a pair of tokens A and B is defined as follows. Suppose Ai and Bi are the number of occurrences of token A (and B, respectively) across the i-th author's papers, then
nA = sum(Ai); nB = sum(Ai)
nAB = sum(Ai*Bi) if A not equal B; nAA = sum(Ai*(Ai-1)/2) otherwise
nAnB = nA*nB if A not equal B; nAnA = nA*(nA-1)/2 otherwise
score = 1000000*nAB/nAnB if A is not equal B; 1000000*nAA/nAnA otherwise

Token pairs are excluded when: score < 5, or nA < cut-off, or nB < cut-off, or nAB < cut-offAB.
The cut-offs differ for token types and can be inferred from the datasets. For example, cut-off = 200 and cut-offAB = 20 for journal pairs.

Each dataset has the following 7 tab-delimited all-ASCII columns

1: score: roughly the number tokens' co-occurrence divided by the total number of pairs, in parts per million (ppm), ranging from 5 to 1,000,000
2: nAB: total number of co-occurrences
3: nAnB: total number of pairs
4: nA: number of occurrences of token A
5: nB: number of occurrences of token B
6: A: token A
7: B: token B

We made some of these datasets as early as 2011 as we were working to link PubMed authors with USPTO inventors, where the vocabulary usage is strikingly different, but also more recently to create links from PubMed authors to their dissertations and NIH/NSF investigators, and to help disambiguate PubMed authors. Going beyond explicit (exact within-field match) is particularly useful when data is sparse (think old papers lacking controlled vocabulary and affiliations, or papers with metadata written in different languages) and when making links across databases with different kinds of fields and vocabulary (think PubMed vs USPTO records). We never published a paper on this but our work inspired the more refined measures described in:

<a href="">D′Souza JL, Smalheiser NR (2014) Three Journal Similarity Metrics and Their Application to Biomedical Journals. PLOS ONE 9(12): e115681.</a>

<a href="">Smalheiser, N., & Bonifield, G. (2016). Two Similarity Metrics for Medical Subject Headings (MeSH): An Aid to Biomedical Text Mining and Author Name Disambiguation. DISCO: Journal of Biomedical Discovery and Collaboration, 7. doi:</a>
Date made availableApr 23 2018
PublisherUniversity of Illinois Urbana-Champaign


  • PubMed
  • name disambiguation
  • token
  • MeSH

Cite this