A probabilistic similarity metric for medline records: A model for author name disambiguation

Vetle I. Torvik, Marc Weeber, Don R. Swanson, Neil R. Smalheiser

Research output: Contribution to journalArticle

Abstract

We present a model for estimating the probability that a pair of author names (sharing last name and first initial), appearing on two different Medline articles, refer to the same individual. The model uses a simple yet powerful similarity profile between a pair of articles, based on title, journal name, coauthor names, medical subject headings (MeSH), language, affiliation, and name attributes (prevalence in the literature, middle initial, and suffix). The similarity profile distribution is computed from reference sets consisting of pairs of articles containing almost exclusively author matches versus nonmatches, generated in an unbiased manner. Although the match set is generated automatically and might contain a small proportion of nonmatches, the model is quite robust against contamination with nonmatches. We have created a free, public service ("Author-ity": http://arrowsmith.psych.uic.edu) that takes as input an author's name given on a specific article, and gives as output a list of all articles with that (last name, first initial) ranked by decreasing similarity, with match probability indicated.

Original languageEnglish (US)
Pages (from-to)140-158
Number of pages19
JournalJournal of the American Society for Information Science and Technology
Volume56
Issue number2
DOIs
StatePublished - Jan 15 2005
Externally publishedYes

Fingerprint

Contamination
environmental pollution
public service
language
Language
Proportion
Public services
literature

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

A probabilistic similarity metric for medline records : A model for author name disambiguation. / Torvik, Vetle I.; Weeber, Marc; Swanson, Don R.; Smalheiser, Neil R.

In: Journal of the American Society for Information Science and Technology, Vol. 56, No. 2, 15.01.2005, p. 140-158.

Research output: Contribution to journalArticle

@article{b009e35f8f264ea4a47aaa66f621487d,
title = "A probabilistic similarity metric for medline records: A model for author name disambiguation",
abstract = "We present a model for estimating the probability that a pair of author names (sharing last name and first initial), appearing on two different Medline articles, refer to the same individual. The model uses a simple yet powerful similarity profile between a pair of articles, based on title, journal name, coauthor names, medical subject headings (MeSH), language, affiliation, and name attributes (prevalence in the literature, middle initial, and suffix). The similarity profile distribution is computed from reference sets consisting of pairs of articles containing almost exclusively author matches versus nonmatches, generated in an unbiased manner. Although the match set is generated automatically and might contain a small proportion of nonmatches, the model is quite robust against contamination with nonmatches. We have created a free, public service ({"}Author-ity{"}: http://arrowsmith.psych.uic.edu) that takes as input an author's name given on a specific article, and gives as output a list of all articles with that (last name, first initial) ranked by decreasing similarity, with match probability indicated.",
author = "Torvik, {Vetle I.} and Marc Weeber and Swanson, {Don R.} and Smalheiser, {Neil R.}",
year = "2005",
month = "1",
day = "15",
doi = "10.1002/asi.20105",
language = "English (US)",
volume = "56",
pages = "140--158",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "2",

}

TY - JOUR

T1 - A probabilistic similarity metric for medline records

T2 - A model for author name disambiguation

AU - Torvik, Vetle I.

AU - Weeber, Marc

AU - Swanson, Don R.

AU - Smalheiser, Neil R.

PY - 2005/1/15

Y1 - 2005/1/15

N2 - We present a model for estimating the probability that a pair of author names (sharing last name and first initial), appearing on two different Medline articles, refer to the same individual. The model uses a simple yet powerful similarity profile between a pair of articles, based on title, journal name, coauthor names, medical subject headings (MeSH), language, affiliation, and name attributes (prevalence in the literature, middle initial, and suffix). The similarity profile distribution is computed from reference sets consisting of pairs of articles containing almost exclusively author matches versus nonmatches, generated in an unbiased manner. Although the match set is generated automatically and might contain a small proportion of nonmatches, the model is quite robust against contamination with nonmatches. We have created a free, public service ("Author-ity": http://arrowsmith.psych.uic.edu) that takes as input an author's name given on a specific article, and gives as output a list of all articles with that (last name, first initial) ranked by decreasing similarity, with match probability indicated.

AB - We present a model for estimating the probability that a pair of author names (sharing last name and first initial), appearing on two different Medline articles, refer to the same individual. The model uses a simple yet powerful similarity profile between a pair of articles, based on title, journal name, coauthor names, medical subject headings (MeSH), language, affiliation, and name attributes (prevalence in the literature, middle initial, and suffix). The similarity profile distribution is computed from reference sets consisting of pairs of articles containing almost exclusively author matches versus nonmatches, generated in an unbiased manner. Although the match set is generated automatically and might contain a small proportion of nonmatches, the model is quite robust against contamination with nonmatches. We have created a free, public service ("Author-ity": http://arrowsmith.psych.uic.edu) that takes as input an author's name given on a specific article, and gives as output a list of all articles with that (last name, first initial) ranked by decreasing similarity, with match probability indicated.

UR - http://www.scopus.com/inward/record.url?scp=12344288685&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=12344288685&partnerID=8YFLogxK

U2 - 10.1002/asi.20105

DO - 10.1002/asi.20105

M3 - Article

AN - SCOPUS:12344288685

VL - 56

SP - 140

EP - 158

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 2

ER -