MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide

Research output: Contribution to journalArticle

Abstract

Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2 million mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "University of Illinois" which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.

Original languageEnglish (US)
Pages (from-to)1
Number of pages1
JournalD-Lib Magazine
Volume21
Issue number11-12
DOIs
StatePublished - Jan 1 2015

Fingerprint

city center
information retrieval
search engine
random sample
USSR
Russia
candidacy
linguistics
lack

Keywords

  • Author affiliations
  • Bibliographic databases
  • Digital libraries
  • Geocoding
  • Geographic indexing
  • Geoparsing
  • MEDLINE
  • Place name ambiguity
  • PubMed
  • Toponym extraction
  • Toponym resolution

ASJC Scopus subject areas

  • Library and Information Sciences

Cite this

MapAffil : A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. / Torvik, Vetle I.

In: D-Lib Magazine, Vol. 21, No. 11-12, 01.01.2015, p. 1.

Research output: Contribution to journalArticle

@article{180ad9d5b91a46aabac05ac61fb9a14b,
title = "MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide",
abstract = "Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1{\%}. For the 4.2 million mappings to the USA, 97.7{\%} were complete (included a city), 1.8{\%} included a state but not a city, and 0.4{\%} did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7{\%}) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0{\%}) and GeoMaker got 274 (91.3{\%}) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1{\%}). The incompleteness rate is about 2{\%}, mostly due to a lack of information, e.g. the affiliation simply says {"}University of Illinois{"} which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.",
keywords = "Author affiliations, Bibliographic databases, Digital libraries, Geocoding, Geographic indexing, Geoparsing, MEDLINE, Place name ambiguity, PubMed, Toponym extraction, Toponym resolution",
author = "Torvik, {Vetle I.}",
year = "2015",
month = "1",
day = "1",
doi = "10.1045/november2015-torvik",
language = "English (US)",
volume = "21",
pages = "1",
journal = "D-Lib Magazine",
issn = "1082-9873",
publisher = "Corporation for National Research Initiatives",
number = "11-12",

}

TY - JOUR

T1 - MapAffil

T2 - A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide

AU - Torvik, Vetle I.

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2 million mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "University of Illinois" which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.

AB - Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2 million mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "University of Illinois" which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.

KW - Author affiliations

KW - Bibliographic databases

KW - Digital libraries

KW - Geocoding

KW - Geographic indexing

KW - Geoparsing

KW - MEDLINE

KW - Place name ambiguity

KW - PubMed

KW - Toponym extraction

KW - Toponym resolution

UR - http://www.scopus.com/inward/record.url?scp=84957096646&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84957096646&partnerID=8YFLogxK

U2 - 10.1045/november2015-torvik

DO - 10.1045/november2015-torvik

M3 - Article

AN - SCOPUS:84957096646

VL - 21

SP - 1

JO - D-Lib Magazine

JF - D-Lib Magazine

SN - 1082-9873

IS - 11-12

ER -