Bibliographic records often contain author affiliations as free-form text strings. Ideally one would be able to automatically identify all affiliations referring to any particular country or city such as Saint Petersburg, Russia. That introduces several major linguistic challenges. For example, Saint Petersburg is ambiguous (it refers to multiple cities worldwide and can be part of a street address) and it has spelling variants (e.g., St. Petersburg, Sankt-Peterburg, and Leningrad, USSR). We have designed an algorithm that attempts to solve these types of problems. Key components of the algorithm include a set of 24,000 extracted city, state, and country names (and their variants plus geocodes) for candidate look-up, and a set of 1.1 million extracted word n-grams, each pointing to a unique country (or a US state) for disambiguation. When applied to a collection of 12.7 million affiliation strings listed in PubMed, ambiguity remained unresolved for only 0.1%. For the 4.2 million mappings to the USA, 97.7% were complete (included a city), 1.8% included a state but not a city, and 0.4% did not include a state. A random sample of 300 manually inspected cases yielded six incompletes, none incorrect, and one unresolved ambiguity. The remaining 293 (97.7%) cases were unambiguously mapped to the correct cities, better than all of the existing tools tested: GoPubMed got 279 (93.0%) and GeoMaker got 274 (91.3%) while MediaMeter CLIFF and Google Maps did worse. In summary, we find that incorrect assignments and unresolved ambiguities are rare (< 1%). The incompleteness rate is about 2%, mostly due to a lack of information, e.g. the affiliation simply says "University of Illinois" which can refer to one of five different campuses. A search interface called MapAffil has been developed at the University of Illinois in which the longitude and latitude of the geographical city-center is displayed when a city is identified. This not only helps improve geographic information retrieval but also enables global bibliometric studies of proximity, mobility, and other geo-linked data.
- Author affiliations
- Bibliographic databases
- Digital libraries
- Geographic indexing
- Place name ambiguity
- Toponym extraction
- Toponym resolution
ASJC Scopus subject areas
- Library and Information Sciences
FingerprintDive into the research topics of 'MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide'. Together they form a unique fingerprint.
MapAffil 2018 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide with extracted disciplines, inferred GRIDs, and assigned ORCIDs
Torvik, V. I. (Creator), University of Illinois at Urbana-Champaign, May 7 2021
Torvik, V. I. (Creator), University of Illinois at Urbana-Champaign, Apr 19 2018