MapAffil 2016 dataset -- PubMed author affiliations mapped to cities and their geocodes worldwide. Prepared by Vetle Torvik 2018-04-05
The dataset comes as a single tab-delimited Latin-1 encoded file (only the City column uses non-ASCII characters), and should be about 3.5GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in the first week of October, 2016.
Check here for information to get PubMed/MEDLINE, and NLMs data <a href ="https://www.nlm.nih.gov/databases/download/pubmed_medline.html">Terms and Conditions</a>
• Affiliations are linked to a particular author on a particular article. Prior to 2014, NLM recorded the affiliation of the first author only.
However, MapAffil 2016 covers some PubMed records lacking affiliations that were harvested elsewhere, from PMC (e.g., PMID 22427989), NIH grants (e.g., 1838378), and Microsoft Academic Graph and ADS (e.g. 5833220).
• Affiliations are pre-processed (e.g., transliterated into ASCII from UTF-8 and html) so they may differ (sometimes a lot; see PMID 27487542) from PubMed records.
• All affiliation strings where processed using the MapAffil procedure, to identify and disambiguate the most specific place-name, as described in:
<i>Torvik VI. MapAffil: A bibliographic tool for mapping author affiliation strings to cities and their geocodes worldwide. D-Lib Magazine 2015; 21 (11/12). 10p</i>
• Look for <a href="https://doi.org/10.1186/s41182-017-0073-6">Fig. 4</a> in the following article for coverage statistics over time:
<i>Palmblad M, Torvik VI. Spatiotemporal analysis of tropical disease research combining Europe PMC and affiliation mapping web services. Tropical medicine and health. 2017 Dec;45(1):33.</i>
Expect to see big upticks in coverage of PMIDs around 1988 and for non-first authors in 2014.
• The code and back-end data is periodically updated and made available for query by PMID at <a href="http://abel.ischool.illinois.edu/">Torvik Research Group</a>
• What is the format of the dataset?
The dataset contains 37,406,692 rows. Each row (line) in the file has a unique PMID and author postition (e.g., 10786286_3 is the third author name on PMID 10786286), and the following thirteen columns, tab-delimited. All columns are ASCII, except city which contains Latin-1.
1. PMID: positive non-zero integer; int(10) unsigned
2. au_order: positive non-zero integer; smallint(4)
3. lastname: varchar(80)
4. firstname: varchar(80); NLM started including these in 2002 but many have been harvested from outside PubMed
5. year of publication:
6. type: EDU, HOS, EDU-HOS, ORG, COM, GOV, MIL, UNK
7. city: varchar(200); typically 'city, state, country' but could inlude further subvisions; unresolved ambiguities are concatenated by '|'
8. state: Australia, Canada and USA (which includes territories like PR, GU, AS, and post-codes like AE and AA)
11. lat: at most 3 decimals (only available when city is not a country or state)
12. lon: at most 3 decimals (only available when city is not a country or state)
13. fips: varchar(5); for USA only retrieved by lat-lon query to https://geo.fcc.gov/api/census/block/find
- Toponym Extraction
- PubMed, MEDLINE, Digital Libraries, Bibliographic Databases
- Geographic Indexing
- Author Affiliations
- Place Name Ambiguity
- Toponym Resolution