Author-ity 2009 baseline dataset. Prepared by Vetle Torvik 2009-12-03
The dataset comes in the form of 18 compressed (.gz) linux text files named authority2009.part00.gz - authority2009.part17.gz. The total size should be ~17.4GB uncompressed.
• How was the dataset created?
The dataset is based on a snapshot of PubMed (which includes Medline and PubMed-not-Medline records) taken in July 2009. A total of 19,011,985 Article records and 61,658,514 author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. Details are described in
<i>Torvik, V., & Smalheiser, N. (2009). Author name disambiguation in MEDLINE. ACM Transactions On Knowledge Discovery From Data, 3(3), doi:10.1145/1552303.1552304</i>
<i>Torvik, V. I., Weeber, M., Swanson, D. R., & Smalheiser, N. R. (2005). A Probabilistic Similarity Metric for Medline Records: A Model for Author Name Disambiguation. Journal Of The American Society For Information Science & Technology, 56(2), 140-158. doi:10.1002/asi.20105</i>
Note that for Author-ity 2009, some new predictive features (e.g., grants, citations matches, temporal, affiliation phrases) and a post-processing merging procedure were applied (to capture name variants not capture during blocking e.g. matches for subsets of compound last name matches, and nicknames with different first initial like Bill and William), and a temporal feature was used -- this has not yet been written up for publication.
• How accurate is the 2009 dataset (compared to 2006 and 2009)?
The recall reported for 2006 of 98.8% has been much improved in 2009 (because common last name variants are now captured). Compared to 2006, both years 2008 and 2009 overall seem to exhibit a higher rate of splitting errors but lower rate of lumping errors. This reflects an overall decrease in prior probabilites -- possibly because e.g. a) new prior estimation procedure that avoid wild estimates (by dampening the magnitude of iterative changes); b) 2008 and 2009 included items in Pubmed-not-Medline (including in-process items); and c) and the dramatic (exponential) increase in frequencies of some names (J. Lee went from ~16,000 occurrences in 2006 to 26,000 in 2009.) Although, splitting is reduced in 2009 for some special cases like NIH funded investigators who list their grant number of their papers. Compared to 2008, splitting errors were reduced overall in 2009 while maintaining the same level of lumping errors.
• What is the format of the dataset?
The cluster summaries for 2009 are much more extenstive than the 2008 dataset. Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants (and if there are > 10 papers in the cluster, an identical summary of the 10 most recent papers). Each cluster has a unique Author ID (which is uniquely identified by the PMID of the earliest paper in the cluster and the author name position. The summary has the following tab-delimited fields:
1. blocks separated by '||'; each block may consist of multiple lastname-first initial variants separated by '|'
2. prior probabilities of the respective blocks separated by '|'
3. Cluster number relative to the block ordered by cluster size (some are listed as 'CLUSTER X' when they were derived from multiple blocks)
4. Author ID (or cluster ID) e.g., bass_c_9731334_2 represents a cluster where 9731334_2 is the earliest author name instance. Although not needed for uniqueness, the id also has the most frequent lastname_firstinitial (lowercased).
5. cluster size (number of author name instances on papers)
6. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
7. last name variants separated by '|'
8. first name variants separated by '|'
9. middle initial variants separated by '|' ('-' if none)
10. suffix variants separated by '|' ('-' if none)
11. email addresses separated by '|' ('-' if none)
12. range of years (e.g., 1997-2009)
13. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
14. Top 20 most frequent MeSH (after stoplisting; "-") with counts in parenthesis; separated by '|'; ('-' if none)
15. Journals with counts in parenthesis (separated by "|"),
16. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
17. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
18. Co-author IDs with counts in parenthesis; separated by '|'; ('-' if none)
19. Author name instances (PMID_auno separated '|')
20. Grant IDs (after normalization; "-" if none given; separated by "|"),
21. Total number of times cited. (Citations are based on references extracted from PMC).
23. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by "|"
24. Cited: PMIDs that the author cited (with counts in parenthesis) separated by "|"
25. Cited-by: PMIDs that cited the author (with counts in parenthesis) separated by "|"
26-47. same summary as for 4-25 except that the 10 most recent papers were used (based on year; so if paper 10, 11, 12... have the same year, one is selected arbitrarily)
- Name disambiguation
- Bibliographic databases
- Library information networks