Description
Author-ity 2018 dataset
Prepared by Vetle Torvik Apr. 22, 2021
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries:
####################
File 1: au2id2018.tsv
####################
Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields:
1. Author ID
2. PMID
3. Author name position
########################
File 2: authority2018.tsv
#########################
Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields:
1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance.
2. cluster size (number of author name instances on papers)
3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
4. last name variants separated by '|'
5. first name variants separated by '|'
6. middle initial variants separated by '|' ('-' if none)
7. suffix variants separated by '|' ('-' if none)
8. email addresses separated by '|' ('-' if none)
9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML
10. range of years (e.g., 1997-2009)
11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none)
13. Journal names with counts in parenthesis (separated by '|'),
14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
16. Author name instances (PMID_auno separated by '|')
17. Grant IDs (after normalization; '-' if none given; separated by '|'),
18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC).
19. h-index
20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
Prepared by Vetle Torvik Apr. 22, 2021
The dataset is based on a snapshot of PubMed taken in December 2018 (NLMs baseline 2018 plus updates throughout 2018). A total of 29.1 million Article records and 114.2 million author name instances. Each instance of an author name is uniquely represented by the PMID and the position on the paper (e.g., 10786286_3 is the third author name on PMID 10786286). Thus, each cluster is represented by a collection of author name instances. The instances were first grouped into "blocks" by last name and first name initial (including some close variants), and then each block was separately subjected to clustering. The resulting clusters are provided in two different formats, the first in a file with only IDs and PMIDs, and the second in a file with cluster summaries:
####################
File 1: au2id2018.tsv
####################
Each line corresponds to an author name instance (PMID and Author name position) with an Author ID. It has the following tab-delimited fields:
1. Author ID
2. PMID
3. Author name position
########################
File 2: authority2018.tsv
#########################
Each line corresponds to a predicted author-individual represented by cluster of author name instances and a summary of all the corresponding papers and author name variants. Each cluster has a unique Author ID (the PMID of the earliest paper in the cluster and the author name position). The summary has the following tab-delimited fields:
1. Author ID (or cluster ID) e.g., 3797874_1 represents a cluster where 3797874_1 is the earliest author name instance.
2. cluster size (number of author name instances on papers)
3. name variants separated by '|' with counts in parenthesis. Each variant of the format lastname_firstname middleinitial, suffix
4. last name variants separated by '|'
5. first name variants separated by '|'
6. middle initial variants separated by '|' ('-' if none)
7. suffix variants separated by '|' ('-' if none)
8. email addresses separated by '|' ('-' if none)
9. ORCIDs separated by '|' ('-' if none). From 2019 ORCID Public Data File https://orcid.org/ and from PubMed XML
10. range of years (e.g., 1997-2009)
11. Top 20 most frequent affiliation words (after stoplisting and tokenizing; some phrases are also made) with counts in parenthesis; separated by '|'; ('-' if none)
12. Top 20 most frequent MeSH (after stoplisting) with counts in parenthesis; separated by '|'; ('-' if none)
13. Journal names with counts in parenthesis (separated by '|'),
14. Top 20 most frequent title words (after stoplisting and tokenizing) with counts in parenthesis; separated by '|'; ('-' if none)
15. Co-author names (lowercased lastname and first/middle initials) with counts in parenthesis; separated by '|'; ('-' if none)
16. Author name instances (PMID_auno separated by '|')
17. Grant IDs (after normalization; '-' if none given; separated by '|'),
18. Total number of times cited. (Citations are based on references harvested from open sources such as PMC).
19. h-index
20. Citation counts (e.g., for h-index): PMIDs by the author that have been cited (with total citation counts in parenthesis); separated by '|'
Date made available | Apr 22 2021 |
---|---|
Publisher | University of Illinois Urbana-Champaign |
Keywords
- author name disambiguation
- PubMed