TY - JOUR
T1 - Distortive effects of initial-based name disambiguation on measurements of large-scale coauthorship networks
AU - Kim, Jinseok
AU - Diesner, Jana
N1 - This work is supported by KISTI (Korea Institute of Science and Technology Information), grant P14033, and the FORD Foundation, grant 01450558. We thank Vetle I. Torvik and Brent D. Fegley for helpful comments on the article. We also thank the anonymous reviewers for ideas that helped improve this paper.
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests that assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in data sets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties of coauthorship networks: It deflates the number of unique authors, number of components, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size. Also, on average, more than half of top 10 productive or collaborative authors drop off the lists. Asian names were found to account for the majority of misidentification by initial-based disambiguation due to their common surname and given name initials.
AB - Scholars have often relied on name initials to resolve name ambiguities in large-scale coauthorship network research. This approach bears the risk of incorrectly merging or splitting author identities. The use of initial-based disambiguation has been justified by the assumption that such errors would not affect research findings too much. This paper tests that assumption by analyzing coauthorship networks from five academic fields - biology, computer science, nanoscience, neuroscience, and physics - and an interdisciplinary journal, PNAS. Name instances in data sets of this study were disambiguated based on heuristics gained from previous algorithmic disambiguation solutions. We use disambiguated data as a proxy of ground-truth to test the performance of three types of initial-based disambiguation. Our results show that initial-based disambiguation can misrepresent statistical properties of coauthorship networks: It deflates the number of unique authors, number of components, average shortest paths, clustering coefficient, and assortativity, while it inflates average productivity, density, average coauthor number per author, and largest component size. Also, on average, more than half of top 10 productive or collaborative authors drop off the lists. Asian names were found to account for the majority of misidentification by initial-based disambiguation due to their common surname and given name initials.
KW - ambiguity
KW - bibliometrics
KW - network analysis
UR - https://www.scopus.com/pages/publications/84965096787
UR - https://www.scopus.com/pages/publications/84965096787#tab=citedBy
U2 - 10.1002/asi.23489
DO - 10.1002/asi.23489
M3 - Article
AN - SCOPUS:84965096787
SN - 2330-1635
VL - 67
SP - 1446
EP - 1461
JO - Journal of the Association for Information Science and Technology
JF - Journal of the Association for Information Science and Technology
IS - 6
ER -