TY - JOUR
T1 - The effect of data pre-processing on understanding the evolution of collaboration networks
AU - Kim, Jinseok
AU - Diesner, Jana
N1 - This work was supported by KISTI (Korea Institute of Science and Technology Information), grant P14033 . We would like to thank Jose Maria Cavero, Belen Vela, and Paloma Caceres for helping us to match IS journal names in DBLP and ISI Web of Science, and Natalie Lambert (University of Illinois at Urbana-Champaign) for editing the manuscript.
PY - 2015/1/1
Y1 - 2015/1/1
N2 - This paper shows empirically how the choice of certain data pre-processing methods for disambiguating author names affects our understanding of the structure and evolution of co-publication networks. Thirty years of publication records from 125 Information Systems journals were obtained from DBLP. Author names in the data were pre-processed via algorithmic disambiguation. We applied the commonly used all-initials and first-initial based disambiguation methods to the data, generated over-time networks with a yearly resolution, and calculated standard network metrics on these graphs. Our results show that initial-based methods underestimate the number of unique authors, average distance, and clustering coefficient, while overestimating the number of edges, average degree, and ratios of the largest components. These self-reinforcing growth and shrinkage mechanisms amplify over time. This can lead to false findings about fundamental network characteristics such as topology and reasoning about underlying social processes. It can also cause erroneous predictions of trends in future network evolution and suggest unjustified policies, interventions and funding decisions. The findings from this study suggest that scholars need to be more attentive to data pre-processing when analyzing or reusing bibliometric data.
AB - This paper shows empirically how the choice of certain data pre-processing methods for disambiguating author names affects our understanding of the structure and evolution of co-publication networks. Thirty years of publication records from 125 Information Systems journals were obtained from DBLP. Author names in the data were pre-processed via algorithmic disambiguation. We applied the commonly used all-initials and first-initial based disambiguation methods to the data, generated over-time networks with a yearly resolution, and calculated standard network metrics on these graphs. Our results show that initial-based methods underestimate the number of unique authors, average distance, and clustering coefficient, while overestimating the number of edges, average degree, and ratios of the largest components. These self-reinforcing growth and shrinkage mechanisms amplify over time. This can lead to false findings about fundamental network characteristics such as topology and reasoning about underlying social processes. It can also cause erroneous predictions of trends in future network evolution and suggest unjustified policies, interventions and funding decisions. The findings from this study suggest that scholars need to be more attentive to data pre-processing when analyzing or reusing bibliometric data.
KW - Collaboration network
KW - Disambiguation
KW - Name ambiguity
KW - Network evolution
UR - http://www.scopus.com/inward/record.url?scp=84921297361&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84921297361&partnerID=8YFLogxK
U2 - 10.1016/j.joi.2015.01.002
DO - 10.1016/j.joi.2015.01.002
M3 - Article
AN - SCOPUS:84921297361
SN - 1751-1577
VL - 9
SP - 226
EP - 236
JO - Journal of Informetrics
JF - Journal of Informetrics
IS - 1
ER -