The effect of data pre-processing on understanding the evolution of collaboration networks

Jinseok Kim, Jana Diesner

Research output: Contribution to journalArticlepeer-review


This paper shows empirically how the choice of certain data pre-processing methods for disambiguating author names affects our understanding of the structure and evolution of co-publication networks. Thirty years of publication records from 125 Information Systems journals were obtained from DBLP. Author names in the data were pre-processed via algorithmic disambiguation. We applied the commonly used all-initials and first-initial based disambiguation methods to the data, generated over-time networks with a yearly resolution, and calculated standard network metrics on these graphs. Our results show that initial-based methods underestimate the number of unique authors, average distance, and clustering coefficient, while overestimating the number of edges, average degree, and ratios of the largest components. These self-reinforcing growth and shrinkage mechanisms amplify over time. This can lead to false findings about fundamental network characteristics such as topology and reasoning about underlying social processes. It can also cause erroneous predictions of trends in future network evolution and suggest unjustified policies, interventions and funding decisions. The findings from this study suggest that scholars need to be more attentive to data pre-processing when analyzing or reusing bibliometric data.

Original languageEnglish (US)
Pages (from-to)226-236
Number of pages11
JournalJournal of Informetrics
Issue number1
StatePublished - Jan 1 2015


  • Collaboration network
  • Disambiguation
  • Name ambiguity
  • Network evolution

ASJC Scopus subject areas

  • Computer Science Applications
  • Library and Information Sciences


Dive into the research topics of 'The effect of data pre-processing on understanding the evolution of collaboration networks'. Together they form a unique fingerprint.

Cite this