An incremental graph-partitioning algorithm for entity resolution

Gregory Tauer, Ketan Date, Rakesh Nagi, Moises Sudit

Research output: Contribution to journalArticle

Abstract

Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

LanguageEnglish (US)
Pages171-183
Number of pages13
JournalInformation Fusion
Volume46
DOIs
StatePublished - Mar 1 2019

Fingerprint

Heuristic algorithms

Keywords

  • Data association
  • Entity resolution
  • Graph partitioning
  • Incremental algorithm

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems
  • Hardware and Architecture

Cite this

An incremental graph-partitioning algorithm for entity resolution. / Tauer, Gregory; Date, Ketan; Nagi, Rakesh; Sudit, Moises.

In: Information Fusion, Vol. 46, 01.03.2019, p. 171-183.

Research output: Contribution to journalArticle

Tauer, Gregory ; Date, Ketan ; Nagi, Rakesh ; Sudit, Moises. / An incremental graph-partitioning algorithm for entity resolution. In: Information Fusion. 2019 ; Vol. 46. pp. 171-183.
@article{d6853906284142b2a8b13645d96e4a16,
title = "An incremental graph-partitioning algorithm for entity resolution",
abstract = "Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40{\%} better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.",
keywords = "Data association, Entity resolution, Graph partitioning, Incremental algorithm",
author = "Gregory Tauer and Ketan Date and Rakesh Nagi and Moises Sudit",
year = "2019",
month = "3",
day = "1",
doi = "10.1016/j.inffus.2018.06.001",
language = "English (US)",
volume = "46",
pages = "171--183",
journal = "Information Fusion",
issn = "1566-2535",
publisher = "Elsevier",

}

TY - JOUR

T1 - An incremental graph-partitioning algorithm for entity resolution

AU - Tauer, Gregory

AU - Date, Ketan

AU - Nagi, Rakesh

AU - Sudit, Moises

PY - 2019/3/1

Y1 - 2019/3/1

N2 - Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

AB - Entity resolution is an important data association task when fusing information from multiple sources. Oftentimes the information arrives continuously and the entity resolution algorithm needs to efficiently update its solution upon receiving new information. In this work, we introduce an incremental entity resolution algorithm based on a graph partitioning formulation. The developed algorithm is able to handle both incrementally arriving entity references, as well as incrementally arriving information which changes the pairwise similarity scores between the references. New information is handled in a way that allows the algorithm to reconsider past decisions when contradicting information arrives. Because the graph partitioning formulation used is NP-Hard, a heuristic algorithm is developed to produce good solutions, which is also compatible with a blocking technique to limit the number of required comparisons. The algorithm is tested on a variety of datasets (randomly generated and real) and it is shown that allowing the algorithm to consider revised scores and revisit prior decisions offers a substantial improvement to accuracy (approximately 30–40% better F-Score on a natural language dataset), compared to other greedy heuristics on the same set of coefficients. It is also shown that, on a test set with 100 references, the incremental algorithm is up to an order of magnitude faster than a batch algorithm approach that re-solves the entire problem.

KW - Data association

KW - Entity resolution

KW - Graph partitioning

KW - Incremental algorithm

UR - http://www.scopus.com/inward/record.url?scp=85049037223&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85049037223&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2018.06.001

DO - 10.1016/j.inffus.2018.06.001

M3 - Article

VL - 46

SP - 171

EP - 183

JO - Information Fusion

T2 - Information Fusion

JF - Information Fusion

SN - 1566-2535

ER -