Segmentation of publication records of authors from the Web

Wei Zhang, Clement Yu, Neil Smalheiser, Vetle Torvik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Publication records are often found in the authors ' personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel "Split and Merge" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.

Original languageEnglish (US)
Title of host publicationProceedings of the 22nd International Conference on Data Engineering, ICDE '06
Pages120
Number of pages1
DOIs
StatePublished - 2006
Externally publishedYes
Event22nd International Conference on Data Engineering, ICDE '06 - Atlanta, GA, United States
Duration: Apr 3 2006Apr 7 2006

Publication series

NameProceedings - International Conference on Data Engineering
Volume2006
ISSN (Print)1084-4627

Other

Other22nd International Conference on Data Engineering, ICDE '06
Country/TerritoryUnited States
CityAtlanta, GA
Period4/3/064/7/06

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Fingerprint

Dive into the research topics of 'Segmentation of publication records of authors from the Web'. Together they form a unique fingerprint.

Cite this