Segmentation of publication records of authors from the Web

Wei Zhang, Clement Yu, Neil Smalheiser, Vetle Ingvald Torvik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Publication records are often found in the authors ' personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel "Split and Merge" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.

Original languageEnglish (US)
Title of host publicationProceedings of the 22nd International Conference on Data Engineering, ICDE '06
Number of pages1
DOIs
StatePublished - Oct 17 2006
Event22nd International Conference on Data Engineering, ICDE '06 - Atlanta, GA, United States
Duration: Apr 3 2006Apr 7 2006

Publication series

NameProceedings - International Conference on Data Engineering
Volume2006
ISSN (Print)1084-4627

Other

Other22nd International Conference on Data Engineering, ICDE '06
CountryUnited States
CityAtlanta, GA
Period4/3/064/7/06

Fingerprint

Websites
Classifiers
Semantics
Experiments

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Zhang, W., Yu, C., Smalheiser, N., & Torvik, V. I. (2006). Segmentation of publication records of authors from the Web. In Proceedings of the 22nd International Conference on Data Engineering, ICDE '06 [1617488] (Proceedings - International Conference on Data Engineering; Vol. 2006). https://doi.org/10.1109/ICDE.2006.137

Segmentation of publication records of authors from the Web. / Zhang, Wei; Yu, Clement; Smalheiser, Neil; Torvik, Vetle Ingvald.

Proceedings of the 22nd International Conference on Data Engineering, ICDE '06. 2006. 1617488 (Proceedings - International Conference on Data Engineering; Vol. 2006).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhang, W, Yu, C, Smalheiser, N & Torvik, VI 2006, Segmentation of publication records of authors from the Web. in Proceedings of the 22nd International Conference on Data Engineering, ICDE '06., 1617488, Proceedings - International Conference on Data Engineering, vol. 2006, 22nd International Conference on Data Engineering, ICDE '06, Atlanta, GA, United States, 4/3/06. https://doi.org/10.1109/ICDE.2006.137
Zhang W, Yu C, Smalheiser N, Torvik VI. Segmentation of publication records of authors from the Web. In Proceedings of the 22nd International Conference on Data Engineering, ICDE '06. 2006. 1617488. (Proceedings - International Conference on Data Engineering). https://doi.org/10.1109/ICDE.2006.137
Zhang, Wei ; Yu, Clement ; Smalheiser, Neil ; Torvik, Vetle Ingvald. / Segmentation of publication records of authors from the Web. Proceedings of the 22nd International Conference on Data Engineering, ICDE '06. 2006. (Proceedings - International Conference on Data Engineering).
@inproceedings{293f61a2342544d39e4ad4cc453cda44,
title = "Segmentation of publication records of authors from the Web",
abstract = "Publication records are often found in the authors ' personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel {"}Split and Merge{"} strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.",
author = "Wei Zhang and Clement Yu and Neil Smalheiser and Torvik, {Vetle Ingvald}",
year = "2006",
month = "10",
day = "17",
doi = "10.1109/ICDE.2006.137",
language = "English (US)",
isbn = "0769525709",
series = "Proceedings - International Conference on Data Engineering",
booktitle = "Proceedings of the 22nd International Conference on Data Engineering, ICDE '06",

}

TY - GEN

T1 - Segmentation of publication records of authors from the Web

AU - Zhang, Wei

AU - Yu, Clement

AU - Smalheiser, Neil

AU - Torvik, Vetle Ingvald

PY - 2006/10/17

Y1 - 2006/10/17

N2 - Publication records are often found in the authors ' personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel "Split and Merge" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.

AB - Publication records are often found in the authors ' personal home pages. If such a record is partitioned into a list of semantic fields of authors, title, date, etc., the unstructured texts can be converted into structured data, which can be used in other applications. In this paper, we present PEPURS, a publication record segmentation system. It adopts a novel "Split and Merge" strategy. A publication record is split into segments; multiple statistical classifiers compute their likelihoods of belonging to different fields; finally adjacent segments are merged if they belong to the same field. PEPURS introduces the punctuation marks and their neighboring texts as a new feature to distinguish different roles of the marks. PEPURS yields high accuracy scores in experiments.

UR - http://www.scopus.com/inward/record.url?scp=33749625896&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33749625896&partnerID=8YFLogxK

U2 - 10.1109/ICDE.2006.137

DO - 10.1109/ICDE.2006.137

M3 - Conference contribution

AN - SCOPUS:33749625896

SN - 0769525709

SN - 9780769525709

T3 - Proceedings - International Conference on Data Engineering

BT - Proceedings of the 22nd International Conference on Data Engineering, ICDE '06

ER -