Identifying semantically deviating outlier documents

Honglei Zhuang, Chi Wang, Fangbo Tao, Lance Kaplan, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.

Original languageEnglish (US)
Title of host publicationEMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages2748-2757
Number of pages10
ISBN (Electronic)9781945626838
StatePublished - Jan 1 2017
Event2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017 - Copenhagen, Denmark
Duration: Sep 9 2017Sep 11 2017

Publication series

NameEMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017
CountryDenmark
CityCopenhagen
Period9/9/179/11/17

Fingerprint

Semantics
Screening
Health
Experiments

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Computational Theory and Mathematics

Cite this

Zhuang, H., Wang, C., Tao, F., Kaplan, L., & Han, J. (2017). Identifying semantically deviating outlier documents. In EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 2748-2757). (EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings). Association for Computational Linguistics (ACL).

Identifying semantically deviating outlier documents. / Zhuang, Honglei; Wang, Chi; Tao, Fangbo; Kaplan, Lance; Han, Jiawei.

EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings. Association for Computational Linguistics (ACL), 2017. p. 2748-2757 (EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhuang, H, Wang, C, Tao, F, Kaplan, L & Han, J 2017, Identifying semantically deviating outlier documents. in EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings. EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings, Association for Computational Linguistics (ACL), pp. 2748-2757, 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9/9/17.
Zhuang H, Wang C, Tao F, Kaplan L, Han J. Identifying semantically deviating outlier documents. In EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings. Association for Computational Linguistics (ACL). 2017. p. 2748-2757. (EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings).
Zhuang, Honglei ; Wang, Chi ; Tao, Fangbo ; Kaplan, Lance ; Han, Jiawei. / Identifying semantically deviating outlier documents. EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings. Association for Computational Linguistics (ACL), 2017. pp. 2748-2757 (EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings).
@inproceedings{f8e300cbfb59463fb171ea0198744c07,
title = "Identifying semantically deviating outlier documents",
abstract = "A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135{\%} improvement over baselines in terms of recall at top-1{\%} of the outlier ranking.",
author = "Honglei Zhuang and Chi Wang and Fangbo Tao and Lance Kaplan and Jiawei Han",
year = "2017",
month = "1",
day = "1",
language = "English (US)",
series = "EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings",
publisher = "Association for Computational Linguistics (ACL)",
pages = "2748--2757",
booktitle = "EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings",

}

TY - GEN

T1 - Identifying semantically deviating outlier documents

AU - Zhuang, Honglei

AU - Wang, Chi

AU - Tao, Fangbo

AU - Kaplan, Lance

AU - Han, Jiawei

PY - 2017/1/1

Y1 - 2017/1/1

N2 - A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.

AB - A document outlier is a document that substantially deviates in semantics from the majority ones in a corpus. Automatic identification of document outliers can be valuable in many applications, such as screening health records for medical mistakes. In this paper, we study the problem of mining semantically deviating document outliers in a given corpus. We develop a generative model to identify frequent and characteristic semantic regions in the word embedding space to represent the given corpus, and a robust outlierness measure which is resistant to noisy content in documents. Experiments conducted on two real-world textual data sets show that our method can achieve an up to 135% improvement over baselines in terms of recall at top-1% of the outlier ranking.

UR - http://www.scopus.com/inward/record.url?scp=85066074201&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066074201&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85066074201

T3 - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings

SP - 2748

EP - 2757

BT - EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings

PB - Association for Computational Linguistics (ACL)

ER -