Identifying products in online cybercrime marketplaces: A Dataset for fine-grained domain adaptation

Greg Durrett, Jonathan K. Kummerfeld, Taylor Berg-Kirkpatrick, Rebecca S. Portnoff, Sadia Afroz, Damon McCoy, Kirill Levchenko, Vern Paxson

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

One weakness of machine-learned NLP models is that they typically perform poorly on out-of-domain data. In this work, we study the task of identifying products being bought and sold in online cybercrime forums, which exhibits particularly challenging cross-domain effects. We formulate a task that represents a hybrid of slot-filling information extraction and named entity recognition and annotate data from four different forums. Each of these forums constitutes its own “fine-grained domain” in that the forums cover different market sectors with different properties, even though all forums are in the broad domain of cybercrime. We characterize these domain differences in the context of a learning-based system: supervised models see decreased accuracy when applied to new forums, and standard techniques for semi-supervised learning and domain adaptation have limited effectiveness on this data, which suggests the need to improve these techniques. We release a dataset of 1,938 annotated posts from across the four forums.1

Original languageEnglish (US)
Title of host publicationEMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings
PublisherAssociation for Computational Linguistics (ACL)
Pages2598-2607
Number of pages10
ISBN (Electronic)9781945626838
DOIs
StatePublished - 2017
Externally publishedYes
Event2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017 - Copenhagen, Denmark
Duration: Sep 9 2017Sep 11 2017

Publication series

NameEMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings

Conference

Conference2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017
CountryDenmark
CityCopenhagen
Period9/9/179/11/17

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'Identifying products in online cybercrime marketplaces: A Dataset for fine-grained domain adaptation'. Together they form a unique fingerprint.

  • Cite this

    Durrett, G., Kummerfeld, J. K., Berg-Kirkpatrick, T., Portnoff, R. S., Afroz, S., McCoy, D., Levchenko, K., & Paxson, V. (2017). Identifying products in online cybercrime marketplaces: A Dataset for fine-grained domain adaptation. In EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings (pp. 2598-2607). (EMNLP 2017 - Conference on Empirical Methods in Natural Language Processing, Proceedings). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/d17-1275