Information Extraction from Social Media: A Hands-on Tutorial on Tasks, Data, and Open Source Tools

Shubhanshu Mishra, Rezvaneh Rezapour, Jana Diesner

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Information extraction (IE) is a common sub-area of natural language processing that focuses on identifying structured data from unstructured data. One application domain of IE is Information Retrieval (IR), which relies on accurate and high-performance IE to retrieve high quality results from massive datasets. Another example of IE is to identify named entities in a text. For example, in the the sentence "Katy Perry lives in the USA", Katy Perry and USA are named entities of types of PERSON and LOCATION, respectively. Also, identify the sentiment expressed in a text is another instance of IE: in the sentence, "This movie was awesome", the expressed sentiment is positive. Finally, IE is concerned with identifying various linguistic aspects of text data, e.g., part of speech of words, noun phrases, dependency parses, etc., which can serve as features for additional IE tasks. This tutorial introduces participants to a) the usage of Python based, open-source tools that support IE from social media data (mainly Twitter), and b) best practices for ensuring the responsible use of IE and research data. Participants will learn and practice various lexical, semantic, and syntactic IE techniques that are commonly used for analyzing tweets. Participants will also be familiarized with the landscape of publicly available social media data (including popular NLP and IE benchmarks) and methods for collecting and preparing them for analysis. Furthermore, participants will be trained to use a suite of open source tools (SAIL for active learning, TwitterNER for named entity recognition, TweetNLP for transformer based NLP, and SocialMediaIE for multi task learning), which utilize advanced machine learning techniques (e.g., deep learning, active learning with human-in-the-loop, multi-lingual, and multi-task learning) to perform IE on their own or existing datasets. Participants will also learn how social contexts of text production and usage of results can be integrated into IE systems to improve these systems and to consider the role of time in improving social media IE quality. Finally, participants will learn about the governance of social media data for research purposes. The tools introduced in the tutorial will focus on the three main stages of IE, namely, collection of data (including annotation), data processing and analytics, and visualization of the extracted information. More details can be found at: https://socialmediaie.github.io/tutorials/

Original languageEnglish (US)
Title of host publicationCIKM 2022 - Proceedings of the 31st ACM International Conference on Information and Knowledge Management
PublisherAssociation for Computing Machinery
Pages5148-5151
Number of pages4
ISBN (Electronic)9781450392365
DOIs
StatePublished - Oct 17 2022
Event31st ACM International Conference on Information and Knowledge Management, CIKM 2022 - Atlanta, United States
Duration: Oct 17 2022Oct 21 2022

Publication series

NameInternational Conference on Information and Knowledge Management, Proceedings

Conference

Conference31st ACM International Conference on Information and Knowledge Management, CIKM 2022
Country/TerritoryUnited States
CityAtlanta
Period10/17/2210/21/22

Keywords

  • chunking
  • data governance
  • deep learning
  • information extraction
  • machine learning
  • machine learning bias
  • multitask learning
  • named entity recognition
  • natural language processing
  • open data
  • open source tool
  • part of speech tagging
  • social media
  • supersense tagging
  • text classification
  • twitter

ASJC Scopus subject areas

  • General Business, Management and Accounting
  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Information Extraction from Social Media: A Hands-on Tutorial on Tasks, Data, and Open Source Tools'. Together they form a unique fingerprint.

Cite this