TY - GEN
T1 - Constructing structured information networks from massive text corpora
AU - Ren, Xiang
AU - Jiang, Meng
AU - Shang, Jingbo
AU - Han, Jiawei
N1 - Research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617 and IIS 16-18481, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained in this document are those 5. Conferencetutorial:X.Ren,A.El-Kishky,C.Wang of the author(s) and should not be interpreted as represent-andJ.Han,“AutomaticEntityRecognitionandTyp- ing the official policies of the U.S. Army Research Labo-ing from Massive Text Corpora: A Phrase and Net- ratory or the U.S. Government. The U.S. Government is workMiningApproach”(SIGKDD’15). http://research. authorized to reproduce and distribute reprints for Gov-microsoft.com/en-us/people/chiw/kdd15tutorial.aspx. ernment purposes notwithstanding any copyright notation hereon.
• Meng Jiang, Postdoctoral Research Associate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on behavioral modeling and social media analysis. He got his Ph.D. of Computer Science from Tsinghua University, Beijing in 2015. His Ph.D. thesis won the Dissertation Award at Tsinghua. His recent research won the SIGKDD 2014 Best Paper Finalist. His ICDM 2015 Tutorial won the honorarium. • Jingbo Shang, Ph.D. candidate, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research focuses on mining and constructing structured knowledge from massive text corpora. He is the recipient of Computer Science Excellence Scholarship and Grand Prize of Yelp Dataset Challenge in 2015. • Jiawei Han, Abel Bliss Professor, Department of Computer Science, Univ. of Illinois at Urbana-Champaign. His research areas encompass data mining, data warehousing, information network analysis, etc., with over 600 conference and journal publications. He is Fellow of ACM, Fellow of IEEE, the Director of IPAN, supported by Network Science Collaborative Technology Alliance program of the U.S. Army Research Lab, and the Director of KnowEnG: a Knowledge Engine for Genomics, one of the NIH supported Big Data to Knowledge (BD2K) Centers.
Research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617 and IIS 16-18481, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained in this document are those of the author(s) and should not be interpreted as representing the official policies of the U.S. Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation . hereon.
PY - 2017
Y1 - 2017
N2 - In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information from various domains (medical records, corporate reports). To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations, events) in the text. In this tutorial, we introduce data-driven methods to construct structured information networks (where nodes are different types of entities attached with attributes, and edges are different relations between entities) for text corpora of different kinds (especially for massive, domain-specific text corpora) to represent their factual information. We focus on methods that are minimally-supervised, domain-independent, and language-independent for fast network construction across various application domains (news, web, biomedical, reviews). We demonstrate on real datasets including news articles, scientific publications, tweets and reviews how these constructed networks aid in text analytics and knowledge discovery at a large scale.
AB - In today's computerized and information-based society, text data is rich but messy. People are soaked with vast amounts of natural-language text data, ranging from news articles, social media post, advertisements, to a wide range of textual information from various domains (medical records, corporate reports). To turn such massive unstructured text data into actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations, events) in the text. In this tutorial, we introduce data-driven methods to construct structured information networks (where nodes are different types of entities attached with attributes, and edges are different relations between entities) for text corpora of different kinds (especially for massive, domain-specific text corpora) to represent their factual information. We focus on methods that are minimally-supervised, domain-independent, and language-independent for fast network construction across various application domains (news, web, biomedical, reviews). We demonstrate on real datasets including news articles, scientific publications, tweets and reviews how these constructed networks aid in text analytics and knowledge discovery at a large scale.
KW - Attribute Discovery
KW - Entity Recognition and Typing
KW - Massive Text Corpora
KW - Quality Phrase Mining
KW - Relation Extraction
UR - http://www.scopus.com/inward/record.url?scp=85051486166&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85051486166&partnerID=8YFLogxK
U2 - 10.1145/3041021.3051107
DO - 10.1145/3041021.3051107
M3 - Conference contribution
AN - SCOPUS:85051486166
T3 - 26th International World Wide Web Conference 2017, WWW 2017 Companion
SP - 951
EP - 954
BT - 26th International World Wide Web Conference 2017, WWW 2017 Companion
PB - International World Wide Web Conferences Steering Committee
T2 - 26th International World Wide Web Conference, WWW 2017 Companion
Y2 - 3 April 2017 through 7 April 2017
ER -