TY - GEN
T1 - Doc2Cube
T2 - 18th IEEE International Conference on Data Mining, ICDM 2018
AU - Tao, Fangbo
AU - Zhang, Chao
AU - Chen, Xiusi
AU - Jiang, Meng
AU - Hanratty, Tim
AU - Kaplan, Lance
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2018 IEEE.
PY - 2018/12/27
Y1 - 2018/12/27
N2 - Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.
AB - Data cube is a cornerstone architecture in multidimensional analysis of structured datasets. It is highly desirable to conduct multidimensional analysis on text corpora with cube structures for various text-intensive applications in healthcare, business intelligence, and social media analysis. However, one bottleneck to constructing text cube is to automatically put millions of documents into the right cube cells so that quality multidimensional analysis can be conducted afterwards-it is too expensive to allocate documents manually or rely on massively labeled data. We propose Doc2Cube, a method that constructs a text cube from a given text corpus in an unsupervised way. Initially, only the label names (e.g., USA, China) of each dimension (e.g., location) are provided instead of any labeled data. Doc2Cube leverages label names as weak supervision signals and iteratively performs joint embedding of labels, terms, and documents to uncover their semantic similarities. To generate joint embeddings that are discriminative for cube construction, Doc2Cube learns dimension-tailored document representations by selectively focusing on terms that are highly label-indicative in each dimension. Furthermore, Doc2Cube alleviates label sparsity by propagating the information from label names to other terms and enriching the labeled term set. Our experiments on real data demonstrate the superiority of Doc2Cube over existing methods.
KW - Data cube
KW - Multidimensional analysis
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85061372693&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061372693&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2018.00169
DO - 10.1109/ICDM.2018.00169
M3 - Conference contribution
AN - SCOPUS:85061372693
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 1260
EP - 1265
BT - 2018 IEEE International Conference on Data Mining, ICDM 2018
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 November 2018 through 20 November 2018
ER -