TY - JOUR
T1 - Cloud-Based Phrase Mining and Analysis of User-Defined Phrase-Category Association in Biomedical Publications
AU - Sigdel, Dibakar
AU - Kyi, Vincent
AU - Zhang, Aiden
AU - Setty, Shaun P.
AU - Liem, David A.
AU - Shi, Yu
AU - Wang, Xuan
AU - Shen, Jiaming
AU - Wang, Wei
AU - Han, Jiawei
AU - Ping, Peipei
N1 - Publisher Copyright:
© 2019 Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.
PY - 2019/2
Y1 - 2019/2
N2 - The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline, developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. CaseOLAP has many biomedical applications. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure called Text-Cube, and quantifying phrase-category relationships using the core CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The obtained raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner (processes 100,000 words/sec). Following this protocol, users can access a cloud-computing environment to support their own configurations and applications of CaseOLAP. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.
AB - The rapid accumulation of biomedical textual data has far exceeded the human capacity of manual curation and analysis, necessitating novel text-mining tools to extract biological insights from large volumes of scientific reports. The Context-aware Semantic Online Analytical Processing (CaseOLAP) pipeline, developed in 2016, successfully quantifies user-defined phrase-category relationships through the analysis of textual data. CaseOLAP has many biomedical applications. We have developed a protocol for a cloud-based environment supporting the end-to-end phrase-mining and analyses platform. Our protocol includes data preprocessing (e.g., downloading, extraction, and parsing text documents), indexing and searching with Elasticsearch, creating a functional document structure called Text-Cube, and quantifying phrase-category relationships using the core CaseOLAP algorithm. Our data preprocessing generates key-value mappings for all documents involved. The preprocessed data is indexed to carry out a search of documents including entities, which further facilitates the Text-Cube creation and CaseOLAP score calculation. The obtained raw CaseOLAP scores are interpreted using a series of integrative analyses, including dimensionality reduction, clustering, temporal, and geographical analyses. Additionally, the CaseOLAP scores are used to create a graphical database, which enables semantic mapping of the documents. CaseOLAP defines phrase-category relationships in an accurate (identifies relationships), consistent (highly reproducible), and efficient manner (processes 100,000 words/sec). Following this protocol, users can access a cloud-computing environment to support their own configurations and applications of CaseOLAP. This platform offers enhanced accessibility and empowers the biomedical community with phrase-mining tools for widespread biomedical research applications.
KW - Issue 144
KW - Medicine
KW - cloud computing
KW - data science
KW - medical informatics
KW - phrase mining
KW - text mining
UR - http://www.scopus.com/inward/record.url?scp=85062699925&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85062699925&partnerID=8YFLogxK
U2 - 10.3791/59108
DO - 10.3791/59108
M3 - Article
C2 - 30855564
AN - SCOPUS:85062699925
SN - 1940-087X
VL - 2019
JO - Journal of Visualized Experiments
JF - Journal of Visualized Experiments
IS - 144
M1 - e59108
ER -