TY - GEN
T1 - Rescuing lost history
T2 - Conference on Diversity, Big Data, and Science at Scale, XSEDE 2016
AU - Mendenhall, Ruby
AU - Van Moer, Mark
AU - McKee, Malaika
AU - Brown, Nicole
AU - Lourentzou, Ismini
AU - Zerai, Assata
AU - Black, Michael L.
AU - Flynn, Karen
PY - 2016/7/17
Y1 - 2016/7/17
N2 - This study employs Latent Dirichlet allocation (LDA) algorithms and comparative text mining to search 800,000 periodicals in JSTOR (Journal Storage) and HathiTrust from 1746 to 2014 to identify the types of conversations that emerge about Black women's shared experience over time and the resulting knowledge that developed called standpoint We used MALLET to interrogate various genres of text (poetry, science, psychology, sociology, African American Studies, policy, etc.). We also used comparative text mining (CTM) to explore latent themes across collections written in different time periods by analyzing the common and expert models. We used data visualization techniques, such as tree maps, to identify spikes in certain topics during various historical contexts such as slavery, reconstruction, Jim Crow, etc. We identified a subset of our corpus (20,000) comprised of articles about or by or Black women and compared patterns of words in the subset against the larger 800,000 corpus. Preliminary findings indicate that when we pulled 300,000 volumes, about 800,000 (~27%) do not have subject metadata. This appears to suggest that if a researcher searched for volumes about Black women, they may not have access to a significant amount of data on the topic. When volumes are not tagged properly, researchers would have to know that these texts exists when they do their searches. The recovery nature of this project involves identifying these untagged volumes and making the corpus publicly available to librarians and others with copyr. considerations.
AB - This study employs Latent Dirichlet allocation (LDA) algorithms and comparative text mining to search 800,000 periodicals in JSTOR (Journal Storage) and HathiTrust from 1746 to 2014 to identify the types of conversations that emerge about Black women's shared experience over time and the resulting knowledge that developed called standpoint We used MALLET to interrogate various genres of text (poetry, science, psychology, sociology, African American Studies, policy, etc.). We also used comparative text mining (CTM) to explore latent themes across collections written in different time periods by analyzing the common and expert models. We used data visualization techniques, such as tree maps, to identify spikes in certain topics during various historical contexts such as slavery, reconstruction, Jim Crow, etc. We identified a subset of our corpus (20,000) comprised of articles about or by or Black women and compared patterns of words in the subset against the larger 800,000 corpus. Preliminary findings indicate that when we pulled 300,000 volumes, about 800,000 (~27%) do not have subject metadata. This appears to suggest that if a researcher searched for volumes about Black women, they may not have access to a significant amount of data on the topic. When volumes are not tagged properly, researchers would have to know that these texts exists when they do their searches. The recovery nature of this project involves identifying these untagged volumes and making the corpus publicly available to librarians and others with copyr. considerations.
KW - Black women
KW - Comparative text mining
KW - Intermediate reading.
KW - Standpoint theory
KW - Topic modeling
UR - http://www.scopus.com/inward/record.url?scp=84989204302&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84989204302&partnerID=8YFLogxK
U2 - 10.1145/2949550.2949642
DO - 10.1145/2949550.2949642
M3 - Conference contribution
AN - SCOPUS:84989204302
T3 - ACM International Conference Proceeding Series
BT - Proceedings of XSEDE 2016
PB - Association for Computing Machinery
Y2 - 17 July 2016 through 21 July 2016
ER -