TY - JOUR
T1 - Using python for text analysis in accounting research
AU - Anand, Vic
AU - Bochkay, Khrystyna
AU - Chychyla, Roman
AU - Leone, Andrew
N1 - Publisher Copyright:
© 2020 Massachussetts Medical Society. All rights reserved.
PY - 2020
Y1 - 2020
N2 - The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers. In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package. The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of (1) measuring document sentiment, (2) computing text complexity, (3) identifying forward-looking sentences and risk disclosures, (4) collecting informative numbers in text, and (5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers. Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.
AB - The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers. In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package. The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of (1) measuring document sentiment, (2) computing text complexity, (3) identifying forward-looking sentences and risk disclosures, (4) collecting informative numbers in text, and (5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers. Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.
UR - http://www.scopus.com/inward/record.url?scp=85097420952&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097420952&partnerID=8YFLogxK
U2 - 10.1561/1400000062
DO - 10.1561/1400000062
M3 - Review article
AN - SCOPUS:85097420952
SN - 1554-0642
VL - 14
SP - 128
EP - 359
JO - Foundations and Trends in Accounting
JF - Foundations and Trends in Accounting
IS - 3-4
ER -