TY - GEN
T1 - Semantic-based text classification of environmental regulatory documents for supporting automated environmental compliance checking in construction
AU - Zhou, Peng
AU - El-Gohary, Nora
PY - 2014
Y1 - 2014
N2 - Automated environmental compliance checking requires automated extraction of rules from environmental regulatory textual documents, such as energy conservation codes and U.S. Environmental Protection Agency (EPA) regulations. Automated rule extraction requires complex text processing and analysis for information extraction and subsequent formalization of the extracted information into computer-processable rules. In our automated compliance checking (ACC) approach, we first classify the text into predefined categories to filter out irrelevant text, thereby improving further semantic information extraction and compliance reasoning efficiency. The categories used are predefined in a semantic text classification (TC) topic hierarchy. In this paper, we present our machine-learning-based TC algorithm for classifying clauses in environmental regulatory documents based on the TC topic hierarchy. In developing our TC algorithm, different text preprocessing techniques, machine learning algorithms, and performance improvement strategies were tested and evaluated. Our final TC algorithm was tested on 10 regulatory documents, such as the 2012 International Energy Conservation Code, and evaluated in terms of precision and recall. The algorithm achieved around 96% and 85% recall and precision, respectively, on the testing data.
AB - Automated environmental compliance checking requires automated extraction of rules from environmental regulatory textual documents, such as energy conservation codes and U.S. Environmental Protection Agency (EPA) regulations. Automated rule extraction requires complex text processing and analysis for information extraction and subsequent formalization of the extracted information into computer-processable rules. In our automated compliance checking (ACC) approach, we first classify the text into predefined categories to filter out irrelevant text, thereby improving further semantic information extraction and compliance reasoning efficiency. The categories used are predefined in a semantic text classification (TC) topic hierarchy. In this paper, we present our machine-learning-based TC algorithm for classifying clauses in environmental regulatory documents based on the TC topic hierarchy. In developing our TC algorithm, different text preprocessing techniques, machine learning algorithms, and performance improvement strategies were tested and evaluated. Our final TC algorithm was tested on 10 regulatory documents, such as the 2012 International Energy Conservation Code, and evaluated in terms of precision and recall. The algorithm achieved around 96% and 85% recall and precision, respectively, on the testing data.
UR - http://www.scopus.com/inward/record.url?scp=84904707526&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84904707526&partnerID=8YFLogxK
U2 - 10.1061/9780784413517.0092
DO - 10.1061/9780784413517.0092
M3 - Conference contribution
AN - SCOPUS:84904707526
SN - 9780784413517
T3 - Construction Research Congress 2014: Construction in a Global Network - Proceedings of the 2014 Construction Research Congress
SP - 897
EP - 906
BT - Construction Research Congress 2014
PB - American Society of Civil Engineers (ASCE)
T2 - 2014 Construction Research Congress: Construction in a Global Network, CRC 2014
Y2 - 19 May 2014 through 21 May 2014
ER -