TY - GEN
T1 - Two-Stage Graph-Augmented Summarization of Scientific Documents
AU - Rezapour, Rezvaneh
AU - Ge, Yubin
AU - Han, Kanyao
AU - Jeong, Ray
AU - Diesner, Jana
N1 - We gratefully acknowledge the support from the John D. and Catherine T. MacArthur Foundation.
PY - 2024
Y1 - 2024
N2 - Automatic text summarization helps to digest the vast and ever-growing amount of scientific publications. While transformer-based solutions like BERT and SciBERT have advanced scientific summarization, lengthy documents pose a challenge due to the token limits of these models. To address this issue, we introduce and evaluate a two-stage model that combines an extract-then-compress framework. Our model incorporates a “graph-augmented extraction module” to select order-based salient sentences and an “abstractive compression module” to generate concise summaries. Additionally, we introduce the BioConSumm dataset, which focuses on biodiversity conservation, to support underrepresented domains and explore domain-specific summarization strategies. Out of the tested models, our model achieves the highest ROUGE-2 and ROUGE-L scores on our newly created dataset (BioConSumm) and on the SUMPUBMED dataset, which serves as a benchmark in the field of biomedicine.
AB - Automatic text summarization helps to digest the vast and ever-growing amount of scientific publications. While transformer-based solutions like BERT and SciBERT have advanced scientific summarization, lengthy documents pose a challenge due to the token limits of these models. To address this issue, we introduce and evaluate a two-stage model that combines an extract-then-compress framework. Our model incorporates a “graph-augmented extraction module” to select order-based salient sentences and an “abstractive compression module” to generate concise summaries. Additionally, we introduce the BioConSumm dataset, which focuses on biodiversity conservation, to support underrepresented domains and explore domain-specific summarization strategies. Out of the tested models, our model achieves the highest ROUGE-2 and ROUGE-L scores on our newly created dataset (BioConSumm) and on the SUMPUBMED dataset, which serves as a benchmark in the field of biomedicine.
UR - http://www.scopus.com/inward/record.url?scp=85216932764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85216932764&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.nlp4science-1.5
DO - 10.18653/v1/2024.nlp4science-1.5
M3 - Conference contribution
AN - SCOPUS:85216932764
T3 - NLP4Science 2024 - 1st Workshop on NLP for Science, Proceedings of the Workshop
SP - 36
EP - 46
BT - NLP4Science 2024 - 1st Workshop on NLP for Science, Proceedings of the Workshop
A2 - Peled-Cohen, Lotem
A2 - Calderon, Nitay
A2 - Lissak, Shir
A2 - Reichart, Roi
PB - Association for Computational Linguistics (ACL)
T2 - 1st Workshop on NLP for Science, NLP4Science 2024
Y2 - 16 November 2024
ER -