TY - GEN
T1 - WordScape
T2 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
AU - Weber, Maurice
AU - Siebenschuh, Carlo
AU - Butler, Rory M.
AU - Alexandrov, Anton
AU - Thanner, Valdemar R.
AU - Li, Bo
AU - Tsolakis, Georgios
AU - Jabbar, Haris
AU - Stevens, Rick
AU - Foster, Ian
AU - Zhang, Ce
N1 - This work is partially supported by the National Science Foundation under grant No.1910100, No.2046726, No.2229876, and Alfred P.Sloan Fellowship.CZ and the DS3Lab gratefully acknowledge the support from the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract number MB22.00036 (for European Research Council (ERC) Starting Grant TRIDENT 101042665), the Swiss National Science Foundation (Project Number 200021 184628, and 197485), Innosuisse/SNF BRIDGE Discovery (Project Number 40B2-0 187132), European Union Horizon 2020 Research and Innovation Programme (DAPHNE, 957407), Botnar Research Centre for Child Health, Swiss Data Science Center, Alibaba, Cisco, eBay, Google Focused Research Awards, Kuaishou Inc., Oracle Labs, Zurich Insurance, and the Department of Computer Science at ETH Zurich.IF and RS acknowledge support from the U.S.Department of Energy under Contract DE-AC02-06CH11357.
PY - 2023
Y1 - 2023
N2 - We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection.Relating visual and textual items on document pages has gained further significance with the advent of multimodal models.Various approaches proved effective for visual question answering or layout segmentation.However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks.In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data.WordScape addresses these limitations.Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations.In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages.Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.
AB - We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection.Relating visual and textual items on document pages has gained further significance with the advent of multimodal models.Various approaches proved effective for visual question answering or layout segmentation.However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks.In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data.WordScape addresses these limitations.Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations.In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages.Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.
UR - https://www.scopus.com/pages/publications/85191145360
UR - https://www.scopus.com/pages/publications/85191145360#tab=citedBy
M3 - Conference contribution
AN - SCOPUS:85191145360
T3 - Advances in Neural Information Processing Systems
BT - Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
A2 - Oh, A.
A2 - Neumann, T.
A2 - Globerson, A.
A2 - Saenko, K.
A2 - Hardt, M.
A2 - Levine, S.
PB - Neural information processing systems foundation
Y2 - 10 December 2023 through 16 December 2023
ER -