WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data

  • Maurice Weber
  • , Carlo Siebenschuh
  • , Rory M. Butler
  • , Anton Alexandrov
  • , Valdemar R. Thanner
  • , Bo Li
  • , Georgios Tsolakis
  • , Haris Jabbar
  • , Rick Stevens
  • , Ian Foster
  • , Ce Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We introduce WordScape, a novel pipeline for the creation of cross-disciplinary, multilingual corpora comprising millions of pages with annotations for document layout detection.Relating visual and textual items on document pages has gained further significance with the advent of multimodal models.Various approaches proved effective for visual question answering or layout segmentation.However, the interplay of text, tables, and visuals remains challenging for a variety of document understanding tasks.In particular, many models fail to generalize well to diverse domains and new languages due to insufficient availability of training data.WordScape addresses these limitations.Our automatic annotation pipeline parses the Open XML structure of Word documents obtained from the web, jointly providing layout-annotated document images and their textual representations.In turn, WordScape offers unique properties as it (1) leverages the ubiquity of the Word file format on the internet, (2) is readily accessible through the Common Crawl web corpus, (3) is adaptive to domain-specific documents, and (4) offers culturally and linguistically diverse document pages with natural semantic structure and high-quality text.Together with the pipeline, we will additionally release 9.5M urls to word documents which can be processed using WordScape to create a dataset of over 40M pages.Finally, we investigate the quality of text and layout annotations extracted by WordScape, assess the impact on document understanding benchmarks, and demonstrate that manual labeling costs can be substantially reduced.

Original languageEnglish (US)
Title of host publicationAdvances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023
EditorsA. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, S. Levine
PublisherNeural information processing systems foundation
ISBN (Electronic)9781713899921
StatePublished - 2023
Event37th Conference on Neural Information Processing Systems, NeurIPS 2023 - New Orleans, United States
Duration: Dec 10 2023Dec 16 2023

Publication series

NameAdvances in Neural Information Processing Systems
Volume36
ISSN (Print)1049-5258

Conference

Conference37th Conference on Neural Information Processing Systems, NeurIPS 2023
Country/TerritoryUnited States
CityNew Orleans
Period12/10/2312/16/23

ASJC Scopus subject areas

  • Signal Processing
  • Information Systems
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'WordScape: a Pipeline to extract multilingual, visually rich Documents with Layout Annotations from Web Crawl Data'. Together they form a unique fingerprint.

Cite this