Understanding Genre in a Collection of a Million Volumes, Interim Report

Research output: Book/Report/Conference proceedingOther report


One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index.

Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.

This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.

That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.
Original languageEnglish (US)
Number of pages48
StatePublished - Dec 29 2014


  • drama
  • fiction
  • poetry
  • machine learning
  • genre
  • folksonomy
  • predictive modeling
  • literary history
  • volume structure


Dive into the research topics of 'Understanding Genre in a Collection of a Million Volumes, Interim Report'. Together they form a unique fingerprint.

Cite this