Understanding Genre in a Collection of a Million Volumes, Interim Report

Research output: Book/ReportOther report

Abstract

One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index.

Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.

This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.

That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.
Original languageEnglish (US)
Publisherfigshare
Number of pages48
DOIs
StatePublished - Dec 29 2014

Fingerprint

Literary Scholars
Drama
Prose
Poem
Non-fiction
Paratext
Metadata
Prose Fiction
Digital Libraries
Funding
Distant Reading
Verse
Scarcity
Machine Learning
Lyrics
Poetry

Keywords

  • drama
  • fiction
  • poetry
  • machine learning
  • genre
  • folksonomy
  • predictive modeling
  • literary history
  • volume structure

Cite this

Understanding Genre in a Collection of a Million Volumes, Interim Report. / Underwood, Ted.

figshare, 2014. 48 p.

Research output: Book/ReportOther report

@book{38cabe4f1b854ad3a7e4faebbcace0a2,
title = "Understanding Genre in a Collection of a Million Volumes, Interim Report",
abstract = "One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index. Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.",
keywords = "drama, fiction, poetry, machine learning, genre, folksonomy, predictive modeling, literary history, volume structure",
author = "Ted Underwood",
year = "2014",
month = "12",
day = "29",
doi = "10.6084/m9.figshare.1281251",
language = "English (US)",
publisher = "figshare",

}

TY - BOOK

T1 - Understanding Genre in a Collection of a Million Volumes, Interim Report

AU - Underwood, Ted

PY - 2014/12/29

Y1 - 2014/12/29

N2 - One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index. Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.

AB - One of the main problems confronting distant reading is the scarcity of metadata about genre in large digital collections. Volume-level information is often missing, and volume labels aren't in any case sufficient to guide machine reading, since poems and plays (for instance) are often mixed in a single volume, preceded by a prose introduction and followed by an index. Our goal in this project was to show how literary scholars can use machine learning to select genre-specific collections from digital libraries. We've started by separating five broad categories that interest literary scholars: prose fiction, poetry (narrative and lyric), drama (including verse drama), prose nonfiction, and various forms of paratext.This report discusses assumptions about the nature of genre that underpin our approach, describes methods, and explains how to use the page-level map of genre we have generated.That map itself is also available through figshare, at http://dx.doi.org/10.6084/m9.figshare.1279201 (link below). This research was supported by the National Endowment for the Humanities and the American Council of Learned Societies. Any views, findings, conclusions, or recommendations expressed in this release do not necessarily represent those of the funding agencies.

KW - drama

KW - fiction

KW - poetry

KW - machine learning

KW - genre

KW - folksonomy

KW - predictive modeling

KW - literary history

KW - volume structure

U2 - 10.6084/m9.figshare.1281251

DO - 10.6084/m9.figshare.1281251

M3 - Other report

BT - Understanding Genre in a Collection of a Million Volumes, Interim Report

PB - figshare

ER -