Min(d)ing the Gap: Citation Preservation as a Tool to Open Paywalled Sources to Computational Analysis

Thursday, January 5, 2017: 1:30 PM
Plaza Ballroom D (Sheraton Denver Downtown)
Kalani Craig, Indiana University
Digital history projects bring with them visions of freely accessible data sets, clean and ready to be adapted from one project for use in another. The reality of the data landscape for digital humanists is much more complicated. Many digitized, transcribed sources available for use in text mining are institutionally owned and often paywalled as a result. Additionally, archives often place limitations on the transcription results of their archived documents. These copyrighted digitized texts are still valuable targets for text mining, hGIS and network-theoretical approaches, but it can be difficult to provide results that meet open-data best practices within the constraints of the original copyrights.

The historical analysis of memory building in medieval saints' lives at the core of this paper is built on a foundation of data curation oriented toward the preservation of citation data, which is often lost or obscured as we work with topic modeling and natural-language processing. The paper will use this memory-building argument as a case study to demonstrate a process for scraping, cleaning and importing paywalled sources, and then adding a layer of natural language processing that preserves the citation data scholars need to participate in a historiographic debate.

A set of online documentation and resources for replicating the process will accompany the paper.

