The Promise of WebARChive Files: Exploring the Internet Archive as a Historical Resource

Monday, January 5, 2015: 11:20 AM
Murray Hill Suite A (New York Hilton)
Ian Milligan, University of Waterloo
The World Wide Web is a significant primary source for historians. Every day, users record their thoughts, feelings, locations, ratings, votes, reviews, jokes, comprising an invaluable assemblage of traces of the past that historians can mold into their historical narratives. These records form an ever-increasing resource and epitomize the issue of “abundance” (versus our traditional issues of source scarcity) articulated by Roy Rosenzweig in 2003. With easier access and capability for analysis, historians need to reflect on the shape that these primary sources will take and how to access them. These sources present both opportunity and challenge. This is especially true when it comes to WebARChive, or WARC, files which are the primary means of archiving website information from the Web and which thus underpin the Internet Archive’s preservation efforts. Yet while WARC files are of considerable utility from a preservationist standpoint, they require historians to develop digital skills. For many historical applications we need to move beyond the traditional portal into web archives, the WaybackMachine, and begin to explore and manipulate the plain text within. This paper concerns itself with WARC files and the analysis that can be carried out on them. I present a methodological approach to navigate large bodies of Internet Archive information, drawing on a case study of nearly 5% of the top-level .ca domains preserved in a comprehensive scrape of the entire World Wide Web, the March 2011 Wide Web Scrape. Distant reading, reading large collections by finding underlying patterns and trends as opposed to individual documents, is a critical way to understand such a large quantity of data; yet I sketch forward a way to move beyond this more abstract level of analysis down to the individual item. My paper presents a new way to crawl and use historical World Wide Web data.