Too Much Information: Transparency, Metadata, and Search in the Age of Web Archives

Sunday, January 7, 2018: 12:00 PM
Diplomat Ballroom (Omni Shoreham)
Ian Milligan, University of Waterloo
Big Data, in the form of born-digital historical sources, is reshaping the humanities and social sciences. A large amount of this information is contained within web archives of billions of web pages, ranging from individual homepages, social media sites and feeds, institutional pages, and corporate sites. This material holds potential for historians working in diverse fields. Yet this tremendous opportunity is mitigated by the challenge of dealing with the significant size of the data. Text search is an obvious solution to this problem of scale, but keyword searching across billions of documents presents several issues. First, ranking algorithms become critically important: what is first on a list will be found and cited by a scholar, whereas what is a thousand places down will never be found. We need to understand and make transparent the role of the algorithm. Secondly, the focus on text threatens to occlude other methods of discovering information, notably metadata mining.

My presentation focuses on the work our team of historians, librarians, and computer scientists have done in developing a pan-institution (Alberta, Dalhousie, Victoria, Toronto, Winnipeg, and Simon Fraser University) web archiving portal in Canada. Ingesting 16 TB of web archival data, we have attempted to develop transparent search algorithms, as well as other forms of supporting data to make decisions and discovery more transparent. The presentation speaks to the decisions and challenges we have faced when addressing the two above problems, both as historians as well as how to conceptualize and execute a large-scale project. It also discusses the skills students will need to work with this material, both to explore it on their own merits but also in order to have the ability to work in an interdisciplinary context.

<< Previous Presentation | Next Presentation