Sunday, January 5, 2020: 8:50 AM
Clinton Room (New York Hilton)
As we move towards historical arguments based on computational analysis of machine-readable texts, there are consequences that most researchers may not have anticipated. Just as researchers consult very specific materials in physical archives, they need customized bodies of digitized text for computational analysis based on the queries they wish to answer. This means more than finding an available “digital archive.” Researchers must consider carefully how to create a representative balanced corpus of machine-readable sources. The process of corpus creation raises a host of ethical questions. Researchers may encounter any (or all) of the following problems. It may be difficult to find legal access to copyrighted materials and to digitized versions of those materials. The process of rendering digitized sources machine-readable may involve breaking proprietary containers, which may violate terms of service. If a researcher contracts to use proprietary sources, they may face restrictions on their ability to share data. Once a custom corpus is created, the researcher faces more issues during the process of digital analysis. Variable quality optical character recognition is usually considered a problem to be sorted, but it is also an ethical issue. Very frequently words will still be apparent among poor OCR, but rarer terms or individuals may appear absent where they are in fact present.