Saturday, January 7, 2012: 2:50 PM
Chicago Ballroom A (Chicago Marriott Downtown)
I will explain the process of data selection used in the GCP. Determining the information to be collected is a critical aspect of any database, as it is here that mistakes are made that may not be able to be rectified once data entry has begun. The GCP had a particular advantage in making changes in operational policy because the project underwent three separate funding stages, permitting staff to thoroughly review all procedures, and make changes as necessary. In this session, I will focus on two aspects of data selection. First, I will address our extensive experience in dealing with the practical (and universal) issues in historical database construction--data quality and missing data. The latter in particular is important as it deals with the historian’s habitual frustration with information that is not there. Next, I will emphasize our experience in extracting potentially useful but “invisible” information hidden within the document, either in its language, its structure, or within the visible data itself, for example, the unstated relationship between the individuals within a household, or the gendered character of family organization, or the tracking of social or kinship networks beyond the household. The GCP identified 22 different ways of expressing marital status, 41 different relationships of individuals in household to the head of the household, 23 different household structures. In short, our objective was to uncover all the potentially relevant data, and to present that data in an analytically useful format. I will explore the conceptual basis for data selection, and explain the procedures, and the inevitable problems, that arose in the process of coding nearly 150,000 individuals, their families and households, problems I was all too familiar with in my capacity as Assistant Director of the GCP from 2002 to 2006, and supervisor of its day-to-day operation.