Linked Census Files from the Minnesota Population Center: Following Individuals and Their Households through the Nineteenth Century

Saturday, January 7, 2012
Sheraton Ballroom II (Sheraton Chicago Hotel & Towers)
Katie Genadek, University of Minnesota
The Minnesota Population Center has created a set of linked representative samples of individuals and family groups using U. S. Census microdata for the period 1850 to 1930.  The linked census records contain a rich collection of individual information—such as marital status, education, and occupation—for each year the individual can be matched.   This significantly increases the ability to study the process of change over time in the lives of individuals and their households.  The linked samples also lend themselves to research methodologies that combine quantitative with qualitative research.  

The record- linking process relies on information fields that are common in the two datasets being linked.  To minimize selection bias in this process, MPC linking protocols only use variables that should not change over time (such as names, birthplace and race) or should change in predictable ways (such as age).  We do not use place of residence because it would produce a disproportionate percentage of links for non-migrants.  Nor do we use information from other co-resident household members, since this would favor the selection of those with co-resident kin.  Given names are standardized to eliminate abbreviations and diminutives, and similarity scores are constructed for names and age.  We use Support Vector Machines to classify potential links as true or false.  The poster describes several additional steps in the linking process.  The end result is a file consisting of potential links and the classifier-produced confidence score. Confidence scores are interpreted dichotomously; a positive score = “true” link and negative score = “false” link. In cases where we have more than one true link to a specific sample record, we reject all links for this specific record as ambiguous.  We also assess whether population subgroups are over or under represented in the linked datasets and supply weights to address this.

See more of: Poster Session, Part 2
See more of: AHA Sessions