value of given name data

Given names are an important but under-appreciated type of data. Given names represent significant symbolic choices.  Large populations of persons have been making this well-defined symbolic choice for millennia.  Given names are thus useful data for studying symbolic choice, effects of communication technologies, and information economics.

Given name frequency data are now also important to valuable new population estimation techniques.  Survey costs are typically directly related to sample size.  Most persons, however, know many other persons and can provide information about persons that they know.  So, for example, if you want to estimate how many persons in the U.S. openly blog regularly, you could ask a sample of persons whether they blog regularly, and also ask them how many persons they know who blog regularly.  The sample size is then effectively scaled up by the size of personal networks within the U.S. with sufficiently informed connections to know if a personal connection blogs.  That scale-up might be a factor of about 500.

Research on scale-up estimates has used given name frequency data in making estimates.  Given names provide a good means for estimating personal network size, i.e. the number of persons that someone knows.  Most persons probably could not answer well the question, “How many persons do you know?”  But they can answer quite well, “How many persons named Bao do you know ?”  Answers to that question, combined with data on the frequency of the name Bao in the population, can be used to compute a good estimate of personal network size.  Defining “know” to mean “talk to each other about personal interests at least once a month” might provide an estimate of personal network size relevant to a scale-up estimate of the total population of bloggers.

Governments could relatively easily make good data on given names freely available.  Good data on given names would consist of a large, random sample of given names, along with the person’s sex, age or age range, geographic region, and race/ethnicity, if reported.  In the administration of various government programs,  governments collect large datasets that include such information.  Ensuring that the finest category intersection  had at least a few data points would provide sufficient privacy for personal information that is not highly sensitive and that is widely known in any case.

Making such data freely available would contribute to valuable public knowledge.   The conclusion to an important paper on scale-up estimates noted:

Though the methods presented here account for bias in individual degree estimation in ways that are not present in other methods, they are only as good as the available data on the demographics of first names. Using “How many X’s do you know?” data to estimate person network size requires knowing the number of people in the population with the different first names.  In many countries such information may not be available.[*]

Such information undoubtedly exists.  Not making it available is an intellectual and economic waste.  Communication economists, statisticians, and others potentially can create considerable public value from analysis of given names.

*  *  *  *

[*] McComick TH, Salganik MJ, Zheng T (2008) How many people do you know?: Efficiently estimating personal network size. Journal of the American Statistical Association, forthcoming.  Tian Zheng provides a good overview of the technique in this presentation.

Leave a Reply

Your email address will not be published. Required fields are marked *

Current month ye@r day *