describing name frequency distributions

The most popular given names have become much less popular over the past two centuries.  In the U.K. about 1800, about 85% of males and 82% of females had names that were among the ten most popular given names for males and females, respectively.  Apart from temporary effects of the Norman Conquest in 1066 and the Black Death in the mid-fourteenth century, given names seem to have had a similar distribution from 1000 to 1800.  But after 1800, the name distribution flattened.  By 1994, the share of males and females with given names among the ten most popular given names had fallen to 28% and 24%.[1]  That seems to me to be a quite astonishing change in an important class of symbols.

Describing the name distribution as flattening is a simple way to describe the change from 1800 to the present.   More specifically, plot name popularity (frequency of a name divided by the total number of named persons in the sample) by popularity rank.  Approximate this plot across some range of ranks by a line.  The slope of that line has flattened over the past two centuries.  That’s equivalent to the complementary distribution function of name frequency increasing in slope at the high end of the name frequency distribution.[2]

My work on given name frequency distributions does not support claiming that given names follow a power-law distribution.   To the extent that I have in the past made such a claim, please recognize that I provided no statistical support for that claim.  You can find evidence that I don’t take such a claim seriously.   I’m interested in  understanding major changes in symbolic choices such as the change  in name popularity over  the past two centuries.   Given this interest, expending effort on estimating a stationary statistical model for the name distribution doesn’t seem to me worthwhile.  I hereby explicitly renounce any claim that given names follow a power-law distribution.

Moreover, I heartily recommend recent work on estimating a power-law model, evaluating its goodness of fit, and comparing it to alternative statistical models.  With their article, “Power-Law Distributions in Empirical Data,” Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman provide a clear exposition of power laws, useful estimation strategies (see especially equation 3.7), and analysis of twenty-four real-world data sets.[3]  Even better, they have made available on the web code, in several languages, that implements their estimation methods.   They have also made their test data sets available through the web to the extent that they could.  In short, their work  is an outstanding example of actual, significant advancement of public knowledge.

Notes:

[1]  These figures are from Galbi, Douglas (2002), Long-Term Trends in Personal Given Name Frequencies in the UK, Table 1.

[2]  Name frequencies divided by sample sizes give name popularities.  Moving leftward from the high-end name frequencies and assuming distinct frequency ranks, the empirical complementary distribution function increases by constant probability increments equal to the inverse of the total number of names in the sample.  Hence, to scaling parameters, the popularity/rank plot is equal to a transformation about the x-y axis of the complementary distribution function.  Plotting in log-log space eliminates the effects of the scaling parameters.

[3]  One of their data sets is surname frequency from the U.S. Census of 1990.   Their analysis favors for these data, above a minimum frequency threshold, a power-law with exponential cut-off.   A power-law distribution and a log-normal distribution are not clearly rejected.   However, these statistics shouldn’t be taken too seriously.  Compare the source Census surname data set to Clauset, Shalizi, and Newman’s constructed surname frequency data set.  They constructed surname frequencies from surname frequency percent shares for surnames with shares above 0.005, reported with only one significant digit.   Even at their estimated x-min (upper frequencies), the reported surname share has only two significant digits (0.0045).  Hence, while the computed frequencies are reported with seven significant digits, their accuracy in most cases is much less.   That the analysis of surname frequencies doesn’t employ good data probably isn’t important.  A statistical model for surnames seems to me less interesting than a statistical model for given names.  Moreover, as discussed above, I think the most interesting issue for given names is distribution dynamics.   I hope that smart statisticians will work on understanding given-name distribution dynamics.

Leave a Reply

Your email address will not be published. Required fields are marked *

Current month [email protected] day *