lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It’s a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a “drooping tail” relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites’ page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term “power law” is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun’s website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. “More of the same” is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes’ labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

more discussion about tail size

Once again men are heatedly discussing tail size. Just ponder this queston: How large is the long tail? Personally, I’m going to keep looking before I decide for myself.

While it’s novel to bring mathematical precision to such matters, unfortunately it seems to me that this mathematical model focuses attention on misleading features. The model says that the share of the k most popular items is log(k)/log(n), where n is the total number of items on offer. Thus, in this model, the total number of items on offer determines the share of the most popular items.

This isn’t a sensible model. Mathematically, a power law describes an infinite number of items on offer. The slope of the power law, or more precisely, the slope of an approximating power law at the high popularity end of the distribution, usually describes well the high-end shares. The question is what determines the slope of the power law. The number of items on offer isn’t a good answer to that question, particularly for n varying from two million to six billion.

For a concrete example, consider the popularity of the ten-most-popular given names. The set of possible given names (given names on offer) is huge, and probably hasn’t changed much in the past two-hundred years. However, the popularity of the ten-most-popular given names for males in England has fallen from about 85% in 1800 to about 28% in 1994. If you want to understand changes in the popularity of the most popular items in a collection of symbols instantiated and used in a similar way, try to understand this change.

* * *

For additional amusement, here’s a post I stuck in the galbithink.org newsfeed a little more than a year ago, back in the time of Web-Pleistocene:

Tail aficionados might enjoy pondering the distinguishing features of the long tail. I think that size, which tail authorities have categorized as long or short, matters less than shape. It should be no surprise to anyone that shape can change over time. For some graphical evidence, see the detailed images here.

So don’t just sit around complaining that “diversity plus freedom of choice creates inequality”. Power laws don’t imply any particular amount of inequality. The power of the powerlaw determines the difference between tails. Look at some examples and see for yourself!

growth of Internet traffic in Japan

Good, large-sample data on Internet traffic is hard to find. However, an excellent study of Internet traffic in Japan [Cho et. al. (2006)] describes 42% of Japanese (public) Internet traffic for six one-month observations spanning Sept. 2004 to May 2006.

This study reports many important facts. The ratio of bytes sent to residential customers to bytes received from residential customers was 1.3 — remarkably symmetric. Peer-to-peer applications using dynamic port assignment account for most of the traffic. The distribution of bandwidth use by users is a high-powered power law. Most importantly, specific users’ positions in that distribution vary over time and are not well identified with demographic and other customer characteristics (other than fiber connection, i.e. bandwidth availability).

The residential traffic growth figures for Japan vary considerably across month intervals. From Sept. 2004 to Nov. 2004, total traffic (aggregated inbound and outbound) grew 180% on an annualized basis. In subsequent 6-month intervals, annualized traffic growth rates were 58%, 19%, and 37%. Data on growth in fiber-to-the-home (FTTH) connections and total connections also shows volatility, but this does not seem to explain the reported traffic growth volatility. Any ideas about what explains it?

According to my calculations, (public) Internet traffic in Japan from Sept. 2004 to May 2006 grew about 60% per year. This figure aggregates inbound and outbound traffic and includes residential broadband customers and non-residential broadband customers (leased lines, data centers, dialup). I estimated non-residential traffic for the 7-ISP sample using the reported residential/non-residential traffic figures from the 4-ISP sample in Nov. 2005, and the non-residential broadband growth rate for the 4-ISP sample from Sept. 2004 to May 2006.

This estimated annualized growth of Internet traffic in Japan is much higher than estimated annualized growth of (non-voice channel) bandwidth in use in the U.S. across the 1990s. From 1989 to 1999, total DDS, DS1, and DS3 channel termination bandwidth in the U.S. grew an estimated 27% per year (see Table P6 in U.S. Bandwidth Price Trends in the 1990s). The difference between a 60% growth rate and a 27% growth rate year becomes huge after only a small number of years.

The Japanese study does not encompass bandwidth deployed in private networks. The ratio of Internet bandwidth to total inter-office bandwidth may have been about 15% in the U.S. in 1998 (see Growth in the “New Economy”, p. 6). The Japanese data show much a faster growth rate of non-residential bandwidth than residential bandwith, but the former is only about two-thirds the size of the latter. This is consistent with a large share of non-residential bandwidth not being incorporated into the public Internet.

An important point: the re-organization of network transmission protocols on existing networks tends to occur very slowly. Astonishing fact: in the U.S., about 90% of mobile-phone communications towers use traditional, copper-based TDM backhaul. Legacy networks hang around for a long time.

Reference:

Cho, Kenjiro, Kensuke Fukuda, Hiroshi Esaki, and Akira Kato, “The Impact and Implications of the Growth of Residential User-to-User Traffic,” sigcomm 2006.

Carnival of the Bureaucrats #1

Welcome to the inaugural edition of the Carnival of the Bureaucrats!

The best entry in this edition has been unanimously judged to be Foreign Policy’s photo essay, The State at Work. An appropriately selective quotation from its first page:

Civil servants are asked to do the people’s work with very little, sometimes with nothing at all. They see to it that the job gets done ….. Meet the bureaucrats.

My personal favorite: Josephine George-Francis, governor of Montserrado County, Liberia. According to the photo essay, she “sewed the Liberian flag that hangs in her office.” That’s exactly the kind of resourcefulness and dedication that characterizes many bureaucrats in much lower positions in countries all across the spectrum of per capita income.

Honorable Mention for appreciation of bureaucrats by a non-bureaucrat goes to Stowe Boyd at /message, for declaring, “the only answer to the Cold War is government intervention.” A few decades ago this insight could have saved googols of rubles and dollars.

The heart and soul of bureaucracy is editing. Editing is what makes bureaucracy great. The New York Times recently showed considerable bureaucratic mettle in editing a letter to the editor. The letter was eventual withdrawn, with the decisive issue being the use of “rubbish”. Outstanding! Even better, this matter produced 14 emails over five days of pondering the issues. Read the emails (pdf file) for yourself and cheer!

Jon Swift presents Homeland Security Thinks Outside the Box posted at Jon Swift. Swift observes, “Homeland Security Department stays one step ahead of the terrorists by not only anticipating likely terrorist targets but even anticipating the terrorists’ anticipation of our anticipating them.” This work appears to draw insights from the scholarly field of game theory. It undoubtedly can generate many additional papers and results. Swift apparently is a federal bureaucrat stationed in Alaska. Perhaps the exceptional quality of his blogging will win him enough friends to get a position in DC.

Nedra Weinreich presents The Insider’s Guide to Writing a Winning Proposal posted at Spare Change. This submission was filed after the deadline for the Carnival. Moreover, it did not include a request for a waiver of the Carnival rules. Hence this entry has been rejected. We do not reach the issue of whether, if a properly prepared waiver request had been filed, that waiver request would have been granted.

That concludes this edition of the Carnival of the Bureaucrats. Submit your blog article to the next edition using our carnival submission form. Submissions should conform to the Carnival regulations. Past posts and future hosts can be found on our blog carnival index page.

false copyright and false authors’ rights claims

The Internet enables sharing the intelligence and creativity of persons around the globe. Along with this exciting new set of possibilities is a little recognized shadow: the increased opportunity cost of false copyright and false authors’ rights claims.

In his insightful article, Jason Mazzone examines the problem of false copyright claims with respect to U.S. copyright law. He observes:

Copyright law itself creates strong incentives for copyfraud [false copyright claims]. The limited penalties for copyfraud under the [U.S.] Copyright Act, coupled with weak enforcement of these provisions, give publishers an incentive to claim ownership, however spurious, in everything. Although falsely claiming copyright is technically a criminal offense under the Act, prosecutions are extremely rare. Moreover, the Copyright Act provides no civil penalties for claiming copyrights in public domain materials. [Mazzone, pp. 1029-30]

Mazzone cites numerous examples of what he considers to be copyfraud. One gross, but not unusual, example that he cites is a popular pocket version of the U.S. Constitution. It includes a copyright notice and the admonition “[n]o part of this publication may be reproduced or transmitted in any form or by any means…without permission in writing from the publisher.”

The aggregate cost of copyfraud has probably increased greatly over the past decade. Using small quotes from a copyrighted text to document, illustrate, or advance discussion of a related issue is widely recognized to be fair use. Bloggers do this extensively. Falsely asserting that such use is not permitted probably doesn’t have much effect. Other forms of textual copyfraud may have significant cost. But fair use of copyrighted images, audio, and video is much less legally clear and publicly well-understood than fair use of copyrighted text. Moreover, over the past decade there has been an astonishing expansion of possibilities for creating and sharing non-textual works. The cost of copyfraud with respect to images, audio, and video has increased with this expansion. The cost of this type of copyfraud has probably become larger than the cost of copyfraud with respect to text.

Getting copyright and authors’ rights to serve better the common good requires more attention to the economic and legal implications of false claims.