lack of power laws and other popularity problems

Discussion of tails and popularity seems to be maturing into more comprehensive considerations. Folks aiming for uber-geek status might chat about the difference between power laws and log-normal distributions. This difference has similar consequences to the difference between non-stationary and stationary macroeconomic time series. If those terms are obscure to you, you might just chat about how much bigger infinity is than any other number. It’s a huge issue!

Government bureaucrats and other practical, get-the-job-done types might just try to produce some simple, intuitive, and relevant graphs. Consider, for example, log-log graphs of website page traffic by page rank. They show a “drooping tail” relative to an approximating line for the left part of the popularity distribution.

Whether the droop is a typical characteristic of web page popularity distributions is not clear. Surely a relatively large subset of relatively bad (unlinked, search-word-poor, spam-associated) pages could contribute to the droop. On the other hand, the droop could be a typical effect of the usual distribution of page content and general patterns of linking and searching. These two possibilities could be tested by comparing the magnitude of the droop in different websites’ page traffic distributions.

Fitting a line to a popularity distribution is more useful as a descriptive technique than as a literal claim that the popularity distribution follows a power law. The term “power law” is not meaningful to most persons. Moreover, knowledge about power laws does not provide a lot of insight into the factors that determine website traffic or trends in website traffic over time.

The possibilities for statistical distributions are not limited to power laws (or power laws and log-normal distributions). The personal behavior and information and communication systems that effect page popularity are complex. They may not be uniform across different circumstances. For example, the factors that govern the popularity of the least popular pages may be rather different than the factors that govern the popularity of the most popular pages.

Power laws and log-normal distributions are two-parameter distributions. The distributional form that best characterizes traffic to all pages may have many more than two parameters. Compared to a log-normal distributions and other distributions with more than two paramters, an approximating line has greater value for providing a simple, intuitive description of an important part of the popularity distribution.

Two website traffic distributions suggest that website traffic distributions may have flattened over the past decade. Traffic to Sun’s website pages in July, 1996, according to my calculation, had a descriptive log-log line with slope about -1.1. Traffic to useit.com pages in the summer of 2006 had a descriptive log-log line with slope about -0.8.* This difference may reflect differences between the two websites. It might also indicate flatteining over time in typical website traffic distributions. Such flattening was also a general pattern of change in name popularity distributions in the England (and the U.S.) from 1800 to the present.

Much more evidence of this sort might lead one to expect website traffic distributions, and perhaps other distributions for digital goods, to continue to flatten in the future. “More of the same” is a very crude sort of prediction. But absent any other relevant information, it might be the best that one can make.

Knowlege concerning the factors that have produced changes in popularity distributions could help to predict and shape future changes. Persons may have chosen less popular names in response to changes in patterns of work and residence that put persons closer together and more extensively and uniformly regulated their interactions. In short, personalization may have been a personal counter-reaction to factories and urbanization (the Industrial Revolution).

Suppose new information and communication technologies favor a more dispersed workforce, and family structure continues to shift toward more single persons and smaller households. My theory then predicts a decrease in personalization and an increased desire to associate with popular symbols.

Ummm, about all current trends indicate that either my theory is wrong, or that there are a lot of other, much more important factors. Does anyone have some better ideas?

* This is the slope measured in log-log coordinates, not in the coordinates of the axes’ labels. The intercept on a log(page traffic) axis typically varies greatly with total website traffic. Thus I prefer graphs that have a y-axis labeled with log(page traffic share in total website traffic). An approximating line has the same slope with either labeling for the Y axis.

more discussion about tail size

Once again men are heatedly discussing tail size. Just ponder this queston: How large is the long tail? Personally, I’m going to keep looking before I decide for myself.

While it’s novel to bring mathematical precision to such matters, unfortunately it seems to me that this mathematical model focuses attention on misleading features. The model says that the share of the k most popular items is log(k)/log(n), where n is the total number of items on offer. Thus, in this model, the total number of items on offer determines the share of the most popular items.

This isn’t a sensible model. Mathematically, a power law describes an infinite number of items on offer. The slope of the power law, or more precisely, the slope of an approximating power law at the high popularity end of the distribution, usually describes well the high-end shares. The question is what determines the slope of the power law. The number of items on offer isn’t a good answer to that question, particularly for n varying from two million to six billion.

For a concrete example, consider the popularity of the ten-most-popular given names. The set of possible given names (given names on offer) is huge, and probably hasn’t changed much in the past two-hundred years. However, the popularity of the ten-most-popular given names for males in England has fallen from about 85% in 1800 to about 28% in 1994. If you want to understand changes in the popularity of the most popular items in a collection of symbols instantiated and used in a similar way, try to understand this change.

* * *

For additional amusement, here’s a post I stuck in the galbithink.org newsfeed a little more than a year ago, back in the time of Web-Pleistocene:

Tail aficionados might enjoy pondering the distinguishing features of the long tail. I think that size, which tail authorities have categorized as long or short, matters less than shape. It should be no surprise to anyone that shape can change over time. For some graphical evidence, see the detailed images here.

So don’t just sit around complaining that “diversity plus freedom of choice creates inequality”. Power laws don’t imply any particular amount of inequality. The power of the powerlaw determines the difference between tails. Look at some examples and see for yourself!

growth of Internet traffic in Japan

Good, large-sample data on Internet traffic is hard to find. However, an excellent study of Internet traffic in Japan [Cho et. al. (2006)] describes 42% of Japanese (public) Internet traffic for six one-month observations spanning Sept. 2004 to May 2006.

This study reports many important facts. The ratio of bytes sent to residential customers to bytes received from residential customers was 1.3 — remarkably symmetric. Peer-to-peer applications using dynamic port assignment account for most of the traffic. The distribution of bandwidth use by users is a high-powered power law. Most importantly, specific users’ positions in that distribution vary over time and are not well identified with demographic and other customer characteristics (other than fiber connection, i.e. bandwidth availability).

The residential traffic growth figures for Japan vary considerably across month intervals. From Sept. 2004 to Nov. 2004, total traffic (aggregated inbound and outbound) grew 180% on an annualized basis. In subsequent 6-month intervals, annualized traffic growth rates were 58%, 19%, and 37%. Data on growth in fiber-to-the-home (FTTH) connections and total connections also shows volatility, but this does not seem to explain the reported traffic growth volatility. Any ideas about what explains it?

According to my calculations, (public) Internet traffic in Japan from Sept. 2004 to May 2006 grew about 60% per year. This figure aggregates inbound and outbound traffic and includes residential broadband customers and non-residential broadband customers (leased lines, data centers, dialup). I estimated non-residential traffic for the 7-ISP sample using the reported residential/non-residential traffic figures from the 4-ISP sample in Nov. 2005, and the non-residential broadband growth rate for the 4-ISP sample from Sept. 2004 to May 2006.

This estimated annualized growth of Internet traffic in Japan is much higher than estimated annualized growth of (non-voice channel) bandwidth in use in the U.S. across the 1990s. From 1989 to 1999, total DDS, DS1, and DS3 channel termination bandwidth in the U.S. grew an estimated 27% per year (see Table P6 in U.S. Bandwidth Price Trends in the 1990s). The difference between a 60% growth rate and a 27% growth rate year becomes huge after only a small number of years.

The Japanese study does not encompass bandwidth deployed in private networks. The ratio of Internet bandwidth to total inter-office bandwidth may have been about 15% in the U.S. in 1998 (see Growth in the “New Economy”, p. 6). The Japanese data show much a faster growth rate of non-residential bandwidth than residential bandwith, but the former is only about two-thirds the size of the latter. This is consistent with a large share of non-residential bandwidth not being incorporated into the public Internet.

An important point: the re-organization of network transmission protocols on existing networks tends to occur very slowly. Astonishing fact: in the U.S., about 90% of mobile-phone communications towers use traditional, copper-based TDM backhaul. Legacy networks hang around for a long time.

Reference:

Cho, Kenjiro, Kensuke Fukuda, Hiroshi Esaki, and Akira Kato, “The Impact and Implications of the Growth of Residential User-to-User Traffic,” sigcomm 2006.

Carnival of the Bureaucrats #1

Welcome to the inaugural edition of the Carnival of the Bureaucrats!

The best entry in this edition has been unanimously judged to be Foreign Policy’s photo essay, The State at Work. An appropriately selective quotation from its first page:

Civil servants are asked to do the people’s work with very little, sometimes with nothing at all. They see to it that the job gets done ….. Meet the bureaucrats.

My personal favorite: Josephine George-Francis, governor of Montserrado County, Liberia. According to the photo essay, she “sewed the Liberian flag that hangs in her office.” That’s exactly the kind of resourcefulness and dedication that characterizes many bureaucrats in much lower positions in countries all across the spectrum of per capita income.

Honorable Mention for appreciation of bureaucrats by a non-bureaucrat goes to Stowe Boyd at /message, for declaring, “the only answer to the Cold War is government intervention.” A few decades ago this insight could have saved googols of rubles and dollars.

The heart and soul of bureaucracy is editing. Editing is what makes bureaucracy great. The New York Times recently showed considerable bureaucratic mettle in editing a letter to the editor. The letter was eventual withdrawn, with the decisive issue being the use of “rubbish”. Outstanding! Even better, this matter produced 14 emails over five days of pondering the issues. Read the emails (pdf file) for yourself and cheer!

Jon Swift presents Homeland Security Thinks Outside the Box posted at Jon Swift. Swift observes, “Homeland Security Department stays one step ahead of the terrorists by not only anticipating likely terrorist targets but even anticipating the terrorists’ anticipation of our anticipating them.” This work appears to draw insights from the scholarly field of game theory. It undoubtedly can generate many additional papers and results. Swift apparently is a federal bureaucrat stationed in Alaska. Perhaps the exceptional quality of his blogging will win him enough friends to get a position in DC.

Nedra Weinreich presents The Insider’s Guide to Writing a Winning Proposal posted at Spare Change. This submission was filed after the deadline for the Carnival. Moreover, it did not include a request for a waiver of the Carnival rules. Hence this entry has been rejected. We do not reach the issue of whether, if a properly prepared waiver request had been filed, that waiver request would have been granted.

That concludes this edition of the Carnival of the Bureaucrats. Submit your blog article to the next edition using our carnival submission form. Submissions should conform to the Carnival regulations. Past posts and future hosts can be found on our blog carnival index page.

false copyright and false authors’ rights claims

The Internet enables sharing the intelligence and creativity of persons around the globe. Along with this exciting new set of possibilities is a little recognized shadow: the increased opportunity cost of false copyright and false authors’ rights claims.

In his insightful article, Jason Mazzone examines the problem of false copyright claims with respect to U.S. copyright law. He observes:

Copyright law itself creates strong incentives for copyfraud [false copyright claims]. The limited penalties for copyfraud under the [U.S.] Copyright Act, coupled with weak enforcement of these provisions, give publishers an incentive to claim ownership, however spurious, in everything. Although falsely claiming copyright is technically a criminal offense under the Act, prosecutions are extremely rare. Moreover, the Copyright Act provides no civil penalties for claiming copyrights in public domain materials. [Mazzone, pp. 1029-30]

Mazzone cites numerous examples of what he considers to be copyfraud. One gross, but not unusual, example that he cites is a popular pocket version of the U.S. Constitution. It includes a copyright notice and the admonition “[n]o part of this publication may be reproduced or transmitted in any form or by any means…without permission in writing from the publisher.”

The aggregate cost of copyfraud has probably increased greatly over the past decade. Using small quotes from a copyrighted text to document, illustrate, or advance discussion of a related issue is widely recognized to be fair use. Bloggers do this extensively. Falsely asserting that such use is not permitted probably doesn’t have much effect. Other forms of textual copyfraud may have significant cost. But fair use of copyrighted images, audio, and video is much less legally clear and publicly well-understood than fair use of copyrighted text. Moreover, over the past decade there has been an astonishing expansion of possibilities for creating and sharing non-textual works. The cost of copyfraud with respect to images, audio, and video has increased with this expansion. The cost of this type of copyfraud has probably become larger than the cost of copyfraud with respect to text.

Getting copyright and authors’ rights to serve better the common good requires more attention to the economic and legal implications of false claims.

Tour de Frolorado Rocked By Scandal

By Mike Schiavo,
Cycling News Special Report

JAVA SHACK (Arlington, VA) – The final results of the Tour de Frolorado were thrown into chaos today when the unorthodox training methods of GC winner Taxman were unearthed by a team of Cycling News investigative reporters.

After a thorough examination of the winner’s training records, our reporters unearthed an unusually high running/cycling ratio in the Taxman’s training logs. If verified by the French Agency for Ridiculous Testing (FART), Taxman could be stripped of his title, fired from his team, and miss out on tens of dollars in endorsement money that he would have earned as the TdF champion.

“We were as surprised as anybody,” lead reporter Bob Roll stated. “While we certainly knew about his 8-week off-the-bike taper program, intensive TV-watching, and strict dietary regimen of beer, pizza, cheeseburgers and chocolate chip cookies, the ratio of running miles to cycling miles came as a complete shock.”

“Running is natural for me,” Taxman said in a phone interview from his home in Arlington. “I’m not doing anything illegal, and I’m certainly not trying to hide anything. I intend to defend myself fully from these ridiculous charges.”

“He’s done if it’s true,” team captain Weasel said. “Cycling doesn’t need this, Lanterne Rouge doesn’t need this, and the Tour de Frolorado definitely doesn’t need this.”

FART, the international authority on the running/cycling test, has a well-established acceptable ratio of 1%, or one mile of running for every 100 miles of cycling. Hard-core cyclists are routinely tested and usually have no trouble falling within this guideline.

Dick “Dick” Pound, head of FART, explained: “We understand and accept that cyclists have to do a certain amount of running in their everyday lives. Sometimes you just can’t help running across the room to answer the phone, running after your kids, running errands, or even running up a tab. It’s unavoidable. But to run for exercise? On purpose? Cycling will just not tolerate that.”

The current controversy centers on the period May 29 – July 31, 2006, a time during which Taxman was supposedly recovering from a fractured shoulder. The stunning revelation is that during this time, his training log includes only a handful of entries for cycling, but as many as three entries per week for jogging and/or running.

In fact, for three of the weeks in question, our reporters were unable to calculate the ratio because Taxman logged zero cycling miles. “The numbers don’t lie,” Roll said. “We checked and re-checked the data, but you just can’t divide by zero.”

At the time of this posting, the effect on the final results of the TdF is unclear. What is certain, however, is that Taxman is planning a vigorous defense, starting with this sternly-worded statement by his agent Justin Gatlin: “This is sabotage, pure and simple. Somebody gained access to his training logs without his knowledge and added fictitious running entries. As for his alleged recent purchase at Metro Run and Walk, somebody obviously stole his identity and treated themselves to new running shoes. Come on, they don’t even make his favorite running shoe anymore…um, I mean…it’s sabotage, OK!!!” When asked about the equipment bag confiscated from Taxman’s car, a bag that contained running shoes, shorts and a running watch, Gatlin had no comment.

The uneasy relationship between running and cycling is not new. Seven-time Tour de France champion Lance Armstrong cut his competitive teeth in triathlon, an event that involves running, cycling, and swimming. (“At least nobody’s accusing him of swimming,” Taxman’s mother chimed in from Florida.) Now that Armstrong is retired from the pro peloton he is training for the New York marathon. “Old habits die hard,” Armstrong said. “As I have said time and time again, however, during my career as a professional cyclist, I never tested positive for an unusual running/cycling ratio.”

(Editor’s note: the “Tour de France” is a three-week bicycling race that many riders use as a warmup for the Tour de Frolorado.)

The Lanterne Rouge Wins the Tour de Frolorado!

The past few weeks have produced stunning news in the world of competitive cycling. But no news is more fantastic than the unprecedented victory of Taxman and the Lanterne Rouge in the Tour de Frolorado. Follow all the action on youtube:

Or, if you prefer, you can find the Tour de Frolorado on Google Video. Also, here’s a transcript covering all the stages.

This historic news video is freely available for remixes, mash-ups, abbreviations, extensions, and DVD burning. You can download it from the Tour de Frolorado page in the Internet Archive. Some fun projects to contemplate and perhaps even complete:

  1. Add captions to the make the video more accessible to deaf and hard-of-hearing persons.
  2. Include a memorial for Ben Inglis. He was tragically killed in an accident in the last stage of the tour.
  3. Include other cyclists, other teams …anyone who wants to be part of the Tour.
  4. Think you should have won a stage, or the whole Tour? Go for it!

Please include the existing credits in your new video. If you get a chance, please send me a link to your work. Then I might feel a small measure of paternal delight and share your new creation with all my friends.

For your viewing pleasure, the prologue and stage 1 of this epic bicycle race are right here:

Note: This post has been updated as I’ve sorted out various problems.

presence for you

While I consider these presence management services more useful, I like the way availabot relates information to atoms:

Rather than showing up on your screen, it shows availability as a physical object in the world. That means that you can move the puppet out of view when you don’t want to be distracted, watch out for it when you’re working on other tasks, and have a background awareness of your friends from the corner of your eye. [availabot site]

The puppet is personalized with custom fabrication technology. A more cost-effective way to do this might be to have a custom image printed on the outside of an inflatable shape. Inflation is associated with human character, spirit, and mood (buoyant, inflated ego, fat-headed, depressed, drooping, etc.). With some cheap internal pneumatics, the object could provide a richer sense of presence.

Availabot is a practical application of the sort of general program that MIT’s fabulous Fab Lab has been promoting. Innovation in information technology over the past decades has been dazzling. But I don’t think it’s possible even to approach the value of the information in the order of matter in our real world. Information technology can create value most effectively by leveraging the value of the real world.

A lot of work on presence seems to be oriented toward services for alpha information geeks in their professional lives. Microsoft’s Business Division is taking the lead for its Unified Communications Strategy. Discussion of that strategy tends to focus on how to best manage information in communication:

It’s the intersection of the fundamentals of presence and business processes that will provide the value that customers are looking for. [Alex Sanders . Log]

That may well be true in business situations. But communication is not just about information transfer, and non-business communications is a huge field of value.

Business executives may have a natural bias to underestimate the value of non-business communication.

About 1877, within a year after Alexander Graham Bell had publicly demonstrated telephony, the president of Western Union Telegraph Company turned down the opportunity to buy all the rights to Bell’s telephone. He is reported to have remarked, “What use could this company make of an electrical toy?” [Sense in Communication]

In the late nineteenth century, the Bell System primarily marketed its telephone to business users. Perhaps 90% of its subscribers were business subscribers. When the Bell System’s patent on the telephone expired in 1894, independent telephone entered the industry and flourished by providing residential telephone services that the Bell System had largely neglected. These independent telephone companies are now at the center of very important and contentious policy questions about universal service funds.

human rights to communicate using radio devices

Article 19 of the Universal Declaration of Human Rights recognizes a right to freedom of expression. Article 10 of the European Declaration of Human Rights does likewise. Regulation of the use of radio devices can restrict freedom of expression. What sort of radio regulation is justified under human rights law?

Under the prompting of Open Spectrum, human rights organizations are beginning to consider this question. With respect to licensing requirements (one type of restriction on radio use), a human rights organization called Article 19 stated in a brief note (MS Word doc):

A licence requirement for wireless communications devices clearly constitutes a restriction and therefore it must be 1) provided by law; 2) serve a legitimate aim; and 3) be necessary for the attainment of that aim.

A necessary restriction for attaining a legitimate aim under law is no more restrictive than a feasible alternative, narrowly tailored, and proportionate to the aim.

Article 19’s brief but pioneering analysis seems to have at least one weakness. The analysis suggested that “preventing chaos in the frequency spectrum is a legitimate goal” under human rights law. Under Article 19(3) of the International Covenant on Civil and Political Rights, one legitimate aim for restricting freedom of expression is to protect public order. One might consider protecting public order to encompass preventing chaos in the frequency spectrum. However, focus on order among frequencies, like focus on relations between bodies of water, can lead to law with little connection to the facts of communication among persons. Particularly with respect to human rights, public order is probably better understood in terms of order among persons (the public). Compared to the extent of chaos in the frequency spectrum, actual personal freedom to communicate is a much more meaningful public issue.

Freedom of expression is directly related to the real circumstances of contemporary life. In an insightful response to Ofcom’s consultation entitled “Spectrum Framework Review,” Open Spectrum UK noted:

The justifications given in the current consultation for utilising market forces refer to maximising economic benefits and spectrum efficiency. However, one must not forget that the regulation of radio was instituted internationally not to control interference but to reign in the business practices of the Marconi Wireless Telegraph Company.

Knowledge of technology and examination of leading practices world-wide provides insight into what sort of communications capabilities persons could have at a given time. Radio regulation that deprives persons of these capabilities deserves to be assailed as a violation of human rights.