Easy to Zipf around
How many times does the word “the” appear in this sentence? Twice, of course. How many times does it appear in this paragraph? Twice still, though you’re probably already wondering why I’m asking silly questions. But what if I asked these slightly different questions instead: Which word appears most frequently in this article? Which word appears second most frequently? Almost certainly, the answers will be “the” and “of”, respectively. This holds for any reasonably long piece of English text, and “and” is third-placed.
Bear with me for a few seconds more. Let’s say you picked up a novel that’s about 100,000 English words long. “The” will indeed appear more often in it than any other word. If you actually counted, you’d find about 7% of its words are “the”, or 7,000 occurrences. That’s just a feature of the language. But here’s something for you to chew on. “Of” will appear about 3,500 times, or 3.5%. That’s also a feature of the language, but does that number wrinkle your brow at all? What would you have guessed: That second-placed “of” would appear just slightly less frequently than “the”? Or about half as frequently?
More to chew on: Bronze-medallist “and” will show up about 2,300 times, or 2.3%: about one-third as often as “the”. So here’s the totally counter-intuitive thing in all this: “of” appears about half as often as “the”; “and” about a third as often; and so on down the list of most-used words in the language. That is, how many times a given word appears in a large chunk of text is inversely related to its rank. So if you asked about the 355th ranked word, “true” (see: http://bit.ly/1ziIIMT), it would probably appear about 7,000/355, or 20 times in that novel.
This curiosity is Zipf’s Law.
A small caveat here: Strictly, even 100,000 words are not really an adequate sample size for all these Zipf numbers to show up accurately. The idiosyncrasies of a particular writer, for one thing, will change them. In my own 100,000-word book Roadrunner, for example, 355th-ranked “true” does appear 20 times, but 280th-ranked “took”—which Zipf predicts there should be 25 of—appears 57 times. It’s true, I cannot explain why I was partial to “took” in that book, though I took a look.
Still, if you plot a graph of word rank vs frequency for Roadrunner, Zipf’s broad pattern begins to take shape even for its 100,000 words. Make such a plot for a much larger grab-bag of text—novels, magazines, cookbooks, newspapers, even TV shows and films—and take my 45th-ranked “word” for it, you’ll see a much closer cleaving to Zipf’s Law.
The law is named after the Harvard University linguist George Kingsley Zipf, who found it held for the word frequencies he was studying in the 1930s. But other scientists had run across the relationship before. Turns out that it holds not just for English, but for pretty much any language you might choose. Even more interestingly, Zipf’s Law applies to other phenomena too. Zipf himself suggested that his law held for people’s incomes in a given country. And Felix Auerbach, a German physicist, found in 1913 that the relationship seemed to hold for the population of cities in a given country. That is, the city with the most people is about twice as populated as the second-most populated city, about thrice as many as the third-most one, and so on.
Naturally, reading that prompted me to dig up population figures for Indian cities. The list goes like this: New Delhi, 22 million; Mumbai, 18.4 million; Bengaluru, 8.4 million; Chennai, 7 million; Hyderabad, 6.8 million; Ahmedabad, 5.6 million and so on. What about China? Chongqing, 49.2 million; Shanghai, 24.2 million; Beijing, 21.5 million; Tianjin, 15.5 million; Chengdu, 14.4 million; Guangzhou, 14 million, etc.
If you do the arithmetic, you’ll find those numbers are generally higher than Zipf’s Law predicts. In fact, city populations tend to more closely follow a cousin of Zipf’s Law called Gibrat’s Law, though even that depends on how you define what a city is. Instead of using census data to do it, Bin Jiang and Tao Jia looked at the way streets are laid out and intersect, reasoning that “no streets (means) no human activity”. Defining “natural cities” this way, Jiang and Jia “found that Zipf’s Law holds remarkably well…across the United States” (“Zipf’s Law for All The Natural Cities In The United States: A Geospatial Perspective”, International Journal Of Geographical Information Science, 2011).
Definitions aside, the general point remains: Urban population figures, and word frequencies, and numbers like these from many other phenomena, don’t seem to be totally random even if that’s what intuition might suggest they should be. Instead, they tend to obey certain mathematical rules. Why?
We don’t quite understand that yet, though some mathematicians have worked out plausible statistical explanations that I won’t try to get into here. With language, Zipf himself thought that something called the principle of least effort was at work. Here’s how that makes sense to me: In order to make themselves understood, writers generally choose the simpler and more-often-used words rather than putting in the effort to find rarer ones. This is because simpler words are easier for readers to understand, but also because they are familiar and come to a writer’s mind more readily than the alternatives.
After all, consider the locutions I would of necessity have to utilize were I so inclined as to pontificate on the virtues or otherwise of these said alternatives.
Took me 15 minutes to construct that sentence. This is true.
Once a computer scientist, Dilip D’Souza now lives in Mumbai and writes for his dinners. A Matter Of Numbers explores the joy of mathematics, with occasional forays into other sciences.
Comments are welcome at firstname.lastname@example.org. Read Dilip’s Mint columns at www.livemint.com/dilipdsouza