Zipf and Zipfian distributions
The Zipf distribution, also known as the zeta distribution or discrete Pareto distribution, models the frequency of elements in ranked data where the frequency of an element is inversely proportional to its rank. The Zipfian distribution is a variant with finite support. Both distributions arise naturally in linguistics, population distribution, web traffic analysis, and many other domains exhibiting power law behavior.
Zipf's law, named after the linguist George Zipf, was first observed in the 1930s and 1940s as a pattern in the frequency of words in natural language texts. Zipf noted that the most frequent word in a text occurs approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. This observation led to the formulation of Zipf's law, which has since been found to apply to a wide range of phenomena in fields such as linguistics, economics, and information science.
The historical significance of Zipf's law lies in its ability to reveal underlying regularities in complex systems. It has inspired extensive research into the mathematical and statistical properties of power-law distributions and their applications in modeling real-world data.
Definition
The Zipf distribution has one parameter
where
The cumulative distribution function (CDF) can be expressed as:
where
The domain of the Zipf distribution is
The Zipfian distribution (with finite support) has two parameters: the exponent
with
Applications
The Zipf and Zipfian distributions have numerous applications across diverse fields due to their ability to model naturally occurring power-law relationships:
In linguistics, the frequency of words in natural language texts follows Zipf's law, with the most frequent word occurring approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on.
In urbanization studies, the sizes of cities within a country often follow a Zipfian distribution, where the largest city is about twice the size of the second largest, three times the size of the third largest, and so on.
In information retrieval, the frequency of web page visits and search term usage demonstrates Zipfian behavior, making this distribution crucial for search engine optimization and web analytics.
In bibliometrics, the distribution of citations among papers typically follows a Zipf-like pattern, with a small number of papers receiving a disproportionately large number of citations.
Properties
The Zipf and Zipfian distributions have several important statistical properties:
Property | Zipf Distribution ( | Zipfian Distribution |
---|---|---|
Mean | ||
Variance | Complex expression involving harmonic numbers | |
Mode | 1 | 1 |
Support | ||
Entropy |
The infinite moments of the Zipf distribution highlight its heavy-tailed nature.
For instance, the mean exists only when
Relationships to Other Distributions
The Zipf and Zipfian distributions are connected to several other distributions:
The Zipf distribution is a discrete analog of the continuous Pareto distribution. If
follows a Pareto distribution with shape parameter , then approximately follows a Zipf distribution with parameter .The Zipf-Mandelbrot distribution generalizes the Zipf distribution by introducing an additional parameter
that shifts the ranks: .When
, the Zipf distribution approaches a degenerate distribution concentrated at .
The ZipfDistribution class
The Zipf distribution is implemented by the ZipfDistribution class. It has one constructor that takes as its only argument the exponent, which must be greater than 1.
var zipf = new ZipfDistribution(2.5);
The ZipfDistribution
class has one specific property:
Exponent,
which returns the exponent parameter
The ZipfianDistribution class
The Zipfian distribution is implemented by the ZipfianDistribution class. It has one constructors that takes two arguments: the exponent and the size of the population. The size of the population must be a positive integer, and the exponent The exponent must be greater than 0. following constructs a Zipfian distribution with exponent 2.5 and a population size of 10:
var zipfian = new ZipfianDistribution(2.5, 10);
The ZipfianDistribution
class has two specific properties:
Exponent,
which returns the exponent parameter
Both ZipfDistribution and ZipfianDistribution also provides efficient methods for generating random samples from the distributions:
var random = new Pcg32();
int sample = zipf.Sample(random);
The above example uses the Pcg32 class to generate uniform random numbers, which are then used to follow the Zipf distribution.
For details of the properties and methods common to all discrete probability distribution classes, see the topic on Discrete Probability Distributions.
References
Zipf's Law on Wikipedia
Powers, David M. W. (1998). Applications and explanations of Zipf's law. In New methods in language processing and computational natural language learning.
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf's law. Contemporary Physics, 46(5), 323-351.