Scaling laws and fluctuations in the statistics of word frequencies

被引:38
|
作者
Gerlach, Martin [1 ]
Altmann, Eduardo G. [1 ]
机构
[1] Max Planck Ints Phys Complex Syst, D-01187 Dresden, Germany
来源
NEW JOURNAL OF PHYSICS | 2014年 / 16卷
关键词
scaling laws; stochastic processes; statistical fluctuations; natural language; GROWTH; DISTRIBUTIONS; INNOVATION; DYNAMICS; ORIGIN;
D O I
10.1088/1367-2630/16/11/113010
中图分类号
O4 [物理学];
学科分类号
0702 ;
摘要
In this paper, we combine statistical analysis of written texts and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. The average vocabulary of an ensemble of fixed-length texts is known to scale sublinearly with the total number of words (Heaps' law). Analyzing the fluctuations around this average in three large databases (Googlengram, English Wikipedia, and a collection of scientific articles), we find that the standard deviation scales linearly with the average (Taylor's law), in contrast to the prediction of decaying fluctuations obtained using simple sampling arguments. We explain both scaling laws (Heaps' and Taylor) by modeling the usage of words using a Poisson process with a fat-tailed distribution of word frequencies (Zipf's law) and topic-dependent frequencies of individual words (as in topic models). Considering topical variations lead to quenched averages, turn the vocabulary size a non-self-averaging quantity, and explain the empirical observations. For the numerous practical applications relying on estimations of vocabulary size, our results show that uncertainties remain large even for long texts. We show how to account for these uncertainties in measurements of lexical richness of texts with different lengths.
引用
收藏
页数:18
相关论文
共 50 条