Previous Chapter: THE COALESCENT AND MUTATION
Suggested Citation: "The Ewens Sampling Formula." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.

Page 122

The Ewens Sampling Formula

Motivated by the realization that mutations in DNA sequences could lead to an essentially infinite number of alleles at the given locus, Kimura and Crow (1964) advocated modeling the effects of mutation as an infinitely-many-alleles model. In this process, a gene inherits the type of its ancestor if no mutation occurs and inherits a type not currently (or previously) existing in the population if a mutation does occur. In such a process the genes in the sample are thought of as unlabeled, so that the experimenter knows whether two genes are different, but records nothing further about the identity of alleles. In this case the natural statistic to record about the sample is its configuration Cn º (C1, C2,. . ., Cn), where

Cj = number of alleles represented j times.

Of course, C1 + 2C2+ . . . + nCn = n, and the number of alleles in the sample is

Kn º C1 + C2 + . . . +Cn.                             (5.3)

The sampling distribution of Cn was found by Ewens (1972):

image                 (5.4)

for a = (a1,a2,. . .,an) satisfying aj ³ 0 for j = 1,2,. . .,n and image and where

q (n)º q (q + 1)···(q+ n- 1).

From (5.4) it follows that

Suggested Citation: "The Ewens Sampling Formula." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.

Page 123

image                          (5.5)

and

image                                                      (5.6)

image being the Stirling number of the first kind. From (5.5) and (5.4) it follows that Kn is sufficient for q, so that the information in the sample relevant for estimating q is contained just in Kn. This allows us (Ewens, 1972, 1979) to calculate the maximum likelihood (and moment) estimator of q as the solution image of the equation

image                                                              (5.7)

where k is the number of alleles observed in the sample. In large samples, the estimator image has variance given approximately by

image                                 (5.8)

For the pyrimidine sequence data described above in the ''Overview" section, there are k = 24 alleles. Solving equation (5.7) for image gives image = 10.62, with a variance of 9.89. An approximate 95 percent confidence interval for q is therefore 10.62 ± 6.29. This example serves to underline the variability inherent in estimating q from this model. The pyrimidine region comprises 201 sites, so that the per site substitution rate is estimated to be 0.053 ± 0.031.

The goodness of fit of the model to the data may be assessed by using the sufficiency of Kn for q: given Kn, the conditional distribution of the allele frequencies is independent of q. Ewens (1972, 1979) gives further details on this point. To describe alternative goodness-of-fit methods, we return briefly to the probabilistic structure of mutation in the coalescent.

Suggested Citation: "The Ewens Sampling Formula." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
Page 122
Suggested Citation: "The Ewens Sampling Formula." National Research Council. 1995. Calculating the Secrets of Life: Contributions of the Mathematical Sciences to Molecular Biology. Washington, DC: The National Academies Press. doi: 10.17226/2121.
Page 123
Next Chapter: Top-down
Subscribe to Emails from the National Academies
Stay up to date on activities, publications, and events by subscribing to email updates.