# Publications

A simple genealogical structure is found for a general finite island model of population subdivision. The model allows for variation in the sizes of demes, in contributions to the migrant pool, and in the fraction of each deme that is replaced by migrants every generation. The ancestry of a sample of non-recombining DNA sequences has a simple structure when the sample size is much smaller than the total number of demes in the population. This allows an expression for the probability distribution of the number of segregating sites in the sample to be derived under the infinite-sites mutation model. It also yields easily computed estimators of the migration parameter for each deme in a multi-deme sample. The genealogical process is such that the lineages ancestral to the sample tend to accumulate in demes with low migration rates and/or which contribute disproportionately to the migrant pool. In addition, common ancestor or coalescent events tend to occur in demes of small size. This provides a framework for understanding the determinants of the effective size of the population, and leads to an expression for the probability that the root of a genealogy occurs in a particular geographic region, or among a particular set of demes.

The origins and divergence of Drosophila simulans and close relatives D. mauritiana and D. sechellia were examined using the patterns of DNA sequence variation found within and between species at 14 different genes. D. sechellia consistently revealed low levels of polymorphism, and genes from D. sechellia have accumulated mutations at a rate that is approximately 50% higher than the same genes from D. simulans. At synonymous sites, D. sechellia has experienced a significant excess of unpreferred codon substitutions. Together these observations suggest that D. sechellia has had a reduced effective population size for some time, and that it is accumulating slightly deleterious mutations as a result. D. simulans and D. mauritiana are both highly polymorphic and the two species share many polymorphisms, probably since the time of common ancestry. A simple isolation speciation model, with zero gene flow following incipient species separation, was fitted to both the simulans/mauritiana divergence and the simulans/sechellia divergence. In both cases the model fit the data quite well, and the analyses revealed little evidence of gene flow between the species. The exception is one gene copy at one locus in D. sechellia, which closely resembled other D. simulans sequences. The overall picture is of two allopatric speciation events that occurred quite near one another in time.

A nonequilibrium migration model is proposed and applied to genetic data from humans. The model assumes symmetric migration among all possible pairs of demes and that the number of demes is large. With these assumptions it is straightforward to allow for changes in demography, and here a single abrupt change is considered. Under the model this change is identical to a change in the ancestral effective population size and might be caused by changes in deme size, in the number of demes, or in the migration rate. Expressions for the expected numbers of sites segregating at particular frequencies in a multideme sample are derived. A maximum-likelihood analysis of independent polymorphic restriction sites in humans reveals a decrease in effective size. This is consistent with a change in the rates of migration among human subpopulations from ancient low levels to present high ones.

Expressions for the expectation and variance of the number of segregating sites in samples from an island model of population subdivision are derived. For small samples, an arbitrary number of demes can be accommodated. Results for larger samples are derived under the assumption of an infinite number of demes. However, simulations indicate that the latter results will hold quite well for the finite-island model in many cases. A new estimator of the population migration rate is proposed and is shown to outperform the widely used pairwise method.

Population genetic models often use a population recombination parameter 4Nc, where N is the effective population size and c is the recombination rate per generation. In many ways 4Nc is comparable to 4Nu, the population mutation rate. Both combine genome level and population level processes, and together they describe the rate of production of genetic variation in a population. However, 4Nc is more difficult to estimate. For a population sample of DNA sequences, historical recombination can only be detected if polymorphisms exist, and even then most recombination events are not detectable. This paper describes an estimator of 4Nc, hereafter designated gamma (gamma), that was developed using a coalescent model for a sample of four DNA sequences with recombination. The reliability of gamma was assessed using multiple coalescent simulations. In general gamma has low to moderate bias, and the reliability of gamma is comparable, though less, than that for a widely used estimator of 4Nu. If there exists an independent estimate of the recombination rate (per generation, per base pair), gamma can be used to estimate the effective population size or the neutral mutation rate.

The expected numbers of different categories of polymorphic sites are derived for two related models of population history the isolation model, in which an ancestral population splits into two descendents, and the size-change model, in which a single population undergoes an instantaneous change in size. For the isolation model, the observed numbers of shared, fixed, and exclusive polymorphic sites are used to estimate the relative sizes of the three populations, ancestral plus two descendent, as well as the time of the split. For the size change model, the numbers of sites segregating at particular frequencies in the sample are used to estimate the relative sizes of the ancestral and descendent populations plus the time the change took place. Parameters are estimated by choosing values that most closely equate expectations with observations. Computer simulations show that current and historical population parameters can be estimated accurately. The methods are applied to DNA data from two species of Drosophila and to some human mitochondrial DNA sequences.

A new estimator is proposed for the parameter C = 4*Nc*, where N is the population size and *c *is the recombination rate in a finite population model without selection. The estimator is an improved version of Hudson's (1987) estimator, which takes advantage of some recent theoretical developments. The improvement is slight, but the smaller bias and standard error of the new estimator support its use. The variance of the average number of pairwise differences is also derived, and is important in the formulation of the new estimator.

The divergence of Drosophila pseudoobscura and close relatives D. persimilis and D. pseudoobscura bogotana has been studied using comparative DNA sequence data from multiple nuclear loci. New data from the Hsp82 and Adh regions, in conjunction with existing data from Adh and the Period locus, are examined in the light of various models of speciation. The principal finding is that the three loci present very different histories, with Adh indicating large amounts of recent gene flow among the taxa, while little or no gene flow is apparent in the data from the other loci. The data were compared with predictions from several isolation models of divergence. These models include no gene flow, and they were found to be incompatible with the data. Instead the DNA data, taken together with other evidence, seem consistent with divergence models in which natural selection acts against gene flow at some loci more than at others. This family of models includes some sympatric and parapatric speciation models, as well as models of secondary contact and subsequent reinforcement of sexual isolation.

Estimates of transition bias provide insight into the process of nucleotide substitution, and are required in some commonly used phylogenetic methods. Transitions are favored over transversions among spontaneous mutations, and the direction and strength of selection on proteins and RNA appears to depend on mutation type. As the complexity of the nucleotide-substitution process has become apparent, problems with classical methods of estimating transition bias have been recognized. These problems arise because there Is a fundamental difference between ratios of numbers of differences among sequences and ratios of rates, and because classical methods are not easily generalized. Several new methods are now available.

Two demographic scenarios are considered: two populations with migration and two populations that have been completely isolated from each other for some period of time. The variance of the number of differences between pairs of sequences in a single sample is studied and forms the basis of a test of the isolation model. The migration model is one possible alternative to isolation. The isolation model is rejected when the proposed test statistic, which involves the variances of pairwise difference within and between populations, is larger than power and realized significance of the test are investigated using simulations, and an example using mitochondrial DNA illustrates its application.

The variances of three measures of pairwise difference are derived for the case of two populations that exchange migrants. The resulting expressions can be used to place standard errors an estimates of population genetic parameters. The three measures considered are the average number of intrapopulation nucleotide differences, the average number of interpopulation nucleotide differences, and the net number of nucleotide differences between the two populations. The expectations of these statistics are previously known and suggest that they might be used to the quantify the divergence between populations. However, the standard errors of all three statistics are shown to be quite large relative to their expectations. Thus, our ability to quantify divergence between populations with them is limited, at least using available data. An analysis of mitochondrial DNA sequences from grey-crowned babblers illustrates the application of the theory. The variances derived here for migration are compared to previously published results for two populations that have been completely isolated from one another for some length of time. All three variances are greater under migration than under isolation, suggesting that a test to distinguish these two demographic situations could be developed.

We inferred phylogenetic trees from individual genes and random samples of nucleotides from the mitochondrial genomes of 10 vertebrates and compared the results to those obtained by analyzing the whole genomes. Individual genes are poor samples in that they infrequently lead to the whole-genome tree. A large number of nucleotide sites is needed to exactly determine the whole-genome tree. A relatively small number of sites, however, often results in a tree close to the whole-genome tree. We found that blocks of contiguous sites were less likely to lead to the whole-genome tree than samples composed of sites drawn individually from throughout the genome. Samples of contiguous sites are not representative of the entire genome, a condition that violates a basic assumption of the bootstrap method as it is applied in phylogenetic studies.

Substitution-rate variation among sites and differences in the probabilities of change among the four nucleotides are conflated in DNA sequence comparisons. When variation in rate exists among sites but is ignored, biases in the rates of change among nucleotides are underestimated. This paper provides a quantification of this effect when the observed proportions of transitions, P, and transversions, Q, between two sequences are used to estimate transition bias. The utility of P/Q as an estimator is examined both with and without rate variation among sites. A gamma-distributed-rates model is used to illustrate the effect that variation among sites has on estimates of transition bias, but it is argued that the basic results should hold for any pattern of rate variation. Naive estimates of the extent of transition bias, those that ignore rate variation when it is present, can seriously underestimate its true value. The extent of this underestimation increases with the amount of rate variation among sites. An example using human mitochondrial DNA shows that a simple comparison of the proportions of transitions and transversions in recently diverged sequences underestimates the level of transition bias by approximately 15%. This does not depend on the use of P/Q to estimate transition bias; maximum-likelihood methods give similar results.

More than an order of magnitude difference in substitution rate exists among sites within hypervariable region 1 of the control region of human mitochondrial DNA. A two-rate Poisson mixture and a negative binomial distribution are used to describe the distribution of the inferred number of changes per nucleotide site in this region. When three data sets are pooled, however, the two-rate model cannot explain the data. The negative binomial distribution always fits, suggesting that substitution rates are approximately gamma distributed among sites. Simulations presented here provide support for the use of a biased, yet commonly employed, method of examining rate variation. The use of parsimony in the method to infer the number of changes at each site introduces systematic errors into the analysis. These errors preclude an unbiased quantification of variation in substitution rate but make the method conservative overall. The method can be used to distinguish sites with highly elevated rates, and 29 such sites are identified in hypervariable region 1. Variation does not appear to be clustered within this region. Simulations show that biases in rates of substitution among nucleotides and non-uniform base composition can mimic the effects of variation in rate among sites. However, these factors contribute little to the levels of rate variation observed in hypervariable region 1.