# Publications

Contrary to what is often assumed in population genetics, independently segregating loci do not have completely independent ancestries, since all loci are inherited through a single, shared population pedigree. Previous work has shown that the non-independence between gene genealogies of independently segregating loci created by the population pedigree is weak in panmictic populations, and predictions made from standard coalescent theory are accurate for populations that are at least moderately sized. Here, we investigate patterns of coalescence in pedigrees of structured populations. We find that the pedigree creates deviations away from the predictions of the structured coalescent that persist on a longer timescale than in the case of panmictic populations. Nevertheless, we find that the structured coalescent provides a reasonable approximation for the coalescent process in structured population pedigrees so long as migration events are moderately frequent and there are no migration events in the recent pedigree of the sample. When there are migration events in the recent sample pedigree, we find that distributions of coalescence in the sample can be modeled as a mixture of distributions from different initial sample configurations. We use this observation to motivate a maximum-likelihood approach for inferring migration rates and mutation rates jointly with features of the pedigree such as recent migrant ancestry and recent relatedness. Using simulation, we show that our inference framework accurately recovers long-term migration rates in the presence of recent migration events in the sample pedigree.

We demonstrate the advantages of using information at many unlinked loci to better calibrate estimates of the time to the most recent common ancestor (TMRCA) at a given locus. To this end, we apply a simple empirical Bayes method to estimate the TMRCA. This method is both asymptotically optimal, in the sense that the estimator converges to the true value when the number of unlinked loci for which we have information is large, and has the advantage of not making any assumptions about demographic history. The algorithm works as follows: we first split the sample at each locus into inferred left and right clades to obtain many estimates of the TMRCA, which we can average to obtain an initial estimate of the TMRCA. We then use nucleotide sequence data from other unlinked loci to form an empirical distribution that we can use to improve this initial estimate.

Genetic variation among loci in the genomes of diploid biparental organisms is the result of mutation and genetic transmission through the genealogy, or population pedigree, of the species. We explore the consequences of this for patterns of variation at unlinked loci for two kinds of demographic events: the occurrence of a very large family or a strong selective sweep that occurred in the recent past. The results indicate that only rather extreme versions of such events can be expected to structure population pedigrees in such a way that unlinked loci will show deviations from the standard predictions of population genetics, which average over population pedigrees. The results also suggest that large samples of individuals and loci increase the chance of picking up signatures of these events, and that very large families may have a unique signature in terms of sample distributions of mutant alleles.

The rate at which human genomes mutate is a central biological parameter that has many implications for our ability to understand demographic and evolutionary phenomena. We present a method for inferring mutation and gene-conversion rates by using the number of sequence differences observed in identical-by-descent (IBD) segments together with a reconstructed model of recent population-size history. This approach is robust to, and can quantify, the presence of substantial genotyping error, as validated in coalescent simulations. We applied the method to 498 trio-phased sequenced Dutch individuals and inferred a point mutation rate of 1.66 x 10(-8) per base per generation and a rate of 1.26 x 10(-9) for <20 bp indels. By quantifying how estimates varied as a function of allele frequency, we inferred the probability that a site is involved in non-crossover gene conversion as 5.99 x 10(-6). We found that recombination does not have observable mutagenic effects after gene conversion is accounted for and that local gene-conversion rates reflect recombination rates. We detected a strong enrichment of recent deleterious variation among mismatching variants found within IBD regions and observed summary statistics of local sharing of IBD segments to closely match previously proposed metrics of background selection; however, we found no significant effects of selection on our mutation-rate estimates. We detected no evidence of strong variation of mutation rates in a number of genomic annotations obtained from several recent studies. Our analysis suggests that a mutation-rate estimate higher than that reported by recent pedigree-based studies should be adopted in the context of DNA-based demographic reconstruction.

Sophisticated inferential tools coupled with the coalescent model have recently emerged for estimating past population sizes from genomic data. Recent methods that model recombination require small sample sizes, make constraining assumptions about population size changes, and do not report measures of uncertainty for estimates. Here, we develop a Gaussian process-based Bayesian nonparametric method coupled with a sequentially Markov coalescent model that allows accurate inference of population sizes over time from a set of genealogies. In contrast to current methods, our approach considers a broad class of recombination events, including those that do not change local genealogies. We show that our method outperforms recent likelihood-based methods that rely on discretization of the parameter space. We illustrate the application of our method to multiple demographic histories, including population bottlenecks and exponential growth. In simulation, our Bayesian approach produces point estimates four times more accurate than maximum-likelihood estimation (based on the sum of absolute differences between the truth and the estimated values). Further, our method's credible intervals for population size as a function of time cover 90% of true values across multiple demographic scenarios, enabling formal hypothesis testing about population size differences over time. Using genealogies estimated with *ARG*weaver, we apply our method to European and Yoruban samples from the 1000 Genomes Project and confirm key known aspects of population size history over the past 150,000 years.

A long genomic segment inherited by a pair of individuals from a single, recent common ancestor is said to be *identical-by-descent* (IBD). Shared IBD segments have numerous applications in genetics, from demographic inference to phasing, imputation, pedigree reconstruction, and disease mapping. Here, we provide a theoretical analysis of IBD sharing under Markovian approximations of the coalescent with recombination. We describe a general framework for the IBD process along the chromosome under the Markovian models (SMC/SMC’), as well as introduce and justify a new model, which we term the *renewal approximation*, under which lengths of successive segments are independent. Then, considering the infinite-chromosome limit of the IBD process, we recover previous results (for SMC) and derive new results (for SMC’) for the mean number of shared segments longer than a cutoff and the fraction of the chromosome found in such segments. We then use renewal theory to derive an expression (in Laplace space) for the distribution of the number of shared segments and demonstrate implications for demographic inference. We also compute (again, in Laplace space) the distribution of the fraction of the chromosome in shared segments, from which we obtain explicit expressions for the first two moments. Finally, we generalize all results to populations with a variable effective size.

The evolution of drug resistance in HIV occurs by the fixation of specific, well-known, drug-resistance mutations, but the underlying population genetic processes are not well understood. By analyzing within-patient longitudinal sequence data, we make four observations that shed a light on the underlying processes and allow us to infer the short-term effective population size of the viral population in a patient. Our first observation is that the evolution of drug resistance usually occurs by the fixation of one drug-resistance mutation at a time, as opposed to several changes simultaneously. Second, we find that these fixation events are accompanied by a reduction in genetic diversity in the region surrounding the fixed drug resistance mutation, due to the hitchhiking effect. Third, we observe that the fixation of drug-resistance mutations involves both hard and soft selective sweeps. In a hard sweep, a resistance mutation arises in a single viral particle and drives all linked mutations with it when it spreads in the viral population, which dramatically reduces genetic diversity. On the other hand, in a soft sweep, a resistance mutation occurs multiple times on different genetic backgrounds, and the reduction of diversity is weak. Using the frequency of occurrence of hard and soft sweeps we estimate the effective population size of HIV to be 1:5|105 (95% confidence interval ½0:8|105,4:8|105). This number is much lower than the actual number of infected cells, but much larger than previous population size estimates based on synonymous diversity. We propose several explanations for the observed discrepancies. Finally, our fourth observation is that genetic diversity at non-synonymous sites recovers to its pre-fixation value within 18 months, whereas diversity at synonymous sites remains depressed after this time period. These results improve our understanding of HIV evolution and have potential implications for treatment

strategies.

Citation: Pennings PS, Kryazhimskiy S, Wakeley J (2014) Loss and Recovery of Genetic Diversity in Adapting Populations of HIV. PLoS Genet 10(1): e1004000.

doi:10.1371/journal.pgen.1004000

Editor: Christophe Fraser, Imperial College London, United Kingdom

Received April 19, 2013; Accepted October 19, 2013; Published January 23, 2014

Copyright: 2014 Pennings et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits

unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: SK was supported by a Career Award at Scientific Interface from the Burroughs Wellcome Fund (http://www.bwfund.org/). PSP was supported by a

long-term postdoctoral fellowship of the Human Frontier Science Program (http://www.hfsp.org/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: pleuni@stanford.edu

We develop coalescent models for autotetraploid species with tetrasomic inheritance. We show that the ancestral genetic process in a large population without recombination may be approximated using Kingman’s standard coalescent, with a coalescent effective population size 4*N*. Numerical results suggest that this approximation is accurate for population sizes on the order of hundreds of individuals. Therefore, existing coalescent simulation programs can be adapted to study population history in autotetraploids simply by interpreting the timescale in units of 4*N* generations. We also consider the possibility of double reduction, a phenomenon unique to polysomic inheritance, and show that its effects on gene genealogies are similar to partial self-fertilization.

We address a conceptual flaw in the backward-time approach to population genetics called coalescent theory as it is applied to diploid biparental organisms. Specifically, the way random models of reproduction are used in coalescent theory is not justified. Instead, the population pedigree for diploid organisms--that is, the set of all family relationships among members of the population--although unknown, should be treated as a fixed parameter, not as a random quantity. Gene genealogical models should describe the outcome of the percolation of genetic lineages through the population pedigree according to Mendelian inheritance. Using simulated pedigrees, some of which are based on family data from 19th century Sweden, we show that in many cases the (conceptually wrong) standard coalescent model is difficult to reject statistically and in this sense may provide a surprisingly accurate description of gene genealogies on a fixed pedigree. We study the differences between the fixed-pedigree coalescent and the standard coalescent by analysis and simulations. Differences are apparent in recent past, within ≈ <log(2)(N) generations, but then disappear as genetic lineages are traced into the more distant past.

A large number of statistical tests have been proposed to detect natural selection based on a sample of variation at a single genetic locus. These tests measure the deviation of the allelic frequency distribution observed within populations from the distribution expected under a set of assumptions that includes both neutral evolution and equilibrium population demography. The present study considers a new way to assess the statistical properties of these tests of selection, by their behavior in response to direct perturbations of the steady-state allelic frequency distribution, unconstrained by any particular nonequilibrium demographic scenario. Results from Monte Carlo computer simulations indicate that most tests of selection are more sensitive to perturbations of the allele frequency distribution that increase the variance in allele frequencies than to perturbations that decrease the variance. Simulations also demonstrate that it requires, on average, 4N generations (N is the diploid effective population size) for tests of selection to relax to their theoretical, steady-state distributions following different perturbations of the allele frequency distribution to its extremes. This relatively long relaxation time highlights the fact that these tests are not robust to violations of the other assumptions of the null model besides neutrality. Lastly, genetic variation arising under an example of a regularly cycling demographic scenario is simulated. Tests of selection performed on this last set of simulated data confirm the confounding nature of these tests for the inference of natural selection, under a demographic scenario that likely holds for many species. The utility of using empirical, genomic distributions of test statistics, instead of the theoretical steady-state distribution, is discussed as an alternative for improving the statistical inference of natural selection.

Pacific salmon include several species that are both commercially important and endangered. Understanding the causes of loss in genetic variation is essential for designing better conservation strategies. Here we use a coalescent approach to analyze a model of the complex life history of salmon, and derive the coalescent effective population (CES). With the aid of Kronecker products and a convergence theorem for Markov chains with two time scales, we derive a simple formula for the CES and thereby establish its existence. Our results may be used to address important questions regarding salmon biology, in particular about the loss of genetic variation. To illustrate the utility of our approach, we consider the effects of fluctuations in population size over time. Our analysis enables the application of several tools of coalescent theory to the case of salmon.

Many short-lived organisms pass through several generations during favorable growing seasons, separated by inhospitable periods during which only small hibernating or estivating refugia remain. This induces pronounced seasonal fluctuations in population size and metapopulation structure. The first generations in the growing season will be characterized by small, relatively isolated demes whereas the later generations will experience larger deme sizes with more extensive gene flow. Fluctuations of this sort can induce changes in the amount of genetic variation in early season samples compared to late season samples, a classical example being the observations of seasonal variation in allelism in New England Drosophila populations by PT. Ives. In this article, we study the properties of a structured coalescent process under seasonal fluctuations using numerical analysis of exact state equations, analytical approximations that rely on a separation of timescales between intrademic versus interdemic processes, and individual-based simulations. We show that although an increase in genetic variation during each favorable growing season is observed, it is not as pronounced as in the empirical observations This suggests that some of the temporal patterns of variation seen by Ives may be due to selection against deleterious lethals rather than neutral processes.

*F*under a skewed offspring distribution among individuals in a population. Genetics. 2009;181 (2) :615-629.Abstract

_{ST}heterogeneity among subpopulations of certain marine organisms despite substantial gene flow