Population Genomics With R presents a multidisciplinary approach to the analysis of population genomics. The methods treated cover a large number of topics from traditional population genetics to large-scale genomics with high-throughput sequencing data. Several dozen R packages are examined and integrated to provide a coherent software environment with a wide range of computational, statistical, and graphical tools. Small examples are used to illustrate the basics and published data are used as case studies. Readers are expected to have a basic knowledge of biology, genetics, and statistical inference methods. Graduate students and post-doctorate researchers will find resources to analyze their population genetic and genomic data as well as help them design new studies.
The first four chapters review the basics of population genomics, data acquisition, and the use of R to store and manipulate genomic data. Chapter 5 treats the exploration of genomic data, an important issue when analysing large data sets. The other five chapters cover linkage disequilibrium, population genomic structure, geographical structure, past demographic events, and natural selection. These chapters include supervised and unsupervised methods, admixture analysis, an in-depth treatment of multivariate methods, and advice on how to handle GIS data. The analysis of natural selection, a traditional issue in evolutionary biology, has known a revival with modern population genomic data. All chapters include exercises. Supplemental materials are available on-line (http://ape-package.ird.fr/PGR.html).
Emmanuel Paradis is senior researcher in the French Institute of Research for Development (IRD). His research focuses on evolutionary models and their applications. The development and publication of software associated to his research has been an important aspect of his activities for more than twenty years. He adopted R as his main software for data analysis in 2000 and has since published and maintained several packages, including ape since 2002 and pegas since 2009. He gives regular workshops and trainings in several countries.
1. Introduction
Heredity, Genetics, and Genomics
Principles of Population Genomics
Units
Genome Structures
Mutations
Drift and Selection
R Packages and Conventions
Required Knowledge and Other Readings
2. Data Acquisition
Samples and Sampling Designs
How Much DNA in a Sample?
Degraded Samples
Sampling Designs
Low-Throughput Technologies
Genotypes From Phenotypes
DNA Cleavage Methods
Repeat Length Polymorphism
Sanger and Shotgun Sequencing
DNA Methylation and Bisulfite Sequencing
High-Throughput Technologies
DNA Microarrays
High-Throughput Sequencing
Restriction Site Associated DNA
RNA Sequencing
Exome Sequencing
Sequencing of Pooled Individuals
Designing a Study With HTS
The Future of DNA Sequencing
File Formats
Data Files
Archiving and Compression
Bioinformatics and Genomics
Processing Sanger Sequencing Data With sangerseqR
Read Mapping With Rsubread
Managing Read Alignments With Rsamtools
Simulation of High-Throughput Sequencing Data
Exercises
3. Genomic Data in R
What is an R Data Object?
Data Classes for Genomic Data
The Class "loci" (pegas)
The Class "genind" (adegenet)
The Classes "SNPbin" and "genlight" (adegenet)
The Class "SnpMatrix" (snpStats)
The Class "DNAbin" (ape)
The Classes "XString" and "XStringSet" (Biostrings)
The Package SNPRelate
Data Input and Output
Reading Text Files
Reading Spreadsheet Files
Reading VCF Files
Reading PED and BED Files
Reading Sequence Files
Reading Annotation Files
Writing Files
Internet Databases
Managing Files and Projects
Exercises
4. Data Manipulation
Basic Data Manipulation in R
Subsetting, Replacement, and Deletion
Commonly Used Functions
Recycling and Coercion
Logical Vectors
Memory Management
Conversions
Case Studies
Mitochondrial Genomes of the Asiatic Golden Cat
Complete Genomes of the Fruit Fly
Human Genomes
Influenza HN Virus Sequences
Jaguar Microsatellites
Bacterial Whole Genome Sequences
Metabarcoding of Fish Communities
Exercises
5. Data Exploration and Summaries
Genotype and Allele Frequencies
Allelic Richness
Missing Data
Haplotype and Nucleotide Diversity
The Class "haplotype"
Haplotype and Nucleotide Diversity From DNA Sequences
Genetic and Genomic Distances
Theoretical Background
Hamming Distance
Distances From DNA Sequences
Distances From Allele Sharing
Distances From Microsatellites
Summary by Groups
Sliding Windows
DNA Sequences
Summaries With Genomic Positions
Package SNPRelate
Multivariate Methods
Matrix Decomposition
Eigendecomposition
Singular Value Decomposition
Power Method and Random Matrices
Principal Component Analysis
adegenet
SNPRelate
flashpcaR
Multidimensional Scaling
Case Studies
Mitochondrial Genomes of the Asiatic Golden Cat
Complete Genomes of the Fruit Fly
Human Genomes
Influenza HN Virus Sequences
Jaguar Microsatellites
Bacterial Whole Genome Sequences
Metabarcoding of Fish Communities
Exercises
6. Linkage Disequilibrium and Haplotype Structure
Why Linkage Disequilibrium is Important?
Linkage Disequilibrium: Two Loci
Phased Genotypes
Theoretical Background
Implementation in pegas
Unphased Genotypes
More Than Two Loci
Haplotypes From Unphased Genotypes
The Expectation-Maximization Algorithm
Implementation in haplostats
Locus-Specific Imputation
Maps of Linkage Disequilibrium
Phased Genotypes With pegas
SNPRelate
snpStats
Case Studies
Complete Genomes of the Fruit Fly
Human Genomes
Jaguar Microsatellites
Exercises
7. Population Genetic Structure
Hardy-Weinberg Equilibrium
F-Statistics
Theoretical Background
Implementations in pegas and in mmod
Implementations in snpStats and in SNPRelate
Trees and Networks
Minimum Spanning Trees and Networks
Statistical Parsimony
Median Networks
Phylogenetic Trees
Multivariate Methods
Principles of Discriminant Analysis
Discriminant Analysis of Principal Components
Clustering
Maximum Likelihood Methods
Bayesian Clustering
Admixture
Likelihood Method
Principal Component Analysis of Coancestry
A Second Look at F-Statistics
Case Studies
Mitochondrial Genomes of the Asiatic Golden Cat
Complete Genomes of the Fruit Fly
Influenza HN Virus Sequences
Jaguar Microsatellites
Exercises
8. Geographical Structure
Geographical Data in R
Packages and Classes
Calculating Geographical Distances
A Third Look at F-Statistics
Hierarchical Components of Genetic Diversity
Analysis of Molecular Variance
Moran I and Spatial Autocorrelation
Spatial Principal Component Analysis
Finding Boundaries Between Populations
Spatial Ancestry (tessr)
Bayesian Methods (Geneland)
Case Studies
Complete Genomes of the Fruit Fly
Human Genomes
Exercises
9. Past Demographic Events
The Coalescent
The Standard Coalescent
The Sequential Markovian Coalescent
Simulation of Coalescent Data
Estimation of _
Heterozygosity
Number of Alleles
Segregating Sites
Microsatellites
Trees
Coalescent-Based Inference
Maximum Likelihood Methods
Analysis of Markov Chain Monte Carlo Outputs
Skyline Plots
Bayesian Methods
Heterochronous Samples
Site Frequency Spectrum Methods
The Stairway Method
CubSFS
Popsicle
Whole-Genome Methods (psmcr)
Case Studies
Mitochondrial Genomes of the Asiatic Golden Cat
Complete Genomes of the Fruit Fly
Influenza HN Virus Sequences
Bacterial Whole Genome Sequences
Exercises
10. Natural Selection
Testing Neutrality
Simple Tests
Selection in Protein-Coding Sequences
Selection Scans
A Fourth Look at F-Statistics
Association Studies (LEA)
Principal Component Analysis (pcadapt)
Scans for Selection With Extended Haplotypes
FST Outliers
Time-Series of Allele Frequencies
Case Studies
Mitochondrial Genomes of the Asiatic Golden Cat
Complete Genomes of the Fruit Fly
Influenza HN Virus Sequences
Exercises
A Installing R Packages
B Compressing Large Sequence Files
C Sampling of Alleles in a Population