Smart genetic analysis made fast and easy

29 07 2021

If you use genetics to differentiate populations, the new package smartsnp might be your new friend. Written in R language and available from GitHub and CRAN, this package does principal component analysis with control for genetic drift, projects ancient samples onto modern genetic space, and tests for population differences in genotypes. The package has been built to load big datasets and run complex stats in the blink of an eye, and is fully described in a paper published in Methods in Ecology and Evolution (1).


In the bioinformatics era, sequencing a genome has never been so straightforward. No surprise that > 20 petabytes of genomic data are expected to be generated every year by the end of this decade (2) — if 1 byte of information was 1 mm long, we could make 29,000 round trips to the moon with 20 petabytes. Data size in genetics keeps outpacing the computer power available to handle it at any given time (3). Many will be familiar with a computer freezing if unable to load or run an analysis on a huge dataset, and how many coffees or teas we might have drunk, or computer screens might have been broken, during the wait. The bottom line is that software advances that speed up data processing and genetic analysis are always good news.

With that idea in mind, I have just published a paper presenting the new R package smartsnp (1) to run multivariate analysis of big genotype data, with applications to studies of ancestry, evolution, forensics, lineages, and overall population genetics. I am proud to say that the development of the package has been one of the most gratifying short-term collaborations in my entire career, with my colleagues Christian Huber and Ray Tobler: a true team effort!

The package is available on GitHub and the Comprehensive R Archive Network CRAN. See downloading options here, and vignettes here with step-by-step instructions to run different functionalities of our package (summarised below).

In this blog, I use “genotype” meaning the combination of gene variants (alleles) across a predefined set of positions (loci) in the genome of a given individual of animal, human, microbe, or plant. One type of those variants is single nucleotide polymorphisms (SNP), a DNA locus at which two or more alternative nucleotides occur, sometimes conditioning protein translation or gene expression. SNPs are relatively stable over time and are routinely used to identify individuals and ancestors in humans and wildlife.

What the package does

The package smartsnp is partly based on the field-standard software EIGENSOFT (4, 5) which is only available for Unix command-line environments. In fact, our driving motivation was (i) to broaden the use of EIGENSOFT tools by making them available to the rocketing community of professionals, not only academics who employ R for their work (6), and (ii) to optimise our package to handle big datasets and complex stats efficiently. Our package mimics EIGENSOFT’s principal component analysis (SMARTPCA) (5), and also runs multivariate tests for population differences in genotypes as follows:

Read the rest of this entry »




Less snow from climate change pushes evolution of browner birds

7 09 2017

© Bill Doherty

© Bill Doherty

Climate changes exert selective pressures on the reproduction and survival of species. A study of tawny owls from Finland finds that the proportion of two colour morphs varies in response to the gradual decline of snowfall occurring in the boreal region.

Someone born in the tropics who travels to the Antarctic or the Himalaya can, of course, stand the cold (with a little engineering help from clothing, however). The physiology of our body is flexible enough to tolerate temperatures alien to those of our home. We can acclimate and, if we are healthy, we can virtually reside anywhere in the world.

However, modern climate change is steadily altering the thermal conditions of the native habitats of many species. Like us, some can live up to as much heat or cold as their genetic heritage permits, because each species can express a range of morphological, physiological, and behavioural variation (plasticity). Others can modify their genetic make-up, giving way to novel species-specific features or genotypes (evolution).

When genetic changes are speedy, that is, within a few generations, we are witnessing ‘microevolution’ — in contrast to ‘macroevolution’ across geological time scales as originally reported by Darwin and Wallace (1). To date, the detection of microevolution in response to modern climate change remains elusive, and many studies claiming so seem to lack the appropriate data to differentiate microevolution from phenotypic plasticity (i.e., the capacity of a single genotype to exhibit variable phenotypes in different environments) (2, 3). Read the rest of this entry »