Smart genetic analysis made fast and easy

29 07 2021

If you use genetics to differentiate populations, the new package smartsnp might be your new friend. Written in R language and available from GitHub and CRAN, this package does principal component analysis with control for genetic drift, projects ancient samples onto modern genetic space, and tests for population differences in genotypes. The package has been built to load big datasets and run complex stats in the blink of an eye, and is fully described in a paper published in Methods in Ecology and Evolution (1).


In the bioinformatics era, sequencing a genome has never been so straightforward. No surprise that > 20 petabytes of genomic data are expected to be generated every year by the end of this decade (2) — if 1 byte of information was 1 mm long, we could make 29,000 round trips to the moon with 20 petabytes. Data size in genetics keeps outpacing the computer power available to handle it at any given time (3). Many will be familiar with a computer freezing if unable to load or run an analysis on a huge dataset, and how many coffees or teas we might have drunk, or computer screens might have been broken, during the wait. The bottom line is that software advances that speed up data processing and genetic analysis are always good news.

With that idea in mind, I have just published a paper presenting the new R package smartsnp (1) to run multivariate analysis of big genotype data, with applications to studies of ancestry, evolution, forensics, lineages, and overall population genetics. I am proud to say that the development of the package has been one of the most gratifying short-term collaborations in my entire career, with my colleagues Christian Huber and Ray Tobler: a true team effort!

The package is available on GitHub and the Comprehensive R Archive Network CRAN. See downloading options here, and vignettes here with step-by-step instructions to run different functionalities of our package (summarised below).

In this blog, I use “genotype” meaning the combination of gene variants (alleles) across a predefined set of positions (loci) in the genome of a given individual of animal, human, microbe, or plant. One type of those variants is single nucleotide polymorphisms (SNP), a DNA locus at which two or more alternative nucleotides occur, sometimes conditioning protein translation or gene expression. SNPs are relatively stable over time and are routinely used to identify individuals and ancestors in humans and wildlife.

What the package does

The package smartsnp is partly based on the field-standard software EIGENSOFT (4, 5) which is only available for Unix command-line environments. In fact, our driving motivation was (i) to broaden the use of EIGENSOFT tools by making them available to the rocketing community of professionals, not only academics who employ R for their work (6), and (ii) to optimise our package to handle big datasets and complex stats efficiently. Our package mimics EIGENSOFT’s principal component analysis (SMARTPCA) (5), and also runs multivariate tests for population differences in genotypes as follows:

Read the rest of this entry »




Being empathetic for better interdisciplinarity

4 06 2019

Source: taazatadka.com(originally published on the GE.blog)

Scientists appear to have mixed feelings when it comes to interdisciplinarity in science — the reaction spans from genuine enthusiasm right through to pure disdain.

I myself have crossed many research fields since my Masters project, but despite the support of my supervisors, I have already had to face some tough gatekeeping from science specialists in conferences and in front of other panels. Several times I was taken aback by some reactions, so I have started to become interested in the topic from a more analytical perspective. How are these fields’ boundaries defined in science?

Although each field’s specific methodology, jargon, and tendency to interpret results could represent communication barriers among them, this can be easily overcome by spending time learning the language of other groups, in the company of specialist collaborators, or by attending workshops.

But what about ideology — a philosophy of science inherent to a specific group of individuals? This is one of the things making us human. It definitely affects our society, and even if it is never assumed, it also affects the generation of scientific knowledge from its production to its transmission. Scientists have that connection to their field, its history, its identity, and its compromises.

For example, historians or philosophers use different ways of thinking than do physicists or biologists. The first group aims to clarify and analyse the reconstruction of past events, while the second group strives for conceptual understanding. While useful withina field, these specific ways of seeing science can generate roadblocks when two fields need to start a conversation.

I will tell you a story based on my own experience. Read the rest of this entry »