Smart genetic analysis made fast and easy

29 07 2021

If you use genetics to differentiate populations, the new package smartsnp might be your new friend. Written in R language and available from GitHub and CRAN, this package does principal component analysis with control for genetic drift, projects ancient samples onto modern genetic space, and tests for population differences in genotypes. The package has been built to load big datasets and run complex stats in the blink of an eye, and is fully described in a paper published in Methods in Ecology and Evolution (1).


In the bioinformatics era, sequencing a genome has never been so straightforward. No surprise that > 20 petabytes of genomic data are expected to be generated every year by the end of this decade (2) — if 1 byte of information was 1 mm long, we could make 29,000 round trips to the moon with 20 petabytes. Data size in genetics keeps outpacing the computer power available to handle it at any given time (3). Many will be familiar with a computer freezing if unable to load or run an analysis on a huge dataset, and how many coffees or teas we might have drunk, or computer screens might have been broken, during the wait. The bottom line is that software advances that speed up data processing and genetic analysis are always good news.

With that idea in mind, I have just published a paper presenting the new R package smartsnp (1) to run multivariate analysis of big genotype data, with applications to studies of ancestry, evolution, forensics, lineages, and overall population genetics. I am proud to say that the development of the package has been one of the most gratifying short-term collaborations in my entire career, with my colleagues Christian Huber and Ray Tobler: a true team effort!

The package is available on GitHub and the Comprehensive R Archive Network CRAN. See downloading options here, and vignettes here with step-by-step instructions to run different functionalities of our package (summarised below).

In this blog, I use “genotype” meaning the combination of gene variants (alleles) across a predefined set of positions (loci) in the genome of a given individual of animal, human, microbe, or plant. One type of those variants is single nucleotide polymorphisms (SNP), a DNA locus at which two or more alternative nucleotides occur, sometimes conditioning protein translation or gene expression. SNPs are relatively stable over time and are routinely used to identify individuals and ancestors in humans and wildlife.

What the package does

The package smartsnp is partly based on the field-standard software EIGENSOFT (4, 5) which is only available for Unix command-line environments. In fact, our driving motivation was (i) to broaden the use of EIGENSOFT tools by making them available to the rocketing community of professionals, not only academics who employ R for their work (6), and (ii) to optimise our package to handle big datasets and complex stats efficiently. Our package mimics EIGENSOFT’s principal component analysis (SMARTPCA) (5), and also runs multivariate tests for population differences in genotypes as follows:

Read the rest of this entry »




Journal ranks 2020

23 07 2021

This is the 13th year in a row that I’ve generated journal ranks based on the journal-ranking method we published several years ago.

There are few differences in how I calculated this year’s ranks, as well as some relevant updates:

  1. As always, I’ve added a few new journals (either those who have only recently been scored with the component metrics, or ones I’ve just missed before);
  2. I’ve included the new ‘Journal Citation Indicator’ (JCI) in addition to the Journal Impact Factor and Immediacy Index from Clarivate ISI. JCI “… a field-normalised metric, represents the average category-normalised citation impact for papers published in the prior three-year period.”. In other words, it’s supposed to correct for field-specific citation trends;
  3. While this isn’t my change, the Clarivate metrics are now calculated based on when an article is first published online, rather than just in an issue. You would have thought that this should have been the case for many years, but they’ve only just done it;
  4. I’ve also added the ‘CiteScore’ (CS) in addition to the Source-Normalised Impact Per Paper (SNIP) and SCImago Journal Rank (SJR) from Scopus. CS is “the number of citations, received in that year and previous 3 years, for documents published in the journal during that period (four years), divided by the total number of published documents … in the journal during the same four-year period”;
  5. Finally, you can access the raw data for 2020 (I’ve done the hard work for you) and use my RShiny app to derive your own samples of journal ranks (also see the relevant blog post). You can add new journal as well to the list if my sample isn’t comprehensive enough for you.

Since the Google Scholar metrics were just released today, I present the new 2020 ranks for: (i) 101 ecology, conservation and multidisciplinary journals, and a subset of (ii) 61 ‘ecology’ journals, (iii) 29 ‘conservation’ journals, (iv) 41 ‘sustainability’ journals (with general and energy-focussed journals included), and (v) 20 ‘marine & freshwater’ journals.

One final observation. I’ve noted that several journals are boasting about how their Impact Factors have increased this year, when they fail to mention that this is the norm across most journals. As you’ll see below, relative ranks don’t actually change that much for most journals. In fact, this is a redacted email I received from a journal that I will not identify here:

We’re pleased to let you know that the new Impact Factor for [JOURNAL NAME] marks a remarkable increase, as it now stands at X.XXX, compared to last year’s X.XXX. And what is even more important: [JOURNAL NAME] increased its rank in the relevant disciplines: [DISCIPLINE NAME].

Although the Impact Factor may not be the perfect indicator of success, it remains the most widely recognised one at journal level. Therefore, we’re excited to share this achievement with you, as it wouldn’t have been possible, had it not been for all of your contributions and support as authors, reviewers, editors and readers. A huge ‘THANK YOU’ goes to all of you!

What bullshit.

Anyway, on to the results:

Read the rest of this entry »





Interval between extremely wet years increasing?

16 07 2021

The other day I was playing around with some Bureau of Meteorology data for my little patch of the Adelaide Hills (free data — how can I resist?), when I discovered an interesting trend.

Living on a little farm with a small vineyard, I’m rather keen on understanding our local weather trends. Being a scientist, I’m also rather inclined to analyse data.

My first question was given the strong warming trend here and everywhere else, plus ample evidence of changing rainfall patterns in Australia (e.g., see here, here, here, here, here), was it drying out, getting wetter, or was the seasonal pattern of rainfall in my area changing?

I first looked to see if there was any long-term trend in total annual rainfall over time. Luckily, the station records nearest my farm go all the way back to 1890:

While the red line might suggest a slight decrease since the late 19th Century, it’s no different to an intercept-only model (evidence ratio = 0.84) — no trend.

Here’s the R code to do that analysis (you can download the data here, or provide your own data in the same format):

## IMPORT MONTHLY PRECIPITATION DATA
dat <- read.table("monthlyprecipdata.csv", header=T, sep=",")

## CALCULATE ANNUAL VECTORS
precip.yr.sum <- xtabs(dat$Monthly.Precipitation.Total..millimetres. ~ dat$Year)
precip.yr.sum <- precip.yr.sum[-length(precip.yr.sum)]
year.vec <- as.numeric(names(precip.yr.sum))

## PLOT
plot(year.vec, as.numeric(precip.yr.sum), type="l", pch=19, xlab="year", ylab="annual precipitation (mm)")
fit.yr <- lm(precip.yr.sum ~ year.vec)
abline(fit.yr, lty=2, lwd=2, col="red")
abline(h=mean(as.numeric(precip.yr.sum)),lty=2, lwd=3)

## TEST FOR TREND
# functions
AICc <- function(...) {
  models <- list(...)
  num.mod <- length(models)
  AICcs <- numeric(num.mod)
  ns <- numeric(num.mod)
  ks <- numeric(num.mod)
  AICc.vec <- rep(0,num.mod)
  for (i in 1:num.mod) {
    if (length(models[[i]]$df.residual) == 0) n <- models[[i]]$dims$N else n <- length(models[[i]]$residuals)
    if (length(models[[i]]$df.residual) == 0) k <- sum(models[[i]]$dims$ncol) else k <- (length(models[[i]]$coeff))+1
    AICcs[i] <- (-2*logLik(models[[i]])) + ((2*k*n)/(n-k-1))
    ns[i] <- n
    ks[i] <- k
    AICc.vec[i] <- AICcs[i]
  }
  return(AICc.vec)
}

delta.AIC <- function(x) x - min(x) ## where x is a vector of AIC
weight.AIC <- function(x) (exp(-0.5*x))/sum(exp(-0.5*x)) ## Where x is a vector of dAIC
ch.dev <- function(x) ((( as.numeric(x$null.deviance) - as.numeric(x$deviance) )/ as.numeric(x$null.deviance))*100) ## % change in deviance, where x is glm object

linreg.ER <- function(x,y) { # where x and y are vectors of the same length; calls AICc, delta.AIC, weight.AIC functions
  fit.full <- lm(y ~ x); fit.null <- lm(y ~ 1)
  AIC.vec <- c(AICc(fit.full),AICc(fit.null))
  dAIC.vec <- delta.AIC(AIC.vec); wAIC.vec <- weight.AIC(dAIC.vec)
  ER <- wAIC.vec[1]/wAIC.vec[2]
  r.sq.adj <- as.numeric(summary(fit.full)[9])
  return(c(ER,r.sq.adj))
}

linreg.ER(year.vec, as.numeric(precip.yr.sum))
Read the rest of this entry »




‘Living’ figures

8 07 2021

Have you ever constructed a database and then published the findings, only to realise that after the time elapsed your database is already obsolete?

This is the reality of scientific information today. There are so many of us doing so many things that information accumulates substantially in months, if not weeks. If you’re a geneticist, this probably happens for many datasets on the order of days.

While our general databasing capacity worldwide has improved enormously over the last decade with the transition to fully online and web-capable interactivity, the world of scientific publication still generally lags behind the tech. But there is a better way to communicate dynamic, evolving database results to the public.

Enter the ‘living figure’, which is a simple-enough concept where a published figure remains dynamic as its underlying database is updated.

We have, in fact, just published such a living figure based on our paper earlier this year where we reported the global costs of invasive species.

That paper was published based on version 1 of the InvaCost database, but a mere three months after publication, InvaCost is already at version 4.

Read the rest of this entry »