Need human census data for any of your analyses? Follow these simple steps

25 02 2022

As someone who regularly delves into human demography — often from a conservation perspective — I’m always on the lookout for quick and easy ways to get the latest and greatest datasets. Whether it’s for projection human populations, or just getting country-specific population densities, I’ve found a really nice way to interface great human data with R.

In this particular example, I’m using a api (application programming interface) key to access live data on the US Census Bureau server (don’t worry — they have global data, not just those specific to the US). What’s an ‘api key’? It’s just a code that gives you permission to access the server directly from an application via an internet link.

Step 1. Apply for an api key

This is a straightforward process and just needs to be done via this URL. The approval process doesn’t take long.

Step 2: Install the idbr package in R

This stands for the ‘(US Census Bureau) International Data Base (R)’, and grants access to and queries demographic data, including contemporary, historical, and future projections to 2100 for countries with ≥ 5000 people.


Step 3: Set api key

You need to set your user api using the following commands:


Step 4. Get data

Using the get_idb() command, you can specify all sorts of queries to get various levels of data complexity. All the variable combinations for the international database are described well here.

Example 1. Life expectancy

Let’s say you wanted to plot a map of the world with the shading of a country related to its average life expectancy at birth. First we get the necessary data:

lex.dat <- idbr::get_idb(
country = “all”,
year = 2022,
variables = c(“name”, “e0”),
geometry = T)

The ensuing lex.dat object looks like this:

Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -73.41544 ymin: -55.25 xmax: 75.15803 ymax: 42.68825
code year name e0 geometry
1 AF 2022 Afghanistan 53.65 MULTIPOLYGON (((61.21082 35…
2 AO 2022 Angola 62.11 MULTIPOLYGON (((16.32653 -5…
3 AL 2022 Albania 79.47 MULTIPOLYGON (((20.59025 41…
4 AE 2022 United Arab Emirates 79.56 MULTIPOLYGON (((51.57952 24…
5 AR 2022 Argentina 78.31 MULTIPOLYGON (((-65.5 -55.2…
6 AM 2022 Armenia 76.13 MULTIPOLYGON (((43.58275 41

Read the rest of this entry »

Smart genetic analysis made fast and easy

29 07 2021

If you use genetics to differentiate populations, the new package smartsnp might be your new friend. Written in R language and available from GitHub and CRAN, this package does principal component analysis with control for genetic drift, projects ancient samples onto modern genetic space, and tests for population differences in genotypes. The package has been built to load big datasets and run complex stats in the blink of an eye, and is fully described in a paper published in Methods in Ecology and Evolution (1).

In the bioinformatics era, sequencing a genome has never been so straightforward. No surprise that > 20 petabytes of genomic data are expected to be generated every year by the end of this decade (2) — if 1 byte of information was 1 mm long, we could make 29,000 round trips to the moon with 20 petabytes. Data size in genetics keeps outpacing the computer power available to handle it at any given time (3). Many will be familiar with a computer freezing if unable to load or run an analysis on a huge dataset, and how many coffees or teas we might have drunk, or computer screens might have been broken, during the wait. The bottom line is that software advances that speed up data processing and genetic analysis are always good news.

With that idea in mind, I have just published a paper presenting the new R package smartsnp (1) to run multivariate analysis of big genotype data, with applications to studies of ancestry, evolution, forensics, lineages, and overall population genetics. I am proud to say that the development of the package has been one of the most gratifying short-term collaborations in my entire career, with my colleagues Christian Huber and Ray Tobler: a true team effort!

The package is available on GitHub and the Comprehensive R Archive Network CRAN. See downloading options here, and vignettes here with step-by-step instructions to run different functionalities of our package (summarised below).

In this blog, I use “genotype” meaning the combination of gene variants (alleles) across a predefined set of positions (loci) in the genome of a given individual of animal, human, microbe, or plant. One type of those variants is single nucleotide polymorphisms (SNP), a DNA locus at which two or more alternative nucleotides occur, sometimes conditioning protein translation or gene expression. SNPs are relatively stable over time and are routinely used to identify individuals and ancestors in humans and wildlife.

What the package does

The package smartsnp is partly based on the field-standard software EIGENSOFT (4, 5) which is only available for Unix command-line environments. In fact, our driving motivation was (i) to broaden the use of EIGENSOFT tools by making them available to the rocketing community of professionals, not only academics who employ R for their work (6), and (ii) to optimise our package to handle big datasets and complex stats efficiently. Our package mimics EIGENSOFT’s principal component analysis (SMARTPCA) (5), and also runs multivariate tests for population differences in genotypes as follows:

Read the rest of this entry »

Dangers of forcing regressions through the origin

17 10 2017

correlationsI had an interesting ‘discussion’ on Twitter yesterday that convinced me the topic would make a useful post. The specific example has nothing whatsoever to do with conservation, but it serves as a valuable statistical lesson for all concerned about demonstrating adequate evidence before jumping to conclusions.

The data in question were used in a correlation between national gun ownership (guns per capita) and gun-related deaths and injuries (total deaths and injuries from guns per 100,000 people) (the third figure in the article). As you might intuitively expect, the author concluded that there was a positive correlation between gun-related deaths and injuries, and gun ownership:



Now, if you’re an empirical skeptic like me, there was something fishy about that fitted trend line. So, I replotted the data (available here) using Plot Digitizer (if you haven’t yet discovered this wonderful tool for lifting data out of figures, you would be wise to get it now), and ran a little analysis of my own in R:


Just doing a little 2-parameter linear model (y ~ α + βx) in R on these log-log data (which means, it’s assumed to be a power relationship), shows that there’s no relationship at all — the intercept is 1.3565 (± 0.3814) in log space (i.e., 101.3565 = 22.72), and there’s no evidence for a non-zero slope (in fact, the estimated slope is negative at -0.1411, but it has no support). See R code here.

Now, the author pointed out what appears to be a rather intuitive requirement for this analysis — you should not have a positive number of gun-related deaths/injuries if there are no guns in the population; in other words, the relationship should be forced to go through the origin (xy = 0, 0). You can easily do this in R by using the lm function and setting the relationship to y ~ 0 + x; see code here). Read the rest of this entry »

%d bloggers like this: