In this post I’ll look at replicating Hadley Wickham‘s `gather()`

tool from his `tidyr`

package using the pandas `melt()`

function. Why would anyone want to do this? Well, Dr. Wickham’s work is beautiful, and the `pandas.melt()`

function is not as elegant as the `tidyr::gather()`

function. You may read Dr. Wickham’s pre-print paper here.

# tidyr

The `tidyr`

package contains functions to *tidy* data, which is to say, transform data sets so that each column represents a feature or variable, and each row represents a unique record, or observation. The RStudio Blog has a great introduction to the `tidyr`

functions `gather()`

, `separate()`

, and `spread()`

. The `gather()`

function is used to convert *wide* data to *long* data. I’ll use an example from the RStudio Blog post to describe this operation. Suppose we have the following table describing heart rates for two different treatments:

Name | Treatment A | Treatment B |
---|---|---|

Wilbur | 67 | 59 |

Petunia | 80 | 90 |

Gregory | 64 | 50 |

We’d like to have treatments in one column, and heart rates in another column, like this:

Name | Treatment | Heart Rate |
---|---|---|

Wilbur | A | 67 |

Petunia | A | 80 |

Gregory | A | 64 |

Wilbur | B | 59 |

Petunia | B | 90 |

Gregory | B | 50 |

The `tidyr::gather()`

function achieves this deftly. We pass the name of the *key* column, `treatment`

, and the name of the *value* column, `heartrate`

, and then an expression describing the columns to be gathered which may take several forms. The lines 10-12 are all equivalent. The colon in line ten means *“all columns from a to b”*, and the minus in line twelve means, *“not the name column”*. If we cannot use a colon or a minus sign, then we may list the columns or variables of interest as individual trailing arguments, as in line eleven.

The `%>%`

is a pipe that, in this example, passes `messy`

through the `gather()`

function. This is available to us through (wait for it) the `magrittr`

package. (Get it? Magritte? Pipe?) The pipe operator allows you to chain a bunch of functions together instead of nesting them.

library(magrittr) library(tidyr) messy <- data.frame( name = c("Wilbur", "Petunia", "Gregory"), a = c(67, 80, 64), b = c(56, 90, 50) ) tidy <- messy %>% gather( "treatment", "heartrate", a:b ) tidy <- messy %>% gather( "treatment", "heartrate", a, b ) tidy <- messy %>% gather( "treatment", "heartrate", -name )

# pandas

We can emulate this in pandas using the `melt()`

function, which is similar to the `melt()`

function in the other Wickham package, `reshape2`

. In Python, we can recreate our toy data set:

import pandas names = [ 'Wilbur', 'Petunia', 'Gregory' ] a = [ 67, 80, 64 ] b = [ 56, 90, 50 ] df = pandas.DataFrame({'names':names,'a':a,'b':b})

Then we can define a function that uses the `pandas.melt()`

function to simplistically emulate the `tidyr::gather()`

function.

def gather( df, key, value, cols ): id_vars = [ col for col in df.columns if col not in cols ] id_values = cols var_name = key value_name = value return pandas.melt( df, id_vars, id_values, var_name, value_name )

Then we can call our function as:

gather( df, 'drug', 'heartrate', ['a','b'] )

This gives us the following output:

names | treatment | heartrate | |
---|---|---|---|

0 | Wilbur | a | 67 |

1 | Petunia | a | 80 |

2 | Gregory | a | 64 |

3 | Wilbur | b | 56 |

4 | Petunia | b | 90 |

5 | Gregory | b | 50 |

So, our hacked together `gather()`

function is still kind of clunky, and only a little bit less clunky than the `pandas.melt()`

function, but it was fun learning about this stuff.

Thanks for the explanation!

Well really you need to revise your introduction. Hadley Wickham authored the R package reshape and reshape2 which is where melt originally came from. So much of Pandas comes from Dr. Wickham’s packages. So in R we have the choice or reshape2::melt() or tidyr::gather() which melt is older and does more and gather which does less but that is almost always the trend in Hadley Wickham’s packages. Same is happening with his ggplot2 which is stable and ggvis which does less but is more elegant to code with the added benefit of interactive graphs.

Thanks for posting this. Between ggplot for python and pandas and insightful snippets like this, it’s much easier to make the transition to python.

Best & quickest description of how to work with the GATHER function of the tidyr package that I could find – thank you !