Intermediate R for reproducible scientific analysis

reshape2

Learning objectives

  • To know how split a column by a group to create multiple columns
  • To know how to combine multiple columns into one with different groupings.

So far we’ve been working with the gapminder dataset:

library("data.table")
gap <- fread("data/gapminder-FiveYearData.csv")
gap
##           country year      pop continent lifeExp gdpPercap
##    1: Afghanistan 1952  8425333      Asia  28.801  779.4453
##    2: Afghanistan 1957  9240934      Asia  30.332  820.8530
##    3: Afghanistan 1962 10267083      Asia  31.997  853.1007
##    4: Afghanistan 1967 11537966      Asia  34.020  836.1971
##    5: Afghanistan 1972 13079460      Asia  36.088  739.9811
##   ---                                                      
## 1700:    Zimbabwe 1987  9216418    Africa  62.351  706.1573
## 1701:    Zimbabwe 1992 10704340    Africa  60.377  693.4208
## 1702:    Zimbabwe 1997 11404948    Africa  46.809  792.4500
## 1703:    Zimbabwe 2002 11926563    Africa  39.989  672.0386
## 1704:    Zimbabwe 2007 12311143    Africa  43.487  469.7093

The format that this dataset has been provided in is called a “long” format: each variable has its own column, and there are several columns with identifying information: where those observations came from/groupings in the data.

This is a format that is convenient for rapid analysis using R: it is easy to manipulate, filter and query using data.table, and is the format ggplot2 expects for plotting.

However, data often does not come in this format. We might sometimes find we are given data in what is known as a “wide” format: in which variables may be split over multiple columns, one for each group. Let’s read in an alternative version of the gapminder dataset:

# This is an example where `fread` doesn't work: it loses the column names!
gapWide <- as.data.table(read.csv("data/gapminder-wide-format.csv", header=TRUE))
gapWide
##      continent        country pop.1952 pop.1957 pop.1962 pop.1967 pop.1972
##   1:    Africa        Algeria  9279525 10270856 11000948 12760499 14760787
##   2:    Africa         Angola  4232095  4561361  4826015  5247469  5894858
##   3:    Africa          Benin  1738315  1925173  2151895  2427334  2761407
##   4:    Africa       Botswana   442308   474639   512764   553541   619351
##   5:    Africa   Burkina Faso  4469979  4713416  4919632  5127935  5433886
##  ---                                                                      
## 138:    Europe    Switzerland  4815000  5126000  5666000  6063000  6401400
## 139:    Europe         Turkey 22235677 25670939 29788695 33411317 37492953
## 140:    Europe United Kingdom 50430000 51430000 53292000 54959000 56079000
## 141:   Oceania      Australia  8691212  9712569 10794968 11872264 13177000
## 142:   Oceania    New Zealand  1994794  2229407  2488550  2728150  2929100
##      pop.1977 pop.1982 pop.1987 pop.1992 pop.1997 pop.2002 pop.2007
##   1: 17152804 20033753 23254956 26298373 29072015 31287142 33333216
##   2:  6162675  7016384  7874230  8735988  9875024 10866106 12420476
##   3:  3168267  3641603  4243788  4981671  6066080  7026113  8078314
##   4:   781472   970347  1151184  1342614  1536536  1630347  1639131
##   5:  5889574  6634596  7586551  8878303 10352843 12251209 14326203
##  ---                                                               
## 138:  6316424  6468126  6649942  6995447  7193761  7361757  7554661
## 139: 42404033 47328791 52881328 58179144 63047647 67308928 71158647
## 140: 56179000 56339704 56981620 57866349 58808266 59912431 60776238
## 141: 14074100 15184200 16257249 17481977 18565243 19546792 20434176
## 142:  3164900  3210650  3317166  3437674  3676187  3908037  4115771
##      lifeExp.1952 lifeExp.1957 lifeExp.1962 lifeExp.1967 lifeExp.1972
##   1:       43.077       45.685       48.303       51.407       54.518
##   2:       30.015       31.999       34.000       35.985       37.928
##   3:       38.223       40.358       42.618       44.885       47.014
##   4:       47.622       49.618       51.520       53.298       56.024
##   5:       31.975       34.906       37.814       40.697       43.591
##  ---                                                                 
## 138:       69.620       70.560       71.320       72.770       73.780
## 139:       43.585       48.079       52.098       54.336       57.005
## 140:       69.180       70.420       70.760       71.360       72.010
## 141:       69.120       70.330       70.930       71.100       71.930
## 142:       69.390       70.260       71.240       71.520       71.890
##      lifeExp.1977 lifeExp.1982 lifeExp.1987 lifeExp.1992 lifeExp.1997
##   1:       58.014       61.368       65.799       67.744       69.152
##   2:       39.483       39.942       39.906       40.647       40.963
##   3:       49.190       50.904       52.337       53.919       54.777
##   4:       59.319       61.484       63.622       62.745       52.556
##   5:       46.137       48.122       49.557       50.260       50.324
##  ---                                                                 
## 138:       75.390       76.210       77.410       78.030       79.370
## 139:       59.507       61.036       63.108       66.146       68.835
## 140:       72.760       74.040       75.007       76.420       77.218
## 141:       73.490       74.740       76.320       77.560       78.830
## 142:       72.220       73.840       74.320       76.330       77.550
##      lifeExp.2002 lifeExp.2007 gdpPercap.1952 gdpPercap.1957
##   1:       70.994       72.301      2449.0082      3013.9760
##   2:       41.003       42.731      3520.6103      3827.9405
##   3:       54.406       56.728      1062.7522       959.6011
##   4:       46.634       50.728       851.2411       918.2325
##   5:       50.650       52.295       543.2552       617.1835
##  ---                                                        
## 138:       80.620       81.701     14734.2327     17909.4897
## 139:       70.845       71.777      1969.1010      2218.7543
## 140:       78.471       79.425      9979.5085     11283.1779
## 141:       80.370       81.235     10039.5956     10949.6496
## 142:       79.110       80.204     10556.5757     12247.3953
##      gdpPercap.1962 gdpPercap.1967 gdpPercap.1972 gdpPercap.1977
##   1:      2550.8169      3246.9918       4182.664       4910.417
##   2:      4269.2767      5522.7764       5473.288       3008.647
##   3:       949.4991      1035.8314       1085.797       1029.161
##   4:       983.6540      1214.7093       2263.611       3214.858
##   5:       722.5120       794.8266        854.736        743.387
##  ---                                                            
## 138:     20431.0927     22966.1443      27195.113      26982.291
## 139:      2322.8699      2826.3564       3450.696       4269.122
## 140:     12477.1771     14142.8509      15895.116      17428.748
## 141:     12217.2269     14526.1246      16788.629      18334.198
## 142:     13175.6780     14463.9189      16046.037      16233.718
##      gdpPercap.1982 gdpPercap.1987 gdpPercap.1992 gdpPercap.1997
##   1:      5745.1602      5681.3585      5023.2166       4797.295
##   2:      2756.9537      2430.2083      2627.8457       2277.141
##   3:      1277.8976      1225.8560      1191.2077       1232.975
##   4:      4551.1421      6205.8839      7954.1116       8647.142
##   5:       807.1986       912.0631       931.7528        946.295
##  ---                                                            
## 138:     28397.7151     30281.7046     31871.5303      32135.323
## 139:      4241.3563      5089.0437      5678.3483       6601.430
## 140:     18232.4245     21664.7877     22705.0925      26074.531
## 141:     19477.0093     21888.8890     23424.7668      26997.937
## 142:     17632.4104     19007.1913     18363.3249      21050.414
##      gdpPercap.2002 gdpPercap.2007
##   1:       5288.040       6223.367
##   2:       2773.287       4797.231
##   3:       1372.878       1441.285
##   4:      11003.605      12569.852
##   5:       1037.645       1217.033
##  ---                              
## 138:      34480.958      37506.419
## 139:       6508.086       8458.276
## 140:      29478.999      33203.261
## 141:      30687.755      34435.367
## 142:      23189.801      25185.009

This contains exactly the same data, but the three variables, “gdpPercap”, “pop”, and “lifeExp” have one column for each year of collection. This format is useful if you wish to calculate group-wise properties, e.g. the correlation structure between groups. It is also a more natural way for data collectors to structure data in another application, such as Excel.

Being able to transform between “wide” and “long” formats is an incredible useful skill that will save you a lot of time in the future. We can do this using the reshape2 package.

Combining columns

To collapse multiple columns into one we use the melt function:

library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
gapLong <- melt(
  data=gapWide,
  id.vars=c("continent", "country") # All other columns will be collapsed into one
)
## Warning in melt.data.table(data = gapWide, id.vars = c("continent",
## "country")): 'measure.vars' [pop.1952, pop.1957, pop.1962, pop.1967, pop.
## 1972, pop.1977, pop.1982, pop.1987, pop.1992, pop.1997, pop.2002, pop.
## 2007, lifeExp.1952, lifeExp.1957, lifeExp.1962, lifeExp.1967, lifeExp.
## 1972, lifeExp.1977, lifeExp.1982, lifeExp.1987, lifeExp.1992, lifeExp.1997,
## lifeExp.2002, lifeExp.2007, gdpPercap.1952, gdpPercap.1957, gdpPercap.1962,
## gdpPercap.1967, gdpPercap.1972, gdpPercap.1977, gdpPercap.1982, gdpPercap.
## 1987, gdpPercap.1992, gdpPercap.1997, gdpPercap.2002, gdpPercap.2007] are
## not all of the same type. By order of hierarchy, the molten data value
## column will be of type 'double'. All measure variables not of type 'double'
## will be coerced to. Check DETAILS in ?melt.data.table for more on coercion.

We get a warning because the ‘pop’ columns are of type ‘integer’ (i.e. whole numbers) while the ‘gdpPercap’ and ‘lifeExp’ columns are type ‘double’ (i.e. decimal values).

This melt has collapsed the table too far. We need to separate out the different variable types into their own columns. First, lets split the variable column into variable type and year:

gapLong[, c("var", "year") := colsplit(variable, "\\.", c("var", "year"))]

Let’s break this down.

First, colsplit(variable, "\\.", c("var", "year")) creates two columns called “var” and “year” from the “variable” column, splitting each value on the “.”. We need to specific the pattern as \\. because . by itself is a wild card character: it would match every character in the string.

Next, we create two new columns in the gapLong data table using the := operator. To create multiple columns, we specify their names in a vector to the left of the := operator.

Now that we’re happy this has worked, we can delete the old variable column:

gapLong[,variable := NULL]

Finally, to split out the value column into the groups stored in var, we use the dcast function. Since we’d like to keep the result as a data table, we’ll explicitly call the method from the data.table package:

gapLong <- dcast.data.table(
  data=gapLong,  
  # unique identifier columns go to the left of the '~', separated by '+' signs.
  # The grouping column goes to the right of the '~'.
  formula=continent+country+year~var,
  # which column stores the values to be spread over the new columns
  value.var="value" 
)

And now we’re back where we’ve started with the long data!

We’ve shown you the simplest cases for reshaping data. More complex cases may require you to melt or dcast in several steps, and rbind the results.

Other packages you may find useful for this task are tidyr and splitstackshape