9  Merging and Reshaping Data

Here are all the libraries you should install for this chapter.

library(dplyr)
library(readr)
library(tidyr)

9.1 Merge: Built-in Functions

As discussed in chapter 4, rbind() can be used to append additional observations. If using this approach, it is better to transform the new row(s) into a data frame. This will help avoid silently changing a variable type.

df <- read.csv("Data/Gapminder/gapminder_large.csv")
str(df)
'data.frame':   195 obs. of  21 variables:
 $ country     : chr  "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ gdp_2015    : int  574 4520 4780 42100 3750 13300 10600 3920 55100 47800 ...
 $ gini_2015   : num  36.8 29 27.6 40 42.6 40 41.8 31.9 32.3 30.6 ...
 $ region      : chr  "Asia & Pacific" "Europe" "Arab States" "Europe" ...
 $ co2_2015    : num  0.262 1.6 3.8 5.97 1.22 5.84 4.64 1.65 16.8 7.7 ...
 $ co2_2016    : num  0.245 1.57 3.64 6.07 1.18 5.9 4.6 1.76 17 7.7 ...
 $ co2_2017    : num  0.247 1.61 3.56 6.27 1.14 5.89 4.55 1.7 17 7.94 ...
 $ co2_2018    : num  0.254 1.59 3.69 6.12 1.12 5.88 4.41 1.89 16.9 7.75 ...
 $ cpi_2012    : int  8 33 34 NA 22 NA 35 34 85 69 ...
 $ cpi_2013    : int  8 31 36 NA 23 NA 34 36 81 69 ...
 $ cpi_2014    : int  12 33 36 NA 19 NA 34 37 80 72 ...
 $ cpi_2015    : int  11 36 36 NA 15 NA 32 35 79 76 ...
 $ cpi_2016    : int  15 39 34 NA 18 NA 36 33 79 75 ...
 $ cpi_2017    : int  15 38 33 NA 19 NA 39 35 77 75 ...
 $ lifeexp_2012: num  60.8 77.8 76.8 82.4 61.3 76.7 76 74.7 82.5 81 ...
 $ lifeexp_2013: num  61.3 77.9 76.9 82.5 61.9 76.8 76.1 75.2 82.6 81.2 ...
 $ lifeexp_2014: num  61.2 77.9 77 82.5 62.8 76.8 76.4 75.3 82.5 81.4 ...
 $ lifeexp_2015: num  61.2 78 77.1 82.6 63.3 76.9 76.5 75.3 82.5 81.5 ...
 $ lifeexp_2016: num  61.2 78.1 77.4 82.7 63.8 77 76.5 75.4 82.5 81.7 ...
 $ lifeexp_2017: num  63.4 78.2 77.7 82.7 64.2 77 76.7 75.6 82.4 81.8 ...
 $ lifeexp_2018: num  63.7 78.3 77.9 NA 64.6 77.2 76.8 75.8 82.5 81.9 ...
 [1] "COUNTRY"      "gdp_2015"     "gini_2015"    "region"       "co2_2015"    
 [6] "co2_2016"     "co2_2017"     "co2_2018"     "cpi_2012"     "cpi_2013"    
[11] "cpi_2014"     "cpi_2015"     "cpi_2016"     "cpi_2017"     "lifeexp_2012"
[16] "lifeexp_2013" "lifeexp_2014" "lifeexp_2015" "lifeexp_2016" "lifeexp_2017"
[21] "lifeexp_2018"
 [1] "COUNTRY"      "GDP"          "GINI"         "region"       "co2_2015"    
 [6] "co2_2016"     "co2_2017"     "co2_2018"     "cpi_2012"     "cpi_2013"    
[11] "cpi_2014"     "cpi_2015"     "cpi_2016"     "cpi_2017"     "lifeexp_2012"
[16] "lifeexp_2013" "lifeexp_2014" "lifeexp_2015" "lifeexp_2016" "lifeexp_2017"
[21] "lifeexp_2018"
df1 <- df[1:98, ]
df2 <- df[99:195, ]
rbind(df1, df2)

An even more robust approach is to use the merge() function. This allows for the two data frames to have different variables and similar observations. As long as there is at least one variable common to both data frames, they can be merged. Here is a very simple example.

df1 <- df[1:5, c("COUNTRY", "region")]
df2 <- df[1:7, c("COUNTRY", "GDP", "GINI")]
merge(df1, df2, by = "COUNTRY")
      COUNTRY         region   GDP GINI
1 Afghanistan Asia & Pacific   574 36.8
2     Albania         Europe  4520 29.0
3     Algeria    Arab States  4780 27.6
4     Andorra         Europe 42100 40.0
5      Angola         Africa  3750 42.6

Note that df2 has 7 observations while df1 only has 5. Yet, the output of the merge has 5 observations. This is because the arguments all.x and all.y are set to FALSE by default. This means that only rows that appear in both are present in the output. If we set all.y = TRUE, all the rows of df2 are added with missing values for region.

merge(df1, df2, by = "COUNTRY", all.y = TRUE)
              COUNTRY         region   GDP GINI
1         Afghanistan Asia & Pacific   574 36.8
2             Albania         Europe  4520 29.0
3             Algeria    Arab States  4780 27.6
4             Andorra         Europe 42100 40.0
5              Angola         Africa  3750 42.6
6 Antigua and Barbuda           <NA> 13300 40.0
7           Argentina           <NA> 10600 41.8

If you want to keep all the rows in both data frames, the argument all = TRUE sets both all.x = TRUE and all.y = TRUE.

merge(df1, df2, by = "COUNTRY", all = TRUE)
              COUNTRY         region   GDP GINI
1         Afghanistan Asia & Pacific   574 36.8
2             Albania         Europe  4520 29.0
3             Algeria    Arab States  4780 27.6
4             Andorra         Europe 42100 40.0
5              Angola         Africa  3750 42.6
6 Antigua and Barbuda           <NA> 13300 40.0
7           Argentina           <NA> 10600 41.8

Suppose the variable you are merging on has different names in the two data frames. The arguments by.x and by.y allow for you to specify both variables.

names(df1)[1] <- "country"
merge(df1, df2, by.x = "country", by.y = "COUNTRY")
      country         region   GDP GINI
1 Afghanistan Asia & Pacific   574 36.8
2     Albania         Europe  4520 29.0
3     Algeria    Arab States  4780 27.6
4     Andorra         Europe 42100 40.0
5      Angola         Africa  3750 42.6

If the two data frames have different variables with the same name, the merge will not combine these columns. This even applies if the columns are different types.

df1 <- df[c("COUNTRY", "region", "GDP")]
df1$GDP <- as.character(df1$GDP) # GDP is now character in df1
merge(df1, df2, by = "COUNTRY")
              COUNTRY              region GDP.x GDP.y GINI
1         Afghanistan      Asia & Pacific   574   574 36.8
2             Albania              Europe  4520  4520 29.0
3             Algeria         Arab States  4780  4780 27.6
4             Andorra              Europe 42100 42100 40.0
5              Angola              Africa  3750  3750 42.6
6 Antigua and Barbuda South/Latin America 13300 13300 40.0
7           Argentina South/Latin America 10600 10600 41.8

9.2 Merge: tidyverse functions

Merging data frames is useful when there are several data frames with similar observations but different variables. To demonstrate the join functions in dplyr, we have two datasets. One is the population of all countries and the other is the population of all countries that begin with “A.” Neither of these datasets have duplicates.

pop <- read_csv("Data/Gapminder/population.csv")
Rows: 195 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (1): population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
popA <- read_csv("Data/Gapminder/population_A.csv")
Rows: 11 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (1): population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
tib1 <- read_csv("Data/Gapminder/gapminder.csv")
Rows: 197 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, region
dbl (2): gdp, gini

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str((tib1))
spc_tbl_ [197 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country: chr [1:197] "Afghanistan" "Albania" "Algeria" "Andorra" ...
 $ gdp    : num [1:197] 574 4520 4780 42100 3750 13300 10600 3920 55100 47800 ...
 $ gini   : num [1:197] 36.8 29 27.6 40 42.6 40 41.8 31.9 32.3 30.6 ...
 $ region : chr [1:197] "Asia & Pacific" "Europe" "Arab States" "Europe" ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   gdp = col_double(),
  ..   gini = col_double(),
  ..   region = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

The different join functions relate to which observations are kept. In full_join(), all observations in the two data frames are kept, even if there are unmatched observations. The argument by indicates which variable on which to match.

full_join(tib1, pop, by = "country")
# A tibble: 197 × 5
   country               gdp  gini region              population
   <chr>               <dbl> <dbl> <chr>                    <dbl>
 1 Afghanistan           574  36.8 Asia & Pacific        34400000
 2 Albania              4520  29   Europe                 2890000
 3 Algeria              4780  27.6 Arab States           39700000
 4 Andorra             42100  40   Europe                   78000
 5 Angola               3750  42.6 Africa                27900000
 6 Antigua and Barbuda 13300  40   South/Latin America      93600
 7 Argentina           10600  41.8 South/Latin America   43100000
 8 Armenia              3920  31.9 Europe                 2930000
 9 Australia           55100  32.3 Asia & Pacific        23900000
10 Austria             47800  30.6 Europe                 8680000
# ℹ 187 more rows

The function inner_join() only keeps observations that are present in both data frames. In this case, that is only countries that begin with “A.”

inner_join(tib1, popA, by = "country")
# A tibble: 11 × 5
   country               gdp  gini region              population
   <chr>               <dbl> <dbl> <chr>                    <dbl>
 1 Afghanistan           574  36.8 Asia & Pacific        34400000
 2 Albania              4520  29   Europe                 2890000
 3 Algeria              4780  27.6 Arab States           39700000
 4 Andorra             42100  40   Europe                   78000
 5 Angola               3750  42.6 Africa                27900000
 6 Antigua and Barbuda 13300  40   South/Latin America      93600
 7 Argentina           10600  41.8 South/Latin America   43100000
 8 Armenia              3920  31.9 Europe                 2930000
 9 Australia           55100  32.3 Asia & Pacific        23900000
10 Austria             47800  30.6 Europe                 8680000
11 Azerbaijan           6060  32.4 Asia & Pacific         9620000

The function left_join() only keeps from the data frame in the left argument (tib1 in this case).

left_join(tib1, popA, by = "country")
# A tibble: 197 × 5
   country               gdp  gini region              population
   <chr>               <dbl> <dbl> <chr>                    <dbl>
 1 Afghanistan           574  36.8 Asia & Pacific        34400000
 2 Albania              4520  29   Europe                 2890000
 3 Algeria              4780  27.6 Arab States           39700000
 4 Andorra             42100  40   Europe                   78000
 5 Angola               3750  42.6 Africa                27900000
 6 Antigua and Barbuda 13300  40   South/Latin America      93600
 7 Argentina           10600  41.8 South/Latin America   43100000
 8 Armenia              3920  31.9 Europe                 2930000
 9 Australia           55100  32.3 Asia & Pacific        23900000
10 Austria             47800  30.6 Europe                 8680000
# ℹ 187 more rows

The function right_join() is the same except it only keeps the observations from the data frame in the right argument.

right_join(tib1, popA, by = "country")
# A tibble: 11 × 5
   country               gdp  gini region              population
   <chr>               <dbl> <dbl> <chr>                    <dbl>
 1 Afghanistan           574  36.8 Asia & Pacific        34400000
 2 Albania              4520  29   Europe                 2890000
 3 Algeria              4780  27.6 Arab States           39700000
 4 Andorra             42100  40   Europe                   78000
 5 Angola               3750  42.6 Africa                27900000
 6 Antigua and Barbuda 13300  40   South/Latin America      93600
 7 Argentina           10600  41.8 South/Latin America   43100000
 8 Armenia              3920  31.9 Europe                 2930000
 9 Australia           55100  32.3 Asia & Pacific        23900000
10 Austria             47800  30.6 Europe                 8680000
11 Azerbaijan           6060  32.4 Asia & Pacific         9620000

The function semi_join() keeps all rows in tib1 that have a match in popA.

semi_join(tib1, popA, by = "country")
# A tibble: 11 × 4
   country               gdp  gini region             
   <chr>               <dbl> <dbl> <chr>              
 1 Afghanistan           574  36.8 Asia & Pacific     
 2 Albania              4520  29   Europe             
 3 Algeria              4780  27.6 Arab States        
 4 Andorra             42100  40   Europe             
 5 Angola               3750  42.6 Africa             
 6 Antigua and Barbuda 13300  40   South/Latin America
 7 Argentina           10600  41.8 South/Latin America
 8 Armenia              3920  31.9 Europe             
 9 Australia           55100  32.3 Asia & Pacific     
10 Austria             47800  30.6 Europe             
11 Azerbaijan           6060  32.4 Asia & Pacific     

The function anti_join() keeps all rows in tib1 that do not have a match in popA.

anti_join(tib1, popA, by = "country")
# A tibble: 186 × 4
   country      gdp  gini region             
   <chr>      <dbl> <dbl> <chr>              
 1 Bahamas    27500  43.7 South/Latin America
 2 Bahrain    22400  40   Arab States        
 3 Bangladesh  1000  32.3 Asia & Pacific     
 4 Barbados   15800  43.8 South/Latin America
 5 Belarus     6380  26.9 Europe             
 6 Belgium    45500  27.8 Europe             
 7 Belize      4300  53.3 South/Latin America
 8 Benin       1130  46.9 Africa             
 9 Bhutan      2780  38   Asia & Pacific     
10 Bolivia     2360  46.3 South/Latin America
# ℹ 176 more rows

9.3 Reshape: tidyr

The tidyr package provides an efficient way to reshape and reformat data.

tib1 <- read_csv("Data/Gapminder/gapminder_large.csv")
Rows: 195 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): country, region
dbl (19): gdp_2015, gini_2015, co2_2015, co2_2016, co2_2017, co2_2018, cpi_2...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tib1)
# A tibble: 6 × 21
  country gdp_2015 gini_2015 region co2_2015 co2_2016 co2_2017 co2_2018 cpi_2012
  <chr>      <dbl>     <dbl> <chr>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Afghan…      574      36.8 Asia …    0.262    0.245    0.247    0.254        8
2 Albania     4520      29   Europe    1.6      1.57     1.61     1.59        33
3 Algeria     4780      27.6 Arab …    3.8      3.64     3.56     3.69        34
4 Andorra    42100      40   Europe    5.97     6.07     6.27     6.12        NA
5 Angola      3750      42.6 Africa    1.22     1.18     1.14     1.12        22
6 Antigu…    13300      40   South…    5.84     5.9      5.89     5.88        NA
# ℹ 12 more variables: cpi_2013 <dbl>, cpi_2014 <dbl>, cpi_2015 <dbl>,
#   cpi_2016 <dbl>, cpi_2017 <dbl>, lifeexp_2012 <dbl>, lifeexp_2013 <dbl>,
#   lifeexp_2014 <dbl>, lifeexp_2015 <dbl>, lifeexp_2016 <dbl>,
#   lifeexp_2017 <dbl>, lifeexp_2018 <dbl>

Wide data have one row per unit while long data have more than one row per unit. To convert wide data to long, use pivot_longer(). There are several different ways to get the same output.

# Select columns using tidy-select
# Dictate pattern with names_sep
long_tib1 <- pivot_longer(tib1,
                       contains("_"),
                       names_to = c("var", "year"),
                       names_sep = "_")

# Select columns using column indices
pivot_longer(tib1,
             5:21,
             names_to = c("var", "year"),
             names_sep = "_")
# A tibble: 3,315 × 7
   country     gdp_2015 gini_2015 region         var   year   value
   <chr>          <dbl>     <dbl> <chr>          <chr> <chr>  <dbl>
 1 Afghanistan      574      36.8 Asia & Pacific co2   2015   0.262
 2 Afghanistan      574      36.8 Asia & Pacific co2   2016   0.245
 3 Afghanistan      574      36.8 Asia & Pacific co2   2017   0.247
 4 Afghanistan      574      36.8 Asia & Pacific co2   2018   0.254
 5 Afghanistan      574      36.8 Asia & Pacific cpi   2012   8    
 6 Afghanistan      574      36.8 Asia & Pacific cpi   2013   8    
 7 Afghanistan      574      36.8 Asia & Pacific cpi   2014  12    
 8 Afghanistan      574      36.8 Asia & Pacific cpi   2015  11    
 9 Afghanistan      574      36.8 Asia & Pacific cpi   2016  15    
10 Afghanistan      574      36.8 Asia & Pacific cpi   2017  15    
# ℹ 3,305 more rows
# Dictate pattern with names_pattern
pivot_longer(tib1,
             5:21,
             names_to = c("var", "year"),
             names_pattern = "(.*)_(.*)")
# A tibble: 3,315 × 7
   country     gdp_2015 gini_2015 region         var   year   value
   <chr>          <dbl>     <dbl> <chr>          <chr> <chr>  <dbl>
 1 Afghanistan      574      36.8 Asia & Pacific co2   2015   0.262
 2 Afghanistan      574      36.8 Asia & Pacific co2   2016   0.245
 3 Afghanistan      574      36.8 Asia & Pacific co2   2017   0.247
 4 Afghanistan      574      36.8 Asia & Pacific co2   2018   0.254
 5 Afghanistan      574      36.8 Asia & Pacific cpi   2012   8    
 6 Afghanistan      574      36.8 Asia & Pacific cpi   2013   8    
 7 Afghanistan      574      36.8 Asia & Pacific cpi   2014  12    
 8 Afghanistan      574      36.8 Asia & Pacific cpi   2015  11    
 9 Afghanistan      574      36.8 Asia & Pacific cpi   2016  15    
10 Afghanistan      574      36.8 Asia & Pacific cpi   2017  15    
# ℹ 3,305 more rows

To go from long data to wide, use pivot_wider. There are several options to dictate the names of the newly created variables.

wide_tib1 <- pivot_wider(long_tib1,
                      names_from = c("var", "year"),
                      values_from = "value")
head(wide_tib1)
# A tibble: 6 × 21
  country region gdp_2015 gini_2015 co2_2015 co2_2016 co2_2017 co2_2018 cpi_2012
  <chr>   <chr>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 Afghan… Asia …      574      36.8    0.262    0.245    0.247    0.254        8
2 Albania Europe     4520      29      1.6      1.57     1.61     1.59        33
3 Algeria Arab …     4780      27.6    3.8      3.64     3.56     3.69        34
4 Andorra Europe    42100      40      5.97     6.07     6.27     6.12        NA
5 Angola  Africa     3750      42.6    1.22     1.18     1.14     1.12        22
6 Antigu… South…    13300      40      5.84     5.9      5.89     5.88        NA
# ℹ 12 more variables: cpi_2013 <dbl>, cpi_2014 <dbl>, cpi_2015 <dbl>,
#   cpi_2016 <dbl>, cpi_2017 <dbl>, lifeexp_2012 <dbl>, lifeexp_2013 <dbl>,
#   lifeexp_2014 <dbl>, lifeexp_2015 <dbl>, lifeexp_2016 <dbl>,
#   lifeexp_2017 <dbl>, lifeexp_2018 <dbl>
pivot_wider(long_tib1,
            names_from = c("var", "year"),
            values_from = "value",
            names_sep = ".")
# A tibble: 195 × 21
   country         region gdp.2015 gini.2015 co2.2015 co2.2016 co2.2017 co2.2018
   <chr>           <chr>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
 1 Afghanistan     Asia …      574      36.8    0.262    0.245    0.247    0.254
 2 Albania         Europe     4520      29      1.6      1.57     1.61     1.59 
 3 Algeria         Arab …     4780      27.6    3.8      3.64     3.56     3.69 
 4 Andorra         Europe    42100      40      5.97     6.07     6.27     6.12 
 5 Angola          Africa     3750      42.6    1.22     1.18     1.14     1.12 
 6 Antigua and Ba… South…    13300      40      5.84     5.9      5.89     5.88 
 7 Argentina       South…    10600      41.8    4.64     4.6      4.55     4.41 
 8 Armenia         Europe     3920      31.9    1.65     1.76     1.7      1.89 
 9 Australia       Asia …    55100      32.3   16.8     17       17       16.9  
10 Austria         Europe    47800      30.6    7.7      7.7      7.94     7.75 
# ℹ 185 more rows
# ℹ 13 more variables: cpi.2012 <dbl>, cpi.2013 <dbl>, cpi.2014 <dbl>,
#   cpi.2015 <dbl>, cpi.2016 <dbl>, cpi.2017 <dbl>, lifeexp.2012 <dbl>,
#   lifeexp.2013 <dbl>, lifeexp.2014 <dbl>, lifeexp.2015 <dbl>,
#   lifeexp.2016 <dbl>, lifeexp.2017 <dbl>, lifeexp.2018 <dbl>

It might be useful to reference this chapter on strings and regular expressions. There are many ways to represent different patterns in character strings, and a standardized approach exists to minimize the need to type out everything explicitly.

9.4 Further Reading

The above information comes from chapters 5.1-5.3, 6, and 21 of Boehmke (2016), chapters 2.2.5 and 3 of Zamora Saiz et al. (2020).

9.4.1 References

Boehmke, Bradley C. 2016. Data Wrangling with R. Use R! Springer. https://link.springer.com/book/10.1007/978-3-319-45599-0.
Zamora Saiz, Alfonso, Carlos Quesada González, Lluís Hurtado Gil, and Diego Mondéjar Ruiz. 2020. An Introduction to Data Analysis in R: Hands-on Coding, Data Mining, Visualization and Statistics from Scratch. https://link.springer.com/book/10.1007/978-3-030-48997-7.