library(dplyr)
library(readr)
library(tidyr)
9 Merging and Reshaping Data
Here are all the libraries you should install for this chapter.
9.1 Merge: Built-in Functions
As discussed in chapter 4, rbind()
can be used to append additional observations. If using this approach, it is better to transform the new row(s) into a data frame. This will help avoid silently changing a variable type.
<- read.csv("Data/Gapminder/gapminder_large.csv")
df str(df)
'data.frame': 195 obs. of 21 variables:
$ country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ gdp_2015 : int 574 4520 4780 42100 3750 13300 10600 3920 55100 47800 ...
$ gini_2015 : num 36.8 29 27.6 40 42.6 40 41.8 31.9 32.3 30.6 ...
$ region : chr "Asia & Pacific" "Europe" "Arab States" "Europe" ...
$ co2_2015 : num 0.262 1.6 3.8 5.97 1.22 5.84 4.64 1.65 16.8 7.7 ...
$ co2_2016 : num 0.245 1.57 3.64 6.07 1.18 5.9 4.6 1.76 17 7.7 ...
$ co2_2017 : num 0.247 1.61 3.56 6.27 1.14 5.89 4.55 1.7 17 7.94 ...
$ co2_2018 : num 0.254 1.59 3.69 6.12 1.12 5.88 4.41 1.89 16.9 7.75 ...
$ cpi_2012 : int 8 33 34 NA 22 NA 35 34 85 69 ...
$ cpi_2013 : int 8 31 36 NA 23 NA 34 36 81 69 ...
$ cpi_2014 : int 12 33 36 NA 19 NA 34 37 80 72 ...
$ cpi_2015 : int 11 36 36 NA 15 NA 32 35 79 76 ...
$ cpi_2016 : int 15 39 34 NA 18 NA 36 33 79 75 ...
$ cpi_2017 : int 15 38 33 NA 19 NA 39 35 77 75 ...
$ lifeexp_2012: num 60.8 77.8 76.8 82.4 61.3 76.7 76 74.7 82.5 81 ...
$ lifeexp_2013: num 61.3 77.9 76.9 82.5 61.9 76.8 76.1 75.2 82.6 81.2 ...
$ lifeexp_2014: num 61.2 77.9 77 82.5 62.8 76.8 76.4 75.3 82.5 81.4 ...
$ lifeexp_2015: num 61.2 78 77.1 82.6 63.3 76.9 76.5 75.3 82.5 81.5 ...
$ lifeexp_2016: num 61.2 78.1 77.4 82.7 63.8 77 76.5 75.4 82.5 81.7 ...
$ lifeexp_2017: num 63.4 78.2 77.7 82.7 64.2 77 76.7 75.6 82.4 81.8 ...
$ lifeexp_2018: num 63.7 78.3 77.9 NA 64.6 77.2 76.8 75.8 82.5 81.9 ...
[1] "COUNTRY" "gdp_2015" "gini_2015" "region" "co2_2015"
[6] "co2_2016" "co2_2017" "co2_2018" "cpi_2012" "cpi_2013"
[11] "cpi_2014" "cpi_2015" "cpi_2016" "cpi_2017" "lifeexp_2012"
[16] "lifeexp_2013" "lifeexp_2014" "lifeexp_2015" "lifeexp_2016" "lifeexp_2017"
[21] "lifeexp_2018"
[1] "COUNTRY" "GDP" "GINI" "region" "co2_2015"
[6] "co2_2016" "co2_2017" "co2_2018" "cpi_2012" "cpi_2013"
[11] "cpi_2014" "cpi_2015" "cpi_2016" "cpi_2017" "lifeexp_2012"
[16] "lifeexp_2013" "lifeexp_2014" "lifeexp_2015" "lifeexp_2016" "lifeexp_2017"
[21] "lifeexp_2018"
<- df[1:98, ]
df1 <- df[99:195, ]
df2 rbind(df1, df2)
An even more robust approach is to use the merge()
function. This allows for the two data frames to have different variables and similar observations. As long as there is at least one variable common to both data frames, they can be merged. Here is a very simple example.
<- df[1:5, c("COUNTRY", "region")]
df1 <- df[1:7, c("COUNTRY", "GDP", "GINI")]
df2 merge(df1, df2, by = "COUNTRY")
COUNTRY region GDP GINI
1 Afghanistan Asia & Pacific 574 36.8
2 Albania Europe 4520 29.0
3 Algeria Arab States 4780 27.6
4 Andorra Europe 42100 40.0
5 Angola Africa 3750 42.6
Note that df2
has 7 observations while df1
only has 5. Yet, the output of the merge has 5 observations. This is because the arguments all.x
and all.y
are set to FALSE
by default. This means that only rows that appear in both are present in the output. If we set all.y = TRUE
, all the rows of df2
are added with missing values for region
.
merge(df1, df2, by = "COUNTRY", all.y = TRUE)
COUNTRY region GDP GINI
1 Afghanistan Asia & Pacific 574 36.8
2 Albania Europe 4520 29.0
3 Algeria Arab States 4780 27.6
4 Andorra Europe 42100 40.0
5 Angola Africa 3750 42.6
6 Antigua and Barbuda <NA> 13300 40.0
7 Argentina <NA> 10600 41.8
If you want to keep all the rows in both data frames, the argument all = TRUE
sets both all.x = TRUE
and all.y = TRUE
.
merge(df1, df2, by = "COUNTRY", all = TRUE)
COUNTRY region GDP GINI
1 Afghanistan Asia & Pacific 574 36.8
2 Albania Europe 4520 29.0
3 Algeria Arab States 4780 27.6
4 Andorra Europe 42100 40.0
5 Angola Africa 3750 42.6
6 Antigua and Barbuda <NA> 13300 40.0
7 Argentina <NA> 10600 41.8
Suppose the variable you are merging on has different names in the two data frames. The arguments by.x
and by.y
allow for you to specify both variables.
names(df1)[1] <- "country"
merge(df1, df2, by.x = "country", by.y = "COUNTRY")
country region GDP GINI
1 Afghanistan Asia & Pacific 574 36.8
2 Albania Europe 4520 29.0
3 Algeria Arab States 4780 27.6
4 Andorra Europe 42100 40.0
5 Angola Africa 3750 42.6
If the two data frames have different variables with the same name, the merge will not combine these columns. This even applies if the columns are different types.
<- df[c("COUNTRY", "region", "GDP")]
df1 $GDP <- as.character(df1$GDP) # GDP is now character in df1
df1merge(df1, df2, by = "COUNTRY")
COUNTRY region GDP.x GDP.y GINI
1 Afghanistan Asia & Pacific 574 574 36.8
2 Albania Europe 4520 4520 29.0
3 Algeria Arab States 4780 4780 27.6
4 Andorra Europe 42100 42100 40.0
5 Angola Africa 3750 3750 42.6
6 Antigua and Barbuda South/Latin America 13300 13300 40.0
7 Argentina South/Latin America 10600 10600 41.8
9.2 Merge: tidyverse functions
Merging data frames is useful when there are several data frames with similar observations but different variables. To demonstrate the join functions in dplyr
, we have two datasets. One is the population of all countries and the other is the population of all countries that begin with “A.” Neither of these datasets have duplicates.
<- read_csv("Data/Gapminder/population.csv") pop
Rows: 195 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (1): population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("Data/Gapminder/population_A.csv") popA
Rows: 11 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): country
dbl (1): population
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("Data/Gapminder/gapminder.csv") tib1
Rows: 197 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, region
dbl (2): gdp, gini
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str((tib1))
spc_tbl_ [197 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ country: chr [1:197] "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ gdp : num [1:197] 574 4520 4780 42100 3750 13300 10600 3920 55100 47800 ...
$ gini : num [1:197] 36.8 29 27.6 40 42.6 40 41.8 31.9 32.3 30.6 ...
$ region : chr [1:197] "Asia & Pacific" "Europe" "Arab States" "Europe" ...
- attr(*, "spec")=
.. cols(
.. country = col_character(),
.. gdp = col_double(),
.. gini = col_double(),
.. region = col_character()
.. )
- attr(*, "problems")=<externalptr>
The different join functions relate to which observations are kept. In full_join()
, all observations in the two data frames are kept, even if there are unmatched observations. The argument by
indicates which variable on which to match.
full_join(tib1, pop, by = "country")
# A tibble: 197 × 5
country gdp gini region population
<chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific 34400000
2 Albania 4520 29 Europe 2890000
3 Algeria 4780 27.6 Arab States 39700000
4 Andorra 42100 40 Europe 78000
5 Angola 3750 42.6 Africa 27900000
6 Antigua and Barbuda 13300 40 South/Latin America 93600
7 Argentina 10600 41.8 South/Latin America 43100000
8 Armenia 3920 31.9 Europe 2930000
9 Australia 55100 32.3 Asia & Pacific 23900000
10 Austria 47800 30.6 Europe 8680000
# ℹ 187 more rows
The function inner_join()
only keeps observations that are present in both data frames. In this case, that is only countries that begin with “A.”
inner_join(tib1, popA, by = "country")
# A tibble: 11 × 5
country gdp gini region population
<chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific 34400000
2 Albania 4520 29 Europe 2890000
3 Algeria 4780 27.6 Arab States 39700000
4 Andorra 42100 40 Europe 78000
5 Angola 3750 42.6 Africa 27900000
6 Antigua and Barbuda 13300 40 South/Latin America 93600
7 Argentina 10600 41.8 South/Latin America 43100000
8 Armenia 3920 31.9 Europe 2930000
9 Australia 55100 32.3 Asia & Pacific 23900000
10 Austria 47800 30.6 Europe 8680000
11 Azerbaijan 6060 32.4 Asia & Pacific 9620000
The function left_join()
only keeps from the data frame in the left argument (tib1
in this case).
left_join(tib1, popA, by = "country")
# A tibble: 197 × 5
country gdp gini region population
<chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific 34400000
2 Albania 4520 29 Europe 2890000
3 Algeria 4780 27.6 Arab States 39700000
4 Andorra 42100 40 Europe 78000
5 Angola 3750 42.6 Africa 27900000
6 Antigua and Barbuda 13300 40 South/Latin America 93600
7 Argentina 10600 41.8 South/Latin America 43100000
8 Armenia 3920 31.9 Europe 2930000
9 Australia 55100 32.3 Asia & Pacific 23900000
10 Austria 47800 30.6 Europe 8680000
# ℹ 187 more rows
The function right_join()
is the same except it only keeps the observations from the data frame in the right argument.
right_join(tib1, popA, by = "country")
# A tibble: 11 × 5
country gdp gini region population
<chr> <dbl> <dbl> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific 34400000
2 Albania 4520 29 Europe 2890000
3 Algeria 4780 27.6 Arab States 39700000
4 Andorra 42100 40 Europe 78000
5 Angola 3750 42.6 Africa 27900000
6 Antigua and Barbuda 13300 40 South/Latin America 93600
7 Argentina 10600 41.8 South/Latin America 43100000
8 Armenia 3920 31.9 Europe 2930000
9 Australia 55100 32.3 Asia & Pacific 23900000
10 Austria 47800 30.6 Europe 8680000
11 Azerbaijan 6060 32.4 Asia & Pacific 9620000
The function semi_join()
keeps all rows in tib1
that have a match in popA
.
semi_join(tib1, popA, by = "country")
# A tibble: 11 × 4
country gdp gini region
<chr> <dbl> <dbl> <chr>
1 Afghanistan 574 36.8 Asia & Pacific
2 Albania 4520 29 Europe
3 Algeria 4780 27.6 Arab States
4 Andorra 42100 40 Europe
5 Angola 3750 42.6 Africa
6 Antigua and Barbuda 13300 40 South/Latin America
7 Argentina 10600 41.8 South/Latin America
8 Armenia 3920 31.9 Europe
9 Australia 55100 32.3 Asia & Pacific
10 Austria 47800 30.6 Europe
11 Azerbaijan 6060 32.4 Asia & Pacific
The function anti_join()
keeps all rows in tib1
that do not have a match in popA
.
anti_join(tib1, popA, by = "country")
# A tibble: 186 × 4
country gdp gini region
<chr> <dbl> <dbl> <chr>
1 Bahamas 27500 43.7 South/Latin America
2 Bahrain 22400 40 Arab States
3 Bangladesh 1000 32.3 Asia & Pacific
4 Barbados 15800 43.8 South/Latin America
5 Belarus 6380 26.9 Europe
6 Belgium 45500 27.8 Europe
7 Belize 4300 53.3 South/Latin America
8 Benin 1130 46.9 Africa
9 Bhutan 2780 38 Asia & Pacific
10 Bolivia 2360 46.3 South/Latin America
# ℹ 176 more rows
9.3 Reshape: tidyr
The tidyr
package provides an efficient way to reshape and reformat data.
<- read_csv("Data/Gapminder/gapminder_large.csv") tib1
Rows: 195 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): country, region
dbl (19): gdp_2015, gini_2015, co2_2015, co2_2016, co2_2017, co2_2018, cpi_2...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(tib1)
# A tibble: 6 × 21
country gdp_2015 gini_2015 region co2_2015 co2_2016 co2_2017 co2_2018 cpi_2012
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghan… 574 36.8 Asia … 0.262 0.245 0.247 0.254 8
2 Albania 4520 29 Europe 1.6 1.57 1.61 1.59 33
3 Algeria 4780 27.6 Arab … 3.8 3.64 3.56 3.69 34
4 Andorra 42100 40 Europe 5.97 6.07 6.27 6.12 NA
5 Angola 3750 42.6 Africa 1.22 1.18 1.14 1.12 22
6 Antigu… 13300 40 South… 5.84 5.9 5.89 5.88 NA
# ℹ 12 more variables: cpi_2013 <dbl>, cpi_2014 <dbl>, cpi_2015 <dbl>,
# cpi_2016 <dbl>, cpi_2017 <dbl>, lifeexp_2012 <dbl>, lifeexp_2013 <dbl>,
# lifeexp_2014 <dbl>, lifeexp_2015 <dbl>, lifeexp_2016 <dbl>,
# lifeexp_2017 <dbl>, lifeexp_2018 <dbl>
Wide data have one row per unit while long data have more than one row per unit. To convert wide data to long, use pivot_longer()
. There are several different ways to get the same output.
# Select columns using tidy-select
# Dictate pattern with names_sep
<- pivot_longer(tib1,
long_tib1 contains("_"),
names_to = c("var", "year"),
names_sep = "_")
# Select columns using column indices
pivot_longer(tib1,
5:21,
names_to = c("var", "year"),
names_sep = "_")
# A tibble: 3,315 × 7
country gdp_2015 gini_2015 region var year value
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific co2 2015 0.262
2 Afghanistan 574 36.8 Asia & Pacific co2 2016 0.245
3 Afghanistan 574 36.8 Asia & Pacific co2 2017 0.247
4 Afghanistan 574 36.8 Asia & Pacific co2 2018 0.254
5 Afghanistan 574 36.8 Asia & Pacific cpi 2012 8
6 Afghanistan 574 36.8 Asia & Pacific cpi 2013 8
7 Afghanistan 574 36.8 Asia & Pacific cpi 2014 12
8 Afghanistan 574 36.8 Asia & Pacific cpi 2015 11
9 Afghanistan 574 36.8 Asia & Pacific cpi 2016 15
10 Afghanistan 574 36.8 Asia & Pacific cpi 2017 15
# ℹ 3,305 more rows
# Dictate pattern with names_pattern
pivot_longer(tib1,
5:21,
names_to = c("var", "year"),
names_pattern = "(.*)_(.*)")
# A tibble: 3,315 × 7
country gdp_2015 gini_2015 region var year value
<chr> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 Afghanistan 574 36.8 Asia & Pacific co2 2015 0.262
2 Afghanistan 574 36.8 Asia & Pacific co2 2016 0.245
3 Afghanistan 574 36.8 Asia & Pacific co2 2017 0.247
4 Afghanistan 574 36.8 Asia & Pacific co2 2018 0.254
5 Afghanistan 574 36.8 Asia & Pacific cpi 2012 8
6 Afghanistan 574 36.8 Asia & Pacific cpi 2013 8
7 Afghanistan 574 36.8 Asia & Pacific cpi 2014 12
8 Afghanistan 574 36.8 Asia & Pacific cpi 2015 11
9 Afghanistan 574 36.8 Asia & Pacific cpi 2016 15
10 Afghanistan 574 36.8 Asia & Pacific cpi 2017 15
# ℹ 3,305 more rows
To go from long data to wide, use pivot_wider
. There are several options to dictate the names of the newly created variables.
<- pivot_wider(long_tib1,
wide_tib1 names_from = c("var", "year"),
values_from = "value")
head(wide_tib1)
# A tibble: 6 × 21
country region gdp_2015 gini_2015 co2_2015 co2_2016 co2_2017 co2_2018 cpi_2012
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghan… Asia … 574 36.8 0.262 0.245 0.247 0.254 8
2 Albania Europe 4520 29 1.6 1.57 1.61 1.59 33
3 Algeria Arab … 4780 27.6 3.8 3.64 3.56 3.69 34
4 Andorra Europe 42100 40 5.97 6.07 6.27 6.12 NA
5 Angola Africa 3750 42.6 1.22 1.18 1.14 1.12 22
6 Antigu… South… 13300 40 5.84 5.9 5.89 5.88 NA
# ℹ 12 more variables: cpi_2013 <dbl>, cpi_2014 <dbl>, cpi_2015 <dbl>,
# cpi_2016 <dbl>, cpi_2017 <dbl>, lifeexp_2012 <dbl>, lifeexp_2013 <dbl>,
# lifeexp_2014 <dbl>, lifeexp_2015 <dbl>, lifeexp_2016 <dbl>,
# lifeexp_2017 <dbl>, lifeexp_2018 <dbl>
pivot_wider(long_tib1,
names_from = c("var", "year"),
values_from = "value",
names_sep = ".")
# A tibble: 195 × 21
country region gdp.2015 gini.2015 co2.2015 co2.2016 co2.2017 co2.2018
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia … 574 36.8 0.262 0.245 0.247 0.254
2 Albania Europe 4520 29 1.6 1.57 1.61 1.59
3 Algeria Arab … 4780 27.6 3.8 3.64 3.56 3.69
4 Andorra Europe 42100 40 5.97 6.07 6.27 6.12
5 Angola Africa 3750 42.6 1.22 1.18 1.14 1.12
6 Antigua and Ba… South… 13300 40 5.84 5.9 5.89 5.88
7 Argentina South… 10600 41.8 4.64 4.6 4.55 4.41
8 Armenia Europe 3920 31.9 1.65 1.76 1.7 1.89
9 Australia Asia … 55100 32.3 16.8 17 17 16.9
10 Austria Europe 47800 30.6 7.7 7.7 7.94 7.75
# ℹ 185 more rows
# ℹ 13 more variables: cpi.2012 <dbl>, cpi.2013 <dbl>, cpi.2014 <dbl>,
# cpi.2015 <dbl>, cpi.2016 <dbl>, cpi.2017 <dbl>, lifeexp.2012 <dbl>,
# lifeexp.2013 <dbl>, lifeexp.2014 <dbl>, lifeexp.2015 <dbl>,
# lifeexp.2016 <dbl>, lifeexp.2017 <dbl>, lifeexp.2018 <dbl>
It might be useful to reference this chapter on strings and regular expressions. There are many ways to represent different patterns in character strings, and a standardized approach exists to minimize the need to type out everything explicitly.
9.4 Further Reading
The above information comes from chapters 5.1-5.3, 6, and 21 of Boehmke (2016), chapters 2.2.5 and 3 of Zamora Saiz et al. (2020).