5 Joining datasets
We will discuss here how to join two or more datasets.
Libraries we will need:
library(tidyverse)
library(eurostat)
library(scales)
library(lubridate)
library(ggpubr)
library(ggthemes)
library(hrbrthemes)
library(gfonts)
5.1 Inner join
Let us perform a new task: download two datasets and join them into one. We will use here a simple case of GDP growth and unemployment rate (annual data).
Here is annual GDP growth:
And here is annual unemployment rate:
We will subset the dataset to a single country, let’s use Germany this time. So here is Germany’s annual GDP growth:
gdp <- tec00115 %>%
filter(unit == 'CLV_PCH_PRE') %>%
filter(na_item == 'B1GQ') %>%
filter(geo == 'DE') %>%
select(time, values)
So here is Germany’s annual unemployment rate:
une <- une_rt_a %>%
filter(age == 'Y15-74') %>%
filter(unit == 'PC_ACT') %>%
filter(sex == 'T') %>%
filter(geo == 'DE') %>%
select(time, values)
In order to join the two datasets we can simply use dplyr’s inner_join function:
## Joining, by = c("time", "values")
## # A tibble: 0 × 2
## # … with 2 variables: time <dbl>, values <dbl>
Yep, zero rows in result because of wrong join columns. The two datasets share two common columns:
## tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
## $ time : num [1:12] 2009 2010 2011 2012 2013 ...
## $ values: num [1:12] -5.7 4.2 3.9 0.4 0.4 2.2 1.5 2.2 2.7 1.1 ...
## tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
## $ time : num [1:12] 2020 2019 2018 2017 2016 ...
## $ values: num [1:12] 3.8 3.1 3.4 3.8 4.1 4.6 5 5.2 5.4 5.8 ...
but values are different, for example last three rows:
## # A tibble: 3 × 2
## time values
## <dbl> <dbl>
## 1 2018 1.1
## 2 2019 1.1
## 3 2020 -4.6
## # A tibble: 3 × 2
## time values
## <dbl> <dbl>
## 1 2011 5.8
## 2 2010 7
## 3 2009 7.8
A typical rule must be underlined here: it is a very good idea always to define the joining columns and not let the software to decide automatically about it. So:
## # A tibble: 12 × 3
## time values.x values.y
## <dbl> <dbl> <dbl>
## 1 2009 -5.7 7.8
## 2 2010 4.2 7
## 3 2011 3.9 5.8
## 4 2012 0.4 5.4
## 5 2013 0.4 5.2
## 6 2014 2.2 5
## 7 2015 1.5 4.6
## 8 2016 2.2 4.1
## 9 2017 2.7 3.8
## 10 2018 1.1 3.4
## 11 2019 1.1 3.1
## 12 2020 -4.6 3.8
Now it looks much better. However we see that the common column (values) that was not included in the joining columns list appears now renamed values.x and values.y. In order to fix this, we can simply rename the values column in the two datasets:
## # A tibble: 12 × 3
## time gdp une
## <dbl> <dbl> <dbl>
## 1 2009 -5.7 7.8
## 2 2010 4.2 7
## 3 2011 3.9 5.8
## 4 2012 0.4 5.4
## 5 2013 0.4 5.2
## 6 2014 2.2 5
## 7 2015 1.5 4.6
## 8 2016 2.2 4.1
## 9 2017 2.7 3.8
## 10 2018 1.1 3.4
## 11 2019 1.1 3.1
## 12 2020 -4.6 3.8
However, it is much better and preferable to properly name the columns before the join is applied:
gdp <- tec00115 %>%
filter(unit == 'CLV_PCH_PRE') %>%
filter(na_item == 'B1GQ') %>%
filter(geo == 'DE') %>%
select(time, gdp = values)
une <- une_rt_a %>%
filter(age == 'Y15-74') %>%
filter(unit == 'PC_ACT') %>%
filter(sex == 'T') %>%
filter(geo == 'DE') %>%
select(time, une = values)
inner_join(gdp, une, by = "time")
## # A tibble: 12 × 3
## time gdp une
## <dbl> <dbl> <dbl>
## 1 2009 -5.7 7.8
## 2 2010 4.2 7
## 3 2011 3.9 5.8
## 4 2012 0.4 5.4
## 5 2013 0.4 5.2
## 6 2014 2.2 5
## 7 2015 1.5 4.6
## 8 2016 2.2 4.1
## 9 2017 2.7 3.8
## 10 2018 1.1 3.4
## 11 2019 1.1 3.1
## 12 2020 -4.6 3.8