The Swiss Federal Statistical Office, or BFS from “Bundesamt für Statistik” in German, provides a rich public database. As Swiss citizen and R enthousiast, I wanted to easily access its datasets directly from R. So I created the BFS package.
In this article, I will show how to use my BFS package package to easily search and download datasets from the Swiss Federal Statistical Office. We will then quickly explore a dataset and plot a map of Swiss municipalities, the lowest level of administrative division in Switzerland.
As always the code is fully reproducible, so you can get it from my Github account or on my online RStudio Cloud session.
To use my BFS
package, we should begin by downloading information related to all available datasets of the Swiss Federal Statistical Office Catalogue. We can get the BFS metadata in German (“de”), French (“fr”), Italian (“it”) and English (“en”)1.
# install/load needed R packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, scales, colorspace, plotly, RSwissMaps, BFS)
# setting light theme
theme_set(theme_light())
# Get BFS metadata in German
meta_de <- bfs_get_metadata(language = "de")
meta_de
## # A tibble: 676 x 5
## title observation_period published url url_px
## <chr> <chr> <chr> <chr> <chr>
## 1 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 2 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 3 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 4 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 5 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 6 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 7 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 8 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 9 Hotellerie: A… Dargestellter Zeitr… 04.11.20… https://www.… https://www…
## 10 Hotellerie: A… Dargestellter Zeitr… 04.11.20… https://www.… https://www…
## # … with 666 more rows
We currently have access to 676 BFS datasets.
The challenge today is to plot a detailed map of Switzerland containing all 2’212 Swiss municipalites, or “gemeinde” in German. How many current BFS datasets contain the word “gemeinde” in their title? Let’s find out using the bfs_search()
function.
## # A tibble: 34 x 5
## title observation_period published url url_px
## <chr> <chr> <chr> <chr> <chr>
## 1 Ausländische … Dargestellter Zeitr… 07.11.20… https://www.… https://www…
## 2 Hotellerie: A… Dargestellter Zeitr… 04.11.20… https://www.… https://www…
## 3 Hotellerie: A… Dargestellter Zeitr… 04.11.20… https://www.… https://www…
## 4 Nationalratsw… Dargestellter Zeitr… 19.09.20… https://www.… https://www…
## 5 Arbeitsstätte… Dargestellter Zeitr… 22.08.20… https://www.… https://www…
## 6 Kinoinfrastru… Dargestellter Zeitr… 14.03.20… https://www.… https://www…
## 7 Volksabstimmu… Dargestellter Zeitr… 04.03.20… https://www.… https://www…
## 8 Öffentliche B… Dargestellter Zeitr… 20.12.20… https://www.… https://www…
## 9 Gründung, Sch… Dargestellter Zeitr… 10.12.20… https://www.… https://www…
## 10 Neu gegründet… Dargestellter Zeitr… 05.10.20… https://www.… https://www…
## # … with 24 more rows
We got 34 different datasets related to Swiss municipalities. I am interested by the one related to the cross-border worker (“Ausländische Grenzgänger/innen” in German) at the first row. Let’s download it using the bfs_get_dataset()
function.
## [1] "Ausländische Grenzgänger/innen nach Geschlecht und Arbeitsgemeinde"
df_title <- "Ausländische Grenzgänger/innen nach Geschlecht und Arbeitsgemeinde"
meta_de_gemeinde1 <- bfs_search(data = meta_de_gemeinde, pattern = df_title)
data_bfs <- bfs_get_dataset(meta_de_gemeinde1$url_px[1])
data_bfs
## # A tibble: 638,970 x 4
## quartal arbeitsgemeinde geschlecht value
## <fct> <fct> <fct> <dbl>
## 1 1996Q1 Schweiz Geschlecht - Total 143526.
## 2 1996Q2 Schweiz Geschlecht - Total 142744.
## 3 1996Q3 Schweiz Geschlecht - Total 141538.
## 4 1996Q4 Schweiz Geschlecht - Total 139210.
## 5 1997Q1 Schweiz Geschlecht - Total 137687.
## 6 1997Q2 Schweiz Geschlecht - Total 135740.
## 7 1997Q3 Schweiz Geschlecht - Total 136010.
## 8 1997Q4 Schweiz Geschlecht - Total 134580.
## 9 1998Q1 Schweiz Geschlecht - Total 134379.
## 10 1998Q2 Schweiz Geschlecht - Total 134552.
## # … with 638,960 more rows
Note that the developing version of BFS
leverage the new pins
package to save all the downloaded Swiss datasets in the same cache folder, accessible using the bfs_open_dir()
function.
That’s all for my new BFS
package.
Using the Tidyverse workflow, we can now performe a quick exploratory data analysis.
Let’s begin with a glimpse at the data.
## Observations: 638,970
## Variables: 4
## $ quartal <fct> 1996Q1, 1996Q2, 1996Q3, 1996Q4, 1997Q1, 1997Q2, …
## $ arbeitsgemeinde <fct> Schweiz, Schweiz, Schweiz, Schweiz, Schweiz, Sch…
## $ geschlecht <fct> Geschlecht - Total, Geschlecht - Total, Geschlec…
## $ value <dbl> 143526.0, 142744.5, 141537.9, 139210.4, 137686.8…
The dataset contains information about the number of cross-border workers by quarter (quartal
), Swiss municipality (arbeitsgemeinde
) and gender (geschlecht
).
Notice that value
is a pondered value: each worker get a weighted point between 0-1 according to the number of hours of works he/she is doing (see more here). It is therefore more appropriate to speak about “cross-border work” as the value of two men working half time in a Swiss municipality is equal to a full time cross-border working woman.
I am curious to learn more about the gender ratio of cross-border work by municipality and its evolution over the years. Let’s build a new gender_ratio
variable.
data_bfs_ratio <- data_bfs %>%
tidyr::pivot_wider(names_from = "geschlecht", values_from = "value") %>%
rename(quarter = quartal,
municipality = arbeitsgemeinde,
man = Mann, woman = Frau,
gender_total = `Geschlecht - Total`) %>%
mutate(municipality = str_remove_all(municipality, "\\.|^\\- ")) %>% # cleaning
mutate(gender_ratio = man / gender_total * 100) %>%
arrange(desc(quarter))
data_bfs_ratio
## # A tibble: 212,990 x 6
## quarter municipality gender_total man woman gender_ratio
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2019Q3 Schweiz 325291. 209406. 1.16e+5 64.4
## 2 2019Q3 Zürich 10342. 7557. 2.78e+3 73.1
## 3 2019Q3 Aeugst am Albis 0 0 0. NaN
## 4 2019Q3 Affoltern am Alb… 19.3 14.5 4.75e+0 75.4
## 5 2019Q3 Bonstetten 1.74 0.874 8.62e-1 50.4
## 6 2019Q3 Hausen am Albis 0 0 0. NaN
## 7 2019Q3 Hedingen 42.1 41.5 5.99e-1 98.6
## 8 2019Q3 Kappel am Albis 0 0 0. NaN
## 9 2019Q3 Knonau 0 0 0. NaN
## 10 2019Q3 Maschwanden 0 0 0. NaN
## # … with 212,980 more rows
We see that the gender ratio of cross-border workers for the 2nd quarter of 2019 in Switzerland is 64.2% (but 73% in Zürich).
Does it mean we have strong cantonal gender disparities in terms of cross-border work?
# Create table to join later to bfs_data
# ref: https://en.wikipedia.org/wiki/Data_codes_for_Switzerland#Cantons
cantons <- tibble::tribble(
~canton, ~code, ~id_can,
"Aargau", "AG", 19,
"Appenzell Innerrhoden", "AI", 15,
"Appenzell Ausserrhoden", "AR", 16,
"Bern", "BE", 2,
"Basel-Landschaft", "BL", 13,
"Basel-Stadt", "BS", 12,
"Fribourg", "FR", 10,
"Genève", "GE", 25,
"Glarus", "GL", 8,
"Graubünden", "GR", 18,
"Jura", "JU", 26,
"Luzern", "LU", 3,
"Neuchâtel", "NE", 24,
"Nidwalden", "NW", 7,
"Obwalden", "OW", 6,
"St Gallen", "SG", 17,
"Schaffhausen", "SH", 14,
"Solothurn", "SO", 11,
"Schwyz", "SZ", 5,
"Thurgau", "TG", 20,
"Ticino", "TI", 21,
"Uri", "UR", 4,
"Vaud", "VD", 22,
"Valais", "VS", 23,
"Zug", "ZG", 9,
"Zürich", "ZH", 1,
)
data_bfs_ratio_annualized <- data_bfs_ratio %>%
mutate(year = str_extract(quarter, "^.{4}"),
year = as.numeric(year)) %>%
group_by(municipality, year) %>%
summarise(gender_ratio_annualized = mean(gender_ratio, na.rm = TRUE)) %>%
ungroup() %>%
mutate(gender_per = gender_ratio_annualized/100)
data_bfs_ratio_annualized %>%
inner_join(cantons, by = c("municipality" = "canton")) %>% # join table
ggplot(aes(x = year, y = gender_per, color = code)) +
geom_line() +
geom_line(data = filter(data_bfs_ratio_annualized, municipality == "Schweiz"),
color = "red2", linetype = "dashed", size = 1) +
scale_y_continuous(label = percent) +
annotate("text", x = 2016, y = 0.54, size = 3.5, label = "National gender ratio") +
geom_segment(aes(x = 2016, y = 0.551, xend = 2016, yend = 0.635),
color = "black", size = 0.2,
arrow = arrow(length = unit(0.2, "cm"))) +
labs(title = "Proportion of Men in Cross-Border Workforce",
subtitle = "Switzerland, 1995-2019",
color = "Canton",
x = "", y = "",
caption = "quarter annualized - Data source: BFS")
It looks like we have different gender ratio levels according to the Swiss canton. However, it is hard to see clearly and to categorise the cantons by group.
Let’s make a time serie clustering to get the categories. I will reuse some code of the excellent blogpost of Bruno Rodrigues to perform a time-series k-means clustering.
set.seed(1111)
# Only since 2007 as missing values before for some cantons
data_bfs_wide <- data_bfs_ratio_annualized %>%
inner_join(cantons, by = c("municipality" = "canton")) %>% # join table
filter(year > 2006) %>%
select(municipality, year, gender_ratio_annualized) %>%
pivot_wider(names_from = year, values_from = gender_ratio_annualized)
wss <- map_dbl(1:6, ~{kmeans(select(data_bfs_wide, -municipality), .)$tot.withinss})
elbow_df <- as.data.frame(cbind("n_clust" = 1:6, "wss" = wss))
ggplot(elbow_df) +
geom_line(aes(y = wss, x = n_clust))
The optimal number of categories seems to be four. Let’s cluser our times series in four different groups.
clusters <- kmeans(select(data_bfs_wide, -municipality), centers = 4)
gg_plot <- data_bfs_wide %>%
mutate(cluster = clusters$cluster) %>%
pivot_longer(cols = c(-municipality, -cluster),
names_to = "year",
values_to = "gender ratio") %>%
mutate(cluster = as.factor(cluster)) %>%
rename(canton = municipality) %>%
ggplot() +
geom_line(aes(y = `gender ratio`, x = year,
group = canton, colour = cluster),
show.legend = FALSE) +
facet_wrap(~cluster, nrow = 1) +
scale_color_brewer(palette = "Set2") +
scale_x_discrete(breaks = seq(2007, 2019, by = 3)) +
guides(color = FALSE) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Proportion of Men in Cross-Border Workforce in Swiss Cantons",
x = "")
plotly::ggplotly(gg_plot) %>%
hide_legend()
Put your mouse over the interactive plot above. You can discover the canton’s name and related gender ratio for each time serie of the four clusters. Note that we have only kept data from 2007 as we had missing values for the previous years in some cantons.
That’s all for the exploration of the cantonal level. What about the municipal level?
Let’s plot the gender ratio of cross-bording work by Swiss municipality for the last year available, i.e. 2019.
With the great RSwissMaps
package of David Zumbach, it is possible to create a map of Swizterland with only a few lines of code.
# the BFS id of all municipalities are inside the RSwissMaps
# the data inside RSwissMaps is taken from year 2016
bfs_id_mun <- RSwissMaps::mun.template(year = 2016)
data_bfs_2018 <- data_bfs_ratio_annualized %>%
left_join(bfs_id_mun, by = c("municipality" = "name")) %>%
filter(year == 2019)
mun.plot(data_bfs_2018$bfs_nr,
data_bfs_2018$gender_per,
year = 2016) +
scale_fill_viridis_c(labels = percent, direction = -1) +
theme(legend.position = "right") +
labs(title = "Proportion of Men in Cross-Border Workforce in Switzerland",
subtitle = "More women in green-yellow, 2019",
fill = "",
caption = "Quarterly annualized - Data Source: BFS")
Data shows that central Swiss municipalites also have some cross-bording workers. If cross-bording work is mainly done by men, a few Swiss municipalities have more women working as cross-bording worker.
Let me know what you think about my new BFS
package and feel free to contribute or make a pull request here.
English and Italian have less datasets available.↩︎