How to create an interactive world map showing the most frequent word of each country for its related Wikipedia page.
A friend recently offered me a fun book: Brilliant maps. An atlas for curious minds. Full of beautiful maps, one was showing the most frequent word of each country’s English Wikipedia page. Let’s see if we can reproduce it using R.
An different version of the infographic is available on the website of the book, however it gives a good idea of how the map looks like in the book:
The authors specify in a footnote that they excluded “country”, linking words, demonyms and “government”. Surprisingly the most recurring word of the English Wikipedia page are often either revealing or funny. For example, the most recurring word of the English Wikipedia page of the USA is, according to the infographic, “war”. For China, it would be “dynasty”, Australia “new” and Russia “soviet”.
In this article, we will:
As always my analysis is fully reproducible, so you can get my R code from my Github repository or my online RStudio Cloud session.
Firstly, we need a list of all Wikipedia pages of the countries we want to scrap. The Wikipedia page of the 193 United Nations states members contains a table with the wikilinks of all the countries.
library(tidyverse)
library(rvest)
library(janitor)
un_members_permalink <- "https://en.wikipedia.org/w/index.php?title=Member_states_of_the_United_Nations&oldid=926737595"
# Extract wikitable
un_members_html <- un_members_permalink %>%
read_html() %>%
html_nodes("table") %>%
.[[2]]
# extract wikilinks of all UN members
un_members_wikilinks <- un_members_html %>%
html_nodes("a") %>%
html_attr("href") %>%
unique() %>%
str_subset("^/wiki/") %>%
str_subset("United_Nations", negate = TRUE) %>%
str_remove("/wiki/Viet_Nam|/wiki/The_Gambia") %>% # duplicated pages
str_remove("/wiki/Flag") %>%
str_replace("S%C3%A3o_Tom%C3%A9_and_Pr%C3%ADncipe", "Sao_Tome_and_Principe") %>%
str_subset("") %>%
paste0("https://en.wikipedia.org", .)
# create UN members table
un_members <- un_members_html %>%
html_table() %>%
mutate(
wikilinks = un_members_wikilinks,
page_name = str_remove(un_members_wikilinks, ".*/")
) %>%
janitor::clean_names() %>%
mutate_all(funs(str_remove_all(., "\\[note [0-9]+]|\\[[0-9]+]"))) %>%
select(member_state = member_state_7_13_14, wikilinks, page_name)
un_members
# A tibble: 193 x 3
member_state wikilinks page_name
<chr> <chr> <chr>
1 Afghanistan https://en.wikipedia.org/wiki/Af~ Afghanistan
2 Albania https://en.wikipedia.org/wiki/Al~ Albania
3 Algeria https://en.wikipedia.org/wiki/Al~ Algeria
4 Andorra https://en.wikipedia.org/wiki/An~ Andorra
5 Angola https://en.wikipedia.org/wiki/An~ Angola
6 Antigua and Barb~ https://en.wikipedia.org/wiki/An~ Antigua_and_Ba~
7 Argentina https://en.wikipedia.org/wiki/Ar~ Argentina
8 Armenia https://en.wikipedia.org/wiki/Ar~ Armenia
9 Australia https://en.wikipedia.org/wiki/Au~ Australia
10 Austria https://en.wikipedia.org/wiki/Au~ Austria
# ... with 183 more rows
Now that we have all the URLs of the Wikipedia pages, we will use the {WikipediR} R package to download all the related Wikipedia pages in a single dataset.
library(WikipediR)
countries_html <- un_members %>%
mutate(html = purrr::map(page_name, ~ WikipediR::page_content(
language = "en",
project = "wikipedia",
page_name = .x)$parse$text$`*`[[1]]))
countries <- countries_html %>%
mutate(html = purrr::map(html, unlist)) %>%
mutate(html = purrr::map(html, read_html)) %>%
mutate(html = purrr::map(html, html_text)) %>%
unnest_longer(html) %>%
mutate(html = str_squish(html))
glimpse(countries)
Rows: 193
Columns: 4
$ member_state <chr> "Afghanistan", "Albania", "Algeria", "Andorra",~
$ wikilinks <chr> "https://en.wikipedia.org/wiki/Afghanistan", "h~
$ page_name <chr> "Afghanistan", "Albania", "Algeria", "Andorra",~
$ html <chr> "Country in Central and South Asia .mw-parser-o~
Nice! We have all the data we need.
We want now to clean the data and extract the most recurring word of each Wikipedia page. We will follow the methodology by removing “country”, linking words, demonyms and “government”.
Let’s get the demonyms of the countries.
wiki_demonyms_permalink <- "https://en.wikipedia.org/w/index.php?title=List_of_adjectival_and_demonymic_forms_for_countries_and_nations&oldid=1042895982"
demonyms_raw <- wiki_demonyms_permalink %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
as_tibble() %>%
mutate_all(funs(str_remove_all(., "\\[.]|[(.)]"))) %>%
mutate_all(funs(str_to_lower)) %>%
janitor::clean_names()
demonyms <- demonyms_raw %>%
mutate(
adjectivals = str_trim(adjectivals),
demonyms = str_trim(demonyms),
adjectivals_unnested = str_split(adjectivals, ",| or | "),
demonyms_unnested = str_split(demonyms, ", | or |/| ")
) %>%
unnest_longer(adjectivals_unnested) %>%
unnest_longer(demonyms_unnested) %>%
filter(adjectivals_unnested != "") %>%
distinct(country_entity_name, adjectivals_unnested, demonyms_unnested)
demonyms
# A tibble: 713 x 3
country_entity_name adjectivals_unnested demonyms_unnested
<chr> <chr> <chr>
1 abkhazia abkhaz abkhazians
2 abkhazia abkhazian abkhazians
3 afghanistan afghan afghans
4 åland islands åland åland
5 åland islands åland islanders
6 åland islands island åland
7 åland islands island islanders
8 albania albanian albanians
9 algeria algerian algerians
10 american samoa american american
# ... with 703 more rows
Now we can clean the dataset following the methodology of the authors, i.e. removing “country”, linking words, demonyms and “government”.
library(tidytext)
countries_tidy <- countries %>%
mutate(html = str_squish(html)) %>%
tidytext::unnest_tokens(word, html) %>%
anti_join(tidytext::get_stopwords("en"), by = "word") %>%
anti_join(countries %>%
select(member_state) %>%
distinct(member_state) %>%
mutate(word = str_to_lower(member_state)), by = "word") %>%
anti_join(countries %>%
select(member_state) %>%
distinct(member_state) %>%
tidytext::unnest_tokens(word, member_state), by = "word") %>%
filter(!word %in% demonyms$country_entity_name) %>%
filter(!word %in% paste0(demonyms$country_entity_name, "'s")) %>%
filter(!word %in% demonyms$adjectivals_unnested) %>%
filter(!word %in% demonyms$demonyms_unnested) %>%
filter(!word %in% c("country", "government"))
countries_tidy
# A tibble: 2,632,364 x 4
member_state wikilinks page_name word
<chr> <chr> <chr> <chr>
1 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan asia
2 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan mw
3 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan parser
4 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan output
5 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan hatnote
6 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan font
7 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan style:it~
8 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan mw
9 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan parser
10 Afghanistan https://en.wikipedia.org/wiki/A~ Afghanistan output
# ... with 2,632,354 more rows
The results shows that additional stopwords should be added, such as removing html tags, numbers, month names, and other additional words.
countries_cleaned <- countries_tidy %>%
filter(!word %in% str_to_lower(month.name)) %>%
filter(!word %in% c(0:2030)) %>%
filter(!word %in% letters) %>%
filter(!word %in% c(
"u.s", "uk", "uae", "drc", "de", "en", "also", "ib", "africa",
"retrieved", "archived", "original", "pdf", "edit", "isbn", "p", "pp",
"output", "redirect", "page", "parser", "mw", "wayback", "main",
"st", "al", "la", "per", "percent", "cent", "05", "cs1", "one", "two",
"perú", "d'andorra", "china's", "syria", "brasil", "citation"
))
Let’s have a look at our most recurring words by country.
countries_cleaned %>%
count(member_state, word, sort = TRUE) %>%
group_by(member_state) %>%
top_n(1) %>%
ungroup() %>%
distinct(member_state, .keep_all = TRUE) %>%
arrange(desc(member_state, n)) %>%
rmarkdown::paged_table()
Yes, the most recurring word of the United States of America’s Wikipedia page is “war” indeed.
But to which extend the word “war” is more recurring than other words?
countries_cleaned %>%
filter(member_state == "United States of America") %>%
count(word, sort = TRUE)
# A tibble: 5,912 x 2
word n
<chr> <int>
1 war 127
2 world 111
3 history 90
4 press 88
5 federal 74
6 university 67
7 first 60
8 york 59
9 national 58
10 population 55
# ... with 5,902 more rows
Let’s have a closer look at all the top 10 recurring words by country.
countries_cleaned %>%
count(member_state, word, sort = T) %>%
group_by(member_state) %>%
top_n(10) %>%
arrange(member_state) %>%
rmarkdown::paged_table()
What are the most recurring words in all our tops?
countries_cleaned %>%
count(member_state, word) %>%
group_by(member_state) %>%
top_n(10) %>%
ungroup() %>%
count(word, sort = TRUE)
# A tibble: 602 x 2
word n
<chr> <int>
1 world 171
2 national 128
3 population 127
4 font 80
5 margin 62
6 international 61
7 first 56
8 president 54
9 war 52
10 history 40
# ... with 592 more rows
The words “world”, “population”, “national” and “president” seem good candidates to be removed as stopwords. Let’s try to remove them.
countries_top_word <- countries_cleaned %>%
filter(!word %in% c("world", "world's", "population", "national", "president")) %>%
count(member_state, word, sort = TRUE) %>%
group_by(member_state) %>%
top_n(1) %>%
ungroup() %>%
distinct(member_state, .keep_all = TRUE) %>%
arrange(desc(member_state))
countries_top_word %>%
count(word, sort = TRUE)
# A tibble: 136 x 2
word n
<chr> <int>
1 caribbean 9
2 font 8
3 soviet 8
4 war 8
5 pacific 7
6 europe 5
7 east 3
8 gulf 3
9 margin 3
10 baltic 2
# ... with 126 more rows
Which countries have “war” as there more recurrent word they?
countries_top_word %>%
filter(word == "war") %>%
arrange(desc(n))
# A tibble: 8 x 3
member_state word n
<chr> <chr> <int>
1 United States of America war 127
2 Syrian Arab Republic war 77
3 Spain war 57
4 Libya war 57
5 Democratic Republic of the Congo war 56
6 Iraq war 55
7 Liberia war 42
8 Paraguay war 39
Quite interesting!
Finally, let’s make our own map based on this new dataset. As it is a blog article, let’s make our world map interactive.
library(countrycode)
library(leaflet)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
library(htmltools)
world <- ne_countries(scale = "small", returnclass = "sf") %>%
filter(continent != "Antarctica") %>%
mutate(
name = recode(name, "Greenland" = "Denmark"),
iso_a3 = recode(iso_a3, "GRL" = "DNK")
) %>%
select(name, iso_a3, geometry)
countries_top_word$iso3c <- countrycode(countries_top_word$member_state, "country.name", "iso3c")
world <- world %>%
inner_join(countries_top_word, by = c("iso_a3" = "iso3c"))
labels <- sprintf(
"<strong>%s</strong><br/>%s",
world$name, world$word
) %>% lapply(htmltools::HTML)
# reference: https://stackoverflow.com/a/52226825
tag.map.title <- tags$style(HTML(".leaflet-control.map-title {
transform: translate(-50%,20%);
position: fixed !important;
left: 50%;
text-align: center;
padding-left: 10px;
padding-right: 10px;
background: rgba(255,255,255,0.75);
font-weight: bold;
font-size: 22px;
}
"))
title <- tags$div(
tag.map.title, HTML("Most Recurring Word on each Country's Wikipedia Page")
)
leaflet(world) %>%
addPolygons(
weight = 1,
fillOpacity = 0.3,
highlight = highlightOptions(
weight = 2,
fillOpacity = 1
),
label = labels,
labelOptions = labelOptions(
textsize = "15px"
)
) %>%
addControl(title, position = "topleft", className = "map-title")
Hover your mouse on the world map and discover the most recurring words of each country’s English Wikipedia page. Yes, they are quite different from the old version of Brilliant Maps.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Luginbuhl (2019, Dec. 16). : Most Recurring Word on each Country's Wikipedia Page. Retrieved from felixluginbuhl.com/blog/posts/2019-12-16-leaflet-map/
BibTeX citation
@misc{luginbuhl2019most, author = {Luginbuhl, Felix}, title = {: Most Recurring Word on each Country's Wikipedia Page}, url = {felixluginbuhl.com/blog/posts/2019-12-16-leaflet-map/}, year = {2019} }