Most Recurring Word on each Country’s Wikipedia Page

rvest tidytext leaflet

How to create an interactive world map showing the most frequent word of each country for its related Wikipedia page.

true
2019-12-16

A friend recently offered me a fun book: Brilliant maps. An atlas for curious minds. Full of beautiful maps, one was showing the most frequent word of each country’s English Wikipedia page. Let’s see if we can reproduce it using R.

An different version of the infographic is available on the website of the book, however it gives a good idea of how the map looks like in the book:

The authors specify in a footnote that they excluded “country”, linking words, demonyms and “government”. Surprisingly the most recurring word of the English Wikipedia page are often either revealing or funny. For example, the most recurring word of the English Wikipedia page of the USA is, according to the infographic, “war”. For China, it would be “dynasty”, Australia “new” and Russia “soviet”.

In this article, we will:

As always my analysis is fully reproducible, so you can get my R code from my Github repository or my online RStudio Cloud session.

Scraping 193 Country Wikipedia pages

Firstly, we need a list of all Wikipedia pages of the countries we want to scrap. The Wikipedia page of the 193 United Nations states members contains a table with the wikilinks of all the countries.

library(tidyverse)
library(rvest)
library(janitor)

un_members_permalink <- "https://en.wikipedia.org/w/index.php?title=Member_states_of_the_United_Nations&oldid=926737595"

# Extract wikitable
un_members_html <- un_members_permalink %>%
  read_html() %>%
  html_nodes("table") %>%
  .[[2]]

# extract wikilinks of all UN members
un_members_wikilinks <- un_members_html %>%
  html_nodes("a") %>%
  html_attr("href") %>%
  unique() %>%
  str_subset("^/wiki/") %>%
  str_subset("United_Nations", negate = TRUE) %>%
  str_remove("/wiki/Viet_Nam|/wiki/The_Gambia") %>% # duplicated pages
  str_remove("/wiki/Flag") %>%
  str_replace("S%C3%A3o_Tom%C3%A9_and_Pr%C3%ADncipe", "Sao_Tome_and_Principe") %>%
  str_subset("") %>%
  paste0("https://en.wikipedia.org", .)

# create UN members table
un_members <- un_members_html %>%
  html_table() %>%
  mutate(
    wikilinks = un_members_wikilinks,
    page_name = str_remove(un_members_wikilinks, ".*/")
  ) %>%
  janitor::clean_names() %>%
  mutate_all(funs(str_remove_all(., "\\[note [0-9]+]|\\[[0-9]+]"))) %>%
  select(member_state = member_state_7_13_14, wikilinks, page_name)

un_members
# A tibble: 193 x 3
   member_state      wikilinks                         page_name      
   <chr>             <chr>                             <chr>          
 1 Afghanistan       https://en.wikipedia.org/wiki/Af~ Afghanistan    
 2 Albania           https://en.wikipedia.org/wiki/Al~ Albania        
 3 Algeria           https://en.wikipedia.org/wiki/Al~ Algeria        
 4 Andorra           https://en.wikipedia.org/wiki/An~ Andorra        
 5 Angola            https://en.wikipedia.org/wiki/An~ Angola         
 6 Antigua and Barb~ https://en.wikipedia.org/wiki/An~ Antigua_and_Ba~
 7 Argentina         https://en.wikipedia.org/wiki/Ar~ Argentina      
 8 Armenia           https://en.wikipedia.org/wiki/Ar~ Armenia        
 9 Australia         https://en.wikipedia.org/wiki/Au~ Australia      
10 Austria           https://en.wikipedia.org/wiki/Au~ Austria        
# ... with 183 more rows

Now that we have all the URLs of the Wikipedia pages, we will use the {WikipediR} R package to download all the related Wikipedia pages in a single dataset.

library(WikipediR)

countries_html <- un_members %>%
  mutate(html = purrr::map(page_name, ~ WikipediR::page_content(
    language = "en", 
    project = "wikipedia", 
    page_name = .x)$parse$text$`*`[[1]]))

countries <- countries_html %>%
  mutate(html = purrr::map(html, unlist)) %>%
  mutate(html = purrr::map(html, read_html)) %>%
  mutate(html = purrr::map(html, html_text)) %>%
  unnest_longer(html) %>%
  mutate(html = str_squish(html))
glimpse(countries)
Rows: 193
Columns: 4
$ member_state <chr> "Afghanistan", "Albania", "Algeria", "Andorra",~
$ wikilinks    <chr> "https://en.wikipedia.org/wiki/Afghanistan", "h~
$ page_name    <chr> "Afghanistan", "Albania", "Algeria", "Andorra",~
$ html         <chr> "Country in Central and South Asia .mw-parser-o~

Nice! We have all the data we need.

Getting the most recurring words

We want now to clean the data and extract the most recurring word of each Wikipedia page. We will follow the methodology by removing “country”, linking words, demonyms and “government”.

Let’s get the demonyms of the countries.

wiki_demonyms_permalink <- "https://en.wikipedia.org/w/index.php?title=List_of_adjectival_and_demonymic_forms_for_countries_and_nations&oldid=1042895982"

demonyms_raw <- wiki_demonyms_permalink %>%
  read_html() %>%
  html_nodes("table") %>%
  .[[1]] %>%
  html_table() %>%
  as_tibble() %>%
  mutate_all(funs(str_remove_all(., "\\[.]|[(.)]"))) %>%
  mutate_all(funs(str_to_lower)) %>%
  janitor::clean_names()

demonyms <- demonyms_raw %>%
  mutate(
    adjectivals = str_trim(adjectivals),
    demonyms = str_trim(demonyms),
    adjectivals_unnested = str_split(adjectivals, ",| or | "),
    demonyms_unnested = str_split(demonyms, ", | or |/| ")
  ) %>%
  unnest_longer(adjectivals_unnested) %>%
  unnest_longer(demonyms_unnested) %>%
  filter(adjectivals_unnested != "") %>%
  distinct(country_entity_name, adjectivals_unnested, demonyms_unnested)
demonyms
# A tibble: 713 x 3
   country_entity_name adjectivals_unnested demonyms_unnested
   <chr>               <chr>                <chr>            
 1 abkhazia            abkhaz               abkhazians       
 2 abkhazia            abkhazian            abkhazians       
 3 afghanistan         afghan               afghans          
 4 åland islands       åland                åland            
 5 åland islands       åland                islanders        
 6 åland islands       island               åland            
 7 åland islands       island               islanders        
 8 albania             albanian             albanians        
 9 algeria             algerian             algerians        
10 american samoa      american             american         
# ... with 703 more rows

Now we can clean the dataset following the methodology of the authors, i.e. removing “country”, linking words, demonyms and “government”.

library(tidytext)

countries_tidy <- countries %>%
  mutate(html = str_squish(html)) %>%
  tidytext::unnest_tokens(word, html) %>%
  anti_join(tidytext::get_stopwords("en"), by = "word") %>%
  anti_join(countries %>%
              select(member_state) %>%
              distinct(member_state) %>%
              mutate(word = str_to_lower(member_state)), by = "word") %>%
  anti_join(countries %>%
              select(member_state) %>%
              distinct(member_state) %>%
              tidytext::unnest_tokens(word, member_state), by = "word") %>%
  filter(!word %in% demonyms$country_entity_name) %>%
  filter(!word %in% paste0(demonyms$country_entity_name, "'s")) %>%
  filter(!word %in% demonyms$adjectivals_unnested) %>%
  filter(!word %in% demonyms$demonyms_unnested) %>%
  filter(!word %in% c("country", "government"))

countries_tidy
# A tibble: 2,632,364 x 4
   member_state wikilinks                        page_name   word     
   <chr>        <chr>                            <chr>       <chr>    
 1 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan asia     
 2 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan mw       
 3 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan parser   
 4 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan output   
 5 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan hatnote  
 6 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan font     
 7 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan style:it~
 8 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan mw       
 9 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan parser   
10 Afghanistan  https://en.wikipedia.org/wiki/A~ Afghanistan output   
# ... with 2,632,354 more rows

The results shows that additional stopwords should be added, such as removing html tags, numbers, month names, and other additional words.

countries_cleaned <- countries_tidy %>%
  filter(!word %in% str_to_lower(month.name)) %>%
  filter(!word %in% c(0:2030)) %>%
  filter(!word %in% letters) %>%
  filter(!word %in% c(
    "u.s", "uk", "uae", "drc", "de", "en", "also", "ib", "africa",
    "retrieved", "archived", "original", "pdf", "edit", "isbn", "p", "pp",
    "output", "redirect", "page", "parser", "mw", "wayback", "main",
    "st", "al", "la", "per", "percent", "cent", "05", "cs1", "one", "two",
    "perú", "d'andorra", "china's", "syria", "brasil", "citation"
  ))

Let’s have a look at our most recurring words by country.

countries_cleaned %>%
  count(member_state, word, sort = TRUE) %>%
  group_by(member_state) %>%
  top_n(1) %>%
  ungroup() %>%
  distinct(member_state, .keep_all = TRUE) %>%
  arrange(desc(member_state, n)) %>%
  rmarkdown::paged_table()

Yes, the most recurring word of the United States of America’s Wikipedia page is “war” indeed.

But to which extend the word “war” is more recurring than other words?

countries_cleaned %>%
  filter(member_state == "United States of America") %>%
  count(word, sort = TRUE)
# A tibble: 5,912 x 2
   word           n
   <chr>      <int>
 1 war          127
 2 world        111
 3 history       90
 4 press         88
 5 federal       74
 6 university    67
 7 first         60
 8 york          59
 9 national      58
10 population    55
# ... with 5,902 more rows

Let’s have a closer look at all the top 10 recurring words by country.

countries_cleaned %>%
  count(member_state, word, sort = T) %>%
  group_by(member_state) %>%
  top_n(10) %>%
  arrange(member_state) %>%
  rmarkdown::paged_table()

What are the most recurring words in all our tops?

countries_cleaned %>%
  count(member_state, word) %>%
  group_by(member_state) %>%
  top_n(10) %>%
  ungroup() %>%
  count(word, sort = TRUE)
# A tibble: 602 x 2
   word              n
   <chr>         <int>
 1 world           171
 2 national        128
 3 population      127
 4 font             80
 5 margin           62
 6 international    61
 7 first            56
 8 president        54
 9 war              52
10 history          40
# ... with 592 more rows

The words “world”, “population”, “national” and “president” seem good candidates to be removed as stopwords. Let’s try to remove them.

countries_top_word <- countries_cleaned %>%
  filter(!word %in% c("world", "world's", "population", "national", "president")) %>%
  count(member_state, word, sort = TRUE) %>%
  group_by(member_state) %>%
  top_n(1) %>%
  ungroup() %>%
  distinct(member_state, .keep_all = TRUE) %>%
  arrange(desc(member_state))

countries_top_word %>%
  count(word, sort = TRUE)
# A tibble: 136 x 2
   word          n
   <chr>     <int>
 1 caribbean     9
 2 font          8
 3 soviet        8
 4 war           8
 5 pacific       7
 6 europe        5
 7 east          3
 8 gulf          3
 9 margin        3
10 baltic        2
# ... with 126 more rows

Which countries have “war” as there more recurrent word they?

countries_top_word %>%
  filter(word == "war") %>%
  arrange(desc(n))
# A tibble: 8 x 3
  member_state                     word      n
  <chr>                            <chr> <int>
1 United States of America         war     127
2 Syrian Arab Republic             war      77
3 Spain                            war      57
4 Libya                            war      57
5 Democratic Republic of the Congo war      56
6 Iraq                             war      55
7 Liberia                          war      42
8 Paraguay                         war      39

Quite interesting!

An interactive world map

Finally, let’s make our own map based on this new dataset. As it is a blog article, let’s make our world map interactive.

library(countrycode)
library(leaflet)
library(rnaturalearth)
library(rnaturalearthdata)
library(rgeos)
library(htmltools)

world <- ne_countries(scale = "small", returnclass = "sf") %>%
  filter(continent != "Antarctica") %>%
  mutate(
    name = recode(name, "Greenland" = "Denmark"),
    iso_a3 = recode(iso_a3, "GRL" = "DNK")
  ) %>%
  select(name, iso_a3, geometry)

countries_top_word$iso3c <- countrycode(countries_top_word$member_state, "country.name", "iso3c")

world <- world %>%
  inner_join(countries_top_word, by = c("iso_a3" = "iso3c"))

labels <- sprintf(
  "<strong>%s</strong><br/>%s",
  world$name, world$word
) %>% lapply(htmltools::HTML)

# reference: https://stackoverflow.com/a/52226825
tag.map.title <- tags$style(HTML(".leaflet-control.map-title { 
    transform: translate(-50%,20%);
    position: fixed !important;
    left: 50%;
    text-align: center;
    padding-left: 10px; 
    padding-right: 10px; 
    background: rgba(255,255,255,0.75);
    font-weight: bold;
    font-size: 22px;
  }
"))

title <- tags$div(
  tag.map.title, HTML("Most Recurring Word on each Country's Wikipedia Page")
)

leaflet(world) %>%
  addPolygons(
    weight = 1,
    fillOpacity = 0.3,
    highlight = highlightOptions(
      weight = 2,
      fillOpacity = 1
    ),
    label = labels,
    labelOptions = labelOptions(
      textsize = "15px"
    )
  ) %>%
  addControl(title, position = "topleft", className = "map-title")

Hover your mouse on the world map and discover the most recurring words of each country’s English Wikipedia page. Yes, they are quite different from the old version of Brilliant Maps.

Thanks for reading!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Luginbuhl (2019, Dec. 16). : Most Recurring Word on each Country's Wikipedia Page. Retrieved from felixluginbuhl.com/blog/posts/2019-12-16-leaflet-map/

BibTeX citation

@misc{luginbuhl2019most,
  author = {Luginbuhl, Felix},
  title = {: Most Recurring Word on each Country's Wikipedia Page},
  url = {felixluginbuhl.com/blog/posts/2019-12-16-leaflet-map/},
  year = {2019}
}