Gender Analysis of the Pokémon Universe

Last week I came accross the funny dataset of the last Data Is Beautiful monthly challenge: Information on All 802 Pokemon. Having a quick look at the data, I discovered, to my biggest surprized, a percentage_male variable. I wasn’t aware that Pokémon have genders. So I decided to dig further into this gender dimension of the Pokémon universe.

I learnt that since the second generation, Pokémon could either be male or female. For example, when a little Pikachu get out of its egg, he has 50% percent of being male (and 50% of being female). Some Pokémons have more chances to be male, some to be female and some have no genders (as I always thought). Take Squirtle for example. According to the dataset, he has 88.1% of chances to be male. Squirtle is therefore a male Pokémon.

This surprise led me to ask a simple question: is there gender equality in the Pokémon universe?

Catch ’Em All

Firstly, let’s get the data and classify the Pokémon according to their more probable gender.

library(tidyverse)

pokemon <- read_csv("pokemon.csv") #data from https://www.kaggle.com/rounakbanik/pokemon

pokemon_gender <- pokemon %>%
  select(percentage_male, generation, name) %>%
  mutate(gender = case_when(percentage_male == 0.0 ~ "Pokémon more likely to be FEMALE",
                            percentage_male == 11.2 ~ "Pokémon more likely to be FEMALE",
                            percentage_male == 24.6 ~ "Pokémon more likely to be FEMALE",
                            percentage_male == 50.0 ~ "Pokémon with equal likelihood of being FEMALE OR MALE",
                            percentage_male == 75.4 ~ "Pokémon more likely to be MALE",
                            percentage_male == 88.1 ~ "Pokémon more likely to be MALE",
                            percentage_male == 100.0 ~ "Pokémon more likely to be MALE"),
         gender = replace_na(gender, "Pokémon with NO GENDER"), #NA is for genderless
         generation = case_when(generation == 1 ~ "from Generation I",
                                generation == 2 ~ "from Generation II",
                                generation == 3 ~ "from Generation III",
                                generation == 4 ~ "from Generation IV",
                                generation == 5 ~ "from Generation V",
                                generation == 6 ~ "from Generation VI",
                                generation == 7 ~ "from Generation VII")) %>%
  count(gender, generation, name) #mutate(n = 1) would also work

pokemon_gender
## # A tibble: 801 x 4
##    gender                           generation        name           n
##    <chr>                            <chr>             <chr>      <int>
##  1 Pokémon more likely to be FEMALE from Generation I Chansey        1
##  2 Pokémon more likely to be FEMALE from Generation I Clefable       1
##  3 Pokémon more likely to be FEMALE from Generation I Clefairy       1
##  4 Pokémon more likely to be FEMALE from Generation I Jigglypuff     1
##  5 Pokémon more likely to be FEMALE from Generation I Jynx           1
##  6 Pokémon more likely to be FEMALE from Generation I Kangaskhan     1
##  7 Pokémon more likely to be FEMALE from Generation I Nidoqueen      1
##  8 Pokémon more likely to be FEMALE from Generation I Nidoran♀       1
##  9 Pokémon more likely to be FEMALE from Generation I Nidorina       1
## 10 Pokémon more likely to be FEMALE from Generation I Ninetales      1
## # ... with 791 more rows

Visualize ’Em all

A simple treemap allows us to visualize our hierarchical dataset.

library(treemap)

pokemon_tm <- treemap(pokemon_gender,
                      index = c("gender", "generation", "name"),
                      vSize = "n",
                      palette = "Pastel1",
                      title = "Pokémon Genders over the Generations")

Looks like gender imbalance to me. The Pokémon more likely to be FEMALE cell is less than half the size of the Pokémon more likely to be MALE cell.

The sunburstR package, a htmlwidget to create d3.js sequence sunbursts, allows us to better explore the Pokémon gender repartition over the generations.

library(sunburstR)
library(d3r)
library(htmlwidgets)

pokemon_tm_nest <- d3_nest(
  pokemon_tm$tm[,c("gender", "generation", "name", "vSize", "color")],
  value_cols = c("vSize", "color")
  )

sb <- sunburst(
  data = pokemon_tm_nest,
  valueField = "vSize",
  legend = list(w = 400),
  legendOrder = c("Pokémon more likely to be FEMALE", 
                  "Pokémon more likely to be MALE", 
                  "Pokémon with equal likelihood of being FEMALE OR MALE",
                  "Pokémon with NO GENDER"),
  count = TRUE,
  sumNodes = FALSE,
  colors = htmlwidgets::JS("function(d){return d3.select(this).datum().data.color;}"),
  withD3 = TRUE)

sb <- htmlwidgets::onRender(sb,
  #ref: https://github.com/timelyportfolio/sunburstR/issues/15
  "function(el,x){
  // have legend as default
    d3.select(el).select('.sunburst-togglelegend').property('checked', true);
    d3.select(el).select('.sunburst-legend').style('visibility', '');
  }"
  )

Some modification have to be made manually inside the HTML output.

## code to copy past into the equivalent html output

// Fade all but the current sequence, and show it in the breadcrumb trail.
  function mouseover(d) {

    var percentage = (100 * d.value / totalSize).toPrecision(2); // precision 2 - lgnbhl mod
    var percentageString = percentage + "%";
    if (percentage < 0.13) { // conditionality added
      percentageString = "";
    }

    var countString = [
        '<span style = "font-size:.7em">',
        d3Format.format("1.2s")(d.value) + ' Pokémon on 801', // on 801 Pokémon
        '</span>'
      ].join('');
    if (percentage < 0.13) { // conditionality added
      countString = d.data.name;
    }

We get a simple but effective interactive visualization, which I published online using flexdashboard. Click on the image to open the interactive page.

Click me

Thanks for reading. For updates of recent blog posts, follow me on Twitter.

Leave a Comment