Cooking is sometimes used as a metaphor for data preparation in machine learning. To practice my skills in machine learning, I decided to look for a dataset related to cooking. I found a Kaggle dataset where the goal of the competition is to predict the category of a dish’s cuisine given a list of its ingredients.
Let’s begin by reading and joining the datasets downloaded on Kaggle.
Show code
library(tidyverse)
library(jsonlite)
train <- fromJSON("input/train.json", flatten = TRUE) %>%
mutate(cuisine = as.factor(cuisine))
test <- fromJSON("input/test.json", flatten = TRUE) %>%
mutate(cuisine = NA, # Add cuisine variable
cuisine = factor(cuisine, levels = c(levels(train$cuisine)))) #add levels
train_unnested <- train %>%
tidyr::unnest(ingredients) %>%
mutate(ingredients = str_replace_all(ingredients, "-","_"), # allow dash
ingredients = str_remove_all(ingredients, "[^A-Za-z0-9_ ]"), #keep letters, spaces, dash
type = "train")
test_unnested <- test %>%
tidyr::unnest(ingredients) %>%
mutate(ingredients = str_replace_all(ingredients, "-","_"), # allow dash
ingredients = str_remove_all(ingredients, "[^A-Za-z0-9_ ]"), #keep letters, spaces, dash
type = "test")
dataset <- full_join(train_unnested, test_unnested) %>%
as_tibble()
dataset %>% arrange(id)
## # A tibble: 535,670 x 4
## id cuisine ingredients type
## <int> <fct> <chr> <chr>
## 1 0 spanish mussels train
## 2 0 spanish ground black pepper train
## 3 0 spanish garlic cloves train
## 4 0 spanish saffron threads train
## 5 0 spanish olive oil train
## 6 0 spanish stewed tomatoes train
## 7 0 spanish arborio rice train
## 8 0 spanish minced onion train
## 9 0 spanish medium shrimp train
## 10 0 spanish fat free less sodium chicken broth train
## # ... with 535,660 more rows
Now we will realize some basic visualizations in order to better understand our dataset.
What is the repartition of our train dataset in terms of cuisine
?
Show code
dataset %>%
filter(type == "train") %>%
group_by(cuisine) %>%
summarize(n = n()) %>%
ggplot(aes(x = cuisine, y = n)) +
geom_col() +
coord_flip() +
ggthemes::theme_economist_white(horizontal = FALSE) +
theme(plot.background = element_rect(fill = "#f8f2e4")) +
labs(x = "", y = "Number of Recipes",
title = "Twenty cuisines",
subtitle = "What's cooking?",
caption = "Félix Luginbühl (@lgnbhl)\nData source: Kaggle, Yummly")
Italian and Mexican recipes are prevalent. We also notice a class imbalance, which should be taken into account when training the model.
Let’s explore the repartition of the number of ingredients
per recipe for each cuisine in the train dataset.
Show code
dataset %>%
filter(type == "train") %>%
group_by(id, cuisine) %>%
summarize(n = n()) %>%
group_by(cuisine) %>%
ggplot(aes(x = cuisine, y = n)) +
geom_boxplot() +
coord_flip() +
ggthemes::theme_economist_white(horizontal = FALSE) +
theme(plot.background = element_rect(fill = "#f8f2e4")) +
labs(x = "", y = "Number of ingredients per recipe",
title = "Twenty cuisines",
subtitle = "What's cooking?",
caption = "Félix Luginbühl (@lgnbhl)\nData source: Kaggle, Yummly")
We see important outliers in each cuisine.
Another way of comparing the variation of the number of ingredients per recipe by cuisine is to calculate the coefficient of variation (CV)1 of each cuisine.
Show code
## # A tibble: 20 x 3
## cuisine mean cv
## <fct> <dbl> <dbl>
## 1 brazilian 9.52 58.4
## 2 japanese 9.74 43.6
## 3 british 9.71 42.9
## 4 mexican 10.9 42.8
## 5 french 9.82 42.2
## 6 vietnamese 12.7 41.5
## 7 southern_us 9.63 40.2
## 8 spanish 10.4 39.9
## 9 irish 9.30 39.8
## 10 russian 10.2 39.6
## 11 indian 12.7 39.5
## 12 jamaican 12.2 39.0
## 13 filipino 10 38.6
## 14 italian 9.91 38.4
## 15 moroccan 12.9 37.2
## 16 greek 10.2 36.6
## 17 cajun_creole 12.6 36.6
## 18 thai 12.5 35.2
## 19 korean 11.3 34.4
## 20 chinese 12.0 33.7
We see that the Brazilian cuisine have much more variation (CV of 58.4%) than the Chinese cuisine (CV of 33.7%) relative to their means.
How many unique ingredients do we have in the dataset?
## # A tibble: 1 x 1
## n
## <int>
## 1 7135
This is a lot of ingredients. We will have to do features reduction.
What are the top 100 ingredients of the full dataset?
Show code
What about the top 10 ingredients by cuisine?
Show code
This table will lead us, in the feature engineering step, to count the number of ingredients “typical” from different cultural regions.
First, we want to tidy our dataset. We will use the unnest_tokens
function from {tidytext} before extracting the word stems with {SnowballC}.
Show code
## # A tibble: 994,837 x 6
## id cuisine ingredients type word wordStem
## <int> <fct> <chr> <chr> <chr> <chr>
## 1 0 spanish mussels train mussels mussel
## 2 0 spanish ground black pepper train ground ground
## 3 0 spanish ground black pepper train black black
## 4 0 spanish ground black pepper train pepper pepper
## 5 0 spanish garlic cloves train garlic garlic
## 6 0 spanish garlic cloves train cloves clove
## 7 0 spanish saffron threads train saffron saffron
## 8 0 spanish saffron threads train threads thread
## 9 0 spanish olive oil train olive oliv
## 10 0 spanish olive oil train oil oil
## # ... with 994,827 more rows
As you can see, we have now the ingredient words in the word
and the stem word in wordStem
. Note that “soy” became “soi” with the stemming.
We will now build a document-term matrix with the cast_dtm
function of {tidytext}. We will performe feature reduction by removing the less frequent words.
Show code
library(tm)
dataset_dtm <- dataset_tidy %>%
count(id, wordStem, sort = TRUE) %>%
cast_dtm(id, wordStem, n, weighting = tm::weightTf) #tm:weightTfIdf also possible
dataset_dtm <- dataset_dtm %>%
tm::removeSparseTerms(sparse = 0.999) %>%
as.matrix() %>%
as.data.frame() %>%
rownames_to_column("id") %>%
mutate(id = as.numeric(id)) %>%
inner_join(dataset_tidy %>%
select(id, type, cuisine), by = "id") %>%
mutate(cuisine = as.factor(cuisine)) %>%
distinct() %>%
select(cuisine, id, type, everything()) %>%
as_tibble()
dataset_dtm %>% arrange(id)
## # A tibble: 49,718 x 803
## cuisine id type ground pepper fresh seed chicken dri sauc
## <fct> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 spanish 0. train 1. 1. 0. 0. 1. 0. 0.
## 2 mexican 1. train 3. 4. 1. 0. 0. 0. 0.
## 3 french 2. train 0. 1. 0. 0. 1. 0. 0.
## 4 chinese 3. train 0. 1. 1. 0. 0. 0. 1.
## 5 italian 4. train 0. 0. 0. 0. 0. 0. 0.
## 6 <NA> 5. test 1. 0. 0. 0. 0. 1. 1.
## 7 chinese 6. train 0. 0. 0. 0. 0. 0. 1.
## 8 <NA> 7. test 4. 1. 1. 0. 1. 0. 0.
## 9 french 8. train 0. 0. 0. 0. 0. 0. 0.
## 10 southern_us 9. train 0. 1. 0. 0. 0. 2. 0.
## # ... with 49,708 more rows, and 793 more variables
Nice! We have a dataset of 803 variables.
Now some quick feature engineering. We will add a nIng
variable, which give the number of ingredients by recipe. We will also create three variables that count the number of ingredients “typical” from three large regions: Asian, South and North ingredients. The idea comes from this other Kernel.
Show code
sumIng <- dataset_tidy %>%
group_by(id) %>%
count(id) %>%
rename(nIng = n)
dataset_dtm <- dataset_dtm %>%
full_join(sumIng, by = "id")
dataset_dtm %>%
select(cuisine, id, nIng, type, everything()) %>%
arrange(id)
#ref: https://www.kaggle.com/alonalevy/cultural-diffusion-by-recipes
north_cuisine <- c("british", "irish", "southern_us", "russian", "french")
south_cuisine <- c("brazilian", "cajun_creole", "greek", "indian", "italian", "jamaican", "mexican", "moroccan", "spanish")
asian_cuisine <- c("filipino")
ingredients_north <- dataset_tidy %>%
select(id, cuisine, wordStem) %>%
filter(cuisine %in% north_cuisine)
ingredients_south <- dataset_tidy %>%
select(id, cuisine, wordStem) %>%
filter(cuisine %in% south_cuisine)
ingredients_asian <- dataset_tidy %>%
select(id, cuisine, wordStem) %>%
filter(cuisine %in% asian_cuisine)
common_north <- ingredients_north %>%
anti_join(ingredients_south, by = "wordStem") %>%
anti_join(ingredients_asian, by = "wordStem")
common_south <- ingredients_south %>%
anti_join(ingredients_north, by = "wordStem") %>%
anti_join(ingredients_asian, by = "wordStem")
common_asian <- ingredients_asian %>%
anti_join(ingredients_north, by = "wordStem") %>%
anti_join(ingredients_south, by = "wordStem")
# Now let’s add three variables that count the number of occurrences of
# each of this regional ingredients by recipe.
dataset_tidy2 <- dataset_tidy %>%
mutate(north_ing = ifelse(dataset_tidy$wordStem %in% common_north$wordStem, 1, 0)) %>%
mutate(south_ing = ifelse(dataset_tidy$wordStem %in% common_south$wordStem, 1, 0)) %>%
mutate(asian_ing = ifelse(dataset_tidy$wordStem %in% common_asian$wordStem, 1, 0)) %>%
select(cuisine, north_ing, south_ing, asian_ing, everything())
# count the regional wordSteam by id
nAsian_ingredients <- dataset_tidy2 %>%
select(id, asian_ing) %>%
filter(asian_ing == 1) %>%
count(id) %>%
rename(nAsian_ing = n)
nSouth_ingredients <- dataset_tidy2 %>%
select(id, south_ing) %>%
filter(south_ing == 1) %>%
count(id) %>%
rename(nSouth_ing = n)
nNorth_ingredients <- dataset_tidy2 %>%
select(id, north_ing) %>%
filter(north_ing == 1) %>%
count(id) %>%
rename(nNorth_ing = n)
# We can now join these three variables to our main dataset.
dataset_final <- dataset_dtm %>%
full_join(nAsian_ingredients, by = "id") %>%
full_join(nSouth_ingredients, by = "id") %>%
full_join(nNorth_ingredients, by = "id") %>%
# replace NA by 0 in the new variables
mutate_at(vars(c(nAsian_ing, nNorth_ing, nSouth_ing)), funs(replace(., is.na(.), 0)))
write_csv(dataset_final, "output/dataset_final.csv") #save dataset
dataset_final %>%
select(cuisine, id, nIng, nAsian_ing, nNorth_ing, nSouth_ing, everything()) %>%
arrange(id)
## # A tibble: 49,718 x 807
## cuisine id nIng nAsian_ing nNorth_ing nSouth_ing type ground
## <fct> <dbl> <int> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 spanish 0. 25 0. 0. 0. train 1.
## 2 mexican 1. 36 0. 0. 0. train 3.
## 3 french 2. 26 0. 1. 0. train 0.
## 4 chinese 3. 19 0. 0. 0. train 0.
## 5 italian 4. 8 0. 0. 0. train 0.
## 6 <NA> 5. 17 0. 0. 0. test 1.
## 7 chinese 6. 9 0. 0. 0. train 0.
## 8 <NA> 7. 31 0. 0. 1. test 4.
## 9 french 8. 16 0. 0. 0. train 0.
## 10 southern_us 9. 19 0. 0. 0. train 0.
## # ... with 49,708 more rows, and 799 more variables
Let’s begin by building a simple tree. As the game is to work in the tidy way, I tried to find an equivalent of {rpart} in the tidyverse ecosystem. Sadly, {broom} doesn’t work with classification trees and the {broomstick} package, which aims to implement them, is still in development. So we will do this in the traditional way.
Show code
# splitting the `train` dataset using {rsample}
library(rsample)
set.seed(1234)
train_test_split <- dataset_final %>%
filter(type == "train") %>%
initial_split(prop = 0.8)
# Retrieve train and test sets
train_tbl <- training(train_test_split) %>% select(-type)
test_tbl <- testing(train_test_split) %>% select(-type)
library(rpart)
library(rpart.plot)
set.seed(1234)
cartModel <- rpart(cuisine ~ ., data = train_tbl, method = "class")
prp(cartModel)
The plot reveals the most importants ingredients of the model.
The accuracy of our classification model is “only” of 42%.
Show code
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.4181544 0.3280825 0.4072832 0.4290856 0.2000251
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
We will try again, but this time with one of the algorithm which dominates Kaggle competitions: XGboost. XGBoost implements gradient boosted decision trees. It is very fast and handle well overfitting.
Show code
library(xgboost)
set.seed(1234)
train_matrix <- xgb.DMatrix(as.matrix(select(train_tbl, -cuisine)),
label = as.numeric(train_tbl$cuisine)-1) #feature index starts from 0
# train our multiclass classification model using softmax
# default parameters, with a maximum depth of 25.
xgbModel <- xgboost(train_matrix,
max.depth = 25,
nround = 10,
objective = "multi:softmax",
num_class = 20)
## [1] train-merror:0.221873
## [2] train-merror:0.178064
## [3] train-merror:0.150189
## [4] train-merror:0.127278
## [5] train-merror:0.108580
## [6] train-merror:0.093212
## [7] train-merror:0.078661
## [8] train-merror:0.067410
## [9] train-merror:0.057668
## [10] train-merror:0.049151
xgbPredict <- predict(xgbModel, newdata = as.matrix(select(test_tbl, -cuisine)))
# change cuisine back to string
xgbPredictText <- levels(test_tbl$cuisine)[xgbPredict + 1]
# for a tidy confusion matrix, the `conf_mat` function of {yardstick} can also be used.
confMat <- confusionMatrix(as.factor(xgbPredictText), test_tbl$cuisine)
confMat$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.7607493 0.7318283 0.7512162 0.7700910 0.2000251
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
We can visualize the confusion matrix of the model using {ggplot2}.
Show code
#ref: https://stackoverflow.com/questions/37897252
confMat$table %>%
as.data.frame() %>%
ggplot(aes(x = Prediction, y = Reference)) +
geom_tile(aes(fill = Freq)) +
geom_text(aes(label = sprintf("%1.0f", Freq)), vjust = 1) +
scale_fill_gradient(low = "blue", high = "red", trans = "log") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", size = 16),
plot.caption = element_text(colour = "dimgrey"),
axis.text.x = element_text(angle = 45, vjust = 1, hjust=1),
plot.background = element_rect(fill = "#f8f2e4")) +
labs(title = "Confusion matrix",
subtitle = "What's cooking?",
fill = "Freq (log)",
caption = "Félix Luginbühl (@lgnbhl)\nData source: Kaggle, Yummly")
The confusion matrix reveals the class imbalance. What about the sensitivity (true positive rate) results by class?
Show code
## cuisine Sensitivity Specificity
## 1 Class: british 0.3221477 0.9932095
## 2 Class: russian 0.3673469 0.9978360
## 3 Class: spanish 0.3846154 0.9938136
## 4 Class: irish 0.4472050 0.9947389
## 5 Class: filipino 0.4936709 0.9955105
## 6 Class: french 0.5615079 0.9707383
## 7 Class: brazilian 0.5773196 0.9973272
## 8 Class: vietnamese 0.5785714 0.9930893
## 9 Class: japanese 0.6134185 0.9938490
## 10 Class: greek 0.6923077 0.9940515
## 11 Class: jamaican 0.7019231 0.9978344
## 12 Class: cajun_creole 0.7046980 0.9912487
## 13 Class: moroccan 0.7218543 0.9956427
## 14 Class: korean 0.7245509 0.9964043
## 15 Class: thai 0.7370130 0.9905833
## 16 Class: southern_us 0.7784091 0.9549053
## 17 Class: chinese 0.8241563 0.9806521
## 18 Class: italian 0.8818353 0.9410655
## 19 Class: indian 0.9037801 0.9875203
## 20 Class: mexican 0.9097331 0.9736527
The sensibility metric shows that the British cuisine is correctly identified as such only one third of the time (32%), followed by Russian and Spanish cuisine.
Finally let’s look at the 20 most important features of the model.
Show code
names <- colnames(select(train_tbl, -cuisine))
importance_matrix <- xgb.importance(names, model = xgbModel)
xgb.ggplot.importance(importance_matrix[1:30,]) +
theme_bw() +
theme(legend.background = element_blank(),
legend.key = element_blank(),
plot.title = element_text(size = 14, face = "bold"),
plot.background = element_rect(fill = "#f8f2e4")) +
labs(title = "Feature importance",
subtitle = "What's cooking?",
caption = "Félix Luginbühl (@lgnbhl)\nData source: Kaggle, Yummly")
The most important feature to predict the cuisine is the word tortilla, followed by soi (i.e. soja), parmesan and the number of ingredients in the recipe.
It would be frustrating to stop here. Let’s submit our work on Kaggle to know how well our model “really” performed.
To submit our predictions on Kaggle, let’s use the same code as before but with the full training dataset. We will run xggoost, this time with 120 rounds. It will take some time! Let’s calculate how much.
Show code
training <- dataset_final %>%
filter(type == "train") %>%
select(-type)
testing <- dataset_final %>%
filter(type == "test") %>%
select(-type)
testing <- testing %>%
mutate(cuisine = "NA") %>%
mutate(cuisine = factor(cuisine, levels = c(levels(training$cuisine)))) #add levels
library(xgboost)
set.seed(1234)
train_matrix <- xgb.DMatrix(as.matrix(select(training, -cuisine)),
label = as.numeric(training$cuisine)-1)
start_time <- Sys.time()
xgbModel2 <- xgboost(train_matrix,
max.depth = 25,
nround = 120,
objective = "multi:softmax",
num_class = 20)
end_time <- Sys.time()
time_taken <- end_time - start_time
time_sec <- difftime(end_time, start_time, units = "secs")
print(time_taken)
## Time difference of 4.359697 hours
To build the xgboost model, my computer worked more than 4 hours!
We can create and submit the CSV file on Kaggle.
Show code
# predict and change cuisine back to string
xgb.submit <- predict(xgbModel2, newdata = as.matrix(select(test_tbl, -cuisine)))
xgb.submit.text <- levels(testing$cuisine)[xgb.submit + 1]
# submission
sample <- read.csv('input/sample_submission.csv')
predictions <- data_frame(id = as.integer(testing$id), cuisine = as.factor(xgb.submit.text))
sample %>%
select(id) %>%
inner_join(predictions) %>%
write_csv("xgboost_submission.csv")
We have a accuracy score of 0.7871. Nice!
Could we have a better score on Kaggle using the new h2o’s Automatic machine learning function?
Let’s split the training
dataset again and run the h2o.automl
during the same time of xgboost.
Show code
library(h2o)
h2o.init() # max_mem_size = "8g"
training_h2o <- as.h2o(training)
split_h2o <- h2o.splitFrame(training_h2o, 0.8, seed = 1234)
train_h2o <- h2o.assign(split_h2o[[1]], "train" ) # 80%
valid_h2o <- h2o.assign(split_h2o[[2]], "valid" ) # 20%
target <- "cuisine"
predictors <- setdiff(names(train_h2o), target)
start_time_h2o <- Sys.time()
aml <- h2o.automl(x = predictors,
y = target,
training_frame = train_h2o,
leaderboard_frame = valid_h2o,
balance_classes = TRUE,
max_runtime_secs = as.numeric(time_sec),
seed = 1234
)
end_time_h2o <- Sys.time()
time_taken_h2o <- end_time_h2o - start_time_h2o
print(time_taken_h2o)
## Time difference of 8.775196 hours
Oops!
Our function ran twice the time of xgboost (almost 9 hours). It was expected that max_runtime_secs
would compute all the running time of the function. But it doesn’t. The time of the computation actually doubled. Good to know for next time!
Finally, let’s extract the leader model, predict the class on the testing dataset and create the CSV file for the Kaggle submission. What is the result on Kaggle of our Stacked Ensemble model?
Show code
# Extract leader model
automl_leader <- aml@leader
# Predict on test set
testing_h2o <- as.h2o(testing)
pred_conversion <- h2o.predict(object = automl_leader, newdata = testing_h2o)
pred_conversion <- as.data.frame(pred_conversion)
predictions <- testing %>%
mutate(pred = as.character(as.list(pred_conversion[1]))) %>%
mutate(pred = as.factor(pred)) %>%
select(id, pred)
sample %>%
select(id) %>%
inner_join(predictions) %>%
rename(cuisine = pred) %>%
mutate(cuisine = as.factor(cuisine)) %>%
mutate(id = as.integer(id)) %>%
write_csv("h2o_submission.csv")
The accuracy score of our model is 0.79314. It is better than our previous xgboost model (0.7871). However, it is hard to judge in terms of algorithm performance, as h2o’s AutoML ran twice the time of xgboost on my computer.
the sample standard deviation should not be used as the mean of the ingredients by recipes varies according to the cuisine.↩︎