So with the new season of Game of Thrones dropping as I write this, I wanted to take another peek at any data I could find. This time, I’ll combine trakt.tv, IMDb and Wikipedia.

First up, we’ll load the packages we’ll need:

library(tRakt) # devtools::install_github("jemus42/tRakt")
library(rvest)
library(dplyr)
library(magrittr)
library(stringr)
library(tidyr)
library(ggplot2)
library(tadaatoolbox)

Getting data

Now we’re good to go. And since my tRakt package already handles the trakt.tv data, that step should be the easiest.

got <- trakt.get_all_episodes("game-of-thrones") %>%
  rename(rating_trakt = rating,
         votes_trakt = votes)

Next up the IMDb data. Since there’s no proper API, at least not for episode data, we’ll have to resort to simple webscraping, but thanks to rvest that part is also relatively straight forward.

"http://www.imdb.com/title/tt0944947/epdate" %>%
  read_html() %>%
  html_table() %>%
  extract2(1) %>%
  select(Episode, UserRating, UserVotes) %>%
  transmute(episode_abs = seq_along(Episode),
            rating_imdb = as.numeric(UserRating),
            votes_imdb = as.numeric(str_replace_all(UserVotes, ",", ""))) %>%
  full_join(got, by = c("episode_abs" = "episode_abs")) -> got

No we continue the scraping for the Wikipedia table with viewer numbers, which is probably the most finnecky and likely to break first as this code ages.

"https://en.wikipedia.org/wiki/List_of_Game_of_Thrones_episodes" %>%
  read_html() %>%
  html_table(fill = TRUE) %>%
  magrittr::extract(c(2:7)) %>%
  bind_rows() %>%
  set_colnames(c("episode_abs", "episode", "title", "director",
                 "writer", "firstaired", "viewers")) %>%
  select(-firstaired) %>%
  mutate(viewers = str_replace_all(viewers, "\\[\\d+\\]", ""),
         viewers = as.numeric(viewers)) %>%
  select(-episode, -title) %>%
  full_join(got, by = c("episode_abs" = "episode_abs")) -> got

And now we can start the cleanup and reordering process.

got %<>%
  select(episode_abs, episode, season, epid, runtime, title,
         year, overview, starts_with("rating"), starts_with("votes"), viewers,
         director, writer,
         first_aired, first_aired.string,
         trakt, imdb, tvdb, tmdb) %>%
  arrange(episode_abs)

gameofthrones <- as_tibble(got) %>% 
  filter(!is.na(episode), as.numeric(season) < 7)
rm(got)

Looking at Data

And now we’re finally good to go as far as analysis goes.
We’ll start with a neat little overview:

theme_set(theme_tadaa() + theme(legend.position = "top"))

gameofthrones %>%
  gather(source, rating, starts_with("rating")) %>% 
  ggplot(data = ., aes(x = episode_abs, y = rating, fill = source)) +
  geom_point(shape = 21, color = "black", size = 3) +
  facet_wrap(~season, nrow = 1, scales = "free_x", labeller = label_both) +
  scale_x_continuous(breaks = seq(0, 70, 10), minor_breaks = seq(0, 70, 5)) +
  scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .1)) +
  scale_fill_manual(values = c("yellow", "red"), labels = c("IMDb", "trakt.tv")) +
  labs(title = "Game of Thrones User Ratings",
       subtitle = "Ratings on IMDb and trakt.tv",
       caption = paste0("Data as of ", lubridate::today()),
       x = "Absolute Episode #",
       y = "Rating (1-10)",
       fill = "Source")
ratings
ratings

Or with a focus on individual seasons, calculating averages and 95% confidence intervals:

gameofthrones %>%
  gather(source, rating, starts_with("rating")) %>% 
  ggplot(data = ., aes(x = season, y = rating, color = source, fill = source)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", 
               width = .265, size = 2.5, color = "black") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", 
               width = .25, size = 2) +
  stat_summary(fun.y = mean, geom = "point", size = 3.5, 
               shape = 21, color = "black") +
  scale_x_discrete(breaks = 1:7) +
  scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .1)) +
  scale_color_manual(values = c("yellow", "red"), labels = c("IMDb", "trakt.tv")) +
  scale_fill_manual(values = c("yellow", "red"), labels = c("IMDb", "trakt.tv"), guide = F) +
  labs(title = "Game of Thrones User Ratings",
       subtitle = "Season Average Ratings on IMDb and trakt.tv",
       caption = paste0("Data as of ", lubridate::today()),
       x = "Season",
       y = "Mean Rating (1-10) + 95% CI",
       color = "Source")
ratings_seasons
ratings_seasons

So what we see here is that the IMDb ratings are consistently higher than the trakt.tv ratings. Make of that what you will, but the difference is there.

Next up we’ll take a look at which episode per season is the most popular, since we probably all remember that usually episode 9 will have something big happening.
Since the last two seasons won’t have the full 10 episodes, they’ll break the mold of the first 60 episodes all being organized in neat 10 episode chunks, but oh well.

gameofthrones %>%
  gather(source, rating, starts_with("rating")) %>% 
  ggplot(data = ., aes(x = episode, y = rating, color = source, fill = source)) +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", 
               width = .265, size = 2.5, color = "black") +
  stat_summary(fun.data = mean_cl_normal, geom = "errorbar", 
               width = .25, size = 2) +
  stat_summary(fun.y = mean, geom = "point", size = 3.5, 
               shape = 21, color = "black") +
  scale_x_continuous(breaks = 1:10) +
  scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .1)) +
  scale_color_manual(values = c("yellow", "red"), labels = c("IMDb", "trakt.tv")) +
  scale_fill_manual(values = c("yellow", "red"), labels = c("IMDb", "trakt.tv"), guide = F) +
  labs(title = "Game of Thrones User Ratings",
       subtitle = "Episode Average Ratings on IMDb and trakt.tv",
       caption = paste0("Data as of ", lubridate::today()),
       x = "Episode Within Season",
       y = "Mean Rating (1-10) + 95% CI",
       color = "Source")
ratings_ep_perseason
ratings_ep_perseason

So that’s to be expected.
Next up let’s look at the writers. Here we have the trakt.tv episode ratings for each writing credit as listed by Wikipedia (with minor cleanup to make the labels narrow enough), and we use boxplots to visualize variability as well as an overlay of points to denote the individual ratings, jittered to reduce overlay.

gameofthrones %>%
  mutate(writer = ifelse(str_detect(writer, "Teleplay"), 
                         "David Benioff & D. B. Weiss", writer)) %>%
  group_by(writer) %>%
  mutate(n = n()) %>%
  ggplot(data = ., aes(x = reorder(writer, n), y = rating_trakt)) +
  geom_boxplot(alpha = .7, fill = "red") +
  geom_point(position = "jitter", alpha = .6) +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .1)) +
  labs(title = "Game of Thrones User Ratings",
       subtitle = "Episode Average Ratings per Writer on trakt.tv",
       caption = paste0("Data as of ", lubridate::today()),
       x = "Writer",
       y = "Episode Rating (1-10) + Boxplot",
       color = "Source")
writers
writers

And now we can do the exact same thing for directors:

gameofthrones %>%
  group_by(director) %>%
  mutate(n = n()) %>%
  ggplot(data = ., aes(x = reorder(director, n), y = rating_trakt)) +
  geom_boxplot(alpha = .7, fill = "red") +
  geom_point(position = "jitter", alpha = .6) +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .1)) +
  labs(title = "Game of Thrones User Ratings",
       subtitle = "Episode Average Ratings per Director on trakt.tv",
       caption = paste0("Data as of ", lubridate::today()),
       x = "Director",
       y = "Episode Rating (1-10) + Boxplot",
       color = "Source")
directors
directors

And now I ran out of ideas, so yeah, have fun with that.