EDIT 2017-08-14: I’ve fixed all the code so it works and is up to date. Side effects include the text not necessarily matching the data. Whoopsie.

Overanalyzing tv shows has kind of become my jam. So why not totally overdo it.
Note that everything I describe in this blogpost is purely for the lulz, and I don’t pretend there’s any scientific merit to it. I just like throwing maths at data.

After I more or less succesfully plotted all the things, I wanted to go full blown statisticy on the subject. While my knowledge of statistics isn’t nearly as extensive as I’d like it to, I at least know a little about comparing groups. So, that’s pretty much the plan.

Methodology(ish)

The basic idea is simple: I pull the episode data from trakt.tv and receive a nice data.frame with episode numbers, season number, ratings (0-10) and much more. Assuming that seasons are a decent way of dividing a tv show into groups, I’ll be using them exactly as such. The variable we’re interested in comparing is the rating, provied by the trakt.tv userbase.
Using that information we can perform an analysis of variance (ANOVA), with the aov() function. If it yields any signicificant results, we then perform the TukeyHSD test for post-hoc analysis to find the season comparison with the significant results (i.e. lowest p-value). And yes, I know the p-value in itself is at least a little controversial, and the whole plan might sound completely rubbish to you, but please keep in mind that I’m only doing this for fun.

Acquiring the Data

This step is pretty simple thanks to the work I did with my tRakt package which allows us to pull the data we want in a few simple steps. I’ll also start by loading some packages I intend on using throughout this blogpost. The idea is that you can aggregate all the code in this post and thereby reliably reproduce my results. Because that’s how I like to science.

library(tRakt)   # To get the data. Get it via devtools::install_github("jemus42/tRakt")
library(dplyr)   # Because convenience
library(ggplot2) # In case of plot
library(grid)    # For some ggplot2 theme specs
library(scales)  # Primarily for pretty_breaks() in plots
library(broom)   # To pretty up output of tests & models
library(knitr)   # For simple tables via knitr::kable
library(tadaatoolbox) # For prettier test output

if (is.null(getOption('trakt.client.id'))){
  get_trakt_credentials(client.id = "12fc1de7671c7f2fb4a8ac08ba7c9f45b447f4d5bad5e11e3490823d629afdf2")
}

theme_set(theme_tadaa())

Now we’re all set. At this point, we can pull the episode data via trakt.get_all_episodes(slug, season_nums), so if we were to look at Game of Thrones (and we will), we’d call trakt.get_all_episodes("game-of-thrones", 1:4).

So, why not get started on that.

Analyzing all the things

Game of Thrones.

got <- trakt.get_all_episodes("game-of-thrones", 1:6)

# Glance at the data
got %>% select(season, episode, rating) %>% head %>% kable
season episode rating
1 1 8.08501
1 2 8.09403
1 3 8.03229
1 4 8.06916
1 5 8.09534
1 6 8.28193

Now that we have the data, we can pump it through aov() and look at the broom::tidy’d data.
Note that I pump the output through knitr::kable, which converts the output to a markdown table suitable for imbedding in this blogpost.

tadaa_aov(rating ~ season, data = got, print = "markdown") 

Table 1: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 5 0.68 0.14 2.32 0.056 0.18 0.46 0.75
Residuals 54 3.18 0.06
Total 59 3.86 0.2



Well, that’s anticlimactic. Anyways, I assume it’s safe to say that Game of Thrones is pretty consistent with it’s ratings across all seasons. I hope it can keep this up for the following one to five seasons.

A neat way to look at the distributions across multiple groups is via density plots. If you’re unfamiliar with them, just imagine a standard histograms (values on the x, frequencies on the y axis), but with interpolation/smoothing to produce curves. I know that’s probably not the right way to think of it, but should give you an idea of what’s up.

ggplot(data = got, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Game of Thrones Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

What I like about this plot is how season 2 seems so neatly grouped around ~8.6 with a bump near the top of the scale, which I assume to be s02e09/s02e10. Especially season 4 on the other hand is pretty spread out, meaning there’s a broader range of ratings, or at least a more even spread of ratings.

The Simpsons

simpsons <- trakt.get_all_episodes("the-simpsons")

tadaa_aov(rating ~ season, data = simpsons, print = "markdown") 

Table 2: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 28 18.52 0.66 25.84 < 0.001 0.54 1.09 1
Residuals 610 15.61 0.03
Total 638 34.13 0.69



Oh, look, a significant value. Nice. Let’s throw TukeyHSD at it to see what sticks.

aov(rating ~ season, data = simpsons) %>%
  TukeyHSD() %>%
  tidy() %>%
  filter(adj.p.value < 0.05) %>%
  nrow()
## [1] 194

Welp, that’s quite a bunch of significant values. Let’s look at everything with a p < 0.00000001, because I tried normal significance levels and there were still too many hits.

aov(rating ~ season, data = simpsons) %>%
  TukeyHSD() %>%
  tidy() %>%
  filter(adj.p.value < 0.00000001) %>%
  tadaa_plot_tukey()

I think the result is quite convincing. Season 19 28 is, statistically speaking, the worst of The Simpsons.

I don’t think a density plot would make much sense here, since there are so many seasons, so I’ll just give you a scatterplot.

ggplot(data = simpsons, aes(x = epnum, y = rating, colour = season)) +
  geom_point(size = 4, colour = "black") +
  geom_point(size = 3) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "The Simpsons Episode Ratings per Season",
       y = "Rating", x = "Episode (absolute)", colour = "Season") +
  theme(legend.position   = "top") +
  theme(legend.key.size   = unit(.3, "cm")) +
  theme(legend.text  = element_text(size = 8))

If you want to take a closer look, try looking it up on tRakt-shiny

Star Trek: Enterprise

As per request, let’s look how that one’s doing.

ste  <- trakt.get_all_episodes("star-trek-enterprise")

tadaa_aov(rating ~ season, data = ste, print = "markdown") 

Table 2: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 3 0.53 0.18 4.74 < 0.01 0.13 0.39 0.9
Residuals 94 3.48 0.04
Total 97 4.01 0.21



Once again a significant result. From what I’ve been told, s04 is said to be better than the other seasons, so I guess that’s what we’ll be looking for here.

aov(rating ~ season, data = ste) %>%
  TukeyHSD() %>% 
  tidy() %>%
  tadaa_plot_tukey()

Now let’s look at the ratings in density-plot form.

ggplot(data = ste, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Star Trek: Enterprise Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Welp, it really looks like s04 was (significantly) more well-received than the other seasons.

Futurama

A long standing favorite of mine and also a request. I wonder if the many many cancellations had any effect on season reception.

futurama  <- trakt.get_all_episodes("futurama")

tadaa_aov(rating ~ season, data = futurama, print = "markdown") 

Table 3: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 6 0.27 0.05 1.09 0.37 0.05 0.24 0.44
Residuals 117 4.85 0.04
Total 123 5.12 0.09



Assuming we’re working with a \(\alpha = 0.05\) (which I’ll just say we do), that’s a significant result.

aov(rating ~ season, data = futurama) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

Welp, at least the first line yields a significant result with the .05 cutoff, so I guess you could make an argument that s05 was better than s07, but besides that it seems like Futurama was pretty consistent.
Let’s look at a plot to get a feel for that.

ggplot(data = futurama, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Futurama Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Yip. Seems pretty consistent to me.

Lost

Oh screw me. Well, let’s do this.

lost  <- trakt.get_all_episodes("lost-2004")

tadaa_aov(rating ~ season, data = lost) 
##        term  df      sumsq      meansq statistic   p.value eta.sq.part
## 1 Residuals 114 3.26835512 0.028669782        NA        NA          NA
## 2    season   5 0.03805877 0.007611755 0.2654975 0.9310704  0.01151059
## 3     Total 119 3.30641390 0.036281537        NA        NA          NA
##    cohens.f     power
## 1        NA        NA
## 2 0.1079103 0.1170426
## 3        NA        NA

Huh. So I guess we’ll have to take a closer look.

aov(rating ~ season, data = lost) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

I haven’t watched Lost and I don’t really care for it, so make of this what you will.

ggplot(data = lost, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Lost Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Daria

Everybody loves Daria, so don’t pretend you don’t know what’s up.

daria <- trakt.get_all_episodes("daria")

tadaa_aov(rating ~ season, data = daria, print = "markdown") 

Table 4: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 4 0.34 0.09 0.72 0.58 0.05 0.22 0.24
Residuals 60 7.1 0.12
Total 64 7.44 0.2



Oh, well.

aov(rating ~ season, data = daria) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

It’s been a while since I watched Daria, but season 4 seems… oh wait, nevermind. There are several missing values in the dataset, I don’t feel comfortable making assumptions based on insufficient data. I mean, look at this plot of the rating distribution:

ggplot(data = daria, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Daria Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

And then look at the vote distribution:

ggplot(data = daria, aes(x = votes)) +
  geom_histogram(alpha = .6, binwidth = 1) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Daria Episode Votes",
       y = "Density", x = "Vote", fill = "Season")

The peak at \(x = 0\) means exactly what you think it means: A large number of episodes with no votes at all.
So, that’s a bummer.

Person of Interest

poi  <- trakt.get_all_episodes("person-of-interest")

tadaa_aov(rating ~ season, data = poi, print = "markdown") 

Table 5: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 4 0.87 0.22 3.83 < 0.01 0.14 0.4 0.9
Residuals 98 5.6 0.06
Total 102 6.47 0.28



No post-hoc necessary. Courtesy plot:

ggplot(data = poi, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Person of Interest Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Castle

castle <- trakt.get_all_episodes("castle")

tadaa_aov(rating ~ season, data = castle, print = "markdown") 

Table 6: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 7 7.67 1.1 28.73 < 0.001 0.55 1.1 1
Residuals 165 6.29 0.04
Total 172 13.97 1.13



Huh.

aov(rating ~ season, data = castle) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

There seems to be quite some disturbance in the force with this one.

ggplot(data = castle, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Castle Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Parks & Recreation

prec <- trakt.get_all_episodes("parks-and-recreation")

tadaa_aov(rating ~ season, data = prec, print = "markdown") 

Table 7: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 6 2.87 0.48 10.51 < 0.001 0.35 0.73 1
Residuals 118 5.38 0.05
Total 124 8.25 0.52



Again? Whow, we’re on a roll.

aov(rating ~ season, data = prec) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

What. It looks like almost every season is rated significantly different than every other.

ggplot(data = prec, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Parks & Recreation Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Oh yeah. That show is a mess, consistency-wise.
Interesting how bad s01 seems to be doing.

Sleepy Hollow

Haven’t seen it, not sure if I will.

shollow <- trakt.get_all_episodes("sleepy-hollow")

tadaa_aov(rating ~ season, data = shollow) 
##        term df    sumsq     meansq statistic      p.value eta.sq.part
## 1 Residuals 58 1.199994 0.02068955        NA           NA          NA
## 2    season  3 1.454644 0.48488120  23.43604 4.618261e-10   0.5479631
## 3     Total 61 2.654638 0.50557075        NA           NA          NA
##   cohens.f power
## 1       NA    NA
## 2 1.101004     1
## 3       NA    NA

Well.

aov(rating ~ season, data = shollow) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

Okay, with only two seasons there’s not much fun in doing an ANOVA in the first place, a simple t-test would probably have done the trick.

ggplot(data = shollow, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Sleepy Hollow Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

The point being: It doesn’t seem to get better.

Family Guy

To end this, here comes a favorite of mine.

fguy <- trakt.get_all_episodes("family-guy")
tadaa_aov(rating ~ season, data = fguy, print = "markdown") 

Table 8: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 15 6.61 0.44 21.19 < 0.001 0.52 1.04 1
Residuals 293 6.09 0.02
Total 308 12.71 0.46



Oh well, I’ll take it.

aov(rating ~ season, data = fguy) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

Hm.

fguy$rating[fguy$rating == 0] <- NA
ggplot(data = fguy, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Family Guy Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

Scrubs

As an afterthought, let’s look at the one that got this whole thing started.

scrubs <- trakt.get_all_episodes("scrubs")
tadaa_aov(rating ~ season, data = scrubs, print = "markdown") 

Table 9: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p \(\eta^2\) Cohen’s f Power
season 8 10.83 1.35 65.33 < 0.001 0.75 1.74 1
Residuals 172 3.57 0.02
Total 180 14.4 1.37



That’s… convincing.

aov(rating ~ season, data = scrubs) %>%
  TukeyHSD() %>%
  tidy() %>%
  tadaa_plot_tukey()

Once again, pretty convincing. Nobody likes s09.

ggplot(data = scrubs, aes(x = rating, fill = season)) +
  geom_density(alpha = .6) +
  scale_x_continuous(breaks = scales::pretty_breaks()) +
  labs(title = "Scrubs Episode Ratings per Season",
       y = "Density", x = "Rating", fill = "Season")

…yip.

Limitations

Probably the most important thing to note about all of this is that trakt.tv doesn’t nearly have the userbase of other, well established sites like IMDb, so, well, I guess the sample size might be too small, or maybe there are some biases or confounders at work which I can’t really adjust for.
A way around this would probably be to use the votes variable in the episode dataset to normalize the ratings across a show, but I don’t really know a decent way to do that, since the per-episode ratings already represent the average of all the user votes for that episode.

The other problem is that I never once checked for normality within the ratings, which should be given if you want to perform an ANOVA. If the variable you’re looking at is not distributed normally, you should be doing a Kruskal-Wallis, but I’m a lazy man.

Well, there’s that and my whole point about how this is all fun and games, and no serious points shall be made.