# Remember the X-Files?

With the return of the X-Files in form of a miniseries, I was tempted to catch up on the original run of the show, since I had only seen the occasional episode in the late 90’s or early 00’s (my mom was a big fan). Being me, I already looked up the X-Files episodes ratings on trakt.tv to see if there’s something interesting about them, but I didn’t think there was. However, when I listened to the Incomparable talking about the show, I learned that apparently X-Files can be divided into the “myth arc” and regular, more stand-alone episodes. That’s when I realized I need to get my tv show analysis boots on and try to see what I could do. To my delight, I noticed that the appropriate Wikipedia article neatly marks the myth arc episodes, ready for plucking.

And then I started plucking.

### Setup and data retrieval

Here are the R packages I used for the analysis. Some of them are from GitHub, and two of them are my own packages. It seems I need this stuff a lot.

```
library(rvest)
library(dplyr)
library(magrittr)
library(stringr)
library(ggplot2)
library(broom)
library(knitr)
library(tRakt) # devtools::install_github("jemus42/tRakt")
library(tadaatoolbox) # devtools::install_github("tadaadata/tadaatoolbox")
library(waffle) # devtools::install_github("hrbrmstr/waffle")
theme_set(theme_tadaa())
options(digits = 3)
knitr::opts_chunk$set(cache = TRUE)
```

And here is the heart of that whole project. Probably my biggest and most complex `%>%`

-chain to date. I realize that going for big and complex chains is not necessarily a good thing because of readibility and understandability, but at some point I just wanted to know if I could actually do all the data retrieval *and cleanup* in one chain. I could.

```
xfiles <- read_html("https://en.wikipedia.org/wiki/List_of_The_X-Files_episodes") %>%
str_replace_all('</a>\"<img', "ARC</a><img", string = .) %>%
read_html() %>%
html_table(fill = TRUE) %>%
extract(c(2:6, 8:11)) %>%
lapply(., function(x){
names(x) <- c("epnum", "episode", "title", "director", "writer",
"firstaired", "prod_id", "viewers")
x <- as.data.frame(lapply(x, as.character), stringsAsFactors = F)
return(x)
}) %>%
bind_rows() %>%
mutate(title = str_replace_all(title, '"', ""),
plotarc = ifelse(grepl("ARC", title), "Mytharc", "Regular"),
title = str_replace_all(title, "ARC", ""),
epnum = as.numeric(epnum)) %>%
bind_rows(., .[nrow(.), ]) %>%
mutate(epnum = seq_along(epnum)) %>%
select(-episode, -title, -firstaired) %>%
mutate(viewers = str_replace_all(viewers, "\\[\\d+]", ""),
viewers = as.numeric(viewers),
plotarc = as.factor(plotarc)) %>%
full_join(trakt.get_all_episodes("the-x-files") %>% filter(season != "10"),
by = c("epnum" = "epnum"))
```

Maybe I can explain the process a little, because the first half took me ages to get right. The thing about the episodes in that Wikipedia article being marked is tricky, because they are indeed marked with a double dagger (‡), but no, not a unicode text dagger like this: ‡ — but with an image of the symbol. Therefore, my web scraping only returned the text of the html table, not the image in it. Therefore, I could not easily figure out which episode was being marked, since the marking was now gone. The workaround for that was for me to scrape the raw html with `read_html`

, pump it through `str_replace_all`

where I looked for occurrences of images being placed directly after text, and then inserted the dummy text `ARC`

at the end of the text and before the image. To my surprise, that actually worked quite well (at this point I had tried a dozen different things, including regex-voodoo). At this point I could simply re-pump the modified html into the same `read_html`

function, extract the tables and filter out the elements which corresponded to the episode tables from the Wiki article. That’s it for the first 5 lines. After that there’s only a bunch of modifications to make the output usable, like replacing column names with names that didn’t cause weird issues (probably due to weird characters R didn’t like), and combining all the episode tables for each season into one coherent `data.frame`

(the first `bind_rows`

). The rest is basically your average data munging, until the last line, where I used my trakt.tv package to collect all the X-Files episode with ratings and some other data from trakt.tv, merge that with the original dataset to produce the show dataset with the most metadata I ever had. That’s nice.

So, on to the actualy analysis.

### So I heard you like plots

Before I start the plottage, here are some numbers: The original run of the serious spanned 9 seasons, 202 episodes (depending on how you count certain two parters, I counted them as two episodes) and just over 3 *billion* views.

That’s a lot of views. Let’s plot them over the seasons.

```
ggplot(data = xfiles, aes(x = episode, y = viewers)) +
geom_point(size = 4) +
facet_grid(. ~ season) +
labs(title = "The X-Files US Viewers by Season",
x = "Episode", y = "Viewers (Millions)")
```

That’s beautiful. I bet I could draw a quadratic curve right through those data points and probably even get a good regression fit out of it.

```
ggplot(data = xfiles, aes(x = epnum, y = viewers)) +
geom_point(size = 4) +
geom_smooth(method = lm, formula = y ~ poly(x, 2), se = F, color = "red") +
labs(title = "The X-Files US Viewers",
x = "Episode", y = "Viewers (Millions)")
```

```
model <- lm(data = xfiles, viewers ~ poly(epnum, 2))
tidy(model) %>% kable(digits = 3)
```

term | estimate | std.error | statistic | p.value |
---|---|---|---|---|

(Intercept) | 14.7 | 0.15 | 97.6 | 0 |

poly(epnum, 2)1 | -20.9 | 2.35 | -8.9 | 0 |

poly(epnum, 2)2 | -50.5 | 2.46 | -20.5 | 0 |

`glance(model) %>% kable(digits = 3)`

r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | |
---|---|---|---|---|---|---|---|---|---|---|---|

value | 0.691 | 0.688 | 2.1 | 220 | 0 | 3 | -430 | 868 | 882 | 865 | 197 |

…And I actually did. Nice. For those who slept through statistics class, that mumbo jumbo of numbers basically means that that curve I drew is a pretty decent approximation of the viewership numbers, meaning that in the beginning, there were little, then a peak through season 5, and then a steady decline throughout the end.

Let’s look at the averages for each season and throw some confidence intervals in for good measure.

```
xfiles %>% group_by(season) %>%
summarize(mean = mean(viewers, na.rm = T),
lower = mean - confint_t(viewers, na.rm = T),
upper = mean + confint_t(viewers, na.rm = T)) %>%
ggplot(aes(x = season, y = mean)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2) +
geom_point(size = 3, color = "red") +
scale_color_brewer(palette = "Set1", guide = F) +
labs(title = "Average Viewers per Season with 95% CI",
x = "Season", y = "Average Viewers (Millions)")
```

It’s probably quite telling that season 9 got less viewers than the first season.

### So, plot arcs

Let’s start looking at these plot arcs I keep hearing about. First of all, I classified all the episodes according to the Wikipedia episode list in “Mytharc” and “Regular”, and now let’s see how many of each there are.

```
waffle(table(xfiles$plotarc), rows = 10, size = .5,
title = "Number of Episodes by Plot Arc", xlab = "1 Square == 1 Episode",
legend_pos = "top", colors = RColorBrewer::brewer.pal(2, "Set1"))
```

Yep. That’s 62 *mytharc* episodes and 140 remaining episodes. I’m assuming the plot arcs don’t have any influence on the original viewer numbers, but let’s check that.

```
ggplot(data = xfiles, aes(x = episode, y = viewers, colour = plotarc)) +
geom_point(size = 4) +
geom_smooth(method = lm, se = F) +
facet_grid(. ~ season) +
scale_color_brewer(palette = "Set1") +
labs(title = "The X-Files US Viewers\nBy Season and Plot Arc",
x = "Episode", y = "Viewers (Millions)", color = "Plot Arc") +
theme(legend.position = "top")
```

```
xfiles %>% group_by(plotarc) %>%
summarize(mean = mean(viewers, na.rm = T),
lower = mean - confint_t(viewers, na.rm = T),
upper = mean + confint_t(viewers, na.rm = T)) %>%
ggplot(aes(x = plotarc, y = mean, colour = plotarc)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2) +
geom_point(size = 3, color = "black") +
scale_color_brewer(palette = "Set1", guide = F) +
labs(title = "Average Viewers with 95% CI", x = "Plot Arc", y = "Average Viewers (Millions)")
```

Huh, it actually looks like there is a mild increase in viewers for the *mytharc*, but nothing significant (the CIs overlap a lot).

So, what’s next? How about we look at the trakt.tv episode ratings to compare the plot arcs:

```
ggplot(data = xfiles, aes(x = episode, y = rating, colour = plotarc)) +
geom_point(size = 4) +
geom_smooth(method = lm, se = F) +
facet_grid(. ~ season) +
scale_color_brewer(palette = "Set1") +
labs(title = "The X-Files Ratings on trakt.tv\nBy Season and Plot Arc",
x = "Episode", y = "Rating (0-10)", color = "Plot Arc") +
theme(legend.position = "top")
```

That seems… oddly conclusive. The mytharc appear to be consistently more well-received than non-mytharc episodes, but that might be due to people on rewatches (remember trakt.tv wasn’t around during the 90s) only watch and/or rate the mytharc episodes more often?

```
ggplot(data = xfiles, aes(x = episode, y = votes, colour = plotarc)) +
geom_point(size = 4) +
geom_smooth(method = lm, se = F) +
facet_grid(. ~ season) +
scale_color_brewer(palette = "Set1") +
labs(title = "The X-Files Votes on trakt.tv\nBy Season and Plot Arc",
x = "Episode", y = "Votes", color = "Plot Arc") +
theme(legend.position = "top")
```

Well, nope. That seems pretty uniform and resembles the same kind of vote distribution I commonly see on shows on trakt.tv.

So let’s take a closer look at the episode ratings by plot arc: Here’s a histogram with overlaid density distribution and the means with confidence intervals.

```
ggplot(data = xfiles, aes(x = rating, fill = plotarc)) +
geom_density(alpha = .5) +
geom_histogram(aes(y = ..density..), position = "dodge", alpha = .6) +
labs(title = "Rating Distribution by Plot Arc",
x = "Rating", y = "Density", fill = "Plot Arc") +
theme(legend.position = "top")
```

```
xfiles %>% group_by(plotarc) %>%
summarize(mean = mean(rating),
lower = mean - confint_t(rating),
upper = mean + confint_t(rating)) %>%
ggplot(aes(x = plotarc, y = mean, colour = plotarc)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2) +
geom_point(size = 3, color = "black") +
scale_color_brewer(palette = "Set1", guide = F) +
labs(title = "Average Rating with 95% CI", x = "Plot Arc", y = "Rating (0-100)")
```

As you can see, there’s quite a nice distinction. Especially the second plot shows an obvious difference which is so big a statistical test for significance would be entirely pointless. But who am I to judge, here’s a t-test.

`tadaa_t.test(xfiles, rating, plotarc, print = "markdown")`

Table 1: **Two Sample t-test** with alternative hypothesis: \(\mu_1 \neq \mu_2\)

Diff | \(\mu_1\) Mytharc | \(\mu_2\) Regular | t | SE | df | \(CI_{95\%}\) | p | Cohen's d | Power |
---|---|---|---|---|---|---|---|---|---|

0.41 | 7.95 | 7.54 | 9.92 | 0.04 | 200 | (0.33 - 0.49) | < 0.001 | 1.49 | 1 |

If you’re not familiar with t-tests and power analysis: That’s ridiculous. The p-value indicates significance by it’s own right, and the effect size of \(d = 1.421\) tells use the effect is *huge*, with effect sizes greater than 0.8 commonly referred to as large. Also, the power is 1 (rounded value, so *very close to 1*), which basically means that with that data it’s theoretically next to impossible no not spot that difference if our data was a sample of a larger “population” of episodes.

Anyway, let’s continue. The next thing I’m curious about is if that difference in ratings is seen throughout the series, or possibly limited to some seasons.

```
xfiles %>% group_by(plotarc, season) %>%
summarize(mean = mean(rating),
lower = mean - confint_t(rating),
upper = mean + confint_t(rating)) %>%
ggplot(aes(x = season, y = mean, colour = plotarc)) +
geom_errorbar(aes(ymin = lower, ymax = upper), size = 2) +
geom_point(size = 3, color = "black") +
scale_color_brewer(palette = "Set1") +
labs(title = "Average Rating with 95% CI", x = "Season", y = "Rating (0-100)", color = "Plot Arc") +
theme(legend.position = "top")
```

Here we have the mean rating with a 95% CI for each season, and who would have guessed, the overall trend is the same. Besides season 6 and maybe season 5, there’s a significant difference in ratings within each season per plotarc.

### Conclusion

So, take this to mean whatever you want, but let’s just say that if you were to rewatch *X-Files* to get in the mood for the new miniseries, there’s a chance you’ll be fine if you select your sample from the mytharc episodes. Or maybe not, as Siracusa argued in that Incomparable episodes. Anyways, I just wanted to make some plots.

So to end this, let’s take a quick look at the writers who brought us this show (filtered for at leats 2 episode credits):

```
xfiles %>% group_by(plotarc, writer) %>%
tally %>%
filter(n > 1) %>%
ggplot(aes(x = reorder(writer, n), weight = n)) +
geom_bar() +
coord_flip() +
facet_wrap(~plotarc) +
labs(title = "Writers with at Least Two Episode Credits",
x = "Writer", y = "Number of Episodes")
```