Star Trek. It’s a thing.
Not only is it a thing, it’s also a big franchise. You might have heard of it.
If you happen to fall into my sociodemographic bracket, you might not have particularly strong feelings about it, but you’re probably aware that many people do.
I for one do not really care. I didn’t grow up with Star Trek like many people did, I grew up watching Stargate SG-1.
You might think of that however you wish, but fact of the matter is that the new Star Trek: Discovery is the first Star Trek TV Show I watch as it airs.
Yes, I’ve seen the rebooted JJ Abrams movies, and yes they kind of helped to make the franchise more accessible to people like me by basically starting fresh, but in this blogpost I’ll only focus on the shows.
Shows I have not seen.
Basically, I’m the perfect person to talk about these shows due to my lack of emotional investment, knowledge and oh wait I have no business talking about them.
So, with Discovery having made it’s first impression, I thought it would be nice to put it into context a little, with all the other Star Trek stuff we had in the past.
As someone who’s not that familiar with the franchise, I was a little surprised by the length of the wilderness years between the original series and the second live-action installment, as well as the relatively larg gap between the latest series and the new Dicscovery. I also didn’t realize how big the overlap between Deep Space Nine, Voyager and The Next Generation was, I always assumed they were more sequential.
Anyway, I did my usual “pull data from <trakt.tv> and look at it”-thing, and here it goes.

## Preparation

library(tRakt) # devtools::install_github("jemus42/tRakt")
library(dplyr)
library(ggplot2)
library(scales)
library(purrr)
library(ggrepel)

plot_caption <- paste0("@jemus42 – blog.jemu.name @ ", Sys.Date())

Next up, we’ll gather the data. Since these shows have ridiculously long titles in a world of single-word-shows, we’ll keep track of both titles and commonly used abbreviations.
Please note that the data collection step is a little clunky, code-wise, but oh well, it get’s the job done.

stshows <- c("star-trek",
"star-trek-the-animated-series",
"star-trek-the-next-generation",
"star-trek-deep-space-nine",
"star-trek-voyager",
"star-trek-enterprise",
"star-trek-discovery")

abbrv <- c("TOS", "TAS", "TNG", "DS9", "VOY", "ENT", "DSC")

# Assemble tibble of names
stshows_titles <- tibble(
slug  = stshows,
abbrv = abbrv,
title = map_chr(slug, function(x) {trakt.show.summary(x)$title}) ) # Gather data using the slug, appending title and abbreviated title startrek <- map_df(stshows, function(stshow) { episodes <- trakt.get_all_episodes(stshow) episodes$show     <- trakt.show.summary(stshow)$title episodes$show_abr <- stshows_titles$abbrv[stshows_titles$slug == stshow]

as_tibble(episodes)
}) %>%
arrange(first_aired) %>%
mutate(show =     factor(show, levels = stshows_titles$title, ordered = TRUE), show_abr = factor(show_abr, levels = stshows_titles$abbrv, ordered = TRUE)) %>%
arrange(first_aired) %>%
mutate(ep_abs = seq_along(episode))

## The Data

Here’s a randomly chosen sample of two episodes of each show to give you a rough idea of the data I’m working with.
The dataset contains a few more variables, but I won’t be using them or they’re just variations on the variables below.

startrek %>%
select(show, epid, first_aired, rating, votes) %>%
group_by(show) %>%
sample_n(2) %>%
ungroup() %>%
arrange(first_aired) %>%
knitr::kable()
Star Trek s01e15 1966-12-30 01:30:00 7.38603 272
Star Trek s02e23 1968-03-02 01:30:00 6.70370 189
Star Trek: The Animated Series s01e15 1974-01-05 05:00:00 7.45098 51
Star Trek: The Animated Series s02e06 1974-10-12 04:00:00 7.45098 51
Star Trek: The Next Generation s02e21 1989-07-11 01:00:00 7.95012 401
Star Trek: The Next Generation s03e02 1989-10-03 01:00:00 7.53077 390
Star Trek: Deep Space Nine s01e18 1993-05-30 04:00:00 7.60166 241
Star Trek: Deep Space Nine s03e21 1995-05-01 04:00:00 8.35111 225
Star Trek: Voyager s02e14 1996-01-22 05:00:00 7.72857 280
Star Trek: Voyager s07e21 2001-04-25 04:00:00 7.78397 287
Star Trek: Enterprise s03e12 2004-01-15 01:00:00 7.48675 302
Star Trek: Enterprise s04e06 2004-11-13 01:00:00 7.76510 298
Star Trek: Discovery s01e06 2017-10-23 00:30:00 7.82714 5328
Star Trek: Discovery s01e10 2018-01-08 01:30:00 7.88776 3555

You might notice a few things.
First up, the vote count. While trakt.tv is pretty neat, it doesn’t have nearly the userbase of bigger sites like IMDb, nor has it been around for as long. Additionally, how many people do you know who not only rewatch older tv shows, but also take the time to rate each episode individually on sites like trakt?
Exactly. That’s the biggest flaw I see in the data, as with all my trakt-data-shenanigans. The data is easy to retrieve, but unfortunately there’s not that much of it.

But oh well, who cares, I’m just here to put Discovery into a little perspective, so on we go.
Throughout this post I’ll be shortening the show names to a more plot-friendly size, using these commonly used abbreviations for reference:

stshows_titles %>%
select(title, abbrv) %>%
magrittr::set_names(c("Title", "Abbreviation")) %>%
knitr::kable()
Title Abbreviation
Star Trek TOS
Star Trek: The Animated Series TAS
Star Trek: The Next Generation TNG
Star Trek: Deep Space Nine DS9
Star Trek: Voyager VOY
Star Trek: Enterprise ENT
Star Trek: Discovery DSC

## The Timeline

startrek %>%
group_by(show_abr) %>%
summarize(e_first = min(first_aired),
e_last  = max(first_aired)) %>%
mutate(show_abr = factor(show_abr, levels = rev(stshows_titles$abbrv))) %>% ggplot(aes(x = show_abr, color = show_abr)) + geom_errorbar(aes(ymin = e_first, ymax = e_last), width = 0, size = 4.5, color = "black") + geom_errorbar(aes(ymin = e_first, ymax = e_last), width = 0, size = 4) + scale_y_datetime(date_breaks = "5 years", date_minor_breaks = "1 year", date_labels = "%Y", limits = c(as.POSIXct("1963-01-01"), NA)) + scale_color_brewer(palette = "Set2", direction = -1, guide = F) + coord_flip() + labs(title = "Star Trek TV Shows", subtitle = "Show Timeline", x = "", y = "Show Timespan", caption = plot_caption) ggplot(data = startrek, aes(x = first_aired, y = rating, fill = show)) + geom_point(shape = 21, color = "black", size = 3) + scale_x_datetime(date_breaks = "5 years", date_minor_breaks = "1 year", date_labels = "%Y", limits = c(as.POSIXct("1963-01-01"), NA)) + scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .25)) + scale_fill_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv", x = "Original Airdate", y = "Rating (1-10)", fill = "Show", caption = plot_caption) + theme(legend.position = "bottom") ggplot(data = startrek, aes(x = ep_abs, y = rating, fill = show_abr)) + geom_point(shape = 21, color = "black", size = 3) + scale_x_continuous(breaks = seq(0, 1e3, 50), minor_breaks = seq(0, 1e3, 25)) + scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .25)) + scale_fill_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv", x = "Sequential Episode Number", y = "Rating (1-10)", fill = "Show", caption = plot_caption) + theme(legend.position = "bottom") But random internet guy, people will now exclaim, isn’t it bad to use a truncated y-axis? they will ask, smugly. And yes, in many situations it’s a bad idea to truncate axis of quantities with a known range like a 1-10 point scale, so here’s the same plot as above with “proper” limits: ggplot(data = startrek, aes(x = ep_abs, y = rating, fill = show_abr)) + geom_point(shape = 21, color = "black", size = 3) + scale_x_continuous(breaks = seq(0, 1e3, 50), minor_breaks = seq(0, 1e3, 25)) + scale_y_continuous(breaks = seq(0, 10, 1), minor_breaks = seq(0, 10, .5), limits = c(1, 10)) + scale_fill_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv", x = "Sequential Episode Number", y = "Rating (1-10)", fill = "Show", caption = plot_caption) + theme(legend.position = "bottom") The thing is, it’s harder to see differences within individual shows. It’s a more or less uniform strip of points, which doesn’t really help that much, so I’ll probably stick to the truncated axes. ## Within-Show Variation Next up we’ll be looking at the variation within each show’s rating, with respect to each show’s total number of episodes. ggplot(data = startrek, aes(x = show_abr, y = rating, color = show_abr, fill = show_abr)) + geom_point(alpha = .5, position = position_jitter(width = .5, height = 0)) + geom_boxplot(alpha = .5, color = "black", outlier.alpha = 0) + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv: Within-Show Ratings as Boxplots", x = "Show", y = "Rating (1-10)", caption = plot_caption) + theme(legend.position = "none") We can see TNG with a lot of variation compared to the relatively consistent Enterprise. Also, there are quite a few outliers in TNG, so apaprently there are some really good episodes, and a few really bad episodes (relatively speaking). We’ll be looking at the outliers a little later. Next up I’d like to look at the ratings of individual seasons of each shows, where we’ll be using boxplots again, but this time per season. ggplot(data = startrek, aes(x = season, y = rating, fill = show_abr)) + geom_boxplot(alpha = .5, color = "black") + facet_wrap(~show_abr, scales = "free_x", nrow = 1) + scale_fill_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv: Within-Show Ratings as Boxplots", x = "Show", y = "Rating (1-10)", caption = plot_caption) + theme(legend.position = "none") You might notice a few trends, like TOS getting technically worse over time, while DS9 and ENT seem to be getting better over time. I vaguely remember analyzing the seasonal trends of ENT in an earlier blogpost, where I also did some simple statistics to “prove” that the show gets better over time. We can make these trends a little more explicit by using simple linear regression to approximate the ratings over time, where time is just the sequential episode number. A rising trendline would indicate the show getting better towards the end and vice versa. It’s a little too early to evaluate Discovery in this manner, but at least we can see that apparently people really enjoyed the Fall Finale™. In other news, I think it’s safe to say that apparently nobody liked TAS, so… yeah. There’s that. ggplot(data = startrek, aes(x = epnum, y = rating, fill = show_abr)) + geom_point(shape = 21, color = "black", size = 3, alpha = .25) + geom_smooth(aes(color = show_abr), method = lm, se = F) + facet_wrap(~show_abr, nrow = 1, scales = "free_x") + scale_x_continuous(breaks = pretty_breaks()) + scale_y_continuous(breaks = seq(0, 10, .5), minor_breaks = seq(0, 10, .25)) + scale_fill_brewer(palette = "Set2") + scale_color_brewer(palette = "Set2") + labs(title = "Star Trek TV Shows", subtitle = "Ratings on trakt.tv", x = "Sequential Episode Number", y = "Rating (1-10)", fill = "Show", caption = plot_caption) + theme(legend.position = "none") ## How Do You Feel About Histograms? We’ve seen the individual episode ratings, but how about the general distribution of all the ratings regardless of the show? Just a big distribution which tells us the rough range the ratings seem to be in seems appropriate. It should also be noted that in my experience, ratings on trakt.tv don’t seem to vary that greatly, but rather seem to fall in the range between 6 and 9. My working hypothesis is that people tend to, well, watch and rate things they enjoy at least a bit, and if they really dislike it and would rate it 5 or lower, they don’t go through to watch and rate the whole shebang. Anyway, my point being: The ratings might be a little biased, but I think we’re already aware that the trakt.tv user ratings are not a perfect cross-section of society as a whole, so… yeah, I’m fine with that. mx <- round(mean(startrek$rating), 2)
sx <- round(sd(startrek$rating), 2) ggplot(data = startrek, aes(x = rating)) + geom_histogram(binwidth = .1) + geom_vline(xintercept = mean(startrek$rating),
linetype = "dashed", color = "red", size = 2) +
scale_x_continuous(breaks = seq(0, 10, .5),
minor_breaks = seq(0, 10, .1)) +
annotate(geom = "label",
x = 6.5, y = 90,
label = paste("italic(bar(x)) == ", mx,
"~~italic(s[x]) == ", sx),
parse = TRUE) +
annotate("rect",
xmin = mx - sx,
xmax = mx + sx,
ymin = 0,
ymax = 125,
fill = "grey70",
alpha = 0.2) +
labs(title = "Star Trek TV Shows",
subtitle = "Episode Rating Distribution on trakt.tv",
x = "Rating (1-10)",
y = "Absolute Frequency",
caption = plot_caption) 

In this histogram we see that most ratings fall in a relatively thin range around ~7.6, which is a solid “yeah, cool” I guess. There’s only one episode below the 6.0 line, which falls into meh-territory, so that’s interesting.
Besides that we observe a perfectly “pretty normal” distribution, which shouldn’t be surprising considering we have a total $N = 739$.

ggplot(data = startrek, aes(x = rating, fill = show_abr)) +
geom_histogram(aes(y = ..density..),
binwidth = .2, position = "dodge") +
geom_density(alpha = .5) +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~show_abr, ncol = 2) +
labs(title = "Star Trek TV Shows",
subtitle = "Episode Rating Distribution on trakt.tv",
x = "Rating (1-10)",
y = "Absolute Frequency",
fill = "Show",
caption = plot_caption) +
theme(legend.position = "none")

In a by-show plot of distributions, we can now use the width of the distributions to estimate the variance within each show, but we kind of did that already earlier. It could be nice to look at the skew of each distribution, but I don’t think there’s that much to gain here.

## The First Seasons

If I remember correctly, Jason Snell mentioned that the first season of any Star Trek tended to be not that great, which is why you should probably not jump to conclusions regarding Discovery’s quality just by the first few episodes. So I plotted all the first seasons in one handy graph, where I rescaled the x-axis to a relative number of percentage of episodes of first seaosn, which allows a better comparison. Note that at the time of this writing the first season of Discovery is not yet concluded, but we already saw the silly named Fall Finale™ and know the season will consist of 15 episodes, so I adjusted appropriately.

startrek %>%
filter(season == 1) %>%
group_by(show) %>%
mutate(season_progress = episode / max(episode),
season_progress = ifelse(grepl(x = show, pattern = "Discovery"),
episode / 15, season_progress)) %>%
ggplot(aes(x = season_progress, y = rating,
color = show_abr, fill = show_abr)) +
geom_point(shape = 21, color = "black", size = 3, alpha = .25) +
geom_smooth(method = loess, se = F, span = 1.5) +
scale_x_continuous(breaks = pretty_breaks(), labels = percent) +
scale_y_continuous(breaks = seq(0, 10, .5),
minor_breaks = seq(0, 10, .25)) +
scale_fill_brewer(palette = "Set2", guide = F) +
scale_color_brewer(palette = "Set2") +
labs(title = "Star Trek TV Shows",
subtitle = "First Season Ratings on trakt.tv",
x = "Relative Episode Position in Show's First Season",
y = "Rating (1-10)",
color = "Show",
caption = plot_caption) +
theme(legend.position = "bottom")

Or, if you prefer the boxplot way of life:

startrek %>%
filter(season == 1) %>%
ggplot(aes(x = show_abr, y = rating, color = show_abr, fill = show_abr)) +
geom_point(alpha = .5, position = position_jitter(width = .2, height = 0)) +
geom_boxplot(alpha = .5, color = "black", outlier.alpha = 0) +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Set2") +
labs(title = "Star Trek TV Shows",
subtitle = "First Season Ratings on trakt.tv",
x = "",
y = "Rating (1-10)",
caption = plot_caption) +
theme(legend.position = "none")

I guess it’s fair to say that Discovery is doing pretty well so far, but it should also be noted that most of the ratings of previous shows were presumably made during rewatches, since trakt hasn’t been around that long. So either DSC is doing pretty good or it’s impossible to actually make a statement about the first-season-hypothesis based on the data, so I opt for the interpration that makes it interesting.

## The Best and the Worst Episodes

Ah yes, the thing with the outliers.
In the following plot, I’ve labelled each episode with regards to whether or not is in an outlier, which I have defined in this case to be any value that deviates more than two IQR from the median. What an IQR range is a thing that you either know or are googleing now, and well let’s face it, it’s not important. Anyway, I labelled the outliers with their episode ID (e.g. s02e03) and the episode title.

startrek %>%
group_by(show) %>%
mutate(median = median(rating),
q1 = quantile(rating, probs = 0.25),
q3 = quantile(rating, probs = 0.75),
outlier_max = median + 2 * (q3 - q1),
outlier_min = median - 2 * (q3 - q1),
is_outlier = if_else(rating < outlier_min | rating > outlier_max,
"Outlier", "No Outlier")) %>%
{
ggplot(data = ., aes(x = ep_abs, y = rating,
fill = show_abr, alpha = is_outlier)) +
geom_point(shape = 21, color = "black") +
geom_label_repel(data = filter(., is_outlier == "Outlier"),
aes(label = paste0(epid,": ", title)),
alpha = .75, size = 3) +
scale_fill_brewer(palette = "Set2") +
scale_alpha_discrete(range = c(0.1, 1), guide = FALSE) +
labs(title = "Star Trek TV Shows",
subtitle = "Outliers of Episode Ratings on trakt.tv (Within Show)",
x = "Sequential Episode Number",
y = "Rating (1-10)",
fill = "Show",
caption = plot_caption) +
theme(legend.position = "bottom")
}

It’s nice how DS9 seems to have a lot of positive outliers with only two negative outliers, which indicates that DS9 is not that great on average, but a couple of episodes are pretty good. At least that’s the way I interpret it, not sure if that’s a realistic assessment. Additionally we can see that TNG takes the cake for both the best and the worst liked episodes all over, so… yay TNG I guess? Idunno.

## Inter-Show Comparisons

Let’s compare all the shows in the statsy way with a simple ANOVA by show.
If this results in a significant result, which it probably will, it indicates that at least one show has a significantly different variation than the other shows.

tadaa_aov(rating ~ show_abr, data = startrek, print = "markdown")

Table 1: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p $\eta^2$ Cohen’s f Power
show_abr 6 8.96 1.49 14.72 < 0.001 0.11 0.35 1
Residuals 732 74.24 0.1
Total 738 83.2 1.59

Welp, I don’t have to look at the pariwise comparisons to tell you that TAS is the odd one out because it’s obviously rated consistently lower than the others.
To make it more interesting, we’ll look at the remaining shows if we kick out TAS:

startrek %>%
filter(show_abr != "TAS") %>%
tadaa_aov(rating ~ show_abr, data = ., print = "markdown")

Table 2: One-Way ANOVA: Using Type III Sum of Squares

Term df SS MS F p $\eta^2$ Cohen’s f Power
show_abr 5 4.58 0.92 9.01 < 0.001 0.06 0.25 1
Residuals 711 72.26 0.1
Total 716 76.84 1.02

Well I’ll be damned. Who’da thunk.
The effect ($\eta^2$ and $f$) is much smaller than before, but we have a lot of statistical power, presumably due to the large sample size.
Let’s look at the pairwise comparisons:

startrek %>%
filter(show_abr != "TAS") %>%
aov(rating ~ show_abr, data = .) %>%
TukeyHSD() %>%
broom::tidy() %>%
theme(legend.position = "none")

These errorbars indicate which pairwise comparison (e.g. “mean rating of ENT minus mean rating of DS9” in the first row) results in a significant difference from 0, assuming that if the shows have the same mean rating, the difference would be 0. The direction of the difference is determined by the labels to the left, as mentioned, so we can say that TOS is “significantly worse” than Voyager, at least according to mean episode ratings and variation.

The lasr thing I’d like to take a look at is the possible difference between the total show rating on trakt and the mean episode rating of each show. Since you have the option to give each show an over-all rating without rating each episode, I’m assuming that the nostalgia factor is strong in that regard and many people might give TNG a nostalgia-inflated rating compared to people who rewatched each episode and rated them as they judge them today(ish).
Long story short, here’s the over-all show ratings, sorted by rating:

stshows_titles$rating <- map_dbl(stshows_titles$slug, function(x) {
trakt.show.ratings(x)\$rating
})

ggplot(data = stshows_titles, aes(x = reorder(abbrv, rating),
y = rating, fill = abbrv)) +
geom_col(color = "black", alpha = .75) +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
scale_y_continuous(breaks = seq(0, 10, 1),
minor_breaks = seq(0, 10, .5)) +
labs(title = "Star Trek TV Shows",
subtitle = "Show Ratings on trakt.tv",
x = "",
y = "Rating (1-10)",
fill = "Show",
caption = plot_caption) +
theme(legend.position = "bottom")

Now we can use that plot and build upon it. I’ll draw the mean episode rating for each show including errorbars on top of the previous plot, so hold my beer:

stshows_titles %>%
transmute(show_rating = rating,
show_abr = abbrv) %>%
full_join(startrek) %>%
ggplot() +
geom_col(data = stshows_titles,
aes(x = reorder(abbrv, rating),
y = rating, fill = abbrv),
color = "black", alpha = .75) +
stat_summary(aes(x = reorder(show_abr, show_rating),
y = rating),
fun.data = mean_ci_t,
geom = "errorbar", width = .5) +
stat_summary(aes(x = reorder(show_abr, show_rating), y = rating,
color = show_abr),
fun.y = mean,
geom = "point", shape = 21, color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Set2") +
scale_color_brewer(palette = "Set2") +
scale_y_continuous(breaks = seq(0, 10, 1),
minor_breaks = seq(0, 10, .5)) +
labs(title = "Star Trek TV Shows",
subtitle = "Show Ratings on trakt.tv",
x = "",
y = "Rating (1-10)",
fill = "Show",
caption = plot_caption) +
theme(legend.position = "bottom")

Neat. These little tie-fighters (wrong franchise, I know) represent the mean episode rating with its confidence interval. What we can learn from that plot is how TNG has a high show rating, but individual episodes tend to be rated lower on average when compared to the over-all show rating.
Interestingly, it’s the other way around fpr TAS, where the average episode rating is higher than the show rating, so apparently people who remember it liked it better than the people who watched it? Not quite sure, but I’m certain you can figure out a way to rationalize this effect, I’m out ¯\_(ツ)_/¯

On the other hand, Discovery is very consistent with its mean episiode rating CI enclosing the current show rating over all.
That’s neat.
If you disagree, don’t email me.