# I analyzed some world penis data - because why not

**Edit: 2016-12-18 02:13:19**

Please note that this analysis is out of date and the code to acquire the data no longer works, since the source website has restructured and I have not found a way to reproduce the old behavior. Also, the current analysis is located at https://worldpenis.tadaa-data.de, so please go there for up to date code and analysis. It’s prettier. And better.

If there’s one thing I just can’t resist, it’s publicly available tabular data containing adequate amounts of numeric values. Naturally, I couldn’t resist the World Penis Data I stumbled upon somewhere over at Reddit.

So, let’s suck that data out of the web and put it into our favorite data structure.

```
library(rvest)
library(dplyr)
library(tidyr)
library(broom)
library(ggplot2)
library(pixiedust)
penis <- html("http://www.everyoneweb.com/worldpenissize") %>% html_table(fill = T)
penis <- penis[117] # Fish out the correct table
penis <- penis[[1]] # Subset because nested list
names(penis) <- penis[2, ] # Get the names out of the second row
penis <- penis[c(-1, -2), ] # The first two rows are trash
names(penis) <- c("Country", "Region", "length_flaccid", "length_erect", "length_erect_in",
"circumf_flaccid", "circumf_erect", "circumf_erect_in", "volume", "Method",
"N", "Source")
# Convert comma decimal seperator to dot so it can be as.numeric'd
penis[c(3:9, 11)] <- as.data.frame(lapply(penis[c(3:9, 11)], function(x){
as.numeric(gsub(",", ".", x = x))
}))
penis$Method <- factor(penis$Method)
penis$Region <- factor(penis$Region)
# More informative volumes
penis <- penis %>%
mutate(volume_erect = length_erect * (circumf_erect/pi/2)^2 * pi,
volume_flaccid = length_flaccid * (circumf_flaccid/pi/2)^2 * pi) %>%
select(-Method, -Source, -N, everything(), N, Method, Source) %>%
select(-volume)
knitr::kable(head(penis))
```

Country | Region | length_flaccid | length_erect | length_erect_in | circumf_flaccid | circumf_erect | circumf_erect_in | volume_erect | volume_flaccid | Method | N | Source |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Afghanistan | Central Asia | 9.5 | 13.69 | 5.4 | 9.1 | 11.42 | 4.50 | 142.08 | 62.60 | Measured | 100 | Journal of Urology (mentioned in 2011) |

Albania | Europe | 9.8 | 14.19 | 5.6 | 9.7 | 12.16 | 4.79 | 166.97 | 73.38 | Self reported | 95 | Journal of Sexology 2006 |

Algeria | Africa | 9.9 | 14.49 | 5.7 | 8.9 | 10.97 | 4.32 | 138.76 | 62.40 | Self reported | 738 | https://www.surveymonkey.com - 2015 |

Angola | Africa | 10.0 | 15.73 | 6.2 | 9.6 | 11.82 | 4.65 | 174.89 | 73.34 | Measured | 978 | University Agostinho Neto 2001 |

Argentina | South America | 9.4 | 14.88 | 5.9 | 8.9 | 11.45 | 4.51 | 155.24 | 59.25 | Self reported | 1669 | Journal of Urology 2013 |

Armenia | Europe | 10.5 | 13.12 | 5.2 | 8.6 | 10.78 | 4.24 | 121.33 | 61.80 | Measured | 469 | Ուրոլոգիայի Առողջության Պահպանման Ծառայություն Armenia - 2015 |

So, with that data, what can we look at? How about we look at the relation between flaccid and erect penis length across the different regions, that seems like a reasonable thing to do.

```
ggplot(data = penis, aes(x = length_flaccid, y = length_erect, colour = Region)) +
geom_point(size = 5, colour = "black") +
geom_point(size = 4) +
geom_smooth(method = lm, se = F) +
labs(title = "World Penis Data", y = "Erect Length (cm)", x = "Flaccid Length (cm)")
```

But wait, maybe we should take a look at the general distributions before we dive any deeper, shall we?

```
penis %>% gather(State, Length, length_erect, length_flaccid) %>%
mutate(State = factor(State, labels = c("Erect", "Flaccid"))) %>%
ggplot(aes(x = Length, fill = State)) +
geom_histogram(binwidth = .5, alpha = .6) +
geom_density(aes(y = ..count..), alpha = .7) +
labs(title = "Penis Length", x = "Length (cm)", y = "Count")
```

```
penis %>% gather(State, Circumference, circumf_erect, circumf_flaccid) %>%
mutate(State = factor(State, labels = c("Erect", "Flaccid"))) %>%
ggplot(aes(x = Circumference, fill = State)) +
geom_histogram(binwidth = .5, alpha = .6) +
geom_density(aes(y = ..count..), alpha = .7) +
labs(title = "Penis Circumference", x = "Circumference (cm)", y = "Count")
```

```
penis %>% gather(State, Volume, volume_erect, volume_flaccid) %>%
mutate(State = factor(State, labels = c("Erect", "Flaccid"))) %>%
ggplot(aes(x = Volume, fill = State)) +
geom_histogram(binwidth = 1, alpha = .6) +
geom_density(aes(y = ..count..), alpha = .7) +
labs(title = "Penis Volume", x = "Volume (cm^3)", y = "Count")
```

```
penis %>% mutate(Growth = length_erect / length_flaccid) %>%
ggplot(aes(x = Growth)) +
geom_histogram(binwidth = .01, alpha = .9) +
labs(title = "Penis Growth Factor (Length)", x = "Growth: Erect by Flaccid Length (cm)", y = "Count")
```

```
penis %>% mutate(Growth = volume_erect / volume_flaccid) %>%
ggplot(aes(x = Growth)) +
geom_histogram(binwidth = .01, alpha = .9) +
labs(title = "Penis Growth Factor (Volume)", x = "Growth: Erect by Flaccid Volume (cm^3)", y = "Count")
```

Wait a minute, that last one seems odd. Apparently, there’s a country with a massive volume growth rate. Let’s find out which one that is.

```
penis %>% mutate(Growth = volume_erect / volume_flaccid) %>%
select(Country, contains("volume"), Growth, N, Source) %>%
arrange(desc(Growth)) %>% head(1) %>% knitr::kable
```

Country | volume_erect | volume_flaccid | Growth | N | Source |
---|---|---|---|---|---|

Buzzfeed Motion Pictures | 198 | 51.65 | 3.833495 | 11 | Keith Habersberger, Los Angeles, BuzzFeed Motion Pictures - 2015 |

Uhm… Buzzfeed Motion Pictures, eh? Well, whatever they are doing in that dataset, at least they have something to show for it.

Let’s forget about that and get to the next thing: You may have noticed that column labelled “Method”, containing information as to how the data has been obtained, i.e. via a presumably objective measurement, or self reported, presumably by the penis owners. Naturally, we assume there’s a discrepancy in length between the two categories, so let’s get on with the analysis.

```
penis %>% gather(State, Length, length_erect, length_flaccid) %>%
mutate(State = factor(State, labels = c("Erect", "Flaccid"))) %>%
ggplot(data = ., aes(x = Length, fill = Method)) +
geom_histogram(binwidth = .5, position = "dodge") +
geom_density(colour = "black", alpha = 0.5, aes(y = ..count..)) +
facet_grid(. ~ State, scales = "free_x") +
labs(title = "Penis Length by Measure of Reporting", y = "Count", x = "Length (cm)")
```

Well, that sure comes as a surprise. But we don’t want to jump the gun here, let’s pretend we know what we’re doing and throw some statistics at it. We’ll t-test both erect and flaccid length in both measurent groups and see if we get a significant result.

```
t.test(length_erect ~ Method, data = penis, var.equal = TRUE) %>%
tidy %>% select(-contains("conf")) %>% dust %>%
sprinkle_colnames(estimate1 = "Mean (Measured)",
estimate2 = "Mean (Self Reported)",
statistic = "t",
parameter = "df") %>%
sprinkle(col = c(1:3), round = 2) %>%
sprinkle(col = 4, fn = quote(pvalString(value))) %>%
sprinkle_print_method("markdown")
```

Mean (Measured) | Mean (Self Reported) | t | p.value | df |
---|---|---|---|---|

13.46 | 14.74 | -4.39 | < 0.001 | 144 |

```
t.test(length_flaccid ~ Method, data = penis, var.equal = TRUE) %>%
tidy %>% select(-contains("conf")) %>% dust %>%
sprinkle_colnames(estimate1 = "Mean (Measured)",
estimate2 = "Mean (Self Reported)",
statistic = "t",
parameter = "df") %>%
sprinkle(col = c(1:3), round = 2) %>%
sprinkle(col = 4, fn = quote(pvalString(value))) %>%
sprinkle_print_method("markdown")
```

Mean (Measured) | Mean (Self Reported) | t | p.value | df |
---|---|---|---|---|

9.26 | 9.99 | -4.1 | < 0.001 | 144 |

It turns out the difference is actually statistically significant. Yes, this might not be methodologically sound, but come on, I’m throwing code at penis data.

One last thing I’d like to look at is the length growth factor in relation to the flaccid length across different regions:

```
penis %>% mutate(growth = length_erect / length_flaccid) %>%
ggplot(data = ., aes(x = length_flaccid, y = growth, color = Region)) +
geom_point(size = 5, colour = "black") +
geom_point(size = 4) +
geom_smooth(method = lm, se = F) +
labs(title = "World Penis Data",
y = "Growth: Erect by Flaccid Length (cm)",
x = "Flaccid Length (cm)")
```

Well, I certainly didn’t expect that amount of vertical grouping and ambivalent trend lines. If anyone as an idea how that can be explained, let me know.

**Edit 2015-08-06 00:47**