Sandbox Statistics: Methodological Insurgency Edition

Epistemic status: almost certainly did something egregiously wrong, and I would really like to know what it is.


I collect papers. Whenever I run across an interesting paper, on the internet or referenced from another paper, I add it to a list for later perusal. Since I'm not discriminating in what I add, I add more things than I can ever hope to read. However, "What Went Right and What Went Wrong": An Analysis of 155 Postmortems from Game Development (PDF) caught my eye: an empirical software engineering process paper, doing a postmortem on the process of doing postmortems? That seemed relevant to me, a software engineer that does wrong things once in a while, so I pulled it out of paper-reading purgatory and went through it.

The paper authors wanted to study whether different sorts of game developers had systematic strengths and weaknesses: for example, they wanted to know whether "a larger team produces a game with better gameplay." To this end, they gathered 155 postmortems off Gamasutra[1] and encoded each one[2] as claiming a positive or negative outcome for a part of the game development process, like art direction or budget. They then correlated these outcomes to developer attributes, looking for systematic differences in outcomes between different sorts of developers.

I'll be upfront: there are some problems with the paper, prime ofwhich is that the authors are a little too credulous given the public nature of the postmortems. As noted on Hacker News, the companies posting these postmortems are strongly disincentivised from actually being honest; publicly badmouthing a business partner is bad for business, or airing the company dirty laundry is bad for business, or even saying "we suck" is bad for business. Unless the company is going under, there's little pressure to put out the whole truth and nothing but the truth, and instead a whole lot of pressure to omit the hard parts of the truth, maybe even shade the truth[3]. It's difficult to say that conclusions built on this unstable foundation are ultimately true. A second problem is the absence of any discussion of statistical significance; without knowing if statistical rigor was present, we don't know if any conclusions drawn are indistinguishable from noise.

We can't do much about the probably shaded truth in the source material, but we might be able to do something about the lack of statistical rigor. The authors graciously publicized their data[4], so we can run our own analyses using the same data they used. Of course, any conclusions we draw are still suspect, but it means even if I screw up the analysis, the worst that could happen is some embarrassment to myself: if I end up prescribing practicing power poses in the mirror or telling Congress that cigarettes are great, no one should be listening to me, since they already know my source data is questionable.

Now we have a sandbox problem and sandbox data: how do we go about finding statistically significant conclusions?

p-values

Before we dive in, a quick primer about p-values[5]. If you want more than this briefest of primers, check out the Wikipedia article on p-values for more details.

Roughly speaking, p-values are the chance that a null hypothesis, the boring, no interesting effect result, is true given the data we see. The lower the p-value is, the more likely a non-boring outcome is.

For example, if we're testing for a loaded coin, our boring null hypothesis is "the coin is fair". If we flip a coin 3 times, and it comes up heads twice, how do we decide how likely it is that a fair coin would generate this data? Assuming that the coin is fair, it's easy to see that the probability of a specific sequence of heads and tails, like HTH, is (\frac{1}{2})^3 = \frac{1}{8}. We need to use some combinatorial math in order to find the probability of 2 heads and and 1 tail in any order. We can use the "choose" operation to calculate that 3 \text{ choose } 2 = {{3}\choose{2}} = 3 different outcomes match 2 heads and 1 tail. With 3 coin flips, there are 8 equally probable outcomes possible, so our final probability of 2 heads and 1 tail in any order is 3/8.

However, neither of these are the probability of the coin being fair. Intuitively, the weirder the data, the less weight we shoul give to the null hypothesis: if we end up with 20 heads and 2 tails, we should be suspicious that the coin is not fair. We don't want to simply use the probability of the outcome itself, though: ending up with one of 100 equally probable outcomes is unremarkable (one of them had to win, and they were all equally likely to win), while ending up with an unlikely option instead of a likely option is remarkable. By analogy, receiving 1 inch of rain instead of 1.1 inches in Seattle is unremarkable, even if getting exactly 1 inch of rain is unlikely. Receiving any rain at all in the Sahara Desert is remarkable, even if it's the same probability as getting exactly 1 inch of rain in Seattle. The weirdness of our data depends not just the probability of the event itself, but the probability of other events in our universe of possibility.

The p-value is a way to solidify this reasoning: instead of using the probability of the outcome itself, it is the sum of the probability of all outcomes equally or less probable than the event we saw[6]. In the coin case, we would add the probability of 2 heads and 1 tail (3/8) with the probability of the more extreme results, all heads (1/8), for p=0.5.

But wait! Do we also consider a result of all tails to be more extreme than our result? If we only consider head-heavy results in our analysis, that is known as a one-tailed analysis. If we stay with a one-tailed analysis, then we will in essence be stating that we knew all along that the coin would always have more heads in a sequence, and we only wanted to know by how much it did so. This obviously does not hold in our case: we started by assuming the coin was fair, not loaded, so tails-heavy outcomes are just as extreme as heads-heavy outcomes and should be included. When we do so, we end up with p=1.0: the data fits the null hypothesis closely[7]. One-tailed analysis is only useful in specific cases, and I'll be damned if I fully understand those cases, so we'll stick to two-tailed analyses throughout the rest of this post.

If there were only ever two hypotheses, like the coin being fair, or not, then rejecting one implies the other. However, note that rejecting the null hypothesis says nothing about choosing between multiple other hypotheses, like whether the coin is biased towards the head or tail, or by how much a coin is biased. Those questions are certainly answerable, but not with the original p-value.

How low a p-value is low enough? Traditionally, scientists have treated p<0.05 as the threshold of statistical significance: if the null hypothesis were true, it would generate data this extreme less than 1/20th of the time purely by chance, which is pretty unlikely, so we should feel safe rejecting the boring null hypothesis[8].

There are problems with holding the p<0.05 threshold as sacrosanct: it turns out making p=0.05 a threshold for publication means all sorts of fudging with the p-value (p-hacking) happens[9], which is playing a part in psychology's replication crisis, which is where the 2nd part of this post's title comes from[10].

For these reasons, the p-value is a somewhat fragile tool. However, it's the one we'll be using today.

Adjusting expectations

The first step is simple: before looking at any of the data, can we know whether any conclusions are even possible?

The first step would be to do a power analysis, and find out whether 155 postmortems is enough data to produce significant results. First, we need to choose an expected effect size we think our data will display: usual values range from 0.1 (a weak effect) to 0.5 (a strong effect). Yes, it's subjective what you choose. We already know how many data points we have, 155 (normally we would be solving for this value, to see how big our sample size would have to be). Now, I'm not going to calculate this by hand, and instead use R, a commonly used statistical analysis tool (for details on running this, see the appendix below). Choosing a "medium" effect size of 0.3 with n=155 data points tells us that we have a projected 25% false negative rate, a ~1/4 chance to miss an existing effect purely by chance (see the appendix for more details about running the analysis). It's not really a great outlook, but we can't go back and gather more data, so we'll just have to temper our expectations and live with it.

What about looking at other parts of the general experiment? One potential problem that pops out is the sheer number of variables that the experiment considers. There are 3 independent variables (company attributes), and 22 dependent variables (process outcomes) that we think the independent variables affect, for a total of 3\cdot 22=66 different correlations that we are looking at separately. This is a breeding ground for the multiple comparisons problem: comparing multiple results against the same significance threshold increases the chances that at least one conclusion is falsely accepted (see this XKCD for a pictorial example). If you want to hold steady the chances that every conclusion you accept is statistically significant, then you need to make the evidential threshold for each individual correlation stricter.

But how much more stricter? Well, we can pick between the Bonferroni, the Sidak, and the Holm-Bonferroni methods.

The Bonferroni method simply takes your overall threshold of evidence, and divides by the number of tests you are doing to get the threshold of evidence for any one comparison. If you have m=5 tests, then you have to be 5 times as strict, so 0.05/5 = 0.01. This is a stricter restriction than necessary: however, it's easy to calculate, and it turns out to be a pretty good estimate.

The Sidak method calculates the exact overall threshold of evidence given the per-comparison threshold. The previous method, the Bonferroni, is fast to calculate, but it calls some outcomes insignificant when it in fact has enough evidence to label those outcomes as significant. The Sidak method correctly marks those outcomes as significant, in exchange for a slightly more difficult calculation. The equation is:

p_{comparison} = 1 - (1 - p_{overall})^{1/m}

There's some intuition for why this works in a footnote [11].

If p_{overall}=0.05 (as is tradition) and m=5, then p_{comparison}=0.0102. This is not that much less strict than the Bonferroni bound, which is simply p_{Bonferroni}=0.01, but sometimes you just need that extra leeway.

The Holm-Bonferroni method takes a different tack: instead of asking each comparison to pass a stringent test, it asks only some tests to pass the strict tests, and then allows successive tests to meet less strict standards.

We want to end up with an experiment-wide significance threshold of 0.05, so we ask whether each p-value from low to high is beneath the threshold divided by its number in line, and stop considering results significant once we reach a p-value that doesn't reach its threshold. For example: let's say that we have 5 p-values, ordered from low to high: 0.0001, 0.009, 0.017, 0.02, 0.047. Going in order, 0.0001 < 0.05/5 = 0.01, and 0.009 < 0.05/4 = 0.0125, but 0.017 > 0.05/3 = 0.0167, so we stop and consider the first two results significant, and reject the rest.

There is a marvelous proof detailing why this works which is too large for this post, so I will instead direct you to Wikipedia for the gory details.

With these methods, if we wanted to maintain a traditional p=0.05 threshold with m=66 different comparisons, we need to measure each individual comparison[12] against a p-value of:

p_{Bonferroni}=0.000758
p_{Sidak}=0.000777
p_{Holm}=(\text{between } 0.000758 \text{ and } 0.05)

We haven't even looked at the data, but we're already seeing that we need to meet strict standards of evidence, far beyond the traditional 0.05 threshold. And with n=155 data points at best (not all the postmortems touch on every outcome), it seems unlikely that we can meet these standards.

Perhaps I spoke too soon, though: can the data hit our ambitious p-value goals?

Testing the data

So how do we get p-values out of the data we have been given?

Keep in mind that we're interested in comparing different proportions of "this went well" and "this went poorly" responses for different types of companies, and asking ourselves whether there's any difference between the types of companies. We don't care about whether one population is better or worse, just that they have different enough outcomes. In other words, we're interested in whether the different populations of companies have the same proportional mean.

We'll use what's known as a contingency table to organize the data for each test. For instance, let's say that we're looking at whether small or large companies are better at doing art, which will produce the following table:

Small Company Large Company
Good Art 28 16
Bad Art 12 6

We want to compare the columns, and decide whether they look like they're being drawn from the same source (our null hypothesis). This formulation is nice, because it makes obvious that the more data we have, the more similar we expect the columns to look due to the law of large numbers. But how do we compare the columns in a rigorous way? I mean, they look like they have pretty similar proportions; how different can the proportions in each column get before they are too different? It turns out that we have different choices available to determine how far is too far.

z-test, t-test

The straightforward option is built in to R, called prop.test. Standing for "proportionality test", it returns a p-value for the null hypothesis that two populations have the same proportions of outcomes, which is exactly what we want.

However, a little digging shows that there are some problematic assumptions hidden behind the promising name. Namely, prop.test is based on the z-test[13], which is built on the chi-squared test, which is built on the assumption that large sample sizes are available. Looking at our data, it's clear our samples are not large: a majority of the comparisons are built on less than 40 data points. prop.test handily has an option to overcome this, known as Yates continuity correction, which corrects p-values for small sample sizes. However, people on CrossValidated don't trust Yates, and given that I don't understand what the correction is doing, we probably shouldn't either.

Instead, we should switch from using the z-test to using the t-test: Student's t-test makes no assumptions about how large our sample sizes are, and does not need any questionable corrections. It's a little harder to use than the z-test, especially since we can't make assumptions about variance, but worth the effort.

Fischer

However, the t-test still makes an assumption that the populations being compared are drawn from a normal
distribution
. Is our data normal? I don't know, how do you even see if binary data (good/bad) is normal? It would be great if we could just sidestep this, and use a test that didn't assume our data was normal.

It turns out that one of the first usages of p-values matches our desires exactly. Fischer's exact test was devised for the "lady tasting tea" experiment, which tested whether someone could tell whether the milk had been added to the tea, instead of vice versa[14]. This test is pretty close to what we want, and has the nice property that it is exact: unlike the t-test, it is not an approximation based on an assumption of normal data.

Note that the test is close, but not exactly what we want. The tea experiment starts with by making a fixed number of cups with milk added, and a fixed number of cups with tea added. This assumption bleeds through into the calculation of the p-value: as usual, Fischer's test calculates the p-value by looking at all possible contingency tables that are "more extreme" (less probable) than our data, and then adding up the probability of all those tables to obtain a p-value. (The probability of a table is calculated with some multinomial math: see the Wikipedia article for details). However while looking for more extreme tables it only looks at tables that add up to the same column and row totals as our data. With our earlier example, we have:

28 16 =44
12 6 =18
=40 =22

All the bolded marginal values would be held constant. See the extended example on Wikipedia, especially if you're confused how we can shuffle the numbers around while keeping the sums the same.

This assumption does not exactly hold in our data: we didn't start by getting 10 large companies and 10 small companies and then having them make a game. If we did, it would be unquestionably correct to hold the column counts constant. As it stands, it's better to treat the column and row totals as variables, instead of constants.

Barnard

Helpfully, there's another test that drops that assumption: Barnard's test. It's also exact, and also produces a p-value from a contingency table. It's very similar to Fischer's test, but does not hold the column and row sums constant when looking for more extreme tables (note that it does hold the total number of data points constant). There are several variants of Barnard's test based on how exactly one calculates whether a table is more extreme or not, but the Boschloo-Barnard variant is held to be always more powerful that Fischer's test.

The major problem with Barnard is that it is computationally expensive: all the other tests run in under a second, but running even approximate forms of Barnard take considerably longer. Solving for non-approximate forms of Barnard with both columns and rows unfixed take tens of minutes. With 66 comparisons to calculate, this means
that it's something to leave running overnight with a beefy computer (thank the gods for Moore's law).

You can see the R package documentation (PDF) for more details on the different flavors of Barnard available, and all the different options available. In our case, we'll use Boschloo-Barnard, and allow both rows and columns to vary.

Outcomes

So now we have our data, a test that will tell us whether the populations differ in a significant way, and a few ways to adjust our p-values to account for multiple comparisons. All that remains is putting it all together.

When we get a p-value for each comparison, we get (drum roll): results in a Google Sheet, or a plain CSV.

It turns out that that precisely 1 result passes the traditional p=0.05 threshold with Barnard's test. This is especially bad: if there was no effect whatsoever, we would naively expect 66 \cdot 0.05 \sim 3 of the comparisons to give back a "significant" result. So, we didn't even reach the level of "spurious results producing noise", far away from our multiple comparison adjusted thresholds we calculated earlier.

This is partly due to such a lack of data that some of the tests simply can't run: for example, no large companies touched on their experience with community support, either good or bad. With one empty column, none of the tests can give back a good answer. However, only a few comparisons had this exact shortcoming; the rest likely suffer from a milder version of the same problem, where there were only tens of data points on each side, which doesn't produce confidence in our data, and hence higher p-values.

In conclusion, there's nothing we can conclude, folks, it's time to pack it up and go home.

p-value Pyrotechnics

Or, we could end this Mythbusters style: the original experiment didn't work, but how could we make it work, even if we were in danger of losing some limbs?

In other words, the data can't pass a p=0.05 threshold, but that's just a convention decided on by the scientific community. If we loosened this threshold, how far would we have to loosen it in order to have a statistically significant effect in the face of multiple comparisons and the poor performance of our data?

It turns out that reversing Bonferroni correction is impossible: trying to multiply p=0.023 (the lowest Barnard-Boschloo p-value) by 66 hands back 0.023 \cdot 66 \sim 1.5, which is over 1.0 (100%), which is ridiculous and physically impossible. The same holds for Holm-Bonferroni, since it's built on Bonferroni.

So let's ditch Barnard-Boschloo: the t-test hands back a small p-value in one case, at 5.14 \cdot 10^{-6}. This we can work with! 5.14 \cdot 10^{-6} \cdot 66 = 0.000339, far below 0.05. This is pretty good, this outcome even passes our stricter multiple-comparisons adjusted tests. But what if we wanted more statistically valid results? If we're willing to push it to the limit, setting p_{overall}=0.9872 gives us just enough room to snag 3 statistically significant conclusions, either with Bonferroni or Holm-Bonferroni applied to the t-test results. Of course, the trade-off is that we are virtually certain that we are accepting a false positive conclusion, even before taking into account that we are using p-values generated by a test that doesn't exactly match our situation.

Reversing Sidak correction gets us something saner: with 66 tests and our lowest Barnard-Boschloo p-value, p=0.023, we have an overall 1-(1-0.023)^{66}=p_{overall}=0.785. Trying to nab a 2nd statistically significant conclusion pushes p_{overall}=0.991. Ouch.

This means that we can technically extract conclusions from this data, but the conclusions are ugly. A p=0.785 means that if there is no effect in any of our data, we expect to see a at least one spurious positive result around 75% of the time. It's worse than a coin flip. We're not going to publish in Nature any time soon, but we already knew that. Isn't nihilism fun?

Conclusions

So, what did we learn today?

  • How to correct for multiple comparisons: if there are many comparisons, you have to adjust the strictness of your tests to maintain power.
  • How to compare proportions of binary outcomes in two different populations.

At some point I'll do a Bayesian analysis for the folks in the back baying for Bayes: just give me a while to get through a textbook or two.

Thanks for following along.

Appendix: Running the Analysis

If you're interested in the nitty gritty details of actually running the analyses, read on.

For doing the power analysis, you want to install the pwr package in R. In order to run a power analysis for the proportion comparison we'll end up doing, use the pwr.2p.test function (documentation (PDF)), and use n=155 data points and a "medium" effect size (0.3). The function will hand back a power value, which is the inverse of the false negative rate (1-\text{"false negative rate"}). If you want to do a power analysis for other sorts of tests not centered around comparing population proportions, you will need to read the pwr package documentation for the other functions it provides.

Now on to the main analysis…

First, download the xlsx file provided by the paper author (gamasutra_codes.xslx, in a zip hosted by Google Drive).

The "Codes" sheet contains all the raw data we are interested in. Extract that sheet as a CSV file if you want to feed it to my scripts. The "Results" sheet is also interesting in that it contains what was likely the original author's analysis step, and makes me confident that they eyeballed their results and that statistical power was not considered.

Second, we need to digest and clean up the data a bit. To paraphrase Edison, data analysis is 99% data cleaning, and 1% analysis. A bit of time was spent extracting just the data I needed. Lots of time was spent defending against edge cases, like case rows not all having the same variable values that should be the same, and then transforming the data into a format I better understood. There are asserts littering my script to make sure that the format of the data stays constant as it flows through the script: this is definitely not a general purpose data cleaning script.

You can check out the data cleaning script as a Github gist (written in Python).

This data cleaning script is meant to be run on the CSV file we extracted from the xlxs file earlier (I named it raw_codes.csv), like so:

python input_script.py raw_codes.csv clean_rows.csv 

The actual data analysis itself was done in R, but it turns out I'm just not happy "coding" in R (why is R so terrible?[15][16]). So, I did as much work as possible in Python, and then shipped it over to R at the last possible second to run the actual statistical tests.

Get the Python wrapper script, also as a Github gist.

Get the R data analysis script used by the wrapper script, also as a Github gist.

The R script isn't meant to be invoked directly, since the Python wrapper script will do it, but it should be in the same directory. Just take the CSV produced by the data cleaning step, and pass to the wrapper script like so:

python analysis.py clean_rows.csv \
    --t_test --fischer_test \
    --barnard_csm_test \
    --barnard_boschloo_test 

This produces a CSV analysis_rows.csv, which should look an awful lot like the CSV I linked to earlier.


Math rendering provided by KaTeX.


[1] The video game community has a culture that encourages doing a public retrospective after the release of a game, some of which end up on Gamasutra, a web site devoted to video gaming.

[2] The authors tried to stay in sync while encoding the postmortems to make sure that their each rater's codings were reasonably correlated with each other, but they didn't use a more rigorous measure of inter-rater reliability, like Cronbach's alpha.

[3] Even if the company is going under, there are likely repercussions a no-holds barred retrospective would have for the individuals involved.

[4] It turns out Microsoft wiped the dataset supposedly offered (probably due to a site re-organization: remember, it's a shame if you lose any links on your site!), but thankfully one of the authors had a copy on their site. Kudos to the authors, and that author in particular!

[5] This is also your notice that this post will be focusing on traditional frequentist tools and methods. Perhaps in the future I will do another post on using Bayesian methods.

[6] One of the curious things that seems to fall out of this formulation of the p-value is that you can obtain wildly different p-values depending on whether your outcome is a little less or a little more likely. Consider that there are 100 events, 98 of which happen with probability 1/100, and one that happens with probability 0.00999 (event A), for 0.01001 remaining probability on the last event (event B). If event A happens, p=0.00999, but if event B happens, p=1.0. These events happen with mildly different probabilities, but lead to vastly different p-values. I don't know how to account for this sort of effect.

[7] This is kind of a strange case, but it makes sense after thinking about it. Getting an equal number of heads and tails would be the most likely outcome for a fair coin (even if the exact outcome happens with low probability, everything else is more improbable). Since we're flipping an odd number of times, there is no equals number of heads and tails, so we have to take the nex best thing, an almost equal number of heads and tails. Since there's only 3 flips, the most equal it can get is 2 of one kind and 1 of another. Therefore, every outcome is as likely or less so than 2 heads and a tail.

[8] However, note that separate fields will use their own p-value thresholds: physics requires stringent p-values for particle discovery, with p=0.0000003 as a threshold.

[9] This wouldn't be such a big deal if people didn't require lots of publications for tenure, or accepted negative results for publication. However, we're here to use science, not fix it.

[10] Reminder: I'm almost certainly doing something wrong in this post. If you know what it is, I would love to hear it. TELL ME MY METHODOLOGY SINS SO I CAN CONFESS THEM. It's super easy, !

[11] So why does the Sidak equation have that form?

Let's say that you are trying to see Hamilton, the musical, and enter a lottery every day for tickets. Let's simplify and state that you always 1 out of 1000 people competing for one ticket, so you have a 0.001 chance of winning a ticket each day.

Now, what are the chances that you win at least once within the next year (365 days)? You can't add the probability of winning 365 times: if you extend that process, you'll eventually have more than 100% chance of winning, which simply doesn't make sense. Thinking about it, you can never play enough times to guarantee you will win the lottery, just play enough times that you will probably win. You can't multiply the probability of winning together 365 times, since that would be the probability that you win 100 times in a row, an appropriately tiny number.

Instead, what you want is the probability that you lose 365 times in a row; then inverting that gets you the probability that you win at least once. The probability of losing is 0.999, so 365 \cdot 0.999 = 0.694. But we don't want the probability of losing 365 times in a row: we want the chance that doesn't happen. So we invert by subtracting that probability from 1, 1-0.694, for a total probability of winning equal to 0.306.

Generalizing from a year to any number of days N, this equation calculates the total probability of winning.


p_{total} = 1 - (1 - p_{winning})^N

Which looks an awful lot like the Sidak equation. The exponent contains a N instead of a \frac{1}{m}, since p_{total} corresponds with p_{overall} in the Sidak equation: solving for p_{winning} will net you the same equation.

[12] An unstated assumption throughout the post is that each measure of each variable is independent of each other measure. I don't know how to handle situations involving less-than-complete independence yet, so that's a topic for another day. This will probably come after I actually read Judea's Causality, which is a 484 page textbook, so don't hold your breath.

[13] The manual page for prop.test was not forthcoming with this detail, so I had to find this out via CrossValidated.

[14] It's adorable how Victorian the experiment sounds.

[15] Allow me to briefly rant about R's package ecosystem. R! Why the fuck would you let your users be so slipshod when they make their own packages? Every other test function out there takes arguments in a certain format, or a range of formats, and then a user defined package simply uses a completely different format for no good reason. Do your users not care about each other? Do your dark magicks grow stronger with my agony? Why, R!? Why!?

[16] I suppose I really should be using pandas instead, since I'm already using python.