Nathan Hwang

Sandbox Statistics: Methodological Insurgency Edition

Epistemic status: almost certainly did something egregiously wrong, and I would really like to know what it is.

I collect papers. Whenever I run across an interesting paper, on the internet or referenced from another paper, I add it to a list for later perusal. Since I'm not discriminating in what I add, I add more things than I can ever hope to read. However, "What Went Right and What Went Wrong": An Analysis of 155 Postmortems from Game Development (PDF) caught my eye: an empirical software engineering process paper, doing a postmortem on the process of doing postmortems? That seemed relevant to me, a software engineer that does wrong things once in a while, so I pulled it out of paper-reading purgatory and went through it.

The paper authors wanted to study whether different sorts of game developers had systematic strengths and weaknesses: for example, they wanted to know whether "a larger team produces a game with better gameplay." To this end, they gathered 155 postmortems off Gamasutra[1] and encoded each one[2] as claiming a positive or negative outcome for a part of the game development process, like art direction or budget. They then correlated these outcomes to developer attributes, looking for systematic differences in outcomes between different sorts of developers.

I'll be upfront: there are some problems with the paper, prime ofwhich is that the authors are a little too credulous given the public nature of the postmortems. As noted on Hacker News, the companies posting these postmortems are strongly disincentivised from actually being honest; publicly badmouthing a business partner is bad for business, or airing the company dirty laundry is bad for business, or even saying "we suck" is bad for business. Unless the company is going under, there's little pressure to put out the whole truth and nothing but the truth, and instead a whole lot of pressure to omit the hard parts of the truth, maybe even shade the truth[3]. It's difficult to say that conclusions built on this unstable foundation are ultimately true. A second problem is the absence of any discussion of statistical significance; without knowing if statistical rigor was present, we don't know if any conclusions drawn are indistinguishable from noise.

We can't do much about the probably shaded truth in the source material, but we might be able to do something about the lack of statistical rigor. The authors graciously publicized their data[4], so we can run our own analyses using the same data they used. Of course, any conclusions we draw are still suspect, but it means even if I screw up the analysis, the worst that could happen is some embarrassment to myself: if I end up prescribing practicing power poses in the mirror or telling Congress that cigarettes are great, no one should be listening to me, since they already know my source data is questionable.

Now we have a sandbox problem and sandbox data: how do we go about finding statistically significant conclusions?


Before we dive in, a quick primer about p-values[5]. If you want more than this briefest of primers, check out the Wikipedia article on p-values for more details.

Roughly speaking, p-values are the chance that a null hypothesis, the boring, no interesting effect result, is true given the data we see. The lower the p-value is, the more likely a non-boring outcome is.

For example, if we're testing for a loaded coin, our boring null hypothesis is "the coin is fair". If we flip a coin 3 times, and it comes up heads twice, how do we decide how likely it is that a fair coin would generate this data? Assuming that the coin is fair, it's easy to see that the probability of a specific sequence of heads and tails, like HTH, is (\frac{1}{2})^3 = \frac{1}{8}. We need to use some combinatorial math in order to find the probability of 2 heads and and 1 tail in any order. We can use the "choose" operation to calculate that 3 \text{ choose } 2 = {{3}\choose{2}} = 3 different outcomes match 2 heads and 1 tail. With 3 coin flips, there are 8 equally probable outcomes possible, so our final probability of 2 heads and 1 tail in any order is 3/8.

However, neither of these are the probability of the coin being fair. Intuitively, the weirder the data, the less weight we shoul give to the null hypothesis: if we end up with 20 heads and 2 tails, we should be suspicious that the coin is not fair. We don't want to simply use the probability of the outcome itself, though: ending up with one of 100 equally probable outcomes is unremarkable (one of them had to win, and they were all equally likely to win), while ending up with an unlikely option instead of a likely option is remarkable. By analogy, receiving 1 inch of rain instead of 1.1 inches in Seattle is unremarkable, even if getting exactly 1 inch of rain is unlikely. Receiving any rain at all in the Sahara Desert is remarkable, even if it's the same probability as getting exactly 1 inch of rain in Seattle. The weirdness of our data depends not just the probability of the event itself, but the probability of other events in our universe of possibility.

The p-value is a way to solidify this reasoning: instead of using the probability of the outcome itself, it is the sum of the probability of all outcomes equally or less probable than the event we saw[6]. In the coin case, we would add the probability of 2 heads and 1 tail (3/8) with the probability of the more extreme results, all heads (1/8), for p=0.5.

But wait! Do we also consider a result of all tails to be more extreme than our result? If we only consider head-heavy results in our analysis, that is known as a one-tailed analysis. If we stay with a one-tailed analysis, then we will in essence be stating that we knew all along that the coin would always have more heads in a sequence, and we only wanted to know by how much it did so. This obviously does not hold in our case: we started by assuming the coin was fair, not loaded, so tails-heavy outcomes are just as extreme as heads-heavy outcomes and should be included. When we do so, we end up with p=1.0: the data fits the null hypothesis closely[7]. One-tailed analysis is only useful in specific cases, and I'll be damned if I fully understand those cases, so we'll stick to two-tailed analyses throughout the rest of this post.

If there were only ever two hypotheses, like the coin being fair, or not, then rejecting one implies the other. However, note that rejecting the null hypothesis says nothing about choosing between multiple other hypotheses, like whether the coin is biased towards the head or tail, or by how much a coin is biased. Those questions are certainly answerable, but not with the original p-value.

How low a p-value is low enough? Traditionally, scientists have treated p<0.05 as the threshold of statistical significance: if the null hypothesis were true, it would generate data this extreme less than 1/20th of the time purely by chance, which is pretty unlikely, so we should feel safe rejecting the boring null hypothesis[8].

There are problems with holding the p<0.05 threshold as sacrosanct: it turns out making p=0.05 a threshold for publication means all sorts of fudging with the p-value (p-hacking) happens[9], which is playing a part in psychology's replication crisis, which is where the 2nd part of this post's title comes from[10].

For these reasons, the p-value is a somewhat fragile tool. However, it's the one we'll be using today.

Adjusting expectations

The first step is simple: before looking at any of the data, can we know whether any conclusions are even possible?

The first step would be to do a power analysis, and find out whether 155 postmortems is enough data to produce significant results. First, we need to choose an expected effect size we think our data will display: usual values range from 0.1 (a weak effect) to 0.5 (a strong effect). Yes, it's subjective what you choose. We already know how many data points we have, 155 (normally we would be solving for this value, to see how big our sample size would have to be). Now, I'm not going to calculate this by hand, and instead use R, a commonly used statistical analysis tool (for details on running this, see the appendix below). Choosing a "medium" effect size of 0.3 with n=155 data points tells us that we have a projected 25% false negative rate, a ~1/4 chance to miss an existing effect purely by chance (see the appendix for more details about running the analysis). It's not really a great outlook, but we can't go back and gather more data, so we'll just have to temper our expectations and live with it.

What about looking at other parts of the general experiment? One potential problem that pops out is the sheer number of variables that the experiment considers. There are 3 independent variables (company attributes), and 22 dependent variables (process outcomes) that we think the independent variables affect, for a total of 3\cdot 22=66 different correlations that we are looking at separately. This is a breeding ground for the multiple comparisons problem: comparing multiple results against the same significance threshold increases the chances that at least one conclusion is falsely accepted (see this XKCD for a pictorial example). If you want to hold steady the chances that every conclusion you accept is statistically significant, then you need to make the evidential threshold for each individual correlation stricter.

But how much more stricter? Well, we can pick between the Bonferroni, the Sidak, and the Holm-Bonferroni methods.

The Bonferroni method simply takes your overall threshold of evidence, and divides by the number of tests you are doing to get the threshold of evidence for any one comparison. If you have m=5 tests, then you have to be 5 times as strict, so 0.05/5 = 0.01. This is a stricter restriction than necessary: however, it's easy to calculate, and it turns out to be a pretty good estimate.

The Sidak method calculates the exact overall threshold of evidence given the per-comparison threshold. The previous method, the Bonferroni, is fast to calculate, but it calls some outcomes insignificant when it in fact has enough evidence to label those outcomes as significant. The Sidak method correctly marks those outcomes as significant, in exchange for a slightly more difficult calculation. The equation is:

p_{comparison} = 1 - (1 - p_{overall})^{1/m}

There's some intuition for why this works in a footnote [11].

If p_{overall}=0.05 (as is tradition) and m=5, then p_{comparison}=0.0102. This is not that much less strict than the Bonferroni bound, which is simply p_{Bonferroni}=0.01, but sometimes you just need that extra leeway.

The Holm-Bonferroni method takes a different tack: instead of asking each comparison to pass a stringent test, it asks only some tests to pass the strict tests, and then allows successive tests to meet less strict standards.

We want to end up with an experiment-wide significance threshold of 0.05, so we ask whether each p-value from low to high is beneath the threshold divided by its number in line, and stop considering results significant once we reach a p-value that doesn't reach its threshold. For example: let's say that we have 5 p-values, ordered from low to high: 0.0001, 0.009, 0.017, 0.02, 0.047. Going in order, 0.0001 < 0.05/5 = 0.01, and 0.009 < 0.05/4 = 0.0125, but 0.017 > 0.05/3 = 0.0167, so we stop and consider the first two results significant, and reject the rest.

There is a marvelous proof detailing why this works which is too large for this post, so I will instead direct you to Wikipedia for the gory details.

With these methods, if we wanted to maintain a traditional p=0.05 threshold with m=66 different comparisons, we need to measure each individual comparison[12] against a p-value of:

p_{Holm}=(\text{between } 0.000758 \text{ and } 0.05)

We haven't even looked at the data, but we're already seeing that we need to meet strict standards of evidence, far beyond the traditional 0.05 threshold. And with n=155 data points at best (not all the postmortems touch on every outcome), it seems unlikely that we can meet these standards.

Perhaps I spoke too soon, though: can the data hit our ambitious p-value goals?

Testing the data

So how do we get p-values out of the data we have been given?

Keep in mind that we're interested in comparing different proportions of "this went well" and "this went poorly" responses for different types of companies, and asking ourselves whether there's any difference between the types of companies. We don't care about whether one population is better or worse, just that they have different enough outcomes. In other words, we're interested in whether the different populations of companies have the same proportional mean.

We'll use what's known as a contingency table to organize the data for each test. For instance, let's say that we're looking at whether small or large companies are better at doing art, which will produce the following table:

Small Company Large Company
Good Art 28 16
Bad Art 12 6

We want to compare the columns, and decide whether they look like they're being drawn from the same source (our null hypothesis). This formulation is nice, because it makes obvious that the more data we have, the more similar we expect the columns to look due to the law of large numbers. But how do we compare the columns in a rigorous way? I mean, they look like they have pretty similar proportions; how different can the proportions in each column get before they are too different? It turns out that we have different choices available to determine how far is too far.

z-test, t-test

The straightforward option is built in to R, called prop.test. Standing for "proportionality test", it returns a p-value for the null hypothesis that two populations have the same proportions of outcomes, which is exactly what we want.

However, a little digging shows that there are some problematic assumptions hidden behind the promising name. Namely, prop.test is based on the z-test[13], which is built on the chi-squared test, which is built on the assumption that large sample sizes are available. Looking at our data, it's clear our samples are not large: a majority of the comparisons are built on less than 40 data points. prop.test handily has an option to overcome this, known as Yates continuity correction, which corrects p-values for small sample sizes. However, people on CrossValidated don't trust Yates, and given that I don't understand what the correction is doing, we probably shouldn't either.

Instead, we should switch from using the z-test to using the t-test: Student's t-test makes no assumptions about how large our sample sizes are, and does not need any questionable corrections. It's a little harder to use than the z-test, especially since we can't make assumptions about variance, but worth the effort.


However, the t-test still makes an assumption that the populations being compared are drawn from a normal
. Is our data normal? I don't know, how do you even see if binary data (good/bad) is normal? It would be great if we could just sidestep this, and use a test that didn't assume our data was normal.

It turns out that one of the first usages of p-values matches our desires exactly. Fischer's exact test was devised for the "lady tasting tea" experiment, which tested whether someone could tell whether the milk had been added to the tea, instead of vice versa[14]. This test is pretty close to what we want, and has the nice property that it is exact: unlike the t-test, it is not an approximation based on an assumption of normal data.

Note that the test is close, but not exactly what we want. The tea experiment starts with by making a fixed number of cups with milk added, and a fixed number of cups with tea added. This assumption bleeds through into the calculation of the p-value: as usual, Fischer's test calculates the p-value by looking at all possible contingency tables that are "more extreme" (less probable) than our data, and then adding up the probability of all those tables to obtain a p-value. (The probability of a table is calculated with some multinomial math: see the Wikipedia article for details). However while looking for more extreme tables it only looks at tables that add up to the same column and row totals as our data. With our earlier example, we have:

28 16 =44
12 6 =18
=40 =22

All the bolded marginal values would be held constant. See the extended example on Wikipedia, especially if you're confused how we can shuffle the numbers around while keeping the sums the same.

This assumption does not exactly hold in our data: we didn't start by getting 10 large companies and 10 small companies and then having them make a game. If we did, it would be unquestionably correct to hold the column counts constant. As it stands, it's better to treat the column and row totals as variables, instead of constants.


Helpfully, there's another test that drops that assumption: Barnard's test. It's also exact, and also produces a p-value from a contingency table. It's very similar to Fischer's test, but does not hold the column and row sums constant when looking for more extreme tables (note that it does hold the total number of data points constant). There are several variants of Barnard's test based on how exactly one calculates whether a table is more extreme or not, but the Boschloo-Barnard variant is held to be always more powerful that Fischer's test.

The major problem with Barnard is that it is computationally expensive: all the other tests run in under a second, but running even approximate forms of Barnard take considerably longer. Solving for non-approximate forms of Barnard with both columns and rows unfixed take tens of minutes. With 66 comparisons to calculate, this means
that it's something to leave running overnight with a beefy computer (thank the gods for Moore's law).

You can see the R package documentation (PDF) for more details on the different flavors of Barnard available, and all the different options available. In our case, we'll use Boschloo-Barnard, and allow both rows and columns to vary.


So now we have our data, a test that will tell us whether the populations differ in a significant way, and a few ways to adjust our p-values to account for multiple comparisons. All that remains is putting it all together.

When we get a p-value for each comparison, we get (drum roll): results in a Google Sheet, or a plain CSV.

It turns out that that precisely 1 result passes the traditional p=0.05 threshold with Barnard's test. This is especially bad: if there was no effect whatsoever, we would naively expect 66 \cdot 0.05 \sim 3 of the comparisons to give back a "significant" result. So, we didn't even reach the level of "spurious results producing noise", far away from our multiple comparison adjusted thresholds we calculated earlier.

This is partly due to such a lack of data that some of the tests simply can't run: for example, no large companies touched on their experience with community support, either good or bad. With one empty column, none of the tests can give back a good answer. However, only a few comparisons had this exact shortcoming; the rest likely suffer from a milder version of the same problem, where there were only tens of data points on each side, which doesn't produce confidence in our data, and hence higher p-values.

In conclusion, there's nothing we can conclude, folks, it's time to pack it up and go home.

p-value Pyrotechnics

Or, we could end this Mythbusters style: the original experiment didn't work, but how could we make it work, even if we were in danger of losing some limbs?

In other words, the data can't pass a p=0.05 threshold, but that's just a convention decided on by the scientific community. If we loosened this threshold, how far would we have to loosen it in order to have a statistically significant effect in the face of multiple comparisons and the poor performance of our data?

It turns out that reversing Bonferroni correction is impossible: trying to multiply p=0.023 (the lowest Barnard-Boschloo p-value) by 66 hands back 0.023 \cdot 66 \sim 1.5, which is over 1.0 (100%), which is ridiculous and physically impossible. The same holds for Holm-Bonferroni, since it's built on Bonferroni.

So let's ditch Barnard-Boschloo: the t-test hands back a small p-value in one case, at 5.14 \cdot 10^{-6}. This we can work with! 5.14 \cdot 10^{-6} \cdot 66 = 0.000339, far below 0.05. This is pretty good, this outcome even passes our stricter multiple-comparisons adjusted tests. But what if we wanted more statistically valid results? If we're willing to push it to the limit, setting p_{overall}=0.9872 gives us just enough room to snag 3 statistically significant conclusions, either with Bonferroni or Holm-Bonferroni applied to the t-test results. Of course, the trade-off is that we are virtually certain that we are accepting a false positive conclusion, even before taking into account that we are using p-values generated by a test that doesn't exactly match our situation.

Reversing Sidak correction gets us something saner: with 66 tests and our lowest Barnard-Boschloo p-value, p=0.023, we have an overall 1-(1-0.023)^{66}=p_{overall}=0.785. Trying to nab a 2nd statistically significant conclusion pushes p_{overall}=0.991. Ouch.

This means that we can technically extract conclusions from this data, but the conclusions are ugly. A p=0.785 means that if there is no effect in any of our data, we expect to see a at least one spurious positive result around 75% of the time. It's worse than a coin flip. We're not going to publish in Nature any time soon, but we already knew that. Isn't nihilism fun?


So, what did we learn today?

  • How to correct for multiple comparisons: if there are many comparisons, you have to adjust the strictness of your tests to maintain power.
  • How to compare proportions of binary outcomes in two different populations.

At some point I'll do a Bayesian analysis for the folks in the back baying for Bayes: just give me a while to get through a textbook or two.

Thanks for following along.

Appendix: Running the Analysis

If you're interested in the nitty gritty details of actually running the analyses, read on.

For doing the power analysis, you want to install the pwr package in R. In order to run a power analysis for the proportion comparison we'll end up doing, use the pwr.2p.test function (documentation (PDF)), and use n=155 data points and a "medium" effect size (0.3). The function will hand back a power value, which is the inverse of the false negative rate (1-\text{"false negative rate"}). If you want to do a power analysis for other sorts of tests not centered around comparing population proportions, you will need to read the pwr package documentation for the other functions it provides.

Now on to the main analysis…

First, download the xlsx file provided by the paper author (gamasutra_codes.xslx, in a zip hosted by Google Drive).

The "Codes" sheet contains all the raw data we are interested in. Extract that sheet as a CSV file if you want to feed it to my scripts. The "Results" sheet is also interesting in that it contains what was likely the original author's analysis step, and makes me confident that they eyeballed their results and that statistical power was not considered.

Second, we need to digest and clean up the data a bit. To paraphrase Edison, data analysis is 99% data cleaning, and 1% analysis. A bit of time was spent extracting just the data I needed. Lots of time was spent defending against edge cases, like case rows not all having the same variable values that should be the same, and then transforming the data into a format I better understood. There are asserts littering my script to make sure that the format of the data stays constant as it flows through the script: this is definitely not a general purpose data cleaning script.

You can check out the data cleaning script as a Github gist (written in Python).

This data cleaning script is meant to be run on the CSV file we extracted from the xlxs file earlier (I named it raw_codes.csv), like so:

python raw_codes.csv clean_rows.csv 

The actual data analysis itself was done in R, but it turns out I'm just not happy "coding" in R (why is R so terrible?[15][16]). So, I did as much work as possible in Python, and then shipped it over to R at the last possible second to run the actual statistical tests.

Get the Python wrapper script, also as a Github gist.

Get the R data analysis script used by the wrapper script, also as a Github gist.

The R script isn't meant to be invoked directly, since the Python wrapper script will do it, but it should be in the same directory. Just take the CSV produced by the data cleaning step, and pass to the wrapper script like so:

python clean_rows.csv \
    --t_test --fischer_test \
    --barnard_csm_test \

This produces a CSV analysis_rows.csv, which should look an awful lot like the CSV I linked to earlier.

Math rendering provided by KaTeX.

[1] The video game community has a culture that encourages doing a public retrospective after the release of a game, some of which end up on Gamasutra, a web site devoted to video gaming.

[2] The authors tried to stay in sync while encoding the postmortems to make sure that their each rater's codings were reasonably correlated with each other, but they didn't use a more rigorous measure of inter-rater reliability, like Cronbach's alpha.

[3] Even if the company is going under, there are likely repercussions a no-holds barred retrospective would have for the individuals involved.

[4] It turns out Microsoft wiped the dataset supposedly offered (probably due to a site re-organization: remember, it's a shame if you lose any links on your site!), but thankfully one of the authors had a copy on their site. Kudos to the authors, and that author in particular!

[5] This is also your notice that this post will be focusing on traditional frequentist tools and methods. Perhaps in the future I will do another post on using Bayesian methods.

[6] One of the curious things that seems to fall out of this formulation of the p-value is that you can obtain wildly different p-values depending on whether your outcome is a little less or a little more likely. Consider that there are 100 events, 98 of which happen with probability 1/100, and one that happens with probability 0.00999 (event A), for 0.01001 remaining probability on the last event (event B). If event A happens, p=0.00999, but if event B happens, p=1.0. These events happen with mildly different probabilities, but lead to vastly different p-values. I don't know how to account for this sort of effect.

[7] This is kind of a strange case, but it makes sense after thinking about it. Getting an equal number of heads and tails would be the most likely outcome for a fair coin (even if the exact outcome happens with low probability, everything else is more improbable). Since we're flipping an odd number of times, there is no equals number of heads and tails, so we have to take the nex best thing, an almost equal number of heads and tails. Since there's only 3 flips, the most equal it can get is 2 of one kind and 1 of another. Therefore, every outcome is as likely or less so than 2 heads and a tail.

[8] However, note that separate fields will use their own p-value thresholds: physics requires stringent p-values for particle discovery, with p=0.0000003 as a threshold.

[9] This wouldn't be such a big deal if people didn't require lots of publications for tenure, or accepted negative results for publication. However, we're here to use science, not fix it.

[10] Reminder: I'm almost certainly doing something wrong in this post. If you know what it is, I would love to hear it. TELL ME MY METHODOLOGY SINS SO I CAN CONFESS THEM. It's super easy, I even have an anonymous feedback form!

[11] So why does the Sidak equation have that form?

Let's say that you are trying to see Hamilton, the musical, and enter a lottery every day for tickets. Let's simplify and state that you always 1 out of 1000 people competing for one ticket, so you have a 0.001 chance of winning a ticket each day.

Now, what are the chances that you win at least once within the next year (365 days)? You can't add the probability of winning 365 times: if you extend that process, you'll eventually have more than 100% chance of winning, which simply doesn't make sense. Thinking about it, you can never play enough times to guarantee you will win the lottery, just play enough times that you will probably win. You can't multiply the probability of winning together 365 times, since that would be the probability that you win 100 times in a row, an appropriately tiny number.

Instead, what you want is the probability that you lose 365 times in a row; then inverting that gets you the probability that you win at least once. The probability of losing is 0.999, so 365 \cdot 0.999 = 0.694. But we don't want the probability of losing 365 times in a row: we want the chance that doesn't happen. So we invert by subtracting that probability from 1, 1-0.694, for a total probability of winning equal to 0.306.

Generalizing from a year to any number of days N, this equation calculates the total probability of winning.

p_{total} = 1 - (1 - p_{winning})^N

Which looks an awful lot like the Sidak equation. The exponent contains a N instead of a \frac{1}{m}, since p_{total} corresponds with p_{overall} in the Sidak equation: solving for p_{winning} will net you the same equation.

[12] An unstated assumption throughout the post is that each measure of each variable is independent of each other measure. I don't know how to handle situations involving less-than-complete independence yet, so that's a topic for another day. This will probably come after I actually read Judea's Causality, which is a 484 page textbook, so don't hold your breath.

[13] The manual page for prop.test was not forthcoming with this detail, so I had to find this out via CrossValidated.

[14] It's adorable how Victorian the experiment sounds.

[15] Allow me to briefly rant about R's package ecosystem. R! Why the fuck would you let your users be so slipshod when they make their own packages? Every other test function out there takes arguments in a certain format, or a range of formats, and then a user defined package simply uses a completely different format for no good reason. Do your users not care about each other? Do your dark magicks grow stronger with my agony? Why, R!? Why!?

[16] I suppose I really should be using pandas instead, since I'm already using python.

Filed under: Uncategorized
No Comments »

Tools I Use

I’ve been thinking about whether the tools I use to get things done are good enough. Where are the gaps in my toolset? Do I need to make new tools for myself? Do I need to make tools that can make more tools[1]?

Before diving too deep, though, I thought it would be helpful to list out the tools I use today, why I use them, and how I think they could be better. It’s a bit of a dry list, but perhaps you’ll find one of these tools is useful for you, too.

Getting Things Done


Say what you will about gamification, but when it works, it works.

I wasn’t a habitual child, adolescent, or young adult. I had the standard brush/floss teeth habit when going to sleep, and nothing much beyond that. Sure, I tried to cultivate the habit of practicing the violin consistently, but that culminated with only moderate success in my early college years.

Then I picked up HabitRPG (now Habitica) in 2014, and suddenly I had to keep a central list of habits up to date on a daily basis, or I would face the threat of digital death. Previous attempts at holding myself to habits would track my progress on a weekly basis, or fail to track anything at all, but the daily do-or-die mentality built into Habitica got me to keep my stated goals at the forefront of my mind. Could I afford to let this habit go unpracticed? Am I falling into this consistent pattern of inaction which will get me killed in the long run? It was far from a cure-all, but it was a good first step to getting me to overcome my akrasia and do the things that needed to be done[2].

Currently, I only use the daily check-in features (“Dailies”): at first I also used the todo list, but it turned out that I wanted much, much more flexibility in my todo system than Habitica could provide, so I eventually ditched it for another tool (detailed below). I simply never got into using the merit/demerit system after setting up merits and demerits for myself.


I have tried making todo lists since I was a young teenager. The usual pattern would start with making a todo list, crossing a couple items off it over a week, and then I would forget about it for months. Upon picking it back up I would realize each item on the list was done, or had passed a deadline, or I didn’t have the motivation for the task while looking at the list. At that point I would throw the list out; if I felt really ambitious in the moment, I would start a new list, and this time I wouldn’t let it fade into obsolescence…

Habitica fixed this problem by getting me into the habit of checking up on my todo list on a regular basis, which meant my todo lists stopped getting stale, but the todo list built into the app was just too simple: it worked when I had simple one-step tasks like “buy trebuchet from Amazon” on the list, but complicated things like “build a trebuchet” would just sit on the list. It never felt like I was making forward progress on those large items, even when I worked for hours on it, and breaking up the task into parts felt like cheating (since you get rewarded for completing any one task[3]), but more importantly it made my todo list long, cluttered, and impossible to sort. Additionally, I wanted to put things onto the list that I wanted to do, but weren’t urgent, which would just compound how cluttered the list would be. For scale, I made a todo spreadsheet in college that accumulated 129 items, and most of which weren’t done by the end of college and would have taken weeks of work.

So I needed two things: a way to track all of the projects I wanted to do, even the stupid ones I wouldn’t end up doing for years, and a way to track projects while letting me break them down into manageable tasks.

After a brief stint of looking at existing todo apps, and even foraying into commercial project management tools, I decided I was a special unique flower and had to build my own task tracker, and started coding.

After weeks of this, one of my friends started raving about org-mode, the flexible list-making/organization system built inside of Emacs (the text editor; I talk about it some more below). He told me that I should stop re-implementing the wheel: since I was already using Emacs, why not just hack the fancy extra stuff I wanted from a todo system on top of org-mode, instead of tediously re-implementing all the simple stuff I was bogged down in? So I tried it, and it’s worked out in exactly that way. The basics are sane and easy to use, and since it’s just an Emacs package, I can configure and extend it however I want.

Like I implied earlier, I use my org-mode file as a place to toss all the things that I want to do, or have wanted to do; it’s my data pack-rat haven. For example, I have an item that tracks “make an animated feature length film”[4], which I’m pretty sure will never happen, but I keep it around anyways because the peace of mind I can purchase with a few bytes of hard drive space is an absolute bargain. It doesn’t matter that most of my tasks are marked “maybe start 10 years from now”, just that they’re on paper disk and out of my head.

And like I implied earlier, org-mode really got me to start breaking down tasks into smaller items. “Build a trebuchet” is a long task with an intimidating number of things to do hidden by a short goal statement; breaking it down into “acquire timber” and “acquire chainsaw” and “acquire boulders” is easier to think about, and makes it clearer how I’m making progress (or failing to do so).

The last big feature of org-mode that I use is time tracking, allowing me to track time to certain tasks. I do a weekly review, and org-mode lets me look at how I did certain tasks, and for how long. For example, I used to think that I wrote blog posts by doing continual short edit/revision cycles, but it turned out that I usually had the revision-level changes nailed down quickly, but then I had long editing cycles where I worried about all the minutia of my writing. Now I’m more realistic about how much time I spend writing, and how quickly I can actually write, instead of kidding myself that I’ll be happy with just an hour of editing[5].

Org-mode isn’t for everyone. It only really works on desktop OS’s (some mobile apps consume/edit the org-mode file format, but only somewhat), so it’s hard to use if you aren’t tied to a desktop/laptop. And the ability to extend it is tied up in knowing an arcane dialect of lisp and a willingness to wrestle with an old editor’s internals. And you might spend more time customizing the thing than actually getting things done. But, if you’re bound to a desktop anyways, and know lisp, and have the self discipline to not yak shave forever, then org-mode might work for you.


Nothing out of the ordinary here, it’s just Google email. Aside from handling my email, I primarily use the reminders feature: if there are small recurring tasks (like “take vitamins”), then I just leave them in Inbox instead of working them into org-mode. At some point they’ll probably move into org-mode, but not yet.

Keep / Evernote

I started using Evernote from 2011 or so, and switched to Keep last year when Evernote tried to force everyone to pay for it. Originally, I bought into the marketing hype of Evernote circa 2011: “Remember Everything”. Use it as your external brain. Memorizing is for chumps, write it down instead.

And I took the “Everything” seriously. How much did I exercise today? What did I do this week? What was that interesting link about the ZFS scrub of death? Why did I decide to use an inverted transistor instead of an inverted zener diode in this circuit? It’s all a search away.

I recognize that this level of tracking is a bit weird, but recalling things with uncanny precision is helpful. For example, while I was doing NaNoWriMo in November, I had years of story ideas and quips as notes; if I sort of half-remembered that I had an idea where Groundhog Day was a desperate action movie instead of a comedy, I could just look up what sorts of plot directions I had been thinking about, or if I had more ideas about the plot over time, and bring to bear all that pent up creative energy.

Less importantly, I use my note taking stream as a mobile intake hopper for org-mode, since there aren’t any mobile org-mode apps I trust with my todo list.

Habit Group

And for something that isn’t electronic: I am part of a habit setting and tracking group. It’s a group of like-minded individuals that all want to be held accountable to their goals, so we get together and tell each other how we are doing while striving towards those goals. It’s using social pressure to get yourself to be the person you want to be, but without the rigid formality of tools like Stickk.

Mobile Apps


A spaced repetition app, free on Android. See Gwern for an introduction deep dive on spaced repetition.

I use it to remember pretty random things. There’s some language stuff, mainly useful for impressing my parents and niece with how easily I can pronounce Korean words. There’s some numbers of friends and family, in case I somehow lose my phone and find a functioning payphone. There’s a subset of the IPA alphabet, in case I need to argue about pronunciation.

I have some more plans to add to this, but mostly covering long-tail language scenarios. If you’ve read Gwern’s introduction above, you’ll remember that the research implies that mathematical and performance knowledge are not as effective to memorize through spaced repetition as language and motor skills, so I’m not really in a rush to throw everything into an Anki deck.

Google Authenticator

This is your reminder that if you’re not using two-factor authentication, you really should be. Two factor means needing two different types of things to log in: something you know (a password) and something you have (a phone, or other token). This way, if someone steals your password over the internet, you’re still safe if they also don’t mug you (applicable to most cybercriminals).

Password Manager

On a related note, if you aren’t using a password manager then you should be using one of those, too. The idea is to unlock your password manager with a single strong password, and the manager remembers your actual passwords for all your different accounts. Since you don’t have to remember your passwords, you can use a different strong random password for each different service, which is much more secure than using the same password for everything[6]. For a starting pointer, you can start with The Wirecutter’s best password manager recommendations[7].


For reading RSS feeds. I follow some bloggers (SSC, Overcoming Bias), some science fiction authors (Stross, Watts), and the short story feed.

However, Feedly isn’t especially good. The primary problem is the flaky offline support. Go into a tunnel? There’s no content cache, so you can’t read anything if you didn’t have the app open at the exact moment you went underground. (I imagine this is mostly a problem in NYC).

Plus, the screens are broken up into pages instead of being in one scrolling list, which is weird. It’s okay enough to get me to not leave, but I’m on the look out for a better RSS reader.


Location check-in app, throwing it back to 2012. Sure, it’s yet another way to leak out information about myself, like whether I’m on vacation, but governments and ginormous companies already can track me, so it’s more a question of whether I want to track myself. Swarm lets me do that, and do it in a way that is semantically meaningful instead of just raw long/lat coordinates.

Kobo eReader

My trusty e-reader, which I’ve written about before. It currently runs stock firmware, but I recently learned about an exciting custom firmware I had missed, koreader, which looks like it solves some of the PDF problems I had bemoaned before. We’ll see if I can scrounge up some time to check it out.

Desktop Software


Text editor Operating system. What org-mode is layered on top of. If you’re clicking around with a mouse to move to the beginning of a paragraph so you can edit there, instead of hitting a couple of keys, you’re doing it wrong.

Also make sure to map your caps lock key to be another control, which is easily one of the higher impact things on this list that you can do today, even if you will never use Emacs. Now, you don’t have to contort your hand to reach the control keys when you copy-paste, or when you issue a stream of Emacs commands.


Running 16.04 LTS, with a ton of customization layered on top. For example, I replaced my window manager with…


Tiling window manager for Linux. All programs on the desktop are fully visible, all the time. This would be a problem with the number of programs I usually have open, but xmonad also lets you have tons of virtual desktops you can switch between with 2 key-presses. I suspect that this sort of setup covers at least part of the productivity gains from using additional monitors.

Caveat for the unwary: like org-mode, xmonad is power user software, which you can spend endless time customizing to an inane degree (to be fair, it’s usually a smaller amount of endless time than org-mode).


Late night blue light is less than ideal. Redshift is a way to shift your screen color away from being so glaringly blue on Linux.

There are similar programs for other platforms:

However, the default behavior for most of these apps is to follow the sun: when the sun sets, the screen turns red. During the winter the sun sets at some unreasonable hour when I still want to be wide awake, so there’s some hacking involved to get the programs to follow a time-based schedule instead of a natural light schedule.

Crackbook/News Feed Eradicator (Chrome extensions)

I’m sure you’re aware of how addictive the internet can be (relevant XKCD). These extensions help me make sure I don’t mindlessly wander into time sinks.

I use Crackbook by blocking the link aggregators I frequent, hiding the screen for 10 seconds: if there’s actual content I need to see, or if I’m deliberately relaxing, then 10 seconds isn’t too much time to spend staring at a blank screen. But if I just tabbed over without thinking, then those 10 seconds are enough for second thoughts, which is usually enough to make me realize that I’ve wandered over by habit instead of intention, and by that point I just close the tab.

The News Feed Eradicator is pretty straightforward: it just removes Facebook’s infinite feed, without forcing a more drastic action, like deleting your Facebook. For example, it’s easy for me to see if anyone had invited me to an event[8], but I don’t get sucked into scrolling down the feed forever and ever.

This will not work for everyone: some people will go to extreme lengths to get their fix, and extensions are easy to disable. However, it might work for you[9].

Things I Made To Help Myself

Newsletter Aggregator Tool

I made a personal tool to create the monthly/quinannual/annual newsletters I send to the world. It’s my hacked up replacement for social networking.

Throughout the month/year/life, I keep the tool up to date with what’s happening, and then at the end of the month it packages everything up and sends it in one email. It’s not strictly necessary, since I could just write out the email at the end of the month/year, but it feels like less of a time sink, since I’m spreading the writing out over time instead of spending a day writing up a newsletter, and that means I’m willing to spend more time on each entry.

Writing Checker Tool

There are a number of writing checkers out there: some of them aren’t even human.

There’s the set of scripts a professor wrote to replace himself as a PhD advisor. There are some folks that are working on a prose linter (proselint, appropriately), which aims to raise the alarms only when things are obviously wrong with your prose (“god, even a robot could tell you ‘synergy’ is bullshit corporate-speak!”). There have been other attempts, like Word’s early grammar checker, and the obvious spellchecker, but they all stem from trying to automate the first line of writing feedback.

My own script isn’t anything exciting, since it uses other scripts to do the heavy lifting, like the aforementioned proselint and PhD scripts. So far the biggest thing I added to the linter is a way to check markdown links for doubled parentheses, like [this link]( unless the inner parentheses are escaped with \, the link won’t include the last ), probably preventing the link from working, and a dangling ) will appear after the link.

There are more things I plan on adding (proper hyphenation in particular is a problem I need to work on), but I’ve already used the basic script for almost every blog post I’ve written in 2016. Notably, it’s helping me break my reliance on the word “very” as a very boring intensifier, and helped me think closely about whether all the adverbs I strew around by default are really necessary.

Real Life

The 7 Minute Workout

Exercising is good for you, but it wasn’t clear to me how I should exercise. Do I go to the gym? That’s placing a pretty big barrier in front of me actually exercising, given that gyms are outside and gym culture is kind of foreign to me. Do I go running? It’s a bit hard to do so in the middle of the city, and I’ve heard it’s not good for the knees[10]. Places to swim are even harder to reach than gyms, so that’s right out.

What about calisthenics? Push ups, sit ups, squats and the like. It requires barely any equipment, which means I can do it in my room, whenever I wanted. While thinking about this approach, I came across the 7 minute workout as detailed by the NY Times. Is it optimal? Certainly not; it won’t build muscle mass quickly or burn the most calories[11]. Is it good enough, in the sense of “good is the enemy of perfect”? Probably! So I started doing the routine and have been doing it for 3.5 years.

I’ve made my own tweaks to the routine: I use reps instead of time, use dumbbells for some exercises, and swapped out some parts that weren’t working. For example, I didn’t own any chairs tall enough to do good tricep dips on, so I substituted it with overhead triceps extensions.

And, well, I haven’t died yet, so it’s working so far.

Cleaning Checklist

After reading The Checklist Manifesto, I only made one checklist (separate from my daily Habitica list, which I was already using), but I have been using that checklist on a weekly basis for more than a year.

It’s a cleaning checklist. I use it to keep track of when I should clean my apartment, and how: not every week is “vacuum the shelves” week, but every week is “take out the trash” week. It has been helpful for making sure I don’t allow my surroundings to descend into chaos, which was especially helpful when I lived alone.

Meditation and Gratitude Journaling

Meditation I touch on in an earlier blog post; it builds up your ability to stay calm and think, even when your instinct rages to respond. Gratitude journaling is the practice of writing down the things and people you are grateful for, which emphasizes to yourself that even when things are bad, there’s some good in your life.

I’m wary about whether either of these actually work, or are otherwise worth it, but lots of people claim they do, and to a certain extent, they feel like they do. In a perfect world I would have already run through a meta-analysis to convince myself, but I don’t know how to do that yet, so I just do both meditation and gratitude journaling; they’re low cost, so even if they turn out to not do anything it’s not too big a loss.

Book/Paper Lists

I keep spreadsheets with the books I am reading, have read, and want to read. I do the same with academic papers.

It’s not just “I read this, on this date”: I also keep track of whether I generally recommend them, and a short summary of what I thought of the book, which is helpful when people ask whether I recommend any books I read recently. On the flipside, I also use the list as a wishlist to make sure I always have something interesting to read.

That’s it for now! We’ll see how this list might change over the next while…

[1] And when I do make tools that make tools, should it be a broom or bucket?

[2] Obviously, this won’t work for everyone. If you’re not motivated by points and levels going upwards, but the general concept appeals to you, Beeminder might be more motivating, since it actually takes your money instead of imaginary internet points.

[3] Conceivably, you could make this work by creating tasks to take a certain amount of time (like 30 minutes) so each item is time based instead of result based, and treat that as Just The Way You Use The Habitica Todo List.

[4] Don’t worry, it’s more fleshed out than this: I’m not keen on doing something for the sake of doing something, like “write my magnum opus, doesn’t matter what it’s about”. Come on, it has to matter somehow!

[5] It’s certainly possible that I should try to edit faster, or move towards that short and repeated revise-edit cycle, but this is more about having a clear view of what I’m actually doing now, after which I can decide how I should change things.

[6] If you use the same password everywhere, then your password is only as secure as the least secure site you use. Suppose you use the same password at your bank and InternetPetsForum, and InternetPetsForum hasn’t updated their forum software in 12 years. If InternetPetsForum is hacked, and your password was stored without any obfuscation, the hackers are only a hop and skip away from logging into your bank account, too.

[7] I’m declining to state exactly which password manager to use; while security through obscurity isn’t really viable for larger targets, I’ve picked up enough residual paranoia that disclosing exactly which service/tool I use seems to needlessly throw away secrecy I don’t need to throw away.

[8] lol

[9] And if you want something that’s less easy to disable, then SelfControl or Freedom might be more your speed. I can’t personally vouch for either.

[10] Honestly not really a true objection, but saying “running is hard” makes me feel like a lazy bum. I already did 20 pushups, what more do you want?!

[11] If you are interested in optimality in exercise, I’ve heard good things about Starting Strength.

Filed under: Uncategorized
No Comments »

Transcript 7X-2: A Zoothropological Perspective

Thank you all for coming to today’s seminar on the Zootopia artifact, recovered during our excursion on planet 7X. I’ll be diving a bit deeper into the implications of the recording, especially those clues that might reveal the reason for their civilization’s demise.

You should have gotten a copy of the translated recording last night, but for those that skipped viewing it, the story matches our own “cop buddy” movies, with unlikely partners pairing up to right wrongs and become friends. Additionally, the recording conveys a message of tolerance, even to those highly unlike yourself.

However, there are hints throughout the recording that there exist multiple conflicts and instabilities brewing beneath the surface of society, any of which might have been the cause of the end of their civilization.

(Of course, everything must be taken with a grain of salt. I will interpret the recording in earnest, but the recording may be presenting a biased/utopic/dystopic view of Zootopian society. However, given the extreme degradation of the other artifacts recovered, I will simply have to assume that the recording reflects Zootopian reality.)

A Malthusian World

The most obvious problem is the looming Malthusian trap. We catch a glimpse of Bunny Burrow’s population near the beginning of the film, and can extrapolate an approximate growth rate. Pegging Bunny Burrow at the visible 8,143,580 individuals, and growing at 1-2 people per second, this rural farming town is almost as large as New York City, and growing around 2 to 4 times as quickly in terms of births assuming no deaths or immigration. Once we include deaths into the population counter, then the birthrate must be even larger.

Humanity dodged predictions of a Malthusian trap in the latter half of the 1900s with a green revolution and a novel tendency for rising standards of living to lead to lower birthrates. However, it’s not clear that either of these did or could happen on 7X. Bunny Burrow is 211 miles from Zootopia, a large and apparently wealthy city (for comparison, Boston is around 200 miles from NYC). Even though the town appears rural, the area is connected to the city with high speed rail, which practically puts Bunny Burrow right next to Zootopia. If Bunny Burrow is selling food to Zootopia, then Bunny Burrow almost certainly has a relatively high standard of living, and yet growth rates are still much greater than replacement. We don’t know what the bunny death rate is like, but unless there’s some system of bunny birth restriction just off screen, each couple giving birth to 275 children will not yield a low enough birthrate to avoid explosive population growth.

On the other hand, it is not apparent there has been a green revolution yet. There are 3 million farmers within the present-day USA, but 8 million farmers on the doorstep of Zootopia, which implies that Zootopian agriculture is closer to America in the 1870s, when half the country was involved with agriculture. It’s also implied that botany or science education is still in its infancy: no one seems to know about “Night Howlers”, an unrestricted plant that elicits an aggressive response in a wide range of species. If botany was underdeveloped relative to the rest of their apparent scientific advancement, then it is possible they could pull off their own green revolution and raise food yields and agricultural productivity. However, the apparent tendency of at least one species (sub-species?) to maintain birthrates in the face of prosperity simply means a massive yet finite increase in agricultural output would only forestall the inevitable.

To fully sketch a bleak world, once 7X nears carrying capacity, any change in agricultural productivity (say, a volcano dusting up the stratosphere) would cause famine. Human responses to famine are varied, so we can’t rule out responses such as violent revolution, widespread debt enslavement while people try to raise the funds to buy increasingly expensive food, or even simple mass death. It’s possible that any of these contributed to the ultimate desolation of Zootopia.

Divisions in Society

Contrary to the main message of the movie, another source of strife would be the highly heterogeneous nature of Zootopian society.

It is unclear how old the accord between herbivores and carnivores is; the introductory skit doesn’t elaborate beyond “thousands of years ago”, which is ambiguous. “There was war thousands of years ago” does not preclude “there was war tens of years ago”. There are hints, though, that the accord is a recent event.

By our eyes, Zootopia looks like a new city: high technology abounds, and there is not much creaking infrastructure of the sort you might find in an NYC subway. On the other hand, there are hints that the city is not brand new: the jungle superstructure presumably had to grow while the city provided climate control, and the city has been around for long enough that it has older low-cost housing (which Judy lives in). However, 50 years is more than long enough for a city to develop those sorts of signs of aging, and the overall veneer of the city reflects a shiny new Singapore instead of an older NYC or Paris. Since the accord was signed in Zootopia, the relative youth of the city implies that the accord is also young.

More circumstantial evidence suggests a young accord: predators easily enter a hyper-aggressive state with barely any chemicals applied (the skin is a good barrier against random chemicals entering the bloodstream). If the accord had happened thousands of years ago, one would expect predator aggression to be more easily kept in check, due to thousands of years of study on an important public relations matter.

(To be fair to the inhabitants 7X, it is likely that Zootopia exaggerates or imagines problems in society. In particular, the “Night Howlers” drug is curiously similar in nature to our own tales of zombies, serving as a fictional boogeyman. Along with the other problems I will detail about “Night Howlers” later, it seems unlikely that it is a real substance, or as dramatic as portrayed. However, as stated before, until a future expedition uncovers contrary evidence we can only take the recording’s word at face value.)

It seems that the accord is young. This means that the peace is more uncertain: institutions that have proven themselves over thousands of years in hundreds of civilizations, like the concept of courts, have shown themselves to be stable across many different circumstances. The accord seems more an uneasy peace that hasn’t had enough time to solidify into an alliance, more like the latest Israel-Palestine ceasefire than today’s peace between pre-Bismarck Germanic states. A societal shock might cause enough strife to break the accord, and the intervening peace would mean both predator and prey are prepared with better weapons.

(On the other hand, coordinating any peace at all between such different groups should be commended. Perhaps the inhabitants of Zootopia have a different enough neural architecture that negotiating and keeping a peace comes easy to them. However, the societal strife caused by Judy’s mid-recording revelations imply that isn’t the case.)

Subsistence Inequality

Another source of instability stems from the inequality coded into the genes of the different Zootopians, with vastly larger inherent differences between Zootopians than between any two humans.

Assuming that the city is relatively young, and that Zootopian society has only recently attained their technological level (much like our own world), it is only recently that smaller animals have gained access to machines with which they could do the work of much larger animals. Since they’re smaller, they don’t have large fixed costs: an elephant has to eat 300 pounds of food a day on an open savanna, while a gerbil has to eat 10 grams of food a day in a square foot cage. If the elephant wants to work at a high frequency trading firm downtown, he has to work remotely with the communication costs that entails, or pay out the trunk for a large city apartment that is still never large enough, but could serve as an outsized mansion for 20 gerbils.

If their society is still moving towards an information technology base, as it seems it is, (mobile phones included), then the smaller animals gain more and more of an advantage. And small animals are demonstrably not dumb: a shrew is a successful mafia don, and the employees at the financial institution Lemming Brothers are, well, lemmings. The situation is analogous to Robin Hanson’s virtual person emulation scenario, where the ease with which minds can be replicated and the low cost of virtual living drive wages through the floor, far below human subsistence costs (defined as maintaining the minimum caloric intake needed for living). Back on 7X, the low cost of gerbil living drives wages through the floor, far below elephant subsistence costs[1]. With this discrepancy in living costs, the tendency of smaller animals to have more children becomes more pronounced: it’s easy to support several deadbeat siblings as a gerbil, but a burden to support a deadbeat elephant. Even if the heritability of IQ doesn’t hold for the inhabitants of 7X, this means that it pays to pursue a r-selection strategy as a small animal. The more children you have, the more breadwinners you might have as children who can support all their siblings and then some. Over time, gerbils will vastly outnumber elephants.

In other words, tiny animals can eat the lunch of much larger animals. However, there is an existing peaceful integration of animals that literally eat each other: perhaps it is possible to also integrate animals with vastly different subsistence rates. One approach would be to impose a species-specific tax structure, similar to a skewed basic income, or provide a subsidy, like a housing subsidy for larger animals, or normalize different wages for different species[2] (it seems like these schemes aren’t already in place, since Judy doesn’t balk at paying for an elephant-marketed popsicle). Or coming at the problem from a different angle, perhaps their society would implement growth restrictions on faster growing populations, although it’s clear that such restrictions are not in place at the time of the recording.

Additionally, we do not know how long each animal species lives. If gerbils and elephants live as long as their terrestrial counterparts, then the shortness of gerbil lives leaves room for elephants to take on a long term Elder role, acting as a valuable repository of institutional knowledge for teams of short-lived gerbils. However, without more knowledge of Zootopian physiology, we can’t know for certain how their institutions would be structured to take advantage of different species, and if those would naturally counteract the problem of subsistence inequality or exacerbate them.

Balancing on a Knife Edge

In addition to the other concerns raised, it seems clear that there generally is a lot of destructive potential energy is stored in Zootopian society, but it is unclear how much of it is actively contained by their governments.

The first hint is the availability and easy-going concern with dangerous drugs, like “Night Howlers”. Previously, I pointed out that this meant that botany probably wasn’t advanced, but the advanced technology of the rest of 7X society means that oversights such as this are increasingly dangerous. Drawing a rough analogy, it’s as if knowing that ammonia fertilizers could be used to create explosives was freely available but specialized knowledge, and when a random farmer orders 10 tons of fertilizer over the internet and blows up an orphanage, the government blames the orphanage for being an old creaky building, and says so for months. “Night Howlers” have been an uncontrolled substance for so long, and city police so unconcerned with copycat terrorist attacks after the events in the Zootopia recording (mass aerosol or water supply attacks leap to mind) that 7X society seems woefully unprepared for what our colleagues in that Three Letter Agency[3] call “independent actors” leveraging all the power a technological society grants them, without any of the checks.

The second hint is the absolutely mind-boggling availability of energy. Creating city-sized micro-climates? It’s an HVAC nightmare, an energy black hole to shovel electrons into. How bad might it be? Let’s do a Fermi estimate: since the climate outside the city is reasonably temperate looking, we might estimate that it is similar in latitude to the farming zone in Western Europe, which means it gets around 50% less direct sunlight than the equator. If the desert climate requires HVAC to make up the rest of the energy usually injected into a more equatorial desert by the sun, then a Manhattan-sized area would require 120TWh of energy over a year [4]. Keep in mind that all 5 boroughs of New York City used around 60TWh in 2009: it requires a city-sized energy budget just to keep one of these climates stable. With 2 more climates to control, the energy expenditure must be staggering. There are some energy savings to be had by the fact that the cooling systems for Tundratown can just dump waste heat directly into Sahara Square, but we’re neglecting to account for the fact that none of these climates are enclosed. It’s well known that you should close your windows when your AC is running under pain of using and paying for more energy than necessary, and the same principle applies here: we never see an enclosing dome dividing the different climates, including the temperate climate in the surrounding area. It’s tough to say exactly how much heat leakage happens between each borough, but it’s likely that the already high energy expenditures become astronomical.

This loose attitude towards energy usage probably means that energy is dirt cheap. However, where is this energy coming from? There’s so much of it, there’s a distinct possibility it’s coming from somewhere unsafe. Certainly, the Zootopians may have access to liquid thorium reactors, fusion reactors, or more exotic forms of energy generation, but we don’t know that they did, and many of the high-output energy technologies we have access to have dubious trade offs.

Fossil fuel sources have the downside of undoing their careful climate control (but we do know their world ended, so maybe that played a role). Nuclear energy needs strict controls to ensure it doesn’t aid nuclear proliferation, and with their lax approach to “Night Howlers”, it isn’t out of the question that they would have problems down the line.

Even if it’s safer green energy, the amount of energy in play can still be dangerous. If there are multiple booming populations like Bunny Burrow, and agricultural efficiency isn’t advanced, the rest of the world likely favors farms, not solar panels. However, we know Zootopia had orbital launch capability, since children want to become astronauts when they grow up, which opens up solar energy farms in space. Getting the energy from vast regions of space, though, has some problems. If there’s an orbital laser beaming down energy from an orbital solar array, that’s another opportunity for something to be hacked and aligned with great destructive power. Same with using gravitational potential energy, such as using falling asteroids as an energy source. It’s not worth belaboring the destructive potentials of even higher density energy mediums, like antimatter.


From the moment we arrived on planet 7X, we knew that we arrived on a dead world. With the end of their journey fixed, we can only look to the past and ask who lived on planet 7X, how they lived, and what brought their civilization to a smoking ruin. We can only hope that by learning more about this one of many civilizations that was caught by the Great Filter, we can hope to avoid their fate.

That’s it. Thank you for coming. Now, are there any questions?

(And in case there is any doubt: this is not an allegory for the current human condition, or any portion of such. This is a crazy no-holds-barred extrapolation of a children’s movie.)

[1] In case you were wondering, humans aren’t subject to the same problem: on a logarithmic scale, there’s barely any difference in size between small and large humans, and size does not correlate to appetite.

[2] This last suggestion seems straightforward, but probably introduces more knock-on effects. For example, a law enforcing different wages per species likely makes their version of Mechanical Turk vulnerable to illegal competitors: if there are dark web enabling technologies like Tor and Bitcoin, then carrying out “human” intensive tasks will be much cheaper in the black market, since smaller animals could mask their identity and charge rates undercutting large animal rates, but higher than small animal rates.

[3] Hint: all three letters are different. Sorry if you thought I was referencing the FAA.

[4] A back of the blog calculation: sunlight provides 1120W/m2. Manhattan is 59.1km2 large. Assume 10 hours of sunlight a day and 365 days a year. Divide by half due to latitude. Arrive at around 120TWh/year.

Filed under: Uncategorized
No Comments »

How to Succeed in Business by Playing Video Games: An XCOMedy of Learning

It’s no secret that I have a love-hate relationship with video games. On the one hand, games whisk you off to enchanted worlds optimized for fun. On the other hand, any sense of accomplishment is illusory at best: congratulations, you’ve learned how to press buttons better than before!

However, I’ve found that one particular game, XCOM: Enemy Unknown, ended up teaching me some valuable lessons. The lessons are post-facto obvious in the way many lessons seem to be, but my system one needed something experiential, and it turns out that games are all about experience. First, I’ll explain the bare minimum of how XCOM works and a bit about the community surrounding it, then lay out the lessons I learned, and then talk about why this doesn’t change my ambivalence towards gaming.

I. The Review

If you like videos, then you can watch this walkthrough of XCOM’s tutorial mission, and then watch Beaglerush play a mission of the Long War mod. Or, keep reading…

Imagine: an alien force is invading earth, abducting humans and waging a shadow war against Earth’s militaries. You are the leader of the international anti-extraterrestrial task force, XCOM, and tasked with responding to alien threats around the globe. Outmanned and outgunned, you must uncover the alien’s secrets, take their technology for your own, and destroy them before the governments of Earth surrender to the aliens and shut down the XCOM project. Oorah.

So that’s the story. How does the game play?

You command a small squad of soldiers, giving them orders to move and shoot, and then allowing the aliens to move and shoot in turn. Most soldiers or aliens need to hide behind cover, or else the enemy can shoot at them with high chances to hit or even score a critical shot. Cover is directional, so moving units to the exposed flanks of enemies means shooting those exposed enemies becomes much easier. Overwatch is an ability that allows units to defer shooting at enemy units until they move during the enemy’s turn, which is useful for discouraging the enemy from moving, especially if the enemy can flank (and then kill) one of your soldiers. Outside of an individual battle, soldiers gain experience by killing aliens and participating in battles, and gain more perks as they gain more experience. Perks, you say? Yes, abilities like “Lightning Reflexes”, which means a soldier can’t be hit by alien overwatch shots, or “Double Tap”, which allows a soldier to shoot twice in a turn, or “Smoke Gernade”, which lays down a defensive smoke screen. Each soldier adopts a class, like a long range sniper or an explosives focused heavy weapons expert, which determines which perks are available.

So that’s vanilla XCOM, but there’s a incredible XCOM mod called Long War (LW). It’s partly incredible because XCOM was never meant to be modded, so the mod itself is technically impressive. The interesting part lies in LW’s design choices. Vanilla XCOM is geared towards a more casual crowd; players only have to make a few choices at any one time, and the flow of the game is straightforward. The LW modders stood back and asked themselves, “yes, XCOM is a pretty good game, but how can we take every element of the game and make it tactically deeper?” For example, vanilla has 5 main types of weapons across 3 technology tiers; LW has 10 weapon types across 5 tiers, with an attendant expansion of possible trade-offs. Vanilla has 4 soldier classes, each with 32 possible combinations of perks; LW has 8 classes with 729 combinations each. Then, there are additional strategic concerns like soldier fatigue, where soldiers have to rest after a mission. This prevents the Vanilla strategy of sending your best squad on every mission, putting the focus on leveling up all your soldiers. Then there’s the fact that the aliens are stronger, more devious, and scale up over time (sometimes literally — I’m looking at you, 2-story-tall chryssalid). And the modifications keep going. This all adds up to a tougher game, and for a certain person, a more engaging and fun game.

There’s one final ingredient that completes the XCOM picture for me. I’m not big on watching people play through games: if I wanted to watch something, then better a movie than watching someone else interact with some interactive media. However, I’ve made an exception for the Australian gaming streamer Beaglerush. He would play through XCOM campaigns, both in Vanilla and LW, and commentate while playing with humor and wit, breaking down his tactical analysis, all while playing on the toughest difficulty. This format neatly side steps the “fitting narratives to RNG outcomes” problem suffered by other sports, both physical and electronic: XCOM is not nearly as fast-paced as other games, so the players themselves can talk about their decisions instead of having commentators guess at their intentions. Pretty much all turn-based games meet this criteria, but XCOM also breaks up gameplay so chunks fit into a person’s attention span, unlike some games that take at least 8 hours to complete (looking at you, Civ). Even when I don’t credit him directly below, Beaglerush had a hand in how I thought about each concept.

Fair warning about Beaglerush, though. If you want to follow along with the furthest-along LW campaign, it is 100+ hours long. The mod is not kidding when it says it’s a Long War. If you do watch it, remember there’s a 2x speed option on YouTube.

II. The Lessons

So that’s enough of me fawning over the game, what are the lessons I learned?

First, I feel like I better understand why strategies, in business or otherwise, are allergic to risk. In the words of Beaglerush, you want a boring game[1]; you want to play to win, you want to stack the deck as far in your favor as the game will allow, you want to have won before fighting. This is counterintuitive in a gaming context, where boredom is the true enemy. However, a well designed game like LW has a way of upending the best laid plans, throwing unexpected curve balls on a regular basis, and that’s where things get uncomfortably exciting. Bringing a “best case-only plan” or no plan at all will get your squad killed, so it’s up to you to make your own luck instead of letting the game give you some ready-made luck[2].

In a business context, I wondered why my team leads would obsess about pinning down possible sources of variance. It only became clear after I had underestimated the difficulty of my first projects (even while taking Murphy into account): translating back to a gaming model, the team leads were managing an XCOM firefight, and wanted to guarantee each shot would connect, to have worst-case contingency plans laid down before committing. Now, no one is going to die if a deadline slips a month. There’s some room for risk and subsequent outsize reward, which I presume is the reasoning behind strategies like Google trying to make sure they meet only 70% of their goals. A different attitude to risk is apparent when people really can die, like the NASA software shops that have layers of review for each line of code. But coming from a loose attitude towards risk, XCOM was instructive in showing me how quickly things could go wrong to the little digital soldiers I had gotten emotionally invested in. And just as important, I would just as quickly have the chance to try again.

Second, XCOM taught me about the value of having a crack team of max-level soldiers for any mission. The A-team makes the easy missions easy, and the hard missions possible. However, LW then taught me about scarcity and the need to ration and stretch soldiers: you can’t take your A-team on every mission because of fatigue, so you need to weigh the downsides of taking less useful lower level soldiers on this mission against the upsides of having a greater number of experienced soldiers in later missions, as well as having more experienced troops ready if the game throws a string of really hard missions at you right after the current mission. Once I started thinking about my troop deployments this way, I then subconsciously started applying it to work: “ah, my manager wants me to take these lower level troopsdevelopers on this mission because she needs them to level up, but all the more experienced developers are fatiguedworking on more complicated projects. Welp, guess I better not screw this up.”[3] It’s one thing to know that businesses are profit optimization engines: it’s another to virtually lead a dead-alien optimization engine, and then come to work and have some empathy for your boss.

Third, having more skills is awesome. Sometimes it’s obvious: in LW, the scout soldier class gets the Concealment perk in the middle of their experience progression, and it changes the class from a mediocre jack-of-all-trades soldier to the only soldier you need to scout, ever. Or, the medic class can choose to specialize into a combat medic with Rapid Reaction, which turns the class from a “healing and shoot once in a while” class to “shoot everything all the time, and healing once in a while I guess”. It’s not clear which real-world skills map to these sorts of game-changing skills, but I can guess that learning to study effectively, becoming better at public speaking, writing concisely and clearly, or learning how to lead a team would be the sorts of skills that would lead to a bump in effectiveness and power, even if they are boring.

Fourth, what about combining those skills for an effect greater than the sum of their parts? You know… synergy? Yeah, that bullshit corporate-speak word. However, in-game the concept makes total sense, especially in LW: it made so much sense, I sat down and planned out builds for each soldier class, and then printed them out and put them on the wall next to my gaming computer, like a giant nerd. But it’s worth looking like a giant nerd if you can stack sniper perks until you can roll shots dealing over 40 damage (the starting assault rifle with no perks averages 4 damage), or if you design a combat medic that can shoot 4 times a turn, or if you design gunners that essentially shoot infinite mini shredder rockets.

However, it isn’t clear how to map synergy back to the real world. It seems that the technical/business startup duo works pretty well (Jobs/Woz, Gates/Allen), and having an expert writer and expert in anything else team can write fine books (for example, Peak was written in this way), but it’s unclear to me what else “synergy” can be generalized to without immediately stepping into pools of bullshit. I don’t think the concept is worthless, though. “Synergy” traded well enough in the idea marketplace that there was even a buzzword bubble to pop, and I’ve had the run ins with the concept (like this science fiction story) that can’t help but pique my interest. My bullshit-meter is still going off, but XCOM has convinced me that “synergy” might be something worth paying attention to.

This last idea is not directly related to management-like concepts like everything else, but I found it instructive. We know that people don’t have a good gut understanding of chance, partly because they seem to follow prospect theory and because numbers are hard. Given this, it’s quite the experience to play LW, because the modders took out all instances of cheating in the random number generator on behalf of the player. It doesn’t hit home that you also are subject to the gambler’s fallacy until you take a 75% chance to hit shot, miss, and say to yourself “surely this next 75% chance shot will connect!”, and miss again. At once I was enlightened: optimism is not a viable strategy. You could probably get the same experience with probabilities by working on calibrating yourself or betting in a prediction market, but it was helpful for me to get emotionally involved in the outcomes and receive lots of feedback in a tight loop.

III. Conclusion

This analysis might raise a question about whether video games are a waste of time by coming down hard on the side of “video games are not only fun, but educational”, and then just continually extract lessons from games. Unfortunately, I don’t think that works: as noted before, some games are all about twitching your way to victory, and others, like in the 4X genre, are so slow it becomes difficult to link mistakes and consequences together. Additionally, the ideas I got a better handle on within LW aren’t ideas I need to be reintroduced to. There might be another game that can clarify other ideas for me, but it seems any given game is unlikely to do so.

TLDR: XCOM is pretty good. You should try it if you’re going to play video games anyways; maybe you’ll also learn something.

[1] Unfortunately, I can’t find where Beaglerush says this: I have a sinking suspicion that it’s in one of his LW beta 14 videos on Twitch, which are saved for a short time but ultimately ephemeral. That, or it’s hidden in the middle of hundreds of hours of video and I just missed it. So unfortunately you’re just going to have to take my word that he said it.

[2] A particularly egregious example of ready-made luck served to the player on a silver platter: vanilla XCOM would invisibly adjust shot success probabilities upwards if you missed a couple times in a row, which allowed sitting in good cover and taking a bunch of low-probability shots at the enemy to be a workable strategy. Of course, LW removed this mechanic.

[3] Is this sort of approach to human resources dehumanizing? Probably!

Filed under: Uncategorized
No Comments »

Surely You’re Hamming It Up, Mr. Feynman!

I was talking to friends about Deep Work, a book about doing, well, deep work, when I realized that I had two conflicting models of how to choose what to work on, and how to work on it.

The more straightforward approach is sketched by Richard Hamming in You and Your Research, which simply asks (paraphrased) “What are the important problems of your field, and if you’re not working on them, why not?” It’s an extraordinarily dense mantra, packing lots of decision power into a simple sentence: if you’re not focused on your field, then focus on your field, and if you’re not focused on the most promising area of your field, then focus on that area, and if you’re not focused on the most important problem of that area, focus on that problem. Everything else? Strip it away as much as possible, because the rocket equation is hell[1] and we are going to Mars!

Then there’s the more playful way to find problems, which the incorrigible Richard[2] Feynman described in Surely You’re Joking, Mr. Feynman!. Frustrated by his research problems, he decided that he would stop slaving away and just play with whatever problems caught his fancy: “Now that I am burned out and I’ll never accomplish anything… I’m going to play with physics, whenever I want to, without worrying about importance whatsoever.” He goes on to derive equations related to the physics of a spinning plate, because why not? Later, he realizes “The diagrams and the whole business that I got the Nobel Prize for came from that piddling around with the wobbling plate.” Feynman’s little anecdote is a direct repudiation of Hamming’s strategy, the triumph of play over a conscious effort to work on Important Things[3].

At each work’s core is a different philosophy. Hamming says “if it’s not important, you are wasting your time: by definition, how else can you do important work?”, and Feynman says “if it’s not joyful, you are wasting your time: how can you do your best work when it’s no longer important to you?”.

I flirt with both ways of thinking, but Hamming’s philosophy in particular rings in my ears. Do important work! Revel in the flow, conducting a grand symphony of gathered skills and knowledge into a masterpiece unlike any the world has seen! Though the tears, sweat, and blood blur your vision, behold your work, and see that it is good! Well, your work probably won’t actually end up being world-class, but what does it matter for a shot at glory? And never mind Hamming saying “I did sort of neglect [my wife] sometimes”, just choose a hill, the taller the better, and get ready to die on it.


You know, my current actions most closely fit a Feynman-style strategy, but I’m not even playing and learning in an effective way; instead of going to war against intractable problems, maybe I could consciously pursue a Feynman strategy and deliberately chase those “that’s funny…” moments. The problem is that it’s easier to slip into comfortable zones of thought, easier to craft trivial solutions to trivial problems, easier to wake up with the precursors of dementia and years of work even you don’t care about. And yet, gambling away the years of my life on an Important Problem[4] is a bitter proposition, and it is gambling: Hamming getting at least six different concepts named after him is a highly unusual outcome, not participation points for years of work on the right problems.

Not that we have to choose just one: the Way of the Fox tells us we should keep a stable of models and use each one when appropriate. If we recognize Hamming’s strategy as a primarily exploitative one, and Feynman’s strategy as an exploratory one, then we can just re-use the multi-armed bandit’s mechanism; we start by exploring, and gradually exploit more and more as we get to know the exploration space. Of course, like all models this doesn’t map neatly to real life, but it does indicate that mixing strategies by varying the amount of time one spends on different approaches to problems might be a workable solution. Then the question is how one should balance exploration and exploitation efforts, especially over time, which I will leave as an exercise for the reader.

Even refusing to use a mixed strategy might not turn out badly. Wienersmith points out that it’s possible to build yourself into an expert many times within a life, so you can work your way up to working on Important Problems multiple times. But keep in mind that Important Problems are the things one cracks over a career, not right after attaining mastery, so re-training every decade is exploring too often to actually make any deep progress. However, I like to read this instead as reassuring people that they don’t just have one shot at becoming an expert: if you just went to grad school, and it turns out you utterly detest your field’s Important Problems, it’s still possible to refocus. It’s a high cost, but it’s not an infinite one. And that might be the difference between paralyzing yourself with how important the choice of field is, and making a quick partially informed decision before plunging in headfirst and learning more by actually doing things.

I still don’t have an answer at this point. These are just meditations on resolving dissonance between two different respected sources. At this end, these questions remain: who will I be, and what will I do?

[1] Rocketry is hard because you need to carry your fuel: for every pound of stuff you want to put into orbit (or farther), you need the fuel to boost that pound, and then the fuel to boost that fuel, and the fuel to boost that fuel, ad nauseam. This means that if you are carrying anything gratuitous and unnecessary, then you are doing rocketry wrong and you will not go to space today. Hat tip to Sam’s Ra and Kerbal Space Program for helping solidify this concept for me.

[2] Since I’m reading Unsong, this correspondence leapt out at me. Both the scientists are Richards. Both lived from approximately 1910-1990. Both worked in Los Alamos during the war. Both are physical scientists. This Is Not A Coincidence Because Nothing Is A Coincidence.

[3] For a possible follow up, Robin Hanson recently pointed out that play must be important.

[4] Quote: “Trying to do the impossible is definitely not for everyone. Exceptional talent is only the ante to sit down at the table. The chips are the years of your life. If wagering those chips and losing seems like an unbearable possibility to you, then go do something else. Seriously. Because you can lose.”

Filed under: Uncategorized
No Comments »
All content is licensed under CC-by-nc-sa
Creative Commons License