This post is split into 3 parts:
- A fast high level overview, to provide background for the next two questions.
- My answer to the question “should you read this book?”
- A mildly more detailed breakdown of the book into chunks, paired chapters that share a theme. The intent is to provide a map where certain material is, to help readers just hit the parts they want to.
What does the textbook cover?
Chapters 1-4: Deriving/applying probability theory
Jaynes starts by deriving the probability theory from basic building blocks, rigorously(ish) deriving results from basic properties until it covers many of the results presented in an introductory probability textbook. Then Jaynes extends to results beyond introductory textbooks, to show off the flexibility of the methods developed.
Chapters 6-7, 11-12: Priors
As befits a Bayesian textbook, the choice of priors is a running topic. Jaynes starts with the principle of indifference, extends to maximum entropy distributions with an entire chapter devoted to the Gaussian, and includes pointers to marginalization and coding theory as prior generation methods in later chapters.
Chapters 8-10, 15, 16-17: Against orthodox statistics
Probability Theory is half Bayesian textbook, and half polemic against the old guard Fischer/Neyman Orthodox school of statistics. The polemic nature of the textbook is not restrained to these chapters, but these chapters are where Jaynes focuses on picking apart orthodox practices, tools, and sociology.
Chapters 13-14, 18-22: Other
The other chapters make sense for inclusion, but don’t have a strong common theme, covering:
- decision theory.
- a use of probabilities of probabilities.
- outlier/bad data handling.
- communication theory.
Originally I thought I would be badass and work through all the proofs: after spending around half a year of my free time on the first 6 chapters, I gave up on that plan. For calibration, the most advanced math course I took in college was partial differential equations. If you’re more adept at math, you may be able to follow along with the flood of proofs more easily.
After skipping through the mathematical parts, there’s still interesting insight to be derived, but it likely isn’t as clear as a work that was written from the ground up to serve as loose insight fodder. However, I don’t know if that work yet exists, and so the textbook might still be worth reading if you want a look at this philosophical mindset.
On first reading, the polemic parts enlivened the work, but looking back I see them as offputting: even if it is true and necessary, the tone leaves a bad taste in my mouth.
Should you read this?
- You have a few months of your schedule cleared, mathematical aptitude, and a desire to learn the foundations of Bayesian probability theory inside and out.
- You’re interested in how to motivate probability theory. Some folks recommend reading the just first 2 chapters, but I wonder if explanations of Cox’s theorem available on the internet are sufficient.
- You’re a frequentist, want to learn about how frequentism and Bayesian approaches relate to each other, and (importantly) don’t mind feeling attacked.
- You want a polemic; moreover, you want a highly technical polemic. I only hesitate, since it’s a long polemic, and there are likely papers written by Jaynes that encapsulate this feeling more efficiently.
- You need something practical, to use in your day to day work as quickly as possible. As befits a book with “Theory” in the name, this is not that book.
- You want to learn about modern Bayesian approaches that take full advantage of the current glut of computational power. Jaynes died in 1998, just as the dotcom boom started taking off, and WinBUGS (an early Bayesian analysis package) was only just released. I can vouch for Statistical Rethinking as a beginner-friendly text, and Bayesian Data Analysis is regarded as a good non-introductory work.
- You are bothered by incomplete works. Jaynes didn’t finish writing Probability Theory before he died, and the later chapters have big chunks missing and lines of inquiry dropped on the floor.
In my understanding, it appears that most of the conceptual chunks consist of paired chapters.
Chapters 1-2: Grounding Probability Theory
The textbook starts by laying out the reasoning behind probabilistic thinking, first motivating thinking about probabilities by contrasting it with Aristotelian logic. Then, Jaynes sets out to chart the properties of probabilistic thinking, framing it as an extended logic.
Unsurprisingly, Jaynes delivers the usual definition of probability, with the usual product/sum/Bayes rule. Surprisingly, he derives these fixtures by starting with a simple set of desiderata (desired properties):
- Plausibility is represented by continuous real numbers.
- Qualitative correspondence with common sense. This means using syllogisms similar to but weaker than Aristotle’s.
- Consistent reasoning, including:
- Path independence. If an answer can be calculated multiple ways, then each calculation should give the same answer.
- Non-ideological. The reasoner does not leave out information.
- Equivalent states of knowledge are represented by the same number.
and simply working forwards via mathematical derivation, laying out the entirety of Cox’s Theorem. As befits a Bayesian probability textbook, a term for background/prior information is included from the beginning without much fanfare.
Chapters 3-4: Doing Inference
With the product and sum rules in place, Jaynes works out exact solutions to “draw a ball without replacement from an urn” problems, including a surprising backwards inference to the first draw given information about a later ball draw.
Expanding to drawing with replacement, Jaynes takes the chance to draw a map vs territory distinction: randomization is throwing away data. While he goes on to derive the usual randomized draw results, he also extends the result to draws with (simplified) non-perfect randomization.
After working forwards from given generating information (given this urn, what is the probability of drawing a red ball?) Jaynes also works backwards to do hypothesis testing (given these draws, what does the urn look like?). There’s a bit of concept/terminology thrashing when Jaynes adopts and throws away terms (decibels of evidence, the log form of the likelihood) as he generalizes to multiple hypothesis testing.
By this point, it’s pretty clear that Jaynes has an axe to grind, with a constant exhortation to OBEY THE RULES OF PROBABILITY THEORY and a vendetta against taking underdefined limits to infinity.
Chapter 5: Queer uses for probability theory
A grab bag of topics, a break chapter of sorts. Jaynes talks about ESP, which leads to the counter-intuitive idea that different priors can lead to different people updating their probabilities in opposite directions (as an extreme example, priors may include “Bob is a paragon of truth” and “Bob is a compulsive liar”). He also talks about the importance of comparing alternatives, instead of evaluating hypotheses in a vacuum, offering a solution/dissolution of Hempel’s paradox (is seeing a white shoe supporting evidence for “all crows are black”?).
Chapters 6-7: What priors should we use?
Jaynes starts by generalizing hypothesis testing to a continuous domain, but I think this chunk is more properly thought of as starting to tackle the hard question of prior selection. He works out the impact of choosing uniform, truncated uniform (assumes at least 1 ball of each color in an urn), a concave (more uninformative than uniform), and “binomial monkey” prior on hypothesis testing.
Chapter 7 is just about the normal/Gaussian distribution. Jaynes includes 3 different derivations of the distribution (by Herschel-Maxwell, Gauss, and Landon), which seems like overkill. However, his motivation is to explain the unreasonable effectiveness of the normal distribution (other distributions naturally become Gaussian, and stay Gaussian under common operations, and is the maximum entropy distribution given mean/variance), and dispel the unease when people use a distribution that doesn’t match their (unknown) error distribution. I also think of this as his first attempt at making good on his promise in the preface to teach us maximum entropy methods.
Jaynes also weighs in on:
- early stopping, basically stating that possible data sets should not impact analysis, especially over the actual data set that was collected.
- improving precision by aggregating data. Contravening folk wisdom, averaging a bunch of data with 3 significant digits means we can confidently state the average with 4 significant digits.
Chapters 8-10: Against frequentist tools
So far most of the topics have been about explaining the Bayesian approach to probability, but now Jaynes lays into frequentist tooling and philosophy.
I didn’t get much out of this chunk, since I wasn’t grounded in frequentist practice before. However, a list of topics Jaynes explains as redundant or supplanted by the Bayesian approach:
- sufficient statistics.
- the likelihood principle.
- ancillary statistics.
- ad-hoc evidence combination. Includes a cute parable about estimating the height of the emperor of China (1 billion people know the emperor’s height to ±1m, averaging all the estimates should get us an estimate with stdev 1/√N = 0.03mm. The key to the paradox is that the individual estimates are not independent) and a warning against something similar to Simpson’s paradox.
- 𝛘2, or significance tests in general that purport to evaluate a hypothesis without any alternatives to compare against.
Jaynes also spends around half the chunk explaining connections between Bayesian probability theory as thus far explained and frequentism, showing that in a pretty simple case the frequentist solution is equal to the Bayesian one with an ignorant prior.
(Of interest to people from LW: Jaynes is certain that the probabilistic nature of quantum mechanics is false, that the quantum physicists have given up the cause too easily. With that in mind, it feels like Yudkowsky’s quantum mechanics sequence is in part a response to this charge.)
Chapters 11-12: Discrete/continuous priors
Now in part 2, Jaynes loops back around and comes back to expand on Bayesian concepts.
So far Jaynes has touched on maximum entropy priors, especially around the normal distribution, but now he lays out a more rigorous definition of information entropy, working from desiderata:
- there’s a connection between uncertainty and entropy.
- continuity: entropy is a continuous function of p.
- if there are additional choices, uncertainty increases.
- consistency: if there are multiple ways to get an answer, they should agree.
from which Jaynes derives information entropy, following the Wallis derivation. Using this definition, he expands maximum entropy distributions past just considering an average from prior data.
He also extends to continuous maximum entropy distributions, tackling the problem of getting a distribution that is invariant under parameter changes. (For example, a uniform prior can lead to different results depend on if it’s uniform over x or x5.)
Chapters 13-14: Decision Theory
This is a bit of strange chunk: does this really belong in a textbook entitled Probability Theory?
First, Jaynes lays out some groundwork, including a demonstration of non-linear utility (the Tversky/Kahneman research program shows up a few other times in Probability Theory) and the usual square/linear/dirac delta loss functions resulting in the usual mean/median/mode estimates.
Getting to the heart of the matter, Jaynes walk through Wald’s decision theory, and ties it to Bayesian inference by way of identifying a prior distribution in the likelihood calculations. Then, he walks through deriving different decision rule criteria (minimax) from a Bayesian criteria.
Given that Jaynes identifies these developments of decision theory as starting points for the Bayesian revolution, it makes sense why this topic shows up, even if only for historical context.
Chapter 15: Stop, Paradox time
Another break chapter, Jaynes picks apart paradoxes, most of which have to do with improper limits to infinity, including:
- the non-conglomerability paradox.
- the Borel-Kolmogorov paradox.
- the marginalization paradox. In this case, the main problem seems to be due to non-rigorous handling of an implicit prior, and apparently led to another way to generate priors. (But see this comment, which claims Jaynes got it wrong)
I think Jaynes included discussion of these paradoxes to shore up his position that probability theory is not just practically workable, but correct, applicable everywhere. In this light, tackling historical confusions makes sense.
Chapters 16-17: Against Orthodoxy
Jaynes returns to ragging on frequentists, this time with a more philosophical bent.
Jaynes presents his take on the sociology of orthodox statistics as encouraging learned helplessness with a cookbook approach (Statistical Methods for Research Workers) and doctor-client-like statistician relationship, instead of teaching researchers base principles themselves. (I do wonder if he was typical minding and overestimating people’s ability to generate probability theory from scratch.) Unfortunately, he also stoops to comparing personal details on Fischer and Jeffreys, which made me feel like I’m reading a high-brow academic tabloid instead of a textbook.
That said, there are more technical arguments as well, around the choice of unbiased estimators, the practice of prefiltering data (if you smooth your data, you get some future data into your current data), and the (mis)use of the sampling distribution width as a measure of estimator goodness.
An interesting insight that makes intuitive sense to me is that Fischer and Jeffreys were naturally responding to their fields: Fischer in biology had lots of data, but not as much theory, and Jeffreys in geophysics had well developed theory, but not much data. In that light, it is no wonder only Jeffreys considered priors important.
Chapters 18, 20: Future work
(The strongly thematic paired chapters start to fall apart here.)
Now we’re getting into some more speculative work, which Jaynes thought should be developed with more rigor in the future.
Take, for example, the Ap distribution, in which Jaynes models a distribution of probabilities, instead of a distribution of values, leading to the final probability of a particular hypothesis as an expectation of the distribution. It seems necessary in order to avoid recomputing probabilities with all the gathered data so far, but he didn’t have a principled motivation for the construct, nor a way to avoid infinitely regressing with a probability of a probability of a…
He also sneaks in a discussion of Laplace’s rule of succession, attempting to rescue it from rampant misunderstandings (it only applies with little to no prior information), and uses it as a vehicle to tie together probability/frequency again.
Jaynes also touches on model comparison, but only as an expanded sort of hypothesis testing. Most of the chapter is not super practical: for example, while he vaguely gestures towards the problem of overfitting, he doesn’t give concrete solutions to solve it. (The recent generation of Bayesian textbooks noted above give more detail on possible solutions.)
Chapters 19, 21: Wrapping up loose ends
Jaynes shows that Bayesian methods don’t need as much hand holding if the model can take into account bad/inaccurate data, being able to weight data the model concludes is bad without needing manual intervention (ex. outlier removal).
Chapter 22: Example Application, Communication Theory
The chapter is kind of a worked application of max entropy, but otherwise it’s not clear why Jaynes decided to include this chapter. My best guess is that he meant to make clearer the use of coding theory for prior generation, but wasn’t able to do so before he died.
Plus, it’s not a great introduction to communication/coding theory, when he won’t call a Huffman coding a Huffman coding.
 ↑ Rigor as defined relative to me: others have described it as “fast and loose” in the tradition of the Griffiths physics textbooks.
 ↑ Looking over the table of contents for A First Course in Probability by Sheldon Ross, many but not all of the chapters would be covered.
 ↑ I was well into my software engineering job, which did leave me less intellectual energy. If you are not as heavily intellectually taxed, this timeline could be compressed.
 ↑ Keep in mind that not all courses are the same; doing well in the equivalent course at MIT would be more impressive than what I did at my small liberal arts college, and hence bode better for following all the details.
 ↑ For example, a good chunk of the time I was trying to puzzle out whether some step was in fact a legal operation, and how it was legal.
 ↑ Like, how often are you going to need to rederive the sum and product rule from basic principles, instead of simply leaning on the knowledge that probability theory has such a basis?
 ↑ It’s unclear to me exactly how non-introductory you have to get. Like, do you just need to know about distributions? Have taken a graduate course?
 ↑ It’s a bit weird to include 3 different sub-items under the same “consistent reasoning” heading, but apparently this is the traditional formulation of the desiderata.
 ↑ To be fair, it does seem like it’s a problem that people tend to fall into easily.
 ↑ There’s a fun worked example that underlines the ad-hocness of the 𝛘2 test: let’s say we have a (thick) coin, and a person that knows there’s a coin flip (49.9% heads, 49.9% tails, 0.2% on edge) and a person only informed there are 3 outcomes (33.3% to everything). For the data 14 heads/14 tails/1 edge, then 𝛘2coin = 15.33 and 𝛘2equal = 11.66.
Note there are practical ways to overcome the problem of a small category with the 𝛘2 test, some more satisfying than others.
 ↑ Keep in mind this is coming from the man that coined “whenever there is a randomized way of doing something, then there is a nonrandomized way that delivers better performance but requires more thought”.
 ↑ It seems like the Wald maximin model is the major contribution Wald made to decision theory, but it’s not obvious this is the same concept as in the textbook. It also doesn’t seem like this decision theory matches directly to CDT or EDT.
 ↑ The outlier removal example used Euler’s attempt to estimate the orbital parameters of celestial bodies as a frame device. I’m half taken by the idea of using this frame as a demonstration of Bayesian methods: just go outside, make crappy measurements with crappy equipment, and derive an okay solution for the orbital parameters for the inner solar system.