[Review] Visualizing Data

If you are not already acquainted with the works of Edward Tufte, you ought go fix that, especially if you happen to be a younger me that could use a badassery-boost later in life. Basically, Tufte encourages thought about data display beyond the usual “plot and run away” touted by most anyone, chides those that try deceiving with displays, and fires the imagination with wondrous examples.

However, Tufte writes for data mongers that are crafting displays for a popular audience. What if your audience is scientific, sophisticated, and yet still looking for that “interocular impact”? Then, as an Amazon reviewer once pointed out to me, maybe Visualizing Data by William Cleveland could help you.

Certainly, the target audience of Visualizing Data is related to the scientific community, eschewing any amount of glitz for effective visualization techniques (as does Tufte) with a solid grounding in math (nevermind). Instead of searching for visualization techniques that are only suited for a few scenarios (such as Minard’s Napolean chart), Cleveland takes a clinical approach of developing a few nearly orthogonal techniques and extending them into ever-increasing dimensions (the loess is a particularly intruiging example).

With the troop further into the book and into higher dimensions, also comes different ways of looking at the data, such as stereographic techniques for viewing 3d data which is interesting in itself. Notably, Cleveland also develops methods for finding subtle dependencies in the data in these higher dimensions, bypassing the intuitive but flawed solution of merely dishing out 2 two-dimensional plots.

This all boils down to Cleveland’s main mission of battling “rote data analysis”, the application of statistical techniques without applied critical thought. The bulk of the book is taken with application of the various tools developed to real-world data sets, augmenting the case studies with comments on how to go about the analysis effectively. He also compares and contrasts his results with previous studies done on the very same data sets by people likely to have used rote data analysis, as a non-too-subtle indication of the sort of fate awaiting those using least-squares without thinking. Overall, the style is instructive and helpful, and quite the pedagogical win.

However, I should note thta I said earlier that Tufte was meant for data mongers creating displays for the general public, this implies that Cleveland is not so suited to the same. Indeed, tools such as the q-q, r-f, or m-d plots are not standard in the public visual literacy, such that I had to read the explanation of q-q plots a second time before grasping what was going on. To generalize, you would use Cleveland’s methods to convince your colleagues, and Tufte-like displays to convince everyone else.

While I borrowed this book as part of my “read all the books in the library” campaign, I may be compelled to buy this book anyways. If you’re in the business of playing with data (not necessarily machine learning domains, because trying to display hundreds of dimensions just will not work), then I highly recommend checking this book out.