I put the call out on twitter for ideas for my first post, and Gabe asked this:
I suppose I did ask for statistics questions. This one is a bit tough to answer because, like I hinted at on twitter, a wide variety of things get called statistics by the people doing them, statisticians do an even wider variety of things, and to muddy the waters even more, lots of things that typically get categorized as statistics often are also categorized as other things like machine learning or computer science. I suppose I should blame the computer people.
To get a sense for the word, we can start at Wikipedia:
Statistics is the study of the collection, organization, analysis, interpretation and presentation of data.
So statistics is about data, that much can be sure. The first question of the first test in the Stat 104 course I taught asked
Statistics is about __________________.
If you would have written in “data” I would have marked it wrong. So much for data, eh? The
password answer I wanted in that blank, just like the answer my predecessors wanted in the long storied history of Stat 104, was variation. If you guessed that the reason I wanted that answer was because I got lecture notes, assignments and example tests from a previous instructor, you’d be right. But I realized while teaching that course that there was something to that definition. Pretty much everything statisticians do with data has something to do with analyzing, organizing, interpreting and presenting the variation in data. Or lack thereof. The central focus of classical statistics is variation. Suppose you have some variable you care about. Classical statistics basically breaks down the variation in that variable into two components – 1) variation that can be explained by a given model and 2) variation that can’t, i.e. the error. Then using these components, you try to answer all sorts of interesting questions as long as you can phrase them as questions about the model.
But wait, what is a model? This question gets at the heart of what statistics is all about. There are lots of peripheral things that people do that can be classified as statistics without much fuss, but they’re, well, peripheral. The key component tying most of the field together is probability theory. We can’t even get statisticians to agree on what probability is, but by and large everyone agrees on the math. And this math is used to build models. Models of what? Well, that depends on who you ask. A frequentist might tell you we’re building models of the data generating process. A Bayesian might tell you we’re building models of our uncertainty about the data generating process. A pragmatist might refuse to answer the question and scuttle away. But we build models and those models use probability.
Lots of disparate things under the statistics umbrella are tied to the center by probability. Statisticians use probability to think about data collection and experimental design. Data visualization and model construction are mutually reinforcing – we use plots to help select useful models and models to help come up with useful plots. Statistical computing is driven by the need to fit models faster and to this end often uses concepts from probability. Mathematical statistics is basically the derivation of high-falutin’ probability theory relevant to statistical problems.
So statistics is about data and using probability to understand variation in the data. Except when it isn’t. Probability free statistics looks a lot like frequentist statistics, except it ignores probability theory entirely. So while probability theory isn’t essential, it’s pretty close. Probability free statistics is also a good test case for figuring out the other essential features of statistics. In probability free statistics, you are given a sequence of observations and you attempt to predict the next observation in the sequence. Given a set of different prediction algorithms, using what amounts to worst-case thinking in decision theory you can show that the best prediction is some sort of weighted average of the predictions of the original set.
Outside of using data, the two keys are prediction and decision theory. Prediction is a common topic in statistics, but that’s not what makes probability free statistics statistics. It’s actually decision theory. Well, statistical decision theory, but it’s really just normal decision theory with a different emphasis. Most statistical methods can be justified using decision theory. Bayesians see decision theory as an essential component of statistics – some don’t think probability can even be defined apart from decision theory! But the statement is true for frequentists as well. Prediction is a good example of this. You see a sequence of coin flips and have to predict the next flip. You choose between heads and tails. Which one should you choose? The one that maximizes utili—ahem. The one that minimizes loss. Or worst case loss. Or expected loss. (Psst. Loss is the negative of utility.) What about p-values, what does decision theory have to do with p-values? You have to choose whether or not to reject the null hypothesis! P-values are a component to the solution of that decision problem. Parameter estimation? Decision theory. Confidence intervals? Decision theory. Model selection? Decision theory. Picking the best way to display your data? Maybe that doesn’t seem like decision theory, but if you give me 10 minutes and a whiteboard… Ok, fine, probably not decision theory.
Alright, decision theory is a big part of statistics. But is it an essential part? I’m not sure. This seems like the key idea we were missing when I brought up probability free statistics, but it probably isn’t essential. I’m inclined to say that while neither probability theory nor decision theory is essential, at least one is required for something to be “statistics.” But I won’t say that because there might be an example of something we call statistics or looks a lot like statistics that doesn’t use probability theory or decision theory. I’m not going to throw my hands up and say that statistics is just what statisticians do though – there may be edge cases, but there’s still a big cluster over there in idea space that needs a name. So bottom line, what is statistics? I think Wikipedia was close but a little too inclusive – probability theory and decision theory are both important if not essential. We can fix that right up though:
Statistics is the study of the collection, organization, analysis, interpretation and presentation of data, especially through the frameworks of probability theory and statistical decision theory.