Statistical Inference & “Running the Numbers”
I identify a lot with Frederik deBoer, who has a liberal arts background and is getting a crash course in quantitative methods as a grown-ass adult. He has a nice piece on “Big Data” and its broken promises. One of the most important is the simple fact that basically anything is “statistically significant” in a large-enough dataset, for which he offers a pretty good layman’s explanation:
Think of statistical significance like this. Suppose I came to you and claimed that I had found an unbalanced or trick quarter, one that was more likely to come up heads than tails. As proof, I tell you that I had flipped the quarter 15 times, and 10 times of those times, it had come up heads. Would you take that as acceptable proof? You would not; with that small number of trials, there is a relatively high probability that I would get that result simply through random chance. In fact, we could calculate that probability easily. But if I instead said to you that I had flipped the coin 15,000 times, and it had come up heads 10,000 times, you would accept my claim that the coin was weighted. Again, we could calculate the odds that this happened by random chance, which would be quite low– close to zero. This example shows that we have a somewhat intuitive understanding of what we mean by statistically significant. We call something significant if it has a low p-value, or the chance that a given quantitative result is the product of random error. (Remember, in statistics, error = inevitable, bias = really bad.) The p-value of the 15 trial example would be relatively high, too high to trust the result. The p-value of the 15,000 trial example, low enough to be treated as zero.
With a big dataset, everything has a very low p-value. When you run it through most predictive models and test it on every possible variable, you will get the result that every independent variable impacts the result, and the p-value is low enough you can discard the possibility that it is a result of random chance. This is especially troublesome when your sample, or n, is extremely large yet represents only a small portion of the overall result. You will not face the issue that your results are driven by random variations of the sample, but you are very likely to face the problem that your sample isn’t representative. If there is some selection bias in what datapoints are sampled, you could be producing very poor models of reality. But judging whether your data is representative is difficult, because it’s difficult to make statements about missing data. For really big datasets, making sure that your data is representative is a devilishly difficult and extremely important problem.
The promise of “Big Data” is that “n=All”, a common statement amongst its proponents. Yet this is rarely the case, often because the data capture is process and often because of computational practicality. The most fancy machine learning models in the world won’t help you if your data is of poor quality, and pretty simple methods like linear regression can be incredibly powerful when faced with the right methods. Fancy machine learning techniques do have their uses, though – for example, regression offers little help in the problem of “feature selection“, or choosing which variables to use in constructing a predictive model. This is a pretty important problem when you are looking at a problem without strong theoretical avenues of investigation.
The most important lesson I have learned from my intensive study of statistics is that certainty is elusive. To people with no serious quantitative training (including myself nine months ago), you imagine that statistical inference works like high-school math problems. You “run the numbers”, and an answer pops out. But this is almost never the case, and successful inference involves many layers of judgment in data gathering, data processing, and model-building. There are definitely wrong answers – for example, when faced with a problem that calls for a prediction of probability, you can’t assume linearity. If you do, some predicted values might be negative and negative probabilities are a nonsensical idea. But answers can’t be “right”, they can only be defensible. And beyond that, reasonable intelligent people can disagree violently about which answers are defensible.
Learning statistics has been invigorating as I realize just how much is possible with a dataset and an old laptop, and humbling as I realize that statistical investigation is a lot more difficult than learning the commands in R.