I referred in my last post to a lost writing of mine on the subject of abuse of statistics in economics. I’ve sort of found it – I sent it as an email in response to this blog post by Noel Campbell at Division of Labour. (Read it – it’s short).
He quoted from my response, but I can’t find the actual email I sent him. I do have a draft of it, so it would have been very much like this:
That’s a superb question, and I think the answer will surprise (and disturb) many.
Your paper will include a calculation of significance. This is essentially an estimate of the probability that a correlation as strong as the one you found would exist purely as a result of randomness in the data, even if your theory is false.
This calculation assumes the “proper” sequence of events. You have a theory, and you test the data for a correlation. Since you in fact poked around for correlations, then came up with a theory, the significance calculation is not valid. The true significance depends on the probability that, having found a randomly-caused correlation somewhere, you can then invent a theory to explain it. That probability is very difficult to estimate, but is probably much greater – meaning that the significance of the correlation is much smaller.
It is very counterintuitive that the order of your actions affects the validity of your findings, and indeed it is a close relative of the famous Monty Hall problem – the poster child for counterintuitive probability. When you reveal the correlation that you already knew of, you are revealing no information about the chance of your theory being correct, much as when the quizmaster opens the door that he already knows doesn’t have the car, he reveals no information about the chance that the door you first picked has the car. Conversely if you pick a door and find that it doesn’t have the car, that does change the probability that the first door had it, and if you had no prior knowledge of the data, the correlation does change the probability of your theory being true.
Back to science. As you say, theories aren’t formed in a vacuum, and so there is not such an clear division between the “right” way of doing it and the “wrong” way of doing it. Nobody is completely ignorant of the data when they start to theorize. That is a real problem with nearly all statistics-based results that are published today. They are all presented with significance calculations based on the assumption that the forming of the theory was independent of the data – an assumption that is very unlikely to be completely true. Therefore nearly every significance published is an overestimate.
This was much less of a problem when collecting data and analysing it was difficult and laborious. Now that large data sets fly around the internet, and every researcher has the capability of running analyses at the click of a mouse, it is a problem that has already got out of hand.
I didn’t want to be rude at the time, but I found Campbell’s response shocking. He seemed to fully accept my argument, but wasn’t bothered by the implication that pretty much all published research relying on analysing pre-existing statistics was wrong. Rather, his conclusion was that since everybody else was doing what he was doing, nobody should complain and demand “purity” (his scare quotes). That came to mind particularly reading Bruce Charlton’s discussion of the state of honesty in science.