Statistics is not boring. Really! It can be (and often is) *made*
boring by bad teaching, but the underlying subject is fascinating --
drawing valid conclusions from seemingly noisy or random observations,
by a process that can look like magic to the uninitiated. And like any
applied mathematical technique, there's much more to it than just the
mathematics itself. The correct application, and explanation of that
application, are equally as important to get right. And so here
Abelson describes his MAGIC approach to principled statistical
arguments: the researcher needs to consider **M**agnitude, **A**rticulation,
**G**enerality, **I**nterestingness, and **C**redibility.

**Magnitude**

Statistical significance is not enough. It merely tells you whether your observations could have arisen by chance, or, as Abelson delightfully puts it:

p39.
The particular level (*p* value)
at which the null hypothesis is rejected ... is often used as a gauge
of the degree of contempt in which the null hypothesis deserved to be
held.

But as he points out, this isn't very helpful, since *p*
depends on the sample size: the bigger the sample, the more "statistically
significant" the result. (This is particularly a problem in areas
like computer science, where doing a few thousand, or million runs,
might be quite cheap, and ridiculously small *p* values can be
obtained.) So it is good practice also to quote the effect size: not
only show your observation is not due to chance, but show that the
difference is big enough to get excited about.

However, even having a significant *p* value might not mean
what you think it means. Abelson describes *The Replication Fallacy*,
which is:

p75.
... an overconfidence in the
repeatability of statistically significant results. The following
thought experiment may help to correct the fallacy. Imagine an
experimenter who has run a two-group study, and has found by *t*
test the result *p* = .05. What is the chance that if she
exactly repeated the study with a new sample of subjects (and the same
*n* per group) that she would again get a significant result at
the .05 level? ...

Half the time, the observed effect size from the second study ought to be bigger than that of the first study, and half the time, smaller. Because the first observed effect size was just big enough to obtain a*p* value of .05, anything
smaller would yield a nonsignificant *p* > .05. This analysis
thus yields an expected repeatability of 50-50, much lower than the
usual intuition.

Half the time, the observed effect size from the second study ought to be bigger than that of the first study, and half the time, smaller. Because the first observed effect size was just big enough to obtain a

**Articulation**

Articulation is about reporting the results clearly, without getting lost in irrelevant and uninteresting details (but without fudging the important but inconvenient details). This can be difficult if the results are inconclusive or borderline. Abelson makes the very important, but often forgotten point that:

p105.
If a null hypothesis survives a
significance test ... this outcome does not mean that a value of zero
can be assigned to the true comparative difference; it merely
signifies that we are not confident of the direction of the (somewhat
small) true difference.

Even if the null hypothesis is refuted, it can still be hard to articulate the results. Abelson gives advice around "ticks and buts", that help expose the interesting and relevant results.

**Generality**

All results are in some sense specific to the particular experiments run, but are actually only useful if they can be interpreted more generally. If every experiment can say only what those people did, on that day, under those circumstances, it's not very useful: we want to know what people in general will do, or even, why they do it.

This might require doing more experiments, particularly to test a theory. If your theory explains why something happens, you should be able to construct a situation in which it won't:

p143.
"You never understand a phenomenon
unless you can make it go away" ... "or unless you can
reverse its direction"

This fits in with the overall approach: statistical analysis isn't about a single isolated experiment, it's about a series of experiments advancing and changing the understanding, contributing to the lore.

**Interestingness**

Moreover, results should be interesting: they should change the way people think about the subject, they should be "surprising". What makes something interesting?

p160.
when an unambiguous prediction of a folk
theory or a scientific theory is generally believed, it is more
interesting to cast doubt on it than to provide evidence strengthening
it

One needs to be careful here, however. There is folk theory that is "generally
believed" but for which there is precious little evidence. The
move for evidence-based medicine is based on this observation: there
are things everybody knows that just ain't so. In these cases it is
important to gather the evidence. I agree, evidence that disproves
such folk theories is more*interesting*, but in some cases, it
might not be more *important*. Nevertheless, in general this is
good advice, and Abelson suggests a "surprisingness coefficient":
how different the observed result is from what you expect it to be. If
you expect the null hypothesis to hold (which, let's face it, you
rarely do), then the surprisingness is the same as the effect size.
But if you expect a big effect, and find it, that's not very
surprising. If you expect an effect in one direction and find it in
the other, that's the best of all!

**Credible**

Finally, we come on to credibility. If your results beggar belief, then your methodology is going to be attacked. You can counter this by guarding against the well known problems (which grow in number as the lore progresses), but eventually, if your result is just too unbelievable, you are probably going to have to come up with a corresponding theory that is better than the prevailing one. Anomalous results are not enough to overturn the world: you have to replace it with something better.

As well as the MAGIC chapters, there is a lovely chapter entitled "On Suspecting Fishiness" (who could resist buying such a book?), which highlights some things to look out for that could indicate error, or worse, fraud. It includes an interesting little tale about Mendel and his pea plants. We've all been brought up on the story that Mendel cheated, and massaged his results to get the answer he wanted, caught out by clever statisticians who showed his answers were just too good to be true. Well, maybe the story isn't as clear. Here Abelson recounts a different analysis that shows Mendel might have got his results by using a dodgy statistical procedure: not such a heinous crime after all, particularly as statistics wasn't the well-developed subject in the mid-1800s that it is today.

This is not an introductory text: it assumes you know about *t*
tests and *p* values, etc. But it is also not particularly
mathematical -- it is simply full of good, clear-headed advice, wisdom
even, of doing statistics *properly*. Although it is written
from the perspective of a psychologist analysing experiments done in
noisy environments on a small number of subjects, its advice is a must
read for *anyone* using statistics seriously.