# Confounded paradox

/Probabilities are notoriously slippery things to deal with, so it shouldn’t be surprising that *proportions*, which are really probabilities in disguise, can catch us out too.

**Simpson’s paradox** is my favourite example of something simple, something we know we understand, indeed have always understood, suddenly turning on us.

Exploration geophysicists often use information extracted from seismic data, called *attributes*, to help predict rock properties in the subsurface. Suppose you are a geophysicist comparing two new seismic attributes, *truth* and *beauty*, each purported to predict fluid type. You compare their hydrocarbon-predicting success rates on 35 discoveries and it’s close, but *beauty* has an 83% hit rate, while *t**ruth* manages only 77%. There's not much in it, but since you only need one attribute, all else being equal, *beauty* it is.

But then someone asks you about predicting oil in particular. You dig out your data and drill down:

Apparently, *truth* did a little better when you just look at oil. And what about gas, they ask? Well, the data showed that *truth* was also better than *beauty* at predicting gas. So *truth* does a better job at both oil and gas, but somehow *beauty* edges out overall.

Impossible? Clearly not: these numbers are real and plausible, I haven't done anything sneaky. In this case, hydrocarbon type is a **confounding variable**, and it’s important to look for such groupings in your data. Improbable? No, it’s quite common in all kinds of data and this trap is well known among statisticians.

How can you avoid it? Be especially wary when the sample size in one or more of the groups you are interested in is much smaller than the others. Be even more alert if group sizes are inconsistent across the variables, as in my example: oil is under-sampled for *truth*, gas for *beauty*.

Ultimately, there's no guarantee this effect won’t crop up; that’s just how proportions are. All you can do is make sure you ask your data the questions you care about.

*This post is a version of part of my article The rational geoscientist, The Leading Edge, May 2010*