Subscribe by email
Want updates? Enter your email


Delivered by Google FeedBurner
No spam, total privacy, opt out any time
News
« On being the world's smallest technical publishing company | Main | Cross plot or plot cross? »
Wednesday
Sep192012

Cross plots: a non-answer

On Monday I asked whether we should make crossplots according to statistical rules or natural rules. There was some fun discussion, and some awesome computation from Henry Herrera, and a couple of gems:

Physics likes math, but math doesn't care about physics — @jeffersonite

But... when I consider the intercept point I cannot possibly imagine a rock that has high porosity and zero impedance — Matteo Niccoli, aka @My_Carta

I tried asking on Stack Overflow once, but didn’t really get to the bottom of it, or perhaps I just wasn't convinced. The consensus seems to be that the statistical answer is to put porosity on y-axis, because that way you minimize the prediction error on porosity. But I feel—and this is just my flaky intuition talking—like this fails to represent nature (whatever that means) and so maybe that error reduction is spurious somehow.

Reversing the plot to what I think of as the natural, causation-respecting plot may not be that unreasonable. It's effectively the same as reducing the error on what was x (that is, impedance), instead of y. Since impedance is our measured data, we could say this regression respects the measured data more than the statistical, non-causation-respecting plot.

So must we choose? Minimize the error on the prediction, or minimize the error on the predictor. Let's see. In the plot on the right, I used the two methods to predict porosity at the red points from the blue. That is, I did the regression on the blue points; the red points are my blind data (new wells, perhaps). Surprisingly, the statistical method gives an RMS error of 0.034, the natural method 0.023. So my intuition is vindicated! 

Unfortunately if I reverse the datasets and instead model the red points, then predict the blue, the effect is also reversed: the statistical method does better with 0.029 instead of 0.034. So my intuition is wounded once more, and limps off for an early bath.

Irreducible error?

Here's what I think: there's an irreducible error of prediction. We can beg, borrow or steal error from one variable, but then it goes on the other. It's reminiscent of Heisenberg's uncertainty principle, but in this case, we can't have arbitrarily precise forecasts from imperfectly correlated data. So what can we do? Pick a method, justify it to yourself, test your assumptions, and then be consistent. And report your errors at every step. 

I'm reminded of the adage 'Correlation does not equal causation.' Indeed. And, to borrow @jeffersonite's phrase, it seems correlation also does not care about causation.

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (6)

I'm not sure if you saw my comment on LinkedIn, so I'll repeat it here. I think that the problem arises because you don't specify the errors in the data. In general, both the porosity and the impedance may have errors associated with them. For a treatment of the problem of fitting the "best" straight line when both variables have errors see "Straight line fitting - a Bayesian solution" by E. T. Jaynes available at ET Jaynes, 1999, Straight-line fitting—a Bayesian solution. (PDF).

September 20, 2012 | Unregistered CommenterColin Sayers

@Colin: Thanks for the link, Colin. Once again, frequentist statistics fail us it seems. I must admit, I find statistical notation a bit of a bugger to read, but I think I get the gist. Not entirely sure what constitutes the prior... And it would be nice to see an example with real data, showing the superiority of the method. Maybe I'll try to build one.

The idea of rotating the reference frame crops up again here. It all makes me wonder why linear regression is still the pre-eminent prediction tool in petroleum geoscience, and why few of these alternatives seem to have been implemented in real software that people actually use.

September 20, 2012 | Registered CommenterMatt Hall

Hi Matt, Interesting couple of posts! sure gets you thinking!
As for your comment above about how all of this makes you wonder why linear regression is still the preeminent prediction tool, I totally agree. Surely we can do better. Maybe trying to fit some of the rock physics models (e.g. constant cement model, friable sand models etc) to the data. We could perturb some of the parameters within the models and still fit them in a least squares sense. That way we are honoring physical constraints, such as at zero porosity the impedance must be equal to the value of the mineral grain, and at porosities above critical porosity it must equal that of a suspension (i.e. Reuss/Woods bound). This should provide better predictions outside measured data ranges, and if we believe the model will also give us extra information such as the amount of contact cement etc.

September 24, 2012 | Unregistered CommenterMattS

@MattS: A great point. And interestingly, I note as I flick through Avseth et al (2006, Cambridge), almost all rock physics model plots are made in the 'natural' way, e.g. velocity vs porosity.

September 24, 2012 | Registered CommenterMatt Hall

Very interesting post. It got me thinking: when we do the multi attribute analysis, we make use of such cross plots (internally in the software somehow) and it might not be the right way to do it.

I also found the idea of "false correlation" interesting, (i.e. the cross plot of two independant, or unrelated variables). Take this for example: the cross plot of average global temperature (global warming) vs. number of pirates! (number of pirates has decreased over time, average temperature has increased over time, so the cross plot showed that the average global temperature decreases with number of pirates). The poster concluded that in order to combat globsal warming, we should add more pirates to the seas!

Here is a link to the graph:
http://blog.lib.umn.edu/farre212/f11psy1001ds1415/Correlation%20vs.%20Causation.jpg

November 28, 2012 | Unregistered CommenterFereidoon

Matt, this is my way, using excel.

Por vs Imp:
Sort ascending order Por (column a) & Imp (column b) based on por. Copy Imp (column b) to column c. Sort descending

order column c only to get sorted Imp, name it as Imp_reg (as impedance regression). Plot columns a, b & c. It

looks like por vs imp_reg as linear!! On graph, insert linear regression line on data from column c. Get the equation to calculate porosity.

Imp vs Por:
Similar to above, but x is Imp & y is Por. But this time Imp in column a & Por in column b.
Sort ascending order Imp (column a) & Por (column b) based on Imp. Copy Por (column b) to column c. Sort descending

order column c only to get sorted Por, name it as Por_reg (as porosity regression). Plot columns a, b & c. It

looks like Imp vs Por_reg as linear!! On graph, insert linear regression line on data from column c. Get the equation to calculate porosity.

Those 2 equations gave similar porosity result ...

email me izmanhamid@gmail.com if u hv any question.

January 29, 2013 | Unregistered CommenterIzman

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>