Subscribe by email

No spam, total privacy, opt out any time
Explore Agile*
Connect
News
Blogroll
Monday
Sep172012

## Cross plot or plot cross?

I am stumped. About once a year, for the last nine years or so, I have failed to figure this out.

What could be simpler than predicting porosity from acoustic impedance? Well, lots of things, but let’s pretend for a minute that it’s easy. Here’s what you do:

1.   Measure impedance at a bunch of wells
2.   Measure the porosity — at seismic scale of course — at those wells
3.   Make a crossplot with porosity on the y-axis and amplitude on the x-axis
4.   Plot the data points and plot the regression line (let’s keep it linear)
5.   Find the equation of the line, which is of the form y = ax + b, or porosity = gradient × impedance + constant
6.   Apply the equation to a map (or volume, if you like) of amplitude, and Bob's your uncle.

Easy!

But, wait a minute. Is Bob your uncle after all? The parameter on the y-axis is also called the dependent variable, and that on the x-axis the independent. In other words, the crossplot represents a relationship of dependency, or causation. Well, porosity certainly does not depend on impedance — it’s the other way around. To put it another way, impedance is not the cause of porosity. So the natural relationship should put impedance, not porosity, on the y-axis. Right?

Therefore we should change some steps:

3.   Make a crossplot with impedance on the y-axis and porosity on the x-axis
4.   Plot the data points and plot the regression line
5a. Find the equation of the line, which is of the form y = ax + b, or impedance = gradient × porosity + constant
5b. Rearrange the equation for what we really want:

Not quite as easy! But still easy.

More importantly, this gives a different answer. Bob is not your uncle after all. Bob is your aunt. To be clear: you will compute different porosities with these two approaches. So then we have to ask: which is correct? Or rather, since neither going to give us the ‘correct’ porosity, which is better? Which is more physical? Do we care about physicality?

I genuinely do not know the answer to this question. Do you?

If you're interested in playing with this problem, the data I used are from Imaging reservoir quality seismic signatures of geologic effects, report number DE-FC26-04NT15506 for the US Department of Energy by Gary Mavko et al. at Stanford University. I digitized their figure D-8; you can download the data as a CSV here. I have only plotted half of the data points, so I can use the rest as a blind test.

View Printer Friendly Version

Email Article to Friend

"Cross plot" is what you want I think. The estimated coefficients from simple linear regression provide an "unbiased" estimate for the dependent variable for any value of the independent one. So we want the dependent variable to be the one we are trying to predict. The estimated coefficients produce the minimum value for the sum of the squared residuals. That is to say the error is a minimum. The error is measured in units of the dependent variable. It is not a funtion of the indepedent variable. This is why trending A vs B gives a different answer to B vs A. This is also why rearraging the B vs A equation to predict A from B is not the right thing to do. These resulting coefficients are not unbiased.
At least that is my understanding...

September 17, 2012 | TT

Hello Matt,
I think that I am missing something in your question. I can't see any difference in your analyses, just the residuals.

http://www.ualberta.ca/~rhherrer/crossplot/
Cheers,
Henry

September 18, 2012 | Henry

G'day Henry
I had quick look at your implementation. I think your "Porosity Calculated" variable might need to be this:
por_calc = (impedance - p2(2))/p2(1);
which would give a different answer (with higher residuals) I expect...

Cheers

September 18, 2012 | TT

@kwinkunks IMHO, in Y=MX+B, X is known input and Y is the predicted output... Physics likes about math, but math doesn't care about physics.— @jeffersonite September 17, 2012

September 18, 2012 | Matt Hall

@Henry: Wow, I love what you've done. Very cool. It's more or less the process I went through, except that I held some data back. If I follow everything, then I think Thomas is onto something — since we want to plug in impedance values, and get back an estimate of porosity, I think your equation for por_calc needs impedance, not imp_est. Does that sound right? I will put my results up later.

@Thomas: As you say, the regression process minimizes the error on y. So I think the error for the data we use in the regression has a smaller residual on y than </x> (though I'm not certain it necessarily does—perhaps this is at the core of the problem). But I suppose what I'm really interested in is, "what about data we don't have yet?" In other words, how is our prediction of the future? My gut feeling is that, since this is supposedly a random sample, there's an assumption of representativeness, so minimizing the error on y in our data is the same as minimizing the global error on y... but the poetic side of me wants some physical justice in all of this. But I don't think I'm going to get it...

September 18, 2012 | Matt Hall

It's been a long time since I did "sadist-ics", and I don't really have to do it at work these decades. (People look "funny" at me when I include estimates of errors in answering questions at work. Engineers don't admit to uncertainty of measurement, at least not to us geologists.)
I think your problem implies that you've got considerably different degrees of uncertainty (or variance, SD, whatever measure you want) on one variable compared to the other. (Relative uncertainty, not numeric values due to using different units. If in doubt, normalize everything to "z" measures to hide the different units. But that shouldn't be necessary outside the classroom.) In theory, your two different regression equations should be inverses of each other. Take one equation as y=a.x+c and flip it around to give x=(y-c)/a or x=y/a -c/a. So your two coefficients should be reciprocals of each other. They're not. (1/-0.0266 = -37.59 ; but your second equation has a coefficient of -32).
What to do about it ... well, firstly get your story straight on what the workflow is for you. That will determine which is the independent variable in your analysis, and which is the dependent variable. Secondly, quote your errors. Engineers (reservoir engineers particularly) don't like working with uncertainties, but they reluctantly admit to them being real things.
Umm, and for me, that would be it. I'm not a sadist-ics-ist, but I can play one well enough for the real world and far more realistically than on TV. By the time *my* audience starts asking questions about the presence of error bars, they've already forgotten about the more important parts of my analysis. Goldfish syndrome.
Your audience may be more sophisticated. Your mileage may vary a lot!

September 18, 2012 | Aidan Karley

Hi Matt

You guys are always full of good surprises and this post is stirring an interesting discussion.

I like your question on the physical meaning. I have occasionally seen a paper with crossplot as in you first figure and wondered why the author(s) did it but did not reflect on the implication as you did, so thank you for the opportunity to do it.

I agree with Aidan's remark that in theory the two regressions should be the reverse of each other. But even if they did one to me has a physical meaning, the other does not. I could try for a second to think of impedance as the independent variable. But even then when I consider the intercept point I cannot possibly imagine a rock that has high porosity and zero impedance. Conversely, when I think about the other intercept point I can more easily imagine a rock that has high impedance and (virtually) zero porosity. Does that make sense?

September 18, 2012 | Matteo

@Aidan: Thanks for the remarks about uncertainty. I do think that there's a story here about measurement magnitudes and uncertainty. I mean, the model is just a model — the rocks actually have these other values, the ones with 'error'. So the errors are part of nature, and perhaps we can explain or understand them, rather that trying to eliminate them. I guess this is what stochastic simulation tries to address. Engineers seem to get along okay with those methods, at least — I think?

@Matteo: I love your observation about the intercepts. Is that another reality check, perhaps? One can say, "it doesn't matter, these are mathematical contrivances," but that feels hollow. The impedance-vs-porosity graph tells a story about the earth.

Another point that occurs to me, as we think about the physical meaning of statistics, is something about confounding variables. The authors of the paper the data come from point out that there are multiple lithologies here. So we could consider different populations — low impedance and high impedance. But once you start splitting, where do you stop?

September 18, 2012 | Matt Hall

Hi Matt,
Interesting discussion here. TT is right with the eq por_calc = (impedance - p2(2))/p2(1). Now I get different residuals.

@Aidan the only way to have this inversion working (y=a.x+c and flip it around to give x=(y-c)/a or x=y/a -c/a) is under perfect correlation and in this case Rsquare is 0.59.
The standardization idea is the alternative at the cost of the physical meaning. See the last two images:
http://www.ualberta.ca/~rhherrer/crossplot/

September 18, 2012 | Henry

There are different ways to do regression. The most popular is the least square (because it easy to implement in excel). The main problem with this is it minimize the error in y axis and assumes x axis values doesn't have any error. I think the best way is to use reduced major axis regression (RMA) which minimizes errors both in x axis and y-axis. For more information check the following website:

http://www.spec2000.net/06-statistics.htm

http://193.146.160.29/gtb/sod/usu/\$UBUG/repositorio/10301409_Smith.pdf

September 19, 2012 | Utpalendu Kuila