Subscribe by email
Want updates? Enter your email


Delivered by Google FeedBurner
No spam, total privacy, opt out any time
News
« News of the week | Main | How to keep up with Agile* »
Friday
Jan062012

What do you mean by average?

I may need some help here. The truth is, while I can tell you what averages are, I can't rigorously explain when to use a particular one. I'll give it a shot, but if you disagree I am happy to be edificated. 

When we compute an average we are measuring the central tendency: a single quantity to represent the dataset. The trouble is, our data can have different distributions, different dimensionality, or different type (to use a computer science term): we may be dealing with lognormal distributions, or rates, or classes. To cope with this, we have different averages. 

Arithmetic mean

Everyone's friend, the plain old mean. The trouble is that it is, statistically speaking, not robust. This means that it's an estimator that is unduly affected by outliers, especially large ones. What are outliers? Data points that depart from some assumption of predictability in your data, from whatever model you have of what your data 'should' look like. Notwithstanding that your model might be wrong! Lots of distributions have important outliers. In exploration, the largest realizations in a gas prospect are critical to know about, even though they're unlikely.

Geometric mean

Like the arithmetic mean, this is one of the classical Pythagorean means. It is always equal to or smaller than the arithmetic mean. It has a simple geometric visualization: the geometric mean of a and b is the side of a square having the same area as the rectangle with sides a and b. Clearly, it is only meaningfully defined for positive numbers. When might you use it? For quantities with exponential distributions — permeability, say. And this is the only mean to use for data that have been normalized to some reference value. 

Harmonic mean

The third and final Pythagorean mean, always equal to or smaller than the geometric mean. It's sometimes (by 'sometimes' I mean 'never') called the subcontrary mean. It tends towards the smaller values in a dataset; if those small numbers are outliers, this is a bug not a feature. Use it for rates: if you drive 10 km at 60 km/hr (10 minutes), then 10 km at 120 km/hr (5 minutes), then your average speed over the 20 km is 80 km/hr, not the 90 km/hr the arithmetic mean might have led you to believe. 

Median average

The median is the central value in the sorted data. In some ways, it's the archetypal average: the middle, with 50% of values being greater and 50% being smaller. If there is an even number of data points, then its the arithmetic mean of the middle two. In a probability distribution, the median is often called the P50. In a positively skewed distribution (the most common one in petroleum geoscience), it is larger than the mode and smaller than the mean:

Mode average

The mode, or most likely, is the most frequent result in the data. We often use it for what are called nominal data: classes or names, rather than the cardinal numbers we've been discussing up to now. For example, the name Smith is not the 'average' name in the US, as such, since most people are called something else. But you might say it's the central tendency of names. One of the commonest applications of the mode is in a simple voting system: the person with the most votes wins. If you are averaging data like facies or waveform classes, say, then the mode is the only average that makes sense. 

Honourable mentions

Most geophysicists know about the root mean square, or quadratic mean, because it's a measure of magnitude independent of sign, so works on sinusoids varying around zero, for example. 

The root mean square equation

Finally, the weighted mean is worth a mention. Sometimes this one seems intuitive: if you want to average two datasets, but they have different populations, for example. If you have a mean porosity of 19% from a set of 90 samples, and another mean of 11% from a set of 10 similar samples, then it's clear you can't simply take their arithmetic average — you have to weight them first: (0.9 × 0.21) + (0.1 × 0.14) = 0.20. But other times, it's not so obvious you need the weighted sum, like when you care about the perception of the data points

Are there other averages you use? Do you see misuse and abuse of averages? Have you ever been caught out? I'm almost certain I have, but it's too late now...

There is an even longer version of this article in the wiki. I just couldn't bring myself to post it all here. 

PrintView Printer Friendly Version

EmailEmail Article to Friend

Reader Comments (7)

Paul Edwards wrote a great comment over at the Google Plus thread on this post.

January 9, 2012 | Registered CommenterMatt Hall

Matt, I find geometric mean is much easier to understand when accompanied by a semi-circle diagram. I must admit I didn't understand it until I saw the diagram.

January 11, 2012 | Unregistered CommenterMark Dahl

@Mark: I assume you are referring to this geometric relationship between the arithmetic and geometric means? Very cool, I'd never thought about it like that before.

Cheers!

January 11, 2012 | Registered CommenterMatt Hall

Hi Matt, nice post.
A clear application of applying different averages is when predicting bounds on the effective elastic moduli of a mixture of minerals or fluids. E.g. the Reuss bound (Harmonic average) and the Voigt bound (Arithmetic average).
The effective moduli could then be input into Gassmann's equations, and depending on what you wanted to model, would depend on what average you chose to use - one can estimate the effect of Patchy Saturation by using the Arithmetic average to determine the fluid bulk modulus, for example.
I like to think of the 'softer' (lower bulk modulus) fluid dominating in the case of a Harmonic average, and the 'stiffer' fluid dominating in the case of an Arithmetic average.
Cheers, Matt

January 15, 2012 | Unregistered CommenterMattS

@Matt: Thanks for the great comment... I didn't think about rock physics at all. I like this notion of using different averages to let different parts of the data 'speak'. I think I tend to think of the choice of average as right or wrong, but really they are all indications of the central tendency—they all come from the data.

January 16, 2012 | Registered CommenterMatt Hall

You have the midrange, minimizing an L_\infty norm (for each central tendency, a minimizing cost)
http://en.wikipedia.org/wiki/Mid-range
Can be useful, sometimes

November 30, 2012 | Unregistered CommenterLaurent Duval

@Laurent: Good one — it has the virtue of being the only one you can easily do in your head in a dataset of any size. But then... you get what you pay for, I guess.

November 30, 2012 | Registered CommenterMatt Hall

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>