The curse of hunting rare things

What are the chances of intersecting features with a grid of cross-sections? I often wonder about this when interpreting 2D seismic data, but I think it also applies to outcrops, or any other transects. I want to know:

  1. If there are only a few of these features, how many should I see?
  2. What's the probability of the lines missing them all? 
  3. Conversely, if I interpret x of them, then how many are there really?
  4. How is the detectability affected by the reliability of the data or my skills?

I used to have a spreadsheet for computing all this stuff, but spreadsheets are dead to me so here's an IPython Notebook :)

An example

I'm interpreting seep locations on 2D data at the moment. So I'm looking for subvertical pipes and chimneys, mud volcanos, seafloor pockmarks and pingos, that sort of thing (see Løseth et al., 2009 for a great overview). Here are some similar features from the Norwegian continental shelf from Hustoft et al., 2010:

Figure 3 from hustoft et al. (2010) showing the 3D expression of some hydrocarbon leakage features in Norway. © The Authors.

As Hustoft et al. show, these can be rather small features — most pockmarks are in the 100–800 m diameter range, so let's call it 500 m. The dataset I have is an orthogonal grid of decent quality 2D lines with a 3 km spacing. The area is about 120,000 km². For the sake of argument (and a forward model), let's imagine there are 120 features I'm interested in — one per 1000 km². Here's a zoomed-in view showing a subset of the problem:

Zoomed-in view of part of my example. A grid of 2D seismic lines, 3 km apart, and randomly distributed features, each 500 m in diameter. If a feature's centre falls inside a grey square, then the feature is not intersected by the data. The grey squares are 2.5 km across.

Zoomed-in view of part of my example. A grid of 2D seismic lines, 3 km apart, and randomly distributed features, each 500 m in diameter. If a feature's centre falls inside a grey square, then the feature is not intersected by the data. The grey squares are 2.5 km across.

According to my calculations...

  1. Of the 120 features in the area, we expect 37 to be intersected by the data. Of course, some of those intersections might be very subtle, if they are right at the edge of the feature.
  2. The probability of intersecting a given feature is 0.31. There are 120 features, so the probability of the whole dataset intersecting at least one is essentially 1 (certain). That's good! Conversely, the probability of missing them all is effectively 0. (If there were only 5 features, then there'd be about a 16% chance of missing them all.)
  3. Clearly, if I interpret 37 features, there are about 120 in total (that was my a priori). It's a linear relationship, so if I interpret 10 features, I can expect there to be about 33 altogether, and if I see 100 then I can expect that there are almost 330 in total. (I think the probability distribution would be log-normal, but would appreciate others' insights here.)
  4. Reliability? That sounds like a job for Bayes' theorem...

It's far from certain that I will interpret everything the data intersects, for all sorts of reasons:

  • I am human and therefore inconsistent, biased, and fallible.
  • The feature may be cryptic in the section , because of how it was intersected.
  • The data may be poor quality at that point, or everywhere.

Let's assume that if a feature has been intersected by the data, then I have a 75% chance of actually interpreting it. Bayes' theorem tells us how to update the prior probability of 0.31 (for a given feature; point 2 above) to get a posterior probability. Here's the table:

Interpreted Not interpreted
Intersected by a 2D line 28 9
Not intersected by any lines 21 63

What do the numbers mean?

  • Of the 37 intersected features, I interpret 28.
  • I fail to interpret 9 features that are intersected by the data. These are Type II errors, false negatives.
  • I interpret another 21 features which are not real! These are Type I errors: false positives. 
  • Therefore I interpret 48 features, of which only 57% are real. This seems like a lot, but it's a function of my imperfect reliability (75%) and the poor sampling, resulting in a large number of 'missed' features.

Interestingly, my 75% reliability translates into a 57% chance of being right about the existence of a feature. We've seen this effect before — it's the curse of hunting rare things: with imperfect knowledge, we are often wrong


References

Hustoft, S, S Bünz, and J Mienart (2010). Three-dimensional seismic analysis of the morphology and spatial distribution of chimneys beneath the Nyegga pockmark field, offshore mid-Norway. Basin Research 22, 465–480. DOI 10.1111/j.1365-2117.2010.00486.x 

Løseth, H, M Gading, and L Wensaas (2009). Hydrocarbon leakage interpreted on seismic data. Marine & Petroleum Geology 26, 1304–1319. DOI 10.1016/j.marpetgeo.2008.09.008