Week 2: Identifying , describing and comparing populations.

Reading:

Chapt. 2 p. 15-43 in Swan and Sandilands. This will introduce you to basic population statistics and to histogram construction and analysis.


Week 2 index:


Image to right is of crustal thickness for different types of crusts. The questions we are exploring this week are how do you construct, analyze and interpret these types of plots? Image source: http://earthquake.usgs.gov/research/structure/crust/histoProvince.html


What is a population?

As is often the case, something that seems so simple, and that we define without thinking all the time, can be quite complex when you get into the nitty gritty detail. Fundamentally, a population is usually considered as some set, that something either belongs to or not. In fact, set theory is part of statistics.We are usually concerned with a series of measurements of some variable from a collection of samples. So for our purposes a population is typically a series of numbers. The sample set is a subset of all the possible samples that could be taken, a set known as the universal set. The universal set consists of all possible members of the set. If we were considered faults on Bjørnøya it would be all the faults, including those exposed and those not, those measured and mapped and those not. Hopefully, the sample set is representative of the universal set. In geology we can also make a distinction between a hypothetical population and an available population. The hypothetical population usually includes all of the members that existed back through time - perhaps think of it as the ultimate universal set. The available population includes what is still preserved today. The distinction is very clear for paleontology. All of the members of a species that once lived is the hypothetical population, while those represented by fossils are the population available for actual study, while those that are actually in a collection under study would represent the actual sample population. Can you think of ways how the available population can be a biased selection of the hypothetical population?

What we often want to know about is the character of the universal population given the sample set. Using a set perspective, a sample population is always a subset of the universal population it was drawn from. It is worth repeating - how well the sample set represents the universal set is a fundamental question. Sampling plans and biases (link to pdf on one common example of a bias and how to potentially correct for it) are critical considerations. Biases are both obvious and not so obvious. A major and very common bias in geology is that of outcrop surface access that permits sampling. Since weathering is often concentrated near such a surface it may not be totally representative of the entire rock body when it comes to compositions. Erosionally more resistant units will outcrop better. Shallow dipping structures will be under represented in terms of there frequency relative to steeply dipping structures. Larger fossils may be found more easily than smaller ones. A known bias can be corrected for in some situations using a concept known as weighting.

Consider the amounts of slip on all faults on the Arctic island Bjornoya as a universal data set. Slip amount is usually described by a simple length. But it turns out that on a given fault that the amount of slip is not uniform. Detailed studies have shown that it can vary in a systematic way depending on position on the fault plane, with a maximum more towards the center, diminishing to zero at the fault edge. A given sand grain is also not perfectly spherical and so has, in a sense, a population of diameters. In other words, one number, one data point, in a population can come from another population. This is important to remember. Hopefully the variation in diameters of one grain is much less than variation in diameters between grains, or the variation of slip on one fault plane is less than that on different faults in the universal population, otherwise the sample population may not be reflecting only variation between different grains or faults, but also variation in your fundamental measure for one grain or fault.

An appropriate conceptual model and logic helps us to identify what is a population in a given geologic context, along with the question(s) the geoscientist is interested in. Concentrations of TCE (trichloroethylene) in an aquifer, SiO2 content of flows composing a volcanic construct, grain size in a channel deposit, cranial capacity of a group of hominid fossils from a geologic unit are examples of populations that have been looked at. For each of these you should be able to think of some reason that population trait is important, something it can tell you about. A population is a conceptual construct. For our purposes today we are interested in understanding how a population as represented by a list of numbers can be described and analyzed. This is known as univariate analysis.

Conceptual diagram of nested populations. From the sample set one would often like to make statements about the universal set. The accuracy and confidence with which one can do that depends on many factors including the sampling plan, and the statistics of the populations.

What is a distribution?

A variable varies. One might be tempted to see univariate analysis as an attempt to find the variables true value (singular). One of Charles Darwin's great insights was to realize that for species the interesting thing was the variation within any population, the spread, and not the population mean or a representative abstraction (the Platonic ideal or holotype). Variation - the outliers, the deviations, mutations, the peripheral members - can play a critical role in the history of a natural system. Again, think of evolution and speciation. The nature of the variation can be very informative. Think back to your in-class exercise on grain size. A distribution is a mathematical map of how the frequency of occurence changes as a function of variable magnitude. The x axis is the variable value (e.g. Temperature), and the y axis is a measure of the frequency that a variable or range of variables occurs. The best way to see and first investigate the distribution is by the construction of a histogram. This is the standard starting point for understanding and describing a population. The histogram is plotting up a sample population, that comes from a larger population. The hope is that your sample histogram reflects the actual distribution of your universal set. See your reading for other ways to look at a distribution. As the number of samples increases smaller and smaller bins can be used, and the appearance of the histogram begins to become less stepped and more smooth, and begins to approach a function.


Construction of histograms.

Mechanics of production:

Diagram of histogram for the 10 listed sample population values given above. It is a bit like stacking blocks.

This figure shows rose diagrams (circular histograms) of both joints and of cave passage strikes for 4 different caves in Missouri. The idea is to clearly show the similarities and to make the argument that joint directions are determining cave passage orientations. Note that diagrams are all symmetric. That is because orientation data for fractures are bidirectional where it doesn't matter whether the north or south 'end' of the line is used. This is different than a rose diagram for wind directions where there is a different between a south and north end (unidirectional). Sometimes only half of the rose diagram is presented for bidirectional data. Image from USGS site: http://water.usgs.gov/ogw/karst/kigconference/rco_geologicozarks.htm.

Construction of histogram using Excel:

Population distributions vs. histograms: with increasing n (the population size) and smaller and more bins the histogram should approach a curve, i.e. a continuous frequency distribution.


Distribution descriptors.

Distribution peaks: unimodal vs. bimodal vs. polymodal.

This image is from a USGS report which can be found at toxics.usgs.gov/pubs/wri99-4018/Volume1/.../1507_Cravotta.pdf. It nicely shows a bimodal distribution. Can you guess why there are two very different peaks?

Different unimodal distributions (founded in probability theory):

Unimodal distribution descriptors: position, dispersion, shape (symmetry).

Important distinction - sample vs. population descriptors: It is easy as you get buried in all the numbers and calculations to forget this distinction. Obviously you want the sampling plan to be unbiased so that your samples reflect well on the population they were taken from. As your sample size gets larger your sample statistics should get closer and closer to the true population statistics, but there will be a difference between the universal population mean and the sample mean. There is a statistics of how to measure how well your samples may reflect the universal population, but we won't have time to cover that in this course. A statistics course will teach you that (hopefully).

Excel statistical functions: Below is a list of functions you may use this week in describing distribution characteristics. You can find a list of Excel functions under Insert -> Function. You should familiarize yourself with the array of functions available to you. If you want to enter a formula into a cell start with the equal sign. Then you can build the equation afterwards using standard mathematical operators and by inserting functions.


Simple tests comparing populations.

Basic questions that often arise include: 1) what is the form of a distribution - is it random, uniform, normal or other, and 2) how do these two populations compare with each other. Is the difference significant or does it likely result from the natural variation within the populations. More specifically, we can ask questions such as - how does the sample population compare to a model distribution, or a sample mean compare to a model mean.

The framing of the question is important. There is a distinct tradition that exists that consists of stating the null and alternate hypothesis.

Chi square test:


Examples and week 2 exercise.

Example of looking at SiO2 contents of two types of oceanic crustal material, and exploring the effect of increasing n (sample size) on the resulting sample average as compared to the true average.

Heat flow data from the Indian continent. This can be used as an example of how to use Excel to compute basic population descriptors and plot a frequency distribution curve.

Excel data from clastic dike trace strikes in Tertiary strata of the Big Badlands to test against a uniform distribution.

Exercise 2.


Other sources of information:

Davis, J. C., 1986, Statistics and Data Analysis in Geology, Wiley, 646 p. This is a detailed expositions with an initial classic approach in probability theory.

Marsal, D., 1979, Statistics for Geoscientists; Pergamon Press, 176 p. This has an excellent description of basic statistics for the uninitiated, with geologic examples. Some of the later chapters are a bit brief and incomplete.


Revised 7/18/06

Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.