Week 2: Identifying , describing and comparing populations.

Chapt. 2 p. 15-43 in Swan and Sandilands. This will introduce you to basic population statistics and to histogram construction and analysis.

Week 2 index:

Image to right is of crustal thickness for different types of crusts. The questions we are exploring this week are how do you construct, analyze and interpret these types of plots? Image source: http://earthquake.usgs.gov/research/structure/crust/histoProvince.html

What is a population?

As is often the case, something that seems so simple, and that we define without thinking all the time, can be quite complex when you get into the nitty gritty detail. Fundamentally, a population is usually considered as some set, that something either belongs to or not. In fact, set theory is part of statistics.We are usually concerned with a series of measurements of some variable from a collection of samples. So for our purposes a population is typically a series of numbers. The sample set is a subset of all the possible samples that could be taken, a set known as the universal set. The universal set consists of all possible members of the set. If we were considered faults on Bjørnøya it would be all the faults, including those exposed and those not, those measured and mapped and those not. Hopefully, the sample set is representative of the universal set. In geology we can also make a distinction between a hypothetical population and an available population. The hypothetical population usually includes all of the members that existed back through time - perhaps think of it as the ultimate universal set. The available population includes what is still preserved today. The distinction is very clear for paleontology. All of the members of a species that once lived is the hypothetical population, while those represented by fossils are the population available for actual study, while those that are actually in a collection under study would represent the actual sample population. Can you think of ways how the available population can be a biased selection of the hypothetical population?

What we often want to know about is the character of the universal population given the sample set. Using a set perspective, a sample population is always a subset of the universal population it was drawn from. It is worth repeating - how well the sample set represents the universal set is a fundamental question. Sampling plans and biases (link to pdf on one common example of a bias and how to potentially correct for it) are critical considerations. Biases are both obvious and not so obvious. A major and very common bias in geology is that of outcrop surface access that permits sampling. Since weathering is often concentrated near such a surface it may not be totally representative of the entire rock body when it comes to compositions. Erosionally more resistant units will outcrop better. Shallow dipping structures will be under represented in terms of there frequency relative to steeply dipping structures. Larger fossils may be found more easily than smaller ones. A known bias can be corrected for in some situations using a concept known as weighting.

Consider the amounts of slip on all faults on the Arctic island Bjornoya as a universal data set. Slip amount is usually described by a simple length. But it turns out that on a given fault that the amount of slip is not uniform. Detailed studies have shown that it can vary in a systematic way depending on position on the fault plane, with a maximum more towards the center, diminishing to zero at the fault edge. A given sand grain is also not perfectly spherical and so has, in a sense, a population of diameters. In other words, one number, one data point, in a population can come from another population. This is important to remember. Hopefully the variation in diameters of one grain is much less than variation in diameters between grains, or the variation of slip on one fault plane is less than that on different faults in the universal population, otherwise the sample population may not be reflecting only variation between different grains or faults, but also variation in your fundamental measure for one grain or fault.

An appropriate conceptual model and logic helps us to identify what is a population in a given geologic context, along with the question(s) the geoscientist is interested in. Concentrations of TCE (trichloroethylene) in an aquifer, SiO2 content of flows composing a volcanic construct, grain size in a channel deposit, cranial capacity of a group of hominid fossils from a geologic unit are examples of populations that have been looked at. For each of these you should be able to think of some reason that population trait is important, something it can tell you about. A population is a conceptual construct. For our purposes today we are interested in understanding how a population as represented by a list of numbers can be described and analyzed. This is known as univariate analysis.

Conceptual diagram of nested populations. From the sample set one would often like to make statements about the universal set. The accuracy and confidence with which one can do that depends on many factors including the sampling plan, and the statistics of the populations.

What is a distribution?

A variable varies. One might be tempted to see univariate analysis as an attempt to find the variables true value (singular). One of Charles Darwin's great insights was to realize that for species the interesting thing was the variation within any population, the spread, and not the population mean or a representative abstraction (the Platonic ideal or holotype). Variation - the outliers, the deviations, mutations, the peripheral members - can play a critical role in the history of a natural system. Again, think of evolution and speciation. The nature of the variation can be very informative. Think back to your in-class exercise on grain size. A distribution is a mathematical map of how the frequency of occurence changes as a function of variable magnitude. The x axis is the variable value (e.g. Temperature), and the y axis is a measure of the frequency that a variable or range of variables occurs. The best way to see and first investigate the distribution is by the construction of a histogram. This is the standard starting point for understanding and describing a population. The histogram is plotting up a sample population, that comes from a larger population. The hope is that your sample histogram reflects the actual distribution of your universal set. See your reading for other ways to look at a distribution. As the number of samples increases smaller and smaller bins can be used, and the appearance of the histogram begins to become less stepped and more smooth, and begins to approach a function.

Construction of histograms.

Mechanics of production:

• First decide on the best bin width (and hence the number of bins needed to cover the data range) based on maximum and minimum population values and the sample size. If most bins don't have at least several values that fall within them, then the bin size should be increased. Try something like sample size divided by # of bins equals or is larger than 5-10. Another rule of thumb is that your number of bins should be the square root of n, your sample size. Your bins should be the same width.
• Decide on what value to start your bin boundaries with. As you will see it can make a difference, especially for sample sets with a smaller n.
• Count the number of population values that fall within a certain bin. If the value is equal to the bin boundary, typically it counts in the bin to the right (a greater than or equal to rule).
• Plot the bin boundaries along the x axis. Plot columns in the y direction whose height is equal to the number of values from the population within a given bin.
• You can also plot y as the percentage of the data within that bin, which has some distinct statistical advantages.
• The area under the histogram 'curve' can be normalized to one, or 100%.
• rose diagrams or circular histograms are special types of histograms, where temporal or orientation data is being analyzed.
• 'fuzzy' histograms: What to do when error is > or = to half the histogram interval (bin size)?. Basically the idea is to treat each value as its own normalized distribution, and you are basically simply adding those distributions to find a y value. The mechanics can be a bit more time intensice, depending on the nature of the distributions associated with each value. You basically need to compute what portion of a values distribution falls within a bin, and assign that fractional value to that bin. This can be done in Excel, but takes quite a few steps or a macro. More on that later. It is not necessary where any variable measurement error << the bin width.

Diagram of histogram for the 10 listed sample population values given above. It is a bit like stacking blocks.

This figure shows rose diagrams (circular histograms) of both joints and of cave passage strikes for 4 different caves in Missouri. The idea is to clearly show the similarities and to make the argument that joint directions are determining cave passage orientations. Note that diagrams are all symmetric. That is because orientation data for fractures are bidirectional where it doesn't matter whether the north or south 'end' of the line is used. This is different than a rose diagram for wind directions where there is a different between a south and north end (unidirectional). Sometimes only half of the rose diagram is presented for bidirectional data. Image from USGS site: http://water.usgs.gov/ogw/karst/kigconference/rco_geologicozarks.htm.

Construction of histogram using Excel:

• using the FREQUENCY function:
• I suggest you avoid this until you become or are more familiar with Excel. How it operates differs from one Excel version to another. In some situations it can provide some greater flexibility.
• insert sample data in a column. If you paste data in from the web you might be careful and do a paste special and select the text option. Other wise it can be bring in some extraneous hidden garbage, so that it isn't truly formatted as a number.
• create an array of bin values in descending order. If you have a lot of data you may want to sort it first to find the high and low numbers.
• use the FREQUENCY function, selecting the data and bin arrays when asked.
• the frequency function should return an array of numbers that are the values in each bin. You may have to drag/copy the first cell downwards to create the array.
• idiosyncrasies of the Excel FREQUENCY function. If you have your bin values in ascending order it computes a cumulative frequency. If you have your bin values in descending order it computes a standard frequency. Bottom line, always - check out your numbers to make sure they make sense. If you check out the cell position reference in the formulas for the bins lower in the sequence, they appear wrong, but the results are OK. For older versions of Excel you can try using control-shift-enter for the FREQUENCY function. One Excel source indicates that this should yield an array instead of a single number. In some versions of Excel on some platforms you may want to fix your data cell referents with a \$.
• use the column plot to look at the results. You can plot multiple histograms at once. If you look in the Series window there is a place to label the bin intervals by choosing your bin array. Check your x axis labeling to make sure you are plotting bin intervals. Unfortunately Excel is not very good about placement of x labels, but this can be corrected in another software package if need be.
• using the histogram routine (the suggested route):
• the Histogram routine is found under Tools, under Data Analysis. This is an add-in and if you can't find it you may need to install it. Look under Tools at the Add-ins option. If you are lucky you won't need the CD. It varies from platform to platform and by Excel version.
• choose the Histogram insert the data and bin arrays in the appropriate spot in the input box. Choose what type of output you want. make sure you select the chart option. Notice that this will put the results into a separate sheet. Select finish and, that's it, your done.

Population distributions vs. histograms: with increasing n (the population size) and smaller and more bins the histogram should approach a curve, i.e. a continuous frequency distribution.

Distribution descriptors.

Distribution peaks: unimodal vs. bimodal vs. polymodal.

• bi- or poly- modal may indicate mixed populations (remembering that a population is a convenient mental construct, and that they can be subdivided). Many of the statistical descriptors below are meaningless for non-unimodal distributions.
• classic example of Si contents of volcanics from continental rifts as being asymmetrically bimodal (lotsa basalt with some rhyolite).

This image is from a USGS report which can be found at toxics.usgs.gov/pubs/wri99-4018/Volume1/.../1507_Cravotta.pdf. It nicely shows a bimodal distribution. Can you guess why there are two very different peaks?

Different unimodal distributions (founded in probability theory):

• normal distribution or Gaussian distribution: the variation or dispersion from the mean is the result of many random and independent contributions. These are symmetric "bell curves". A great wealth of statistical techniques exist for a normal distribution, and so it is advantageous if your population has this distribution.
• uniform distribution: each outcome has the same probability (1/n) of occurring. This is different than a random distribution.
• Poisson distribution: a random distribution for point phenomena in time or space.
• Binomial distributions: number of successes in n trials when the result is either success or failure, and there is a constant probability of success vs. failure. You could use this to look at the chances of finding a diamond or diamonds in a sediment sample.
• mathematical manipulation and transformation of distributions is possible.

Unimodal distribution descriptors: position, dispersion, shape (symmetry).

• measures of central tendency (distribution position):
• median: # such that half the values are lower and half are higher.
• mode: most frequently occurring value (highest peak of the frequency distribution)
• arithmetic mean: a.k.a. the average. The sum of all the values divided by n, the number of values.
• geometric mean: the nth root of multiplying all the values instead of adding.
• measures of distribution dispersion:
• range: the difference between the maximum and minimum value.
• variance: sum of squaring all the differences between individual values and the mean (i.e. squaring the deviations) all divided by the sample number for the universal population or the sample number - 1 for sample variance. Obviously the larger the dispersion the larger the variance.
• standard deviation = square root of the variance.
• measures of distribution symmetry of a population: Pearson measure of skewness = (Mean-mode)/standard deviation. Closer to 0 the more symmetric.

Important distinction - sample vs. population descriptors: It is easy as you get buried in all the numbers and calculations to forget this distinction. Obviously you want the sampling plan to be unbiased so that your samples reflect well on the population they were taken from. As your sample size gets larger your sample statistics should get closer and closer to the true population statistics, but there will be a difference between the universal population mean and the sample mean. There is a statistics of how to measure how well your samples may reflect the universal population, but we won't have time to cover that in this course. A statistics course will teach you that (hopefully).

Excel statistical functions: Below is a list of functions you may use this week in describing distribution characteristics. You can find a list of Excel functions under Insert -> Function. You should familiarize yourself with the array of functions available to you. If you want to enter a formula into a cell start with the equal sign. Then you can build the equation afterwards using standard mathematical operators and by inserting functions.

• AVERAGE, FREQUENCY, MEDIAN, MODE, RANK, SKEW, SORT, STDEV, VAR
• The function arguments (the numbers it acts on) are placed within parantheses after the function. For statistical functions an array of numbers is input into the function, usually using cell references for the beginning and end of the array. For example "=AVERAGE(A1:A36)" in a cell will calculate the average of the numbers in the array of cells from A1 through A36.
• note that Excel describes what the function does when you insert it into the formula bar. Help will also give you a more complete description. The frequency function can be a bit tricky to use.
• For more sophisticated statistical analysis many people use a software program known as SPSS.

Simple tests comparing populations.

Basic questions that often arise include: 1) what is the form of a distribution - is it random, uniform, normal or other, and 2) how do these two populations compare with each other. Is the difference significant or does it likely result from the natural variation within the populations. More specifically, we can ask questions such as - how does the sample population compare to a model distribution, or a sample mean compare to a model mean.

The framing of the question is important. There is a distinct tradition that exists that consists of stating the null and alternate hypothesis.

• The null hypothesis is that there is no difference between the two populations. It is often termed H0.
• The alternate hypothesis is that there is a difference. It is often referred to as H1.

Chi square test:

• This is a simple and commonly used test comparing two comparable (same bin width and positions) histograms from two populations. Most commonly you test an observed distribution against a theoretical model, or against a specified mathematical distribution. The null hypothesis is that the observed and the expected were taken from the same population and the differences could be due to random factors. The alternate hypothesis is that the two populations are different.
• The Chi square statistic is derived by taking the difference between the frequency between the two populations for a bin, squaring that, and dividing that by the frequency of the second. You might wonder why the difference is often squared in statistics. This takes care of the problem of positive versus negative differences canceling each other out. These numbers are then summed to yield the Chi-squared value. Obviously, the higher the Chi-squared value the more different the two distributions are and the less likely your null hypothesis is true. This is also of course a function of how many bins - the more bins the higher the value. On the other ludicrous end, for one bin the populations look exactly the same. One does have to take into account the number of bins in evaluating the meaning of your Chi-squared statistic. The number of bins controls your degrees of freedom (which can also be affected by other things). The observed Chi-square value is compared against a table where the likelihood you would randomly see such a difference given the null hypothesis is computed given the degrees of freedom (see below for degrees of freedom.
• The table of expected Chi-squared values for a given probability of the null hypothesis given the degrees of freedom (DF) can usually be found in the back of a statistics book. DF = the number of classes (bins) - 1 - the number of estimated parameters (usually involved in computing the frequency of the expected classes). If it is a simple measurement this last component is usually 0. The conventional probability cut off used is .05, or a 5% chance of being wrong. If your Chi-squared statistic is lower than that of the appropriate table value, you accept the null hypothesis. The two populations could be the same. If it is higher you reject it and can state that the probability of the null hypothesis being true is less than 5%, i.e. you are fairly certain they are different. Remember that they could actually come from the same population, but a sampling bias is causing the difference! Or it could reflect a real-world difference. Or it could reflect bad luck in that 1 out of 20 times they could not be different, even though you accept the alternate hypothesis, The test doesn't tell you why they are different, or even that they are different, only that it is very probable they are different.
• This is a robust and not a sensitive test. In other words it is fairly easy to conclude they are the same when they are not. The emphasis is on establishing that they are different.
• Some caveats:
• there should be at least 5 observations in each bin or class..
• observations are independent.
• the results are susceptible to the number of classes or bins. The larger the number the more likely you will reject the null hypothesis. One class and of course they are the same.
• the sample size needs to be the same for the two populations. Otherwise the difference in n will simply reflect the different sample sizes. If the sample size is not the same you can normalize one against the other, but this should be done carefully and clearly reflected in the statement of your null and alternate hypotheses.
• The Chi-Square test function in Excel does not return the Chi-square statistic, but instead the probability level of null hypothesis rejection. They (those who designed/wrote the software) want to save you the trouble of looking up values in the contingency table. This can be confusing if you are going by the protocol described for Chi-Square tests in many statistics text books. Instead, it indicates what the probability is that you will be wrong if you accept the alternate hypothesis (that they are different), by returning a value between 0 and 1. For example, if you get an Excel value of .71, you have a 29% chance of being wrong if you reject the null hypothesis and assume the two are different. It also assumes the degrees of freedom are one less than the number of bins.
• If you need to answer some of these types of questions, and you haven't acquired more expertise than in this course this is an excellent time to find someone with the right expertise.
• My experience suggests that generally people will see differences that are not statistically significant, but won't see similarity where there is a real difference. In other words we are biased to seeing differences!
• Below is an example of a simple Chi-square test done in Excel. In addition to illustrating the technique it shows the importance of sample size. Imagine you have a granite and you are wondering if it is a minimum eutectic melt (first stuff that comes off from partially melting sediment). If it was eutectic quartz, K-feldspar and plagioclase should be 40%, 30%, and 30% in relative proportion. We could test whether a given sample is significantly different from the composition predicted by this model. Imagine two cases. One where only 20 measurements were taken, and another where 200 where, but where the relative proportions were the same and reflected in the measurements (an unlikely scenario by the way). The results are shown below.
•  observed, n = 20 observed n=200 expected for n=20 expected for n=200 quartz 10 100 8 80 K-spar 6 60 6 60 plag 4 40 6 60 Chi-square p value for B Chi-square p value for C 0.558035146 0.0029283
• Note that for the first case, we would not reject the null hypothesis. The returned p values indicate chances are almost half-half. Therefore we would conclude that there is not a difference. One way to think of this is that there is not enough information to detect a difference. Note also the fact that there are less than 5 observations in one of our bins is problematic, and actually, while Excel provides a number by convetion it should not be reported. In case 2 we would be very safe in concluding they are different (less than a 1% chance of being wrong). Even if one was to obtain the true proportions with a few measurements and the two distributions were different, a low number of observations won't allow you to identify they are different.
• The two types of error that can occur:
• Type 1 (false positive) = you conclude they are different and they are not (5% chance with .05 cut off in Excel).
• Type 2 (false negative) = you conclude they are the same when they are different. Associated risk very much controlled by n, your sample size, and the amount of dispersion (standard deviation). One can make an estimate.
• Can there be such a thing as too much data? Vermeesch, P., 2009.Lies, Damned Lies, and Statistics (in Geology); Eos Trans. Am. Geophys. Union, 90 (47), p.443 is a short and interesting piece that argues using the Chi-Square test on a very large earthquake data set leads to the conclusion that earthquakes are not evenly distributed throughout the week (perhaps concentrated on Sunday). However, at least one response has been that Vermeesch violated some of the basic assumptions behind the Chi Square test (Sornette - arxiv.org/pdf/1001.4158). More of the ensuing discussion can be found on the web (e.g. Taylor, S. R., and D. N. Anderson (2011), Careful construction of hypothesis tests: Comment on “Lies, damned lies, and statistics (in geology),” Eos Trans. AGU, 92(8), 65, doi:10.1029/2011EO080009. (View
• Some thought leads one to realize that the bins you define will also influence your Chi Square value, and therefore could influence whether you accept or reject the null hypothesis. Ballantyne & Cornisch, (Use of the chi-square test for the analysis of orientation data; Journal of Sedimentary Research September 1979 v. 49 no. 3 p. 773-776) took the trouble of calculating all the different Chi Square values for a sample set of 98 orientations, and found significant variation in the resulting values.

Examples and week 2 exercise.

Heat flow data from the Indian continent. This can be used as an example of how to use Excel to compute basic population descriptors and plot a frequency distribution curve.

Excel data from clastic dike trace strikes in Tertiary strata of the Big Badlands to test against a uniform distribution.

Exercise 2.

Other sources of information:

Davis, J. C., 1986, Statistics and Data Analysis in Geology, Wiley, 646 p. This is a detailed expositions with an initial classic approach in probability theory.

Marsal, D., 1979, Statistics for Geoscientists; Pergamon Press, 176 p. This has an excellent description of basic statistics for the uninitiated, with geologic examples. Some of the later chapters are a bit brief and incomplete.

Revised 7/18/06

Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.