Week 2: Identifying , describing
and comparing populations.
Reading:
Chapt. 2 p. 15-43 in Swan and Sandilands.
This will introduce you to basic population
statistics and to frequency plots and histograms.
Week 2 index:
What is a population?
As is often the case, something that seems
so simple, and that we define without thinking all the time, can
be quite complex when you get into the nitty gritty detail. Fundamentally,
a population is usually considered as some set, that something
either belongs to or not. In fact, set theory is part of statistics.We
are usually concerned with a series of measurements of some variable
from a collection of samples. So for our purposes a population
is typically a series of numbers. The sample set is a subset
of all the possible samples that could be taken, a set known as
the universal set. The universal set consists of all possible
members of the set. If we were considered faults on Bjørnøya
it would be all the faults, including those exposed and those
not, those measured and mapped and those not. Hopefully, the
sample set is representative of the universal set. In geology
we can also make a distinction between a hypothetical population
and an available population. The hypothetical population
usually includes all of the members that existed back through
time - perhaps think of it as the ultimate universal set. The
available population includes what is still preserved today. The
distinction is very clear for paleontology. All of the members
of a species that once lived is the hypothetical population, while
those represented by fossils are the population available for
actual study, while those that are actually in a collection under
study would represent the actual sample population. Can you think
of ways how the available population can be a biased selection
of the hypothetical population?
What we often want to know about is the character
of the universal population given the sample set. Using a set
perspective, a sample population is always a subset of the universal
population it was drawn from. It is worth repeating - how well
the sample set represents the universal set is a fundamental question.
Sampling plans and biases are critical considerations.
Biases are both obvious and not so obvious. A major and very common
bias in geology is that of outcrop surface access that permits
sampling. Since weathering is often concentrated near such a surface
it may not be totally representative of the entire rock body when
it comes to compositions. Erosionally more resistant units will
outcrop better. Shallow dipping structures will be under represented
in terms of there frequency relative to steeply dipping structures.
Larger fossils may be found more easily than smaller ones. A known
bias can be corrected for in some situations using a concept known
as weighting.
Consider the amounts of slip on all faults
on the Arctic island Bjornoya as a universal data set. Slip amount
is usually described by a simple length. But it turns out that
on a given fault that the amount of slip is not uniform. Detailed
studies have shown that it can vary in a systematic way depending
on position on the fault plane, with a maximum more towards the
center, diminishing to zero at the fault edge. A given sand grain
is also not perfectly spherical and so has, in a sense, a population
of diameters. In other words, one number, one data point, in
a population can come from another population. This is important
to remember. Hopefully the variation in diameters of one grain
is much less than variation in diameters between grains, or the
variation of slip on one fault plane is less than that on different
faults in the universal population, otherwise the sample population
may not be reflecting only variation between different grains
or faults, but also variation in your fundamental measure for
one grain or fault.
An appropriate conceptual model and logic helps
us to identify what is a population in a given geologic context,
along with the question(s) the geoscientist is interested in.
Concentrations of TCE in an aquifer, SiO2
content of flows composing a volcanic construct, grain size in
a channel deposit, cranial capacity of a group of hominid fossils
from a geologic unit are examples of populations that have been
looked at. For each of these you should be able to think of some
reason that population trait is important, something it can tell
you about. A population is a conceptual construct. For
our purposes today we are interested in understanding how a population
as represented by a list of numbers can be described and analyzed.
This is known as univariate analysis.
Conceptual diagram of nested populations.
From the sample set one would often like to make statements about
the universal set. The accuracy and confidence with which one
can do that depends on many factors including the sampling plan,
and the statistics of the populations.
What
is a distribution?
A variable varies. One might be tempted to
see univariate analysis as an attempt to find the variables true
value (singular). One of Charles Darwin's great insights was to
realize that for species the interesting thing was the variation
within any population, the spread, and not the population
mean or a representative abstraction (the Platonic ideal or holotype).
Variation - the outliers, the deviations, mutations, the peripheral
members - can play a critical role in the history of a natural
system. Again, think of evolution and speciation. The nature of
the variation can be very informative. Think back to your in-class
exercise on grain size. A distribution is a mathematical map of
how the frequency of occurrence changes as a function of variable
magnitude. The x axis is the variable value (e.g. Temperature),
and the y axis is a measure of the frequency that a variable or
range of variables occurs. The best way to see and first investigate
the distribution is by the construction of a histogram.
This is the standard starting point for describing a population.
See your reading for other ways to look at a distribution.
Construction of histograms.
Mechanics of production:
- First decide on the best bin width
(and hence the number of bins needed to cover the data range)
based on maximum and minimum population values and the sample
size. If most bins don't have at least several values that fall
within them, then the bin size should be increased. Try something
like sample size divided by # of bins equals or is larger than
5-10. Another rule of thumb is that your number of bins should
be the square root of n, your sample size. Your bins should be
the same width.
- decide on what value to start your bin boundaries
with. As you will see it can make a difference, especially for
sample sets with a smaller n.
- count the number of population values that
fall within a certain bin. If the value is equal to the bin boundary,
typically it counts in the bin to the right (a greater than or
equal to rule).
- Plot the bin boundaries along the x axis.
Plot columns in the y direction whose height is equal to the
number of values from the population within a given bin.
- You can also plot y as the percentage of
the data within that bin, which has some distinct statistical
advantages.
- area under the histogram 'curve' can be normalized
to one, or 100%.
- rose diagrams
are special types of histograms, where temporal or orientation
data is being analyzed.
- 'fuzzy' histograms:
What to do when error is > or = to half the histogram interval
(bin size)?. Basically the idea is to treat each value as its
own normalized distribution, and you are basically simply adding
those distributions to find a y value. The mechanics can be a
bit difficult, depending on the nature of the distributions associated
with each value. You basically need to compute what portion of
a values distribution falls within a bin, and assign that fractional
value to that bin. This can be done in Excel, but takes quite
a few steps or a macro. More on that later. It is not necessary
where any variable measurement error << the bin width.
Diagram of histogram for the 10
listed sample population values given above. It is a bit like
stacking blocks.
Construction of histogram using Excel:
- using the FREQUENCY function:
- I suggest you avoid this until you become
or are more familiar with Excel. How it operates differs from
one Excel version to another. In some situations it can provide
some greater flexibility.
- insert sample data in a column. If you past
data in from the web you might be careful and do a paste special
and select the text option. Other wise it can be bring in some
extraneous hidden garbage, so that it isn't truly formatted as
a number.
- create an array of bin values in descending
order. If you have a lot of data you may want to sort it first
to find the high and low numbers.
- use the FREQUENCY function, selecting the
data and bin arrays when asked.
- the frequency function should return an array
of numbers that are the values in each bin. You may have to drag/copy
the first cell downwards to create the array.
- idiosyncrasies of the Excel FREQUENCY function.
If you have your bin values in ascending order it computes a
cumulative frequency. If you have your bin values in descending
order it computes a standard frequency. Bottom line, always -
check out your numbers to make sure they make sense. If
you check out the cell position reference in the formulas for
the bins lower in the sequence, they appear wrong, but the results
are OK. For older versions of Excel you can try using control-shift-enter
for the FREQUENCY function. One Excel source indicates that this
should yield an array instead of a single number. In some versions
of Excel on some platforms you may want to fix your data cell
referents with a $.
- use the column plot to look at the results.
You can plot multiple histograms at once. If you look in the
Series window there is a place to label the bin intervals by
choosing your bin array. Check your x axis labeling to make sure
you are plotting bin intervals. Unfortunately Excel is not very
good about placement of x labels, but this can be corrected in
another software package if need be.
- using the histogram routine (the suggested
route):
- the Histogram routine is found under Tools,
under Data Analysis. This is an add-in and if you can't
find it you may need to install it. Look under Tools at the Add-ins
option. If you are lucky you won't need the CD. It varies from
platform to platform and by Excel version.
- choose the Histogram insert the data and
bin arrays in the appropriate spot in the input box. Choose what
type of output you want. make sure you select the chart option.
Notice that this will put the results into a separate sheet.
Select finish and, that's it, your done.
Population distributions vs. histograms:
with increasing n (the population size) and smaller
and more bins the histogram should approach a curve, i.e. a continuous
frequency distribution.
Distribution
descriptors.
Distribution peaks:
unimodal vs. bimodal vs. polymodal.
- bi- or poly- modal may indicate mixed populations
(remembering that a population is a convenient mental construct,
and that they can be subdivided). Many of the statistical descriptors
below are meaningless for non-unimodal distributions.
- classic example of Si contents of volcanics
from continental rifts as being asymmetrically bimodal (lotsa
basalt with some rhyolite).
Different unimodal distributions (founded in probability theory):
- normal distribution
or Gaussian distribution: the variation or dispersion from the
mean is the result of many random and independent contributions.
These are symmetric. A great wealth of statistical techniques
exist for a normal distribution, and so it is advantageous if
your population has this distribution.
- uniform distribution:
each outcome has the same probability (1/n) of occurring. This
can be different than a random distribution.
- Poisson distribution:
a random distribution for point phenomena in time or space.
- Binomial distributions: number of successes in n trials when the result
is either success or failure, and there is a constant probability
of success vs. failure. You could use this to look at the chances
of finding a diamond or diamonds in a sediment sample.
- mathematical manipulation and transformation
of distributions is possible.
Unimodal distribution descriptors: position, dispersion, shape (symmetry).
- measures of central tendency (distribution
position):
- median: # such
that half the values are lower and half are higher.
- mode: most
frequently occurring value (highest peak of the frequency distribution)
- arithmetic mean:
a.k.a. the average. The sum of all the values divided by n, the
number of values.
- geometric mean:
the nth root of multiplying all the values instead of adding.
- measures of distribution dispersion:
- range: the
difference between the maximum and minimum value.
- variance: sum
of squaring all the differences between individual values and
the mean (i.e. squaring the deviations) all divided by the sample
number for the universal population or the sample number - 1
for sample variance. Obviously the larger the dispersion the
larger the variance.
- standard deviation
= square root of the variance.
- measures of distribution symmetry of a population:
Pearson measure of skewness = (Mean-mode)/standard deviation.
Closer to 0 the more symmetric.
Important distinction - sample vs. population
descriptors: It is easy as you
get buried in all the numbers and calculations to forget this
distinction. Obviously you want the sampling plan to be unbiased
so that your samples reflect well on the population they were
taken from. As your sample size gets larger your sample statistics
should get closer and closer to the true population statistics,
but there will be a difference between the universal population
mean and the sample mean. There is a statistics of how to measure
how well your samples may reflect the universal population, but
we won't have time to cover that in this course. A statistics
course will teach you that (hopefully).
Excel statistical functions: Below is a list of functions you may use this week
in describing distribution characteristics. You can find
a list of Excel functions under Insert > Function.
You should familiarize yourself with the array of functions available
to you. If you want to enter a formula into a cell start with
the equal sign. Then you can build the equation afterwards using
standard mathematical operators and by inserting functions.
- AVERAGE, FREQUENCY,
MEDIAN, MODE, RANK, SKEW,
SORT, STDEV, VAR
- The function arguments (the numbers it acts
on) are placed within parantheses after the function. For statistical
functions an array of numbers is input into the function, usually
using cell references for the beginning and end of the array.
For example "=AVERAGE(A1:A36)" in a cell will
calculate the average of the numbers in the array of cells from
A1 through A36.
- note that Excel describes what the function
does when you insert it into the formula bar. Help will also
give you a more complete description. The frequency function
can be a bit tricky to use.
- For more sophisticated statistical analysis
many people use a software program known as SPSS.
Simple tests comparing populations.
A basic question is often how do two populations
compare. Is the difference significant or does it likely result
from the natural variation within the populations. More specifically,
we can ask questions such as - how does the sample population
compare to a model distribution, or a sample mean compare to a
model mean. What is the probability that they are the same or
different?
The framing of the question is important.
There is a distinct tradition that exists that consists of stating
the null and alternate hypothesis.
- The null hypothesis is that there is no difference
between the two populations. It is often termed H0.
- The alternate hypothesis is that there is
a difference. It is often referred to as H1.
Chi-squared test:
- This is a simple and commonly used test comparing
two comparable (same bin width and positions) histograms from
two populations. You can test an observed distribution against
a theoretical model, against a specified mathematical distribution,
or against another observed population (in some cases). The null
hypothesis is that the observed and the expected were taken from
the same population and the differences could be due to random
factors. The alternate hypothesis is that the two populations
are different.
- The Chi-squared statistic is derived by taking
the difference between the frequency between the two populations for a bin, squaring that,
and dividing that by the frequency of the second. You might wonder
why the difference is often squared in statistics. This takes
care of the problem of positive versus negative differences canceling
each other out. These numbers are then summed to yield the Chi-squared
value. Obviously, the higher the Chi-squared value the more
different the two distributions are and the less likely your
null hypothesis is true. This is also of course a function
of how many bins - the more bins the higher the value. On the
other ludicrous end, for one bin the populations look exactly
the same. One does have to take into account the number of bins
in evaluating the meaning of your Chi-squared statistic. The
number of bins controls your degrees of freedom (which can also
be affected by other things). The observed Chi-square value is
compared against a table where the likelihood you would randomly
see such a difference given the null hypothesis is computed given
the degrees of freedom (see below for degrees of freedom.
- The table of expected Chi-squared values
for a given probability of the null hypothesis given the
degrees of freedom (DF) can usually be found in the back of a
statistics book. DF = the number of classes (bins) - 1 - the
number of estimated parameters (usually involved in computing
the frequency of the expected classes). If it is a simple measurement
this last component is usually 0. The conventional probability
cut off used is .05, or a 5% chance. If your Chi-squared statistic
is lower than that of the appropriate table value, you accept
the null hypothesis. The two populations could be the same. If
it is higher you reject it and can state that the probability
of the null hypothesis being true is less than 5%, i.e. you are
fairly certain they are different. Remember that they
could actually come from the same population, but a sampling
bias is causing the difference! Or it could reflect a real-world
difference. Or it could reflect bad luck in that 1 out of 20
times they could not be different, even though you accept the
alternate hypothesis, The test doesn't tell you why they are
different, or even that they are different, only that it is very
probable they are different.
- This is a robust and not a sensitive test.
In other words it is fairly easy to conclude they are the
same when they are not. The emphasis is on establishing
that they are different.
- Some caveats:
- there should be at least 5 observations
in each bin or class.
- observations are independent.
- the results are susceptible to the number
of classes or bins. The larger the number the more likely you
will reject the null hypothesis. One class and of course they
are the same.
- the sample size needs to be the same. Otherwise
the difference in n will simply reflect the different sample
sizes. If the sample size is not the same you can normalize one
against the other, but this should be done carefully and clearly
reflected in the statement of your null and alternate hypotheses.
- The Chi-Square test function in Excel
does not return the Chi-square statistic, but instead the probability
level of null hypothesis rejection.
They (those who designed/wrote the software) want to save you
the trouble of looking up values in the contingency table. This
can be confusing if you are going by the protocol described for
Chi-Square tests in many statistics text books. Instead, it indicates
what the probability is that you will be wrong if you accept
the alternate hypothesis (that they are different), by returning
a value between 0 and 1. For example, if you get an Excel value
of .71, you have a 29% chance of being wrong if you reject the
null hypothesis and assume the two are different. It also assumes
the degrees of freedom are one less than the number of bins.
- If you need to answer some of these types
of questions, and you haven't acquired more expertise than in
this course this is an excellent time to find someone with the
right expertise.
- My experience suggests that generally people
will see differences that are not statistically significant,
but won't see similarity where there is a real difference. In
other words we are biased to seeing differences!
- Below is an example of a simple Chi-square
test done in Excel. In addition to illustrating the technique
it shows the importance of sample size. Imagine you have a granite
and you are wondering if it is a minimum eutectic melt (first
stuff that comes off from partially melting sediment). If it
was eutectic quartz, K-feldspar and plagioclase should be 40%,
30%, and 30% in relative proportion. We could test whether a
given sample is significantly different from the composition
predicted by this model. Imagine two cases. One where only 20
measurements were taken, and another where 200 where, but where
the relative proportions were the same and reflected in the measurements
(an unlikely scenario by the way). The results are shown below.
| |
observed, n = 20 |
observed n=200 |
expected for n=20 |
expected for n=200 |
| quartz |
10 |
100 |
8 |
80 |
| K-spar |
6 |
60 |
6 |
60 |
| plag |
4 |
40 |
6 |
60 |
| |
|
|
Chi-square p value for B |
Chi-square p value for C |
| |
|
|
0.558035146 |
0.0029283 |
- Note that for the first case, we would not
reject the null hypothesis. The returned p values indicate chances
are almost half-half. Therefore we would conclude that there
is not a difference. One way to think of this is that there is
not enough information to detect a difference. Note also the
fact that there are less than 5 observations in one of our bins
is problematic, and actually, while Excel provides a number it
should not be reported. In case 2 we would be very safe in concluding
they are different (less than a 1% chance of being wrong). Even
if one was to obtain the true proportions with a few measurements
and the two distributions were different, a low number of observations
won't allow you to identify they are different.
Examples
and week 2 exercise.
Example of looking
at SiO2 contents of two types of oceanic crustal material, and
exploring the effect of increasing n (sample size) on the resulting
sample average as compared to the true average.
Heat flow data
from the Indian continent. We will
run through an example of how to use Excel to compute basic population
descriptors and plot a frequency distribution curve, and then
compare heat flow from two different geologic provinces.
Exercise 2.
Other sources of information:
Davis, J. C., 1986, Statistics and Data Analysis
in Geology, Wiley, 646 p. This is a detailed expositions with
an initial classic approach in probability theory.
Marsal, D., 1979, Statistics for Geoscientists;
Pergamon Press, 176 p. This has an excellent description of basic
statistics for the uninitiated, with geologic examples. Some of
the later chapters are a bit brief and incomplete.
Revised 7/18/06
Copyright by Harmon D. Maher Jr.. This material
may be used for non-profit educational purposes if proper attribution
is given. Otherwise please contact Harmon D. Maher Jr.