**Week 3 index:**

- Scatter plots.
- Linear regression and correlation.
- Independent vs. dependent variables and best-fit lines.
- In class exercise: initial exploration of a bivariate relationship.
- The correlation coefficient.
- Regression in Excel.
- Examples of linear regression.
- Exercise 3.

*Scatter plot of lightning strike density in Pennsylvania versus elevation with a regression line, with an emphasis on the scatter. What can you tell from the plot? What does the R2 value indicate? A plot with the same axes might look quite different for the Rocky Mountains, a consideration come afternoon time when hiking high. Image source: Digital Mapping Techniques '03 — Workshop Proceedings U.S. Geological Survey Open-File Report 03–471 A Map of Lightning Strike Density for Southeastern Pennsylvania, and Correlation with Terrain Elevation By Alex J. DeCaria and Michael J. Babij - http://pubs.usgs.gov/of/2003/of03-471/decaria/index.html*

**Reading for this week: **Chapt 3 in Sandilands, Statistics with Two Variables,
p. 148 and on. This is available as a pdf document on Blackboard.

**Other sources of information: **

- Chapt 6 and 7 in Pagano, P. R., 1994, Understanding Statistics in the Behavioral Sciences (4th ed.), West Publishers, Minneapolis, 496 p. I have found this book very useful. The author isn't afraid to, or to hurried to, use words to provide a conceptual framework for the math, which is a nice switch from other textbooks. Read through this focusing on the concepts and diagrams.
- Chapt 7 Curve Fitting in Lingme, B. V., 1997, A guide to Microsoft Excel for Scientists and Engineers, John Wiley and Sons., p. 74-83. This section goes through the mechanics of regressions in Excel. You can read this if what is found in the description below is not enough.
- Hall-Wallace, M. K., 2000, Using Linear Regression to Determine Plate Motions; Journal of Geoscience Education, v. 48, p. 455

Last week we looked at how to describe and
analyze one variable. How about two variables that might be related in some way (known as **bivariate analysis**)? __The first step
to investigating what the relationship might be between two paired
variables is to create a scatter plot__, a simple plot of x
versus y for the paired variables. What makes an x and y value
paired? They could be data collected at the same time and place,
or they could be taken from the same specimen, or they could be
the previous versus subsequent values in one population. Something
unites them, and then questions arise. Is one dependent on the
other? Is there some type of mathematical relationship that can
be established between the two? What is the strength of that relationship? Initially, one can assess the scatter plot
qualitatively, and that helps with subsequent more quantitative
analysis. Is there a well defined relationship where the plotted
points seem to fall along a line or a curve, or is there a poorly
defined or absent relationship that looks like a shotgun blast?
Are there clusters? If there is a good linear pattern, is there
a negative or positive slope? Is there a fractal pattern (more
on that later)? There are good examples of different types of
scatter plot patterns in your reading.

*Above is an example of a simple scatter plot for measurements taken in the San Fransisco Bay from two cruises on August 20 (red) versus February 23rd (blue) in 2010 and from 1-3 m water depth. What can you conclude from this simple scatter plot? A good starting place is to consider what determines the dissolved oxygen content of surface waters. This graph was constructed at the USGS site: http://sfbay.wr.usgs.gov/cgi-bin/sfbay/dispsys/plotdata4.pl (an interesting array of scatter plots can be constructed giving insight into water quality in the Bay).
*

*Scatter plot of dissolved oxygen versus temperature for the Beaufort River in South Carolina. Why is the relationship between the two so much tighter here than in the plot above? Image source USGS site: http://sofia.usgs.gov/projects/workplans06/hydro_mon.html
*

There is a next step in sophistication of analysis
where one looks at the relationships between many variables at
once (**multivariate analysis**). Such analysis is beyond the scope of this particular course, but if often learned in a statistics course.

Why would you want to regress? Regression,
as often practiced in earth sciences, is the attempt to establish
a mathematical relationship between two variables. Such a relationsip can be used
to extrapolate beyond the range of data/observation, or interpolate between data points, basically to predict one variable given the other. For
example, a relationship exists between the frequency of occurrence
of a given size flood or earthquake, and the size of the event.
Given flood data, and assuming constancy of system operation then
one can predict how big a size of a certain frequency will be,
i.e. how big the 100 year flood will be. A **linear relationship**
between two variables is captured by the formula **y = b + m
x **, where **b is the y intercept** and **m is the slope**.
__ It is significant which variable is y and which is x, as is
explained below__. The default convention is that

**Correlation** measures the dependability of the
relationship (the goodness of fit of the data to the mathematical relationship). It is
a measure of how well one variable can predict the other (given
the context of the data), and determines the precision you can
assign to a relationship.

Regression or correlation can be **bivariate**
(between 2 variables, x and y) or **multivariate**, between
greater than two variables. __Regression is interested in the
form of the relationship, whereas correlation is more focused
simply on the strength of a relationship__. In this class we will only deal
with classical **bivariate linear regression**, because it is commonly used and the simplest situation. Just to stress the point through repetition, in all cases the independent
variable(s) vs. the dependent variable(s) needs to be clearly defined.

*This example shows how linear regression is used to calibrate an instrument.. In this particular case you have the travel time obtained with a Grand Penetrating Radar traverse across river channels as a function of the water depth as measured using weights. In this particular case. given travel time you would like to be able to predict water depth, and this determines which one is x and which one is y. Thus this best-fit line is the calibration line - what is used to turn a traveltime into a water depth. What is the relationship between the amount of scatter and the error you would assign a depth determined by GPR? r2 describes this, and is described more below. Note that the intercept of the line is not zero. Should it be? Image from USGS site: http://sfbay.wr.usgs.gov/cgi-bin/sfbay/dispsys/plotdata4.pl .*

A key sin in statistics is to confuse these
two, and much shaking of heads and wagging of tongues is directed
toward those committing this particular sin. We will see that
in many situations it is clear, which is which, but in some cases
it is not. The important thing to realize is that which of your
two variables you cast into either role does make a difference
in the results. The convention (that is built into statistical
software packages) is that __x is your independent variable__
and __y the dependent variable__. But what is the difference?

In a controlled laboratory situation, the __independent
variable____ is the one the experimenter controls__,
and the __dependent____ is the variable of interest
that is measured for different values of the independent variable__.
If you are looking at the solubility as a function of temperature,
temperature would be the independent variable and mass per unit
volume of the material dissolved (the solubility) would be the
dependent variable. Often there is a stated or unstated assumption
that the dependent variable is controlled by the independent one;
i.e. that there is a causal (not casual) relation between the
two. So temperature changes causes solubility to change. A change
in solubility is not thought to cause a change in temperature.
Note that you could theoretically use solubility to
measure temperature, and so the role of independent vs. dependent
could be switched, but what an inefficient way to measure temperature!
__What is the dependent vs. what is the independent variable
is sensitive to the context__.

Those used to working in the luxury of a controlled
laboratory situation may be forgiven for thinking that there is
always a clear cut distinction between dependent vs. independent
variables. The real world is more complex. For example, one can
investigate whether there is a statistical relationship between
Si and K content in a suite of volcanic rocks. Which one is the
dependent and which one is the independent variable? Typically
Si is taken as the x coordinate, but why? Or perhaps one is looking
at the relationship between two different contaminants in an aquifer.
This is not a controlled situation. Again, context is the guide.
Very often it is helpful to __assign the variable you want to
predict, or know less about as the dependent variable__. For
example, if one contaminant was a derivative of the other and
an inverse relationship could be expected in any water sample,
then the derivative compound would be the dependent variable.
If one contaminant was much easier to measure and on the basis
of a relationship between the two, was to be used as a proxy indicator
or indicator of another then the proxy would be the independent
variable. __The most important think is to have thought this
out before proceeding with the analysis or presenting your results__.
**Establishing a mathematical relationship does not mean you
have established a causal relationship.**

The results of a linear regression are often
termed the **best-fit line**. What does this mean? If you imagine
a regression line (the plot of a linear equation) and the scatter
plot of points that produced it, then imagine the vertical lines
(y distance) between each point and the regression line, you have
one image of goodness of fit. The smaller those distances the better
the fit. Combine those into an aggregate length. This length is
a measure of how vertically close the y values are to the regression
line. In a perfect fit, there would no difference, with the points
plotting right on the line, and the aggregate length would be
0. Different regression lines for the same data produce different
aggregate lengths. The statistical routine in Excel and other
statistical packages computes the line of minimum deviation, of
minimum aggregate length, the one that the points are, in aggregate,
closest to.

Which variable is assigned to x and which to y does make a difference. As an experiment you can put in some real data and run the linear regression both ways (interchange x and y as independent and dependent variables). Why the difference? Simply because in one case you are minimizing the variation for y (the conventional case), and in the other you are minimizing the variation for x.

Natural system feedback loops provide an interesting
difficulty because there is not a one way causal relationship.
Instead, both **variables **are** interdependent**. Snowcover and local
air temperature could be one example. Of course, the air temperature
determines how much snow melts or doesn't, but also the albedo
of the snow affects local air temperature. Try to think of others.
What to do in this situation? There is a type of linear regression
that instead of seeking to minimize error of line fit only in
the y variable, minimizes the error in both x and y. This is described
in your Swan and Sandilands reference, and is referred to as **structural
regression**. As you might guess it is more difficult, and we
won't treat it here. However, you should look into it when the
occasion arises. In any case, in your work you should clearly
state which is your independent and which is your dependent variable (has this been stressed enough?).
The reader can agree or disagree with your call and take it from
there.

This will be a group project. Get into groups of 3.

- Think of some potential relationship between two earth science variables that could be interesting to explore. It could be, for example, between average annual rainfall and sedimentation rate in a lake.
- For that relationship address the following
questions:
- Why would you expect a relationship, i.e. what is your causal model?
- Which one should be the independent and which one should be the dependent variable?
- Where or how might you obtain the needed data?
- What type of relationship might you expect? In other words, what might be the shape of the scatter plot of the two.

The more specific you get the better. Report to the class.

The regression equation can be thought of as
a mathematical model for a relationship between the two variables.
The natural question is how good is the model, how good is the
fit. That is where **r** comes in, the **correlation coefficient **(technically Pearson's correlation coefficient for linear regression).
This basically quantifies how well pairs of x and y positions within their own distributions match each other. If there is a perfect fit, and x explains all the variation in y, then the one distribution as described by the mean and standard deviation of the population of x numbers should suffice for y. In other words if an x value is .65 standard deviation units from the mean, then its y pair should also be .65 standard deviation units from the mean if there is a perfect fit. The basic measure is how much of the variation in y can be explained by the variation in x.

__Trying to understand__
**r**: One could use the aggregate distance of deviation from
the best fit line as described above as a crude measure of goodness
of fit. However, the outcome would be a function of the specific
scale and units in each individual case, and it would be hard
to know how to interpret the resulting number. What is done instead is
that each x and y value is scaled against their own distribution,
i.e. they are standardized against the distribution of each variable.
This is done by computing **z score values** for each value
for both x and y, where z is obtained by subtracting the mean
from each value and dividing by the distribution's standard deviation.
**Basically z values measure the distance from the mean in units
of standard deviations**.__In a perfect fit a y value's position
in its distribution will be in the same position as its corresponding
x value' position in its distribution - a one to one correspondence__.
__If one plots the z scores for x and y pairs against each other
this one to one fit yields a slope of 1.__ In a good fit the
pairs will be close, but not perfect. For low z values of x, where
the value is close to or at the mean, the corresponding normalized
z value of y likely will be greater because of
the deviation in y that is not explained by x. In turn, at the
highest z value of x, where x is at its extreme range in the distribution,
the z value of the corresponding y is likely to be closer to the y mean, and hence will tend to plot below a
one to one position. The result is that the line then has a slope
less than one. If there is no relationship then as the z value
for x increases, the z value for corresponding y can be higher
or lower, resulting in slope of 0. If the position of one variable
in its own distribution bears no relationship to that of its companion
variable, then a slope of 0 results. **r is the slope of the
z score values of the two variables plotted against each other**.
A value of 1 means a perfect fit, and a value of 0 means no relationship
exists between the two variables.

** What it boils down to**:

**SSE = Standard error of estimate**: If you use x to predict y given the relationship
you might want to know what are the chances the realy y value is a certain amount away. One commonly
sees error reported in association with U-Pb radiometric dates.
In part this is where these errors come from, the fitting of a
line to isotopic ratio data (known as a chord). There are some implicit assumptions you will need to investigate
if the error becomes important to you (i.e. you need to learn
a bit more than is presented here). The plus or minus a certain
number of million years error reported is usually reported at
two SEEs, which means that 95% of the values obtained if this
sample were dated over and over again would lie within the noted
error range. This is the conventional error reported. For example, an age of 346+/-2 Ma years can then be read that one can be 95% confident that the age is between 344 and 348.

**Correlation vs. cause**.
We have mentioned this before, but it bears repeating. Imagine
a situation where one variable controls or influences two other
variables that are independent of each other. For example, temperature
controlling algal reproduction and the amount of calcium dissolved
in the water of a pond (although the amount of calcium could influence
algal reproduction also, and thus they wouldn't be truly independent).
A regression of algal concentration versus dissolved sodium could
develop a statistically significant relationship, and even a useful,
one. However, it would be wrong to conclude that dissolved sodium
content was necessarily controlling the biota. This is when you
get into multivariate analysis, taking into account the multiple
factors that are likely to be related.

__Problems with closed correlations
occur when working with percent data. __This is especially a problem with geochemical or point count data,
which are very common in geology. Since the sum of the components
must equal one, when plotting one versus the other there must
be an inverse relationship, because as one increases in percentage,
the other must decrease. This is most easily seen in thinking
of a two component system, where there must be a perfect fit with
an r = -1. Three or more components will still provide correlations
that have nothing to do with true association, or causation. The
problem is less severe with trace elements, but can still exist.
There are alternate possible relationships in looking for real
correlations, but we don't have time to go into those here. Swan
and Sandilands do discuss some of the alternate possibilities.

15 minute U-tube video on regression and the correlation coefficient.

As usual there are several ways to do a regression, depending on part how much information you want.

- Simple regression in Excel quick and dirty:
- Insert your data with labels, into your spreadsheet with x and y in two separate columns.
- Create a scatter plot, making sure you use identify the x and y variables appropriately (see above).
- For older versions of Excel, with the chart selected, select
**TRENDLINE**under the**CHART**heading. In the**Options**page make sure you choose to have Excel return the equation in addition to the r-squared value. - For Excel 2007 you use the upper
**Insert**tab and choose the insert scatter plot icon from the array of chart possibilities. Once the chart is created, then you can migrate to the upper**Layout**tab and then from the Analysis section of choices you can select the**Trendline**option and follow the instructions. You will need to have the chart selected in order to access these options. Note that if you want the r-squared value to be returned you need to check that option when creating your trendline. The two tabs that allow you to work with your plots/charts are the**Layout**and**Design**tabs, and are worth exploring further. - U-Tube video on constructing scatter plots and trendline.

- For piecemeal information the following functions
can be used:
**LINEST**(returns the slope and intercept values).**SLOPE**.**INTERCEPT**.**STEYX**(returns the Standard Error for y on x).**CORREL**(returns bivariate correlation coefficient).

- For complete statistics and detailed analysis
use the
**REGRESSION**under**Tools**and**Data Analysis**(the same place you found the histogram making option).

- Example of coral clast age vs. elevation for Lanai, Hawaii.
- Example of flood frequency analysis for Elkhorn data using Excel.
- Example of sediment yield rate vs. record length rate computed from.
- Example of porosity vs. unconfined compressive strength of building sandstones.

A similar approach can be taken to estimate recurrence intervals of earthquakes. You will explore this in your exercise.

**A look ahead.**

Is there any reason you can't move this type of modeling into looking at three variables and into three dimensions. Instead of a line or a curve, you could envison a surface. The simplest surface would be a planar one. The answer is that you can, and a later exercise in this course looks some at surface modeling.

*An example of a 3-D scatter plot and the resulting modeled surface for Everglades water data. Image from USGS site: http://sofia.usgs.gov/projects/workplans06/hydro_mon.html*

Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.