Week 3: Linear regression - bivariate analysis

Week 3 index:

Scatter plot of lightning strike density in Pennsylvania versus elevation with a regression line, with an emphasis on the scatter. What can you tell from the plot? What does the R2 value indicate? A plot with the same axes might look quite different for the Rocky Mountains, a consideration come afternoon time when hiking high. Image source: Digital Mapping Techniques '03 — Workshop Proceedings U.S. Geological Survey Open-File Report 03–471 A Map of Lightning Strike Density for Southeastern Pennsylvania, and Correlation with Terrain Elevation By Alex J. DeCaria and Michael J. Babij - http://pubs.usgs.gov/of/2003/of03-471/decaria/index.html

Reading for this week: Chapt 3 in Sandilands, Statistics with Two Variables, p. 148 and on. This is available as a pdf document on Blackboard.

Other sources of information:

Scatter plots

Last week we looked at how to describe and analyze one variable. How about two variables that might be related in some way (known as bivariate analysis)? The first step to investigating what the relationship might be between two paired variables is to create a scatter plot, a simple plot of x versus y for the paired variables. What makes an x and y value paired? They could be data collected at the same time and place, or they could be taken from the same specimen, or they could be the previous versus subsequent values in one population. Something unites them, and then questions arise. Is one dependent on the other? Is there some type of mathematical relationship that can be established between the two? What is the strength of that relationship? Initially, one can assess the scatter plot qualitatively, and that helps with subsequent more quantitative analysis. Is there a well defined relationship where the plotted points seem to fall along a line or a curve, or is there a poorly defined or absent relationship that looks like a shotgun blast? Are there clusters? If there is a good linear pattern, is there a negative or positive slope? Is there a fractal pattern (more on that later)? There are good examples of different types of scatter plot patterns in your reading.

Above is an example of a simple scatter plot for measurements taken in the San Fransisco Bay from two cruises on August 20 (red) versus February 23rd (blue) in 2010 and from 1-3 m water depth. What can you conclude from this simple scatter plot? A good starting place is to consider what determines the dissolved oxygen content of surface waters. This graph was constructed at the USGS site: http://sfbay.wr.usgs.gov/cgi-bin/sfbay/dispsys/plotdata4.pl (an interesting array of scatter plots can be constructed giving insight into water quality in the Bay).

Scatter plot of dissolved oxygen versus temperature for the Beaufort River in South Carolina. Why is the relationship between the two so much tighter here than in the plot above? Image source USGS site: http://sofia.usgs.gov/projects/workplans06/hydro_mon.html


There is a next step in sophistication of analysis where one looks at the relationships between many variables at once (multivariate analysis). Such analysis is beyond the scope of this particular course, but if often learned in a statistics course.

Linear regression and correlation

Why would you want to regress? Regression, as often practiced in earth sciences, is the attempt to establish a mathematical relationship between two variables. Such a relationsip can be used to extrapolate beyond the range of data/observation, or interpolate between data points, basically to predict one variable given the other. For example, a relationship exists between the frequency of occurrence of a given size flood or earthquake, and the size of the event. Given flood data, and assuming constancy of system operation then one can predict how big a size of a certain frequency will be, i.e. how big the 100 year flood will be. A linear relationship between two variables is captured by the formula y = b + m x , where b is the y intercept and m is the slope. It is significant which variable is y and which is x, as is explained below. The default convention is that x represents the independent variable, and y represents the dependent variable, and that predictions of y are made for a given x value. We won't explicitly deal with curvi-linear regression, although the general approach is similar. It is not uncommon that a non-linear relationship can be transformed into a linear one by a mathematical transformation (very commonly a log transformation).

Correlation measures the dependability of the relationship (the goodness of fit of the data to the mathematical relationship). It is a measure of how well one variable can predict the other (given the context of the data), and determines the precision you can assign to a relationship.

Regression or correlation can be bivariate (between 2 variables, x and y) or multivariate, between greater than two variables. Regression is interested in the form of the relationship, whereas correlation is more focused simply on the strength of a relationship. In this class we will only deal with classical bivariate linear regression, because it is commonly used and the simplest situation. Just to stress the point through repetition, in all cases the independent variable(s) vs. the dependent variable(s) needs to be clearly defined.

This example shows how linear regression is used to calibrate an instrument.. In this particular case you have the travel time obtained with a Grand Penetrating Radar traverse across river channels as a function of the water depth as measured using weights. In this particular case. given travel time you would like to be able to predict water depth, and this determines which one is x and which one is y. Thus this best-fit line is the calibration line - what is used to turn a traveltime into a water depth. What is the relationship between the amount of scatter and the error you would assign a depth determined by GPR? r2 describes this, and is described more below. Note that the intercept of the line is not zero. Should it be? Image from USGS site: http://sfbay.wr.usgs.gov/cgi-bin/sfbay/dispsys/plotdata4.pl .

Independent vs. dependent variables and best-fit lines.

A key sin in statistics is to confuse these two, and much shaking of heads and wagging of tongues is directed toward those committing this particular sin. We will see that in many situations it is clear, which is which, but in some cases it is not. The important thing to realize is that which of your two variables you cast into either role does make a difference in the results. The convention (that is built into statistical software packages) is that x is your independent variable and y the dependent variable. But what is the difference?

In a controlled laboratory situation, the independent variable is the one the experimenter controls, and the dependent is the variable of interest that is measured for different values of the independent variable. If you are looking at the solubility as a function of temperature, temperature would be the independent variable and mass per unit volume of the material dissolved (the solubility) would be the dependent variable. Often there is a stated or unstated assumption that the dependent variable is controlled by the independent one; i.e. that there is a causal (not casual) relation between the two. So temperature changes causes solubility to change. A change in solubility is not thought to cause a change in temperature. Note that you could theoretically use solubility to measure temperature, and so the role of independent vs. dependent could be switched, but what an inefficient way to measure temperature! What is the dependent vs. what is the independent variable is sensitive to the context.

Those used to working in the luxury of a controlled laboratory situation may be forgiven for thinking that there is always a clear cut distinction between dependent vs. independent variables. The real world is more complex. For example, one can investigate whether there is a statistical relationship between Si and K content in a suite of volcanic rocks. Which one is the dependent and which one is the independent variable? Typically Si is taken as the x coordinate, but why? Or perhaps one is looking at the relationship between two different contaminants in an aquifer. This is not a controlled situation. Again, context is the guide. Very often it is helpful to assign the variable you want to predict, or know less about as the dependent variable. For example, if one contaminant was a derivative of the other and an inverse relationship could be expected in any water sample, then the derivative compound would be the dependent variable. If one contaminant was much easier to measure and on the basis of a relationship between the two, was to be used as a proxy indicator or indicator of another then the proxy would be the independent variable. The most important think is to have thought this out before proceeding with the analysis or presenting your results. Establishing a mathematical relationship does not mean you have established a causal relationship.

The results of a linear regression are often termed the best-fit line. What does this mean? If you imagine a regression line (the plot of a linear equation) and the scatter plot of points that produced it, then imagine the vertical lines (y distance) between each point and the regression line, you have one image of goodness of fit. The smaller those distances the better the fit. Combine those into an aggregate length. This length is a measure of how vertically close the y values are to the regression line. In a perfect fit, there would no difference, with the points plotting right on the line, and the aggregate length would be 0. Different regression lines for the same data produce different aggregate lengths. The statistical routine in Excel and other statistical packages computes the line of minimum deviation, of minimum aggregate length, the one that the points are, in aggregate, closest to.

Which variable is assigned to x and which to y does make a difference. As an experiment you can put in some real data and run the linear regression both ways (interchange x and y as independent and dependent variables). Why the difference? Simply because in one case you are minimizing the variation for y (the conventional case), and in the other you are minimizing the variation for x.

In a regression of y as the dependent variable, given x, the aggregate values for all the data points of distance A is minimized in the best fit routine. If you reverse the roles of x and y then distance B is minimized instead, and hence you can get a different answer. It is possible to minimize C also.

Natural system feedback loops provide an interesting difficulty because there is not a one way causal relationship. Instead, both variables are interdependent. Snowcover and local air temperature could be one example. Of course, the air temperature determines how much snow melts or doesn't, but also the albedo of the snow affects local air temperature. Try to think of others. What to do in this situation? There is a type of linear regression that instead of seeking to minimize error of line fit only in the y variable, minimizes the error in both x and y. This is described in your Swan and Sandilands reference, and is referred to as structural regression. As you might guess it is more difficult, and we won't treat it here. However, you should look into it when the occasion arises. In any case, in your work you should clearly state which is your independent and which is your dependent variable (has this been stressed enough?). The reader can agree or disagree with your call and take it from there.

In class exercise: initial exploration of a bivariate relationship.

This will be a group project. Get into groups of 3.

The more specific you get the better. Report to the class.

The correlation coefficient.

The regression equation can be thought of as a mathematical model for a relationship between the two variables. The natural question is how good is the model, how good is the fit. That is where r comes in, the correlation coefficient (technically Pearson's correlation coefficient for linear regression). This basically quantifies how well pairs of x and y positions within their own distributions match each other. If there is a perfect fit, and x explains all the variation in y, then the one distribution as described by the mean and standard deviation of the population of x numbers should suffice for y. In other words if an x value is .65 standard deviation units from the mean, then its y pair should also be .65 standard deviation units from the mean if there is a perfect fit. The basic measure is how much of the variation in y can be explained by the variation in x.

Trying to understand r: One could use the aggregate distance of deviation from the best fit line as described above as a crude measure of goodness of fit. However, the outcome would be a function of the specific scale and units in each individual case, and it would be hard to know how to interpret the resulting number. What is done instead is that each x and y value is scaled against their own distribution, i.e. they are standardized against the distribution of each variable. This is done by computing z score values for each value for both x and y, where z is obtained by subtracting the mean from each value and dividing by the distribution's standard deviation. Basically z values measure the distance from the mean in units of standard deviations. In a perfect fit a y value's position in its distribution will be in the same position as its corresponding x value' position in its distribution - a one to one correspondence. If one plots the z scores for x and y pairs against each other this one to one fit yields a slope of 1. In a good fit the pairs will be close, but not perfect. For low z values of x, where the value is close to or at the mean, the corresponding normalized z value of y likely will be greater because of the deviation in y that is not explained by x. In turn, at the highest z value of x, where x is at its extreme range in the distribution, the z value of the corresponding y is likely to be closer to the y mean, and hence will tend to plot below a one to one position. The result is that the line then has a slope less than one. If there is no relationship then as the z value for x increases, the z value for corresponding y can be higher or lower, resulting in slope of 0. If the position of one variable in its own distribution bears no relationship to that of its companion variable, then a slope of 0 results. r is the slope of the z score values of the two variables plotted against each other. A value of 1 means a perfect fit, and a value of 0 means no relationship exists between the two variables.

What it boils down to: r is a measure of goodness of fit. Values close to 1 indicate a very good fit. If you square r it represents the proportion of variability of y accounted for by x. In other words, if you had an r-squared value of .95 you can account for 95% of the variability in y with knowledge of x. In geology r-squared values greater than .9 are preferred, but note that even if you have an r-squared of .33, this means that x is still describing a significant proportion of the y behavior. Those below .5 are considered pretty useless for bivariate analysis, because the associated error is so big. Multivariate analysis is different. Again, if using the mathematical relationship to predict y given x, then the convention is to report an error = 2 x SSE, but this convention is not always followed.

SSE = Standard error of estimate: If you use x to predict y given the relationship you might want to know what are the chances the realy y value is a certain amount away. One commonly sees error reported in association with U-Pb radiometric dates. In part this is where these errors come from, the fitting of a line to isotopic ratio data (known as a chord). There are some implicit assumptions you will need to investigate if the error becomes important to you (i.e. you need to learn a bit more than is presented here). The plus or minus a certain number of million years error reported is usually reported at two SEEs, which means that 95% of the values obtained if this sample were dated over and over again would lie within the noted error range. This is the conventional error reported. For example, an age of 346+/-2 Ma years can then be read that one can be 95% confident that the age is between 344 and 348.

Correlation vs. cause. We have mentioned this before, but it bears repeating. Imagine a situation where one variable controls or influences two other variables that are independent of each other. For example, temperature controlling algal reproduction and the amount of calcium dissolved in the water of a pond (although the amount of calcium could influence algal reproduction also, and thus they wouldn't be truly independent). A regression of algal concentration versus dissolved sodium could develop a statistically significant relationship, and even a useful, one. However, it would be wrong to conclude that dissolved sodium content was necessarily controlling the biota. This is when you get into multivariate analysis, taking into account the multiple factors that are likely to be related.

Problems with closed correlations occur when working with percent data. This is especially a problem with geochemical or point count data, which are very common in geology. Since the sum of the components must equal one, when plotting one versus the other there must be an inverse relationship, because as one increases in percentage, the other must decrease. This is most easily seen in thinking of a two component system, where there must be a perfect fit with an r = -1. Three or more components will still provide correlations that have nothing to do with true association, or causation. The problem is less severe with trace elements, but can still exist. There are alternate possible relationships in looking for real correlations, but we don't have time to go into those here. Swan and Sandilands do discuss some of the alternate possibilities.

15 minute U-tube video on regression and the correlation coefficient.

Regression in Excel.

As usual there are several ways to do a regression, depending on part how much information you want.

Examples of linear regression.

A similar approach can be taken to estimate recurrence intervals of earthquakes. You will explore this in your exercise.

A look ahead.

Is there any reason you can't move this type of modeling into looking at three variables and into three dimensions. Instead of a line or a curve, you could envison a surface. The simplest surface would be a planar one. The answer is that you can, and a later exercise in this course looks some at surface modeling.

An example of a 3-D scatter plot and the resulting modeled surface for Everglades water data. Image from USGS site: http://sofia.usgs.gov/projects/workplans06/hydro_mon.html

Exercise 3.

Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.