Geologic background of first data set: Below are the first 20 values of 1338 rows of data downloaded from the Lamont Doherty's basalt geochemistry and petrography database. You can download the data set yourself if you want a full copy, or come see me. Specifically, it is the prepared data set for backarc basins. Because these basins are smaller and often filled with significant sediment they do not show the well developed magnetic anomalies and ridge geometry seen for sea floor spreading ridge, and yet they must be some type of oceanic crust. They may have somewhat different mechanisms for crustal formation than standard seafloor spreading ridge. This should be reflected in their geochemistry.
Geologic background of the second data set. This is from the same data source as above, but all the samples are from the EPR (East Pacific Rise). This is one of the faster seafloor spreading ridges active on the earth today. There are some 3817 rows of data, and so it is a substantial data set.
Below are histograms of the data with the backarc basin data on top and the EPR data on bottom:
How do they compare? What conclusions can you draw from this? Does the difference between the sample size for the two data sets strengthen or weaken your conclusions. Even though there is a lot of data here do you really know how this data set is representative of back-arc basins? Can you think of biases that might exist?
An Excel experiment: The back arc basin data rows were assigned random numbers. These rows were then sorted by the random numbers. Averages were then computed for 1 to n rows, with n increasing by one. This allows us to look at how the sample average changes as you incrementally increase your sample size. The difference between the universal population and sample population versus the sample size is plotted below. As you would expect, with increasing n you converge on the 'true' sample population. However, note some interesting behavior before that point. It is possible to have your sample average deviate more from the true average, as you happen to hit a random 'cluster' of higher or lower values. Two random number assignments are made below resulting in the two charts. This, by the way, would look different if in each reiteration of computing the sample average the entire population was sampled anew, but with n increased by 1. Instead you are only seeing the effect of getting one more sample, but keeping the samples you already have. If you have time and interest you can explore what the difference would be.
Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.