Introduction into geoscience database design.

Lecture index: Motivation for this lab. / What is a database? / Database structures. / Database queries. / Metadata. / This week's exercise. /


Reading for the week:

Wikipedia article on Database models. This looks at many more possibilities of how a database is structured than are described below. Focus on hierarchal, network, and relational database. Some of the language is 'technospeak', but you can still wade through much of it.

Chapt. 5 What is where? in Clarke, K. C., 2001, Getting Started with Geographic Information Systems, Prentice Hall, p. 146-165. While this is described in a GIS context it still covers some basics about databases.


Motivation for this lab

A wealth of information on databases exists on the web, but it tends to be highly technical. Usually this is considered the domain of IS&T (Information Science and Technology), of these trained in computers and information science. Yet, extracting data from, and contributing data to databases, are very common tasks for scientists in general, and very much for geologists and geoscientists in particular. Inherently, large data sets are required to understand and model complex natural phenomena such as earthquake behavior, or oil reservoirs, or climate change! Geoscience has become much more database oriented than it used to be. It is also highly likely that you may at some time need to create a database of some type. Finally, databases are important primary components of GIS, which we will delve into later. So, some introduction to the topic of databases is useful for all scientists.


What is a database?

Your address book is a database of sorts. So is the telephone book. In practice an Excel sheet with rows and columns of data entry is considered a simple database of sorts. Indeed, you will see that sheets of information in row and column format is a crucial element of most databases. Computer software and hardware is evolving quickly, and so is the associated language. For that reason there is not total agreement on the meaning of related terms. For some practiononers you have to have the ability to query a database before it qualifies as the real thing. A query here is to select some sub portion of the database based on some set of criteria. For our purposes we will consider a database as some software environment you put data in with the hopes that you or others can retrieve it later in order to analyze it. An important aspect of many databases is that they have multiple users that want to extract select data from the larger database. Note that some databases have little analytic capability, and must partner with an analytic package. Excel and Access are two programs that serve as a low level database software that can partner with other software packages. We have seen an example of this already with Surfer. Later on we will look at GIS, which is an integrated database, analytic, modeling and cartographic package. With GIS another big difference is the the spatial display aspect, i.e. the map perspective. You can consider this week to serve as an introduction to one part of the GIS environment.


Database structures

As your reading elucidates, there are many ways to organize information. Some of them focus on the phenomena being studied, some focus on software architecture. Database modeling is another key term here - a model for how the information should be organized.

Examples of database 'maps':

Above: USGS image of the structure of their database for rock type information. What architectural components exist here?

USGS diagram of sediment database. What can this be used for?

Link to USGS Lake Ponchatrain sediment database.


Database queries

You already have used queries within a database. When you selected your earthquake data you queried the data base for all earthquakes in the record for a circular area. Web based databases are more and more widely used. A simpler example, is when you constructed your histograms. A bin value is simply the number of data points you have in a larger data set that fall between the two bin values. The FREQ function in excel is basically a simple query, and the histogram function a series of queries. An even simpler example is when you use the find function within a software package or looking at computer content. Queries can also be thought of in terms of set theory, where you are often looking for the intersection of two or more sets. A query is how you get the data you need out of a larger database.

SQL ­ Standard Query Language. This is a software language that works with many databases including Access and Oracle. We will just mention it here. For more sophisticated queries it can be quite useful, but it is beyond the scope of this class to delve into.

Access is a commonly used software package. One can learn the fundamentals in a week or less of concentrated effort. It has the advantage of interfacing with many other packages. It is more common in business environments and is especially good at handling alphanumeric data (addresses, geographic localities, socio-economic data).


Metadata

If you were given a table of oxide values for a suite of samples, what more might you want to know in order to analyze and draw conclusions from the data? You might want to know about the precision. Remember that this can be implied by the number of significant digits that are present. You could also have an additional column that contains error information. However, how about the accuracy? In order to judge this you might want to know how the samples were prepared, what standards were used to calibrate the analytic device, what type of analytic device was used, or even what lab did the analysis. This information is harder to put into a relational database structure (although some of it can be). The description of how the data was obtained, and thoughts on its accuracy are included in the metadata. The metadata also includes a simple description of what data is included in the database.

The USGS provides a lot of GIS data online, and has had to think about metadata content and structure. The questions they suggest that the metadata document should address are as follows:

USGS site on metadata.


This week's exercise

Purpose of this exercise: This is a conceptualization exercise. Arguably conceptualization is the more important part of database design. It is also where having the relevant expertise is crucial. In a couple of days to months, or years, one can often acquire the needed software skills to construct or manipulate a database, depending on the nature of the database used. It takes many years or more to gain the needed expertise in the relevant disciplines. A related thought is to let the science drive the process and tool development, not the tool drive the science that is done. Otherwise you are pounding in screws with a hammer.

As a geoscientist you have critical expertise when it comes to designing geoscience databases - you understand what is trying to be captured in the database, how they are going to be used. Database design should reflect the user's needs ­ what type of information goes in, how is it arranged, and how can data be retrieved are all shaped by the user's needs? In your career, you may very well be working with someone who has computer expertise, who can help with the details, or you may bear primary responsibility.

Database project design exercise: This will be a group effort, with initial groups of 2-4 people brainstorming, but then the entire class working together. thus for the next several weeks your participation is crucial! You should come out of it with a document that can contribute to your undergraduate portfolio.

If you want to explore a bit a database being created for structures on the Great Plains take a look here -> an example.


Possible answers to what other hierarchal structures are traditionally used:


Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.