Introduction into geoscience database design

Lecture index: Motivation for this lab. / What is a database? / Database structures. / Database queries. / Metadata. / This week's exercise. /

Example of a database structure for the USGS South Florida Ecosystem Program. The challenge is to get the information needed to the user in a form they can best use. Database design is a crucial consideration in doing so.


Reading for the week:

Wikipedia article on Database models. This looks at many more possibilities of how a database is structured than are described below. Focus on hierarchal, network, and relational database. Some of the language is 'technospeak', but you can still wade through much of it and get a lot out of it.


Motivation for this lab

A wealth of information on databases exists on the web, but it tends to be highly technical. Usually this is considered the domain of IS&T (Information Science and Technology), of those trained in computers and information science. Yet, extracting data from, and contributing data to databases, are very common tasks for scientists in general, and very much for geologists and geoscientists in particular. Inherently, large data sets are required to understand and model complex natural phenomena such as earthquake behavior, or oil reservoirs, or climate change! Geoscience has become much more database oriented than it used to be. It is also highly likely that you may at some time need to create a database of some type. Finally, databases are important primary components of GIS, which we will delve into later. So, some introduction to the topic of databases is useful.


What is a database?

Your address book is a database of sorts. So is the telephone book. In practice an Excel sheet with rows and columns of data entry is considered a simple database of sorts. Indeed, you will see that sheets of information in row and column format is a crucial element of most databases. Computer software and hardware is evolving quickly, and so is the associated language. For that reason there is not total agreement on the meaning of related terms. For some practitioners you have to have the ability to query a database before it qualifies as the real thing. A query here is to select some sub portion of the database based on some set of criteria. For our purposes we will consider a database as some software environment you put data in with the hopes that you or others can retrieve it later in order to analyze it. An important aspect of many databases is that they have multiple users that want to extract select data from the larger database. Note that some databases have little analytic capability, and must partner with an analytic package. Excel and Access are two programs that serve as low level database software that can partner with other software packages. We have seen an example of this already with Surfer. Later on we will look at GIS, which is an integrated database, analytic, modeling and cartographic package. With GIS another big difference is the spatial display aspect, i.e. the map perspective. You can consider this week to serve as an introduction to one part of the GIS environment.


Database structures

As your reading elucidates, there are many ways to organize information. Some of them focus on the phenomena being studied, some focus on software architecture. Database modeling is another key term here - a model for how the information should be organized.

Examples of database 'maps':

Above: USGS image of the structure of their database for rock type information. What architectural components exist here?

Above: USGS diagram of sediment database. What can this be used for?

Above is an example of a hierarchal database structure for an Alaska Geologic Map database. This one is interesting for the types of geologic data and how they are arranged. Notice that only the structure of part of the database is being depicted. Image source: http://pubs.usgs.gov/of/2004/1451/freeman/ .

You can learn a lot by example on how geoscience databases can be structured by exploring some of these links.


Database queries

You already have used queries within a database. When you selected your earthquake data from the USGS site you queried the data base for all earthquakes in the record for a circular area, a certain time span and a certain range of sizes. Web based databases are more and more widely used. A simpler example, is when you constructed your histograms. A bin value is simply the number of data points you have in a larger data set that fall between the two bin values. The FREQ function in excel is basically a simple query of the array of values, and the histogram function is a systematic series of queries. An even simpler example is when you use a find function within a software package or looking at computer content. A query is how you get the data you need out of a larger database.

SQL ­ Standard Query Language. This is a software language that works with many databases including Access and Oracle. We will just mention it here. For more sophisticated queries it can be quite useful, but it is beyond the scope of this class to delve into.

Access is a commonly used software package for housing databases. One can learn the fundamentals in a week or less of concentrated effort. It has the advantage of interfacing with many other packages. It is more common in business environments and is especially good at handling alphanumeric data (addresses, geographic localities, socioeconomic data).

Internet queries. The web is one large network database in some ways. It wasn't purposefully organized in a certain structure and instead hyperlinks are the common thread and a form of self-organization is in work. How do you query the internet? Obviously you use one of a variety of search engines (Google Scholar can be particularly helpful). Key words are the key! Then you have to sort through the returns to weed out the good stuff from the garbage. If looking for technical or more specific information, then using the right technical term can really aid your search and weed out much of the garbage for you. For example, if you are looking for information on geohazards in a local area, connecting the name of the locality with a specific type of geohazard will likely focus the search more. For example, the key words 'San Francisco liquefaction earthquakes' will result in more specific and focused possibilities than just "San Francisco earthquakes'.


Metadata

If you were given a table of oxide values for a suite of samples, what more might you want to know in order to analyze and draw conclusions from the data? You might want to know about the precision. Remember that this can be implied by the number of significant digits that are present. However, you may want more information such as associated analytic error bars. You could also have an additional column that contains error information. However, how about the accuracy? In order to judge this you might want to know how the samples were prepared, what standards were used to calibrate the analytic device, what type of analytic device was used, or even what lab did the analysis. This information is harder to put into a relational database structure (although some of it can be). The description of how the data was obtained, and thoughts on its accuracy are included in the metadata. The metadata also includes a simple description of what data is included in the database.

The USGS provides a lot of GIS data online, and has had to think about metadata content and structure. The questions they suggest that the metadata document should address are as follows:

USGS site on metadata.


This week's exercise

Purpose of this exercise: This is a conceptualization exercise and you will work in teams of two (preferably) to three. Arguably, conceptualization is the more important part of database design, and thus, having the relevant domain expertise is crucial. In a couple of days to months, or years, one can often acquire the needed software skills to construct or manipulate a database, depending on the nature of the database. It takes many years or more to gain the needed expertise in the relevant disciplines. A related thought is to let the science drive the process and tool development, not the tool drive the science that is done. Otherwise, as the saying goes, you are pounding in screws with a hammer.

As a geoscientist you have critical expertise when it comes to designing geoscience databases - you understand what is trying to be captured in the database, how they are going to be used. Database design should reflect the user's needs ­ what type of information goes in, how is it arranged, and how can data be retrieved are all shaped by the user's needs! Anticipating the types of queries, and the forms of data desired is crucial. In your career, you may very well be working with someone who has computer expertise, who can help with the details, or you may bear primary responsibility as you engage in database development.

Develop a proposal for a database of your choosing and design that includes the following:

You will present (as a group) your Powerpoint presentation and also hand in the PowerPoint file. Aim for 10 minutes presentation length. Your classmates and I will ask questions and provide feedback. You will then be able to modify your presentation based on feedback. Your final grade will be based on both the presentation, and revised version handed in.

For efficient collaboration it is often useful to divy up specific tasks among team members.

Actual data does not have to be put into the database (although sample data is welcome), and the data does not already have to exist (but should be a feasible type of data to gather in the future if it does not exist (i.e. make things as realistic as possible in the time given). I suggest working collaboratively on this in class, but then for the report assigning different components of the summary report to different members of the group. You will be asked to assess the contribution of the others in the group.


Possible answers to what other hierarchal structures are traditionally used in geoscience:

Set theory is involved here, a topic we will return to later on in the course.


Copyright by Harmon D. Maher Jr.. This material may be used for non-profit educational purposes if proper attribution is given. Otherwise please contact Harmon D. Maher Jr.