Академический Документы
Профессиональный Документы
Культура Документы
CHAPTER 1
If we look up a plant, we can immediately see what its uses are. However, if we want to nd all the plants suitable for hedging, for example, we have a problem. We need to search through each of the use columns individually. Producing a report of all hedging plants would require some logic along the lines of: IF use1 = hedging OR use2 = hedging OR use3=hedging. Also, the database table as it stands restricts a plant to having three uses. That may be adequate for now, but if that threeuse limit changes, the table would have to be redesigned to include a new column(s). Any logic will need to be altered to include OR use4=hedging, and at the back of our minds we just know that whatever number of uses we choose, eventually we will come across a plant that needs one more. The carefully collected data has unfortunately been saved in a manner that is difcult to use and maintain.
In Example 1-1, the real shame is that all the data has been carefully collected and entered, but the design of the table makes it extremely difficult to answer a question such as, What plants are good for shelter? The developer has done better than many in separating the uses into individual columns. Often data like this can be found stored in a single column separated by commas or other punctuation. (E.g., an entry in a single column for uses might read: shelter, hedging, soil stability.) This is even more difficult to manage than the design in Figure 1-1. The problem is that the database was designed principally to satisfy the users immediate problem, which is: I need to store all the info I have about each plant. The developer thought of the data in terms of a single type or class, Plant, and he saw each use as an attribute of a plant in much the same way as its genus or common name. This is fine if all you want to know are answers to questions like, What uses does this plant have? The approach is not so useful when going in the other direction, searching for plants having a given use. In Example 1-1, we really have two sets or classes of data, Plants and Uses, and we are interested in the connections between them. The data modeling techniques described in the rest of the book are a practical way of clarifying exactly what it is you expect from your data and helping you decide on the best database design to support that. Jumping ahead a bit to see a solution for the plant database problem, you can quite quickly set up a useful relational database by creating the two tables shown in Figure 1-2. (Some extra tables would be even better, but more about that in Chapter 2.)
CHAPTER 1
An end user with modest database skills would be able to set up the appropriate keys, relationships, and joins and produce some useful reports. A simple query on (or even a filtering or sorting of ) the Uses table will enable the user to find, for example, all shelter plants. There is no restriction now on how many uses a plant can have. The initial setup is slightly more costly, in time and expertise, than for the single table described in Example 1-1, but these separate tables will be able to provide a great deal of additional information. Example 1-1 shows us one way we can satisfactorily deal with categories. Unfortunately, there are other problems in store. In Example 1-1, the categories were quite clear cut, but this is not always the case. Example 1-2 shows the problems that occur when categories and keywords are not so easily determined.
We are able to see at a glance the research interests of a particular person, but as was the case in Example 1-1, it is awkward to do the reverse and nd who is interested in a particular topic. However, we have an additional problem here. Many of the research interests look similar but they are described differently. How easy will it be to nd a researcher who is able to visualize data? 3
CHAPTER 1
As in Example 1-1, the table has been designed taking just one class of data into consideration: in this case, People. Really, though, we have two classes, People and Interests, and we are concerned with the connections or relationships between them. A solution analogous to that in Example 1-1 would be much more useful in this case, too. Creating a table of people is reasonably straightforward, but the table of interests poses some problems. In Example 1-1, the different possible uses were fairly clear (hedging, shelter, etc.). What are the different possible research interests in Example 1-2? The answer is not so obvious. A quick glance at the data displayed shows eight interests, but it is reasonable to assume that visualisation and visualization are merely different spellings of the same topic. But what about scientific visualisation and visualisation of dataare these the same in the context of the problem? What about computer visualisation? Any staff member with one of these interests would probably be useful for an outside inquiry about how to visualize some data. Having decided on two classes of data, People and Interests, we now need to clearly define what we mean by them. People isnt too difficultyou might have to think about which staff members are to be involved and whether postgraduate students should also be included. However, Interests is more difficult. In the current example, an interest is anything that a staff member might think of. Such a fuzzy definition is going to cause us a number of problems, especially when it comes to doing any reporting or analysis about specific interests. One solution is to predetermine a set of broad topics and ask people to nominate those applicable to them. But that task is far from simple. People will be aggrieved that their pet topic is not included verbatim and hours (probably months) could be wasted attempting to find agreement on a complete list. And this list may well comprise a whole hierarchy of categories and subcategories. Libraries and journals expend considerable energy and expertise devising and maintaining such lists. Maybe such a list will be useful for the problem in Example 1-2, but then again maybe not. Having foreseen the difficulties, you may decide that the effort is still worthwhile, or you may reconsider and choose a different solution. In the latter case, it may well be easier for the liaison team to make a stab at the most likely individual and let a real human being sort out what is required. In just the three-month period prior to drafting this chapter, I have seen three different attempts at setting up spreadsheets or databases to record research interests. Each time, a number of hours were spent collecting and storing data before the perpetrator started to run into the problems Ive just described. None of the databases is being maintained or used as envisioned.
Repeated Information
Another common problem is unnecessarily storing the same piece of information several times. Such redundancy is often a result of the database design reflecting some sort of input form. For example, in a small business, each order form may record the associated information of a customers name, address, and phone number. If we design a table that reflects such a form, the customers name, address, and phone number are recorded every time an order is placed. This inevitably leads to inconsistencies and problems, especially when the customer moves from one address to another. We might want to send out an advertising catalog, and there will be uncertainty as to which address should be used. Sometimes the repeated information is not quite so obvious. Example 1-3 illustrates one such case.
Clare Churcher and Peter McNaughton, There are bugs in our spreadsheet: Designing a database for scientific data (research report, Centre for Computing and Biometrics: Lincoln University, February 1998).
CHAPTER 1
The information about each farm was recorded (quite correctly) elsewhere, thus avoiding that data being repeated. However, there are still problems. The fact that eld ADhc is on farm 1 is recorded every visit, and it does not take long to nd the rst data entry error in row 269. (The coding used for the elds raises other issues that we will not address just now.)
On the face of it, the error of listing field ADhc under farm 2 instead of farm 1 in Figure 1-4 doesnt seem like such a big dealbut it is avoidable. The fact that the farm was recorded in this spreadsheet means that the data is probably likely to be analyzed by farm, and now any results for farms 1 and 2 are potentially inaccurate. And how many other data entry errors will there be over the lifetime of the project? Given that the results in Example 1-3 came from a carefully designed, longterm experiment and were to be statistically analyzed, it seems a shame that such errors are able to slip in when they can be easily prevented. It is important to distinguish the difference between data input errors (anyone can make typos now and then) and design errors. The problem in Example 1-3 is not that field ADhc was wrongly associated with farm 2 (a simple error that could be easily fixed), but that the association between farm and field was recorded so many times that an eventual error became almost certain. And errors such as these can be very difficult to detect. Another piece of information is also repeated in the spreadsheet in Example 1-3: the date of a visit. The information that field ADhc was visited on Aug-11 is repeated in rows 268 to 278, creating another source of avoidable errors (e.g., we could accidentally put Aug-10 in row 273). Such an error would affect any analyses based on date. The repeated visit date information in Example 1-3 also gives rise to an additional and more serious problem: what do you do with miscellaneous information about a particular visit (e.g., it was raining at the timequite important if you are counting insects)? Is it just included on one row (making it difficult to find all the affected samples), or does it go on every row for that visit (awkward and compounding the repeated information problem)? In fact, the weather information in this case was recorded quite separately in a text document, thereby making it impossible to use the power of the software to help in any analyses of weather. Techniques described more fully in later chapters would have prevented the problems encountered in Example 1-3. Rather than thinking of the data in terms of the counts in each sample, the designer would have thought about Farms, Fields, Visits, and Insects as separate classes of data in which researchers are interested both individually and together. For example, the researchers may want to find information about fields with particular soil types or visits undertaken in fine weather conditions. Figure 1-5 shows how separating information
CHAPTER 1
about fields and visits into separate tables not only reduces problems with repeated information, but allows more data (soil types for fields, weather conditions for visits) to be easily added. The Counts table still suffers the same problems as the tables in Examples 1-1 and 1-2, but that can be addressed. We will return to this example in Chapter 4.
Table Fields
Table Visits
Table Counts
Figure 1-5. An improved database design for the insect problem
CHAPTER 1
A database table was designed to exactly match the report in Figure 1-6, with a eld for each column. The rst year the database worked a treat. The next year the problems started. Can you anticipate them? Some students were permitted to replace one of the papers with one of their own choosing. The table was amended to include columns for option name and option mark. Then some subjects were replaced, but the old ones had to be retained for those students who had taken them in the past. The table became messier, but it could still cope with the data. What the design couldnt handle was students who failed and then reenrolled in a subject. The complete academic record for a student needed to be recorded, and the design of the table made it impossible to record more than one mark if a student completed a subject several times. That problem wasnt noticed until the second year in operation (when the rst students started failing). By then, a fair amount of effort had gone into development and data entry. The somewhat curious solution was to create a new table for each year, and then to apply some tortuous logic to extract a students marks from the appropriate tables. When the original developer left for a new job, several years worth of data were left in a state that no one else could comprehend. And thats how I got my rst database job (and the database coped with changing requirements over several years).
Example 1-4 is particularly good for showing how much trouble you can get into with a poor design. The developer could see the problem from the point of view of the required report. He thought in terms of one class: Student. In reality, at the very minimum, we have two classes, Student and Subject, and we are interested in the relationship between them. In particular, we would like to know what mark a particular student earned in a particular subject. Chapter 4 will show how an investigation of a ManyMany relationship such as the one between Subject and Student would have led to the introduction of another class, Enrollment. This allows different marks to be recorded for different attempts at a subject. Taking this approach the oversight concerning how to deal with a students failure would have been discovered, and this whole sorry mess would have been avoided.
Summary
The first thoughts about how to design a database may be influenced by a particular report or by a particular method of input. Sometimes the driver for a database is simply that some valuable information has come to hand and needs to be put somewhere. The hurried creation of a database or spreadsheet can lead to a design that cannot cope with even simple changes to the information you would like to retrieve. It is important to think carefully about the underlying data, and design the database to reflect the information being stored rather than what you might want to do with the data in the short term.
CHAPTER 1
What problems can you foresee in making good use of this information? Suggest some better ways that this information could be stored.
Exercise 1-2
A small library keeps a roster of who will be at the desk each day. They have a database table as shown in Figure 1-8.
What problems can you foresee in making good use of this information? Suggest some better ways that this information could be stored.