Вы находитесь на странице: 1из 23

File Organization

The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields. One approach: assume record size is fixed each file has records of one particular type only different files are used for different relations This case is easiest to implement; will consider variable length records later.

Fixed Length Records



Simple approach: Store record i starting from byte n * (i 1), where n is the size of each record. Record access is simple but records may cross blocks Modification: do not allow records to cross block boundaries Deletion of record i: Alternatives move records i + 1, . . ., n to i, . . . , n 1 move record n to i do not move records, but link all free records on a free list

Free Lists

Store the address of the first deleted record in the file header. Use this first record to store the address of the second deleted record, and so on. Can think of these stored addresses as pointers since they point to the location of a record. More space efficient representation: reuse space for normal attributes of free records to store pointers. (No pointers stored in in-use records.)

Variable Length Records


Variable length records arise in database systems in several ways: Storage of multiple record types in a file. Record types that allow variable lengths for one or more fields. Record types that allow repeating fields (used in some older data models).

Slotted Page Structure

Slotted page header contains: number of record entries end of free space in the block location and size of each record Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated. Pointers should not point directly to record instead they should point to the entry for the record in header.

Organization of Records in Files


Heap a record can be placed anywhere in the file where there is space Sequential store records in sequential order, based on the value of the search key of each record Hashing a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed. Records of each relation may be stored in a separate file. In a multi-table clustering file organization records of several different relations can be stored in the same file. Suitable for applications that require sequential processing of the entire file The records in the file are ordered by a search-key.

Sequential File Organization


Deletion use pointer chains Insertion locate the position where the record is to be inserted if there is free space insert there. if no free space, insert the record in an overflow block. In either case, pointer chain must be updated. Need to reorganize the file from time to time to restore sequential order

Multitable Clustering File Organization


Store several relations in one file using a multi-table clustering file organization Multi-table clustering organization of customer and depositor:

good for queries involving depositor customer, and for queries involving one single customer and his accounts. bad for queries involving only customer.

Hashing
A hash function is any well-defined procedure or mathematical function that converts a large, possibly variable-sized amount of data into a small datum, usually a single integer that may serve as an index to an array (cf. associative array). The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes. Hash functions are mostly used to speed up table lookup or data comparison taskssuch as finding items in a database, detecting duplicated or similar records in a large file, finding similar stretches in DNA sequences, and so on.

A hash function may map two or more keys to the same hash value. In many applications, it is desirable to minimize the occurrence of such collisions, which means that the hash function must map the keys to the hash values as evenly as possible.

Fig:- A hash function that maps names to integers from 0 to 15.. There is a collision between keys "John Smith" and "Sandra Dee". Hash functions are related to checksums, check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions.

Static Hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search key value using a hash function. Hash function h is a function from the set of all search key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion. Records with different search key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.

Example of Hash File Organization


There are 10 buckets, The binary representation of the ith character is assumed to be the integer i. The hash function returns the sum of the binary representations of the characters modulo 10 E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3 Account table Acc_No Branch_name Balance

Hash files organization of account file, using branch_name as key

Hash Functions
A hash function maps keys to small integers (buckets). An ideal hash function maps the keys to the integers in a random-like manner, so that bucket values are evenly distributed even if there are regularities in the input data. This process can be divided into two steps:

Map the key to an integer.

Map the integer to a bucket.

We will assume that our keys are either integers, things that can be treated as integers (e.g. characters, pointers) or 1D sequence of such things (lists of integers, strings of characters).

Simple hash functions


The following functions map a single integer key (k) to a small integer bucket value h(k). m is the size of the hash table (number of buckets). Division method (Cormen) Choose a prime that isn't close to a power of 2. h(k) = k mod m. Works badly for many types of patterns in the input data. Knuth Variant on Division h(k) = k(k+3) mod m. Supposedly works much better than the raw division method. Multiplication Method (Cormen). Choose m to be a power of 2. Let A be some random-looking real number. Knuth suggests M = 0.5*(sqrt(5) - 1). Then do the following:
s = k*A x = fractional part of s h(k) = floor(m*x)

This seems to be the method that the theoreticians like. To do this quickly with integer arithmetic, let w be the number of bits in a word (e.g. 32) and suppose m is 2^p. Then compute:
s = floor(A * 2^w) x = k*s h(k) = x >> (w-p)

// i.e. right shift x by (w-p) bits // i.e. extract the p most significant // bits from x

Hashing sequences of characters


The hash functions in this section take a sequence of integers k=k1,...,kn and produce a small integer bucket value h(k). m is the size of the hash table (number of buckets), which should be a prime number. The sequence of integers might be a list of integers or it might be an array of characters (a string). The specific tuning of the following algorithms assumes that the integers are all, in fact, character codes. In C++, a character is a char variable which is an 8-bit integer. ASCII uses only 7 of these 8 bits. Of those 7, the common characters (alphabetic and number) use only the low-order 6 bits. And the first of those 6 bits primarily indicates the case of characters, which is relatively

insignificant. So the following algorithms concentrate on preserving as much information as possible from the last 5 bits of each number, and make less use of the first 3 bits. When using the following algorithms, the inputs ki must be unsigned integers. Feeding them signed integers may result in odd behavior. For each of these algorithms, let h be the output value. Set h to 0. Walk down the sequence of integers, adding the integers one by one to h. The algorithms differ in exactly how to combine an integer ki with h. The final return value is h mod m. CRC variant: Do a 5-bit left circular shift of h. Then XOR in ki. Specifically:
highorder = h & 0xf8000000 representation five h = h << 5 h = h ^ (highorder >> 27) order h = h ^ ki // extract high-order 5 bits from h // 0xf8000000 is the hexadecimal // for the 32-bit number with the first

// bits = 1 and the other bits = 0 // shift h left by 5 bits // move the highorder 5 bits to the low// end and XOR into h // XOR h and ki

PJW hash (Aho, Sethi, and Ullman pp. 434-438): Left shift h by 4 bits. Add in ki. Move the top 4 bits of h to the bottom. Specifically:
// The top 4 bits of h are all zero h = (h << 4) + ki // shift h 4 bits left, add in ki g = h & 0xf0000000 // get the top 4 bits of h if (g != 0) // if the top 4 bits aren't zero, h = h ^ (g >> 24) // move them to the low end of h h = h ^ g // The top 4 bits of h are again all zero

PJW and the CRC variant both work well and there's not much difference between them. We believe that the CRC variant is probably slightly better because

It uses all 32 bits. PJW uses only 24 bits. This is probably not a major issue since the final value m will be much smaller than either. 5 bits is probably a better shift value than 4. Shifts of 3, 4, and 5 bits are all supposed to work OK. Combining values with XOR is probably slightly better than adding them. However, again, the difference is slight.

BUZ hash: Set up a function R that takes 8-bit character values and returns random numbers. This function can be pre-computed and stored in an array. Then, to add each character ki to h, do a 1-bit left circular shift of h and then XOR in the random value for ki. That is:
highorder = h & 0x80000000 // extract high-order bit from h

h = h << 1 h = h ^ (highorder >> 31) h = h ^ R[ki]

// // // //

shift h left by 1 bit move them to the low-order end and XOR into h XOR h and the random value for ki

Handling of Bucket Overflows


Bucket overflow can occur because of Insufficient buckets Skew in distribution of records. This can occur due to two reasons: Multiple records have same search key value Chosen hash function produces non-uniform distribution of key values Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets. Overflow chaining the overflow buckets of a given bucket are chained together in a linked list. Above scheme is called closed hashing. An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.

Indexing
Indexing mechanisms used to speed up access to desired data. E.g., author catalog in library Search Key attribute to set of attributes used to look up records in a file. An index file consists of records (called index entries) of the form Index files are typically much smaller than the original file Two basic kinds of indices: o Ordered indices: search keys are stored in sorted order

o Hash indices: search keys are distributed uniformly across buckets using a hash function.

Indexes can also be characterized as dense or sparse.

A dense index has an index entry for every search key value (and hence every record) in the data file. A sparse (or nondense) index, on the other hand, has index entries for only some of the search values.

Indexes on the basis of the number of levels can be classified into 2 types Single level Ordered Indexes Multilevel Indexes

Single level Ordered Indexes


It can be classified into 3 types : Primary Indexes Clustering Indexes Secondary Indexes Primary Index A primary index is an ordered file whose records are of fixed length with two fields. The first field is of the same data type as the ordering key fieldcalled the primary key of the data file, and the second field is a pointer to a disk block (a block address). There is one index entry (or index record) in the index file for each block in the data file. Each index entry has the value of the primary key field for the

first record in a block and a pointer to that block as its two field values. The following figure illustrates this primary index. The total number of entries in the index is the same as the number of disk blocks in the ordered data file. The first record in each block of the data file is called the anchor record of the block, or simply the block anchor. Index File Blocks
Primary Key value Block Pointer Primary key

1 5 9 . .

Roll_n o 1 2 3 4 Roll_n o 5 6 7 8

Nam e Xxx Yyy Zzz www Nam e Aaa bbb Ccc ddd

Roll_n o 9 10 11 12

Nam e Eee Fff Ggg hhh

Clustering Index If records of a file are physically ordered on a nonkey field which does not have a distinct value for each record that field is called the clustering field. We can create a

different type of index, called a clustering index, to speed up retrieval of records that have the same value for the clustering field. This differs from a primary index, which requires that the ordering field of the data file have a distinct value for each record. A clustering index is also an ordered file with two fields; the first field is of the same type as the clustering field of the data file, and the second field is a block pointer. There is one entry in the clustering index for each distinct value of the clustering field, containing the value and a pointer to the first block in the data file that has a record with that value for its clustering field.The following figure illustrates clustering index. Index File Blocks
Clustering Field value Block Pointer Clustering field

DGP ASN KOL . .

Cit y AS N AS N AS N AS N

Nam e Aaa Bbb Ccc ddd

Cit y DG P DG P DG P DG P

Nam e Xxx Yyy Zzz www

Cit y KO L KO L KO L KO L

Nam e eee Fff Ggg hhh

Secondary Index A secondary index is also an ordered file with two fields. The first field is of the same data type as some nonordering field of the data file that is an indexing field. The second field is either a block pointer or a record pointer. The secondary field may be a key or a non-key. In case of a key secondary field, the field is also called as a secondary key. Here there is one index entry for each record in the data file, which contains the value of the secondary key for the record and a pointer either to the block in which the record is stored or to the record itself. Hence, such an index is dense. Index File Blocks
Secondary key value Record Pointer Secondary key

1 2 3
Secondary key value Record Pointer

Reg_n o 5 1 6 Reg_n o 3 4 2

Roll_n o 1 2 3 Roll_n o 4 5 6

4 5 6

In case of a non-key secondary field, numerous records in the data file can have the same value for the indexing field. Here, we create an extra level of indirection to handle the multiple pointers. In this non-dense scheme, the pointer in index file points to a block of record pointers; each

record pointer in that block points to one of the data file records with the value for the indexing field. Index File Blocks
Secondary field value Block Pointer

Blocks of record pointers Yea r 3 1 2 Yea r 2 1 3 Roll_n o 4 5 6 Roll_n o 7 8 10

1 2 3

Multilevel Index

If primary index does not fit in memory, access becomes expensive. Solution: treat primary index kept on disk as a sequential file and construct a sparse index on it. outer index a sparse index of primary index inner index the primary index file If even outer index is too large to fit in main memory, yet another level of index can be created, and so on. Indices at all levels must be updated on insertion or deletion from the file.

Embedded SQL
Embedded SQL is a method of combining the computing power of a programming language and the database manipulation capabilities of SQL. Embedded SQL statements are SQL statements written inline with the program source code of the host language. The embedded SQL statements are parsed by an embedded SQL preprocessor and replaced by host-language calls to a code library. The output from the preprocessor is then compiled by the host compiler. This allows programmers to embed SQL statements in programs written in any number of languages such as: C/C++, COBOL and Fortran. Thus the embedded SQL provides the 3GL with a way to manipulate a database, supporting:

highly customized applications background applications running without user intervention database manipulation which exceeds the abilities of simple SQL applications linking to Oracle packages, e.g. forms and reports applications which need customized window interfaces

Query Optimization
Given a query, there are many plans that a database management system(DBMS) can follow to process it and produce its answer. All plans are equivalent in terms of their final output but vary in their cost, i.e., the amount of time that they need to run. Query optimization is a procedure to find out, what is the plan that needs the least amount of time? Such query optimization is absolutely necessary in a DBMS. The cost difference between two alternatives can be enormous. For example, consider the following database schema, which will be Emp(name,age,sal,dno) dept(dno,dname,floor,budget,mgr,ano) acnt(ano,type,balance,bno) bank(bno,bname,address) Further, consider the following very simple SQL query: select name, floor from emp, dept where emp.dno=dept.dno and sal>100K. Assume the characteristics below for the database contents, structure, and run-time environment: Parameter Description Parameter Value Number of emp pages 20000

Number of emp tuples 100000 Number of emp tuples with sal>100K 10 Number of dept pages 10 Number of dept tuples 100 Indices of emp Clustered B+-tree on emp.sal Indices of dept hashing on dept.dno Number of buffer pages 3 Cost of one disk page access 20ms Consider the following three different plans: P1 --Through the B+-tree find all tuples of emp that satisfy the selection on emp.sal. For each one, use the hashing index to find the corresponding dept tuples. (Nested loops, using the index on both relations.) P2-- For each dept page, scan the entire emp relation. If an emp tuple agrees on the dno attribute with a tuple on the dept page and satisfies the selection on emp.sal, then the emp-dept tuple pair appears in the result. (Page-level nested loops, using no index.) P3-- For each dept tuple, scan the entire emp relation and store all emp-dept tuple pairs. Then, scan this set of pairs and, for each one, check if it has the same values in the two dno attributes and satisfies the selection on emp.sal. (Tuplelevel formation of the cross product, with subsequent scan to test the join and the selection.) (3-levels deep) Clustered (average bucket length of 1.2 pages)

Calculating the expected I/O costs of these three plans shows the tremendous difference in efficiency that equivalent plans may have. P1 needs 0.32 seconds, P2 needs a bit more than an hour, and P3 needs more than a whole day. Without query optimization, a system may choose plan P2 or P3 to execute this query with devastating results. Query optimizers, however, examine all alternatives, so they should have no trouble choosing P1 to process the query. The path that a query traverses through a DBMS until its answer is generated is shown in Figure 1. The system modules through which it moves have the following functionality: The Query Parser checks the validity of the query and then translates it into an internal form, usually a relational calculus expression or something equivalent. The Query Optimizer examines all algebraic expressions that are equivalent to the given query and chooses the one that is estimated to be the cheapest. The Code Generator or the Interpreter transforms the access plan generated by the optimizer into calls to the query processor. The Query Processor actually executes the query.

Database security
Database security concerns the use of a broad range of information security controls to protect databases (potentially including the data, the database applications or stored functions, the database systems, the database servers and the associated network links) against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural/administrative and physical. Database security is a specialist topic within the broader realms of computer security, information security and risk management. Security risks to database systems include, for example:

Unauthorized or unintended activity or misuse by authorized database users, database administrators, or network/systems managers, or by unauthorized users or hackers (e.g. inappropriate access to sensitive data, metadata or functions within databases, or inappropriate changes to the database programs, structures or security configurations); Malware infections causing incidents such as unauthorized access, leakage or disclosure of personal or proprietary data, deletion of or damage to the data or programs, interruption or denial of authorized access to the database, attacks on other systems and the unanticipated failure of database services; Overloads, performance constraints and capacity issues resulting in the inability of authorized users to use databases as intended; Physical damage to database servers caused by computer room fires or floods, overheating, lightning, accidental liquid spills, static discharge, electronic breakdowns/equipment failures and obsolescence; Design flaws and programming bugs in databases and the associated programs and systems, creating various security vulnerabilities (e.g. unauthorized privilege escalation), data loss/corruption, performance degradation etc.; Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in database or system administration processes, sabotage/criminal damage etc.

Many layers and types of information security control are appropriate to databases, including:

Access control Auditing Authentication Encryption Integrity controls Backups Application security

Traditionally databases have been largely secured against hackers through network security measures such as firewalls, and network-based intrusion detection systems. While network security controls remain valuable in this regard, securing the database systems themselves, and the programs/functions and data within them, has arguably become more critical as networks are increasingly opened to wider access, in particular access from the Internet. Furthermore, system, program, function and data access controls, along with the associated user identification, authentication and rights management functions, have always been important to limit and in some cases log the activities of authorized users and administrators. In other words, these are complementary approaches to database security, working from both the outside-in and the insideout as it were. Many organizations develop their own "baseline" security standards and designs detailing basic security control measures for their database systems. These may reflect general information security requirements or obligations imposed by corporate information security policies and applicable laws and regulations (e.g. concerning privacy, financial management and reporting systems), along with generally-accepted good database security practices (such as appropriate hardening of the underlying systems) and perhaps security recommendations from the relevant

database system and software vendors. The security designs for specific database systems typically specify further security administration and management functions (such as administration and reporting of user access rights, log management and analysis, database replication/synchronization and backups) along with various business-driven information security controls within the database programs and functions (e.g. data entry validation and audit trails). Furthermore, various securityrelated activities (manual controls) are normally incorporated into the procedures, guidelines etc. relating to the design, development, configuration, use, management and maintenance of databases.

Вам также может понравиться