Вы находитесь на странице: 1из 19

File Organization, Indexing and Hashing

A database consist of a huge amount of data. The data is grouped within a table in RDBMS, and
each table has related records. A user can see that the data is stored in form of tables, but in
actual this huge amount of data is stored in physical memory in form of files.
File – A file is named collection of related information that is recorded on secondary storage
such as magnetic disks, magnetic tables and optical disks.

File Organization
File Organization refers to the logical relationships among various records that constitute the file,
particularly with respect to the means of identification and access to any specific record. In
simple terms, Storing the files in certain order is called file Organization. File Structure refers
to the format of the label and data blocks and of any logical control record.

Types of File Organizations –

Various methods have been introduced to Organize files. These particular methods have
advantages and disadvantages on the basis of access or selection . Thus it is all upon the
programmer to decide the best suited file Organization method according to his requirements.
Some types of File Organizations are :
 Sequential File Organization
 Heap File Organization
 Hash File Organization
 B+ Tree File Organization
 Clustered File Organization

Sequential File Organization –

The easiest method for file Organization is Sequential method. In this method the the file are
stored one after another in a sequential manner. There are two ways to implement this method:
1. Pile File Method – This method is quite simple, in which we store the records in a
sequence i.e one after other in the order in which they are inserted into the tables.

2. Insertion of new record –


Let the R1, R3 and so on upto R5 and R4 be four records in the sequence. Here, records are
nothing but a row in any table. Suppose a new record R2 has to be inserted in the sequence,
and then it is simply placed at the end of the file.

3. Sorted File Method –In this method, As the name itself suggest whenever a new record
has to be inserted, it is always inserted in a sorted (ascending or descending) manner.
Sorting of records may be based on any primary key or any other key.

Insertion of new record –


Let us assume that there is a preexisting sorted sequence of four records R1, R3, and so on
upto R7 and R8. Suppose a new record R2 has to be inserted in the sequence, then it will be
inserted at the end of the file and then it will sort the sequence .

Pros and Cons of Sequential File Organization –


Pros –
 Fast and efficient method for huge amount of data.
 Simple design.
 Files can be easily stored in magnetic tapes i.e cheaper storage mechanism.
Cons –
 Time wastage as we cannot jump on a particular record that is required, but we have to
move in a sequential manner which takes our time.
 Sorted file method is inefficient as it takes time and space for sorting records.

Heap File Organization –

Heap File Organization works with data blocks. In this method records are inserted at the end of
the file, into the data blocks. No Sorting or Ordering is required in this method. If a data block is
full, the new record is stored in some other block, Here the other data block need not be the very
next data block, but it can be any block in the memory. It is the responsibility of DBMS to store
and manage the new records.

Insertion of new record –


Suppose we have four records in the heap R1, R5, R6, R4 and R3 and suppose a new record R2
has to be inserted in the heap then, since the last data block i.e data block 3 is full it will be
inserted in any of the database selected by the DBMS, lets say data block 1.

If we want to search, delete or update data in heap file Organization the we will traverse the data
from the beginning of the file till we get the requested record. Thus if the database is very huge,
searching, deleting or updating the record will take a lot of time.
Pros and Cons of Heap File Organization –
Pros –
 Fetching and retrieving records is faster than sequential record but only in case of small
databases.
 When there is a huge number of data needs to be loaded into the database at a time, then
this method of file Organization is best suited.
Cons –
 Problem of unused memory blocks.
 Inefficient for larger databases.

B+ Tree File Organization-


 B+ tree file organization is the advanced method of an indexed sequential access method.
It uses a tree-like structure to store records in File.
 It uses the same concept of key-index where the primary key is used to sort the records.
For each primary key, the value of the index is generated and mapped with the record.
 The B+ tree is similar to a binary search tree (BST), but it can have more than two
children. In this method, all the records are stored only at the leaf node. Intermediate
nodes act as a pointer to the leaf nodes. They do not contain any records.

The above B+ tree shows that:

 There is one root node of the tree, i.e., 25.


 There is an intermediary layer with nodes. They do not store the actual record. They have
only pointers to the leaf node.
 The nodes to the left of the root node contain the prior value of the root and nodes to the
right contain next value of the root, i.e., 15 and 30 respectively.
 There is only one leaf node which has only values, i.e., 10, 12, 17, 20, 24, 27 and 29.
 Searching for any record is easier as all the leaf nodes are balanced.
 In this method, searching any record can be traversed through the single path and
accessed easily.

Pros of B+ tree file organization

 In this method, searching becomes very easy as all the records are stored only in the leaf
nodes and sorted the sequential linked list.
 Traversing through the tree structure is easier and faster.
 The size of the B+ tree has no restrictions, so the number of records can increase or
decrease and the B+ tree structure can also grow or shrink.
 It is a balanced tree structure, and any insert/update/delete does not affect the
performance of tree.

Cons of B+ tree file organization

 This method is inefficient for the static method.


Cluster file organization-
 When the two or more records are stored in the same file, it is known as clusters. These
files will have two or more tables in the same data block, and key attributes which are
used to map these tables together are stored only once.
 This method reduces the cost of searching for various records in different files.
 The cluster file organization is used when there is a frequent need for joining the tables
with the same condition. These joins will give only a few records from both tables. In the
given example, we are retrieving the record for only particular departments. This method
can't be used to retrieve the record for the entire department.

In this method, we can directly insert, update or delete any record. Data is sorted based on the
key with which searching is done. Cluster key is a type of key with which joining of the table is
performed.

Types of Cluster file organization:

Cluster file organization is of two types:

1. Indexed Clusters:
In indexed cluster, records are grouped based on the cluster key and stored together. The above
EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all the
records are grouped based on the cluster key- DEP_ID and all the records are grouped.

2. Hash Clusters:

It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the
cluster key, we generate the value of the hash key for the cluster key and store the records with
the same hash key value.

Pros of Cluster file organization

 The cluster file organization is used when there is a frequent request for joining the tables
with same joining condition.
 It provides the efficient result when there is a 1:M mapping between the tables.

Cons of Cluster file organization

 This method has the low performance for the very large database.
 If there is any change in joining condition, then this method cannot use. If we change the
condition of joining then traversing the file takes a lot of time.
 This method is not suitable for a table with a 1:1 condition.

Indexing in Databases

Indexing is a way to optimize performance of a database by minimizing the number of disk


accesses required when a query is processed.
An index or database index is a data structure which is used to quickly locate and access the data
in a database table.
Indexes are created using some database columns.
 The first column is the Search key that contains a copy of the primary key or candidate
key of the table. These values are stored in sorted order so that the corresponding data can
be accessed quickly (Note that the data may or may not be stored in sorted order).
 The second column is the Data Reference which contains a set of pointers holding the
address of the disk block where that particular key value can be found.

There are two kinds of indices:


1. Ordered indices: Indices are based on a sorted ordering of the values.
2. Hash indices: Indices are based on the values being distributed uniformly across a range
of buckets. The buckets to which a value is assigned is determined by function called a
hash function.

On what factors, ordered indexing and hashing must be evaluated?


 Access Types: e.g. value based search, range access, etc.
 Access Time: Time to find particular data element or set of elements.
 Insertion Time: Time taken to find the appropriate space and insert a new data.
 Deletion Time: Time taken to find an item and delete it as well as update the index
structure.
 Space Overhead: Additional space required by the index.

Indexing Methods

Ordered Indices
The indices are usually sorted so that the searching is faster. The indices which are sorted are
known as ordered indices.
 If the search key of any index specifies same order as the sequential order of the file, it is
known as primary index or clustering index.If the search key of any index specifies an
order different from the sequential order of the file, it is called the secondary index or non-
clustering index.

Clustered Indexing
 Clustering index is defined on an ordered data file. The data file is ordered on a non-key
field. In some cases, the index is created on non-primary key columns which may not be
unique for each record.
 In such cases, in order to identify the records faster, we will group two or more columns
together to get the unique values and create index out of them. This method is known as
clustering index. Basically, records with similar characteristics are grouped together and
indexes are created for these groups.
 For example, students studying in each semester are grouped together. i.e. 1 st Semester
students, 2ndsemester students, 3rd semester students etc are grouped.

Clustered index sorted according to first name (Search key)


Primary Index
In this case, the data is sorted according to the search key. It induces sequential file organization.
In this case, the primary key of the database table is used to create the index. As primary keys are
unique and are stored in sorted manner, the performance of searching operation is quite efficient.
The primary index is classified into two types : 
Dense Index and Sparse Index.

(I) Dense Index :


 For every search key value in the data file, there is an index record.
 This record contains the search key and also a reference to the first data record with that
search key value.

(II) Sparse Index :


 The index record appears only for a few items in the data file. Each item points to a block
as shown.
 To locate a record, we find the index record with the largest search key value less than or
equal to the search key value we are looking for.
 We start at that record pointed to by the index record, and proceed along the pointers in
the file (that is, sequentially) until we find the desired record.

 
Non-Clustered Indexing
 A non clustered index just tells us where the data lies, i.e. it gives us a list of virtual
pointers or references to the location where the data is actually stored. Data is not
physically stored in the order of the index.
 Instead , data is present in leaf nodes. For eg. the contents page of a book. Each entry
gives us the page number or location of the information stored.
 The actual data here(information on each page of book) is not organised but we have an
ordered reference(contents page) to where the data points actually lie.

 It requires more time as compared to clustered index because some amount of extra work
is done in order to extract the data by further following the pointer. In case of clustered
index, data is directly present in front of the index.
Secondary Index
 It is used to optimize query processing and access records in a database with some
information other than the usual search key (primary key). In this two levels of indexing
are used in order to reduce the mapping size of the first level and in general.
 Initially, for the first level, a large range of numbers is selected so that the mapping size
is small. Further, each range is divided into further sub ranges.
 In order for quick memory access, first level is stored in the primary memory. Actual
physical location of the data is determined by the second mapping level.
Difference between B Tree and B+ Tree Index Files

Compare the difference between the examples of B+ tree index files and B tree index files
above. You can see that they are almost similar but there is little difference in them. This little
difference itself gives greater effect in database performance.

  B Tree Index Files B+ Tree Index Files

  This is a binary tree structure similar to This is a balanced tree with intermediary
  B+ tree. But here each node will have nodes and leaf nodes. Intermediary nodes
  only two branches and each node will contain only pointers / address to the leaf
  have some records. Hence here no need nodes. All leaf nodes will have records
  to traverse till leaf node to get the data. and all are at same distance from the root.
  It has more height compared to width. More width is compared to height.
Number of nodes at any intermediary Each intermediary node can have n/2 to n
level 'l' is 2l. Each of the intermediary children. Only root node will have 2
nodes will have only 2 sub nodes. children.
Even a leaf node level will have Leaf node stores (n-1)/2 to n-1 values
2l nodes. Hence total nodes in the B
Tree are 2 l+1 - 1.
  As the number of intermediary nodes
increases and hence the leaf nodes i.e. as
B+ tree extends, the traversal speed 
increases log arithmetically log(n/2)(K)
Records are in sorted order Records are in sorted order
Advantages It might have fewer nodes compared to Automatically Adjust the nodes to fit the
B+ tree as each node will have data. new record. Similarly it re-organizes the
nodes in the case of delete, if required.
Hence it does not alter the definition of
B+ tree.
Since each node has record, there might Reorganization of the nodes does not
not be required to traverse till leaf node. affect the performance of the file. This is
because, even after the rearrangement all
the records are still found in leaf nodes
and are all at equidistance. There is no
change in distance of records from
neither root nor the time to traverse till
leaf node.
  No file degradation problem
  Good space utilization as intermediary
nodes contain only pointer to the records
and only leaf nodes contain records.
Space needed for pointers are very less
compared to records.
  Is suitable for partial and range search
too
  Since all the leaf nodes are at equal
distance, the time for I/O fetch is much
less. Hence the performance of the tree
will also increase.
Disadvantages If the tree is very big, then we have to  If there is any rearrangement of nodes
traverse through most of the nodes to while insertion or deletion, then it would
get the records. Only few records can be an overhead. It takes little effort, time
be fetched at the intermediary nodes or and space. But this disadvantage can be
near to the root. Hence this method ignored compared to the speed of
might be slower. traversal
Since each node has data and can have  
only two child nodes, the tree will not
spread out much. Its depth/height will
increase as the number of records
increases. But if height of a tree
increases, the I/O will also increase and
hence the performance will decrease.
Insertion and deletion of nodes will  
have re-arrangements like in B+ tree.
But it will be more complicated as it
has to balance the binary nodes.
Implementation of B tree is little  
difficult compared to B+ tree
All these disadvantages cannot be  
ignored as they are highly affecting the
performance of the file.

B+ Tree Indexing
A B+ tree is a balanced binary search tree that follows a multi-level index format. The leaf nodes
of a B+ tree denote actual data pointers. B+ tree ensures that all leaf nodes remain at the same
height, thus balanced. Additionally, the leaf nodes are linked using a link list; therefore, a
B+ tree can support random access as well as sequential access.

Structure of B+ Tree
Every leaf node is at equal distance from the root node. A B + tree is of the order n where n is
fixed for every B+ tree.
Internal nodes −

 Internal (non-leaf) nodes contain at least ⌈n/2⌉ pointers, except the root node.
 At most, an internal node can contain n pointers.
Leaf nodes −

 Leaf nodes contain at least ⌈n/2⌉ record pointers and ⌈n/2⌉ key values.
 At most, a leaf node can contain n record pointers and n key values.
 Every leaf node contains one block pointer P to point to next leaf node and forms a
linked list.
B+ Tree Insertion
 B+ trees are filled from bottom and each entry is done at the leaf node.
 If a leaf node overflows −
o Split node into two parts.
o Partition at i = ⌊(m+1)/2⌋.
o First i entries are stored in one node.
o Rest of the entries (i+1 onwards) are moved to a new node.
o ith key is duplicated at the parent of the leaf.
 If a non-leaf node overflows −
o Split node into two parts.
o Partition the node at i = ⌈(m+1)/2⌉.
o Entries up to i are kept in one node.
o Rest of the entries are moved to a new node.

B+ Tree Deletion
 B+ tree entries are deleted at the leaf nodes.
 The target entry is searched and deleted.
o If it is an internal node, delete and replace with the entry from the left position.
 After deletion, underflow is tested,
o If underflow occurs, distribute the entries from the nodes left to it.
 If distribution is not possible from left, then
o Distribute from the nodes right to it.
 If distribution is not possible from left or from right, then
o Merge the node with left and right to it.

Searching a record in B+ Tree


 Suppose we have to search 55 in the below B+ tree structure. First, we will fetch for the
intermediary node which will direct to the leaf node that can contain a record for 55.
 So, in the intermediary node, we will find a branch between 50 and 75 nodes. Then at the
end, we will be redirected to the third leaf node. Here DBMS will perform a sequential
search to find 55.

B+ Tree Insertion

 Suppose we want to insert a record 60 in the below structure. It will go to the 3rd leaf
node after 55. It is a balanced tree, and a leaf node of this tree is already full, so we
cannot insert 60 there.
 In this case, we have to split the leaf node, so that it can be inserted into tree without
affecting the fill factor, balance and order.

 The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50. We
will split the leaf node of the tree in the middle so that its balance is not altered. So we
can group (50, 55) and (60, 65, 70) into 2 leaf nodes.
 If these two has to be leaf nodes, the intermediate node cannot branch from 50. It should
have 60 added to it, and then we can have pointers to a new leaf node.

 This is how we can insert an entry when there is overflow. In a normal scenario, it is very
easy to find the node where it fits and then place it in that leaf node.
B+ Tree Deletion

 Suppose we want to delete 60 from the above example. In this case, we have to remove
60 from the intermediate node as well as from the 4th leaf node too. If we remove it from
the intermediate node, then the tree will not satisfy the rule of the B+ tree. So we need to
modify it to have a balanced tree.
 After deleting node 60 from above B+ tree and re-arranging the nodes, it will show as
follows:

Hashing
 In database management system, when we want to retrieve a particular data, It becomes
very inefficient to search all the index values and reach the desired data. In this situation,
Hashing technique comes into picture.
 Hashing is an efficient technique to directly search the location of desired data on the
disk without using index structure. Data is stored at the data blocks whose address is
generated by using hash function. The memory location where these records are stored is
called as data block or data bucket.

Hash File Organization:

 Data bucket – Data buckets are the memory locations where the records are stored.
These buckets are also considered as Unit Of Storage.
 Hash Function – Hash function is a mapping function that maps all the set of search
keys to actual record address. Generally, hash function uses primary key to generate the
hash index – address of the data block. Hash function can be simple mathematical function
to any complex mathematical function.
 Hash Index-The prefix of an entire hash value is taken as a hash index. Every hash index
has a depth value to signify how many bits are used for computing a hash function. These
bits can address 2n buckets. When all these bits are consumed ? then the depth value is
increased linearly and twice the buckets are allocated.
Below given diagram clearly depicts how hash function work:
Hashing is further divided into two sub categories :

Static Hashing –

In static hashing, when a search-key value is provided, the hash function always computes the
same address. For example, if we want to generate address for STUDENT_ID = 76 using mod
(5) hash function, it always result in the same bucket address 4.  There will not be any changes to
the bucket address here. Hence number of data buckets in the memory for this static hashing
remains constant throughout.
Operations –
 Insertion – When a new record is inserted into the table, The hash function h generate a
bucket address for the new record based on its hash key K.
Bucket address = h(K)
 Searching – When a record needs to be searched, The same hash function is used to
retrieve the bucket address for the record. For Example, if we want to retrieve whole record
for ID 76, and if the hash function is mod (5) on that ID, the bucket address generated
would be 4. Then we will directly got to address 4 and retrieve the whole record for ID
104. Here ID acts as a hash key.
 Deletion – If we want to delete a record, Using the hash function we will first fetch the
record which is supposed to be deleted.  Then we will remove the records for that address
in memory.
 Updation – The data record that needs to be updated is first searched using hash
function, and then the data record is updated.
Now, If we want to insert some new records into the file But the data bucket address generated
by the hash function is not empty or the data already exists in that address. This becomes a
critical situation to handle.  This situation in the static hashing is called bucket overflow.

1. Open Hashing –
In Open hashing method, next available data block is used to enter the new record, instead
of overwriting older one. This method is also called  linear probing.
For example, D3 is a new record which needs to be inserted , the hash function generates
address as 105. But it is already full. So the system searches next available data bucket, 123
and assigns D3 to it.

2. Closed hashing –
In Closed hashing method, a new data bucket is allocated with same address and is linked it
after the full data bucket. This method is also known as  overflow chaining.
For example, we have to insert a new record D3 into the tables. The static hash function
generates the data bucket address as 105. But this bucket is full to store the new data. In
this case is a new data bucket is added at the end of 105 data bucket and is linked to it.
Then new record D3 is inserted into the new bucket.

 Quadratic probing :Quadratic probing is very much similar to open hashing or


linear probing. Here, The only difference between old and new bucket is linear.
Quadratic function is used to determine the new bucket address.
 Double Hashing :Double Hashing is another method similar to linear probing.
Here the difference is fixed as in linear probing, but this fixed difference is calculated
by using another hash function. That’s why the name is double hashing.

Dynamic Hashing –

The drawback of static hashing is that that it does not expand or shrink dynamically as the size of
the database grows or shrinks.  In Dynamic hashing, data buckets grows or shrinks (added or
removed dynamically) as the records increases or decreases. Dynamic hashing is also known
as extended hashing.
In dynamic hashing, the hash function is made to produce a large number of values. For
Example, there are three data records D1, D2 and D3 . The hash function generates three
addresses 1001, 0101 and 1010 respectively.  This method of storing considers only part of this
address – especially only first one bit to store the data. So it tries to load three of them at address
0 and 1.

But the problem is that No bucket address is remaining for D3. The bucket has to grow
dynamically to accommodate D3. So it changes the address have 2 bits rather than 1 bit, and then
it updates the existing data to have 2 bit address. Then it tries to accommodate D3.

Bitmap Indexing

 Bitmap Indexing is a special type of database indexing that uses bitmaps. This technique
is used for huge databases, when column is of low cardinality and these columns are most
frequently used in the query.

Need of Bitmap Indexing –The need of Bitmap Indexing will be clear through the below given
example: :
For example, Let us say that a company holds an employee table with entries like EmpNo,
EmpName, Job, New_Emp and salary. Let us assume that the employees are hired once in the
year, therefore the table will be updated very less and will remain static most of the time. But the
columns will be frequently used in queries to retrieve data like : No. of female employees in the
company etc. In this case we need a file organization method which should be fast enough to
give quick results. But any of the traditional file organization method is not that fast, therefore
we switch to a better method of storing and retrieving data known as Bitmap Indexing.
How Bitmap Indexing is done –
o In the above example of table employee, we can see that the column New_Emp has only
two values Yes and No based upon the fact that the employee is new to the company or
not.
o Similarily let us assume that the Job of the Employees is divided into 4 categories only i.e
Manager, Analyst, Clerk and Salesman. Such columns are called columns with low
cardinality. Even though these columns have less unique values, they can be queried very
often.
o Bit: Bit is a basic unit of information used in computing that can have only one of two
values either 0 or 1 . The two values of a binary digit can also be interpreted as logical
values true/false or yes/no.
In Bitmap Indexing these bits are used to represent the unique values in those low cardinality
columns. This technique of storing the low cardinality rows in form of bits is called bitmap
indices.
Continuing the Employee example, Given below is the Employee table :

If New_Emp is the data to be indexed, the content of the bitmap index is shown as four( As we
have four rows in the above table) columns under the heading Bitmap Indices. Here Bitmap
Index “Yes” has value 1001 because row 1 and row four has value “Yes” in column New_Emp.

In this case there are two such bitmaps, one for “New_Emp” Yes and one for “New_Emp” NO.
It is easy to see that each bit in bitmap indices shows that whether a particular row refer to a
person who is New to the company or not.
The above scenario is the simplest form of Bitmap Indexing. Most columns will have more
distinct values. For example the column Job here will have only 4 unique values (As mentioned
earlier). Variations on the bitmap index can effectively index this data as well. For Job column
the bitmap Indexing is shown below:
Now Suppose, If we want to find out the details for the Employee who is not new in the
company and is a sales person then we will run the query:
SELECT *
FROM STUDENT
WHERE New_Emp = "No" and Job = "Salesperson";
For this query the DBMS will search the bitmap index of both the columns and perform logical
AND operation on those bits and find out the actual result:

Here the result 0100 represents that the second column has to be retrieved as a result.

Bitmap Indexing in SQL – The syntax for creating bitmap index in sql is given below:
CREATE BITMAP INDEX Index_Name
ON Table_Name (Column_Name);

For the above example of employee table, the bitmap index on column New_Emp will be created
as follows:
CREATE BITMAP INDEX index_New_Emp
ON Employee (New_Emp);
Advantages –
 Efficiency in terms of insertion deletion and updation
 Faster retrieval of records
Disadvantages –
 Only suitable for large tables
 Bitmap Indexing is time consuming