You are on page 1of 34

1 DBMS Notes

DataBase :
A database is a collection of stored operational data used by various applications and/or users by
some particular enterprise or by a set of outside authorized applications and authorized users.

DataBase Management System :


A DataBase Management System (DBMS) is a software system that manages execution of users
applications to access and modify database data so that the data security, data integrity, and data
reliability is guaranteed for each application and each application is written with an assumption that it is
the only application active in the database.

What Is Data ?
Different viewpoints:
A sequence of characters stored in computer memory or storage
Interpreted sequence of characters stored in computer memory or storage
Interpreted set of objects
Database supports a concurrent access to the data

File Systems :
File is uninterrupted, unstructured collection of information
File operations: delete, catalog, create, rename, open, close, read, write, find,
Access methods: Algorithms to implement operations along with internal file organization
Examples: File of Customers, File of Students; Access method: implementation of a set of operations
on a file of students or customers.

File Management System Problems:


Data redundancy
Data Access: New request-new program
Data is not isolated from the access implementation
Concurrent program execution on the same file
Difficulties with security enforcement
Integrity issues.

Database Applications:
Airline Reservation Systems Data items are: single passenger reservations; Information about flights
and airports; Information about ticket prices and tickets restrictions.
Banking Systems Data items are accounts, customers, loans, mortgages, balances, etc. Failures
are not tolerable. Concurrent access must be provided
Corporate Records Data items are: sales, accounts, bill of materials records, employee and their
dependents

ADVANTAGES OF A DBMS:

Data independence: Application programs should be as independent as possible from details of data
representation and storage. The DBMS can provide an abstract view of the data to insulate application
code from such details.
Client data access: A DBMS utilizes a variety of sophisticated techniques to store and retrieve data
2 DBMS Notes

efficiently. This feature is especially important if the data is stored on external storage devices.
Data integrity and security: If data is always accessed through the DBMS, the DBMS can enforce
integrity constraints on the data. For example, before inserting salary information for an employee, the
DBMS can check that the department budget is not exceeded. Also, the DBMS can enforce access
controls that govern what data is visible to deferent classes of users.
Data administration: When several users share the data, centralizing the administration
of data can or signi cant improvements. Experienced professionals who understand the nature of the
data being managed, and how deferent groups of users use it, can be responsible for organizing the
data representation to minimize redundancy and for ne-tuning the storage of the data to make retrieval
efficient.

concurrency recovery: A DBMS schedules concurrent accesses to the data in such a manner that
users can think of the data as being accessed by only one user at a time. Further, the DBMS protects
users from the eects of system failures.

Reduced application development time: Clearly, the DBMS supports many important functions that
are common to many applications accessing data stored in the DBMS. This, in conjunction with the
high-level interface to the data, facilitates quick development of applications. Such applications are
also likely to be more robust than applications developed from scratch because many important tasks
are handled by the DBMS instead of being implemented by the application.
Data Levels and their Roles :

Physical corresponds to the first view of data: How data is stored, how is it accessed, how data
is modified, is data ordered, how data is allocated to computer memory and/or peripheral devices, how
data items are actually represented (ASCI, EBCDIC,) .The physical schema species additional
storage details. Essentially, the physical schema summarizes how the relations described in the
conceptual schema are actually stored on secondary storage devices such as disks and tapes. We
3 DBMS Notes

must decide what le organizations to use to store the relations, and create auxiliary data structures
called indexes to speed up data retrieval operations.

Conceptual corresponds to the second view of data: What we want the data to express and
what relationships between data we must express, what story data tells, are all data necessary for
the story are discussed. The conceptual schema (sometimes called the logical schema) describes the
stored data in terms of the data model of the DBMS. In a relational DBMS, the conceptual schema
describes all relations that are stored in the database. In our sample university database, these
relations contain information about entities, such as students and faculty, and about relationships, such
as students' enrollment in courses. All student entities can be described using records in a Students
relation, as we saw earlier. In fact, each collection of entities and each collection of relationships can
be described as a relation, leading to the following conceptual schema:
Students(sid: string, name: string, login: string, age: integer, gpa: real)
Faculty( d: string, fname: string, sal: real)
Courses(cid: string, cname: string, credits: integer)
Rooms(rno: integer, address: string, capacity: integer)
Enrolled(sid: string, cid: string, grade: string)
Teaches( d: string, cid: string)
Meets In(cid: string, rno: integer, time: string)
The choice of relations, and the choice of elds for each relation, is not always obvious,
and the process of arriving at a good conceptual schema is called conceptual
database design.

View corresponds to the third view of data: What part of the data is seen by a specific
application. External schemas, which usually are also in terms of the data model of the DBMS, allow
data access to be customized (and authorized) at the level of individual users or groups of users. The
external schema design is guided by end user requirements. For example, we might ant to allow
students to and out the names of faculty members teaching courses, as well as course enrollments.
This can be done by de ning the following view:
Course info(cid: string, fname: string, enrollment: integer)

STRUCTURE OF A DBMS:

When a user issues a query, the parsed query is presented to a query optimizer, which uses
information about how the data is stored to produce an efficient execution plan for evaluating the
query. An execution plan is a blueprint for evaluating a query, and is usually represented as a tree of
relational operators.

The code that implements relational operators sits on top of the le and access methods layer. This
layer includes a variety of software for supporting the concept of a le, which, in a DBMS, is a collection
of pages or a collection of records. This layer typically supports a heap le, or le of unordered pages, as
well as indexes. In addition to keeping track of the pages in a le, this layer organizes the information
within a page. The les and access methods layer code sits on top of the buer manager, which brings
pages in from disk to main memory as needed in response to read requests.
4 DBMS Notes

The lowest layer of the DBMS software deals with management of space on disk, where the data is
stored. Higher layers allocate, deallocate, read, and write pages through (routines provided by) this
layer, called the disk space manager.
The DBMS supports concurrency and crash recovery by carefully scheduling user requests and
maintaining a log of all changes to the database. DBMS components associated with concurrency
control and recovery include the transaction manager, which ensures that transactions request and
release locks according to a suitable locking protocol and schedules the execution transactions; the
lock manager, which keeps track of requests for locks and grants locks on database objects when they
become available; and the recovery manager, which is responsible for maintaining a log, and restoring
the system to a consistent state after a crash. The disk space manager, buer manager, and le and
access method layers must interact with these components.

Data Models:
A collection of tools for describing ......

Data.
Data relationships.
Data semantics.
Data constraints.

Relational model..........

Entity-Relationship data model (mainly for database design) .

Object-based data models (Object-oriented and Object-relational).

Semi structured data model (XML).

Other older models:.........

Network model .
Hierarchical model.

Database Access from Application Programs:

To access the database, DML statements need to be executed from the host language. There are two
ways to do this:

By providing an application program interface (set of procedures) that can be used to send DML and
DDL statements to the database, and retrieve the results. The Open Database Connectivity (ODBC)
standard defined by Microsoft for use with the C language is a commonly used application program
interface standard. The Java Database Connectivity (JDBC) standard provides corresponding features
to the Java language.
By extending the host language syntax to embed DML calls within the host language program.
Usually, a special character prefaces DML calls, and a preprocessor, called the DML precompiled,
converts the DML statements to normal procedure calls in the host language.
5 DBMS Notes

Database Users and Administrators:

Naive users are unsophisticated users who interact with the system by invoking one of the
application programs that have been written previously.
Application programmers are computer professionals who write application programs.
Sophisticated users interact with the system without writing programs. Instead, they form their
requests in a database query language. They submit each such query to a query processor,
whose function is to break down DML statements into instructions that the storage manager
understands. Analysts who submit queries to explore data in the database fall in this category.
Specialized users are sophisticated users who write specialized database applications that do
not fit into the traditional data-processing framework.

Database Administrator: A person who has such central control over the system is called a
database administrator (DBA)
Schema definition. The DBA creates the original database schema by executing a set of data
definition statements in the DDL.
Storage structure and access-method definition.
Schema and physical-organization modification. The DBA carries out changes to the schema
and physical organization to reflect the changing needs of the organization, or to alter the
physical organization to improve performance.
Granting of authorization for data access.
Routine maintenance.

Data Model:
A data model is a collection of conceptual tools for describing data, data relationships, data semantics,
and consistency constraints.

Entity: An entity is a thing or object in the real world that is distinguishable from all other objects.
For example, each person in an enterprise is an entity.

Entity set: An entity set is a set of entities of the same type that share the same properties, or
attributes. The set of all persons who are customers at a given bank, for example, can be defined as
the entity set customer. Similarly, the entity set loan might represent the set of all loans awarded by a
particular bank.
An entity is represented by a set of attributes. Attributes are descriptive properties possessed by each
member of an entity set. The designation of an attribute for an entity set expresses that the database
stores similar information concerning each entity in the entity set; however, each entity may have its
own value for each attribute.
Simple and composite attributes: the attributes have been simple; that is, they are not divided into
subparts is called as "simple attributes". on the other hand, can be divided into subparts is called as
"composite attributes". For example, an attribute name could be structured as a composite attribute
consisting of first-name, middle-initial, and last-name.
Single-valued and multivalve attributes: For instance, the loan-number attribute for a specific loan
entity refers to only one loan number. Such attributes are said to be single valued. There may be
instances where an attribute has a set of values for a specific entity. Consider an employee entity set
6 DBMS Notes

with the attribute phone-number. An employee may have zero, one, or several phone numbers, and
different employees may have different numbers of phones.
This type of attribute is said to be multivalve.

Derived attribute: The value for this type of attribute can be derived from the values of other related
attributes or entities. For instance, let us say that the customer entity set has an attribute loans-held,
which represents how many loans a customer has from the bank. We can derive the value for this
attribute by counting the number of loan entities associated with that customer.

Relationship Sets: A relationship is an association among several entities. A relationship set is a set
of relationships of the same type.

Mapping Cardinalities: Mapping cardinalities, or cardinality ratios, express the number of entities to
which another entity can be associated via a relationship set. Mapping cardinalities are most useful in
describing binary relationship sets, although they can contribute to the description of relationship sets
that involve more than two entity sets.

One to one. An entity in A is associated with at most one entity in B, and an entity in B is
associated with at most one entity in A.
One to many. An entity in A is associated with any number (zero or more) of entities in B. An
entity in B, however, can be associated with at most one entity in A.
Many to one. An entity in A is associated with at most one entity in B. An entity in B, however,
can be associated with any number (zero or more) of entities in A.
Many to many. An entity in A is associated with any number (zero or more) of entities in B, and
an entity in B is associated with any number (zero or more) of entities in A.

Keys: A key allows us to identify a set of attributes that suffice to distinguish entities from each other.
Keys also help uniquely identify relationships, and thus distinguish relationships from each other.

Super key: A super key is a set of one or more attributes that, taken collectively, allow us to identify
uniquely an entity in the entity set. For example, the customer-id attribute of the entity set customer is
sufficient to distinguish one customer entity from another. Thus, customer-id is a super key. Similarly,
the combination of customer-name and customer-id is a super key for the entity set customer. The
customer-name attribute of customer is not a super key, because several people might have the same
name.

candidate key: minimal super keys are called candidate keys. If K is a super key, then so is any
superset of K. We are often interested in super keys for which no proper subset is a super key.It is
possible that several distinct sets of attributes could serve as a candidate key. Suppose that a
combination of customer-name and customer-street is sufficient to distinguish among members of the
customer entity set. Then, both {customer-id} and {customer-name, customer-street} are candidate
keys. Although the attributes customer id and customer-name together can distinguish customer
entities, their combination does not form a candidate key, since the attribute customer-id alone is a
candidate key.

primary key:which denotes the unique identity is called as primary key. primary key to denote a
candidate key that is chosen by the database designer as the principal means of identifying entities
7 DBMS Notes

within an entity set. A key (primary, candidate, and super) is a property of the entity set, rather than of
the individual entities. Any two individual entities in the set are prohibited from having the same value
on the key attributes at the same time. The designation of a key represents a constraint in the real-
world enterprise being modeled.

Weak Entity Sets:An entity set may not have sufficient attributes to form a primary key. Such an entity
set is termed a weak entity set. An entity set that has a primary key is termed a strong entity set.

For a weak entity set to be meaningful, it must be associated with another entity set, called the
identifying or owner entity set. Every weak entity must be associated with an identifying entity; that is,
the weak entity set is said to be existence dependent on the identifying entity set. The identifying entity
set is said to own the weak entity set that it identifies. The relationship associating the weak entity set
with the identifying entity set is called the identifying relationship. The identifying relationship is many to
one from the weak entity set to the identifying entity set, and the participation of the weak entity set in
the relationship is total.
In our example, the identifying entity set for payment is loan, and a relationship loan-payment that
associates payment entities with their corresponding loan entities is the identifying relationship.
Although a weak entity set does not have a primary key, we nevertheless need a means of
distinguishing among all those entities in the weak entity set that depend on one particular strong
entity. The discriminator of a weak entity set is a set of attributes that allows this distinction to be made.

In E-R diagrams, a doubly outlined box indicates a weak entity set, and a doubly outlined diamond
indicates the corresponding identifying relationship. in fig the weak entity set payment depends on the
strong entity set loan via the relationship set loan-payment.

The figure also illustrates the use of double lines to indicate total participationthe of the (weak) entity
set payment in the relationship loan-payment is total, meaning that every payment must be related via
loan-payment to some loan. Finally, the arrow from loan-payment to loan indicates that each payment
8 DBMS Notes

is for a single loan. The discriminator of a weak entity set also is underlined, but with a dashed, rather
than a solid, line.

Specialization: An entity set may include sub groupings of entities that are distinct in some way from
other entities in the set. For instance, a subset of entities within an entity set may have attributes that
are not shared by all the entities in the entity set. The E-R model provides a means for representing
these distinctive entity groupings. Consider an entity set person, with attributes name, street, and city.
A person may be further classified as one of the following:
customer
employee

Each of these person types is described by a set of attributes that includes all the attributes of entity
set person plus possibly additional attributes. For example, customer entities may be described further
by the attribute customer-id, whereas employee entities may be described further by the attributes
employee-id and salary. The process of designating sub groupings within an entity set is called
specialization. The specialization of person allows us to distinguish among persons according to
whether they are employees or customers.

Generalization: mThe design process may also proceed in a bottom-up manner, in which multiple
entity sets are synthesized into a higher-level entity set on the basis of common features. The
database designer may have first identified a customer entity set with the attributes name, street, city,
and customer-id, and an employee entity set with the attributes name, street, city, employee-id, and
salary. There are similarities between the customer entity set and the employee entity set in the sense
that they have several attributes in common. This commonality can be expressed by generalization,
which is a containment relationship that exists between a higher-level entity set and one or more lower-
level entity sets. In our example, person is the higher-level entity set and customer and employee are
lower-level entity sets.
Higher- and lower-level entity sets also may be designated by the terms super class and subclass,
respectively. The person entity set is the superclass of the customer and employee subclasses. For all
practical purposes, generalization is a simple inversion of specialization. We will apply both processes,
in combination, in the course of designing the E-R schema for an enterprise. In terms of the E-R
diagram itself, we do not distinguish between specialization and generalization. New levels of entity
representation will be distinguished (specialization) or synthesized (generalization) as the design
schema comes to express fully the database application and the user requirements of the database.
Differences in the two approaches may be characterized by their starting point and overall goal.
Generalization proceeds from the recognition that a number of entity sets share some common
features (namely, they are described by the same attributes and participate in the same relationship
9 DBMS Notes

sets).

Aggregation:
Aggregation is an abstraction in which relationship sets (along with their associated entity sets) are
treated as higher-level entity sets, and can participate in relationships.
10 DBMS Notes

Symbols used in the E-R notation:


11 DBMS Notes

ER Model For a college DB:

Assumptions :

A college contains many departments


Each department can offer any number of courses
Many instructors can work in a department
An instructor can work only in one department
For each department there is a Head
An instructor can be head of only one department
Each instructor can take any number of courses
A course can be taken by only one instructor
A student can enroll for any number of courses
Each course can have any number of students

Steps in ER Modeling:

Identify the Entities


Find relationships
Identify the key attributes for every Entity
Identify other relevant attributes
Draw complete E-R diagram with all attributes including Primary Key

Step 1: Identify the Entities:

DEPARTMENT
STUDENT
COURSE
INSTRUCTOR

Step 2: Find the relationships:

One course is enrolled by multiple students and one student enrolls for multiple courses,
hence the cardinality between course and student is Many to Many.
The department offers many courses and each course belongs to only one department,
hence the cardinality between department and course is One to Many.
One department has multiple instructors and one instructor belongs to one and only one
department , hence the cardinality between department and instructor is one to Many.
Each department there is a Head of department and one instructor is Head of
department ,hence the cardinality is one to one .
One course is taught by only one instructor, but the instructor teaches many courses,
hence the cardinality between course and instructor is many to one.

Step 3: Identify the key attributes

Dept name is the key attribute for the Entity Department, as it identifies the Department
uniquely.
12 DBMS Notes

Course# (CourseId) is the key attribute for Course Entity.


Student# (Student Number) is the key attribute for Student Entity.
Instructor Name is the key attribute for Instructor Entity.

Step 4: Identify other relevant attributes

For the department entity, the relevant attribute is location

For course entity, course name, duration, prerequisite


For instructor entity, room#, telephone#
For student entity, student name, date of birth

ER model for Banking Business :

Assumptions :
13 DBMS Notes

There are multiple banks and each bank has many branches. Each branch has multiple
customers
Customers have various types of accounts
Some Customers also had taken different types of loans from these bank branches
One customer can have multiple accounts and Loans

Step 1: Identify the Entities


BANK
BRANCH
LOAN
ACCOUNT
CUSTOMER
Step 2: Find the relationships
One Bank has many branches and each branch belongs to only one bank, hence the
cardinality between Bank and Branch is One to Many.
One Branch offers many loans and each loan is associated with one branch, hence the
cardinality between Branch and Loan is One to Many.
One Branch maintains multiple accounts and each account is associated to one and
only one Branch, hence the cardinality between Branch and Account is One to Many
One Loan can be availed by multiple customers, and each Customer can avail multiple
loans, hence the cardinality between Loan and Customer is Many to Many.
One Customer can hold multiple accounts, and each Account can be held by multiple
Customers, hence the cardinality between Customer and Account is Many to Many

Step 3: Identify the key attributes


Bank Code (Bank Code) is the key attribute for the Entity Bank, as it identifies the bank
uniquely.
Branch# (Branch Number) is the key attribute for Branch Entity.
Customer# (Customer Number) is the key attribute for Customer Entity.
Loan# (Loan Number) is the key attribute for Loan Entity.
Account No (Account Number) is the key attribute for Account Entity.
Step 4: Identify other relevant attributes
For the Bank Entity, the relevant attributes other than Bank Code would be Name
and Address.
For the Branch Entity, the relevant attributes other than Branch# would be Name
and Address.
For the Loan Entity, the relevant attribute other than Loan# would be Loan Type.
For the Account Entity, the relevant attribute other than Account No would be
Account Type.
For the Customer Entity, the relevant attributes other than Customer# would be
Name, Telephone# and Address.

E-R diagram with all attributes including Primary Key:


14 DBMS Notes
15 DBMS Notes

Normalization

While designing a database out of an entityrelationship model, the main problem existing in that raw
database is redundancy. Redundancy is storing the same data item in more one place. A redundancy
creates several problems like the following:
1. Extra storage space: storing the same data in many places takes large amount of disk space.
2. Entering same data more than once during data insertion.
3. Deleting data from more than one place during deletion.
4. Modifying data in more than one place.
5. Anomalies may occur in the database if insertion, deletion, modification etc are no done properly. It
creates inconsistency and unreliability in the database.
To solve this problem, the raw database needs to be normalized. This is a step by step process of
removing different kinds of redundancy and anomaly at each step. At each step a specific rule is
followed to remove specific kind of impurity in order to give the database a slim and clean look.

Un-Normalized Form (UNF)


If a table contains non-atomic values at each row, it is said to be in UNF. An atomic value is
something that can not be further decomposed. A non-atomic value, as the name suggests, can be
further decomposed and simplified. Consider the following table:
Emp-Id Emp-Name Month Sales Bank-Id Bank-Name
E01 AA Jan 1000 B01 SBI
Feb 1200
Mar 850
E02 BB Jan 2200 B02 UTI
Feb 2500
E03 CC Jan 1700 B01 SBI
Feb 1800
Mar 1850
Apr 1725
In the sample table above, there are multiple occurrences of rows under each key Emp-Id. Although
considered to be the primary key, Emp-Id cannot give us the unique identification facility for any single
row. Further, each primary key points to a variable length record (3 for E01, 2 for E02 and 4 for E03).

First Normal Form (1NF)


A relation is said to be in 1NF if it contains no non-atomic values and each row can provide a unique
combination of values. The above table in UNF can be processed to create the following table in 1NF.
Emp-Name Month Sales Bank-Id Bank-Name
Emp-Id
E01 AA Jan 1000 B01 SBI
E01 AA Feb 1200 B01 SBI
E01 AA Mar 850 B01 SBI
E02 BB Jan 2200 B02 UTI
E02 BB Feb 2500 B02 UTI
16 DBMS Notes

E03 CC Jan 1700 B01 SBI


E03 CC Feb 1800 B01 SBI
E03 CC Mar 1850 B01 SBI
E03 CC Apr 1725 B01 SBI
As you can see now, each row contains unique combination of values. Unlike in UNF, this relation
contains only atomic values, i.e. the rows can not be further decomposed, so the relation is now in
1NF.

Second Normal Form (2NF)


A relation is said to be in 2NF f if it is already in 1NF and each and every attribute fully depends on the
primary key of the relation. Speaking inversely, if a table has some attributes which is not dependant
on the primary key of that table, then it is not in 2NF.
Let us explain. Emp-Id is the primary key of the above relation. Emp-Name, Month, Sales and Bank-
Name all depend upon Emp-Id. But the attribute Bank-Name depends on Bank-Id, which is not the
primary key of the table. So the table is in 1NF, but not in 2NF. If this position can be removed into
another related relation, it would come to 2NF.
Emp-Id Emp-Name Month Sales Bank-Id
E01 AA JAN 1000 B01
E01 AA FEB 1200 B01
E01 AA MAR 850 B01
E02 BB JAN 2200 B02
E02 BB FEB 2500 B02
E03 CC JAN 1700 B01
E03 CC FEB 1800 B01
E03 CC MAR 1850 B01
E03 CC APR 1726 B01

Bank-Id Bank-Name
B01 SBI
B02 UTI
After removing the portion into another relation we store lesser amount of data in two relations without
any loss information. There is also a significant reduction in redundancy.

Third Normal Form (3NF)


A relation is said to be in 3NF, if it is already in 2NF and there exists no transitive dependency in that
relation. Speaking inversely, if a table contains transitive dependency, then it is not in 3NF, and the
table must be split to bring it into 3NF.
What is a transitive dependency? Within a relation if we see
A B [B depends on A]
And
B C [C depends on B]
Then we may derive
A C[C depends on A]
17 DBMS Notes

Such derived dependencies hold well in most of the situations. For example if we have
Roll Marks
And
Marks Grade
Then we may safely derive
Roll Grade.
This third dependency was not originally specified but we have derived it.
The derived dependency is called a transitive dependency when such dependency becomes
improbable. For example we have been given
Roll City
And
City STD Code
If we try to derive Roll STD Code it becomes a transitive dependency, because obviously the
STDCode of a city cannot depend on the roll number issued by a school or college. In such a case the
relation should be broken into two, each containing one of these two dependencies:
Roll City
And
City STD code

Boyce-Code Normal Form (BCNF)


A relationship is said to be in BCNF if it is already in 3NF and the left hand side of every dependency is
a candidate key. A relation which is in 3NF is almost always in BCNF. These could be same situation
when a 3NF relation may not be in BCNF the following conditions are found true.
1. The candidate keys are composite.
2. There are more than one candidate keys in the relation.
3. There are some common attributes in the relation.

Professor Code Department Head of Dept. Percent Time


P1 Physics Ghosh 50
P1 Mathematics Krishnan 50
P2 Chemistry Rao 25
P2 Physics Ghosh 75
P3 Mathematics Krishnan 100
Consider, as an example, the above relation. It is assumed that:
1. A professor can work in more than one department
2. The percentage of the time he spends in each department is given.
3. Each department has only one Head of Department.
18 DBMS Notes

The relation diagram for the above relation is given as the following:

The given relation is in 3NF. Observe, however, that the names of Dept. and Head of Dept. are
duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We lose the information that
Rao is the Head of Department of Chemistry.
The normalization of the relation is done by creating a new relation for Dept. and Head of Dept. and
deleting Head of Dept. form the given relation. The normalized relations are shown in the following.
Professor Code Department Percent Time
P1 Physics 50
P1 Mathematics 50
P2 Chemistry 25
P2 Physics 75
P3 Mathematics 100

Head of Dept.
Department
Physics Ghosh
Mathematics Krishnan
Chemistry Rao
See the dependency diagrams for these new relations.

Fourth Normal Form (4NF)


19 DBMS Notes

When attributes in a relation have multi-valued dependency, further Normalization to 4NF and 5NF are
required. Let us first find out what multi-valued dependency is.
A multi-valued dependency is a typical kind of dependency in which each and every attribute within a
relation depends upon the other, yet none of them is a unique primary key.
We will illustrate this with an example. Consider a vendor supplying many items to many projects in an
organization. The following are the assumptions:
1. A vendor is capable of supplying many items.
2. A project uses many items.
3. A vendor supplies to many projects.
4. An item may be supplied by many vendors.
A multi valued dependency exists here because all the attributes depend upon the other and yet none
of them is a primary key having unique value.
Vendor Code Item Code Project No.
V1 I1 P1
V1 I2 P1
V1 I1 P3
V1 I2 P3
V2 I2 P1
V2 I3 P1
V3 I1 P2
V3 I1 P3
The given relation has a number of problems. For example:
1. If vendor V1 has to supply to project P2, but the item is not yet decided, then a row with a blank for
item code has to be introduced.
2. The information about item I1 is stored twice for vendor V3.
Observe that the relation given is in 3NF and also in BCNF. It still has the problem mentioned above.
The problem is reduced by expressing this relation as two relations in the Fourth Normal Form (4NF).
A relation is in 4NF if it has no more than one independent multi valued dependency or one
independent multi valued dependency with a functional dependency.
The table can be expressed as the two 4NF relations given as following. The fact that vendors are
capable of supplying certain items and that they are assigned to supply for some projects in
independently specified in the 4NF relation.
Vendor-Supply
Item Code
Vendor Code
V1 I1
V1 I2
V2 I2
V2 I3
V3 I1
Vendor-Project
Project No.
Vendor Code
V1 P1
V1 P3
20 DBMS Notes

V2 P1
V3 P2

Fifth Normal Form (5NF)


These relations still have a problem. While defining the 4NF we mentioned that all the attributes
depend upon each other. While creating the two tables in the 4NF, although we have preserved the
dependencies between Vendor Code and Item code in the first table and Vendor Code and Item code
in the second table, we have lost the relationship between Item Code and Project No. If there were a
primary key then this loss of dependency would not have occurred. In order to revive this relationship
we must add a new table like the following. Please note that during the entire process of normalization,
this is the only step where a new table is created by joining two attributes, rather than splitting them
into separate tables.
Project No. Item Code
P1 11
P1 12
P2 11
P3 11
P3 13

Let us finally summarize the normalization steps we have discussed so far.


Input Transformation Output
Relation Relation
All Relations Eliminate variable length record. Remove multi-attribute lines in table. 1NF
1NF Remove dependency of non-key attributes on part of a multi-attribute key. 2NF
Relation
2NF Remove dependency of non-key attributes on other non-key attributes. 3NF
3NF Remove dependency of an attribute of a multi attribute key on an attribute BCNF
of another (overlapping) multi-attribute key.
BCNF Remove more than one independent multi-valued dependency from relation 4NF
by splitting relation.
4NF Add one relation relating attributes with multi-valued dependency. 5NF

Primitive data type


primitive data type

a basic type is a data type provided by a programming language as a basic building block. Most
languages allow more complicated composite types to be recursively constructed starting from
basic types.
a built-in type is a data type for which the programming language provides built-in support.
21 DBMS Notes

In most programming languages, all basic data types are built-in. In addition, many languages also
provide a set of composite data types. Opinions vary as to whether a built-in type that is not basic
should be considered "primitive

Depending on the language and its implementation, primitive data types may or may not have a one-
to-one correspondence with objects in the computer's memory. However, one usually expects
operations on basic primitive data types to be the fastest language constructs there are Integer
addition, for example, can be performed as a single machine instruction, and some processors offer
specific instructions to process sequences of characters with a single instruction. In particular,
the C standard mentions that "a 'plain' int object has the natural size suggested by the architecture of
the execution environment". This means that int is likely to be 32 bits long on a 32-bit architecture.
Basic primitive types are almost always value types.

Most languages do not allow the behavior or capabilities of primitive (either built-in or basic) data types
to be modified by programs. Exceptions include Smalltalk, which permits all data types to be extended
within a program, adding to the operations that can be performed on them or even redefining the built-
in operations.

Composite types
are derived from more than one primitive type. This can be done in a number of ways. The ways they
are combined are called data structures. Composing a primitive type into a compound type generally
results in a new type, e.g. array-of-integer is a different type to integer.

An array stores a number of elements of the same type in a specific order. They are accessed
using an integer to specify which element is required (although the elements may be of almost any
type). Arrays may be fixed-length or expandable.
Record (also called tuple or struct) Records are among the simplest data structures. A record is a
value that contains other values, typically in fixed number and sequence and typically indexed by
names. The elements of records are usually called fields or members.
Union. A union type definition will specify which of a number of permitted primitive types may be
stored in its instances, e.g. "float or long integer". Contrast with a record, which could be defined to
contain a float and an integer; whereas, in a union, there is only one value at a time.
A tagged union (also called a variant, variant record, discriminated union, or disjoint union) contains
an additional field indicating its current type, for enhanced type safety.
A set is an abstract data structure that can store certain values, without any particular order, and no
repeated values. Values themselves are not retrieved from sets, rather one tests a value for
membership to obtain a boolean "in" or "not in".
An object contains a number of data fields, like a record, and also a number of program code
fragments for accessing or modifying them. Data structures not containing code, like those above,
are called plain old data structure.
22 DBMS Notes

Many others are possible, but they tend to be further variations and compounds of the above.
Logical and Physical Database Requirements
The requirements for a logical and physical database vary by size and design parameters. A logical
database must be able to access and identify all files within the storage system to operate correctly,
whereas a physical database manages a much smaller field of information. Sometimes, a physical
database stores only a single file with one value or word in it.

Logical Database Definition

A logical database is the collected information stored on multiple physical disk files and hard drives
within a computer. This database provides a structure to house all the accumulated information within
the device and determines the relationships between different types of files and programs. A logical
database determines these relationships through a series of highly structured tables designed to
categorize information into groups for easier accessibility. Without this categorization, accessing
different files within a computer would take additional time as the system searched each file for the
appropriate match.

Logical Database Requirements


A logical database can stretch over multiple physical hard disks and information files. The data storage
unit is still a single database for information retrieval purposes. To have a logical database, all given
hard disks and information files must be accessible from a single source. An example would be a
personal computer able to access its information files stored on multiple hard drives from a single user
interface. According to Microsoft, when a logical database is successful, the user sees a coherent list
of information from a central location that draws from the many file sources tied into the storage
system.

Physical Database Definition


A physical database is both the actual device housing the information files and the search paths used
to access information between each source. According to Microsoft, the term "database" refers only to
the logical database controlling information files for the entire system. A physical database is
technically a smaller unit of storage referred to as either a company, field, record or table, depending
on how much information the physical storage device contains. A field is the smallest unit of storage
housing only a single file. A company is the largest -- next to a database -- housing separate, large
groups of data.

Physical Storage Requirements


The requirements for a physical database vary by the parameters of the storage device in question.
For example, a flash drive designed to hold up to 2 gigabytes of information needs a personal
computer or another USB-connected device to allow access to the information stored on the
equipment. A physical database also needs a power source to access information. A computer hard
drive cannot function without electricity. A flash drive cannot operate without a device with an adequate
power source.
23 DBMS Notes

Two types of data modeling are as follows:


Logical modeling
Physical modeling
If you are going to be working with databases, then it is important to understand the difference between
logical and physical modeling, and how they relate to one another. Logical and physical modeling are
described in more detail in the following subsections.

Logical Modeling
Logical modeling deals with gathering business requirements and converting those requirements into a
model. The logical model revolves around the needs of the business, not the database, although the
needs of the business are used to establish the needs of the database. Logical modeling involves
gathering information about business processes, business entities (categories of data), and
organizational units. After this information is gathered, diagrams and reports are produced including
entity relationship diagrams, business process diagrams, and eventually process flow diagrams. The
diagrams produced should show the processes and data that exists, as well as the relationships
between business processes and data. Logical modeling should accurately render a visual
representation of the activities and data relevant to a particular business.
The diagrams and documentation generated during logical modeling is used to determine whether the
requirements of the business have been completely gathered. Management, developers, and end
users alike review these diagrams and documentation to determine if more work is required before
physical modeling commences.
Typical deliverables of logical modeling include
Entity relationship diagrams
An Entity Relationship Diagram is also referred to as an analysis ERD. The point of the initial ERD is to
provide the development team with a picture of the different categories of data for the business, as well
as how these categories of data are related to one another.
Business process diagrams
The process model illustrates all the parent and child processes that are performed by individuals
within a company. The process model gives the development team an idea of how data moves within
the organization. Because process models illustrate the activities of individuals in the company, the
process model can be used to determine how a database application interface is design.
User feedback documentation
Physical Modeling
Physical modeling involves the actual design of a database according to the requirements that were
established during logical modeling. Logical modeling mainly involves gathering the requirements of
the business, with the latter part of logical modeling directed toward the goals and requirements of the
24 DBMS Notes

database. Physical modeling deals with the conversion of the logical, or business model, into a
relational database model. When physical modeling occurs, objects are being defined at the schema
level. A schema is a group of related objects in a database. A database design effort is normally
associated with one schema.
During physical modeling, objects such as tables and columns are created based on entities and
attributes that were defined during logical modeling. Constraints are also defined, including primary
keys, foreign keys, other unique keys, and check constraints. Views can be created from database
tables to summarize data or to simply provide the user with another perspective of certain data. Other
objects such as indexes and snapshots can also be defined during physical modeling. Physical
modeling is when all the pieces come together to complete the process of defining a database for a
business.
Physical modeling is database software specific, meaning that the objects defined during physical
modeling can vary depending on the relational database software being used. For example, most
relational database systems have variations with the way data types are represented and the way data
is stored, although basic data types are conceptually the same among different implementations.
Additionally, some database systems have objects that are not available in other database systems.
Typical deliverables of physical modeling include the following:
Server model diagrams
The server model diagram shows tables, columns, and relationships within a database.
User feedback documentation Database design documentation
Conclusion
Understanding the difference between logical and physical modeling will help you build better
organized and more effective database systems.

Data independence
Data independence is the type of data transparency that matters for a centralized DBMS. It refers to
the immunity of user applications to make changes in the definition and organization of data.

Physical data independence deals with hiding the details of the storage structure from user
applications. The application should not be involved with these issues, since there is no difference in
the operation carried out against the data.

The data independence and operation independence together gives the feature of data abstraction.
There are two levels of data independence.

First level
25 DBMS Notes

The logical structure of the data is known as the schema definition. In general, if a user application
operates on a subset of the attributes of a relation, it should not be affected later when new attributes
are added to the same relation. Logical data independence indicates that the conceptual schema can
be changed without affecting the existing schemas.

Second level

The physical structure of the data is referred to as "physical data description". Physical data
independence deals with hiding the details of the storage structure from user applications. The
application should not be involved with these issues since, conceptually, there is no difference in the
operations carried out against the data. There are two types of data independence:

1. Logical data independence: The ability to change the logical (conceptual) schema without
changing the External schema (User View) is called logical data independence. For example,
the addition or removal of new entities, attributes, or relationships to the conceptual schema
should be possible without having to change existing external schemas or having to rewrite
existing application programs.
2. Physical data independence: The ability to change the physical schema without changing the
logical schema is called physical data independence. For example, a change to the internal
schema, such as using different file organization or storage structures, storage devices, or
indexing strategy, should be possible without having to change the conceptual or external
schemas.
3. View level data independence: always independent no effect, because there doesn't exist any
other level above view level.
Data Independence Types

Data independence has two types. They are:

1. Physical Independence
2. Logical Independence.
Data independence can be explained as follows: Each higher level of the data architecture is immune
to changes of the next lower level of the architecture.

Physical Independence: The logical scheme stays unchanged even though the storage space or type
of some data is changed for reasons of optimization or reorganization. In this external schema does
not change. In this internal schema changes may be required due to some physical schema were
reorganized here. Physical data independence is present in most databases and file environment in
26 DBMS Notes

which hardware storage of encoding, exact location of data on disk, merging of records, so on this are
hidden from user.

Logical Independence: The external scheme may stay unchanged for most changes of the logical
scheme. This is especially desirable as the application software does not need to be modified or newly
translated.

Data abstraction

In computer science, abstraction is the process by which data and programs are defined with
a representation similar in form to its meaning (semantics), while hiding away
the implementation details. Abstraction tries to reduce and factor out details so that the
programmer can focus on a few concepts at a time. A system can have several abstraction
layers whereby different meanings and amounts of detail are exposed to the programmer. For
example, low-level abstraction layers expose details of the computer hardware where the program
is run, while high-level layers deal with the business logic of the program.
The following English definition of abstraction helps to understand how this term applies to computer
science, IT and objects:
abstraction - a concept or idea not associated with any specific instance
Abstraction captures only those details about an object that are relevant to the current perspective.
The concept originated by analogy with abstraction in mathematics. The mathematical technique of
abstraction begins with mathematical definitions, making it a more technical approach than the
general concept of abstraction in philosophy. For example, in both computing and in mathematics,
numbers are concepts in the programming languages, as founded in mathematics. Implementation
details depend on the hardware and software, but this is not a restriction because the computing
concept of number is still based on the mathematical concept.
In computer programming, abstraction can apply to control or to data: Control abstraction is the
abstraction of actions while data abstraction is that of data structures.

Control abstraction involves the use of subprograms and related concepts control flows
Data abstraction allows handling data bits in meaningful ways. For example, it is the basic
motivation behind datatype.
One can regard the notion of an object (from object-oriented programming) as an attempt to
combine abstractions of data and code.
The same abstract definition can be used as a common interface for a family of objects with
different implementations and behaviors but which share the same meaning.
The inheritance mechanism in object-oriented programming can be used to define an abstract
class as the common interface.
27 DBMS Notes

The recommendation that programmers use abstractions whenever suitable in order to avoid
duplication (usually of code) is known as the abstraction principle. The requirement that a
programming language provide suitable abstractions is also called the abstraction principle.

Data abstraction enforces a clear separation between the abstract properties of a data type and
the concrete details of its implementation. The abstract properties are those that are visible to client
code that makes use of the data typethe interface to the data typewhile the concrete
implementation is kept entirely private, and indeed can change, for example to incorporate efficiency
improvements over time. The idea is that such changes are not supposed to have any impact on client
code, since they involve no difference in the abstract behaviour.
For example, one could define an abstract data type called lookup table which uniquely
associates keys with values, and in which values may be retrieved by specifying their corresponding
keys. Such a lookup table may be implemented in various ways: as a hash table, a binary search tree,
or even a simple linear list of (key:value) pairs. As far as client code is concerned, the abstract
properties of the type are the same in each case.
Of course, this all relies on getting the details of the interface right in the first place, since any changes
there can have major impacts on client code. As one way to look at this: the interface forms
a contract on agreed behaviour between the data type and client code; anything not spelled out in the
contract is subject to change without notice.
Languages that implement data abstraction include Ada and Modula-2. Object-oriented languages are
commonly claimed[to offer data abstraction; however, their inheritance concept tends to put information
in the interface that more properly belongs in the implementation; thus, changes to such information
ends up impacting client code, leading directly to the Fragile binary interface problem.
SQL

SQL is a standard language for accessing databases.

Our SQL tutorial will teach you how to use SQL to access and manipulate data in: MySQL, SQL
Server, Access, Oracle, Sybase, DB2, and other database systems.

SQL is a standard language for accessing and manipulating databases.

What is SQL?

SQL stands for Structured Query Language


SQL lets you access and manipulate databases
SQL is an ANSI (American National Standards Institute) standard

What Can SQL do?


28 DBMS Notes

SQL can execute queries against a database


SQL can retrieve data from a database
SQL can insert records in a database
SQL can update records in a database
SQL can delete records from a database
SQL can create new databases
SQL can create new tables in a database
SQL can create stored procedures in a database
SQL can create views in a database
SQL can set permissions on tables, procedures, and views

SQL is a Standard - BUT....

Although SQL is an ANSI (American National Standards Institute) standard, there are different versions
of the SQL language.

However, to be compliant with the ANSI standard, they all support at least the major commands (such
as SELECT, UPDATE, DELETE, INSERT, WHERE) in a similar manner.

Note: Most of the SQL database programs also have their own proprietary extensions in addition to the
SQL standard!

Using SQL in Your Web Site

To build a web site that shows data from a database, you will need:

An RDBMS database program (i.e. MS Access, SQL Server, MySQL)


To use a server-side scripting language, like PHP or ASP
To use SQL to get the data you want
To use HTML / CSS

RDBMS

RDBMS stands for Relational Database Management System.

RDBMS is the basis for SQL, and for all modern database systems such as MS SQL Server, IBM DB2,
Oracle, MySQL, and Microsoft Access.

The data in RDBMS is stored in database objects called tables.

A table is a collection of related data entries and it consists of columns and rows.
29 DBMS Notes

Database Tables

A database most often contains one or more tables. Each table is identified by a name (e.g.
"Customers" or "Orders"). Tables contain records (rows) with data.

In this tutorial we will use the well-known North wind sample database (included in MS Access and MS
SQL Server).

Below is a selection from the "Customers" table:

CustomerID CustomerName ContactName Address City PostalCode Country


1 Alfreds Futterkiste Maria Anders Obere Str. 57 Berlin 12209 Germany

2 Ana Trujillo Ana Trujillo Avda. de la Mxico 05021 Mexico


Emparedados y Constitucin D.F.
helados 2222

3 Antonio Moreno Antonio Mataderos 2312 Mxico 05023 Mexico


Taquera Moreno D.F.

4 Around the Horn Thomas 120 Hanover London WA1 1DP UK


Hardy Sq.

5 Berglunds Christina Berguvsvgen 8 Lule S-958 22 Sweden


snabbkp Berglund

The table above contains five records (one for each customer) and seven columns (CustomerID,
CustomerName, ContactName, Address, City, PostalCode, and Country).

SQL Statements

Most of the actions you need to perform on a database are done with SQL statements.

The following SQL statement selects all the records in the "Customers" table:

Example
SELECT * FROM Customers;

In this tutorial we will teach you all about the different SQL statements.

Keep in Mind That...


30 DBMS Notes

SQL is NOT case sensitive: SELECT is the same as select

Semicolon after SQL Statements?

Some database systems require a semicolon at the end of each SQL statement.

Semicolon is the standard way to separate each SQL statement in database systems that allow more
than one SQL statement to be executed in the same call to the server.

In this tutorial, we will use semicolon at the end of each SQL statement.

Some of The Most Important SQL Commands

SELECT - extracts data from a database


UPDATE - updates data in a database
DELETE - deletes data from a database
INSERT INTO - inserts new data into a database
CREATE DATABASE - creates a new database
ALTER DATABASE - modifies a database
CREATE TABLE - creates a new table
ALTER TABLE - modifies a table
DROP TABLE - deletes a table
CREATE INDEX - creates an index (search key)
DROP INDEX - deletes an index

File organization

File organization is the methodology which is applied to structured computer files. Files contain
computer records which can be documents or information which is stored in a certain way for later
retrieval. File organization refers primarily to the logical arrangement of data (which can itself be
organized in a system of records with correlation between the fields/columns) in a file system. It should
not be confused with the physical storage of the file in some types of storage media. There are certain
basic types of computer file, which can include files stored as blocks of data and streams of data,
where the information streams out of the file while it is being read until the end of the file is
encountered.

We will look at two components of file organization here:

1. The way the internal file structure is arranged and


2. The external file as it is presented to the O/S or program that calls it. Here we will also examine the
concept of file extensions.

We will examine various ways that files can be stored and organized. Files are presented to the
application as a stream of bytes and then an EOF (end of file) condition.
31 DBMS Notes

A program that uses a file needs to know the structure of the file and needs to interpret its contents.

Internal File Structure - Methods and Design Paradigm

It is a high-level design decision to specify a system of file organization for a computer software
program or a computer system designed for a particular purpose. Performance is high on the list of
priorities for this design process, depending on how the file is being used. The design of the file
organization usually depends mainly on the system environment. For instance, factors such as whether
the file is going to be used for transaction-oriented processes like OLTP or Data Warehousing, or
whether the file is shared among various processes like those found in a typical distributed system or
standalone. It must also be asked whether the file is on a network and used by a number of users and
whether it may be accessed internally or remotely and how often it is accessed.

However, all things considered the most important considerations might be:

1. Rapid access to a record or a number of records which are related to each other.
2. The Adding, modification, or deletion of records.
3. Efficiency of storage and retrieval of records.
4. Redundancy, being the method of ensuring data integrity.

A file should be organized in such a way that the records are always available for processing with no
delay. This should be done in line with the activity and volatility of the information.

Types of File Organization

Organizing a file depends on what kind of file it happens to be: a file in the simplest form can be a text
file, (in other words a file which is composed of ascii (American Standard Code for Information
Interchange) text.) Files can also be created as binary or executable types (containing elements other
than plain text.) Also, files are keyed with attributes which help determine their use by the host
operating system.

Techniques of File Organization

The three techniques of file organization are:

1. Heap (unordered)
2. Sorted
1. Sequential (SAM)
2. Line Sequential (LSAM)
3. Indexed Sequential (ISAM)
3. Hashed or Direct

In addition to the three techniques, there are four methods of organizing files. They are sequential,
line-sequential, indexed-sequential, inverted list and direct or hashed access organization.
32 DBMS Notes

Sequential Organization

A sequential file contains records organized in the order they were entered. The order of the records is
fixed. The records are stored and sorted in physical, contiguous blocks within each block the records
are in sequence.

Records in these files can only be read or written sequentially.

Once stored in the file, the record cannot be made shorter, or longer, or deleted. However, the record
can be updated if the length does not change. (This is done by replacing the records by creating a new
file.) New records will always appear at the end of the file.

If the order of the records in a file is not important, sequential organization will suffice, no matter
how many records you may have. Sequential output is also useful for report printing or sequential
reads which some programs prefer to do.

Line-Sequential Organization

Line-sequential files are like sequential files, except that the records can contain only characters as
data. Line-sequential files are maintained by the native byte stream files of the operating system.

In the COBOL environment, line-sequential files that are created with WRITE statements with the
ADVANCING phrase can be directed to a printer as well as to a disk.

Indexed-Sequential Organization

Key searches are improved by this system too. The single-level indexing structure is the simplest one
where a file, whose records are pairs, contains a key pointer. This pointer is the position in the data file
of the record with the given key. A subset of the records, which are evenly spaced along the data file,
is indexed, in order to mark intervals of data records.

This is how a key search is performed: the search key is compared with the index keys to find the
highest index key coming in front of the search key, while a linear search is performed from the record
that the index key points to, until the search key is matched or until the record pointed to by the next
index entry is reached. Regardless of double file access (index + data) required by this sort of search,
the access time reduction is significant compared with sequential file searches.

Let's examine, for sake of example, a simple linear search on a 1,000 record sequentially
organized file. An average of 500 key comparisons are needed (and this assumes the search keys are
uniformly distributed among the data keys). However, using an index evenly spaced with 100 entries,
the total number of comparisons is reduced to 50 in the index file plus 50 in the data file: a five to one
reduction in the operations count!

Hierarchical extension of this scheme is possible since an index is a sequential file in itself, capable of
indexing in turn by another second-level index, and so forth and so on. And the exploit of the
hierarchical decomposition of the searches more and more, to decrease the access time will pay
increasing dividends in the reduction of processing time. There is however a point when this advantage
33 DBMS Notes

starts to be reduced by the increased cost of storage and this in turn will increase the index access
time.

Hardware for Index-Sequential Organization is usually Disk-based, rather than tape. Records are
physically ordered by primary key. And the index gives the physical location of each record. Records
can be accessed sequentially or directly, via the index. The index is stored in a file and read into
memory at the point when the file is opened. Also, indexes must be maintained.

Life sequential organization the data is stored in physical contiguous box. How ever the difference is in
the use of indexes. There are three areas in the disc storage:

Primary Area:-Contains file records stored by key or ID numbers.


Overflow Area:-Contains records area that cannot be placed in primary area.
Index Area:-It contains keys of records and there locations on the disc.
Inverted List

In file organization, this is a file that is indexed on many of the attributes of the data itself. The inverted
list method has a single index for each key type. The records are not necessarily stored in a sequence.
They are placed in the are data storage area, but indexes are updated for the record keys and location.

Here's an example, in a company file, an index could be maintained for all products, another one might
be maintained for product types. Thus, it is faster to search the indexes than every record. These types
of file are also known as "inverted indexes." Nevertheless, inverted list files use more media space
and the storage devices get full quickly with this type of organization. The benefits are apparent
immediately because searching is fast. However, updating is much slower.

Content-based queries in text retrieval systems use inverted indexes as their preferred mechanism.
Data items in these systems are usually stored compressed which would normally slow the retrieval
process, but the compression algorithm will be chosen to support this technique.

When querying a file there are certain circumstances when the query is designed to be modal which
means that rules are set which require that different information be held in the index. Here's an
example of this modality: when phrase querying is undertaken, the particular algorithm requires that
offsets to word classifications are held in addition to document numbers.

Direct or Hashed Access

With direct or hashed access a portion of disk space is reserved and a hashing algorithm computes
the record address. So there is additional space required for this kind of file in the store. Records are
placed randomly through out the file. Records are accessed by addresses that specify their disc
location. Also, this type of file organization requires a disk storage rather than tape. It has an excellent
search retrieval performance, but care must be taken to maintain the indexes. If the indexes become
corrupt, what is left may as well go to the bit-bucket, so it is as well to have regular backups of this kind
of file just as it is for all stored valuable data!

External File Structure and File Extensions


34 DBMS Notes

Microsoft Windows and MS-DOS File Systems


The external structure of a file depends on whether it is being created on a FAT or NTFS partition. The
maximum filename length on a NTFS partition is 256 characters, and 11 characters on FAT (8
character name+"."+3 character extension.) NTFS filenames keep their
case, whereas FAT filenames have no concept of case (but case is ignored when performing a search
under NTFS Operating System). Also, there is the new VFAT which permits 256 character filenames.

UNIX and Apple Macintosh File Systems


The concept of directories and files is fundamental to the UNIX operating system. On Microsoft
Windows-based operating systems, directories are depicted as folders and moving about is
accomplished by clicking on the different icons. In UNIX, the directories are arranged as a hierarchy
with the root directory being at the top of the tree. The root directory is always depicted as /. Within
the / directory, there are subdirectories (e.g.: etc and sys). Files can be written to any directory
depending on the permissions. Files can be readable, writable and/or executable.

Organizing files using Libraries

With the advent of Microsoft Windows 7 the concept of file organization and management has
improved drastically by way of use of powerful tool called Libraries. A Library is file organization system
to bring together related files and folders stored in different locations of the local as well as network
computer such that these can be accessed centrally through a single access point. For instance,
various images stored in different folders in the local computer or/and across a computer network can
be accumulated in an Image Library. Aggregation of similar files can be manipulated, sorted or
accessed conveniently as and when required through a single access point on a computer desktop by
use of a Library. This feature is particularly very useful for accessing similar content of related content,
and also, for managing projects using related and common data.