Вы находитесь на странице: 1из 193

Database design and modelling

Business Rules and Relationship


By A. SandanaKaruppan, AP/IT

UNIT – I: INTRODUCTION TO DATABASE DESIGN AND MODELING


Session 1 to 2
1
COURSE OBJECTIVES

 The student should be made to:


To expose students with the basics of managing the information.
To explore the various aspects of database design and modelling.
To examine the basic issues in information governance and information integration.
To understand the overview of information architecture.

2
COURSE OUTCOMES

 Upon successful completion of this course, students will be able to:


Cover core relational database topics including logical and physical design and modeling
Design and implement a complex information system that meets regulatory requirements;
define and manage an organization's key master data entities
Design, Create and maintain data warehouses.
Learn recent advances in NOSQL , Big Data and related tools.

3
SESSION OBJECTIVES

 Introduce the Design Process, Modeling and Constraints


 Introduce the E-R Diagram.
 Design Issues, Weak Entity Sets and Extended E-R Features
 Case study: Design of the Bank Database, Reduction to Relation Schemas
 Explain the Database Design.
 Introduce the UML concepts.

4
OUTCOMES

Understand the Database design and E-R diagram.


Familiar of design issues.
Become familiar with the UML concepts.

5
AGENDA

 Design Process
 Modeling
 Constraints
 E-R Diagram
 Design Issues
 Weak Entity Sets
 Extended E-R Features
 Design of the Bank Database
 Reduction to Relation Schemas
 Database Design
 UML
6
DESIGN PHASES

 The initial phase of database design is to characterize fully the data needs of the prospective
database users.

 Next, the designer chooses a data model and, by applying the concepts of the chosen data model,
translates these requirements into a conceptual schema of the database.

 A fully developed conceptual schema also indicates the functional requirements of the enterprise. In
a “specification of functional requirements”, users describe the kinds of operations (or transactions)
that will be performed on the data.

7
DESIGN PHASES (CONT.)

The process of moving from an abstract data model to the implementation of


the database proceeds in two final design phases.

 Logical Design – Deciding on the database schema. Database design requires that we
find a “good” collection of relation schemas.
 Business decision – What attributes should we record in the database?
 Computer Science decision – What relation schemas should we have and how should
the attributes be distributed among the various relation schemas?

 Physical Design – Deciding on the physical layout of the database

8
DESIGN APPROACHES

 Entity Relationship Model


 Models an enterprise as a collection of entities and relationships
 Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects
 Described by a set of attributes
 Relationship: an association among several entities
 Represented diagrammatically by an entity-relationship diagram:

 Normalization Theory
 Formalize what designs are bad, and test for them

9
INTRODUCTION - DATA MODELING

 Process of creating a data model for an information system by applying formal data modeling
techniques.
 Process used to define and analyze data requirements needed to support the business
processes.
 Therefore, the process of data modeling involves professional data modelers working closely
with business stakeholders, as well as potential users of the information system.

10
WHAT IS DATA MODEL?

 Data Model is a collection of conceptual tools for describing data, data relationships, data
semantics and consistency constraint.
 A data model is a conceptual representation of data structures required for data base and is very
powerful in expressing and communicating the business requirements.
 A data model visually represents the nature of data, business rules governing the data, and how it
will be organized in the database.
 A data model provides a way to describe the design of a database at the physical, logical and
view levels.
 There are three different types of data models produced while progressing from requirements to
the actual database to be used for the information system

11
DIFFERENT DATA MODELS

 Conceptual: describes WHAT the system contains.


 Logical: describes HOW the system will be implemented, regardless of the DBMS.
 Physical: describes HOW the system will be implemented using a specific DBMS.

12
A DATA MODEL CONSISTS OF ENTITIES RELATED TO EACH OTHER ON A
DIAGRAM:

Data Model Definition


Element

Entity A real world thing or an interaction between 2 or more real world things.

Attribute The atomic pieces of information that we need to know about entities.

Relationship How entities depend on each other in terms of why the entities depend on each other
(the relationship) and what that relationship is (the cardinality of the relationship).

13
EXAMPLE:

Given that …
 “Customer” is an entity.
 “Product” is an entity.
 For a “Customer” we need to know their “customer number” attribute and “name” attribute.
 For a “Product” we need to know the “product name” attribute and “price” attribute.
 “Sale” is an entity that is used to record the interaction of “Customer” and “Product”.

14
HERE IS THE DIAGRAM THAT ENCAPSULATES THESE RULES:

15
NOTES

 By convention, entities are named in the singular.


 The attributes of “Customer” are “Customer No” (which is the unique
identifier or primary key of the “Customer” entity and is shown by the #
symbol) and “Customer Name”.
 “Sale” has a composite primary key made up of the primary key of
“Customer”, the primary key of “Product” and the date of the sale.
 Think of entities as tables, think of attributes as columns on the table and
think of instances as rows on that table:

16
• If we want to know the price of a Sale, we can ‘find’ it by using the “Product
Code” on the instance of “Sale” we are interested in and look up the
corresponding “Price” on the “Product” entity with the matching “Product
Code”.

17
TYPES OF DATA MODELS

 Entity-Relationship (E-R) Models

 UML (unified modeling language)

18
ENTITY-RELATIONSHIP MODEL

 Entity Relationship Diagrams (ERD) as this is the most widely used


 ERDs have an advantage in that they are capable of being normalized

Entity UniversityStudent

PK StudentID Primary key

StudentName
StudentDOB
Attributes StudentAge

 Represent entities as rectangles


 List attributes within the rectangle
19
WHY AND WHEN

 The purpose of a data model is to describe the concepts relevant to a domain, the relationships
between those concepts, and information associated with them.

20
 Used to model data in a standard, consistent, predictable manner in order to manage it as a
resource.
To have a clear picture of the base data that your business needs.
To identify missing and redundant base data.
To Establish a baseline for communication across functional boundaries within your
organization.
 Provides a basis for defining business rules.
 Makes it cheaper, easier, and faster to upgrade your IT solutions.

21
ENTITY RELATIONSHIP DIAGRAM (ERD)

22
OBJECTIVES

 Define terms related to entity relationship modeling, including entity, entity instance, attribute,
relationship and cardinality, and primary key.
 Describe the entity modeling process.
 Discuss how to draw an entity relationship diagram.
 Describe how to recognize entities, attributes, relationships, and cardinalities.

23
DATABASE MODEL

A database can be modeled as:


a collection of entities,
relationship among entities.

Database systems are often modeled using an Entity Relationship (ER) diagram as the
"blueprint" from which the actual data is stored — the output of the design phase.

24
ENTITY RELATIONSHIP DIAGRAM (ERD)

 ER model allows us to sketch database designs


 ERD is a graphical tool for modeling data.
 ERD is widely used in database design
 ERD is a graphical representation of the logical structure of a database
 ERD is a model that identifies the concepts or entities that exist in a system and the relationships
between those entities

25
PURPOSES OF ERD

An ERD serves several purposes

The database analyst/designer gains a better understanding of the information to be


contained in the database through the process of constructing the ERD.
The ERD serves as a documentation tool.
Finally, the ERD is used to communicate the logical structure of the database to users. In
particular, the ERD effectively communicates the logic of the database to users.

26
COMPONENTS OF AN ERD

An ERD typically consists of four different graphical components:


1. Entity
2. Relationship
3. Cardinality
4. Attribute

27
CLASSIFICATION OF RELATIONSHIP

 Optional Relationship
An Employee may or may not be assigned to a Department
A Patient may or may not be assigned to a Bed
 Mandatory Relationship
Every Course must be taught by at least one Teacher
Every mother have at least a Child

28
CARDINALITY CONSTRAINTS

 Express the number of entities to which another entity can be associated via a relationship
set.
 Cardinality Constraints - the number of instances of one entity that can or must be associated
with each instance of another entity.
 Minimum Cardinality
If zero, then optional
If one or more, then mandatory
 Maximum Cardinality
The maximum number

29
CARDINALITY CONSTRAINTS (CONTD.)

 For a binary relationship set the mapping cardinality must be one of the following types:
One to one
A Manager Head one Department and vice versa
One to many ( or many to one)
An Employee Works in one Department or One Department has many Employees
Many to many
A Teacher Teaches many Students and A student is taught by many Teachers

30
GENERAL STEPS TO CREATE AN ERD

 Identify the entity


 Identify the entity's attributes
 Identify the Primary Keys
 Identify the relation between entities
 Identify the Cardinality constraint
 Draw the ERD
 Check the ERD

31
STEPS IN BUILDING AN ERD

32
DEVELOPING AN ERD

The process has ten steps:


1. Identify Entities
2. Find Relationships
3. Draw Rough ERD
4. Fill in Cardinality
5. Define Primary Keys
6. Draw Key-Based ERD
7. Identify Attributes
8. Map Attributes
9. Draw fully attributed ERD
10. Check Results

33
A SIMPLE EXAMPLE

A company has several departments. Each department has a supervisor and at


least one employee. Employees must be assigned to at least one, but possibly
more departments. At least one employee is assigned to a project, but an employee
may be on vacation and not assigned to any projects. The important data fields are
the names of the departments, projects, supervisors and employees, as well as the
supervisor and employee number and a unique project number.

34
IDENTIFY ENTITIES

 One approach to this is to work through the information and highlight those words which you
think correspond to entities.
 A company has several departments. Each department has a supervisor and at least one
employee. Employees must be assigned to at least one, but possibly more departments. At
least one employee is assigned to a project, but an employee may be on vacation and not
assigned to any projects. The important data fields are the names of the departments, projects,
supervisors and employees, as well as the supervisor and employee number and a unique
project number.

 A true entity should have more than one instance

35
FIND RELATIONSHIPS

 Aim is to identify the associations, the connections between pairs of


entities.

 A simple approach to do this is using a relationship matrix (table) that has


rows and columns for each of the identified entities.

36
FIND RELATIONSHIPS (CONTD.)

 Go through each cell and decide whether or not there is an association. For
example, the first cell on the second row is used to indicate if there is a
relationship between the entity "Employee" and the entity "Department".

37
IDENTIFIED RELATIONSHIPS

Names placed in the cells are meant to capture/describe the relationships. So you can use them
like this
A Department is assigned an employee
A Department is run by a supervisor
An employee belongs to a department
An employee works on a project
A supervisor runs a department
A project uses an employee

38
DRAW ROUGH ERD

Draw a diagram and:


 Place all the entities in rectangles
 Use diamonds and lines to represent the relationships between entities.
 General Examples

39
DRAWING ROUGH ERD (CONTD.)

40
DRAWING ROUGH ERD (CONTD.)

41
DRAWING ROUGH ERD (CONTD.)

42
FILL IN CARDINALITY

 Supervisor
Each department has one supervisor.
 Department
Each supervisor has one department.
Each employee can belong to one or more departments
 Employee
Each department must have one or more employees
Each project must have one or more employees
 Project
Each employee can have 0 or more projects.

43
FILL IN CARDINALITY (CONTD.)

The cardinality of a relationship can only have the following values


One and only one
One or more
Zero or more
Zero or one

44
CARDINALITY NOTATION

45
CARDINALITY EXAMPLES

Each instance of A is related to a minimum of


A B
zero and a maximum of one instance of B

Each instance of B is related to a minimum of


A B
one and a maximum of one instance of A

Each instance of A is related to a minimum of


A B
one and a maximum of many instances of B

Each instance of B is related to a minimum of


A B
zero and a maximum of many instances of A

46
CARDINALITY EXAMPLES

Each instance of A is related to a minimum of


A B
zero and a maximum of one instance of B

Each instance of B is related to a minimum of


A B
one and a maximum of one instance of A

Each instance of A is related to a minimum of


A B
one and a maximum of many instances of B

Each instance of B is related to a minimum of


A B
zero and a maximum of many instances of A

47
ERD WITH CARDINALITY

48
EXAMPLES

49
ERD FOR COURSE ENROLLMENT

50
ERD FOR COURSE REGISTRATION

51
ROUGH ERD PLUS PRIMARY KEYS

52
IDENTIFY ATTRIBUTES

 In this step we try to identify and name all the attributes essential to the system we are studying
without trying to match them to particular entities.
 The best way to do this is to study the forms, files and reports currently kept by the users of the
system and circle each data item on the paper copy.
 Cross out those which will not be transferred to the new system, extraneous items such as
signatures, and constant information which is the same for all instances of the form (e.g. your
company name and address). The remaining circled items should represent the attributes you need.
You should always verify these with your system users. (Sometimes forms or reports are out of date.)
 The only attributes indicated are the names of the departments, projects, supervisors and
employees, as well as the supervisor and employee NUMBER and a unique project number.

53
MAP ATTRIBUTES

 For each attribute we need to match it with exactly one entity. Often it seems like
an attribute should go with more than one entity (e.g. Name). In this case you
need to add a modifier to the attribute name to make it unique (e.g. Customer
Name, Employee Name, etc.) or determine which entity an attribute "best'
describes.
 If you have attributes left over without corresponding entities, you may have
missed an entity and its corresponding relationships. Identify these missed entities
and add them to the relationship matrix now.

54
MAP ATTRIBUTES (CONTD.)

55
DRAW FULLY ATTRIBUTED ERD

56
CHECK ERD RESULTS

 Look at your diagram from the point of view of a system owner or user. Is everything clear?
 Check through the Cardinality pairs.
 Also, look over the list of attributes associated with each entity to see if anything has been omitted.

57
SUMMARY

 Conceptual design follows requirements analysis,


Yields a high-level description of data to be stored
 ER model popular for conceptual design
Constructs are expressive, close to the way people think about their applications.
 Basic constructs: entities, relationships, and attributes (of entities and relationships).
 Some additional constructs: weak entities, ISA hierarchies, and aggregation.
 Note: There are many variations on ER model.

58
SUMMARY

 Several kinds of integrity constraints can be expressed in the ER model: key constraints,
participation constraints, and overlap/covering constraints for ISA hierarchies. Some foreign key
constraints are also implicit in the definition of a relationship set.
 Some of these constraints can be expressed in SQL only if we use general CHECK constraints or
assertions.
 Some constraints (notably, functional dependencies) cannot be expressed in the ER model.
 Constraints play an important role in determining the best database design for an enterprise.
 ER design is subjective. There are often many ways to model a given scenario! Analyzing
alternatives can be tricky, especially for a large enterprise. Common choices include:
Entity vs. attribute, entity vs. relationship, binary or n-ary relationship, whether or not to use
ISA hierarchies, and whether or not to use aggregation.
 Ensuring good database design: resulting relational schema should be analyzed and refined
further. FD information and normalization techniques are especially useful.

59
JAVA DATABASE CONNECTIVITY (JDBC)

60
Java Database Connectivity (JDBC)
By A. SandanaKaruppan, AP/IT

UNIT – I: Java Database Connectivity (JDBC)


Session 3
61
COURSE OBJECTIVES

 The student should be made to:


To expose students with the basics of managing the information.
To explore the various aspects of database design and modelling.
To examine the basic issues in information governance and information integration.
To understand the overview of information architecture.

62
COURSE OUTCOMES

 Upon successful completion of this course, students will be able to:


Cover core relational database topics including logical and physical design and modeling
Design and implement a complex information system that meets regulatory requirements;
define and manage an organization's key master data entities
Design, Create and maintain data warehouses.
Learn recent advances in NOSQL , Big Data and related tools.

63
SESSION OBJECTIVES

 Introduce the JDBC and ODBC concepts.


 Explain the types of drivers.
 Introduce the Common JDBC Components.
 Explain the Stored Procedure Language.

64
OUTCOMES

 Understand the JDBC and ODBC concepts.


 Explain the types of drivers.
 Understand the Common JDBC Components.
 Express the Stored Procedure Language.

65
AGENDA

 Introduction to JDBC and ODBC.


 Types of Drivers.
 Common JDBC Components.
 Stored Procedure Language

66
INTRODUCTION

 Database
Collection of data
 DBMS
Database management system
Storing and organizing data
 SQL
Relational database
Structured Query Language
 JDBC
Java Database Connectivity
JDBC driver

67
JDBC

 Programs developed with Java/JDBC are platform and vendor independent.


 “write once, compile once, run anywhere”
 Write apps in java to access any DB, using standard SQL statements – while still following Java
conventions.
 JDBC driver manager and JDBC drivers provide the bridge between the database and java
worlds.

68
Java application
JDBC API

JDBC Driver Manager


JDBC Driver API
vendor-
JDBC/ODBC
supplied
Bridge
JDBC driver

ODBC
driver

Database Database

69
ODBC

 JDBC heavily influenced by ODBC


 ODBC provides a C interface for database access on Windows environment.
 ODBC has a few commands with lots of complex options. Java prefers simple methods but lots of
them.

70
TYPES OF DRIVERS

Type 1 3rd Party API

Database
Type 3 Type 2 Native C/C++ API

Local API
Network API
Type 4

• Type 1: Uses a bridging technology to access a database. JDBC-ODBC bridge is an example. It


provides a gateway to the ODBC.
• Type 2: Native API drivers. Driver contains Java code that calls native C/C++ methods provided by the
database vendors.
• Type 3: Generic network API that is then translated into database-specific access at the server level.
The JDBC driver on the client uses sockets to call a middleware application on the server that
translates the client requests into an API specific to the desired driver. Extremely flexible.
• Type 4: Using network protocols built into the database engine talk directly to the database using Java
71
sockets. Almost always comes only from database vendors.
JDBC DRIVERS TYPES

JDBC driver implementations vary because of the wide variety of


operating systems and hardware platforms in which Java operates.
Sun has divided the implementation types into four categories,
Types 1, 2, 3, and 4.

72
JDBC DRIVERS

73
COMMON JDBC COMPONENTS

The JDBC API provides the following interfaces and classes −

DriverManager:

• This class manages a list of database drivers.


• Matches connection requests from the java application with the proper database driver using
communication sub protocol.
• The first driver that recognizes a certain subprotocol under JDBC will be used to establish a
database Connection.

Driver:

This interface handles the communications with the database server.


You will interact directly with Driver objects very rarely. Instead, you use DriverManager objects,
which manages objects of this type.
It also abstracts the details associated with working with Driver objects.

74
COMMON JDBC COMPONENTS

Connection:
 This interface with all methods for contacting a database.
 The connection object represents communication context, i.e., all communication with database is
through connection object only.

Statement:
You use objects created from this interface to submit the SQL statements to the database.

ResultSet:
These objects hold data retrieved from a database after you execute an SQL query using
Statement objects.
 It acts as an iterator to allow you to move through its data.
SQLException:
 This class handles any errors that occur in a database application

75
TYPE 1: JDBC-ODBC BRIDGE DRIVER

• In a Type 1 driver, a JDBC bridge is used to access ODBC drivers installed


on each client machine.
• Using ODBC, It requires configuring on your system a Data Source Name
(DSN) that represents the target database.
• The JDBC-ODBC Bridge that comes with JDK 1.2 is a good example of this
kind of driver.

76
TYPE 1 DRIVER

77
TYPE 2 DRIVER

In a Type 2 driver, JDBC API calls are converted into native C/C++ API calls, which are unique to
the database.
These drivers are typically provided by the database vendors and used in the same manner as the
JDBC-ODBC Bridge.
The vendor-specific driver must be installed on each client machine.
The Oracle Call Interface (OCI) driver is an example of a Type 2 driver.

78
TYPE 2 DRIVER

79
TYPE 3: JDBC-NET PURE JAVA

• In a Type 3 driver, a three-tier approach is used to access databases.


• The JDBC clients use standard network sockets to communicate with a middleware
application server.
• The socket information is then translated by the middleware application server into the call
format required by the DBMS, and forwarded to the database server.

80
TYPE 4: 100% PURE JAVA

81
TYPE 4: 100% PURE JAVA

• In a Type 4 driver, a pure Java-based driver communicates directly with the vendor's database
through socket connection.
• This is the highest performance driver available for the database and is usually provided by the
vendor itself.
• This kind of driver is extremely flexible, you don't need to install special software on the client or
server. Further, these drivers can be downloaded dynamically.

82
TYPE 4 DRIVER

83
The following steps are required to create a new Database using JDBC application

Import the packages:


 Requires that you include the packages containing the JDBC classes needed for database
programming.
Most often, using import java.sql.* will suffice.
Register the JDBC driver:
 Requires that you initialize a driver so you can open a communications channel with the database.
Open a connection:
 Using the DriverManager.getConnection() method to create a Connection object, which represents
a physical connection with the database server.
To create a new database, you need not give any database name while preparing database URL
as mentioned in the below example.
Execute a query:
 Using an object of type Statement for building and submitting an SQL statement to the database.
Clean up the environment:
 Explicitly closing all database resources versus relying on the JVM's garbage collection.

84
STORED PROCEDURES

85
STORED PROCEDURE LANGUAGE

Stored Procedure Overview


Stored Procedure is a function in a shared library accessible to the database server
can also write stored procedures using languages such as C or Java
Advantages of stored procedure : Reduced network traffic
The more SQL statements that are grouped together for execution, the larger the savings in
network traffic

86
Normal Database

87
Applications using stored
procedures

88
Writing Stored Procedures

 Tasks performed by the client application


 Tasks performed by the stored procedure, when invoked
 The CALL statement
 Explicit parameter to be defined :
 IN: Passes a value to the stored procedure from the client application
 OUT: Stores a value that is passed to the client application when the stored procedure
terminates.
 INOUT : Passes a value to the stored procedure from the client application, and returns a value
to the Client application when the stored procedure terminates

89
Some Valid SQL Procedure Body Statements

CASE statement
FOR statement
GOTO statement
IF statement
ITERATE statement
RETURN statement
WHILE statement

90
 Invoking Procedures
Can invoke Stored procedure stored at the location of the database by using the SQL CALL
statement

 Nested SQL Procedures:


To call a target SQL procedure from within a caller SQL procedure, simply include a CALL
statement with the appropriate number and types of parameters in your caller.

91
CONDITIONAL STATEMENTS:

IF <condition> THEN
<statement(s)>
ELSE
<statement(s)>
END IF;

Loops
LOOP
……
EXIT WHEN <condition>
……
END LOOP;

92
SUMMARY

 JDBC is an API specification developed by Sun Microsystems that defines a uniform interface for
accessing different relational databases.
 The primary function of the JDBC API is to allow the developer to issue SQL statements and
process the results in a consistent, database-independent manner.
 The JDBC API uses a driver manager and database-specific drivers to pro-vide transparent
connectivity to heterogeneous databases.
 The JDBC driver manager ensures that the correct driver is used to access each data source.
The driver manager is capable of supporting multiple con-current drivers connected to multiple
heterogeneous databases.
 A JDBC driver translates standard JDBC calls into a network protocol or client API call that
facilitates communication with the database. This translation provides JDBC applications with
database independence.

93
SUMMARY

 A prepared statement is an SQL statement that is precompiled by the data-base. Through


precompilation, prepared statements improve the perfor-mance of SQL commands that are
executed multiple times (given that the database supports prepared statements).
 A transaction is a set of SQL statements that are grouped such that all state-ments are
guaranteed to be executed or the entire operation will fail. If all statements execute successfully,
the results are committed to the database; otherwise, all changes are rolled back.
 A stored procedure is an SQL operation that is stored on the database server. Stored procedures
are usually written in an SQL dialect that has been expanded to include conditional statements,
looping constructs, and other procedural programming features.
 Metadata is defined as information (or data) about data. JDBC provides spe-cific information
about a database or a result set via metadata objects.
 Database connection pooling is the process of establishing a set, or pool, of database
connections before they are actually needed.

94
BIG DATA

95
Trends in Big Data systems - NoSQL
By A. SandanaKaruppan, AP/IT

UNIT – I: Trends in Big Data systems


Session 4 & 5
96
COURSE OBJECTIVES

 The student should be made to:


To expose students with the basics of managing the information.
To explore the various aspects of database design and modelling.
To examine the basic issues in information governance and information integration.
To understand the overview of information architecture.

97
COURSE OUTCOMES

 Upon successful completion of this course, students will be able to:


Cover core relational database topics including logical and physical design and modeling
Design and implement a complex information system that meets regulatory requirements;
define and manage an organization's key master data entities
Design, Create and maintain data warehouses.
Learn recent advances in NOSQL , Big Data and related tools.

98
SESSION OBJECTIVES

 Introduce the JDBC and ODBC concepts.


 Explain the types of drivers.
 Introduce the Common JDBC Components.
 Explain the Stored Procedure Language.

99
OUTCOMES

 Understand the JDBC and ODBC concepts.


 Explain the types of drivers.
 Understand the Common JDBC Components.
 Express the Stored Procedure Language.

100
AGENDA

 Introduction to Big data.


 Types of Data.
 Applications for Big Data Analytics
 Introduction to NoSql
 Types of NoSQL Databases
 Benefits of NoSQL over RDBMS

101
WHAT IS BIG DATA?

 Big data is a massive volume of both structured and unstructured data that is so large it is difficult
to process using traditional database and software techniques.

 In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current
processing capacity.

 Despite these problems, big data has the potential to help companies improve operations and
make faster, more intelligent decisions.

102
WHY BIG DATA

 Key enablers of appearance and growth of Big Data are

Increase of storage capacities


Increase of processing power
Availability of data
Every day we create 2.5 quintillion bytes of data; 90% of the data in the world today has been
created in the last two years alone

103
BIG DATA EVERYWHERE!

 Lots of data is being collected


and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions
Social Network

104
HOW MUCH DATA?

 Google processes 20 PB a day (2008)


 Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
 eBay has 6.5 PB of user data + 50 TB/day (5/2009)

105
UNITS

 BIT
 NIBBLE
 BYTE/OCTET (B)
 KILOBYTE (KB)
 MEGABYTE (MB)
 GIGABYTE (GB)
 TERABYTE (TB)
 PETABYTE (PB)
 EXABYTE (EB)
 ZETTABYTE (ZB)
 YOTTABYTE (YB)

106
TYPES OF DATA

 Three concepts come with big data :

Structured Data

Semi structured Data &

Unstructured Data.

107
STRUCTURED DATA

 It concerns all data which can be stored in database SQL in table with rows and columns.

 They have relational key and can be easily mapped into pre-designed fields.

 Today, those data’s are the most processed in development and the simplest way to manage
information.

 But structured data’s represent only 5 to 10% of all informatics data’s.

108
SEMI STRUCTURED DATA

 Semi-structured data is information that doesn’t reside in a relational database but that does
have some organizational properties that make it easier to analyze.

 Examples of semi-structured : XML and JSON (JavaScript Object Notation) documents are semi
structured documents.

 But as Structured data, semi structured data represents a few parts of data (5 to 10%).

109
UNSTRUCTURED DATA

 Unstructured data represent around 80% of data.

 It often include text and multimedia content.

Examples: include e-mail messages, word processing documents, videos, photos, audio files,
presentations, WebPages and many other kinds of business documents.

 Note that while these sorts of files may have an internal structure, they are still considered
« unstructured » because the data they contain doesn’t fit neatly in a database.

 Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives
around unstructured data.

110
HERE ARE SOME EXAMPLES OF MACHINE-GENERATED
UNSTRUCTURED DATA:

Satellite images
Scientific data
Photographs and video
Social media data
Mobile data &
website content

111
WHAT TO DO WITH THESE DATA?

 Aggregation and Statistics


Data warehouse and OLAP
 Indexing, Searching, and Querying
Keyword based search
Pattern matching (XML/RDF)
 Knowledge discovery
Data Mining
Statistical Modeling

112
EXAMPLES OF BIG DATA
IT log analytics

 IT solutions and IT departments generate an enormous quantity of logs and trace data.

 In the absence of a Big Data solution, much of this data must go unexamined: organizations
simply don't have the manpower or resource to churn through all that information by hand, let
alone in real time.

 With a Big Data solution in place, however, those logs and trace data can be put to good use.

 Within this list of Big Data application examples, IT log analytics is the most broadly applicable.

113
APPLICATIONS FOR BIG DATA ANALYTICS

Multi-channel sales
Smarter Healthcare Finance Log Analysis

Homeland Security Traffic Control Telecom Search Quality

Manufacturing Trading Analytics Fraud and Risk Retail: Churn, NBO

114
NOSQL?

NoSQL Not SQL


does not mean

115
NOSQL?

NoSQL Not Only SQL


OR
It means Not Relational Database

116
WHY NOSQL

 Large Volume of Data

 Dynamic Schemas

 Auto-sharding

 Replication

 Horizontally Scalable

* Some Operations can be achieved by Enterprise class RDBMS software but with very High cost

117
DEFINE NOSQL

 NoSQL is a non-relational database management systems, different from traditional relational


database management systems in some significant ways.

 NoSQL database provides a mechanism for storage and retrieval of data that is modeled in
means other than the tabular relations used in relation databases (RDBMS).

 It is designed for distributed data stores where very large scale of data storing needs (for
example Google or Facebook which collects terabits of data every day for their users).

118
TYPES OF NOSQL DATABASES

NoSQL Databases

Columnar
Document Stores Graph Databases Key-Value Stores
Databases

119
DOCUMENT ORIENTED DATABASES

Document oriented databases treat a document as a whole and avoid splitting a


document in its constituent name/value pairs.

At a collection level, this allows for putting together a diverse set of documents into a
single collection.

Document databases allow indexing of documents on the basis of not only its primary
identifier but also its properties.

120
CONT…

 Different open-source document databases are available today but the most prominent among
the available options are MongoDB and CouchDB.

 In fact, MongoDB has become one of the most popular NoSQL databases.

121
DOCUMENT ORIENTED DATABASES

122
GRAPH BASED DATABASES

 A graph database uses graph structures with nodes, edges, and properties to represent and
store data.

 By definition, a graph database is any storage system that provides index-free adjacency. This
means that every element contains a direct pointer to its adjacent element and no index lookups
are necessary.

123
CONT…

 General graph databases that can store any graph are distinct from specialized graph databases
such as triple-stores and network databases. Indexes are used for traversing the graph.

124
GRAPH DATABASE

125
COLUMN BASED DATABASES

 The column-oriented storage allows data to be stored effectively.

 It avoids consuming space when storing nulls by simply not storing a column when a value doesn’t
exist for that column.

126
CONT…

 Each unit of data can be thought of as a set of key/value pairs, where the unit itself is identified
with the help of a primary identifier, often referred to as the primary key.

127
KEY VALUE DATABASES

 The key of a key/value pair is a unique value in the set and can be easily looked up to access
the data.

 Key/value pairs are of varied types: some keep the data in memory and some provide the
capability to persist the data to disk.

128
KEY VALUE DATABASES

129
KEY VALUE DATABASES

130
BENEFITS OF NOSQL OVER RDBMS

Schema Less

 NoSQL databases being schema-less do not define any strict data structure.

Dynamic and Agile

 NoSQL databases have good tendency to grow dynamically with changing requirements. It
can handle structured, semi-structured and unstructured data.

131
BENEFITS (CONT…)

Scales Horizontally:

 NoSQL scales horizontally by adding more servers and using concepts of sharding and
replication.

 This behavior of NoSQL fits with the cloud computing services such as Amazon Web
Services (AWS) which allows you to handle virtual servers which can be expanded
horizontally on demand.

132
BENEFITS (CONT…)

 Better Performance:
All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.

133
SUMMARY

 Big data is a massive volume of both structured and unstructured data that is so large it is difficult
to process using traditional database and software techniques.
 Lots of data is being collected
and warehoused.
 Three concepts come with big data .
 NoSQL is a non-relational database management systems, different from traditional relational
database management systems in some significant ways.
 All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.

134
CAP THEOREM

135
CAP

It is impossible for a web service to provide following three

guarantees at the same time:

Consistency

Availability

Partition-tolerance

A distributed system can satisfy any two of these guarantees at the same time but not all

three

136
CAP THEOREM

Consistency
All the servers in the system will have the same data so anyone using the system will get the
same copy regardless of which server answers their request.

Availability
The system will always respond to a request (even if it's not the latest data or consistent across
the system or just a message saying the system isn't working)

Partition Tolerance
The system continues to operate as a whole even if individual servers fail or can't be reached..

137
CAP THEOREM

C A

138
CAP THEOREM

 A simple example:

Hotel Booking: are we double-booking the same


room?

Bob Dong

139
CAP THEOREM

 A simple example:

Hotel Booking: are we double-booking the same


room?

Bob Dong

140
CAP THEOREM

 A simple example:

Hotel Booking: are we double-booking the same


room?

Bob Dong

141
Credit: http://architects.dzone.com/articles/better-explaining-cap-theorem
142
CHOOSING AP

Credit: https://foundationdb.com/key-value-store/white-papers/the-cap-theorem
143
CHOOSING CP

Replication allows to add


Availability

Credit: https://foundationdb.com/key-value-store/white-papers/the-cap-theorem
144
 Hadoop

145
Hadoop HDFS, MapReduce, Hive, and enhancements
By A. SandanaKaruppan, AP/IT

UNIT – I: Hadoop HDFS, MapReduce, Hive, and enhancements


Session 6-9
146
COURSE OBJECTIVES

 The student should be made to:


To expose students with the basics of managing the information.
To explore the various aspects of database design and modelling.
To examine the basic issues in information governance and information integration.
To understand the overview of information architecture.

147
COURSE OUTCOMES

 Upon successful completion of this course, students will be able to:


Cover core relational database topics including logical and physical design and modeling
Design and implement a complex information system that meets regulatory requirements;
define and manage an organization's key master data entities
Design, Create and maintain data warehouses.
Learn recent advances in NOSQL , Big Data and related tools.

148
SESSION OBJECTIVES

 Introduce the Hadoop-Basics


 Explain the HDFS.
 Introduce the MapReduce concepts.
 Explain the Hive and enhancements.

149
OUTCOMES

 Understand the Hadoop fundamentals.


 Explain the HDFS.
 Understand the MapReduce concepts.
 Explain the Hive and enhancements.

150
AGENDA

 Hadoop-Basics
 HDFS
Goals
Architecture
Other functions
 MapReduce
Basics
Word Count Example
Handy tools
Finding shortest path example
 Introduction to HIVE.
151
HADOOP-BASICS

 An open source software framework

 Supports Data intensive Distributed Applications.

 Derived from Google’s Map-Reduce and Google File System papers.

 Written in the Java Programming Language.

152
HADOOP (WHY)

 Need to process huge datasets on large no. of computers.


 It is expensive to build reliability into each application.
 Nodes fails everyday
 Failure is expected, rather than exceptional.
 Need common infrastructure
Efficient, reliable, easy to use.
Open sourced , Apache License

153
WHAT IS HADOOP USED FOR ?

 Searching (Yahoo)

 Log Processing

 Recommendation Systems (Facebook, LinkedIn, eBay, Amazon)

 Analytics(Facebook, LinkedIn)

 Video and Image Analysis (NASA)

 Data Retention

154
GOALS OF HDFS

1. Very Large Distributed File System


- 10K nodes, 100 million files, 10 PB
2. Assumes Commodity Hardware
- Files are replicated to handle hardware failure
- Detect failures and recovers from them
3. Optimized for Batch Processing
- Data locations exposed so that computation can move to where data resides.

155
DISTRIBUTED FILE SYSTEM

 Single Namespace for entire cluster


 Data Coherency
Write-once-read-many access model
Client can only append to existing files
 Files are broken up into blocks
Typically 64MB block size
Each block replicated on multiple DataNodes
 Intelligent Client
Client can find location of blocks
Client accesses data directly from DataNode

156
HDFC ARCHITECTURE

157
FUNCTIONS OF A NAMENODE

 Manages File System Namespace


Maps a file name to a set of blocks
Maps a block to the DataNodes where it resides
 Cluster Configuration Management
 Replication Engine for Blocks

158
NAMENODE METADATA

 Metadata in Memory
The entire metadata is in main memory
No demand paging of metadata
 Types of metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g. creation time, replication factor
 A Transaction Log
Records file creations, file deletions etc

159
DATANODE

 A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
 Block Report
Periodically sends a report of all existing blocks to the NameNode
 Facilitates Pipelining of Data
Forwards data to other specified DataNodes

160
BLOCK PLACEMENT

 Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
 Clients read from nearest replicas
 Would like to make this policy pluggable

161
HEARTBEATS

 DataNodes send hearbeat to the NameNode


Once every 3 seconds
 NameNode uses heartbeats to detect DataNode failure

162
REPLICATION ENGINE

 NameNode detects DataNode failures


Chooses new DataNodes for new replicas
Balances disk usage
Balances communication traffic to DataNodes

163
DATA CORRECTNESS

 Use Checksums to validate data


Use CRC32
 File Creation
Client computes checksum per 512 bytes
DataNodestores the checksum
 File access
Client retrieves the data and checksum from DataNode
If Validation fails, Client tries other replicas

164
NAMENODEFAILURE

 A single point of failure


 Transaction Log stored in multiple directories
A directory on the local file system
A directory on a remote file system (NFS/CIFS)
 Need to develop a real HA solution

165
DATA PIEPLINING

 Client retrieves a list of DataNodes on which to place replicas of a block


 Client writes block to the first DataNode
 The first DataNode forwards the data to the next node in the Pipeline
 When all replicas are written, the Client moves on to write the next block in file

166
REBALANCER

 Goal: % disk full on DataNodes should be similar


Usually run when new DataNodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network congestion
Command line tool

167
SECONDARY NAMENODE

 Copies FsImage and Transaction Log from Namenode to a temporary directory


 Merges FSImage and Transaction Log into a new FSImage in temporary directory
 Uploads new FSImage to the NameNode
Transaction Log on NameNode is purged

168
USER INTERFACE

 Commads for HDFS User:


hadoop dfs -mkdir /foodir
hadoop dfs -cat /foodir/myfile.txt
hadoop dfs -rm /foodir/myfile.txt
 Commands for HDFS Administrator
hadoop dfsadmin -report
hadoop dfsadmin -decommision datanodename
 Web Interface
http://host:port/dfshealth.jsp

169
MAP REDUCE

170
WHAT IS MAP REDUCE?

 MapReduce is a programming model for efficient distributed computing based on java..


 The MapReduce algorithm contains two important tasks, namely Map and Reduce.
 Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples .
 It works like a Unix pipeline
cat input | grep | sort | uniq -c | cat > output
Input | Map |Shuffle & Sort | Reduce | Output

 Secondly, reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.

 MapReduce is a programming model Google has used successfully is processing its “big-
data” sets (~ 20000 peta bytes per day)

171
WHAT IS MAP REDUCE?

 Efficiency from
Streaming through data, reducing seeks
Pipelining
 A good fit for a lot of applications
Log processing
Web index building
 Users specify the computation in terms of a map and a reduce function,
 Underlying runtime system automatically parallelizes the computation across large-scale
clusters of machines, and
 Underlying system also handles machine failures, efficient communications, and performance
issues.

172
MAPREDUCE-DATAFLOW

173
MAP STAGE

The map or mapper’s job is to process the input data.

 Generally the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).

 The input file is passed to the mapper function line by line.

 The mapper processes the data and creates several small chunks of data.

174
REDUCE STAGE

 This stage is the combination of the Shuffle stage and the Reduce stage.

 The Reducer’s job is to process the data that comes from the mapper.

 After processing, it produces a new set of output, which will be stored in the HDFS.

175
MAPREDUCE-FEATURES

 Fine grained Map and Reduce tasks


Improved load balancing
Faster recovery from failed tasks
 Automatic re-execution on failure
In a large cluster, some nodes are always slow or flaky
Framework re-executes failed tasks
 Locality optimizations
With large data, bandwidth to data is a problem
Map-Reduce + HDFS is a very effective solution
Map-Reduce queries HDFS for locations of input data
Map tasks are scheduled close to the inputs when possible

176
WORD COUNT EXAMPLE

 Mapper
Input: value: lines of text of input
Output: key: word, value: 1
 Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
 Launching program
Defines this job
Submits job to cluster

177
MAPREDUCE - WORD COUNT DATAFLOW

178
MAPREDUCE

179
179
SOME HANDY TOOLS

 Partitioners
 Combiners
 Compression
 Counters
 Speculation
 Zero Reduces
 Distributed File Cache
 Tool

180
HADOOPRELATED SUBPROJECTS

 Pig
High-level language for data analysis
 HBase
Table storage for semi-structured data
 Zookeeper
Coordinating distributed applications
 Hive
SQL-like Query language and Metastore
 Mahout
Machine learning

181
WHAT IS HIVE

 Hive is a data warehouse infrastructure tool to process structured data in Hadoop.

 Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.

182
HIVE IS NOT

 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates

183
FEATURES OF HIVE

•It stores schema in a database and processed data into HDFS.


•It is designed for OLAP.
•It provides SQL type language for querying called HiveQL or HQL.
•It is familiar, fast, scalable, and extensible.

184
ARCHITECTURE OF HIVE

 The following component diagram depicts the architecture of Hive:

185
ARCHITECTURE OF HIVE

Units and its operations

User Interface

Hive is a data warehouse infrastructure software that can create interaction between user
and HDFS.

The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).

Meta Store

Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.

186
ARCHITECTURE OF HIVE

HiveQL Process Engine

 HiveQL is similar to SQL for querying on schema info on the Metastore.

 It is one of the replacements of traditional approach for MapReduce program.

Execution Engine

 The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.

 Execution engine processes the query and generates results as same as MapReduce
results.
HDFS or HBASE

 Hadoop distributed file system or HBASE are the data storage techniques to store data into
file system.

187
Q&A

1. What is the difference between namenode and datanode in Hadoop?


NameNodestoresMetaData(NoofBlocks,OnWhichRackwhichDataNodeisstoredetc)whereasthe
DataNodestorestheactualData.
2. Which is the default Input Formats defined in Hadoop?
a. SequenceFileInputFormat
b. ByteInputFormat
c. KeyValueInputFormat
d. TextInputFormat
3. Which of the following is not an input format in Hadoop?
a. TextInputFormat
b. ByteInputFormat
c. SequenceFileInputFormat
188 d. KeyValueInputFormat
Q&A

4. Which of the following is a valid flow in Hadoop?


a. Input -> Reducer -> Mapper-> Combiner -> Output
b. Input -> Mapper-> Reducer -> Combiner -> Output
c. Input -> Mapper-> Combiner -> Reducer -> Output
d. Input -> Reducer -> Combiner -> Mapper-> Output
5. How many instances of Job tracker can run on Hadoopcluster ?
a. 1
b. 2
c. 3
d. 4

189
UML

 UML: Unified Modeling Language


 UML has many components to graphically model different aspects of an entire software
system
 UML Class Diagrams correspond to E-R Diagram, but several differences.

190
ER VS. UML CLASS DIAGRAMS

*Note reversal of position in cardinality constraint depiction


191
ER VS. UML CLASS DIAGRAMS

ER Diagram Notation Equivalent in UML

*Generalization can use merged or separate arrows independent


of disjoint/overlapping
192
UML CLASS DIAGRAMS (CONT.)

 Binary relationship sets are represented in UML by just drawing a line connecting the entity sets.
The relationship set name is written adjacent to the line.
 The role played by an entity set in a relationship set may also be specified by writing the role
name on the line, adjacent to the entity set.
 The relationship set name may alternatively be written in a box, along with attributes of the
relationship set, and the box is connected, using a dotted line, to the line depicting the
relationship set.

193

Вам также может понравиться