Академический Документы
Профессиональный Документы
Культура Документы
2
COURSE OUTCOMES
3
SESSION OBJECTIVES
4
OUTCOMES
5
AGENDA
Design Process
Modeling
Constraints
E-R Diagram
Design Issues
Weak Entity Sets
Extended E-R Features
Design of the Bank Database
Reduction to Relation Schemas
Database Design
UML
6
DESIGN PHASES
The initial phase of database design is to characterize fully the data needs of the prospective
database users.
Next, the designer chooses a data model and, by applying the concepts of the chosen data model,
translates these requirements into a conceptual schema of the database.
A fully developed conceptual schema also indicates the functional requirements of the enterprise. In
a “specification of functional requirements”, users describe the kinds of operations (or transactions)
that will be performed on the data.
7
DESIGN PHASES (CONT.)
Logical Design – Deciding on the database schema. Database design requires that we
find a “good” collection of relation schemas.
Business decision – What attributes should we record in the database?
Computer Science decision – What relation schemas should we have and how should
the attributes be distributed among the various relation schemas?
8
DESIGN APPROACHES
Normalization Theory
Formalize what designs are bad, and test for them
9
INTRODUCTION - DATA MODELING
Process of creating a data model for an information system by applying formal data modeling
techniques.
Process used to define and analyze data requirements needed to support the business
processes.
Therefore, the process of data modeling involves professional data modelers working closely
with business stakeholders, as well as potential users of the information system.
10
WHAT IS DATA MODEL?
Data Model is a collection of conceptual tools for describing data, data relationships, data
semantics and consistency constraint.
A data model is a conceptual representation of data structures required for data base and is very
powerful in expressing and communicating the business requirements.
A data model visually represents the nature of data, business rules governing the data, and how it
will be organized in the database.
A data model provides a way to describe the design of a database at the physical, logical and
view levels.
There are three different types of data models produced while progressing from requirements to
the actual database to be used for the information system
11
DIFFERENT DATA MODELS
12
A DATA MODEL CONSISTS OF ENTITIES RELATED TO EACH OTHER ON A
DIAGRAM:
Entity A real world thing or an interaction between 2 or more real world things.
Attribute The atomic pieces of information that we need to know about entities.
Relationship How entities depend on each other in terms of why the entities depend on each other
(the relationship) and what that relationship is (the cardinality of the relationship).
13
EXAMPLE:
Given that …
“Customer” is an entity.
“Product” is an entity.
For a “Customer” we need to know their “customer number” attribute and “name” attribute.
For a “Product” we need to know the “product name” attribute and “price” attribute.
“Sale” is an entity that is used to record the interaction of “Customer” and “Product”.
14
HERE IS THE DIAGRAM THAT ENCAPSULATES THESE RULES:
15
NOTES
16
• If we want to know the price of a Sale, we can ‘find’ it by using the “Product
Code” on the instance of “Sale” we are interested in and look up the
corresponding “Price” on the “Product” entity with the matching “Product
Code”.
17
TYPES OF DATA MODELS
18
ENTITY-RELATIONSHIP MODEL
Entity UniversityStudent
StudentName
StudentDOB
Attributes StudentAge
The purpose of a data model is to describe the concepts relevant to a domain, the relationships
between those concepts, and information associated with them.
20
Used to model data in a standard, consistent, predictable manner in order to manage it as a
resource.
To have a clear picture of the base data that your business needs.
To identify missing and redundant base data.
To Establish a baseline for communication across functional boundaries within your
organization.
Provides a basis for defining business rules.
Makes it cheaper, easier, and faster to upgrade your IT solutions.
21
ENTITY RELATIONSHIP DIAGRAM (ERD)
22
OBJECTIVES
Define terms related to entity relationship modeling, including entity, entity instance, attribute,
relationship and cardinality, and primary key.
Describe the entity modeling process.
Discuss how to draw an entity relationship diagram.
Describe how to recognize entities, attributes, relationships, and cardinalities.
23
DATABASE MODEL
Database systems are often modeled using an Entity Relationship (ER) diagram as the
"blueprint" from which the actual data is stored — the output of the design phase.
24
ENTITY RELATIONSHIP DIAGRAM (ERD)
25
PURPOSES OF ERD
26
COMPONENTS OF AN ERD
27
CLASSIFICATION OF RELATIONSHIP
Optional Relationship
An Employee may or may not be assigned to a Department
A Patient may or may not be assigned to a Bed
Mandatory Relationship
Every Course must be taught by at least one Teacher
Every mother have at least a Child
28
CARDINALITY CONSTRAINTS
Express the number of entities to which another entity can be associated via a relationship
set.
Cardinality Constraints - the number of instances of one entity that can or must be associated
with each instance of another entity.
Minimum Cardinality
If zero, then optional
If one or more, then mandatory
Maximum Cardinality
The maximum number
29
CARDINALITY CONSTRAINTS (CONTD.)
For a binary relationship set the mapping cardinality must be one of the following types:
One to one
A Manager Head one Department and vice versa
One to many ( or many to one)
An Employee Works in one Department or One Department has many Employees
Many to many
A Teacher Teaches many Students and A student is taught by many Teachers
30
GENERAL STEPS TO CREATE AN ERD
31
STEPS IN BUILDING AN ERD
32
DEVELOPING AN ERD
33
A SIMPLE EXAMPLE
34
IDENTIFY ENTITIES
One approach to this is to work through the information and highlight those words which you
think correspond to entities.
A company has several departments. Each department has a supervisor and at least one
employee. Employees must be assigned to at least one, but possibly more departments. At
least one employee is assigned to a project, but an employee may be on vacation and not
assigned to any projects. The important data fields are the names of the departments, projects,
supervisors and employees, as well as the supervisor and employee number and a unique
project number.
35
FIND RELATIONSHIPS
36
FIND RELATIONSHIPS (CONTD.)
Go through each cell and decide whether or not there is an association. For
example, the first cell on the second row is used to indicate if there is a
relationship between the entity "Employee" and the entity "Department".
37
IDENTIFIED RELATIONSHIPS
Names placed in the cells are meant to capture/describe the relationships. So you can use them
like this
A Department is assigned an employee
A Department is run by a supervisor
An employee belongs to a department
An employee works on a project
A supervisor runs a department
A project uses an employee
38
DRAW ROUGH ERD
39
DRAWING ROUGH ERD (CONTD.)
40
DRAWING ROUGH ERD (CONTD.)
41
DRAWING ROUGH ERD (CONTD.)
42
FILL IN CARDINALITY
Supervisor
Each department has one supervisor.
Department
Each supervisor has one department.
Each employee can belong to one or more departments
Employee
Each department must have one or more employees
Each project must have one or more employees
Project
Each employee can have 0 or more projects.
43
FILL IN CARDINALITY (CONTD.)
44
CARDINALITY NOTATION
45
CARDINALITY EXAMPLES
46
CARDINALITY EXAMPLES
47
ERD WITH CARDINALITY
48
EXAMPLES
49
ERD FOR COURSE ENROLLMENT
50
ERD FOR COURSE REGISTRATION
51
ROUGH ERD PLUS PRIMARY KEYS
52
IDENTIFY ATTRIBUTES
In this step we try to identify and name all the attributes essential to the system we are studying
without trying to match them to particular entities.
The best way to do this is to study the forms, files and reports currently kept by the users of the
system and circle each data item on the paper copy.
Cross out those which will not be transferred to the new system, extraneous items such as
signatures, and constant information which is the same for all instances of the form (e.g. your
company name and address). The remaining circled items should represent the attributes you need.
You should always verify these with your system users. (Sometimes forms or reports are out of date.)
The only attributes indicated are the names of the departments, projects, supervisors and
employees, as well as the supervisor and employee NUMBER and a unique project number.
53
MAP ATTRIBUTES
For each attribute we need to match it with exactly one entity. Often it seems like
an attribute should go with more than one entity (e.g. Name). In this case you
need to add a modifier to the attribute name to make it unique (e.g. Customer
Name, Employee Name, etc.) or determine which entity an attribute "best'
describes.
If you have attributes left over without corresponding entities, you may have
missed an entity and its corresponding relationships. Identify these missed entities
and add them to the relationship matrix now.
54
MAP ATTRIBUTES (CONTD.)
55
DRAW FULLY ATTRIBUTED ERD
56
CHECK ERD RESULTS
Look at your diagram from the point of view of a system owner or user. Is everything clear?
Check through the Cardinality pairs.
Also, look over the list of attributes associated with each entity to see if anything has been omitted.
57
SUMMARY
58
SUMMARY
Several kinds of integrity constraints can be expressed in the ER model: key constraints,
participation constraints, and overlap/covering constraints for ISA hierarchies. Some foreign key
constraints are also implicit in the definition of a relationship set.
Some of these constraints can be expressed in SQL only if we use general CHECK constraints or
assertions.
Some constraints (notably, functional dependencies) cannot be expressed in the ER model.
Constraints play an important role in determining the best database design for an enterprise.
ER design is subjective. There are often many ways to model a given scenario! Analyzing
alternatives can be tricky, especially for a large enterprise. Common choices include:
Entity vs. attribute, entity vs. relationship, binary or n-ary relationship, whether or not to use
ISA hierarchies, and whether or not to use aggregation.
Ensuring good database design: resulting relational schema should be analyzed and refined
further. FD information and normalization techniques are especially useful.
59
JAVA DATABASE CONNECTIVITY (JDBC)
60
Java Database Connectivity (JDBC)
By A. SandanaKaruppan, AP/IT
62
COURSE OUTCOMES
63
SESSION OBJECTIVES
64
OUTCOMES
65
AGENDA
66
INTRODUCTION
Database
Collection of data
DBMS
Database management system
Storing and organizing data
SQL
Relational database
Structured Query Language
JDBC
Java Database Connectivity
JDBC driver
67
JDBC
68
Java application
JDBC API
ODBC
driver
Database Database
69
ODBC
70
TYPES OF DRIVERS
Database
Type 3 Type 2 Native C/C++ API
Local API
Network API
Type 4
72
JDBC DRIVERS
73
COMMON JDBC COMPONENTS
DriverManager:
Driver:
74
COMMON JDBC COMPONENTS
Connection:
This interface with all methods for contacting a database.
The connection object represents communication context, i.e., all communication with database is
through connection object only.
Statement:
You use objects created from this interface to submit the SQL statements to the database.
ResultSet:
These objects hold data retrieved from a database after you execute an SQL query using
Statement objects.
It acts as an iterator to allow you to move through its data.
SQLException:
This class handles any errors that occur in a database application
75
TYPE 1: JDBC-ODBC BRIDGE DRIVER
76
TYPE 1 DRIVER
77
TYPE 2 DRIVER
In a Type 2 driver, JDBC API calls are converted into native C/C++ API calls, which are unique to
the database.
These drivers are typically provided by the database vendors and used in the same manner as the
JDBC-ODBC Bridge.
The vendor-specific driver must be installed on each client machine.
The Oracle Call Interface (OCI) driver is an example of a Type 2 driver.
78
TYPE 2 DRIVER
79
TYPE 3: JDBC-NET PURE JAVA
80
TYPE 4: 100% PURE JAVA
81
TYPE 4: 100% PURE JAVA
• In a Type 4 driver, a pure Java-based driver communicates directly with the vendor's database
through socket connection.
• This is the highest performance driver available for the database and is usually provided by the
vendor itself.
• This kind of driver is extremely flexible, you don't need to install special software on the client or
server. Further, these drivers can be downloaded dynamically.
82
TYPE 4 DRIVER
83
The following steps are required to create a new Database using JDBC application
84
STORED PROCEDURES
85
STORED PROCEDURE LANGUAGE
86
Normal Database
87
Applications using stored
procedures
88
Writing Stored Procedures
89
Some Valid SQL Procedure Body Statements
CASE statement
FOR statement
GOTO statement
IF statement
ITERATE statement
RETURN statement
WHILE statement
90
Invoking Procedures
Can invoke Stored procedure stored at the location of the database by using the SQL CALL
statement
91
CONDITIONAL STATEMENTS:
IF <condition> THEN
<statement(s)>
ELSE
<statement(s)>
END IF;
Loops
LOOP
……
EXIT WHEN <condition>
……
END LOOP;
92
SUMMARY
JDBC is an API specification developed by Sun Microsystems that defines a uniform interface for
accessing different relational databases.
The primary function of the JDBC API is to allow the developer to issue SQL statements and
process the results in a consistent, database-independent manner.
The JDBC API uses a driver manager and database-specific drivers to pro-vide transparent
connectivity to heterogeneous databases.
The JDBC driver manager ensures that the correct driver is used to access each data source.
The driver manager is capable of supporting multiple con-current drivers connected to multiple
heterogeneous databases.
A JDBC driver translates standard JDBC calls into a network protocol or client API call that
facilitates communication with the database. This translation provides JDBC applications with
database independence.
93
SUMMARY
94
BIG DATA
95
Trends in Big Data systems - NoSQL
By A. SandanaKaruppan, AP/IT
97
COURSE OUTCOMES
98
SESSION OBJECTIVES
99
OUTCOMES
100
AGENDA
101
WHAT IS BIG DATA?
Big data is a massive volume of both structured and unstructured data that is so large it is difficult
to process using traditional database and software techniques.
In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current
processing capacity.
Despite these problems, big data has the potential to help companies improve operations and
make faster, more intelligent decisions.
102
WHY BIG DATA
103
BIG DATA EVERYWHERE!
104
HOW MUCH DATA?
105
UNITS
BIT
NIBBLE
BYTE/OCTET (B)
KILOBYTE (KB)
MEGABYTE (MB)
GIGABYTE (GB)
TERABYTE (TB)
PETABYTE (PB)
EXABYTE (EB)
ZETTABYTE (ZB)
YOTTABYTE (YB)
106
TYPES OF DATA
Structured Data
Unstructured Data.
107
STRUCTURED DATA
It concerns all data which can be stored in database SQL in table with rows and columns.
They have relational key and can be easily mapped into pre-designed fields.
Today, those data’s are the most processed in development and the simplest way to manage
information.
108
SEMI STRUCTURED DATA
Semi-structured data is information that doesn’t reside in a relational database but that does
have some organizational properties that make it easier to analyze.
Examples of semi-structured : XML and JSON (JavaScript Object Notation) documents are semi
structured documents.
But as Structured data, semi structured data represents a few parts of data (5 to 10%).
109
UNSTRUCTURED DATA
Examples: include e-mail messages, word processing documents, videos, photos, audio files,
presentations, WebPages and many other kinds of business documents.
Note that while these sorts of files may have an internal structure, they are still considered
« unstructured » because the data they contain doesn’t fit neatly in a database.
Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives
around unstructured data.
110
HERE ARE SOME EXAMPLES OF MACHINE-GENERATED
UNSTRUCTURED DATA:
Satellite images
Scientific data
Photographs and video
Social media data
Mobile data &
website content
111
WHAT TO DO WITH THESE DATA?
112
EXAMPLES OF BIG DATA
IT log analytics
IT solutions and IT departments generate an enormous quantity of logs and trace data.
In the absence of a Big Data solution, much of this data must go unexamined: organizations
simply don't have the manpower or resource to churn through all that information by hand, let
alone in real time.
With a Big Data solution in place, however, those logs and trace data can be put to good use.
Within this list of Big Data application examples, IT log analytics is the most broadly applicable.
113
APPLICATIONS FOR BIG DATA ANALYTICS
Multi-channel sales
Smarter Healthcare Finance Log Analysis
114
NOSQL?
115
NOSQL?
116
WHY NOSQL
Dynamic Schemas
Auto-sharding
Replication
Horizontally Scalable
* Some Operations can be achieved by Enterprise class RDBMS software but with very High cost
117
DEFINE NOSQL
NoSQL database provides a mechanism for storage and retrieval of data that is modeled in
means other than the tabular relations used in relation databases (RDBMS).
It is designed for distributed data stores where very large scale of data storing needs (for
example Google or Facebook which collects terabits of data every day for their users).
118
TYPES OF NOSQL DATABASES
NoSQL Databases
Columnar
Document Stores Graph Databases Key-Value Stores
Databases
119
DOCUMENT ORIENTED DATABASES
At a collection level, this allows for putting together a diverse set of documents into a
single collection.
Document databases allow indexing of documents on the basis of not only its primary
identifier but also its properties.
120
CONT…
Different open-source document databases are available today but the most prominent among
the available options are MongoDB and CouchDB.
In fact, MongoDB has become one of the most popular NoSQL databases.
121
DOCUMENT ORIENTED DATABASES
122
GRAPH BASED DATABASES
A graph database uses graph structures with nodes, edges, and properties to represent and
store data.
By definition, a graph database is any storage system that provides index-free adjacency. This
means that every element contains a direct pointer to its adjacent element and no index lookups
are necessary.
123
CONT…
General graph databases that can store any graph are distinct from specialized graph databases
such as triple-stores and network databases. Indexes are used for traversing the graph.
124
GRAPH DATABASE
125
COLUMN BASED DATABASES
It avoids consuming space when storing nulls by simply not storing a column when a value doesn’t
exist for that column.
126
CONT…
Each unit of data can be thought of as a set of key/value pairs, where the unit itself is identified
with the help of a primary identifier, often referred to as the primary key.
127
KEY VALUE DATABASES
The key of a key/value pair is a unique value in the set and can be easily looked up to access
the data.
Key/value pairs are of varied types: some keep the data in memory and some provide the
capability to persist the data to disk.
128
KEY VALUE DATABASES
129
KEY VALUE DATABASES
130
BENEFITS OF NOSQL OVER RDBMS
Schema Less
NoSQL databases being schema-less do not define any strict data structure.
NoSQL databases have good tendency to grow dynamically with changing requirements. It
can handle structured, semi-structured and unstructured data.
131
BENEFITS (CONT…)
Scales Horizontally:
NoSQL scales horizontally by adding more servers and using concepts of sharding and
replication.
This behavior of NoSQL fits with the cloud computing services such as Amazon Web
Services (AWS) which allows you to handle virtual servers which can be expanded
horizontally on demand.
132
BENEFITS (CONT…)
Better Performance:
All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.
133
SUMMARY
Big data is a massive volume of both structured and unstructured data that is so large it is difficult
to process using traditional database and software techniques.
Lots of data is being collected
and warehoused.
Three concepts come with big data .
NoSQL is a non-relational database management systems, different from traditional relational
database management systems in some significant ways.
All the NoSQL databases claim to deliver better and faster performance as compared to
traditional RDBMS implementations.
134
CAP THEOREM
135
CAP
Consistency
Availability
Partition-tolerance
A distributed system can satisfy any two of these guarantees at the same time but not all
three
136
CAP THEOREM
Consistency
All the servers in the system will have the same data so anyone using the system will get the
same copy regardless of which server answers their request.
Availability
The system will always respond to a request (even if it's not the latest data or consistent across
the system or just a message saying the system isn't working)
Partition Tolerance
The system continues to operate as a whole even if individual servers fail or can't be reached..
137
CAP THEOREM
C A
138
CAP THEOREM
A simple example:
Bob Dong
139
CAP THEOREM
A simple example:
Bob Dong
140
CAP THEOREM
A simple example:
Bob Dong
141
Credit: http://architects.dzone.com/articles/better-explaining-cap-theorem
142
CHOOSING AP
Credit: https://foundationdb.com/key-value-store/white-papers/the-cap-theorem
143
CHOOSING CP
Credit: https://foundationdb.com/key-value-store/white-papers/the-cap-theorem
144
Hadoop
145
Hadoop HDFS, MapReduce, Hive, and enhancements
By A. SandanaKaruppan, AP/IT
147
COURSE OUTCOMES
148
SESSION OBJECTIVES
149
OUTCOMES
150
AGENDA
Hadoop-Basics
HDFS
Goals
Architecture
Other functions
MapReduce
Basics
Word Count Example
Handy tools
Finding shortest path example
Introduction to HIVE.
151
HADOOP-BASICS
152
HADOOP (WHY)
153
WHAT IS HADOOP USED FOR ?
Searching (Yahoo)
Log Processing
Analytics(Facebook, LinkedIn)
Data Retention
154
GOALS OF HDFS
155
DISTRIBUTED FILE SYSTEM
156
HDFC ARCHITECTURE
157
FUNCTIONS OF A NAMENODE
158
NAMENODE METADATA
Metadata in Memory
The entire metadata is in main memory
No demand paging of metadata
Types of metadata
List of files
List of Blocks for each file
List of DataNodes for each block
File attributes, e.g. creation time, replication factor
A Transaction Log
Records file creations, file deletions etc
159
DATANODE
A Block Server
Stores data in the local file system (e.g. ext3)
Stores metadata of a block (e.g. CRC)
Serves data and metadata to Clients
Block Report
Periodically sends a report of all existing blocks to the NameNode
Facilitates Pipelining of Data
Forwards data to other specified DataNodes
160
BLOCK PLACEMENT
Current Strategy
One replica on local node
Second replica on a remote rack
Third replica on same remote rack
Additional replicas are randomly placed
Clients read from nearest replicas
Would like to make this policy pluggable
161
HEARTBEATS
162
REPLICATION ENGINE
163
DATA CORRECTNESS
164
NAMENODEFAILURE
165
DATA PIEPLINING
166
REBALANCER
167
SECONDARY NAMENODE
168
USER INTERFACE
169
MAP REDUCE
170
WHAT IS MAP REDUCE?
Secondly, reduce task, which takes the output from a map as an input and combines those
data tuples into a smaller set of tuples.
MapReduce is a programming model Google has used successfully is processing its “big-
data” sets (~ 20000 peta bytes per day)
171
WHAT IS MAP REDUCE?
Efficiency from
Streaming through data, reducing seeks
Pipelining
A good fit for a lot of applications
Log processing
Web index building
Users specify the computation in terms of a map and a reduce function,
Underlying runtime system automatically parallelizes the computation across large-scale
clusters of machines, and
Underlying system also handles machine failures, efficient communications, and performance
issues.
172
MAPREDUCE-DATAFLOW
173
MAP STAGE
Generally the input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS).
The mapper processes the data and creates several small chunks of data.
174
REDUCE STAGE
This stage is the combination of the Shuffle stage and the Reduce stage.
The Reducer’s job is to process the data that comes from the mapper.
After processing, it produces a new set of output, which will be stored in the HDFS.
175
MAPREDUCE-FEATURES
176
WORD COUNT EXAMPLE
Mapper
Input: value: lines of text of input
Output: key: word, value: 1
Reducer
Input: key: word, value: set of counts
Output: key: word, value: sum
Launching program
Defines this job
Submits job to cluster
177
MAPREDUCE - WORD COUNT DATAFLOW
178
MAPREDUCE
179
179
SOME HANDY TOOLS
Partitioners
Combiners
Compression
Counters
Speculation
Zero Reduces
Distributed File Cache
Tool
180
HADOOPRELATED SUBPROJECTS
Pig
High-level language for data analysis
HBase
Table storage for semi-structured data
Zookeeper
Coordinating distributed applications
Hive
SQL-like Query language and Metastore
Mahout
Machine learning
181
WHAT IS HIVE
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up
and developed it further as an open source under the name Apache Hive.
182
HIVE IS NOT
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
183
FEATURES OF HIVE
184
ARCHITECTURE OF HIVE
185
ARCHITECTURE OF HIVE
User Interface
Hive is a data warehouse infrastructure software that can create interaction between user
and HDFS.
The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive
HD Insight (In Windows server).
Meta Store
Hive chooses respective database servers to store the schema or Metadata of tables,
databases, columns in a table, their data types, and HDFS mapping.
186
ARCHITECTURE OF HIVE
Execution Engine
The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine.
Execution engine processes the query and generates results as same as MapReduce
results.
HDFS or HBASE
Hadoop distributed file system or HBASE are the data storage techniques to store data into
file system.
187
Q&A
189
UML
190
ER VS. UML CLASS DIAGRAMS
Binary relationship sets are represented in UML by just drawing a line connecting the entity sets.
The relationship set name is written adjacent to the line.
The role played by an entity set in a relationship set may also be specified by writing the role
name on the line, adjacent to the entity set.
The relationship set name may alternatively be written in a box, along with attributes of the
relationship set, and the box is connected, using a dotted line, to the line depicting the
relationship set.
193