Вы находитесь на странице: 1из 21

PowerPoint

Presentation on
Data Processing in
Data Warehousing
Aim: - To develop a software framework that
allow us to analyse Bank’s data in Data
Warehousing.
We uses Hadoop as the Data Warehouse tool to fulfil our aim by performing the following tasks:-

1. Extract the data from the bank which are provided in different formats and store these data in the
single node cluster.

2. Transform the semi-structured data into structured data.

3. Load the data in high-level platform to merge into datalake and apply different queries to analyse
the data.

4. Automate ETL process using Oozie.

5. Visualize the output in graphical form.


Requirements

 Hardware Requirements : - A desktop / laptop with the basic configuration i.e 4GB
RAM, 500GB Hard Disk Space.

 Software requirements : -

1. Open Source Software : - Cloudera Hadoop, Oracle, Tableau.

2. Utility Software : - Apache Pig, Apache Hive, Apache Flume, Sqoop, Oozie.
DATA WAREHOUSE

IN COMPUTING, A DATA WAREHOUSE (DW OR DWH), ALSO KNOWN


AS AN ENTERPRISE DATA WAREHOUSE (EDW), IS A SYSTEM USED
FOR REPORTING AND DATA ANALYSIS, AND IS CONSIDERED A CORE
COMPONENT OF BUSINESS INTELLIGENCE. DWS ARE CENTRAL
REPOSITORIES OF INTEGRATED DATA FROM ONE OR MORE
DISPARATE SOURCES. THEY STORE CURRENT AND HISTORICAL DATA
IN ONE SINGLE PLACE THAT ARE USED FOR CREATING ANALYTICAL
REPORTS FOR WORKERS THROUGHOUT THE ENTERPRISE.
Architecture Of Data
Warehouse
Data Sources Data Staging Warehouse Data Marts Users

Operational System
Analysis

Operational System
Reporting

Mining
• AS WE HAVE SEEN EARLIER WE ARE USING
HADOOP AS DATA WAREHOUSE TOOL.

• HADOOP IS USED FOR PERFORMING THE ETL


OPERATIONS.

• WE ALSO HAVE VARIOUS TOOLS OF HADOOP.


APACHE PIG
• PIG IS AN OPEN SOURCE HIGH LEVEL DATAFLOW
SYSTEM.

• IT PROVIDES SIMPLE LANGUAGE FOR QUERIES AND


DATA MANIPULATION THAT IS COMPILED INTO MAP-
REDUCE JOBS THAT ARE RUN ON HADOOP.
• Apache Hive is a data warehouse software project built on top
of Hadoop for providing data query and analysis.
• The Apache Hive ™ data warehouse software facilitates reading, writing,
and managing large datasets residing in distributed storage using SQL.
• Hive has three main functions: data summarization, query and analysis. It
supports queries expressed in a language called HiveQL.
• Apache Flume is a distributed, reliable, and available software for
efficiently collecting, aggregating, and moving large amounts of log
data.
• It is robust and fault tolerant with tuneable reliability mechanisms and
many failover and recovery mechanisms.
• It uses a simple extensible data model that allows for online analytic
application.
• Sqoop is a command-line interface application for transferring data
between relational databases and Hadoop.

• Sqoop automates most of this process, relying on the database to describe the
schema for the data to be imported.

• Sqoop uses MapReduce to import and export the data, which provides parallel
operation as well as fault tolerance.
Data Flow
Graphical
data
Input Files visualization

JSON
XML
PIPE FLUME Datalake
Pig Hive
MySQL SQOOP
Tableau
Oracle
Apache Oozie is a workflow scheduler for Hadoop. It is a system which
runs workflow of dependent jobs. Here, users are permitted to
create Directed Acyclic Graphs of workflows, which can be run in
parallel and sequentially in Hadoop.
It consists of two parts:
Workflow engine : Responsibility of a workflow engine is to store and
run workflows composed of Hadoop jobs e.g., MapReduce, Pig, Hive.
Coordinator engine: It runs workflow jobs based on predefined
schedules and availability of data.
Life Cycle Of Oozie
Example of Time Scheduler in Oozie
Resultant Graphical
Representation on
Tableau After Applying
Hive Queries on Merged
Data
1. Find out how many customers purchased
more than equal to 10 products.
2. Which product is in most demand
overall.
3. Get those products list which were
sold out in maximum quantity in Q1.
4. Find out those customers who are not
active since last 3 months.
Summary
This project was undertaken as a Live Project
from Technogeeks. The main objective of this
project is to collect data from different
sources, transform semi-structured data into
structured data, merge and store these data
and analyze these data to retrieve information
in accordance to the requirements of the bank.
THANK YOU

Вам также может понравиться