Вы находитесь на странице: 1из 5

Data Warehouse Point Point Data Lake

Defined volume of structured data stored in relational Vast amount of any (raw, structured and unstructured) data in its
Data Type & Volume database. Data warehouse hosts only a subset of data from 1 native format until it is needed. Data Lake can store anything and
different sources everything.
Storage Cost Expensive for large data volumes 1 Designed for low-cost storage
Agility Less agile; fixed configuration 1 Highly agile; configure as required
Data in a warehouse is already extracted, cleansed, pre-
processed, transformed and loaded into predefined schemas Data from a data lake requires more pre-processing, cleansing or
Data Storage 1
and tables, ready to be consumed by business intelligence enriching. ELT (Extract, Load and Transform)
applications. ETL (Extract, Transform and Load)
Data lakes are often used in big data operations like data mining or
Useful for business intelligence, reporting, and visualizations machine learning for finding patterns, building predictive models or
Who Benefits? 1
with dashboards. End users are business decision makers. other complex high value outputs. End users are typically data
scientists or advanced data analysts.
A data lake hosts data in its raw format without any schema
attached to it. Schema is only applied when data is read from the
Schema is designed and developed with business rules in
lake provided the unstructured data has common fields. This is
If migrated will we mind and tested for functionality before data is loaded into
1 called schema on read i.e. a schema is applied on the source files
be affected? it. This is called "scheme on write". So a basic user typically
when data is actually read. We will need a highly qualified resource
from our bank can write a query and get a reasonable report.
to get a report out as he has to sort out a host of issues as he wont
have anything resembling a semblance of order to fall back on.
Qualified resources Since schema is pre-defined and one ideally knows where to Qualified resource who can get the required output by querying
for report look for what even a resource with basic knowledge can help. unstructured files about which no one has any idea will not be easily
1
development and Lot of qualified resources are available in market for us to available and even if available we may not be able to get them as we
support move ahead. will be priced out.
Stable Tools with
1
Proper Support
Total 5 3

Our data warehouse should be first level data asset for reporting. All our operational reports and other basic reports should go out of this. Through ingestion process, data
from our structured DWH should be loaded into the data lake. (We get to use our existing hardware and a system to fall back on if something goes wrong)

This ingested data along with any external data that we obtain from social media integration etc..., together then becomes a second level data asset that is stored in Data Lake.
Further processing and enriching could be done to this bigger dataset in the data lake and then it could be used for analytical reports etc...
GENERAL DATA LAKE FLOW DIAGRAM

Interaction Data
(Social Media, Mobile
Devices etc…)

Exploratory
Analytics
Third Party Data
(Data bought from
resellers etc…) Basic Reporting

INGESTION ACCESS
LANDING REFINING PROCESSING
LAYER LAYER
LAYER LAYER LAYER
ADF Reporting
Syndicated Data
(Data from Govt and
associated agencies etc…)
Data for other
services like
Campaigns etc…
Existing System Data
(Data from existing
established sources like
CBS etc…)
SUGGESTED METHOD OF IMPLEMENTATION IF WE GO FOR IMPLEMENTATION OF DATA LAKE

Interaction Data
(Social Media, Mobile
Devices etc…)

Exploratory
Analytics
Third Party Data
(Data bought from
resellers etc…)

INGESTION ADF Reporting


ACCESS
LANDING REFINING PROCESSING
LAYER LAYER
LAYER LAYER LAYER
Syndicated Data
(Data from Govt and
Data for other
associated agencies etc…)
services like
Campaigns etc…

Existing System Data


(Data from existing
established sources like
CBS etc…)

Basic Reporting

If we opt for the above method we have to go for a small player on contract basis or subscription basis and the period of subscription or contract should be of lesser duration.
Based on their performance we can renegotiate subsequently.
Huge human costs (you need to hire expensive in-demand people who know how to use Hadoop), pricey implementation (migrate your data into NoSQL or HDFS without it
going wonky) and the possibility of unanticipated problems (you may not fully understand what you are using).

Even Google, the progenitor of all of this technology via the vaunted BigTable and GFS academic papers, has itself moved away from the techniques pioneered by NosQL and
Hadoop community via its recent "Spanner" database.

companies bemoaned the diversity of the "big data" ecosystem and wished for consolidation to make life easier for end-users.

Hadoop’s relatively low storage expense makes it a good option for storing structured data instead of a relational database system. However, Hadoop is not ideal for
transactional data since it is highly complex and needs quick implementation. Hadoop is also not recommended for structured data sets that require minimal latency.

Because of its batch processing capabilities, Hadoop should be deployed for pattern recognition, creating recommendation engines, index building, and sentiment analysis —
functions that generate data at a high volume. These can be easily stored in Hadoop and later queried using MapReduce functions.

However, Hadoop shouldn’t be used as a replacement for your existing data center. In fact, it must be integrated with the existing IT infrastructure to augment the organization’s
data management and storage capabilities.

Hadoop is highly fault-tolerant because it was designed to replicate data across many nodes. Each file is split into blocks and replicated numerous times across many machines,
ensuring that if a single machine goes down, the file can be rebuilt from other blocks elsewhere. ( Distributed storage need more replication and it will store more metadata, it
cant be single large storage instead lot of small hosts, that mean more DC space and racks cost of real estate )
DC has to be up to date, need support like 10Ggb network, Kerberso Authentication. Ability to expand with 100s of new hosts.

“It’s really hard to dig into and actually get real answer from…You really have to understand how this thing works to get what you want.”

Hadoop is great if you’re a data scientist who knows how to code in MapReduce or Pig, Johnson says, but as you go higher up the stack, the abstraction layers have mostly
failed to deliver on the promise of enabling business analysts to get at the data.

The Hadoop community has so far failed to account for the poor performance and high complexity of Hadoop, Johnson says. “The Hadoop ecosystem is still basically in the
hands of a small number of experts,” he says. “If you have that power and you’ve learned know how to use these tools and you’re programmer, then this thing is super
powerful. But there aren’t a lot of those people. I’ve read all these things how we need another million data scientists in the world, which I think means our tools aren’t very
good.”
the public cloud has emerged as a completely viable alternative that virtually every customer is investing in. Unless you have a large amount of unstructured data like photos,
videos, or sound files that you want to analyze, a relational data warehouse will always outperform a Hadoop-based warehouse. And for storing unstructured data, Muglia sees
Hadoop being replaced by S3 or other binary large object (BLOB) stores.

Вам также может понравиться