Вы находитесь на странице: 1из 5

data lake

A data lake is a storage repository that holds a vast amount of raw data in its native format until it is
needed.

While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to
store data.

Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata
tags.

When a business question arises, the data lake can be queried for relevant data, and that smaller set of
data can then be analyzed to help answer the question.

The term data lake is often associated with Hadoop-oriented object storage. In such a scenario, an
organization's

data is first loaded into the Hadoop platform, and then business analytics and data mining tools are
applied to the data where it resides on Hadoop's cluster nodes of commodity computers.

Like big data, the term data lake is sometimes disparaged as being simply a marketing label for a product
that supports Hadoop. Increasingly,

however, the term is being accepted as a way to describe any large data pool in which the schema and
data requirements are not defined until the data is queried.

Hadoop data lake

A Hadoop data lake is a data management platform comprising one or


more Hadoop clusters. It is used principally to process and store nonrelational data,
such as log files, internet clickstream records, sensor data, JSON objects, images and
social media posts.
Such systems can also hold transactional data pulled from relational databases, but
they're designed to support analytics applications, not to handle transaction
processing. As public cloud platforms have become common sites for data storage,
many people build Hadoop data lakes in the cloud.

Hadoop data lake architecture

While the data lake concept can be applied more broadly to include other types of
systems, it most frequently involves storing data in the Hadoop Distributed File
System (HDFS) across a set of clustered compute nodes based on commodity server
hardware. The reliance on HDFS has, over time, been supplemented with data
stores using object storage technology, but non-HDFS Hadoop ecosystem components
typically are part of the enterprise data lake implementation.

With the use of commodity hardware and Hadoop's standing as an open source
technology, proponents claim that Hadoop data lakes provide a less expensive
repository for analytics data than traditional data warehouses. In addition, their ability
to hold a diverse mix of structured, unstructured and semistructured data can make
them a more suitable platform for big data management and analytics applications
than data warehouses based on relational software.

However, a Hadoop enterprise data lake can be used to complement an enterprise data
warehouse (EDW) rather than to supplant it entirely. A Hadoop cluster can offload
some data processing work from an EDW and, in effect, stand in as an analytical data
lake. In such cases, the data lake can host new analytics applications. As a result,
altered data sets or summarized results can be sent to the established data warehouse
for further analysis.
An emerging
style of Hadoop data lake architecture supports storage centered on open source data processing
frameworks.

Hadoop data lake best practices

The contents of a Hadoop data lake need not be immediately incorporated into a
formal database schema or consistent data structure, which allows users to store raw
data as is; information can then either be analyzed in its raw form or prepared for
specific analytics uses as needed.

As a result, data lake systems tend to employ extract, load and transform (ELT)
methods for collecting and integrating data, instead of the extract, transform and load
(ETL) approaches typically used in data warehouses. Data can be extracted and
processed outside of HDFS using MapReduce, Spark and other data processing
frameworks.

Despite the common emphasis on retaining data in a raw state, data lake architectures
often strive to employ schema-on-the-fly techniques to begin to refine and sort some
data for enterprise uses. As a result, Hadoop data lakes have come to hold both raw
and curated data.

As big data applications become more prevalent in companies, the data lake often is
organized to support a variety of applications. While early Hadoop data lakes were
often the province of data scientists, increasingly, these lakes are adding tools that
allow analytics self-service for many types of users.
Hadoop data lake uses, challenges

Potential uses for Hadoop data lakes vary. For example, they can pool varied legacy
data sources, collect network data from multiple remote locations and serve as a way
station for data that is overloading another system.

Experimental analysis and archiving are among other Hadoop data lake uses. They
have also become an integral part of Amazon Web Services (AWS) Lambda
architectures that couple batch with real-time data processing.

The Hadoop data lake isn't without its critics or challenges for users. Spark, as well as
the Hadoop framework itself, can support file architectures other than HDFS.
Meanwhile, data warehouse advocates contend that similar architectures -- for
example, the data mart -- have a long lineage and that Hadoop and related open source
technologies still need to mature significantly in order to match the functionality and
reliability of data warehousing environments.

Experienced Hadoop data lake users say that a successful implementation requires a
strong architecture and disciplined data governance policies; without those things,
they warn, data lake systems can become out-of-control dumping grounds. Effective
metadata management typically helps to drive successful enterprise data lake
implementations.

Hadoop vs. Azure Data Lakes

There are other versions of data lakes, which offer similar functionality to the
Hadoop data lake and also tie into HDFS.

Microsoft launched its Azure Data Lake for big data analytical workloads in the
cloud in 2016. It is compatible with Azure HDInsight, Microsoft's data
processing service based on Hadoop, Spark, R and other open source
frameworks. The main components of Azure Data Lake are Azure Data Lake
Analytics, which is built on Apache YARN, Azure Data Lake Store and U-SQL.
It uses Azure Active Directory for authentication and access control lists and
includes enterprise-level features for manageability, scalability, reliability and
availability.

Around the same time that Microsoft launched its data lake, AWS launched
Data Lake Solutions -- an automated reference data lake implementation that
guides users through creation of a data lake architecture on the AWS cloud,
using AWS services, such as Amazon Simple Storage Service (S3) for
storage and AWS Glue, a managed data catalog and ETL service