Вы находитесь на странице: 1из 4

BIG DATA BEST PRACTICES

Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and
wisely. We all know the usage of these tools. But there are some points, if followed can take
the core efficiency out of these tools.

So, lets have a look at those points that are termed as the Big Data best practices :

1. USE THE NUMBER OF MAP AND REDUCE TASKS APPROPRIATLEY


Choosing the number of map and reduce tasks for a job is important. Here are some
of the factors to be kept in mind :

a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The
task setup and scheduling overhead is a few seconds, so if tasks finish very
quickly, you are wasting time while not doing work. In simple words, your
task is under loaded. Better increase the task load and utilize it to the fullest.
Another option can be the reuse of JVM. The JVM spawned by one mapper
can be reused by the other one, so that there is no overhead of spawning of
an extra JVM.
b. If you are dealing with a huge input data size, for example, suppose 1TB, then
consider increasing the block size of the input dataset to 256M or 512M, so
that less number of mappers will be spawned. Increasing the number of
mappers by decreasing the block size is not a good practice. Hadoop is
designed to work on larger amount of data to reduce the disk seek time and
increase the computation speed. So always define the HDFS block size larger
enough to allow Hadoop to compute effectively.
c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers,
because the first 50 mappers finish at the same time and then the 51st and
the 52nd will run before the reducer task can be started. So just increasing the
number of mappers to 500, or 1000 or even to 2000 does not speed your job.
The mappers will run in parallel according to the map slots available in your
cluster. If map slot available is 50 only 50 will run in parallel, others will be in
queue, waiting for the map slots to be available.

Bijoy Kumar Khandelwal


d. The number of reduce tasks should always be equal less than the reduce slot
available in your cluster.
e. Sometime we dont really use reducers. For example filtering and reduce
noise in data. In these cases make sure you set the number of reducers to
zero since the sorting and shuffling is an expensive operation.

2. EXECUTE JOBS ON A SMALL DATASET FOR TESTING (SAMPLING)


Whenever a complex Hive query or Pig Script or a raw MapReduce job is written, its
a fair technique to first run it on a small dataset rather that testing it on the real
dataset which will be huge. Its better to check all the bugs and bottlenecks in the
job by running it on the test dataset rather than wasting the cluster resources by
running it on a huge dataset.

3. PARTITIONING HIVE TABLES


Hive Partitioning is an effective method to improve the query performance on larger
tables. Partitioning allows us to store data in separate sub-directories under table
location. It greatly helps the queries which are queried upon the partition keys.
Although the selection of partition key is always a sensitive, it should always be a
low cardinal attribute, e.g. if your data is associated with time dimension, then date
could be a good partition key.

4. COMPRESS MAP/REDUCE OUTPUT


Compression techniques significantly reduce the intermediate data volume, which
internally reduces the amount of data transfers between mappers and reducers. All
this generally occur over the network. Compression can be applied on the mapper
and reducer output individually. Keep in mind that gzip compressed files are not
splittable. That means this should be applied with caution. A compressed file should
not be larger than a few hundred megabytes. Otherwise it can potentially lead to an
imbalanced job. Other options of compression codec should be snappy, lzo, bzip etc.

For map output compression set mapred.compression.map.output to true.


For reduce output compression set mapred.compression.reduce.output to
true.

Bijoy Kumar Khandelwal


5. MAP JOIN
Map Joins are really efficient if a table on the other side of a join is small enough to
fit in the memory. Hive supports a parameter, hive.auto.convert.join, which when
set to true suggests that Hive try to map join automatically. When using this
parameter, make sure that the auto convert is enabled in the hive environment.

6. INPUT FORMAT SELECTION


Input format plays a vital role in Hive performance. For example, JSON, the text type
of input formats, is not a good choice for a large production system where data
volume is really high. These type of readable formats actually take a lot of space and
have overhead of parsing. To address these problems, Hive comes with columnar
input formats like RCFile, ORC etc. Columnar Formats allow you to reduce the read
operations in analytics queries by allowing each column to be accessed individually.
There are some binary formats like Avro, sequence files, Thrift and ProtoBuf which
can be useful.

7. PARALLEL EXECUTION
Hadoop can execute MapReduce jobs in parallel, and several queries executed on
hive use this parallelism. However, single, complex Hive queries commonly are
translated into a number of MapReduce jobs that are executed by default
sequencing. Often though, some of the querys MapReduce stages are not
interdependent and could be executed in parallel. Then they can take advantage of
the spare capacity on a cluster and improve cluster utilization while at the same time
reducing the overall query execution time. Set hive.exec.parallel=true to use this
behavior.

8. VECTORIZATION
Vectorization allows Hive to process a batch of rows together instead of one row at
a time. Each batch consists of a columnar vector which is usually an array of
primitive types. Operations are performed on the entire columnar vector, which
improves the instruction pipelines and the cache usage. To enable vectorization, set
this configuration parameter SET hive.vectorized.execution.enabled=true.

Bijoy Kumar Khandelwal


KEY EXPLANATION

The general rule to choose number of mappers and reducers is

Total number of mapper or reducer = number of nodes * maximum number of


tasks per node

Maximum number of task per node = number of processor per node 1(Since
data node and task tracker will take one processor)

Advantages of having less number of maps :


It reduces the scheduling overhead; having fewer maps means task scheduling is
easier and availability of free-slots in the cluster is higher.
It reduces the number of seeks required shuffle the map-outputs from the maps
to the reducers because each map produces output for each reduce, thus the
number of seeks is m * r where m is the number of maps and r is the number of
reduces.
Each shuffled segment is larger; resulting in reducing the overhead of connection
establishment when compared to the real work done that is, moving bytes
across the network.
The reduce-side merge of the sorted map-outputs is more efficient, since the
branch-factor for the merge is lesser, that is, fewer merges are needed since
there are fewer sorted segments of map-outputs to merge.

Bijoy Kumar Khandelwal

Вам также может понравиться