Академический Документы
Профессиональный Документы
Культура Документы
Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and
wisely. We all know the usage of these tools. But there are some points, if followed can take
the core efficiency out of these tools.
So, lets have a look at those points that are termed as the Big Data best practices :
a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The
task setup and scheduling overhead is a few seconds, so if tasks finish very
quickly, you are wasting time while not doing work. In simple words, your
task is under loaded. Better increase the task load and utilize it to the fullest.
Another option can be the reuse of JVM. The JVM spawned by one mapper
can be reused by the other one, so that there is no overhead of spawning of
an extra JVM.
b. If you are dealing with a huge input data size, for example, suppose 1TB, then
consider increasing the block size of the input dataset to 256M or 512M, so
that less number of mappers will be spawned. Increasing the number of
mappers by decreasing the block size is not a good practice. Hadoop is
designed to work on larger amount of data to reduce the disk seek time and
increase the computation speed. So always define the HDFS block size larger
enough to allow Hadoop to compute effectively.
c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers,
because the first 50 mappers finish at the same time and then the 51st and
the 52nd will run before the reducer task can be started. So just increasing the
number of mappers to 500, or 1000 or even to 2000 does not speed your job.
The mappers will run in parallel according to the map slots available in your
cluster. If map slot available is 50 only 50 will run in parallel, others will be in
queue, waiting for the map slots to be available.
7. PARALLEL EXECUTION
Hadoop can execute MapReduce jobs in parallel, and several queries executed on
hive use this parallelism. However, single, complex Hive queries commonly are
translated into a number of MapReduce jobs that are executed by default
sequencing. Often though, some of the querys MapReduce stages are not
interdependent and could be executed in parallel. Then they can take advantage of
the spare capacity on a cluster and improve cluster utilization while at the same time
reducing the overall query execution time. Set hive.exec.parallel=true to use this
behavior.
8. VECTORIZATION
Vectorization allows Hive to process a batch of rows together instead of one row at
a time. Each batch consists of a columnar vector which is usually an array of
primitive types. Operations are performed on the entire columnar vector, which
improves the instruction pipelines and the cache usage. To enable vectorization, set
this configuration parameter SET hive.vectorized.execution.enabled=true.
Maximum number of task per node = number of processor per node 1(Since
data node and task tracker will take one processor)