Вы находитесь на странице: 1из 37

To analyze the transaction data in the new platform, we need to ingest it into the

Hadoop Distributed File System (HDFS). We need to find a tool that easily transfers
structured data from a RDBMS to HDFS, while preserving structure. That enables us
to query the data, but not interfere with or break any regular workload on it.

Apache Sqoop, which is part of CDH, is that tool. The nice thing about Sqoop is that
we can automatically load our relational data from MySQL into HDFS, while
preserving the structure.

With a few additional configuration parameters, we can take this one step further
and load this relational data directly into a form ready to be queried by Impala (the
open source analytic query engine included with CDH). Given that we may want to
leverage the power of the Apache Avro file format for other workloads on the cluster
(as Avro is a Hadoop optimized file format), we will take a few extra steps to load
this data into Impala using the Avro file format, so it is readily available for Impala
as well as other workloads.

You should first open a terminal, which you can do by clicking the black "Terminal"
icon at the top of your screen. Once it is open, you can launch the Sqoop job:

[cloudera@quickstart ~]$ sqoop import-all-tables \


-m 1 \
--connect jdbc:mysql://quickstart:3306/retail_db \
--username=retail_dba \
--password=cloudera \
--compression-codec=snappy \
--as-parquetfile \
--warehouse-dir=/user/hive/warehouse \
--hive-import

All sqoop examples with learnings as well:- best examples:https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html


This command may take a while to complete, but it is doing a lot. It is launching
MapReduce jobs to pull the data from our MySQL database and write the data to
HDFS, distributed across the cluster in Apache Parquet format. It is also creating
tables to represent the HDFS files in Impala / Apache Hive with matching schema.

Parquet is a format designed for analytical applications on Hadoop. Instead of


grouping your data into rows like typical data formats, it groups your data into
columns. This is ideal for many analytical queries where instead of retrieving data
from specific records, you're analyzing relationships between specific variables
across many records. Parquet is designed to optimize data storage and retriveval in
these scenarios.

Once the command is complete we can confirm that our data was imported into
HDFS:

hadoop fs -ls /user/hive/warehouse/


hadoop fs -ls /user/hive/warehouse/categories/

Note: The number of .parquet files shown will be equal to the number of mappers
used by Sqoop. On a single-node you will just see one, but larger clusters will have a
greater number of files.
Hive and Impala also allow you to create tables by defining a schema over existing
files with 'CREATE EXTERNAL TABLE' statements, similar to traditional relational
databases. But Sqoop already created these tables for us, so we can go ahead and
query them.

We're going to use Hue's Impala app to query our tables. Hue provides a web-based
interface for many of the tools in CDH and can be found on port 8888 of your
Manager Node (here). In the QuickStart VM, the administrator username for Hue is
'cloudera' and the password is 'cloudera'.
Once you are inside of Hue, click on Query Editors, and open the Impala Query
Editor.
To save time during queries, Impala does not poll constantly for metadata changes.
So the first thing we must do is tell Impala that its metadata is out of date. Then we
should see our tables show up, ready to be queried:
invalidate metadata;
show tables;

You can also click on the "Refresh Table List" icon on the left to see your new tables
in the side menu.

Now that your transaction data is readily available for structured queries in CDH, it's
time to address DataCos business question. Copy and paste or type in the following
standard SQL example queries for calculating total revenue per product and
showing the top 10 revenue generating products:

-- Most popular product categories


select c.category_name, count(order_item_quantity) as count
from order_items oi
inner join products p on oi.order_item_product_id = p.product_id
inner join categories c on c.category_id = p.product_category_id
group by c.category_name
order by count desc
limit 10;

-- top 10 revenue generating products


select p.product_id, p.product_name, r.revenue
from products p inner join
(select oi.order_item_product_id, sum(cast(oi.order_item_subtotal as float)) as
revenue
from order_items oi inner join orders o
on oi.order_item_order_id = o.order_id
where o.order_status <> 'CANCELED'
and o.order_status <> 'SUSPECTED_FRAUD'
group by order_item_product_id) r
on p.product_id = r.order_item_product_id
order by r.revenue desc
limit 10;

You may notice that we told Sqoop to import the data into Hive but used Impala to
query the data. This is because Hive and Impala can share both data files and the
table metadata. Hive works by compiling SQL queries into MapReduce jobs, which
makes it very flexible, whereas Impala executes queries itself and is built from the
ground up to be as fast as possible, which makes it better for interactive analysis.
We'll use Hive later for an ETL (extract-transform-load) workload.
The current release of Impala does not support the following SQL features that you might be familiar with
from HiveQL:

Non-scalar data types such as maps, arrays, structs.

Extensibility mechanisms such as TRANSFORM, custom file formats, or custom SerDes.

The DATE data type.

XML and JSON functions.

Certain aggregate functions from


HiveQL: covar_pop, covar_samp, corr, percentile, percentile_approx, histogram_numeri
c, collect_set; Impala supports the set of aggregate functions listed in Impala Aggregate
Functions and analytic functions listed in Impala Analytic Functions.

Sampling.

Lateral views.

Multiple DISTINCT clauses per query, although Impala includes some workarounds for this
limitation

Different from Hive, Impala executes queries natively


without translating them into MapReduce jobs
Cloudera Impala is an open source SQL query engine that
runs on Hadoop. It is modeled after Google Dremel.
Impala can query Hive tables directly. Impala actually
uses Hives megastore. Different from Hive, Impala
executes queries natively without translating them into
MapReduce jobs. The core Impala component is a

daemon process that runs on each node of the cluster as


the query planner, coordinator, and execution engine.
Each node can accept queries. The planner turns a
request into collections of parallel plan fragments. The
coordinator initiates execution on remote nodes in the
cluster. The execution engine reads and writes to data
files, and transmits intermediate query results back to the
coordinator node.

https://www.linkedin.com/pulse/2014091014291122744472-why-is-impala-faster-than-hive
The two core technologies of Dremel/Impala are columnar storage for nested data and the tree
architecture for query execution:

Columnar Storage: Data is stored in a columnar


storage fashion to achieve very high compression
ratio and scan throughput.
Tree Architecture: The architecture forms a massively
parallel distributed multi-level serving tree for
pushing down a query to the tree and then
aggregating the results from the leaves.
As a native query engine, Impala avoids the startup
overhead of MapReduce/Tez jobs. It is well known that
MapReduce programs take some time before all nodes
are running at full capacity. In Hive, every query suffers
this cold start problem. In contrast, Impala daemon
processes are started at boot time, and thus are always
ready to execute a query

Hadoop reuses JVM instances to reduce the startup


overhead partially. However, it also introduces
another problem when large heaps are in use. The
nodes in the Cloudera benchmark have 384 GB
memory. Such a big heap is actually a big challenge
to the garbage collector of the reused JVM instances.
The stop-of-the-world GC pauses may add high
latency to queries. On the other hand, Impala prefers
such large memory.

Impala process are multithreaded. Importantly, the


scanning portion of plan fragments are multithreaded
as well as making use of SSE4.2 instructions. The I/O
and network systems are also highly multithreaded.
Therefore, each single Impala node runs more
efficiently by a high level local parallelism.

Impalas query execution is pipelined as much as


possible. In case of aggregation, the coordinator
starts the final aggregation as soon as the preaggregation fragments has started to return results.
In contrast, sort and reduce can only start once all
the mappers are done in MapReduce. Tez currently
doesnt support pipelined execution yet.

MapReduce materializes all intermediate results. This


feature enables better scalability and fault tolerance.
However, it also significantly slows down the data
processing. In contrast, Impala streams intermediate

results between executors (of course, in tradeoff of


the scalability). Tez allows different types of
Input/Output including file, TCP, etc. But it seems
that Hive doesn't use this feature yet to avoid
unnecessary disk writes.
spark-shell --master yarn-client

// First we're going to import the classes we need


import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
import org.apache.avro.generic.GenericRecord
import parquet.hadoop.ParquetInputFormat
import parquet.avro.AvroReadSupport
import org.apache.spark.rdd.RDD

// Then we create RDD's for 2 of the files we imported from MySQL with Sqoop
// RDD's are Spark's data structures for working with distributed datasetsdef
rddFromParquetHdfsFile(path: String): RDD[GenericRecord] = {
val job = new Job()
FileInputFormat.setInputPaths(job, path)
ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[GenericRecord]])
return sc.newAPIHadoopRDD(job.getConfiguration,

classOf[ParquetInputFormat[GenericRecord]],
classOf[Void],
classOf[GenericRecord]).map(x => x._2)
}
val warehouse = "hdfs://quickstart/user/hive/warehouse/"
val order_items = rddFromParquetHdfsFile(warehouse + "order_items");
val products = rddFromParquetHdfsFile(warehouse + "products");

// Next, we extract the fields from order_items and products that we care about
// and get a list of every product, its name and quantity, grouped by order
val orders = order_items.map { x => (
x.get("order_item_product_id"),
(x.get("order_item_order_id"), x.get("order_item_quantity")))
}.join(
products.map { x => (
x.get("product_id"),
(x.get("product_name")))
}
).map(x => (
scala.Int.unbox(x._2._1._1), // order_id
(
scala.Int.unbox(x._2._1._2), // quantity
x._2._2.toString // product_name
)
)).groupByKey()

// Finally, we tally how many times each combination of products appears


// together in an order, then we sort them and take the 10 most commo
val cooccurrences = orders.map(order =>
(
order._1,
order._2.toList.combinations(2).map(order_pair =>
(
if (order_pair(0)._2 < order_pair(1)._2)
(order_pair(0)._2, order_pair(1)._2)
else
(order_pair(1)._2, order_pair(0)._2),
order_pair(0)._1 * order_pair(1)._1
)
)
)
)
val combos = cooccurrences.flatMap(x => x._2).reduceByKey((a, b) => a + b)
val mostCommon = combos.map(x => (x._2, x._1)).sortByKey(false).take(10)

// We print our results, 1 per line, and exit the Spark shell
println(mostCommon.deep.mkString("\n"))
exit

--------------------------------------------------------------------------------------------------------------------Output looks like below:

at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecut
or.java:111)
... 1 more
Caused by: java.nio.channels.ClosedChannelException
16/08/11 03:20:05 WARN netty.NettyRpcEndpointRef: Error sending message
[message = KillExecutors(List(1))] in 3 attempts
org.apache.spark.SparkException: Error sending message [message =
KillExecutors(List(1))]
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$
$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply$mcV$sp(YarnSchedulerBackend.scala:222)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$
$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:222)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$
$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:222)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scal
a:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to send RPC 6835179282061168721 to
192.168.233.135/192.168.233.135:58467:
java.nio.channels.ClosedChannelException
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:239)
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:226)
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at
io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845)
at
io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:
873)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventEx
ecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecut
or.java:111)
... 1 more
Caused by: java.nio.channels.ClosedChannelException
16/08/11 03:20:05 ERROR util.Utils: Uncaught exception in thread kill-executorthread
org.apache.spark.SparkException: Error sending message [message =
KillExecutors(List(1))]
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)

at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doKillExecutors(YarnSche
dulerBackend.scala:70)
at
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.killExecutors(C
oarseGrainedSchedulerBackend.scala:519)
at
org.apache.spark.SparkContext.killAndReplaceExecutor(SparkContext.scala:1510)
at org.apache.spark.HeartbeatReceiver$
$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3$$anon$3$
$anonfun$run$3.apply$mcV$sp(HeartbeatReceiver.scala:206)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1230)
at org.apache.spark.HeartbeatReceiver$
$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3$
$anon$3.run(HeartbeatReceiver.scala:203)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Error sending message [message =
KillExecutors(List(1))]
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118)
at
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$
$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply$mcV$sp(YarnSchedulerBackend.scala:222)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$

$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:222)
at
org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint$
$anonfun$receiveAndReply$1$
$anonfun$applyOrElse$4.apply(YarnSchedulerBackend.scala:222)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scal
a:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
... 3 more
Caused by: java.io.IOException: Failed to send RPC 6835179282061168721 to
192.168.233.135/192.168.233.135:58467:
java.nio.channels.ClosedChannelException
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:239)
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:226)
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at
io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845)
at
io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:
873)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventEx
ecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecut
or.java:111)

... 1 more
Caused by: java.nio.channels.ClosedChannelException
16/08/16 00:13:59 WARN server.TransportChannelHandler: Exception in connection
from 192.168.233.135/192.168.233.135:58494
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
at
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:
313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:2
42)
at
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteC
hannel.java:119)
at
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.jav
a:468)
at
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecut
or.java:111)

at java.lang.Thread.run(Thread.java:745)
16/08/16 00:13:59 ERROR client.TransportClient: Failed to send RPC
6923981271737176555 to 192.168.233.135/192.168.233.135:58467:
java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
16/08/16 00:13:59 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
Attempted to get executor loss reason for executor id 1 at RPC address
192.168.233.135:58494, but got no response. Marking as slave lost.
java.io.IOException: Failed to send RPC 6923981271737176555 to
192.168.233.135/192.168.233.135:58467:
java.nio.channels.ClosedChannelException
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:239)
at
org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClie
nt.java:226)
at
io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at
io.netty.util.concurrent.DefaultPromise$LateListeners.run(DefaultPromise.java:845)
at
io.netty.util.concurrent.DefaultPromise$LateListenerNotifier.run(DefaultPromise.java:
873)
at
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventEx
ecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecut
or.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException

16/08/16 00:13:59 ERROR cluster.YarnScheduler: Lost executor 1 on


192.168.233.135: Slave lost
16/08/16 00:13:59 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID
0, 192.168.233.135): ExecutorLostFailure (executor 1 exited caused by one of the
running tasks) Reason: Slave lost
16/08/16 00:13:59 ERROR scheduler.LiveListenerBus: Listener SQLListener threw an
exception
java.lang.NullPointerException
at
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.s
cala:42)
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.
scala:37)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$
$anonfun$run$1$
$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$
$anonfun$run$1$
$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$
$anonfun$run$1$
$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.util.AsynchronousListenerBus$$anon$1$
$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1181)

at org.apache.spark.util.AsynchronousListenerBus$
$anon$1.run(AsynchronousListenerBus.scala:63)
16/08/16 00:14:33 ERROR cluster.YarnClientSchedulerBackend: Yarn application has
already exited with state FAILED!
16/08/16 00:14:34 ERROR scheduler.LiveListenerBus: SparkListenerBus has already
stopped! Dropping event
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@4b75459)
16/08/16 00:14:34 ERROR scheduler.LiveListenerBus: SparkListenerBus has already
stopped! Dropping event
SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@cb76101)
16/08/16 00:14:34 ERROR scheduler.LiveListenerBus: SparkListenerBus has already
stopped! Dropping event
SparkListenerJobEnd(0,1471331674141,JobFailed(org.apache.spark.SparkException:
Job 0 cancelled because SparkContext was shut down))
org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut
down
at org.apache.spark.scheduler.DAGScheduler$
$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806)
at org.apache.spark.scheduler.DAGScheduler$
$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:804)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGSchedule
r.scala:804)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.
scala:1658)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:84)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1581)
at org.apache.spark.SparkContext$
$anonfun$stop$9.apply$mcV$sp(SparkContext.scala:1751)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1230)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1750)

at
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(
YarnClientSchedulerBackend.scala:147)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1843)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1856)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1869)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1328)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:15
0)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:11
1)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.take(RDD.scala:1302)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$
$iwC.<init>(<console>:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58)
at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60)
at $iwC$$iwC$$iwC$$iwC.<init>(<console>:62)
at $iwC$$iwC$$iwC.<init>(<console>:64)
at $iwC$$iwC.<init>(<console>:66)
at $iwC.<init>(<console>:68)
at <init>(<console>:70)
at .<init>(<console>:74)
at .<clinit>(<console>)

at .<init>(<console>:7)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja
va:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1045)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1326)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:821)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:852)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:800)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$
$loop(SparkILoop.scala:670)
at org.apache.spark.repl.SparkILoop$
$anonfun$org$apache$spark$repl$SparkILoop$
$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at org.apache.spark.repl.SparkILoop$
$anonfun$org$apache$spark$repl$SparkILoop$
$process$1.apply(SparkILoop.scala:945)

at org.apache.spark.repl.SparkILoop$
$anonfun$org$apache$spark$repl$SparkILoop$
$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:
135)
at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$
$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1064)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja
va:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$
$runMain(SparkSubmit.scala:731)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

scala>

scala>

scala>

scala> println(mostCommon.deep.mkString("\n"))
<console>:34: error: not found: value mostCommon
println(mostCommon.deep.mkString("\n"))
^

scala>

scala>

scala>

scala> exit
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
Aug 11, 2016 3:17:29 AM INFO: parquet.hadoop.ParquetInputFormat: Total input
paths to process : 1
Aug 11, 2016 3:17:29 AM INFO: parquet.hadoop.ParquetInputFormat: Total input
paths to process : 1
[cloudera@quickstart ~]$
[cloudera@quickstart ~]$ spark-shell --master yarn-client
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j121.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/jars/slf4j-log4j121.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
Welcome to
____

__

/ __/__ ___ _____/ /__


_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
16/08/16 00:20:15 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
16/08/16 00:20:15 WARN util.Utils: Your hostname, quickstart.cloudera resolves to a
loopback address: 127.0.0.1, but we couldn't find any external IP address!
16/08/16 00:20:15 WARN util.Utils: Set SPARK_LOCAL_IP if you need to bind to
another address
16/08/16 00:20:22 WARN : Your hostname, quickstart.cloudera resolves to a
loopback/non-reachable address: 127.0.0.1, but we couldn't find any external IP
address!
16/08/16 00:20:23 WARN shortcircuit.DomainSocketFactory: The short-circuit local
reads feature cannot be used because libhadoop cannot be loaded.
Spark context available as sc (master = yarn-client, app id =
application_1470901512546_0003).
16/08/16 00:21:06 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple
JAR versions of the same plugin in the classpath. The URL
"file:/usr/lib/hive/lib/datanucleus-rdbms-3.2.9.jar" is already registered, and you are

trying to register an identical plugin located at URL "file:/usr/jars/datanucleusrdbms-3.2.9.jar."


16/08/16 00:21:07 WARN DataNucleus.General: Plugin (Bundle)
"org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR
versions of the same plugin in the classpath. The URL "file:/usr/jars/datanucleus-apijdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin
located at URL "file:/usr/lib/hive/lib/datanucleus-api-jdo-3.2.6.jar."
16/08/16 00:21:07 WARN DataNucleus.General: Plugin (Bundle) "org.datanucleus" is
already registered. Ensure you dont have multiple JAR versions of the same plugin in
the classpath. The URL "file:/usr/lib/hive/lib/datanucleus-core-3.2.10.jar" is already
registered, and you are trying to register an identical plugin located at URL
"file:/usr/jars/datanucleus-core-3.2.10.jar."
16/08/16 00:22:10 WARN metastore.ObjectStore: Version information not found in
metastore. hive.metastore.schema.verification is not enabled so recording the
schema version 1.1.0
16/08/16 00:22:12 WARN metastore.ObjectStore: Failed to get database default,
returning NoSuchObjectException
SQL context available as sqlContext.

scala>

scala> import org.apache.hadoop.mapreduce.Job


import org.apache.hadoop.mapreduce.Job

scala> import org.apache.hadoop.mapreduce.lib.input.FileInputFormat


import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

scala> import org.apache.avro.generic.GenericRecord


import org.apache.avro.generic.GenericRecord

scala> import parquet.hadoop.ParquetInputFormat

import parquet.hadoop.ParquetInputFormat

scala> import parquet.avro.AvroReadSupport


import parquet.avro.AvroReadSupport

scala> import org.apache.spark.rdd.RDD


import org.apache.spark.rdd.RDD

scala> def rddFromParquetHdfsFile(path: String): RDD[GenericRecord] = {


|

val job = new Job()

FileInputFormat.setInputPaths(job, path)

ParquetInputFormat.setReadSupportClass(job,

classOf[AvroReadSupport[GenericRecord]])

return sc.newAPIHadoopRDD(job.getConfiguration,

classOf[ParquetInputFormat[GenericRecord]],

classOf[Void],

classOf[GenericRecord]).map(x => x._2)

|}
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
rddFromParquetHdfsFile: (path:
String)org.apache.spark.rdd.RDD[org.apache.avro.generic.GenericRecord]

scala> val warehouse = "hdfs://quickstart/user/hive/warehouse/"


warehouse: String = hdfs://quickstart/user/hive/warehouse/

scala> val order_items = rddFromParquetHdfsFile(warehouse + "order_items");

order_items: org.apache.spark.rdd.RDD[org.apache.avro.generic.GenericRecord] =
MapPartitionsRDD[1] at map at <console>:41

scala> val products = rddFromParquetHdfsFile(warehouse + "products");


products: org.apache.spark.rdd.RDD[org.apache.avro.generic.GenericRecord] =
MapPartitionsRDD[3] at map at <console>:41

scala>
| val orders = order_items.map { x => (
|

x.get("order_item_product_id"),

(x.get("order_item_order_id"), x.get("order_item_quantity")))

| }.join(
| products.map { x => (
|

x.get("product_id"),

(x.get("product_name")))

| }
| ).map(x => (
|

scala.Int.unbox(x._2._1._1), // order_id

scala.Int.unbox(x._2._1._2), // quantity

x._2._2.toString // product_name

| )).groupByKey()
orders: org.apache.spark.rdd.RDD[(Int, Iterable[(Int, String)])] = ShuffledRDD[10] at
groupByKey at <console>:56

scala>

| val cooccurrences = orders.map(order =>


| (
|

order._1,

order._2.toList.combinations(2).map(order_pair =>

if (order_pair(0)._2 < order_pair(1)._2)

(order_pair(0)._2, order_pair(1)._2)

else

(order_pair(1)._2, order_pair(0)._2),

order_pair(0)._1 * order_pair(1)._1

|
|

)
)

| )
|)
cooccurrences: org.apache.spark.rdd.RDD[(Int, Iterator[((String, String), Int)])] =
MapPartitionsRDD[11] at map at <console>:44

scala> val combos = cooccurrences.flatMap(x => x._2).reduceByKey((a, b) => a +


b)
combos: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[13] at
reduceByKey at <console>:45

scala> val mostCommon = combos.map(x => (x._2,


x._1)).sortByKey(false).take(10)
mostCommon: Array[(Int, (String, String))] = Array((67876,(Nike Men's Dri-FIT
Victory Golf Polo,Perfect Fitness Perfect Rip Deck)), (62924,(O'Brien Men's Neoprene
Life Vest,Perfect Fitness Perfect Rip Deck)), (54399,(Nike Men's Dri-FIT Victory Golf
Polo,O'Brien Men's Neoprene Life Vest)), (39656,(Nike Men's Free 5.0+ Running
Shoe,Perfect Fitness Perfect Rip Deck)), (39314,(Perfect Fitness Perfect Rip
Deck,Perfect Fitness Perfect Rip Deck)), (35092,(Perfect Fitness Perfect Rip

Deck,Under Armour Girls' Toddler Spine Surge Runni)), (33750,(Nike Men's Dri-FIT
Victory Golf Polo,Nike Men's Free 5.0+ Running Shoe)), (33406,(Nike Men's Free
5.0+ Running Shoe,O'Brien Men's Neoprene Life Vest)), (29835,(Nike Men's Dri-FIT
Victory Golf Polo,Nike Men's Dri-FIT Victory Golf Polo)), (29342,(Nike Men'...
scala>
| println(mostCommon.deep.mkString("\n"))
(67876,(Nike Men's Dri-FIT Victory Golf Polo,Perfect Fitness Perfect Rip Deck))
(62924,(O'Brien Men's Neoprene Life Vest,Perfect Fitness Perfect Rip Deck))
(54399,(Nike Men's Dri-FIT Victory Golf Polo,O'Brien Men's Neoprene Life Vest))
(39656,(Nike Men's Free 5.0+ Running Shoe,Perfect Fitness Perfect Rip Deck))
(39314,(Perfect Fitness Perfect Rip Deck,Perfect Fitness Perfect Rip Deck))
(35092,(Perfect Fitness Perfect Rip Deck,Under Armour Girls' Toddler Spine Surge
Runni))
(33750,(Nike Men's Dri-FIT Victory Golf Polo,Nike Men's Free 5.0+ Running Shoe))
(33406,(Nike Men's Free 5.0+ Running Shoe,O'Brien Men's Neoprene Life Vest))
(29835,(Nike Men's Dri-FIT Victory Golf Polo,Nike Men's Dri-FIT Victory Golf Polo))
(29342,(Nike Men's Dri-FIT Victory Golf Polo,Under Armour Girls' Toddler Spine Surge
Runni))

scala>
| exit
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
Aug 16, 2016 12:38:41 AM INFO: parquet.hadoop.ParquetInputFormat: Total input
paths to process : 1
Aug 16, 2016 12:38:41 AM INFO: parquet.hadoop.ParquetInputFormat: Total input
paths to process : 1
[cloudera@quickstart ~]$

CONCLUSION

If it weren't for Spark, doing cooccurrence analysis like this would be an extremely
arduous and time-consuming task. However, using Spark, and a few lines of scala,
you were able to produce a list of the items most frequently purchased together in
very little time.
----------------------------------------------------------------------------------------------------------------------------------------Tutorial Exercise 4
Explore Log Events Interactively
What you can do to enable guided drill down and exploration of data is to make it
searchable. By indexing your data using any of the indexing options provided by
Cloudera Search, your data can be searchable to a variety of audiences. You can
choose to batch index data using the MapReduce Indexing tool, or as in our example
below, extend the Apache Flume configuration that is already ingesting the web log
data to also post events to Apache Solr for indexing in real-time.
The web log data is standard web server log which may look something like this:

Solr organizes data similarly to the way a SQL database does. Each record is called
a 'document' and consists of fields defined by the schema: just like a row in a
database table. Instead of a table, Solr calls it a 'collection' of documents. The
difference is that data in Solr tends to be more loosely structured. Fields may be
optional, and instead of always matching exact values, you can also enter text
queries that partially match a field, just like you're searching for web pages. You'll
also see Hue refer to 'shards' - and that's just the way Solr breaks collections up to
spread them around the cluster so you can search all your data in parallel.

Here is how you can start real-time-indexing via Cloudera Search and Flume over
the sample web server log data and use the Search UI in Hue to explore it:

reate your search index

Ordinarily when you are deploying a new search schema, there are four steps:

Creating an empty configuration

For the sake of this tutorial, you won't need to actually execute steps 1 or 2, as
we have included the configuration and the schema file in your cluster already. They
can be reviewed by exploring /opt/examples/flume/solr_configs.

If you were doing this on your own, you would generate the configs by executing
the following command:

[cloudera@quickstart ~]$ solrctl --zk quickstart:2181/solr instancedir --generate


solr_configs

You don't need to do this for this tutorial. We have already generated the
configuration for you. This instruction is here in case you want to create your own
index.

The result of this command would be a skeleton configuration that you could then
customize to your liking. The primary thing that you would ordinarily be customizing
is the conf/schema.xml, which we cover in the next step.
Edit your schema

As mentioned previously, we have already generated the configuration files for


you. You can view the modified sample schema here.

The most common area that you would be interested in is the <fields></fields>
section. From this area you can define the fields that are present and searchable in
your index.
Uploading your configuration

[cloudera@quickstart ~]$ cd /opt/examples/flume

[cloudera@quickstart ~]$ solrctl --zk quickstart:2181/solr instancedir --create


live_logs ./solr_configs

Creating your collection

[cloudera@quickstart ~]$ solrctl --zk quickstart:2181/solr collection --create


live_logs -s 1

You can verify that you successfully created your collection in Solr by going to Hue,
and clicking Search in the top menu

Then click on Indexes from the top right to see all of the indexes/collections

Now you can see the collection that we just created, live_logs, click on it.

you are now viewing the fields that we defined in our schema.xml file

Now that you have verified that your search collection/index was created
successfully, we can start putting data into it using Flume and Morphlines. Flume is
a tool for ingesting streams of data into your cluster from sources such as log files,
network streams, and more. Morphlines is a Java library for doing ETL on-the-fly,
and it's an excellent companion to Flume. It allows you to define a chain of tasks
like reading records, parsing and formatting individual fields, and deciding where to
send them, etc. We've defined a morphline that reads records from Flume, breaks
them into the fields we want to search on, and loads them into Solr (You can read
more about Morphlines here). This example Morphline is defined at
/opt/examples/flume/conf/morphline.conf, and we're going to use it to index our
records in real-time as they're created and ingested by Flume.

Starting the Log Generator

Your Cloudera Live cluster has a log generator for use with sample data. Start the
log generator by running the following command:

Text Copied!

[cloudera@quickstart ~]$ start_logs

You can verify that the log generator has started by running

Text Copied!

[cloudera@quickstart ~]$ tail_logs

When you're done watching the logs, you can hit <Ctrl + C> to return to your
terminal. You should see a screen similar to the one below:

Later, if you want to stop the log generator you can:

Text Copied!

[cloudera@quickstart ~]$ stop_logs

Flume and the morphline

Now that we have an empty Solr index, and live log events coming in to our fake
access.log, we can use Flume and morphlines to load the index with the real-time
log data.

The key player in this tutorial is Flume. Flume is a system for collecting,
aggregating, and moving large amounts of log data from many different sources to
a centralized data source.

With a few simple configuration files, we can use Flume and a morphline (a simple
way to accomplish on-the-fly ETL,) to load our data into our Solr index.

You can use Flume to load many other types of data stores; Solr is just the example
we are using for this tutorial.

You can review the flume.conf file and the morphline.conf that it uses.

Start the Flume agent by executing the following command:

Text Copied!

[cloudera@quickstart ~]$ flume-ng agent \


--conf /opt/examples/flume/conf \
--conf-file /opt/examples/flume/conf/flume.conf \
--name agent1 \
-Dflume.root.logger=DEBUG,INFO,console

This will start running the Flume agent in the foreground. Once it has started, and is
processing records, you should see something like:

Now you can go back to the Hue UI (refer back to your cluster's guidance page for
the link), and click 'Search' from the collection's page:

You will be able to search, drill down into, and browse the events that have been
indexed.

Вам также может понравиться