Lab Guide

Lab Guide
Big Data Advanced - Spark

Version 6.0
Copyright 2015 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training
Congratulations on choosing a Talend training module. Take a minute to review the following points to help you get the most from
your experience.
Technical Difficulty
Instructor-Led
If you are following an instructor-led training (ILT) module, there will be periods for questions at regular intervals. However, if you
need an answer in order to proceed with a particular lab, or if you encounter a situation with the software that prevents you from pro-
ceeding, don’t hesitate to ask the instructor for assistance so it can be resolved quickly.
Self-Paced
If you are following a self-paced, on-demand training (ODT) module, and you need an answer in order to proceed with a particular
lab, or you encounter a situation with the software that prevents you from proceeding with the training module, a Talend professional
consultant can provide assistance. Double-click the Live Expert icon on your desktop to go to the Talend Live Support login page
(you will find your login and password in your ODT confirmation email). The consultant will be able to see your screen and chat with
you to determine your issue and help you on your way. Please be considerate of other students and only use this assistance if you are
having difficulty with the training experience, not for general questions.
Exploring
Remember that you are interacting with an actual copy of the Talend software, not a simulation. Because of this, you may be tempted
to perform tasks beyond the scope of the training module. Be aware that doing so can quickly derail your learning experience, leaving
your project in a state that is not readily usable within the tutorial, or consuming your limited lab time before you have a chance to fin-
ish. For the best experience, stick to the tutorial steps! If you want to explore, feel free to do so with any time remaining after you've fin-
ished the tutorial (but note that you cannot receive consultant assistance during such exploration).
Additional Resources
After completing this module, you may want to refer to the following additional resources to further clarify your understanding and
refine and build upon the skills you have acquired:
Talend product documentation (help.talend.com)
Talend Forum (talendforge.org/)
Documentation for the underlying technologies that Talend uses (such as Apache) and third-party applications that com-
plement Talend products (such as MySQL Workbench)
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
CONTENTS | Lab Guide

CONTENTS
LESSON 1 Introduction to Kafka
Introduction to Kafka 10
Overview 10
Kafka Overview 11
Overview 11
Publishing Messages to a Kafka Topic 12
Overview 12
Create Kafka topic 12
Generate Data 12
Serialize data 13
Convert Data 15
Publish message 16
Consuming messages 18
Overview 18
Consume messages 18
Extract and display data 18
Wrap-Up 21
Recap 21
LESSON 2 Introduction to Spark

Introduction to Spark 24
Overview 24
Spark Overview 25
Overview 25
Customer Data Analysis 27
Overview 27
Copy data to HDFS 27
Connect to HDFS 28
Read customers data 30
Extract data of interest 31
Aggregate and sort data 32
Save Results to HDFS 33
Run Job and check results in Hue 33
Producing and Consuming messages in Real-Time 36
Overview 36
Publish messages to a Kafka topic 36
Configure execution on Spark 37
Consume messages 38
Configure execution on Spark 39
Run Jobs 40
Wrap-Up 43
Recap 43
LESSON 3 Logs Processing Use Case - Generating Enriched Logs

Generate Enriched Logs 46
Overview 46
Logs Processing Use Case 47
Overview 47
Monitoring 47
Reporting 47
Batch Analysis 48
Generate Raw Logs 49
Overview 49
Create context variables 49
Generate Logs 50
Publish to a Kafka topic 52
Configure Job execution 54
Overview 56
Create customers database 56
Consume raw logs 57
Combine raw logs with customers data 58
Run Jobs and check results 61
Publish enriched logs. 63
Wrap-Up 64
Recap 64
LESSON 4 Logs Processing Use Case - Monitoring

Monitoring Logs 66
Overview 66
Monitoring Enriched Logs 67
Overview 67
Consume enriched logs 67
Processing logs 67
Save logs in Elasticsearch 68
Start services 69
Run Job and check results 71
Troubleshooting 76
Wrap-Up 77

Recap 77
Further Reading 77
LESSON 5 Logs Processing Use Case - Reporting

Reporting users information 80
Overview 80
Overview 81
Consuming enriched Logs 81
Process URL 82
Filter users 82
Generate report 83
Wrap-Up 87
Recap 87
LESSON 6 Logs Processing Use Case - Batch Analysis

Logs analysis 90
Overview 90
Stream Ingestion 91
Overview 91
Create HBase table 91
Create ingestion Job 92
Configure Spark 93
Run your Job and check results in Hue 93
Logs Batch Analysis 95
Overview 95
Analyze Logs for a specific country 95
Compute top 5 downloaded products 97
Wrap-Up 101
Recap 101

LESSON 1
Introduction to Kafka
This chapter discusses the following.
Introduction to Kafka 10
Kafka Overview 11
Publishing Messages to a Kafka Topic 12
Consuming messages 18
Wrap-Up 21
Introduction to Kafka
Overview
During this training, you will be assigned a Hadoop cluster preconfigured. This Hadoop cluster has been built with a Cloudera CDH
5.4 distribution. The purpose is to try the different functionalities, not to have a production cluster. So, this training cluster is in pseudo-
distributed mode. That means that there is only one node. This is enough to understand the different concepts in this training.
In top of the Cluster services that you have used so far, Kafka has been installed. You will use it to publish and consume messages in
a simple Data Integration Job to understand the basic concepts of Kafka.
Later in the course, you will use Kafka for real-time processing in conjunction with the Spark Streaming framework.
Objectives
After completing this lesson, you will be able to:
Create a new topic in Kafka
Publish a message to a specific topic
Consume messages in a specific topic
Before you begin

Be sure that you are working in an environment that contain the following:
A properly installed copy of Talend Studio
A properly configured Hadoop cluster
The supporting files for this lesson
Everything has already been set up in your training environment.
After a short introduction to Kafka, the first step is to build a Job to create a new Kafka topic and to publish messages to this topic.
10 | Big Data Advanced - Spark - Lab Guide

Kafka Overview
Overview
Kafka has been created by LinkedIn and is now an Apache project.
Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable and durable.
First, you will review some basic messaging terminology:
Kafka maintains feeds of messages in categories called topics.
Producers write messages to topics.
Consumers read messages from topics.
Topics are partitioned and replicated across multiple nodes.
At a high-level, the following diagram represents the architecture of a Kafka messaging system:
Kafka is run as a cluster of one or more servers. Each server is called a broker.
Messages are byte arrays and can be used to store any object of any format. String, JSON and Avro formats are the most com-
monly used formats.
A topic is a category to which messages are published. And for each topic, the Kafka cluster maintains a partitioned log. Each partition
is an ordered, immutable sequence of messages. Each message in the partitions is assigned a unique sequential id number called
offset, which uniquely identifies each message in the partition.
The Kafka cluster retains all published messages, consumed or not, for a certain amount of time.
When consuming from a topic, it is possible to configure a consumer group with multiple consumers. Each consumer will read mes-
sages from specific topics they subscribed to.
Kafka does not attempt to track which messages were read by each consumer and only retain unread messages. Instead, Kafka
retains all messages for a configurable amount of time. When this amount of time is elapsed, the messages are deleted to free up
space.
Kafka is typically used for the following use cases:
Messaging
Website Activity Tracking
Log Aggregation
Stream Processing
You will now create a Job to publish messages to a Kafka topic.
LESSON 1 | 11
Publishing Messages to a Kafka Topic
Overview
In this lab, you will build a simple Data Integration Job to publish messages to a Kafka topic.
You will first create a Kafka topic, and then generate random data, process them to be in the right format to publish them to your
Kafka topic.
Create Kafka topic

1. Start your Studio and open the BDAdvanced_SPARK Project.
2. Create a new standard Job and name it PublishMessages.
3. Add a tKafkaCreateTopic component and open the Component view.
4. In the Zookeeper quorum list box, enter "ClusterCDH54:2181".
5. In the Action on topic list, select Create topic if not exists.
This will allow you to run the Job multiple times without error messages.
6. In the Topic name box, enter "mytopic".
7. In the Replication factor and Number of partitions boxes, keep the default value "1".
8. Your configuration should be as follows:
Generate Data
You will now generate random data using the tRowGenerator component.
1. Below tKafkaCreateTopic, add a tRowGenerator component and connect it with an OnSubjobOk trigger.
2. Double-click tRowGenerator to open the RowGenerator editor.
3. Click the green plus sign to add 3 new columns to the schema.
4. Name the columns "Id", "FirstName" and "LastName".
5. Set the Id column type to Integer and in the Functions list, select Numeric.sequence(String,int,int).
6. In the FirstName column Functions list, select TalendDataGenerator.getFirstName().
7. In the LastName column Functions list, select TalendDataGenerator.getLastName().

9. Click OK to save the configuration.

Now that you have generated data, you can process them.
Serialize data
To be published, your message must be serialized and in bytes format.
You will serialize your random data using a tWriteXMLField component.
1. At the right side of tRowGenerator, add a tWriteXMLField component and connect it with the Main row.
2. Double-click tWriteXMLField to open the XML Tree editor.
3. In the Link Target table, in the XML Tree column, click rootTag.
4. Click the green plus sign:
LESSON 1 | 13
5. Select Create as sub-element and click OK:
6. In the Input the new element's valid label box, enter "customer" and click OK:
7. In the Linker source table, in the Schema List column, select Id, FirstName and LastName and drag them on
customer:
8. Select Create as sub-element of target node and click OK:

9. Right-click customer and click Set As Loop Element:
11. Click Ok to save your configuration.

12. In the Component view, click (...) to edit the schema.
13. Below the Output table, click the green plus sign to add an Output column.
14. Name the new column "SerializedData".
15. Click OK to save the schema.
Convert Data
Now that your data have been serialized, you can convert them to bytes.
1. At the right side of tWriteXMLField, add a tJavaRow component and connect it with the Main row.
2. Double-click tJavaRow to open the Component view.
3. Click (...) to open the schema.
LESSON 1 | 15
4. Click SerializedData in the Input table and click the yellow right arrow to transfer it to the Output table:
5. In the Output table, set the column type to byte[].

7. In the Code box, enter:
output_row.SerializedData =
input_row.SerializedData.getBytes();
This will convert your serialized data to byte format.

Now your message is ready for publishing.
Publish message
You will now publish your message to the Kafka topic you created earlier. To publish, you will use a tKafkaOutput component.
1. At the right side of tJavaRow, add a tKafkaOutput component and connect it with the Main row.
2. Open the Component view.
3. In the Broker list box, enter "ClusterCDH54:9092".
4. In the Topic name box, enter "mytopic", which is the Kafka topic you created previously using the tKafkaCreateTopic com-
ponent.

6. Run your Job and check the output in the Console. You should have an exit code equals to 0:
A successful execution means that you have generated 100 rows of random data, convert them to a serialized byte format message
and then published them on your Kafka topic.
You will now create another Job to consume the messages published on mytopic.
LESSON 1 | 17
Consuming messages
Overview
You created a first Job to publish messages on a Kafka topic. you will now create a job that will consume the messages on the same
Kafka topic and display them in the Console.
Consume messages
1. Create a new Standard Job and name it ConsumeMessages.
2. Add a tKafkaInput component and open the Component view.
3. In the Zookeeper quorum list box, enter "ClusterCDH54:2181".
4. In the Starting offset list, select From beginning.
5. In the Topic name box, enter "mytopic".
6. Let the default value in the Consumer group id box.
Extract and display data

The tKafkaInput component will consume the messages in your Kafka topic. But before being able to display them in the Console,
you need to extract the XML fields.
1. At the right side of tKafkaInput, add a tExtractXMLField component and connect it with the Main row.
2. Double-click tExtractXMLField to open the Component view.
3. Click Sync Columns.
4. In the Loop XPath query box, enter "/rootTag/customer".
5. In the XPath query box, enter "/rootTag/customer".
6. Click the checkbox in the Get Nodes column.

8. Add a tLogRow component and connect it with the Main row.

9. Run your Job and inspect the result in the Console:
Note: There is no exit code because the Job is still running.

10. Go back to the PublishMessages Job and run it again several times to generate messages.
11. In the ConsumeMessages Job Console, you might not be able to see the messages, but if you look at the Job in the
Designer view, you should see that the number of processed rows has increased:
LESSON 1 | 19
The ConsumeMessages Job will wait for incoming messages in your Kafka topic.
12. Stop the ConsumeMessages Job execution using the Kill button in the Run view.
You have finished this introduction to Kafka. So now it's time to recap what you've learned.

Wrap-Up
Recap
In this lesson you covered the key base knowledge required to publish and consume messages in a Kafka topic.
You first created a Kafka topic using the tKafkaCreateTopic component. Then you formatted properly your data to publish them on a
topic with the tKafkaOutput component. And you, finally, consumed the messages with the tKafkaInput component.
Later in the course, you will publish and consume messages over various topics in real time in Big Data Streaming Jobs, using the
Spark framework.
The next lab will give you a brief introduction to Spark.
LESSON 1 | 21
LESSON 2
Introduction to Spark
Introduction to Spark 24
Spark Overview 25
Customer Data Analysis 27
Producing and Consuming messages in Real-Time 36
Wrap-Up 43
Introduction to Spark
Overview
This lab is an introduction to the Spark framework. Spark can be used as an alternative to Yarn for Big Data Batch Jobs. It can be
used for Big Data Streaming Jobs as well.
In this lab, you will cover how to create Big Data Batch and Big Data Streaming Jobs. The Jobs will be configured to run on your
cluster using the Spark framework.
You will start with a description of the Spark framework. Then you will cover an example of a Big Data Batch Job running on Spark.
And you will finish with Big Data Streaming Jobs that will publish and consume messages in a Kafka topic.
Objectives
Create a Big Data Batch Job
Create a Big Data Streaming Job
Configure your Jobs to use the Spark framework
Before you begin

Before creating Jobs to use Spark, you will be briefly introduced to Spark.

Spark Overview
Overview
Apache Spark is a fast and general engine for large-scale data processing similar to Hadoop, but it has some useful differences that
make it superior for certain use cases such as machine learning.
Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries. Caching datasets in
memory reduce the latency of access. Spark can be 100 times faster than Map Reduce in memory, and 10 times faster on disk.
Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects distributed
across a set of nodes.
RDDs are resilient because they can be rebuilt if a portion of the dataset is lost, thanks to a fault-tolerance mechanism.
Applications in Spark are called drivers, and these drivers implement the operations performed either on a single node or across a
set of nodes. It abstracts APIs in Java, Scala and Python. Spark also provides high level tools like Spark SQL for structured data pro-
cessing, MLib for machine learning, Spark Streaming for stream processing and GraphX for graph analysis.
The Spark framework can be deployed on Apache Hadoop via Yarn or on its own cluster (standalone). But Spark doesn't need to sit
on top of HDFS. Spark can run on top of HDFS, HBase, Cassandra, Amazon S3 or any Hadoop data source.
LESSON 2 | 25
You will now create a Big Data Batch Job using the Spark framework.

Customer Data Analysis
Overview
In this lab, you will reuse data introduced in the Big Data Basics course. In the Big Data Basics course, you analyzed Customers data
using Pig components in Standard Jobs and then you did the same analysis with a Big Data Batch Job using the Map Reduce frame-
work.
You will now perform the same task in a Big Data Batch Job using the Spark framework:
Copy data to HDFS

The first step will be to use an existing Job to copy the CustomersData.csv file on HDFS. Next, you will create a new Big Data Batch
Job that will use the Spark framework.
1. In the C:\StudentFiles folder, you will find a JobDesigns_spark.zip archive file.
2. Import the CustomersData generic metadata and the Job named PutCustomersData:
This Job is composed of a unique tHDFSPut component. This component will copy the CustomersData.csv file, from your
local file system, to HDFS under /user/student/BDBasics.
3. Modify the output folder so that the file is written under /user/student/BDAdvanced/CustomersData.csv.
4. Run PutCustomersData.
5. Connect to Hue, navigating to "ClusterCDH54:8888" and using student/training as username/password.
LESSON 2 | 27
6. Using the File Browser, open CustomersData.csv under /user/student/BDAdvanced:
The data are composed of different information about customers: Id, first name, last name, city, state, product category,
gender and purchase date.
You will now create a Job to process the data.
Connect to HDFS
When you use Spark, you can use different storage options. You do not rely on HDFS, as for the Map Reduce framework, so you will
need to a specific component to set your HDFS configuration information.
1. In the Repository, right-click Job Designs, then click Create Big Data Batch Job:
2. In the Name box, enter "CustomersDataAnalysis".

3. In the Framework list, select Spark.

4. Add a Purpose and a Description.
Your configuration should be as follows:
5. Click Finish to create the new Job.

The Job will appear in the Repository and will open in the Designer view.
6. Add a tHDFSConfiguration component and open the Component view.
7. In the Property Type list, select Repository and navigate to use your HDFSConnection metadata:
LESSON 2 | 29
Now, your Job will be able to connect to HDFS.
Read customers data

The CustomersData.csv file is stored in HDFS and is a delimited file. As for Data Integration Jobs, you will use a tFileInputDelimited
component to read the file content.
1. Below tHDFSConfiguration, add a tFileInputDelimited component.
2. Double-click tFileInputDelimited to open the Component view.
3. Your tHDFSConfiguration component has been selected as the storage configuration component. This means that the
information in this component will be used to connect to HDFS.

4. In the Schema list, select Repository and navigate to use the CustomersData generic schema metadata:
5. Using the (...) button, select the CustomersData.csv file under /user/student/BDAdvanced and click OK.
If you can't get connected to HDFS, check your tHDFSConfiguration component.
You will now process your data. First, you will filter your data to extract data of interest and then, you will perfom aggregation and sort
to get useful information.
Extract data of interest

The goal here is to filter the customers living in California, then get the corresponding product category and gender.
1. At the right side of tFileInputDelimited, add a tMap component and connect it with the Main row.
2. Double-click tMap to open the mapping editor.
3. Create a new output and name it "out".
4. In the row1 table, select ProductCategory and Gender and drag them to the out table.
5. In the upper right corner of the out table, click the filter icon.
LESSON 2 | 31
6. In the Filter box, enter:
row1.State.equals("California")
8. Click OK to save your configuration.
Aggregate and sort data

The next step is to aggregate the results and sort them to get the count of product category per gender.
1. At the right side of tMap, add a tAggregateRow component and connect it with the out row.
3. Click (...) to edit the schema.
4. Click the double yellow arrow to transfer all the columns in the Input table to the Output table.
5. Below the Output table, click the green plus sign to add a new column.
6. Name the new column Count and set its type to Integer.
8. Add ProductCategory and Gender columns to the Group by table.
9. Add the Count column to the Operations table.
10. In the Function list, select count, and in the Input column position, select ProductCategory.
11. At the right side of tAggregateRow, add a tSortRow component and connect it with the Main row.
12. Double-click tSortRow to open the Component view.

13. In the Criteria table, add the ProductCategory column and configure to have an alphabetical sorting in ascending order:
Save Results to HDFS

The last component will the save the analysis result to HDFS.
1. At the right side of tSortRow, add a tFileOutputDelimited component and connect it with the Main row.
2. Double-click tFileOutputDelimited to open the Component view.
3. In the Folder box, enter
"/user/student/BDAdvanced/CustomersDataOut".
4. In the Action list, select Overwrite.
5. In the Merge configuration area, click the Merge result to single file option.
6. In the Merge File Path box, enter
"/user/student/BDAdvanced/CustomersDataOut/AnalysisResults.csv".
Run Job and check results in Hue

Now, you will configure your Job to execute on Spark, then you will run it and check the results in Hue.
LESSON 2 | 33
1. Click the Run view and then the Spark Configuration tab:
2. In the Spark Mode list, you can choose the execution mode of your Job. There are 3 options: Local, Standalone and
YARN client.
Your cluster has been installed and configured for Spark to run in Standalone mode.
In the Spark Mode list, select Standalone.
3. Check that the Distribution and Version correspond to your Cloudera CDH 5.4 cluster.
4. In the Spark Host box, enter "spark://ClusterCDH54:7077" (quotes included).
5. In the Spark Home box, enter "/user/spark/share/lib".
6. Go back to the Basic Run tab and click Run.
7. At the end of the execution, you should have an exit code equals to 0 in the Console and in the Designer, you should see
100% labels on top of your rows:

8. In Hue, using the File Browser, navigate to "/user/student/BDAdvanced/CustomersDataOut" and click Ana-
lysisResults.csv:
You have covered how to create a Big Data Batch Job using the Spark framework. It's time to move to the next topic which will intro-
duce you to Big Data Streaming Jobs.
LESSON 2 | 35
Producing and Consuming messages in Real-Time
Overview
In this lab, you will build two Big Data Streaming Jobs. The first Job will publish messages to a Kafka topic and the second Job will con-
sume those messages.
Publish messages to a Kafka topic

As you did in the "Introduction to Kafka" lab, you will publish messages to a Kafka topic. In a Data Integration Job, when the message
you created is published, the Job ends with an exit code equals to 0. In a Big Data Streaming Job, the Job will run until you kill it.
In this Job, you will continuously create message and publish them to your Kafka topic.
1. In the Repository, right-click Big Data Streaming, and click Create Big Data Streaming Job:
2. In the Name box, enter "PublishMessagesStreaming".

3. In the Framework list, select Spark Streaming:
4. Add a Purpose and a Description, then click Finish to create your Job.
5. In the Designer view, add a tRowGenerator component and open the RowGenerator editor.

6. Configure it as in the "Introduction to Kafka" lab:
7. Save your configuration.

8. At the right side of tRowGenerator, add a tWriteDelimitedFields component and connect it with the Main row.
9. Double-click to open the Component view.
10. Check that the Output type is set to byte[]:
The tWriteDelimitedFields will serialize your data and convert it to a byte array, as required to publish on a Kafka topic.
11. At the right side of tWriteDelimitedFields, add a tKafkaOutput component and connect it with the Main row.
12. Double-click tKafkaOutput to open the Component view.
14. In the Topic name box, enter "mystreamingtopic":
Configure execution on Spark

Now, you will configure your Job to execute on Spark in Standalone mode.
As explained before, Spark Streaming is, in fact, a micro-batch execution. You will also have to set the Batch size to suit your needs.
LESSON 2 | 37
The goal of this lab is to publish and consume messages in real-time. To achieve this, the Job to publish the messages and the Job to
consume them will run simultaneously.
If you run you Job as it is configured, the default configuration of Spark on the cluster will assign all the available cores to the current
Job. That means that if you start another Job to run on Spark, there will be no core available for this new Job to run. To avoid this, you
must limit the number of cores requested by your Job.
To summarize, compared to a Spark Big Data Batch Job, you will have to set the Batch size and limit the number of cores requested
by your Job.
1. In the Run view, go to the Spark Configuration tab.
2. In the Spark Mode list, select Standalone.
3. In the Distribution list, select Cloudera.
5. In the Spark Home box, enter "/user/spark/share/lib".
6. In the Batch size box, enter "100".
This will set the batch size to 100 milliseconds.
7. Click the green plus sign below the Advanced properties table.
8. In the Property column, enter "spark.cores.max" and in the Value column, enter "4".
This will limit the number of cores requested to 4.
9. Your Spark configuration should be as follows:
10. Save your Job.

Your Job is ready to run, but you will run it later, when the Job to consume the messages will be ready.
Consume messages
Now you will create the Job to consume the messages in the mystreamingtopic Kafka topic.

1. Create a new Big Data Streaming Job that will use the Spark Streaming framework and name it "Con-
sumeMessagesStreaming".
This allows to consume the messages in the topic, from the beginning, and not only the newly published ones.
5. In the Topic name box, enter "mystreamingtopic".
7. At the right side of tKafkaInput, add a tExtractDelimitedFields component and connect it with the Main row.
8. Open the Component view and click Sync columns.
9. The messages consumed will be displayed in the Console.
At the right side of tExtractDelimitedFields, add a tLogRow component and connect it with the Main row.
You can now continue with the Spark configuration.
Configure execution on Spark

The configuration is very similar to the configuration done for the publishing Job. The only difference will be the Batch size. The mes-
sages will be published every 100 ms and they will be consumed every 3 seconds.
1. In the Run view, go to the Spark Configuration tab.
3. In the Distribution list, select Cloudera.
5. In the Spark Home box, enter "/user/spark/share/lib"(quotes included).
6. In the Batch size box, enter "3000".
This will set the batch size to 3 seconds.
7. Click the green plus sign below the Advanced properties table.
8. In the Property column, enter "spark.cores.max" and in the Value column, enter "4".
This will limit the number of cores requested to 4.
LESSON 2 | 39
9. Your Spark configuration should be as follows:
10. Save your Job.

The publishing and the consuming Job are now ready to run.
Run Jobs
You will first run the publishing Job, then, you will run the consuming Job.
1. Open the PublishMessagesStreaming Job:
In the upper left corner, you will see the batch size you set in the Spark configuration tab. The Job will be executed every 100
ms.
2. Run your Job.

3. Once the Job has started to execute, you will see the statistics changing in the lower right corner:
That means that Messages are published on mystreamingtopic. And it will run until you press the Kill button.
4. Open the ConsumeMessagesStreaming Job:
The Job will execute every 3 seconds.

5. Run the Job and observe the statistics in the lower right corner of your Job. They will start to increase as the Job is executed
on the cluster:
LESSON 2 | 41
6. Observe the result in the Console:
Each time the Job is executed, you will see new messages appear.
To test the real-time aspect of your processing, you can stop your PublishMessagesStreaming Job. Once all the messages are con-
sumed and displayed in the Console of the ConsumeMessagesStreaming Job, you won't see new messages appear.
Start again the PublishMessagesStreaming Job to publish new messages and observe the Console in the Con-
sumeMessagesStreaming Job.
Once you have finished, kill the execution of both Jobs to free the resources on your cluster.
You have now covered the introduction to Spark lab. It's time to recap what you have learned.

Wrap-Up
Recap
In this lesson you covered the key base knowledge required to create Big Data Batch and Big Data Streaming Jobs, using the Spark
Framework.
Spark configuration requires to set the Spark host and Spark home value properly. And in top of that, for streaming Jobs, the Batch
size is necessary. If you need to run multiple streaming Jobs simultaneously, you need to limit the number of cores requested by your
Job.
Another important point on Spark, is that Spark is not based on HDFS or HBase. It can use different storage, so you have to include a
tHDFSConfiguration component or an equivalent component to specify where Spark should read and write files.
Now that you have covered introductions to Kafka and Spark, you can start the Logs Processing use case.
LESSON 2 | 43
LESSON 3
Logs Processing Use Case -
Generating Enriched Logs

Logs Processing Use Case 47
Generate Raw Logs 49
Wrap-Up 64
Generate Enriched Logs
Overview
In this chapter, you will start with an overview of the Logs Processing use case.
Next, you will create 2 different Jobs. The first Job will generate raw logs of users connected to the Talend website.
And the second Job will enrich the logs with information stored in a MySQL database.
The raw logs and enriched logs will be published to two different Kafka topics.
Objectives
Connect to a MySQL Database in a Big Data Streaming Job.
Join a stream of logs with data saved in a MySQL database in Real-Time.
Publish logs to Kafka topics.
Before you begin

You can now start with the use case overview.

Logs Processing Use Case
Overview
In this use case, you will simulation users connection to the Talend Website. The users activity will be tracked with logs.
Processing the logs, you will be able to retrieve useful information about the users, such as the most downloaded product, or how
many users visited the services webpage during the last 15 minutes.
The use case has been split in 4 chapters:
Generate Enriched Logstas
Monitoring
Reporting
Batch Analysis

This lab is required to be able to perform the tasks in the following chapters.
In this lab, you will first simulate raw logs coming from three different servers. The raw logs are composed of a user ID and the URL of
the visited web page.
The raw logs will be enriched with data coming from the user database. At the end of the lab, you will have information about the
users such as the first name, last name, email, address and phone number of each user connecting to the website.
You need to complete this chapter to be able to execute the next steps of the use case.
Monitoring
Using Elasticsearch and Kibana, you will create a dashboard to monitor the users activity on the website.
The logs will be processed before saving them to Elasticsearch with different indices so that you can monitor which users connected
from France, from the USA or from other countries.
Reporting
Using a time windowing component, you will accumulate logs for a certain amount of time. Then, the accumulated logs will be pro-
cessed to generate reports.
You will generate reports to sum up which products were downloaded and to know which user visited the services web pages.
LESSON 3 | 47
Batch Analysis
In the previous labs, the logs are processed to generate dashboards or reports, the logs are not saved as they arrive.
In this lab, the enriched logs will be saved in Real Time in HBase. Next, a Batch Job will be performed to extract statistical information
from the logs.
You will compute the top 5 of the downloaded products from a particular country. The country name will be specified as a context vari-
able prompted when the Job starts.
You will now start the use case with the Job to generate the raw logs.

Generate Raw Logs
Overview
In this lab, you will use the knowledge acquired in the previous chapters. You will create a Job which will generate fake logs.
You simulate logs coming from different web servers and representing connections to the Talend web site. In each log, you will get a
user Id and its corresponding URL.
Create context variables

In the use case, you will need some values quite often, such as the Broker list for Kafka components, the Spark home and the Spark
host for the Spark configuration. Instead of writing them again and again, you will create context variables.
1. In the Repository, right-click Contexts, and click Create context group:
2. In the Name box, enter "Spark" and click Next.

3. Click the green plus sign twice to add 2 context variables to the table.
4. The first context variable is named "Spark_Home" and its value is "/user/spark/share/lib".
5. The second context variable is named "Spark_Host" and its value is "spark://ClusterCDH54:7077".
LESSON 3 | 49
7. Click Finish.
8. Create another context group and name it Kafka.
9. Add a context variable named "Broker_list" and set its value to "ClusterCDH54:9092".
10. Click Finish to save your context variable.
11. The 2 context groups should appear in the Repository under Contexts:
The different context variables that you just created will be used quite often in the different Jobs you will create.
Generate Logs
You will use a tRowGenerator component to simulate the logs.
1. Create a new Big Data Streaming Job, running on Spark, and name it GenerateRawLogs.
2. Add a tRowGenerator component and open the RowGenerator editor.
3. Add 2 new columns in the schema and name them user_id and url.
4. Set the user_id type to Integer.
5. The user ID will be a random integer between 10000 and 100000.
In the Functions list, select Numeric.random(int,int) and set the minimum value to 10000 and the maximum value to
100000.
6. The url type remains to String.

7. In the Functions list, select ...
8. In the Function parameters tab, click (...) in the Value column.
The Expression Builder will open.
9. In the Expression box, enter:
"/why-talend", "/download/products/big-data", "/download/products/integration-cloud",
"/download/products/data-integration",
"/download/products/application-integration", "/download/products/mdm",
"/download/products/talend-open-studio",
"/services/technical-support",
"/services/training", "/services/consulting", "/customers", "/resources", "/about-us",
"/blog", "/ecosystem/partners",
"/partners/find-a-partner", "/contact"
Note: You can copy and paste the expression from the LabCodeToCopy file located in the C:\StudentFiles folder.
10. Click Ok to save the expression.
When the Job will be executed, the URL will be randomly chosen in the above list.
11. In the Number of Rows for RowGenerator box, enter "10" (without quotes).
13. Click the Preview tab and click Preview. You should have a result similar to this:
14. Click OK to save the configuration.
LESSON 3 | 51
15. Copy the tRowGenerator component ans paste it twice as follows:
Publish to a Kafka topic

You will now collect the logs coming from the 3 tRowGenerator components and format them to be published on the Kafka topic.
1. At the right side of tRowGenerator_2, add a tUnite component and connect the 3 tRowGenerator components to tUnite
with the Main row:
2. Double-click tUnite to open the Component view and click Sync columns.
3. At the right side of tUnite, add a tWriteDelimitedFields component and connect it with the Main row.

4. At the right side of tWriteDelimitedFields, add a tKafkaOutput component and connect it with the Main row:
5. Open the Contexts view and click the Select Context Variables button ( ) and select Kafka and Spark context
groups:
6. Click OK.
The context variables will be displayed in the Context table:

8. Using the Ctrl+Space shortcut, enter "context.Broker_list" in the Broker list box.
LESSON 3 | 53
9. In the Topic name box, enter "rawlogs":
Configure Job execution

The last step is to configure Spark.
1. In the Run view, click the Spark Configuration tab.
3. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4.X.
4. In the Spark Host box, enter "context.Spark_Host" (without quotes).
Note: Use the Ctrl+Space shortcut.
5. In the Spark Home box, enter "context.Spark_Home" (without quotes).
6. In the Batch size box, enter 100.
7. You will allow a little bit more memory to your Job.
In the Tuning table, click Set tuning properties.
8. In the Driver memory box, enter "1g".
9. In the Driver cores box, enter "1".
10. In the Executor memory box, enter "1g".
Instead of the default value of 512 Mb, you will allocate 1Gb of RAM to the driver and the executor to run your Job.
11. As done previously, you will limit the request of resources to 4 cores.
In the Advanced properties table, add a property named
"spark.cores.max" and set the value to "4".

12. Save your Job to run it later.

This Job will be executed later to generate the logs. The next step is to enrich the logs with information coming from a MySQL data-
base. Using the customer ID, you will retrieve various information such as his name, his email and phone number or his support level.
LESSON 3 | 55
Overview
In this part, you will create a new Spark Streaming which will enrich the raw logs provided by the previous Job.
First you will run a Job that has already been created for you to push customers data to a MySQL database. Then the raw logs will be
consumed and enriched with customers information stored in the MySQL database.
The enriched logs will be published to a Kafka topic and will be consumed later.
Create customers database

You will import a Job with its corresponding metadata to create a MySQL database that will store all the customer information: First
Name, Last Name, Country, Support level, Inscription date, email and phone number.
1. Import the Job named feedCustomerDatabase and the database connection metadata named RemoteConnection
from the JobDesigns.zip archive file in the C:\StudentFiles folder.
2. Open feedCustomerDatabase:
This is a quite simple Job that will generate customer information and save them in a text file (for reference) and in the
MySQL database named users_reference.
3. Run the Job:
Your Job should execute successfully.

4. Right-click the tMySQLOutput component and click Data viewer:
5. The Data Preview window should open and show you an extract of what have been saved in the users_reference database:
6. Click Close.
Consume raw logs

You will create a new Job to consume the raw logs generated by the GenerateRawLogs Job created in the previous lab.
1. Create a new Big Data Streaming Job, using the Spark framework and name it GenerateEnrichedLogs.
2. In the Contexts view of GenerateEnrichedLogs, add the Kafka and the Spark contexts created in the previous lab.
3. Add a tKafkaInput component in the Designer and open the Component view.
4. In the Broker list box, enter "context.Broker_list" (without quotes).
Note that you can access context variables using the Ctrl+Space shorcut.
LESSON 3 | 57
6. In the Topic name box, enter "rawlogs".
7. In order to be combined with the MySQL database, it is necessary to extract the user ID and URL from the Kafka message.
Add a tExtractDelimitedFields and connect with the Main row.
8. Open the Component view and click (...) to edit the schema.
9. Click the green plus sign below the Output table to add 2 new columns.
10. Configure the first column with user_id as name and set its type to integer.
11. Configure the first column with url as name and set its type to string.
Combine raw logs with customers data

Based on the user ID, you will retrieve the corresponding information in the MySQL database to enrich your logs.
1. At the right side of tExtractDelimitedFields, add a tMap component and connect it with the Main row.
2. On top of tMap, add a tMysqlLookupInput component and connect it to tMap with the Main row:
3. Double-click tMysqlLookupInput to open the Component view.

4. In the Property Type list, select Repository and click (...) to select the RemoteConnection Database metadata.
5. Click (...) next to the Table Name box, and select the users_reference table in the training database.
6. The schema is necessary and not specified yet.
In the feedCustomerDatabase Job, open the tMysqlOuput Component view and open the schema.

7. Select all the rows in the Output table and press the Copy selected items button:
8. Paste the schema in the tMysqlLookupInput schema:

10. Next to the Query Type list, click Guess Query:
LESSON 3 | 59
11. As a result, a SQL query will appear in the Query box:
12. This query must be updated to allow the real-time processing of logs. Modify the Query as follows:
Note: The complete query can be found in the LabCodeToCopy file, in the C:\StudentFiles folder.
13. Double-click tMap to open the Map editor.
14. Click user_id in the row2 table and drag it to the User_id column in the row3 table:

15. Click the tMap setting button and configure as follows:
16. Create a new output and name it out.

17. Configure the output as follows:
18. Click Ok to save the mapping.

The last step is to publish the enriched logs to a new Kafka topic. But before going further, you will check that both jobs are working
and that you can successfully generate logs and enrich them.
Run Jobs and check results

Instead of publishing your enriched logs to a Kafka topic, you will first display them in a tLogRow component to validate the execution
of the GenerateRawLogs and GenerateEnrichedLogs Jobs.
1. In the GenerateEnrichedLogs Job, at the right side of tMap, at a tLogRow component and connect it with the out row.
2. Double-click tLogRow to open the Component view and select the Table Mode.
3. Configure Spark as for the GenerateRawLogs Job, except for the maximum number of cores which should be set to 2,
and the Batch size set to 1000 ms:
LESSON 3 | 61
4. Save your Job.
5. Open the GenerateRawLogs Job and run it:
You should see the number of batch completed increase in the lower right corner.
Let the Job run.
6. Open the GenerateEnrichedLogs Job and run it.

7. Examine the results in the Console, you should have enriched logs similar to this:
This proves that you are able to generate logs and to enrich them with data saved in the users database.
You will now modify the Job to be publish the enriched logs to another Kafka topic.
Publish enriched logs.

1. Unless you kill them, your Streaming Jobs will run for ever.
Kill GenerateEnrichedLogs and GenerateRawLogs execution. You will run them again later.
2. Delete the tLogRow component.
3. At the right side of tMap, add a tWriteDelimitedFields component and connect it with the out row.
4. At the right side of tWriteDelimitedFields, add a tKafkaOutput component and connect it with the Main row.
6. In the Broker list box, enter "context.Broker_list" (without quotes), or use the Ctrl+Space shortcut.
7. In the Topic name box, enter "enrichedlogs". Your configuration should be as follows:
8. Save your Job.

Your Job is now complete, so it's time to Wrap-Up.
LESSON 3 | 63
Wrap-Up
Recap
In this lesson you covered the first part of the use case. First, you created a Job to generate logs composed of a user ID and the cor-
responding URL. Next, based on the user ID, you retrieved various information on the user to enrich the logs before publishing them
to a new Kafka topic.

LESSON 4
Monitoring
Monitoring Logs 66
Monitoring Enriched Logs 67
Wrap-Up 77
Monitoring Logs
Overview
In the previous chapter, you generated logs and enriched them with user information saved in a MySQL database.
Now, you will monitor the logs. Using the Log Server which is built on top of Logstash, Elasticsearch and Kibana, you will be able to
monitor your logs through a dashboard.
Logstash is a flexible, open source data collection. It is designed to efficiently process a growing list of logs and events for dis-
tribution into a variety of outputs including Elasticsearch.
Elasticsearch is a distributed, open source search and analytics engine designed for scalability, reliability and easy man-
agement.
Kibana is an open source visualization platform that allows you to interact with your data, stored in Elasticsearch by Log-
stash, through graphics.
You will create a Job that will process the logs before sending them to Elasticsearch.
Objectives
Process data in a Big Data Streaming Job.
Save logs to Elasticsearch
Use and modify a Kibana dashboard.
Before you begin

A Talend Log Server and a Talend Administration Center installed
You will now build a Job to consume the enriched logs generated in the previous lesson.

Monitoring Enriched Logs
Overview
In this lab, you will create a Job that will consume the enriched logs created by the GenerateEnrichedLogs Job. Then, the logs will be
filtered and sent to Elasticsearch.
You will also start the required services to be able to monitor your logs using the Kibana web UI.
Consume enriched logs

You will now create a new Big Data Streaming Job to consume the enriched logs published to the enrichedlogs topic.
1. Create a new Big Data Streaming Job using the Spark framework and name it MonitoringLogs.
2. In the Contexts view, add the Kafka and Spark contexts.
3. Add a tKafkaInput component and double-click to open the Component view.
4. In the Broker list box, enter "context.Broker_list" (without quotes).
6. In the Topic name box, enter "enrichedlogs".
Next, you will process the logs to filter them based on the country before sending the filtered logs to Elasticsearch.
Processing logs
The logs will be filtered to extract the users in France, in USA and in the other countries.
1. At the right side of tExtractDelimitedFields, add a tReplicate component and connect it with the Main row.
2. At the right side of tReplicate, add 3 tFilterRow components and connect them to tReplicate with the Main row.
3. Double-click the first tFilterRow component to open the Component view.
4. Click the green plus sign below the Conditions table.
5. In the Input column list, select country.
LESSON 4 | 67
6. In the Operator list, select ==.
7. In the Value box, enter "France".
8. Using the same steps, configure the second tFilterRow component to extract the users in the USA:
9. Using the same steps, configure the third tFilterRow component to extract the users living in the other countries:
The last step is to save the filtered logs in Elasticsearch.
Save logs in Elasticsearch

You will use the tElasticSearchOutput component to save the logs in Elasticsearch.
1. At the right side of the first tFilterRow component, add a tElasticSearchOutput component and connect it with the Filter
row.
2. Double-click the tElasticSearchOutput component to open the Component view.
3. In the Nodes box, enter "10.0.0.2:9200".
4. In the Index box, enter "usersinfo".
5. In the Type box, enter "frusers".
6. In the Output document list, select JAVABEAN.
7. Copy the tElasticSearchOutput component and paste it at the right side of the second tFilterRow component.

9. In the Type box, enter "ususers".
10. Copy the tElasticSearchOutput component and paste it at the right side of the third tFilterRow component.
12. In the Type box, enter "others".
When the Job will run, the logs will be filtered and saved with an index named "usersinfo", but with different Types "frusers",
"ususers" and "others".
These Type values will be useful to investigate your logs in the Kibana dashboard. To be able to visualize your Kibana dashboard,
you first need to start the Talend Administration Center and the Log server.
Start services
The Log Server is built on top of Logstash, Elasticsearch and Kibana. And to access the Kibana web UI, the Talend Administration
Center must be started.
To have more information about Log Server and Talend Administration Center, you can follow the training Talend Data Integration
Administration.
1. In a web browser, navigate to "http://localhost:8080/kibana".
You will reach the default Kibana dashboard:
LESSON 4 | 69
If you can't reach this page, please refer to the Troubleshooting instructions at the end of the chapter.
You will now configure the dashboard as you need to retrieve the logs generated by the MonitoringLogs Job.
2. In the Query section of the dashboard close ERROR, WARN, INFO and DEBUG.
Delete TRACE to let the last Query empty:
3. In the Filtering section, click the cross to close the timestamp filter.
4. In the upper right corner, click the gear to open the dashboard properties:
5. Click Index and then, in the Default Index box, enter "usersinfo".
6. Click Save.
7. In the upper right corner of the TIMELINE box, click the cross to close it.
8. Close FILTER BY SEVERITY and GROUP BY SEVERITY.
9. In TABLE, click the gear icon to open the properties.
10. Click Panel.
11. Under the Columns box, click the cross to remove @timestamp, type, priority, message and logger_name.
12. Click Save.

13. Click the gray arrow below Table:
This will display the Fields list.

14. Select all the fields except _id.
You are now ready to run your Job.
Run Job and check results

1. Configure Spark as follows:
Your Job will execute every second with 2 cores and the default amount of memory for driver and executor.
2. Run the GenerateRawLogs Job.
3. Run the GenerateEnrichedLogs Job.
4. Run the MonitoringLogs Job.
5. Check the results in your Kibana Dashboard.
You should have results similar to the following:
LESSON 4 | 71
If you look at the columns named _type and _index, you will see the values attributed in the MonitoringLogs Job, in the
tElasticSearchOutput components:

6. To examine the users from the us, click ususers in the Group By Source diagram:
This will automatically filter the information displayed in the dashboard and a filtering condition will appear in the upper part of
the dashboard, in the Filtering section:
7. In the Table listing the ususers logs, you can sort the columns by ascending or descending order.
Click the support column:
LESSON 4 | 73
8. You can add new diagrams to your dashbord.
In the upper left corner of the Group by Source diagram, click the green plus to add a new diagram:
9. In the Select Panel Type list, select terms:
10. In the Title box, enter "Support Level".

11. In the Field box, enter "support".
12. Click Save:
Your new diagram will appear giving you more details about the support level. As done previously, clicking on one of the dia-
gram bars will filter the logs ot keep only the logs of interest.
13. To go back to the original logs, you can close the filters that appeared in the Filtering section of your Dashboard.
14. Stop your running Jobs: GenerateRawLogs, GenerateEnrichedLogs and MonitoringLogs.
LESSON 4 | 75
Troubleshooting
If you can't reach the Kibana web page, check that your services are running, clicking on the Windows Services icon:
The Talend Administrations Center and the Talend Logserver and Logserver Collector should be started. Otherwise, start them
manually.
Click the service name, then click Start in the left pane:
It's now time to Wrap-Up.

Wrap-Up
Recap
In this lesson you covered the key base knowledge required to be able to monitor your enriched logs using a Kibana dashboard.
You first created a Job to filter users from different countries. Then, you started the necessary services to have Elasticsearch and
Kibana running. You started your Streaming Jobs and then analyze the results in the Kibana dashboard.
Further Reading
If you want to learn more about the Talend Administration Center and the Log Server, see the Talend Data Integration Admin-
istration Training, and the Talend Administration Center User Guide.
(missing or bad snippet)
LESSON 4 | 77
LESSON 5
Reporting

Wrap-Up 87
Reporting users information
Overview
In this Lab, you will build a Job that will analyze the incoming enriched logs to detect the users downloading a product or visiting the
services web page. Next, on a regular basis a report containing the corresponding users information will be created and saved on
HDFS.
Objectives
Consume messages from a Kafka topic
Use the tWindow component to schedule procesing
Before you begin

You will now build a Job to consume and process the enriched logs to create reports.

Reporting users information
Overview
You will consume the enriched logs generated by the GenerateEnrichedLogs Job, then, you will process the url column to identify
which page the user navigated to. The goal is to identify users who downloaded products and those who were interested in services
(consulting, training).
Once identified, all information about the users will be saved in a file in HDFS every 10 seconds. To schedule the creation of the file,
you will use a tWindow component.
This component allows you to define a window duration and a triggering duration.
Consuming enriched Logs

As done previously, you will create a Big Data Streaming Job to consume the enriched logs generated by GenerateEnrichedLogs.
1. Create a new Big Data Streaming Job using the Spark framework and name it Reporting.
2. In the Contexts view, add the Kafka and Spark contexts.
4. Configure the component to consume the messages in the enrichedlogs topic from the beginning:
5. Add a tExtractDelimitedFields component.

6. Connect it with the Main row and open the Component view.
8. Below the Output table of the schema, click the icon to import the schema from an xml file. The file is named EnrichedLog-
s.xml and can be found under the C:\StudentFiles folder:

Next, you will process the url column to extract the web page of interest.
LESSON 5 | 81
Process URL
The URL is a string that looks like "/download/products/big-data", or "/services/training". So using the "/" separator, it is possible to
extract the different parts of the string to identify users interested in services or interested in downloading a product.
1. At the right side of tExtractDelimitedFields, add another tExtractDelimitedFields and connect it with the Main row.
This second tExtractDelimitedFields component will help you process the URL string.
4. Copy all the columns in the Input table to the Output table, except for the url column.
5. Click the green plus sign below the Output table to add 4 new columns.
6. Name the columns root, page, specialization and product respectively. Your configuration should be as follows:

8. In the Prev. Comp. Column list, select url.
9. In the Field separator box, enter "/".
When the Job executes, the url column will be split in 4 columns based on the "/" separator. And filtering the page column,
you will be able to identify the users that navigated to downloads page or to services page.
Filter users
1. At the right side of tExtractDelimitedFields_2, add a tReplicate component and connect it with the Main row.
2. At the right side of tReplicate, add a tFilterRow component and connect it with the Main row.
3. Below tFilterRow, add a second tFilterRow component and connect it to tReplicate with the Main row.
4. Double-click the first tFilterRow to open the Component view.
5. The first tFilterRow will filter users from France that downloaded a product.
Below the Conditions table, click the green plus sign twice to add 2 conditions.

6. Configure the first condition to filter the country column to get the users in France.
7. Configure the second condition to filter the page column to get the users that navigated to the download page.
9. Double-click the second tFilterRow to open the Component view.

10. Configure the second condition to filter the page column to get the users that navigated to the services page:
You can now finalize your Job to generate the reports.
Generate report
The report will be saved to HDFS as a text file every 10 seconds. To achieve this, you will use tWindow components.
The tWindow component needs a window duration, in milliseconds, and optionally, a slide duration, in milliseconds.
Let's suppose that you wanted to save a file every 10 minutes about the users that connected to the web site during the last hour. This
would imply a window duration of 1 hour and a slide duration of 10 minutes.
1. In the Designer, add a tHDFSConfiguration component and open the Component view.
2. In the Property Type list, select Repository and use the HDFSConnection cluster metadata.
3. At the right side of each tFilterRow component add a tWindow component and connect them with the Filter row.
5. The tWindow component allows you to define a time window duration.
In the Window duration box, enter "10000".
6. Click the Define the slide duration checkbox and enter "10000".
That means that every 10 seconds the report will be saved to HDFS concerning the logs accumulated during the last 10
seconds.
7. Repeat the same configuration in the second tWindow component:
8. At the right side of each tWindow component add a tFileOutputDelimited component and connect them with the Main
row.
9. Double-click the first tFileOutputDelimited component to open the Component view.
LESSON 5 | 83
"/user/student/BDAdvanced/DownloadReports/"
11. Double-click the second tFileOutputDelimited component to open the Component view.
"/user/student/BDAdvanced/ServicesReports/"

You can now configure properly Spark and then run your Jobs to generate raw logs, enrich them and generate the reports.
The reports will be saved in HDFS under the DownloadReport and ServicesReport folders.
1. Configure Spark as follows:
The Batch size is 10 seconds and to run, you will allow 1 Gb of RAM and 2 cores.
2. Run the GenerateRawLogs Job.
3. Run the GenerateEnrichedLogs Job.
4. Run the Reporting Job.
5. Navigate to Hue "http://ClusterCDH54:8888".
6. Use the File Browser to find the "/user/student/BDAdvanced/DownloadReports" and "/user-
/student/BDAdvanced/ServicesReports" folders.

7. In each folder, you will find subfolders containing the report text file:
8. Click the different folders and open the part-r00000 file to find the information about your users of interest:
As the files are created in real time, depending on the incoming logs, the number of users per file will vary. But, as expected
for the services report, in the different folders under /user/student/BDAdvanced/ServicesReport, you will only find data
about users who navigated to the services web page.
9. Open the /user/student/BDAdvanced/DownloadReports folder and open the part-r00000 text files in the different sub-
folders, to validate your Job:
LESSON 5 | 85
You should only find users from France that navigated to the download web page.
You have now covered all the lab and it's time to Wrap-Up.

Wrap-Up
Recap
In this lesson you covered the key base knowledge required to use the tWindow component. This component allows you to define a
time window duration and a triggering duration (slide).
You will accumulate data during this time windows before processing them.
You can now move to the next lab.
In the next lab, instead of filtering the logs and only keeping those of interest, you will collect all the logs and store them in HBase in
real-time.
And next, you will analyze the stored logs with a Big Data Batch Job to get the top 5 pages more consulted.
LESSON 5 | 87
LESSON 6
Logs Processing Use Case - Batch
Analysis
Logs analysis 90
Stream Ingestion 91
Logs Batch Analysis 95
Wrap-Up 101
Logs analysis
Overview
In the previous labs, the logs were filtered. In this lab, the logs will be saved in your cluster, in real time, as they come through your
Kafka topic.
Then, in a batch Job, the logs will be processed to extract statistical information such as the top 5 web pages or top 5 products down-
loaded by the users.
Objectives
Specify the country name through a context variable prompted whe the job starts.
Compute top values from accumulated data
Before you begin

You can now start with the Stream Ingestion section.

Stream Ingestion
Overview
Instead of filtering the logs, you will save them as they arrive in HBase.
First, you will have to create the HBase table which will be used in your Job. And next you will create a Big Data Streaming Job to con-
sume the enriched logs in the enrichedlogs Kafka topic and save them in HBase.
Create HBase table

Before being able to save data in HBase from your Spark Big Data Streaming Job, the HBase table should already exists. You will
use Hue to create your HBase table.
1. In your web browser, connect to Hue (http://ClusterCDH54:8888).
2. Click Data Browsers then click HBase:
3. In the upper right corner, click New Table.

4. In the Table Name box, enter "Logs".
5. In the Column Families box, enter "ID".
LESSON 6 | 91
6. Click Add an additional column family and name the new family INFO.
7. Click Submit. Your table will appear in the HBase table list.
Create ingestion Job

You will create a new Big Data Streaming Job to consume the messages in the enrichedlogs Kafka topic to save them in the Logs
HBase table.
1. Create a new Big Data Streaming Job using the Spark framework and name it StreamIngestion.
2. Add the Kafka and Spark contexts to your Job.
3. Add a tHBaseConfiguration component.
5. In the Property Type list, select Repository and use the HBaseConnection database connection metadata.
6. Add a tKafkaInput component and configure it to consume the messages in the enrichedlogs Kafka topic from the begin-
ning.
8. At the right side of tExtractDelimitedFields, add a tHBaseOutput component and connect it with the Main row.
10. In the Property Type list, select Repository and use the HBaseConnection database connection metadata.
12. To set the schema in the Output table use the EnrichedLogs.xml file which can be found under the C:\StudentFiles
folder.
14. In the Table name box, enter "Logs".
15. Click the Store row key column to hbase column checkbox.
16. In the Families table, set the family name to "ID" for the user_id column and then, set the family name to "INFO" for all the
others columns.

You can now configure Spark and then you will be able to run your Job.
Configure Spark
This configuration will be a bit different from the other configurations you have done so far. This is because you will use HBase. Some
additional properties are necessary to have a succesfull execution of your Job.
1. In the Run view, click Spark Configuration.
2. Select an execution on a Spark Standalone running on a Cloudera CDH 5.4 cluster.
3. Use the appropriate context variables for Spark Host and Spark Home values.
4. Set the Batch size to 10 seconds.
5. Allow 1 Gb of RAM for the driver memory and the executor memory.
6. Click the green plus sign in the Advanced Properties table to add 3 properties.
7. The first property allows 2 cores for the Spark Job to run.
In the Property column, enter "spark.cores.max" and in the Value, enter "2".
8. The 2 others properties specify class path.
In the Property column, enter "spark.executor.extraClassPath". Then, in the corresponding Value box, enter
"/opt/cloudera/parcels/CDH/lib/hbase/lib/*".
9. In the Property column, enter "spark.driver.extraClassPath".
Then, in the corresponding Value box, enter
"/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar".
Note: Use the LabCodeToCopy file to avoid typos.
Run your Job and check results in Hue

1. Run your GenerateRawLogs Job.
2. Run your GenerateEnrichedLogs Job.
3. Run your StreamIngestion Job.
LESSON 6 | 93
4. When your StreamIngestion Job is running and that the number of Batches completed increases, click Logs in the HBase
table list to check that the logs are saved in Hue:
You have now 3 Big Data Streaming Jobs running to generate logs, enrich them and save them in HBase.
The next step is to analyze the ingested logs.

Logs Batch Analysis
Overview
Once saved in HBase, the logs can be processed in batch mode to extract useful information.
You will create a Big Data Batch Job to analyze the logs and get the top 5 downloaded products for a specific country.
Analyze Logs for a specific country

The country to analyze will be defined in a context variable through a prompt appearing when the Job starts.
1. Create a new Big Data Batch Job using the Spark framework and name it DownloadAnalysis.
2. In the Context view, add the Spark and Kafka contexts.
3. Click the green plus sign below the context table.
4. In the Name box, enter "CountryToAnalyze".
5. In the Value box, enter "France" and click the Default checkbox.
6. In the Prompt box, enter "Country To Analyze?".
This Prompt box will appear when the Job starts.
LESSON 6 | 95
7. Add a tHBaseConfiguration component and configure it to use the HBaseConnection metadata:
8. Add a tHBaseInput component and open the Component view.

10. Use the EnrichedLogs.xml file to populate the schema:
11. In the Table name box, enter "Logs".

12. In the Mapping table, set the Column family to "ID" for the user_id column. Then set the Column family value to "INFO"
for all the other columns.
13. At the right side of tHBaseInput, add a tFilterRow component and connect it with the Main row.
14. Click the green plus sign to add a condition.
15. In the Input column list, select "country".
16. Select "==" in the Operator list.
17. In the Value box, enter "context.CountryToAnalyze" (without quotes).
Note: You can use the Ctrl+Space shortcut to get the context variable name.

When the Job executes, the logs will be filtered according to the country name you will specify in the prompt.
Compute top 5 downloaded products

As done previously, you will process the URL string to get the name of the downloaded product. Then you will aggregate the logs and
display the top 5 in the Console.
1. At the right side of tFilterRow, add a tExtractDelimitedFields component and connect it with the Filter row.
3. In the Prev. Comp. column list, select "url".
4. In the Field Separator box, enter "/".
6. In the Output table, add the country column.
7. Click the green plus sign to add 4 columns and name them root, page, specialization and product.
Your schema should be as follows:

9. At the right side of tExtractDelimitedFields, add a tAggregateRow component and connect it with the Main row.
12. In the Output table, add the product column.
13. Click the green plus sign to add a new column and name it NbDownloads.
LESSON 6 | 97
14. In the Type list, select Integer.
Your schema should look like the following:

16. Configure the tAggregateRow component as follows:
17. At the right side of tAggregateRow, add a tTop component and connect it with the Main row.
19. In the Number of line selected box, enter "5".
20. Click the green plus sign below the Criteria table.
21. In the Schema column list, select NbDownloads.
22. In the sort num or alpha? list, select num and in the Order asc or desc? list, select desc.
23. At the right side of tTop, add a tLogRow component and connect it with the Main row.

24. Next to the tHBaseConfiguration component, add a tHDFSConfiguration component and configure it to use the HDFSCon-
nection metadata.
25. In the tLogRow Component view, select the Table Mode.
Your Job is now ready to run.

Before running your Job you must configure Spark.
1. Configure Spark to be in Standalone mode, using a Cloudera CDH 5.4 distribution.
2. Use the Spark_Host and Spark_Home context variables in the Spark Host and Spark Home boxes.
3. As in the previous Job, you will need to add 2 advanced properties related to HBase and to limit the number of cores reques-
ted by your Job.
Configure the Advanced Properties table as follows:
Note: Use the LabCodeToCopy file to avoid typos.

4. Run your Job.
5. When the Job starts, the following window opens:
This is where you will give the name of the country you are interested in.
6. Enter the country name and click OK.
7. The Job runs and then the result will be displayed in the Console:
It's now time to recap what you've done.
LESSON 6 | 99
Wrap-Up
Recap
In this lesson you covered the key base knowledge required to be able to save messages in HBase in real time.
Once saved, the data can be processed later to extract useful information such as statistical information, data models or clas-
sifications of user. Machine learning is the next step to data stream ingestion.
LESSON 6 | 101

Lab Guide - PDF - EN - Spark

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lab Guide - PDF - EN - Spark

Загружено:

Авторское право:

Доступные форматы

Big Data Advanced - Spark

CONTENTS | Lab Guide

LESSON 2 Introduction to Spark

LESSON 3 Logs Processing Use Case - Generating Enriched Logs

LESSON 4 Logs Processing Use Case - Monitoring

CONTENTS | Lab Guide

LESSON 5 Logs Processing Use Case - Reporting

LESSON 6 Logs Processing Use Case - Batch Analysis

CONTENTS | Lab Guide

Before you begin

10 | Big Data Advanced - Spark - Lab Guide

Create Kafka topic

12 | Big Data Advanced - Spark - Lab Guide

9. Click OK to save the configuration.

8. Select Create as sub-element of target node and click OK:

14 | Big Data Advanced - Spark - Lab Guide

10. Your configuration should be as follows:

11. Click Ok to save your configuration.

5. In the Output table, set the column type to byte[].

This will convert your serialized data to byte format.

16 | Big Data Advanced - Spark - Lab Guide

Extract and display data

18 | Big Data Advanced - Spark - Lab Guide

8. Add a tLogRow component and connect it with the Main row.

Note: There is no exit code because the Job is still running.

20 | Big Data Advanced - Spark - Lab Guide

Before you begin

24 | Big Data Advanced - Spark - Lab Guide

26 | Big Data Advanced - Spark - Lab Guide

Copy data to HDFS

2. In the Name box, enter "CustomersDataAnalysis".

28 | Big Data Advanced - Spark - Lab Guide

5. Click Finish to create the new Job.

Now, your Job will be able to connect to HDFS.

Read customers data

30 | Big Data Advanced - Spark - Lab Guide

Extract data of interest

8. Click OK to save your configuration.

Aggregate and sort data

32 | Big Data Advanced - Spark - Lab Guide

Save Results to HDFS

Run Job and check results in Hue

34 | Big Data Advanced - Spark - Lab Guide

Publish messages to a Kafka topic

2. In the Name box, enter "PublishMessagesStreaming".

36 | Big Data Advanced - Spark - Lab Guide

7. Save your configuration.

Configure execution on Spark

10. Save your Job.

38 | Big Data Advanced - Spark - Lab Guide

Configure execution on Spark

10. Save your Job.

40 | Big Data Advanced - Spark - Lab Guide

The Job will execute every 3 seconds.

42 | Big Data Advanced - Spark - Lab Guide

Generate Enriched Logs 46

Before you begin

You can now start with the use case overview.

46 | Big Data Advanced - Spark - Lab Guide

Generate Enriched Logs

48 | Big Data Advanced - Spark - Lab Guide

Create context variables