Академический Документы
Профессиональный Документы
Культура Документы
Technical Difficulty
Instructor-Led
If you are following an instructor-led training (ILT) module, there will be periods for questions at regular intervals. However, if you
need an answer in order to proceed with a particular lab, or if you encounter a situation with the software that prevents you from pro-
ceeding, don’t hesitate to ask the instructor for assistance so it can be resolved quickly.
Self-Paced
If you are following a self-paced, on-demand training (ODT) module, and you need an answer in order to proceed with a particular
lab, or you encounter a situation with the software that prevents you from proceeding with the training module, a Talend professional
consultant can provide assistance. Double-click the Live Expert icon on your desktop to go to the Talend Live Support login page
(you will find your login and password in your ODT confirmation email). The consultant will be able to see your screen and chat with
you to determine your issue and help you on your way. Please be considerate of other students and only use this assistance if you are
having difficulty with the training experience, not for general questions.
Exploring
Remember that you are interacting with an actual copy of the Talend software, not a simulation. Because of this, you may be tempted
to perform tasks beyond the scope of the training module. Be aware that doing so can quickly derail your learning experience, leaving
your project in a state that is not readily usable within the tutorial, or consuming your limited lab time before you have a chance to fin-
ish. For the best experience, stick to the tutorial steps! If you want to explore, feel free to do so with any time remaining after you've fin-
ished the tutorial (but note that you cannot receive consultant assistance during such exploration).
Additional Resources
After completing this module, you may want to refer to the following additional resources to further clarify your understanding and
refine and build upon the skills you have acquired:
Talend product documentation (help.talend.com)
Talend Forum (talendforge.org/)
Documentation for the underlying technologies that Talend uses (such as Apache) and third-party applications that com-
plement Talend products (such as MySQL Workbench)
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
Introduction to Kafka 10
Kafka Overview 11
Publishing Messages to a Kafka Topic 12
Consuming messages 18
Wrap-Up 21
Introduction to Kafka
Overview
During this training, you will be assigned a Hadoop cluster preconfigured. This Hadoop cluster has been built with a Cloudera CDH
5.4 distribution. The purpose is to try the different functionalities, not to have a production cluster. So, this training cluster is in pseudo-
distributed mode. That means that there is only one node. This is enough to understand the different concepts in this training.
In top of the Cluster services that you have used so far, Kafka has been installed. You will use it to publish and consume messages in
a simple Data Integration Job to understand the basic concepts of Kafka.
Later in the course, you will use Kafka for real-time processing in conjunction with the Spark Streaming framework.
Objectives
After completing this lesson, you will be able to:
Create a new topic in Kafka
Publish a message to a specific topic
Consume messages in a specific topic
Overview
Kafka has been created by LinkedIn and is now an Apache project.
Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable and durable.
First, you will review some basic messaging terminology:
Kafka maintains feeds of messages in categories called topics.
Producers write messages to topics.
Consumers read messages from topics.
Topics are partitioned and replicated across multiple nodes.
At a high-level, the following diagram represents the architecture of a Kafka messaging system:
Kafka is run as a cluster of one or more servers. Each server is called a broker.
Messages are byte arrays and can be used to store any object of any format. String, JSON and Avro formats are the most com-
monly used formats.
A topic is a category to which messages are published. And for each topic, the Kafka cluster maintains a partitioned log. Each partition
is an ordered, immutable sequence of messages. Each message in the partitions is assigned a unique sequential id number called
offset, which uniquely identifies each message in the partition.
The Kafka cluster retains all published messages, consumed or not, for a certain amount of time.
When consuming from a topic, it is possible to configure a consumer group with multiple consumers. Each consumer will read mes-
sages from specific topics they subscribed to.
Kafka does not attempt to track which messages were read by each consumer and only retain unread messages. Instead, Kafka
retains all messages for a configurable amount of time. When this amount of time is elapsed, the messages are deleted to free up
space.
Kafka is typically used for the following use cases:
Messaging
Website Activity Tracking
Log Aggregation
Stream Processing
You will now create a Job to publish messages to a Kafka topic.
LESSON 1 | 11
Publishing Messages to a Kafka Topic
Overview
In this lab, you will build a simple Data Integration Job to publish messages to a Kafka topic.
You will first create a Kafka topic, and then generate random data, process them to be in the right format to publish them to your
Kafka topic.
Generate Data
You will now generate random data using the tRowGenerator component.
1. Below tKafkaCreateTopic, add a tRowGenerator component and connect it with an OnSubjobOk trigger.
2. Double-click tRowGenerator to open the RowGenerator editor.
3. Click the green plus sign to add 3 new columns to the schema.
4. Name the columns "Id", "FirstName" and "LastName".
5. Set the Id column type to Integer and in the Functions list, select Numeric.sequence(String,int,int).
6. In the FirstName column Functions list, select TalendDataGenerator.getFirstName().
7. In the LastName column Functions list, select TalendDataGenerator.getLastName().
Serialize data
To be published, your message must be serialized and in bytes format.
You will serialize your random data using a tWriteXMLField component.
1. At the right side of tRowGenerator, add a tWriteXMLField component and connect it with the Main row.
2. Double-click tWriteXMLField to open the XML Tree editor.
3. In the Link Target table, in the XML Tree column, click rootTag.
4. Click the green plus sign:
LESSON 1 | 13
5. Select Create as sub-element and click OK:
6. In the Input the new element's valid label box, enter "customer" and click OK:
7. In the Linker source table, in the Schema List column, select Id, FirstName and LastName and drag them on
customer:
Convert Data
Now that your data have been serialized, you can convert them to bytes.
1. At the right side of tWriteXMLField, add a tJavaRow component and connect it with the Main row.
2. Double-click tJavaRow to open the Component view.
3. Click (...) to open the schema.
LESSON 1 | 15
4. Click SerializedData in the Input table and click the yellow right arrow to transfer it to the Output table:
Publish message
You will now publish your message to the Kafka topic you created earlier. To publish, you will use a tKafkaOutput component.
1. At the right side of tJavaRow, add a tKafkaOutput component and connect it with the Main row.
2. Open the Component view.
3. In the Broker list box, enter "ClusterCDH54:9092".
4. In the Topic name box, enter "mytopic", which is the Kafka topic you created previously using the tKafkaCreateTopic com-
ponent.
5. Your configuration should be as follows:
A successful execution means that you have generated 100 rows of random data, convert them to a serialized byte format message
and then published them on your Kafka topic.
You will now create another Job to consume the messages published on mytopic.
LESSON 1 | 17
Consuming messages
Overview
You created a first Job to publish messages on a Kafka topic. you will now create a job that will consume the messages on the same
Kafka topic and display them in the Console.
Consume messages
1. Create a new Standard Job and name it ConsumeMessages.
2. Add a tKafkaInput component and open the Component view.
3. In the Zookeeper quorum list box, enter "ClusterCDH54:2181".
4. In the Starting offset list, select From beginning.
5. In the Topic name box, enter "mytopic".
6. Let the default value in the Consumer group id box.
7. Your configuration should be as follows:
LESSON 1 | 19
The ConsumeMessages Job will wait for incoming messages in your Kafka topic.
12. Stop the ConsumeMessages Job execution using the Kill button in the Run view.
You have finished this introduction to Kafka. So now it's time to recap what you've learned.
Recap
In this lesson you covered the key base knowledge required to publish and consume messages in a Kafka topic.
You first created a Kafka topic using the tKafkaCreateTopic component. Then you formatted properly your data to publish them on a
topic with the tKafkaOutput component. And you, finally, consumed the messages with the tKafkaInput component.
Later in the course, you will publish and consume messages over various topics in real time in Big Data Streaming Jobs, using the
Spark framework.
The next lab will give you a brief introduction to Spark.
LESSON 1 | 21
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 2
Introduction to Spark
This chapter discusses the following.
Introduction to Spark 24
Spark Overview 25
Customer Data Analysis 27
Producing and Consuming messages in Real-Time 36
Wrap-Up 43
Introduction to Spark
Overview
This lab is an introduction to the Spark framework. Spark can be used as an alternative to Yarn for Big Data Batch Jobs. It can be
used for Big Data Streaming Jobs as well.
In this lab, you will cover how to create Big Data Batch and Big Data Streaming Jobs. The Jobs will be configured to run on your
cluster using the Spark framework.
You will start with a description of the Spark framework. Then you will cover an example of a Big Data Batch Job running on Spark.
And you will finish with Big Data Streaming Jobs that will publish and consume messages in a Kafka topic.
Objectives
After completing this lesson, you will be able to:
Create a Big Data Batch Job
Create a Big Data Streaming Job
Configure your Jobs to use the Spark framework
Overview
Apache Spark is a fast and general engine for large-scale data processing similar to Hadoop, but it has some useful differences that
make it superior for certain use cases such as machine learning.
Spark enables in-memory distributed datasets that optimize iterative workloads in addition to interactive queries. Caching datasets in
memory reduce the latency of access. Spark can be 100 times faster than Map Reduce in memory, and 10 times faster on disk.
Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects distributed
across a set of nodes.
RDDs are resilient because they can be rebuilt if a portion of the dataset is lost, thanks to a fault-tolerance mechanism.
Applications in Spark are called drivers, and these drivers implement the operations performed either on a single node or across a
set of nodes. It abstracts APIs in Java, Scala and Python. Spark also provides high level tools like Spark SQL for structured data pro-
cessing, MLib for machine learning, Spark Streaming for stream processing and GraphX for graph analysis.
The Spark framework can be deployed on Apache Hadoop via Yarn or on its own cluster (standalone). But Spark doesn't need to sit
on top of HDFS. Spark can run on top of HDFS, HBase, Cassandra, Amazon S3 or any Hadoop data source.
LESSON 2 | 25
You will now create a Big Data Batch Job using the Spark framework.
Overview
In this lab, you will reuse data introduced in the Big Data Basics course. In the Big Data Basics course, you analyzed Customers data
using Pig components in Standard Jobs and then you did the same analysis with a Big Data Batch Job using the Map Reduce frame-
work.
You will now perform the same task in a Big Data Batch Job using the Spark framework:
This Job is composed of a unique tHDFSPut component. This component will copy the CustomersData.csv file, from your
local file system, to HDFS under /user/student/BDBasics.
3. Modify the output folder so that the file is written under /user/student/BDAdvanced/CustomersData.csv.
4. Run PutCustomersData.
5. Connect to Hue, navigating to "ClusterCDH54:8888" and using student/training as username/password.
LESSON 2 | 27
6. Using the File Browser, open CustomersData.csv under /user/student/BDAdvanced:
The data are composed of different information about customers: Id, first name, last name, city, state, product category,
gender and purchase date.
You will now create a Job to process the data.
Connect to HDFS
When you use Spark, you can use different storage options. You do not rely on HDFS, as for the Map Reduce framework, so you will
need to a specific component to set your HDFS configuration information.
1. In the Repository, right-click Job Designs, then click Create Big Data Batch Job:
LESSON 2 | 29
8. Your configuration should be as follows:
5. Using the (...) button, select the CustomersData.csv file under /user/student/BDAdvanced and click OK.
If you can't get connected to HDFS, check your tHDFSConfiguration component.
6. Your configuration should be as follows:
You will now process your data. First, you will filter your data to extract data of interest and then, you will perfom aggregation and sort
to get useful information.
LESSON 2 | 31
6. In the Filter box, enter:
row1.State.equals("California")
7. Your configuration should be as follows:
11. At the right side of tAggregateRow, add a tSortRow component and connect it with the Main row.
12. Double-click tSortRow to open the Component view.
LESSON 2 | 33
1. Click the Run view and then the Spark Configuration tab:
2. In the Spark Mode list, you can choose the execution mode of your Job. There are 3 options: Local, Standalone and
YARN client.
Your cluster has been installed and configured for Spark to run in Standalone mode.
In the Spark Mode list, select Standalone.
3. Check that the Distribution and Version correspond to your Cloudera CDH 5.4 cluster.
4. In the Spark Host box, enter "spark://ClusterCDH54:7077" (quotes included).
5. In the Spark Home box, enter "/user/spark/share/lib".
6. Go back to the Basic Run tab and click Run.
7. At the end of the execution, you should have an exit code equals to 0 in the Console and in the Designer, you should see
100% labels on top of your rows:
You have covered how to create a Big Data Batch Job using the Spark framework. It's time to move to the next topic which will intro-
duce you to Big Data Streaming Jobs.
LESSON 2 | 35
Producing and Consuming messages in Real-Time
Overview
In this lab, you will build two Big Data Streaming Jobs. The first Job will publish messages to a Kafka topic and the second Job will con-
sume those messages.
1. In the Repository, right-click Big Data Streaming, and click Create Big Data Streaming Job:
4. Add a Purpose and a Description, then click Finish to create your Job.
5. In the Designer view, add a tRowGenerator component and open the RowGenerator editor.
The tWriteDelimitedFields will serialize your data and convert it to a byte array, as required to publish on a Kafka topic.
11. At the right side of tWriteDelimitedFields, add a tKafkaOutput component and connect it with the Main row.
12. Double-click tKafkaOutput to open the Component view.
13. In the Broker list box, enter "ClusterCDH54:9092".
14. In the Topic name box, enter "mystreamingtopic":
LESSON 2 | 37
The goal of this lab is to publish and consume messages in real-time. To achieve this, the Job to publish the messages and the Job to
consume them will run simultaneously.
If you run you Job as it is configured, the default configuration of Spark on the cluster will assign all the available cores to the current
Job. That means that if you start another Job to run on Spark, there will be no core available for this new Job to run. To avoid this, you
must limit the number of cores requested by your Job.
To summarize, compared to a Spark Big Data Batch Job, you will have to set the Batch size and limit the number of cores requested
by your Job.
1. In the Run view, go to the Spark Configuration tab.
2. In the Spark Mode list, select Standalone.
3. In the Distribution list, select Cloudera.
4. In the Spark Host box, enter "spark://ClusterCDH54:7077" (quotes included).
5. In the Spark Home box, enter "/user/spark/share/lib".
6. In the Batch size box, enter "100".
This will set the batch size to 100 milliseconds.
7. Click the green plus sign below the Advanced properties table.
8. In the Property column, enter "spark.cores.max" and in the Value column, enter "4".
This will limit the number of cores requested to 4.
9. Your Spark configuration should be as follows:
Consume messages
Now you will create the Job to consume the messages in the mystreamingtopic Kafka topic.
7. At the right side of tKafkaInput, add a tExtractDelimitedFields component and connect it with the Main row.
8. Open the Component view and click Sync columns.
9. The messages consumed will be displayed in the Console.
At the right side of tExtractDelimitedFields, add a tLogRow component and connect it with the Main row.
You can now continue with the Spark configuration.
LESSON 2 | 39
9. Your Spark configuration should be as follows:
Run Jobs
You will first run the publishing Job, then, you will run the consuming Job.
1. Open the PublishMessagesStreaming Job:
In the upper left corner, you will see the batch size you set in the Spark configuration tab. The Job will be executed every 100
ms.
2. Run your Job.
That means that Messages are published on mystreamingtopic. And it will run until you press the Kill button.
4. Open the ConsumeMessagesStreaming Job:
LESSON 2 | 41
6. Observe the result in the Console:
Each time the Job is executed, you will see new messages appear.
To test the real-time aspect of your processing, you can stop your PublishMessagesStreaming Job. Once all the messages are con-
sumed and displayed in the Console of the ConsumeMessagesStreaming Job, you won't see new messages appear.
Start again the PublishMessagesStreaming Job to publish new messages and observe the Console in the Con-
sumeMessagesStreaming Job.
Once you have finished, kill the execution of both Jobs to free the resources on your cluster.
You have now covered the introduction to Spark lab. It's time to recap what you have learned.
Recap
In this lesson you covered the key base knowledge required to create Big Data Batch and Big Data Streaming Jobs, using the Spark
Framework.
Spark configuration requires to set the Spark host and Spark home value properly. And in top of that, for streaming Jobs, the Batch
size is necessary. If you need to run multiple streaming Jobs simultaneously, you need to limit the number of cores requested by your
Job.
Another important point on Spark, is that Spark is not based on HDFS or HBase. It can use different storage, so you have to include a
tHDFSConfiguration component or an equivalent component to specify where Spark should read and write files.
Now that you have covered introductions to Kafka and Spark, you can start the Logs Processing use case.
LESSON 2 | 43
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 3
Logs Processing Use Case -
Generating Enriched Logs
This chapter discusses the following.
Overview
In this chapter, you will start with an overview of the Logs Processing use case.
Next, you will create 2 different Jobs. The first Job will generate raw logs of users connected to the Talend website.
And the second Job will enrich the logs with information stored in a MySQL database.
The raw logs and enriched logs will be published to two different Kafka topics.
Objectives
After completing this lesson, you will be able to:
Connect to a MySQL Database in a Big Data Streaming Job.
Join a stream of logs with data saved in a MySQL database in Real-Time.
Publish logs to Kafka topics.
Overview
In this use case, you will simulation users connection to the Talend Website. The users activity will be tracked with logs.
Processing the logs, you will be able to retrieve useful information about the users, such as the most downloaded product, or how
many users visited the services webpage during the last 15 minutes.
The use case has been split in 4 chapters:
Generate Enriched Logstas
Monitoring
Reporting
Batch Analysis
Monitoring
Using Elasticsearch and Kibana, you will create a dashboard to monitor the users activity on the website.
The logs will be processed before saving them to Elasticsearch with different indices so that you can monitor which users connected
from France, from the USA or from other countries.
Reporting
Using a time windowing component, you will accumulate logs for a certain amount of time. Then, the accumulated logs will be pro-
cessed to generate reports.
You will generate reports to sum up which products were downloaded and to know which user visited the services web pages.
LESSON 3 | 47
Batch Analysis
In the previous labs, the logs are processed to generate dashboards or reports, the logs are not saved as they arrive.
In this lab, the enriched logs will be saved in Real Time in HBase. Next, a Batch Job will be performed to extract statistical information
from the logs.
You will compute the top 5 of the downloaded products from a particular country. The country name will be specified as a context vari-
able prompted when the Job starts.
You will now start the use case with the Job to generate the raw logs.
Overview
In this lab, you will use the knowledge acquired in the previous chapters. You will create a Job which will generate fake logs.
You simulate logs coming from different web servers and representing connections to the Talend web site. In each log, you will get a
user Id and its corresponding URL.
LESSON 3 | 49
6. Your configuration should be as follows:
7. Click Finish.
8. Create another context group and name it Kafka.
9. Add a context variable named "Broker_list" and set its value to "ClusterCDH54:9092".
10. Click Finish to save your context variable.
11. The 2 context groups should appear in the Repository under Contexts:
The different context variables that you just created will be used quite often in the different Jobs you will create.
Generate Logs
You will use a tRowGenerator component to simulate the logs.
1. Create a new Big Data Streaming Job, running on Spark, and name it GenerateRawLogs.
2. Add a tRowGenerator component and open the RowGenerator editor.
3. Add 2 new columns in the schema and name them user_id and url.
4. Set the user_id type to Integer.
5. The user ID will be a random integer between 10000 and 100000.
In the Functions list, select Numeric.random(int,int) and set the minimum value to 10000 and the maximum value to
100000.
6. The url type remains to String.
13. Click the Preview tab and click Preview. You should have a result similar to this:
LESSON 3 | 51
15. Copy the tRowGenerator component ans paste it twice as follows:
2. Double-click tUnite to open the Component view and click Sync columns.
3. At the right side of tUnite, add a tWriteDelimitedFields component and connect it with the Main row.
5. Open the Contexts view and click the Select Context Variables button ( ) and select Kafka and Spark context
groups:
6. Click OK.
The context variables will be displayed in the Context table:
LESSON 3 | 53
9. In the Topic name box, enter "rawlogs":
LESSON 3 | 55
Generate Enriched Logs
Overview
In this part, you will create a new Spark Streaming which will enrich the raw logs provided by the previous Job.
First you will run a Job that has already been created for you to push customers data to a MySQL database. Then the raw logs will be
consumed and enriched with customers information stored in the MySQL database.
The enriched logs will be published to a Kafka topic and will be consumed later.
This is a quite simple Job that will generate customer information and save them in a text file (for reference) and in the
MySQL database named users_reference.
3. Run the Job:
5. The Data Preview window should open and show you an extract of what have been saved in the users_reference database:
6. Click Close.
LESSON 3 | 57
6. In the Topic name box, enter "rawlogs".
Your configuration should be as follows:
7. In order to be combined with the MySQL database, it is necessary to extract the user ID and URL from the Kafka message.
Add a tExtractDelimitedFields and connect with the Main row.
8. Open the Component view and click (...) to edit the schema.
9. Click the green plus sign below the Output table to add 2 new columns.
10. Configure the first column with user_id as name and set its type to integer.
11. Configure the first column with url as name and set its type to string.
12. Your configuration should be as follows:
LESSON 3 | 59
11. As a result, a SQL query will appear in the Query box:
12. This query must be updated to allow the real-time processing of logs. Modify the Query as follows:
Note: The complete query can be found in the LabCodeToCopy file, in the C:\StudentFiles folder.
13. Double-click tMap to open the Map editor.
14. Click user_id in the row2 table and drag it to the User_id column in the row3 table:
LESSON 3 | 61
4. Save your Job.
5. Open the GenerateRawLogs Job and run it:
You should see the number of batch completed increase in the lower right corner.
Let the Job run.
6. Open the GenerateEnrichedLogs Job and run it.
This proves that you are able to generate logs and to enrich them with data saved in the users database.
You will now modify the Job to be publish the enriched logs to another Kafka topic.
LESSON 3 | 63
Wrap-Up
Recap
In this lesson you covered the first part of the use case. First, you created a Job to generate logs composed of a user ID and the cor-
responding URL. Next, based on the user ID, you retrieved various information on the user to enrich the logs before publishing them
to a new Kafka topic.
Monitoring Logs 66
Monitoring Enriched Logs 67
Wrap-Up 77
Monitoring Logs
Overview
In the previous chapter, you generated logs and enriched them with user information saved in a MySQL database.
Now, you will monitor the logs. Using the Log Server which is built on top of Logstash, Elasticsearch and Kibana, you will be able to
monitor your logs through a dashboard.
Logstash is a flexible, open source data collection. It is designed to efficiently process a growing list of logs and events for dis-
tribution into a variety of outputs including Elasticsearch.
Elasticsearch is a distributed, open source search and analytics engine designed for scalability, reliability and easy man-
agement.
Kibana is an open source visualization platform that allows you to interact with your data, stored in Elasticsearch by Log-
stash, through graphics.
You will create a Job that will process the logs before sending them to Elasticsearch.
Objectives
After completing this lesson, you will be able to:
Process data in a Big Data Streaming Job.
Save logs to Elasticsearch
Use and modify a Kibana dashboard.
You will now build a Job to consume the enriched logs generated in the previous lesson.
Overview
In this lab, you will create a Job that will consume the enriched logs created by the GenerateEnrichedLogs Job. Then, the logs will be
filtered and sent to Elasticsearch.
You will also start the required services to be able to monitor your logs using the Kibana web UI.
7. At the right side of tKafkaInput, add a tExtractDelimitedFields component and connect it with the Main row.
Next, you will process the logs to filter them based on the country before sending the filtered logs to Elasticsearch.
Processing logs
The logs will be filtered to extract the users in France, in USA and in the other countries.
1. At the right side of tExtractDelimitedFields, add a tReplicate component and connect it with the Main row.
2. At the right side of tReplicate, add 3 tFilterRow components and connect them to tReplicate with the Main row.
3. Double-click the first tFilterRow component to open the Component view.
4. Click the green plus sign below the Conditions table.
5. In the Input column list, select country.
LESSON 4 | 67
6. In the Operator list, select ==.
7. In the Value box, enter "France".
Your configuration should be as follows:
8. Using the same steps, configure the second tFilterRow component to extract the users in the USA:
9. Using the same steps, configure the third tFilterRow component to extract the users living in the other countries:
7. Copy the tElasticSearchOutput component and paste it at the right side of the second tFilterRow component.
8. Double-click to open the Component view.
10. Copy the tElasticSearchOutput component and paste it at the right side of the third tFilterRow component.
11. Double-click to open the Component view.
12. In the Type box, enter "others".
Your configuration should be as follows:
When the Job will run, the logs will be filtered and saved with an index named "usersinfo", but with different Types "frusers",
"ususers" and "others".
These Type values will be useful to investigate your logs in the Kibana dashboard. To be able to visualize your Kibana dashboard,
you first need to start the Talend Administration Center and the Log server.
Start services
The Log Server is built on top of Logstash, Elasticsearch and Kibana. And to access the Kibana web UI, the Talend Administration
Center must be started.
To have more information about Log Server and Talend Administration Center, you can follow the training Talend Data Integration
Administration.
1. In a web browser, navigate to "http://localhost:8080/kibana".
You will reach the default Kibana dashboard:
LESSON 4 | 69
If you can't reach this page, please refer to the Troubleshooting instructions at the end of the chapter.
You will now configure the dashboard as you need to retrieve the logs generated by the MonitoringLogs Job.
2. In the Query section of the dashboard close ERROR, WARN, INFO and DEBUG.
Delete TRACE to let the last Query empty:
3. In the Filtering section, click the cross to close the timestamp filter.
4. In the upper right corner, click the gear to open the dashboard properties:
5. Click Index and then, in the Default Index box, enter "usersinfo".
6. Click Save.
7. In the upper right corner of the TIMELINE box, click the cross to close it.
8. Close FILTER BY SEVERITY and GROUP BY SEVERITY.
9. In TABLE, click the gear icon to open the properties.
10. Click Panel.
11. Under the Columns box, click the cross to remove @timestamp, type, priority, message and logger_name.
12. Click Save.
Your Job will execute every second with 2 cores and the default amount of memory for driver and executor.
2. Run the GenerateRawLogs Job.
3. Run the GenerateEnrichedLogs Job.
4. Run the MonitoringLogs Job.
5. Check the results in your Kibana Dashboard.
You should have results similar to the following:
LESSON 4 | 71
If you look at the columns named _type and _index, you will see the values attributed in the MonitoringLogs Job, in the
tElasticSearchOutput components:
This will automatically filter the information displayed in the dashboard and a filtering condition will appear in the upper part of
the dashboard, in the Filtering section:
7. In the Table listing the ususers logs, you can sort the columns by ascending or descending order.
Click the support column:
LESSON 4 | 73
8. You can add new diagrams to your dashbord.
In the upper left corner of the Group by Source diagram, click the green plus to add a new diagram:
Your new diagram will appear giving you more details about the support level. As done previously, clicking on one of the dia-
gram bars will filter the logs ot keep only the logs of interest.
13. To go back to the original logs, you can close the filters that appeared in the Filtering section of your Dashboard.
14. Stop your running Jobs: GenerateRawLogs, GenerateEnrichedLogs and MonitoringLogs.
LESSON 4 | 75
Troubleshooting
If you can't reach the Kibana web page, check that your services are running, clicking on the Windows Services icon:
The Talend Administrations Center and the Talend Logserver and Logserver Collector should be started. Otherwise, start them
manually.
Click the service name, then click Start in the left pane:
Recap
In this lesson you covered the key base knowledge required to be able to monitor your enriched logs using a Kibana dashboard.
You first created a Job to filter users from different countries. Then, you started the necessary services to have Elasticsearch and
Kibana running. You started your Streaming Jobs and then analyze the results in the Kibana dashboard.
Further Reading
If you want to learn more about the Talend Administration Center and the Log Server, see the Talend Data Integration Admin-
istration Training, and the Talend Administration Center User Guide.
(missing or bad snippet)
LESSON 4 | 77
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 5
Logs Processing Use Case -
Reporting
This chapter discusses the following.
Overview
In this Lab, you will build a Job that will analyze the incoming enriched logs to detect the users downloading a product or visiting the
services web page. Next, on a regular basis a report containing the corresponding users information will be created and saved on
HDFS.
Objectives
After completing this lesson, you will be able to:
Consume messages from a Kafka topic
Use the tWindow component to schedule procesing
You will now build a Job to consume and process the enriched logs to create reports.
Overview
You will consume the enriched logs generated by the GenerateEnrichedLogs Job, then, you will process the url column to identify
which page the user navigated to. The goal is to identify users who downloaded products and those who were interested in services
(consulting, training).
Once identified, all information about the users will be saved in a file in HDFS every 10 seconds. To schedule the creation of the file,
you will use a tWindow component.
This component allows you to define a window duration and a triggering duration.
LESSON 5 | 81
Process URL
The URL is a string that looks like "/download/products/big-data", or "/services/training". So using the "/" separator, it is possible to
extract the different parts of the string to identify users interested in services or interested in downloading a product.
1. At the right side of tExtractDelimitedFields, add another tExtractDelimitedFields and connect it with the Main row.
This second tExtractDelimitedFields component will help you process the URL string.
2. Double-click to open the Component view.
3. Click (...) to open the schema.
4. Copy all the columns in the Input table to the Output table, except for the url column.
5. Click the green plus sign below the Output table to add 4 new columns.
6. Name the columns root, page, specialization and product respectively. Your configuration should be as follows:
When the Job executes, the url column will be split in 4 columns based on the "/" separator. And filtering the page column,
you will be able to identify the users that navigated to downloads page or to services page.
Filter users
1. At the right side of tExtractDelimitedFields_2, add a tReplicate component and connect it with the Main row.
2. At the right side of tReplicate, add a tFilterRow component and connect it with the Main row.
3. Below tFilterRow, add a second tFilterRow component and connect it to tReplicate with the Main row.
4. Double-click the first tFilterRow to open the Component view.
5. The first tFilterRow will filter users from France that downloaded a product.
Below the Conditions table, click the green plus sign twice to add 2 conditions.
Generate report
The report will be saved to HDFS as a text file every 10 seconds. To achieve this, you will use tWindow components.
The tWindow component needs a window duration, in milliseconds, and optionally, a slide duration, in milliseconds.
Let's suppose that you wanted to save a file every 10 minutes about the users that connected to the web site during the last hour. This
would imply a window duration of 1 hour and a slide duration of 10 minutes.
1. In the Designer, add a tHDFSConfiguration component and open the Component view.
2. In the Property Type list, select Repository and use the HDFSConnection cluster metadata.
3. At the right side of each tFilterRow component add a tWindow component and connect them with the Filter row.
4. Open the Component view.
5. The tWindow component allows you to define a time window duration.
In the Window duration box, enter "10000".
6. Click the Define the slide duration checkbox and enter "10000".
That means that every 10 seconds the report will be saved to HDFS concerning the logs accumulated during the last 10
seconds.
7. Repeat the same configuration in the second tWindow component:
8. At the right side of each tWindow component add a tFileOutputDelimited component and connect them with the Main
row.
9. Double-click the first tFileOutputDelimited component to open the Component view.
LESSON 5 | 83
10. In the Folder box, enter
"/user/student/BDAdvanced/DownloadReports/"
11. Double-click the second tFileOutputDelimited component to open the Component view.
12. In the Folder box, enter
"/user/student/BDAdvanced/ServicesReports/"
The Batch size is 10 seconds and to run, you will allow 1 Gb of RAM and 2 cores.
2. Run the GenerateRawLogs Job.
3. Run the GenerateEnrichedLogs Job.
4. Run the Reporting Job.
5. Navigate to Hue "http://ClusterCDH54:8888".
6. Use the File Browser to find the "/user/student/BDAdvanced/DownloadReports" and "/user-
/student/BDAdvanced/ServicesReports" folders.
8. Click the different folders and open the part-r00000 file to find the information about your users of interest:
As the files are created in real time, depending on the incoming logs, the number of users per file will vary. But, as expected
for the services report, in the different folders under /user/student/BDAdvanced/ServicesReport, you will only find data
about users who navigated to the services web page.
9. Open the /user/student/BDAdvanced/DownloadReports folder and open the part-r00000 text files in the different sub-
folders, to validate your Job:
LESSON 5 | 85
You should only find users from France that navigated to the download web page.
You have now covered all the lab and it's time to Wrap-Up.
Recap
In this lesson you covered the key base knowledge required to use the tWindow component. This component allows you to define a
time window duration and a triggering duration (slide).
You will accumulate data during this time windows before processing them.
You can now move to the next lab.
In the next lab, instead of filtering the logs and only keeping those of interest, you will collect all the logs and store them in HBase in
real-time.
And next, you will analyze the stored logs with a Big Data Batch Job to get the top 5 pages more consulted.
LESSON 5 | 87
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 6
Logs Processing Use Case - Batch
Analysis
This chapter discusses the following.
Logs analysis 90
Stream Ingestion 91
Logs Batch Analysis 95
Wrap-Up 101
Logs analysis
Overview
In the previous labs, the logs were filtered. In this lab, the logs will be saved in your cluster, in real time, as they come through your
Kafka topic.
Then, in a batch Job, the logs will be processed to extract statistical information such as the top 5 web pages or top 5 products down-
loaded by the users.
Objectives
After completing this lesson, you will be able to:
Specify the country name through a context variable prompted whe the job starts.
Compute top values from accumulated data
Overview
Instead of filtering the logs, you will save them as they arrive in HBase.
First, you will have to create the HBase table which will be used in your Job. And next you will create a Big Data Streaming Job to con-
sume the enriched logs in the enrichedlogs Kafka topic and save them in HBase.
LESSON 6 | 91
6. Click Add an additional column family and name the new family INFO.
Your configuration should be as follows:
7. Click Submit. Your table will appear in the HBase table list.
You can now configure Spark and then you will be able to run your Job.
Configure Spark
This configuration will be a bit different from the other configurations you have done so far. This is because you will use HBase. Some
additional properties are necessary to have a succesfull execution of your Job.
1. In the Run view, click Spark Configuration.
2. Select an execution on a Spark Standalone running on a Cloudera CDH 5.4 cluster.
3. Use the appropriate context variables for Spark Host and Spark Home values.
4. Set the Batch size to 10 seconds.
5. Allow 1 Gb of RAM for the driver memory and the executor memory.
6. Click the green plus sign in the Advanced Properties table to add 3 properties.
7. The first property allows 2 cores for the Spark Job to run.
In the Property column, enter "spark.cores.max" and in the Value, enter "2".
8. The 2 others properties specify class path.
In the Property column, enter "spark.executor.extraClassPath". Then, in the corresponding Value box, enter
"/opt/cloudera/parcels/CDH/lib/hbase/lib/*".
9. In the Property column, enter "spark.driver.extraClassPath".
Then, in the corresponding Value box, enter
"/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core-3.1.0-incubating.jar".
Note: Use the LabCodeToCopy file to avoid typos.
LESSON 6 | 93
4. When your StreamIngestion Job is running and that the number of Batches completed increases, click Logs in the HBase
table list to check that the logs are saved in Hue:
You have now 3 Big Data Streaming Jobs running to generate logs, enrich them and save them in HBase.
The next step is to analyze the ingested logs.
Overview
Once saved in HBase, the logs can be processed in batch mode to extract useful information.
You will create a Big Data Batch Job to analyze the logs and get the top 5 downloaded products for a specific country.
LESSON 6 | 95
7. Add a tHBaseConfiguration component and configure it to use the HBaseConnection metadata:
When the Job executes, the logs will be filtered according to the country name you will specify in the prompt.
LESSON 6 | 97
14. In the Type list, select Integer.
Your schema should look like the following:
17. At the right side of tAggregateRow, add a tTop component and connect it with the Main row.
18. Double-click to open the Component view.
19. In the Number of line selected box, enter "5".
20. Click the green plus sign below the Criteria table.
21. In the Schema column list, select NbDownloads.
22. In the sort num or alpha? list, select num and in the Order asc or desc? list, select desc.
Your configuration should be as follows:
23. At the right side of tTop, add a tLogRow component and connect it with the Main row.
This is where you will give the name of the country you are interested in.
6. Enter the country name and click OK.
7. The Job runs and then the result will be displayed in the Console:
LESSON 6 | 99
100 | Big Data Advanced - Spark - Lab Guide
Wrap-Up
Recap
In this lesson you covered the key base knowledge required to be able to save messages in HBase in real time.
Once saved, the data can be processed later to extract useful information such as statistical information, data models or clas-
sifications of user. Machine learning is the next step to data stream ingestion.
LESSON 6 | 101
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.