Вы находитесь на странице: 1из 9

BROUGHT TO YOU IN PARTNERSHIP WITH

CONTENTS

öö Why Apache Kafka

öö About Apache Kafka

öö Quickstart for Apache Kafka

öö Pub/Sub in Apache Kafka

öö Kafka Connect

öö Quickstart for Kafka Connect

Apache Kafka
öö Transformations in Connect

öö Connect REST API

öö Kafka Streams

öö KStream vs. KTable

ö ö KStreams DSL

ö ö Querying the States in KStreams

ö ö Exactly-Once Processing in
KStreams

ö ö And More...
WRITTEN BY TIM SPANN
BIG DATA SOLUTION ENGINEER

Why Apache Kafka in real time. Kafka is in use by more than 40% of Fortune 500 compa-
Two trends have emerged in the information technology space. nies across all industries.
First, the diversity and velocity of the data that an enterprise wants
to collect for decision-making continue to grow. Such data include
not only transactional records, but also business metrics, IoT data,
operational metrics, application logs, etc.

Second, there is a growing need for an enterprise to make decisions


in real-time based on that collected data. Finance institutions want
to not only detect fraud immediately, but also offer a better banking
experience through features like real-time alerting, real-time recom-
mendation, more effective customer service, and so on. Similarly,
it's critical for retailers to make changes in catalog, inventory, and Figure 1: Kafka as a central real-time hub

pricing available as quickly as possible. It is truly a real-time world.

Before Apache Kafka, there wasn't a system that perfectly met both
of the above business needs. Traditional messaging systems are re-
al-time, but weren't designed to handle data at scale. Newer systems
such as Hadoop are much more scalable and handle all streaming
use cases.

Apache Kafka is a streaming engine for collecting, caching, and


processing high volumes of data in real-time. As illustrated in Figure
1, Kafka typically serves as a part of a central data hub in which all
data within an enterprise are collected. The data can then be used
for continuous processing or fed into other systems and applications

1
Where DevOps
Meets Data Integration
Efficiency. Agility. Reliability. Confidence.

Modern Data Movement


From Edge to Data Center to Cloud

Build Batch & Streaming Pipelines in Hours


Drag-and-drop UI for connecting sources to destinations with
transformations. Build, preview, debug and schedule from
a single interface.

Execute at Enterprise Scale


Deploy in your data center or on AWS, Azure or GCP. Scale via
YARN, Mesos or Kubernetes. Even execute on edge devices
or servers.

Map and Monitor Runtime Performance


Track throughput, latency, error rates and more. Enforce Data
SLAs. Drill into live map to pinpoint problems.

Protect Sensitive Data as it Arrives


Set policies to detect PII in-stream via pattern matching. Hide,
hash or quarantine data for GDPR and HIPAA compliance.

TRUSTED BY
CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

About Apache Kafka 6. Consume data:


Kafka was originally developed at LinkedIn in 2010, and it became
bin/kafka-console-consumer.sh
a top level Apache project in 2012. It has three main components: --bootstrap-server localhost:9092
Pub/Sub, Kafka Connect, and Kafka Streams. The role of each com- --topic TopicName --from-beginning
hello
ponent is summarized in the table below.
World

STORING AND DELIVERING DATA EFFICIENTLY


PUB/SUB After step 5, one can type in some text in the console. Then, in step 6,
AND RELIABLY AT SCALE
the same text will be displayed after running the consumer.
Integrating Kafka with external data
Kafka Connect
sources and data sinks.
Pub/Sub in Apache Kafka
Kafka Streams Processing data in Kafka in real time.
The first component in Kafka deals with the production and the
consumption of the data. The following table describes a few key
The main benefits of Kafka are:
concepts in Kafka:
1. High throughput: Each server is capable of handling 100s
DEFINES A LOGICAL NAME FOR PRODUCING AND CON-
TOPIC
MB/sec of data. SUMING RECORDS

2. High availability: Data can be stored redundantly in multiple Defines a non-overlapping subset of records within
partition
servers and can survive individual server failure. a topic.
3. High scalability: New servers can be added over time to scale
A unique sequential number assigned to each
out the system. offset
record within a topic partition.
4. Easy integration with external data sources or data sinks.
A record contains a key, a value, a timestamp, and
5. Built-in real-time processing layer. record
a list of headers.

Quickstart for Apache Kafka broker


Servers where records are stored. Multiple brokers
It's easy to get started on Kafka. The following are the steps to run can be used to form a cluster.
the quickstart script.
Figure 2 depicts a topic with two partitions. Partition 0 has 5 records,
with offsets from 0 to 4, and partition 1 has 4 records, with offsets
1. Download the Kafka binary distribution from
from 0 to 3.
tar -xzf kafka_2.11-2.2.0.tgz
cd kafka_2.11-2.2.0

2. Start the Zookeeper server:

bin/zookeeper-server-start.sh
config/zookeeper.properties Figure 2: Partitions in a topic

3. Start the Kafka broker: The following code snippet shows how to produce records to a topic
"test" using the Java API:
bin/kafka-server-start.sh
Config/server.properties
Properties props = new Properties();
props.put("bootstrap.servers",
4. Create a topic:
"localhost:9092");
props.put("key.serializer",
bin/kafka-topics.sh --create --bootstrap-server
"org.apache.kafka.common.serialization.
localhost:9092
StringSerializer");
--replication-factor 1 --partitions 1 --topic TopicName
props.put("value.serializer",
"org.apache.kafka.common.serialization.
5. Produce data:
StringSerializer");
bin/kafka-console-producer.sh Producer<String, String> producer = new
--broker-list localhost:9092 KafkaProducer<>(props);
--topic TopicName producer.send(
hello new ProducerRecord<String, String>("test", "key",
World "value"));

3 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

In the above example, both the key and value are strings, so we are Kafka Connect
using a StringSerializer . It's possible to customize the serializer The second component in Kafka is Kafka Connect, which is a frame-
when types become more complex. For example, the KafkaAvroSe- work for making it easy to stream data between Kafka and other sys-
rializer from https://docs.confluent.io/current/schema-registry/ tems. As shown in Figure 3, one can deploy a Connect cluster and run
docs/serializer-formatter.html allows the user to produce Avro various connectors to import data from sources like MySQL, MQ, or
records. A second option for serialization is using the open source Splunk into Kafka and export data in Kafka to sinks such as HDFS, S3,
com.hortonworks.registries.schemaregistry.serdes.avro.kafka.Kaf- and Elastic Search. A connector can be either of source or sink type:
kaAvroSerializer available from https://registry-project.readthedocs.
io/en/latest/examples.html.

The following code snippet shows how to consume records with


string key and value in Java.

Properties props = new Properties(); props.


put("bootstrap.servers", "localhost:9092");
props.put("key.deserializer",
Figure 3: Usage of Kafka Connect
"org.apache.kafka.common.serialization.
StringDeserializer");
The benefits of using Kafka Connect are:
props.put("value.deserializer",
"org.apache.kafka.common.serialization. • Parallelism and fault tolerance
StringDeserializer");
• Avoiding ad-hoc code by reusing existing connectors
KafkaConsumer<String, String> consumer =
• Built-in offset and configuration management
new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList("test"));
Quickstart for Kafka Connect
while (true) {
The following steps show how to run the existing file connector in
ConsumerRecords<String, String> records =
standalone mode to copy the content from a source file to a destina-
consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
tion file via Kafka:
System.out.printf("offset=%d, key=%s, value=%s",
1. Prepare some data in a source file:
record.offset(), record.key(), record.value());
consumer.commitSync(); echo -e \"hello\nworld\" > test.txt
}
2. Start a file source and a file sink connector:
Records within a partition are always delivered to the consumer in
bin/connect-standalone.sh
offset order. By saving the offset of the last consumed record from
config/connect-file-source.properties
each partition, the consumer can resume from where it left off after
config/connect-file-sink.properties
a restart. In the example above, we use the commitSync() API to
save the offsets explicitly after consuming a batch of records. One 3. Verify the data in the destination file:
can also save the offsets automatically by setting the property en-
more test.sink.txt
able.auto.commit to true. hello

Unlike other messaging systems, a record in Kafka is not removed 4. Verify the data in Kafka:
from the broker immediately after it is consumed. Instead, it is
> bin/kafka-console-consumer.sh
retained according to a configured retention policy. The following
--bootstrap-server localhost:9092
table summarizes the two common policies: --topic connect-test
--from-beginning
RETENTION POLICY MEANING
{"schema":{"type":"string",

The number of hours to keep a record "optional":false},


log.retention.hours "payload":"hello"}
on the broker.
{"schema":{"type":"string",
The maximum size of records retained in "optional":false},
log.retention.bytes
a partition. "payload":"world"}

4 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

In the above example, the data in the source file test.txt is first --topic connect-test
{"line":"hello","data_source":"test
streamed into a Kafka topic connect-test through a file source
-file-source"}
connector. The records in connect-test are then streamed into the
{"line":"world","data_source":"test
destination file test.sink.txt . If a new line is added to test.txt , -file-source"}
it will show up immediately test.sink.txt. Note that we achieve the
above by running two connectors without writing any custom code. In step 1 above, we add two transformations MakeMap and Insert-
Source , which are implemented by class HoistField$Value and In-
The following is a partial list of existing connectors. A more complete sertField$Value , respectively. The first one adds a field name "line"
list can be found at https://www.confluent.io/product/connectors/. to each input string. The second one adds an additional field "data_
Please note many of these are not open source, and may require source" that indicates the name of the source file. After applying the
paid licenses for usage. transformation logic, the data in the input file is now transformed
to the output in step 3. Because the last transformation step is more
CONNECTOR TYPE CONNECTOR TYPE
complex, we implement it with the Streams API (covered in more
Elastic Search sink HDFS sink
detail below):

Amazon S3 sink Cassandra sink final Serde<String> stringSerde = Serdes.String();


final Serde<Long> longSerde = Serdes.Long();
Oracle source MongoDB source
StreamsBuilder builder = new StreamsBuilder();
// build a stream from an input topic
MQTT source JMS sink
KStream<String, String> source = builder.stream(

Couchbase sink source JDBC sink source "streams-plaintext-input",


Consumed.with(stringSerde, stringSerde));
IBM MQ sink source Dynamo DB sink source KTable<String, Long> counts = source
.flatMapValues(value -> Arrays.asList(value.
toLowerCase().split(" ")))
Transformations in Connect
.groupBy((key, value) -> value)
Connect is primarily designed to stream data between systems as-is,
.count();
whereas Streams is designed to perform complex transformations // convert the output to another topic
once the data is in Kafka. That said, Kafka Connect does provide a counts.toStream().to("streams-wordcount-output",
mechanism to perform some simple transformations per record. Produced.with(stringSerde, longSerde));

The following example shows how to enable a couple of transforma-


tions in the file source connector.
Connect REST API
In production, Kafka Connect typically runs in distributed mode and
Add the following lines to connect-file-source.properties : can be managed through REST APIs. The following table shows the
common APIs. You can check the Kafka documentation for more
transforms=MakeMap, InsertSource
information.
transforms.MakeMap.type=org.apache.kafka
.connect.transforms.HoistField$Value
CONNECT REST API MEANING
transforms.MakeMap.field=line
transforms.InsertSource.type=org.apache
GET /connectors Return a list of active connectors
.kafka.connect.transforms
.InsertField$Value
transforms.InsertSource.static.field= POST /connectors Create a new connector
data_source
transforms.InsertSource.static.value= GET /connectors/ Get the information of a specific con-
test-file-source {name} nector

Start a file source connector: GET /connectors/ Get configuration parameters for a
{name} /config specific connector
bin/connect-standalone.sh
config/connect-file-source.properties
PUT /connectors/ Update configuration parameters for a
{name} /config specific connector
Verify the data in Kafka:
GET /connectors/
bin/kafka-console-consumer.sh Get the current status of the connector
{name} /status
--bootstrap-server localhost:9092

5 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

bin/kafka-console-producer.sh
Kafka Streams
--broker-list localhost:9092
Kafka Streams is a client library for building real-time applications
--topic streams-plaintext-input
and microservices, where the input and/or output data is stored in hello world
Kafka. The benefits of using Kafka Streams are:
Verify the data in the output topic:
• Less code in the application
bin/kafka-console-consumer.sh
• Built-in state management
--bootstrap-server localhost:9092
• Lightweight --topic streams-wordcount-output
--from-beginning
• Parallelism and fault tolerance
--formatter kafka.tools.

The most common way of using Kafka Streams is through the DefaultMessageFormatter
--property print.key=true
Streams DSL, which includes operations such as filtering, joining,
--property print.value=true
grouping, and aggregation. The following code snippet shows the
--property key.deserializer=
main logic of a Streams example called WordCountDemo .
org.apache.kafka.common.
serialization.StringDeserializer
final Serde stringSerde = Serdes.String();
--property value.deserializer=
final Serde longSerde = Serdes.Long();
org.apache.kafka.common.
StreamsBuilder builder = new StreamsBuilder();
serialization.LongDeserializer
// build a stream from an input topic
hello 1
KStream source = builder.stream(
world 1
"streams-plaintext-input",
Consumed.with(stringSerde, stringSerde));
KTable counts = source KStream vs. KTable
.flatMapValues(value -\> Arrays.asList(value. There are two key concepts in Kafka Streams: KStream and KTable.
toLowerCase().split(" "))) A topic can be viewed as either of the two. Their differences are
.groupBy((key, value) -\> value) .count(); summarized in the table below.
// convert the output to another topic
counts.toStream().to("streams-wordcount-output", KSTREAM KTABLE
Produced.with(stringSerde, longSerde));
Each record is treat-
Each record is treated as an
Concept ed as an append to
The above code first creates a stream from an input topic update to an existing key.
the stream.
streams-plaintext-input. It then applies a transformation to split
each input line into words. After that, it groups by the words and Model append-only
Model updatable reference
count the number of occurrences in each word. Finally, the results Usage data such as click
data such as user profiles.
are written to an output topic streams-wordcount-output. streams.

The following are the steps to run the example code. The following example illustrates the difference of the two:

Create the input topic: (KEY, VALUE) RE- SUM OF VALUES


SUM OF VALUES AS KTABLE
CORDS AS KSTREAM

bin/kafka-topics.sh --create ("k1", 2) ("k1", 5) 7 5


--zookeeper localhost:2181
--replication-factor 1 When a topic is viewed as a KStream, there are two independent
--partitions 1 records and thus the sum of the values is 7. On the other hand, if
--topic streams-plaintext-input
the topic is viewed as a KTable, the second record is treated as an

Run the stream application: update to the first record since they have the same key "k1". There-
fore, only the second record is retained in the stream and the sum is
bin/kafka-run-class.sh org.apache.
5 instead.
Kafka.streams.examples.wordcount.
WordCountDemo KStreams DSL
The following tables show a list of common operations available in
Produce some data in the input topic: Kafka Streams:

6 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

COMMONLY USED OPERATIONS IN KSTREAM OPERATION EXAMPLE

OPERATION EXAMPLE
reduce(Reducer)
kt = kgs.reduce( (aggValue, newVal-
filter(Predicate) Combine the values of ue) -> aggValue + newValue );
Create a new KStream that ks_out = ks_in.filter( (key,
records in this stream kgs: ("k1", ("k1", 1), ("k1", 3)))
value) -> value > 5 ); ks_in:
consists of all records of this ("k2", (("k2", 2)))
ks_out: ("k1", 2) ("k2", 7) by the grouped key and
kt: ("k1", 4) ("k2", 2)
stream that satisfy the given ("k2", 7) return it as a KTable.
predicate.
twks = kgs.windowedBy( TimeWindows.

map(KeyValueMapper) of(100) );
ks_out = ks_in..map( (key, windowedBy(Windows)
kgs: ("k1", (("k1", 1, 100t), ("k1",
Transform each record of the value) -> new KeyValue<>(key, Further group the 3, 150t))) ("k2", (("k2", 2, 100t),
input stream into a new record key) )
records by the time- ("k2", 4, 250t))) * t indicates a
in the output stream (both key ks_in: ks_out: ("k1", 2)
stamp and return it as timestamp.
("k1", "k1") ("k2", 7) ("k2",
and value type can be altered twks: ("k1", 100t -- 200t, (("k1", 1,
"k2") a TimeWindowedK-
100t), ("k1", 3, 150t))) ("k2", 100t
arbitrarily).
Stream. -- 200t, (("k2", 2, 100t))) ("k2",
200t -- 300t, (("k2", 4, 250t)))
groupBy()
ks_out = ks.groupBy()
Group the records by their ks_in: ks_out: ("k1", 1) A similar set of operations is available on KTable and KGroupedTa-
current key into a KGrouped- ("k1", (("k1", 1), ("k2", 2)
ble. You can check the Kafka documentation for more information.
("k1", 3))) ("k1", 3) ("k2",
Stream while preserving the
(("k2", 2)))
original values. Querying the States in KStreams
While processing the data in real-time, a KStreams application
locally maintains the states such as the word counts in the previous
join(KTable, ValueJoiner)
ks_out = ks_in.join( kt, example. Those states can be queried interactively through an API
Join records of the input stream (value1, value2) -> value1 +
described in the Interactive Queries section of the Kafka documenta-
with records from the KTable value2 );
ks_in: kt: ("k1", 1) ("k1",
tion. This avoids the need of an external data store for exporting and
if the keys from the records
11) ("k2", 2) ("k2", 12) serving those states.
match. Return a stream of the
("k3", 3) ("k4", 13)
key and the combined value
ks_out: ("k1", 12) ("k2", 14)
using ValueJoiner. Exactly-Once Processing in KStreams
Failures in the brokers or the clients may introduce duplicates
during the processing of records. KStreams provides the capability
join(KStream, ValueJoiner,
ks_out = ks1.join( ks2, (val-
JoinWindows)
of processing records exactly once even under failures. This can be
ue1, value2) ->
achieved by simply setting the property processing.guarantee to
Join records of the two streams value1 + value2, JoinWindows.
of(100) ); exactly_once in KStreams. More details on exactly-once processing
if the keys match and the time-
ks1: ks2: ("k1", 1, 100t) can be found in the Kafka Confluence Space.
stamp from the records satisfy
("k1", 11, 150t) ("k2", 2,
the time constraint specified by 200t) ("k2", 12, 350t) ("k3",
KSQL
JoinWindows. Return a stream 3, 300t) ("k4", 13, 380t) * t
KSQL is an open-source streaming SQL engine that implements
of the key and the combined indicates a timestamp.
continuous, interactive queries against Apache Kafka. It's built using
ks_out: ("k1", 12)
value using ValueJoiner.
the Kafka Streams API and further simplifies the job of a developer.
Currently, KSQL is still not part of the Apache Kafka project, but is
COMMONLY USED OPERATIONS IN KGROUPEDSTREAM
available under the Apache 2.0 license.
OPERATION EXAMPLE
To see how KSQL works, let's first download it and prepare some
count()
data sets.
Count the number of kt = kgs.count(); kgs: ("k1", (("k1",
records in this stream 1), ("k1", 3))) ("k2", (("k2", 2)))
Clone the KSQL repository and compile the code:
by the grouped key and kt: ("k1", 2) ("k2", 1)
> git clone git@github.com:
return it as a KTable.
confluentinc/ksql.git

7 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

> cd ksql Field | Type


> mvn clean compile install -----------------------------
-DskipTests
ROWTIME | BIGINT

Produce some data in two topics: ROWKEY | VARCHAR(STRING)


REGISTERTIME | BIGINT
> java -jar ksql-examples/target/ GENDER | VARCHAR(STRING)
ksql-examples-0.1-SNAPSHOT- REGIONID | VARCHAR(STRING)

standalone.jar USERID | VARCHAR(STRING)

quickstart=pageviews
Note that in the above, each schema always contains two built-in
format=delimited
topic=pageviews fields, ROWTIME and ROWKEY . They correspond to the timestamp and
maxInterval=10000 the key of the record, respectively. Finally, let's run some KSQL que-
> java -jar ksql-examples/target/ ries using the data and the schema that we prepared.
ksql-examples-0.1-SNAPSHOT-
standalone.jar Select a field from a stream:
quickstart=users
ksql> SELECT pageid
format=json
FROM pageviews_stream
topic=users
LIMIT 3;
maxInterval=10000 Page_24
Page_73
Next, let's define the schema of the input topics. Similar to Kafka Page_78
Streams, one can define a schema as either a stream or a table.
Join a stream and a table:
Start KSQL CLI:
ksql> CREATE STREAM pageviews_female
./bin/ksql-cli local AS SELECT
users_table.userid AS userid,
Create a KStream from a topic: pageid, regionid, gender
FROM pageviews_stream
ksql> CREATE STREAM pageviews_stream LEFT JOIN users_table ON
(viewtime bigint, pageviews_stream.userid =
userid varchar, users_table.userid
pageid varchar) WHERE gender = 'FEMALE';
WITH (kafka_topic='pageviews', ksql> SELECT userid, pageid,

value_format='DELIMITED'); regionid, gender

ksql> DESCRIBE pageviews_stream; FROM pageviews_female;


User_2 | Page_55 | Region_9 | FEMALE
Field | Type
User_5 | Page_14 | Region_2 | FEMALE
----------------------------
User_3 | Page_60 | Region_3 | FEMALE
ROWTIME | BIGINT
ROWKEY | VARCHAR(STRING)
Note that in step 2 above, the query pageviews_female runs con-
VIEWTIME | BIGINT
tinuously in the background once it's issued. The results from this
USERID | VARCHAR(STRING)
query are written to a Kafka topic named pageviews_female .
PAGEID | VARCHAR(STRING)

Create a KTable from a topic:

ksql> CREATE TABLE users_table


(registertime bigint,
gender varchar,
regionid varchar,
userid varchar)
WITH (kafka_topic='users',

value_format='JSON');

ksql> DESCRIBE users_ktable;

8 BROUGHT TO YOU IN PARTNERSHIP WITH


CLOUD NATIVE DATA GRIDS: HAZELCAST IMDG WITH KUBERNETES

Additional Resources
• Documentation of Apache Kafka: kafka.apache.org/docu-
mentation/

• Apache NiFi: nifi.apache.org/

• Apache NiFi and Apache Kafka: dzone.com/articles/real-


time-stock-processing-with-apache-nifi-and-ap

• Apache Kafka Summit: kafka-summit.org/

• Apache Kafka Mirroring and Replication: cwiki.apache.org/


confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0

• Apache Kafka Distribution: cloudera.com/documentation/


enterprise/6/6.0/topics/kafka.html

• KSQL Deep Dive: dzone.com/articles/ksql-deep-dive-the-


open-source-streaming-sql-engin

Written by Tim Spann, Big Data Solution Engineer


Tim Spann is a Big Data Solution Engineer. He helps educate and disseminate performant open source
solutions for Big Data initiatives to customers and the community. With over 15 years of experience in various
technical leadership, architecture, sales engineering, and development roles, he is well-experienced in all
facets of Big Data, cloud, IoT, and microservices. As part of his community efforts, he also runs the Future of
Data Meetup in Princeton.

Devada, Inc.
600 Park Offices Drive
Suite 150
Research Triangle Park, NC

888.678.0399 919.678.0300
DZone communities deliver over 6 million pages each
month to more than 3.3 million software developers, Copyright © 2019 DZone, Inc. All rights reserved. No part
architects, and decision makers. DZone offers something for of this publication may be reproduced, stored in a retrieval
everyone, including news, tutorials, cheat sheets, research system, or transmitted, in any form or by means electronic,
guides, feature articles, source code, and more. "DZone is a mechanical, photocopying, or otherwise, without prior written
developer’s dream," says PC Magazine. permission of the publisher.

9 BROUGHT TO YOU IN PARTNERSHIP WITH

Вам также может понравиться