Вы находитесь на странице: 1из 162

Hadoop Application

Architectures: Fraud
Detection
Strata + Hadoop World, London 2016
tiny.cloudera.com/app-arch-london
tiny.cloudera.com/app-arch-questions
Gwen Shapira | @gwenshap
Jonathan Seidman | @jseidman
Ted Malaska | @ted_malaska
Mark Grover | @mark_grover
Logistics
§ Break at 10:30 – 11:00 AM
§ Questions at the end of each section
§ Slides at tiny.cloudera.com/app-arch-london
§ Code at https://github.com/hadooparchitecturebook/fraud-
detection-tutorial

Questions? tiny.cloudera.com/app-arch-questions
About the book
§ @hadooparchbook
§ hadooparchitecturebook.com
§ github.com/hadooparchitecturebook
§ slideshare.com/hadooparchbook

Questions? tiny.cloudera.com/app-arch-questions
About the presenters
Ted Malaska Jonathan Seidman
§ Principal Solutions Architect § Senior Solutions Architect/
at Cloudera Partner Enablement at
§ Previously, lead architect at Cloudera
FINRA § Contributor to Apache
§ Contributor to Apache Sqoop.
Hadoop, HBase, Flume, § Previously, Technical Lead
Avro, Pig and Spark on the big data team at
Orbitz, co-founder of the
Chicago Hadoop User
Group and Chicago Big Data

Questions? tiny.cloudera.com/app-arch-questions
About the presenters
Gwen Shapira Mark Grover
§ System Architect at § Software Engineer on
Confluent Spark at Cloudera
§ Committer on Apache § Committer on Apache
Kafka, Sqoop Bigtop, PMC member on
§ Contributor to Apache Apache Sentry(incubating)
Flume § Contributor to Apache
Spark, Hadoop, Hive,
Sqoop, Pig, Flume
Questions? tiny.cloudera.com/app-arch-questions
Case Study Overview
Fraud Detection
Credit Card Transaction Fraud

Questions? tiny.cloudera.com/app-arch-questions
Video Game Strategy

Questions? tiny.cloudera.com/app-arch-questions
Health Insurance Fraud

Questions? tiny.cloudera.com/app-arch-questions
Network anomaly detection

Questions? tiny.cloudera.com/app-arch-questions
Ok, enough with the
use-cases
Can we move on?
How Do We React
§  Human Brain at Tennis
-  Muscle Memory
-  Fast Thought
-  Slow Thought

Questions? tiny.cloudera.com/app-arch-questions
Back to network anomaly detection
§  Muscle memory – for quick action (e.g. reject events)
-  Real time alerts
-  Real time dashboard
§  Platform for automated learning and improvement
-  Real time (fast thought)
-  Batch (slow thought)
§  Slow thought
-  Audit trail and analytics for human feedback

Questions? tiny.cloudera.com/app-arch-questions
Use case
Network Anomaly Detection
Use Case – Network Anomaly Detection
§  “Beaconing” – Malware “phoning home”.

badhost.com

myserver.com

Questions? tiny.cloudera.com/app-arch-questions
Use Case – Network Anomaly Detection
§  Distributed Denial of Service attacks.

badhost.com
badhost.com
badhost.com
badhost.com
badhost.com
badhost.com
badhost.com
badhost.com
badhost.com myserver.com
badhost.com
badhost.com
badhost.com
badhost.com
badhost.com

Questions? tiny.cloudera.com/app-arch-questions
Use Case – Network Anomaly Detection
§  Network intrusion attempts – attempted or successful break-ins to a network.

Questions? tiny.cloudera.com/app-arch-questions
Use Case – Network Anomaly Detection

All of these require the ability to detect deviations from known patterns.

Questions? tiny.cloudera.com/app-arch-questions
Other Use Cases
Could be used for any system requiring detection of anomalous events:

Etc…
Credit card transactions Market transactions Internet of Things

Questions? tiny.cloudera.com/app-arch-questions
Input Data – NetFlow
§  Standard protocol defining a format for capturing network traffic flow through
routers.
§  Captures network data as flow records. These flow records provide information
on source and destination IP addresses, traffic volumes, ports, etc.
§  We can use this data to detect normal and anomalous patterns in network
traffic.

Questions? tiny.cloudera.com/app-arch-questions
Why is Hadoop a great
fit?
Hadoop
§ In traditional sense
- HDFS (append-only file system)
- MapReduce (really slow but robust execution engine)
§ Is not a great fit

Questions? tiny.cloudera.com/app-arch-questions
Why is Hadoop a Great Fit?
§ But the Hadoop ecosystem is!
§ More than just MapReduce + YARN + HDFS

Questions? tiny.cloudera.com/app-arch-questions
Volume
§ Have to maintain millions of device profiles
§ Retain flow history
§ Make and keep track of automated rule changes

Questions? tiny.cloudera.com/app-arch-questions
Velocity
§ Events arriving concurrently and at high velocity
§ Make decisions in real-time

Questions? tiny.cloudera.com/app-arch-questions
Variety
§ Maintain simple counters in profile (e.g. throughput
thresholds, purchase thresholds)
§ Iris or finger prints that need to be matched

Questions? tiny.cloudera.com/app-arch-questions
Challenges of Hadoop Implementation

Questions? tiny.cloudera.com/app-arch-questions
Challenges of Hadoop Implementation

Questions? tiny.cloudera.com/app-arch-questions
Challenges - Architectural Considerations
§  Storing state (for real-time decisions):
-  Local Caching? Distributed caching (e.g. Memcached)? Distributed storage (e.g. HBase)?
§  Profile Storage
-  HDFS? HBase?
§  Ingestion frameworks (for events):
-  Flume? Kafka? Sqoop?
§  Processing engines (for background processing):
-  Storm? Spark? Trident? Flume interceptors? Kafka Streams?
§  Data Modeling
§  Orchestration
-  Do we still need it for real-time systems?

Questions? tiny.cloudera.com/app-arch-questions
Case Study
Requirements
Overview
But First…

Real-Time? Near Real-Time? Stream Processing?

Questions? tiny.cloudera.com/app-arch-questions
Some Definitions
§  Real-time – well, not really. Most of what we’re talking about here is…
§  Near real-time:
-  Stream processing – continual processing of incoming data.
-  Alerting – immediate (well, as fast as possible) responses to incoming events.
Getting closer to real-time.

Typical Hadoop Processing – Minutes to Hours Fraud Detection


seconds milliseconds
Data Transform Stream Alerts
Sources Extract Load Processing !

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Storage/Modeling
§  Need long term storage for large volumes of profile, event, transaction, etc. data.

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Storage/Modeling
§  Need to be able to quickly retrieve specific profiles and respond quickly to incoming
events.

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Storage/Modeling
§  Need to store sufficient data to analyze whether or not fraud is present.

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Alerts
§  Need to be able to respond to incoming events quickly (milliseconds).
§  Need high throughput and low latency.
§  Processing requirements are minimal.
§  Two types of alerts: “atomic”, and “atomic with context”

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Ingestion

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Ingestion
§  Reliable – we don’t want to lose events.

event,event,event,event,event,event…

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Ingestion
§  Support for multiple targets.

Data Sources

Storage

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Ingestion
§  Needs to support high throughput of large volumes of events.

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Stream Processing
§  A few seconds to minutes response times.
§  Need to be able to detect trends, thresholds, etc. across profiles.
§  Quick processing more important than 100% accuracy.

Questions? tiny.cloudera.com/app-arch-questions
Requirements – Batch Processing
§  Non real-time, “off-line”, exploratory processing and reporting.
§  Data needs to be available for analysts and users to analyze with tools of choice –
SQL, MapReduce, etc.
§  Results need to be able to feed back into NRT processing to adjust rules, etc.

"Parmigiano reggiano factory". Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/File:Parmigiano_reggiano_factory.jpg#/media/File:Parmigiano_reggiano_factory.jpg

Questions? tiny.cloudera.com/app-arch-questions
High level architecture
High level architecture
Real time
dashboard

Events Anomaly Alerts/Events


Buffer Buffer Long term
detector storage

Storage
Batch processing (ML/ Reporting info
NRT manual adjustments) Analytics interface
NRT processing Profiles/Rules engine
External
systems
Batch
processing

Questions? tiny.cloudera.com/app-arch-questions
High level architecture
Real time
dashboard

Events Anomaly Alerts/Events


Buffer Buffer Long term
detector storage

Storage
Batch processing (ML/ Reporting info
NRT manual adjustments) Analytics interface
NRT processing Profiles/Rules engine
External
systems
Batch
processing

Questions? tiny.cloudera.com/app-arch-questions
Full Architecture
Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Storage Layer
Considerations
Storage Layer Considerations
§ Two likely choices for long-term storage of data:

Questions? tiny.cloudera.com/app-arch-questions
Data Storage Layer Choices

§ Stores data directly as files § Stores data as Hfiles on HDFS


§ Fast scans § Slow scans
§ Poor random reads/writes § Fast random reads/writes

Questions? tiny.cloudera.com/app-arch-questions
Storage Considerations
§  Batch processing and reporting requires access to all raw and processed data.
§  Stream processing and alerting requires quickly fetching and saving profiles.

Questions? tiny.cloudera.com/app-arch-questions
Data Storage Layer Choices

§ For ingesting raw data. § For reading/writing profiles.


§ Batch processing, also § Fast random reads and writes
some stream processing. facilitates quick access to
large number of profiles.

Questions? tiny.cloudera.com/app-arch-questions
But…

Is HBase fast enough for fetching profiles?

Hadoop Cluster I

Flume
Flume
(Source)
Flume
(Source)
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Questions? tiny.cloudera.com/app-arch-questions
What About Caching?

Process

microseconds milliseconds

Memory

Questions? tiny.cloudera.com/app-arch-questions
Caching Options
§  Local in-memory cache (on-heap, off-heap).
§  Remote cache
-  Distributed cache (Memcached, Oracle Coherence, etc.)
-  HBase BlockCache

Questions? tiny.cloudera.com/app-arch-questions
Caching – HBase BlockCache
§  Allows us to keep recently used data blocks in memory.
§  Note that this will require recent versions of HBase and specific configurations
on our cluster.

Questions? tiny.cloudera.com/app-arch-questions
Our Storage Choices
Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Local Cache (Profiles) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
HBase Data Modeling
Considerations
HBase Data Modeling Considerations

Tables
Row Key Column Values

Questions? tiny.cloudera.com/app-arch-questions
HBase Data Modeling Considerations
§  Tables
-  Minimize # of Tables
-  Reduce/Remove Joins
-  # Region
-  Split policy

Questions? tiny.cloudera.com/app-arch-questions
HBase Data Modeling Considerations
§  RowKeys
-  Location Location Location
-  Salt is good for you

Questions? tiny.cloudera.com/app-arch-questions
HBase Data Modeling Considerations

Questions? tiny.cloudera.com/app-arch-questions
What is a Salt

Date Time

Math.abs(DateTime.hash() % numOfSplits)

# # # # Date Time

Questions? tiny.cloudera.com/app-arch-questions
Scan a Salted Table

Mapper Mapper Mapper Mapper

Region Server Region Server

Region Region Region Region

Questions? tiny.cloudera.com/app-arch-questions
Scan a Salted Table
Mapper
Mapper

Filtering on Client

Filter

Region Server
0001

0001

0002

0002

0002

… Region

Region

Questions? tiny.cloudera.com/app-arch-questions
Filtered
Filtered
Filtered

0001
Filtered
Filtered
Needed

Filtered
Filtered
Filtered
0002 Filtered
Filtered
Needed
Region

Filtered
Filtered
Mapper (One Big Scan)

Filtered

Filtered
Questions? tiny.cloudera.com/app-arch-questions

Filtered
Needed
Scan a Salted Table

Filtered
Filtered
Filtered
0060

Filtered
Filtered
Needed
Filtered
Filtered
Filtered

0001
Filtered
Filtered
Needed

Filtered
Filtered
Filtered
0002

Filtered
Filtered
Needed
Region

Filtered
Filtered
Mapper (Scan Per Salt)

Filtered

Filtered
Questions? tiny.cloudera.com/app-arch-questions

Filtered
Needed

Filtered
Scan a Salted Table

Filtered
Filtered
0060

Filtered
Filtered
Needed
HBase Data Modeling Considerations
§  Columns
-  Mega Columns
-  Single Value Columns
-  Millions of Columns
-  Increment Values

Questions? tiny.cloudera.com/app-arch-questions
Ingest and Near Real-
Time Alerting
Considerations
Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
The basic workflow
Fetch & Update
Here is an event. Is it suspicious?
Profiles

Network Network Event


Queue Queue Data Store
Device Events Handler
3
1 2

Questions? tiny.cloudera.com/app-arch-questions
1
The Queue
§  What makes Apache Kafka a good choice?
-  Low latency
-  High throughput
-  Partitioned and replicated
-  Easy to plug to applications

Questions? tiny.cloudera.com/app-arch-questions
The Basics
§ Messages are organized into topics
§ Producers push messages
§ Consumers pull messages
§ Kafka runs in a cluster. Nodes are called brokers

Questions? tiny.cloudera.com/app-arch-questions
Topics, Partitions and Logs

Questions? tiny.cloudera.com/app-arch-questions
Each partition is a log

Questions? tiny.cloudera.com/app-arch-questions
Each Broker has many partitions

Partition 0 Partition 0 Partition 0

Partition 1 Partition 1 Partition 1

Partition 2 Partition 2 Partion 2

Questions? tiny.cloudera.com/app-arch-questions
Producers load balance between partitions

Client

Partition 0 Partition 0 Partition 0

Partition 1 Partition 1 Partition 1

Partition 2 Partition 2 Partion 2

Questions? tiny.cloudera.com/app-arch-questions
Producers load balance between partitions

Client

Partition 0 Partition 0 Partition 0

Partition 1 Partition 1 Partition 1

Partition 2 Partition 2 Partion 2

Questions? tiny.cloudera.com/app-arch-questions
Consumers
Order retained within
Kafka Cluster Consumer Group X partition

Consumer
Topic

Partition A (File) Consumer

Off Set Y
Off Set X
Consumer

Partition B (File)
Consumer Group Y
Off Set X

Off Set Y
Order retained with in
partition but not over
partitions

Partition C (File) Consumer


Off Set X

Off Set Y

Offsets are kept per


consumer group
Questions? tiny.cloudera.com/app-arch-questions
In our use-case
§  Topic for profile updates
§  Topic for current events
Partitioned by device ID
§  Topic for alerts
§  Topic for rule updates

Questions? tiny.cloudera.com/app-arch-questions
2
The event handler
§  Very basic pattern:
-  Consume a record
-  “do something to it” Kafka Processor Pattern
-  Produce zero, one, or more results

Questions? tiny.cloudera.com/app-arch-questions
Easier than you think
§  Kafka provides load balancing and availability
-  Add processes when needed – they will get their share of partitions
-  You control partition assignment strategy
-  If a process crashes, partitions will get automatically reassigned
-  You control when an event is “done”
-  Send events safely and asynchronously
§  So you focus on the use-case, not the framework

Questions? tiny.cloudera.com/app-arch-questions
Inside The Processor
Flume Source
Kafka
Flume Source
Rules Topic Flume Source

Flume Interceptor

Kafka Consumer

Kafka Producer
Kafka Event Processing Logic Kafka

Events Topic Alerts Topic


Local HBase Client
Memory
Kafka

Profile/Rules Updates Topic

HBase

Questions? tiny.cloudera.com/app-arch-questions
Scaling the Processor
Partitioner

Producer Kafka Flume Source

Events Topic Flume Source


Initial Events Topic
Flume Source
Partition A
Flume Interceptor

Kafka Consumer

Kafka Producer
Partitioner

Event Processing Logic Kafka


Producer Partition B
Alerts Topic
Local HBase
Memory Client
Partition C
Kafka
Partitioner

Profile/Rules Topic
Producer
Profile/Rules Topic
Partition A

Better use of local HBase


memory

Questions? tiny.cloudera.com/app-arch-questions
3
Ingest – Gateway to deep analytics
§  Data needs to end up in:
-  SparkStreaming
-  Can read directly from Kafka
-  HBase
-  HDFS

Questions? tiny.cloudera.com/app-arch-questions
Considerations for Ingest
§  Async – nothing can slow down the immediate reaction
-  Always ingest the data from a queue in separate thread or process
to avoid write-latency
§  Push vs Pull
§  Useful end-points – integrates with variety of data sources and sinks
§  Supports standard formats and possibly some transformations
-  Data format inside the queue can be different than in the data store
For example Avro -> Parquet

Questions? tiny.cloudera.com/app-arch-questions
More considerations - Safety
§  Reliable – data loss is not acceptable
-  Either “at-least-once” or “exactly-once”
-  When is “exactly-once” possible?
§  Error handling – reasonable reaction to unreasonable data
§  Security – Authentication, authorization, audit, lineage

Questions? tiny.cloudera.com/app-arch-questions
Good Patterns
§  Ingest all things. Events, evidence, alerts
-  It has potential value
-  Especially when troubleshooting
-  And for data scientists
§  Make sure you take your schema with you
-  Yes, your data probably has a schema
-  Everyone accessing the data (web apps, real time, stream, batch) will need to
know about it
-  So make it accessible in a centralized schema registry

Questions? tiny.cloudera.com/app-arch-questions
Choosing Ingest
Our choice for
this example

Flume Kafka Connect Spark Streaming


• Part of Hadoop • Part of Apache Kafka • Part of Hadoop
• Key-value based • Schema preserving • Very strong transformation
story
• Supports HDFS and • Many connectors
HBase • Not as simple as Flume or
• Well integrated support for Connect
• Easy to configure Parquet, Avro and JSON • Supports HDFS and HBase
• Serializers provide • Exactly-once delivery • Smart Kafka Integration
conversions guarantee • Exactly-once guarantee
• At-least once delivery • Easy to manage and scale from Kafka
guarantee
• Proven in huge
production environments

Questions? tiny.cloudera.com/app-arch-questions
Ingestion - Flume

Flume HDFS Sink


Sink
Kafka Cluster
Sink HDFS
Topic
Sink

Partition A
Flume SolR Sink
Sink
Partition B Sink
SolR
Sink

Partition C Flume HBase Sink


Sink
Sink
HBase
Sink

Questions? tiny.cloudera.com/app-arch-questions
Ingestion – Kafka Connect
Kafka Cluster
Topic Copycat Worker
Partition A Convertalizer Connector

Partition B

Copycat Worker
Partition C
Convertalizer Connector HDFS

Offsets

Configuration Copycat Worker

Convertalizer Connector
Schemata

Questions? tiny.cloudera.com/app-arch-questions
Alternative Architectures
§  What if…

Questions? tiny.cloudera.com/app-arch-questions
Do we really need HBase?
§  What if… Kafka would store all information used for real-time processing?

Here is an event. Is is authorized?

Network
Events
Queue Event handler

Network
Device

Fetch & Update


Profiles

Questions? tiny.cloudera.com/app-arch-questions
Log Compaction in Kafka
§  Kafka stores a “change log” of profiles
§  But we want the latest state
§  log.cleanup.policy = {delete / compact}

Questions? tiny.cloudera.com/app-arch-questions
Do we really need Spark Streaming?
§  What if… we could do low-latency single event processing and complex stream
processing analytics in the same system?
Adjust rules and statistics
Spark Streaming

Do we really need two of these?

Network
Queue Event handler Queue Data Store
Events

Network
Device

Questions? tiny.cloudera.com/app-arch-questions
Is complex processing of single events possible?
§  The challenge:
-  Low latency vs high throughput
-  No data loss
-  Windows, aggregations, joins
-  Late data

Questions? tiny.cloudera.com/app-arch-questions
Solutions?
§  Apache Flink
-  Process each event, checkpoint in batches
-  Local and distributed state, also checkpointed
§  Apache Samza
-  State is persisted to Kafka (Change log pattern)
-  Local state cached in RocksDB
§  Kafka Streams
-  Similar to Samza
-  But in simple library that is part of Apache Kafka.
-  Late data as updates to state

Questions? tiny.cloudera.com/app-arch-questions
You Probably Don’t Need
Orchestration

Questions? tiny.cloudera.com/app-arch-questions
Processing
Considerations
Reminder: Full Architecture
Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Processing

Hadoop Cluster II

Storage NRT/Stream Processing

Spark Streaming

Batch Processing

Impala

HDFS Map/Reduce

Spark

Questions? tiny.cloudera.com/app-arch-questions
Two Kinds of Processing
Streaming Batch

Batch Processing

NRT/Stream Processing
Impala

Spark Streaming
Map/Reduce

Spark

Questions? tiny.cloudera.com/app-arch-questions
Stream Processing
Considerations
Stream Processing Options
1.  Storm
2.  Spark Streaming
3.  KafkaStreams
4.  Flink

Questions? tiny.cloudera.com/app-arch-questions
Why Spark Streaming?
§  There’s one engine to know.
§  Micro-batching helps you scale reliably.
§  Exactly once counting
§  Hadoop ecosystem integration is baked in.
§  No risk of data loss.
§  It’s easy to debug and run.
§  State management
§  Huge eco-system backing
§  You have access to ML libraries.
§  You can use SQL where needed.
§  Wide-spread use and hardened

Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming Example
1.  val conf = new SparkConf().setMaster("local[2]”)
2.  val ssc = new StreamingContext(conf, Seconds(1))
3.  val lines = ssc.socketTextStream("localhost", 9999)
4.  val words = lines.flatMap(_.split(" "))
5.  val pairs = words.map(word => (word, 1))
6.  val wordCounts = pairs.reduceByKey(_ + _)
7.  wordCounts.print()
8.  SSC.start()

Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming Example
1.  val conf = new SparkConf().setMaster("local[2]”)
2.  val sc = new SparkContext(conf)
3.  val lines = sc.textFile(path, 2)
4.  val words = lines.flatMap(_.split(" "))
5.  val pairs = words.map(word => (word, 1))
6.  val wordCounts = pairs.reduceByKey(_ + _)
7.  wordCounts.print()

Questions? tiny.cloudera.com/app-arch-questions
Spark SQL and Spark Streaming
§  Catalyst
-  Spark SQL’s optimizer
§  Bringing Catalyst and Spark SQL integration to streaming

Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming
DStream

Source Receiver RDD

DStream

Single Pass
Source Receiver RDD
Filter Count Print
First
Batch

RDD

DStream

Source Receiver RDD

Second
Batch Single Pass
RDD
Filter Count Print

RDD

Questions? tiny.cloudera.com/app-arch-questions
Spark Streaming DStream

Pre-first Source Receiver RDD


Batch

DStream

Single Pass
Source Receiver RDD
Filter Count
First
Batch

RDD Stateful RDD 1

Print

DStream

RDD
Source Receiver Stateful RDD 1
partitions

RDD Single Pass


Parition Filter Count
Second
Batch

RDD Stateful RDD 2

Questions? tiny.cloudera.com/app-arch-questions
Print
Spark Streaming and HBase
Executor
Static Space

HConnection

Driver
Tasks Tasks
Configs
Tasks Tasks

Executor
Static Space

HConnection

Tasks Tasks

Tasks Tasks

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
•  BulkPut
•  BulkDelete
•  BulkGet
•  BulkLoad (Wide and Tail)
•  ForeachPar99on(It, Conn)
•  MapPar99on(it, Conn)
•  Spark SQL

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
val hbaseContext = new HBaseContext(sc, config);
hbaseContext.bulkDelete[Array[Byte]](rdd,
tableName,
putRecord => new Delete(putRecord),
4);

val hbaseContext = new HBaseContext(sc, config)


rdd.hbaseBulkDelete(hbaseContext,
tableName,
putRecord => new Delete(putRecord),
4)

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
val hbaseContext = new HBaseContext(sc, config)
rdd.hbaseBulkDelete(tableName)

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
val hbaseContext = new HBaseContext(sc, config)
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
...
bufferedMutator.flush()
bufferedMutator.close()
})
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
...
})

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
rdd.hbaseBulkLoad (tableName,
t => {
Seq((new KeyFamilyQualifier(t.rowKey, t.family,
t.qualifier), t.value)).
iterator
},
stagingFolder)

Questions? tiny.cloudera.com/app-arch-questions
HBase-Spark Module
val df = sqlContext.load("org.apache.hadoop.hbase.spark",
Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a,
B_FIELD STRING c:b,",
"hbase.table" -> "t1"))
df.registerTempTable("hbaseTmp")

sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " +


"WHERE " +
"(KEY_FIELD = 'get1' and B_FIELD < '3') or " +
"(KEY_FIELD <= 'get3' and B_FIELD = '8')").foreach(r => println(" - " + r))

Questions? tiny.cloudera.com/app-arch-questions
Demo Architecture

Impala

Spark
Generator Kafka
Streaming ??? SparkSQL

Spark
SparkSQL
MlLib

Questions? tiny.cloudera.com/app-arch-questions
KafkaStreams
§  Event-at-a-time processing

§  State-full processing

§  Windowing with out-of-order data

§  Auto scaling / fault tolerant / reprocessing / etc

Questions? tiny.cloudera.com/app-arch-questions
KafkaStreams Example
public static void main(String[] args) {
// specify the processing topology by first reading in a stream from a topic
Kstream<String, String> words = builder.stream(“topic1”);

// count the words in this stream as an aggregated table


Ktable<String, Long> counts = words.countByKey(“Counts”);

// write the result table to a new topic


counts.to(“topic2”)

// create stream processing instance and run in


KafkaStreams streams = new KafkaStreams(builder, config);
streams.start();
}

Questions? tiny.cloudera.com/app-arch-questions
Key KafkaStreams Ideas
§  Operators (join, aggregate, source, sink) are nodes in DAG
§  Stateful operators are persisted to Kafka – process states are a STREAM
§  Data parallelism via topic partitions
§  STREAM can be a changelog of a TABLE
§  TABLE is a snapshot of a STREAM
§  Flow is based on event time (when the event was created)

Questions? tiny.cloudera.com/app-arch-questions
New Hadoop Storage Option
Use case break up

Unstructured Structured Data Fixed Columns Any Type of Column


Data Schemas Schemas
SQL + Scan Use Cases
Deep Storage SQL + Scan Use Gets / Puts / Micro
Scan Use Cases Cases Scans

Questions? tiny.cloudera.com/app-arch-questions
New Hadoop Storage Option
Use case break up

Unstructured Structured Data Fixed Columns Any Type of Column


Data Schemas Schemas
SQL + Scan Use Cases
Deep Storage SQL + Scan Use Gets / Puts / Micro
Scan Use Cases Cases Scans

Questions? tiny.cloudera.com/app-arch-questions
Real-Time Analytics with Kudu

Impala

Spark
Generator Kafka
Streaming Kudu SparkSQL

Spark
SparkSQL
Mllib

Questions? tiny.cloudera.com/app-arch-questions
Kudu and Spark
h-ps://github.com/tmalaska/SparkOnKudu

Same Development From HBase-Spark is now on Kudu and Spark
•  Full Spark RDD integraDon
•  Full Spark Streaming integraDon
•  IniDal SparkSQL integraDon

Questions? tiny.cloudera.com/app-arch-questions
Processing
Batch
Why batch in a NRT use case?
§ For background processing
- To be later used for quick decision making
§ Exploratory analysis
- Machine Learning
§ Automated rule changes
- Based on new patterns
§ Analytics
- Who’s attacking us?
Questions? tiny.cloudera.com/app-arch-questions
Processing Engines (Past)
§  Hadoop = HDFS (distributed FS) + MapReduce (Processing engine)

Questions? tiny.cloudera.com/app-arch-questions
Processing Engines (present)

Hive&(in&beta)&

Dato&(Graphlab)&
Spark&Streaming&

Storm/Trident&
Cascading&
Cascading&

Cascading&
Spark&SQL&
Mahout&

GraphX&
Crunch&
Crunch&
Giraph&

MLlib&

Impala&

Presto&

Oryx&
Hive&

Drill&

H20&
Pig&&

Hive&
Pig&

Pig&
MapReduce& Spark& Tez&

Hadoop&Storage&managers&(HDFS/HBase/Solr)&

General&purpose& AbstracHon& Graph&processing&


execuHon&engines& engines& engines&
Storage&
managers&
SQL&engines& RealOHme&& Machine&Learning&
frameworks& engines&

Questions? tiny.cloudera.com/app-arch-questions
Processing Engines
§  MapReduce
§  Abstractions
§  Spark
§  Impala

Questions? tiny.cloudera.com/app-arch-questions
MapReduce
§  Oldie but goody
§  Restrictive Framework / Innovative Workarounds
§  Extreme Batch

Questions? tiny.cloudera.com/app-arch-questions
MapReduce Basic High Level

Mapper Reducer

HDFS
Block of (Replicated)
Output File
Data

Native File System


Temp Spill Partitioned Reducer
Data Sorted Data Local Copy

Questions? tiny.cloudera.com/app-arch-questions
Abstractions
§  SQL
-  Hive
§  Script/Code
-  Pig: Pig Latin
-  Crunch: Java/Scala
-  Cascading: Java/Scala

Questions? tiny.cloudera.com/app-arch-questions
Spark
§  The New Kid that isn’t that New Anymore
§  Easily 10x less code
§  Extremely Easy and Powerful API
§  Very good for iterative processing (machine learning, graph processing, etc.)
§  Scala, Java, and Python support
§  RDDs
§  Has Graph, ML and SQL engines built on it
§  R integration (SparkR)

Questions? tiny.cloudera.com/app-arch-questions
Spark - DAG

Questions? tiny.cloudera.com/app-arch-questions
Spark - DAG
TextFile Filter KeyBy

Join Filter Take

TextFile KeyBy

Questions? tiny.cloudera.com/app-arch-questions
Spark - DAG
TextFile Filter KeyBy

Good Good Good-Replay

Good Good Good-Replay

Good Good Good-Replay


Join Filter Take

Lost Block Future Future

Good Future Future

TextFile KeyBy

Good Good-Replay

Good-Replay Lost Block


Replay

Good Good-Replay

Questions? tiny.cloudera.com/app-arch-questions
Is Spark replacing Hadoop?
§  Spark is an execution engine
-  Much faster than MR
-  Easier to program
§  Spark is replacing MR

Questions? tiny.cloudera.com/app-arch-questions
Impala
§  MPP Style SQL Engine on top of Hadoop
§  Very Fast
§  High Concurrency
§  Analytical windowing functions

Questions? tiny.cloudera.com/app-arch-questions
Impala – Broadcast Join
Impala Daemon Impala Daemon Impala Daemon

100% Cached 100% Cached 100% Cached


Smaller Table Smaller Table Smaller Table

Smaller Table Smaller Table


Data Block Data Block

Output Output Output

Impala Daemon Impala Daemon Impala Daemon

Hash Join Function Hash Join Function Hash Join Function


100% Cached 100% Cached 100% Cached
Smaller Table Smaller Table Smaller Table

Bigger Table Bigger Table Bigger Table


Data Block Data Block Data Block

Questions? tiny.cloudera.com/app-arch-questions
Impala – Partitioned Hash Join
Impala Daemon Impala Daemon Impala Daemon

Hash Partitioner ~33% Cached Hash Partitioner ~33% Cached ~33% Cached
Smaller Table Smaller Table Smaller Table

Smaller Table Smaller Table


Data Block Data Block

Output Output Output

Impala Daemon Impala Daemon Impala Daemon


Hash Join Function Hash Join Function Hash Join Function
Hash Partitioner Hash Partitioner Hash Partitioner
33% Cached 33% Cached 33% Cached
Smaller Table Smaller Table Smaller Table

BiggerTable Data BiggerTable Data BiggerTable Data


Block Block Block

Questions? tiny.cloudera.com/app-arch-questions
Presto
§  Facebook’s SQL engine replacement for Hive
§  Written in Java
§  Doesn’t build on top of MapReduce
§  Very few commercial vendors supporting it

Questions? tiny.cloudera.com/app-arch-questions
Batch processing in
our use-case
Batch processing in fraud-detection
§  Reporting
-  How many events come daily?
-  How does it compare to last year?
§  Machine Learning
-  Automated rules changes
§  Other processing
-  Tracking device history and updating profiles

Questions? tiny.cloudera.com/app-arch-questions
Reporting
§  Need a fast JDBC-compliant SQL framework
§  Impala or Hive or Presto?
§  We choose Impala
-  Fast
-  Highly concurrent
-  JDBC/ODBC support

Questions? tiny.cloudera.com/app-arch-questions
Machine Learning
§  Options
-  MLLib (with Spark)
-  Oryx
-  H2O
-  Mahout
§  We recommend MLLib because of larger use of Spark in our architecture.

Questions? tiny.cloudera.com/app-arch-questions
Other Processing
§  Options:
-  MapReduce
-  Spark
§  We choose Spark
-  Much faster than MR
-  Easier to write applications due to higher level API
-  Re-use of code between streaming and batch

Questions? tiny.cloudera.com/app-arch-questions
Overall Architecture
Overview
High level architecture
Real time
dashboard

Events Anomaly Alerts/Events


Buffer Buffer Long term
detector storage

Storage
Batch processing (ML/ Reporting info
NRT manual adjustments) Analytics interface
NRT processing Profiles/Rules engine
External
systems
Batch
processing

Questions? tiny.cloudera.com/app-arch-questions
Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Storage

Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
HDFS Local Cache HBase

Raw and Processed Profiles


Profiles
Events (Disk/Cache)

Questions? tiny.cloudera.com/app-arch-questions
NRT Processing/Alerting

Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Flume Source
Kafka
Flume Source
Rules Topic Flume Source

Flume Interceptor

Kafka Consumer

Kafka Producer
Kafka Event Processing Logic Kafka

Transactions Topic Replies Topic


Local HBase Client
Memory
Kafka

Updates Topic

HBase

Questions? tiny.cloudera.com/app-arch-questions
Ingestion

Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Flume HDFS Sink
Sink
Kafka Cluster
Sink HDFS
Topic
Sink

Partition A
Flume SolR Sink
Sink
Partition B Sink
SolR
Sink

Partition C Flume Hbase Sink


Sink
Sink
HBase
Sink

Questions? tiny.cloudera.com/app-arch-questions
Processing

Events
Kafka

Hadoop Cluster I
Hadoop Cluster I

Flume
Flume Batch Time
(Source)
Flume
(Source) Adjustments
(Source) HBase and/or
Interceptor(Rules) HBase Store
and/or
Fetching & Updating Profiles/Rules Memory
Interceptor (Rules) Memory Store

Alerts/Events
Hadoop Cluster II

Storage NRT/Stream Processing


Alerts
Kafka Spark Streaming
Adjusting
Flume Channel NRT stats

Batch Processing

Automated & Manual


Impala
Events Analytical Adjustments and
Flume
HDFS Pattern detection
(Sink)
HBase Map/Reduce

Reporting
Spark

Questions? tiny.cloudera.com/app-arch-questions
Stream
Hadoop Cluster II
Processing
Storage NRT/Stream Processing

Spark Streaming
Reporting
Batch Processing

Impala

Batch, ML,
HDFS Map/Reduce
etc.
Spark

Questions? tiny.cloudera.com/app-arch-questions
Demo!
Where else to find us?
Book Signing!
§  This evening at 5:35 PM at the Cloudera booth.

Questions? tiny.cloudera.com/app-arch-questions
Other Sessions
§  Ask Us Anything session (all) – Thursday, 2:05 pm
§  Reliability guarantees in Kafka (Gwen) – Thursday, 12:05 pm
§  Putting Kafka into overdrive (Gwen) – Thursday, 5:25 pm
§  Introduction to Apache Spark for Java and Scala developers (Ted) – Friday, 2:05
pm

Questions? tiny.cloudera.com/app-arch-questions
Thank you!
@hadooparchbook
tiny.cloudera.com/app-arch-london
Gwen Shapira | @gwenshap
Jonathan Seidman | @jseidman
Ted Malaska | @ted_malaska
Mark Grover | @mark_grover

Вам также может понравиться