Вы находитесь на странице: 1из 25

Online Learning With Streamdrill

Mikio L. Braun, @mikiobraun http://blog.mikiobraun.de TWIMPACT http://twimpact.com

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Event Data
Finance Gaming Monitoring

Advertisment

Sensor Networks

Social Media

Attribution: flickr users kenteegardin, fguillen, torkildr, Docklandsboy, brewbooks, ellbrown, JasonAHowie Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Event Data is Huge: Volume

The problem: You easily get A LOT OF DATA!


100 events per second 360k events per hour 8.6M events per day 260M events per month 3.2B events per year

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Event Data is Huge: Diversity

Potentially large spaces:


distinct words: >100k IP addresses: >100M users in a social network: >10M

http://wordle.net

http://www.flickr.com/photos/arenamontanus/269158554/

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Recommenders

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Stream Mining to the rescue

Stream mining algorithms:

answer stream queries with finite resources how often does an item appear in a stream? how many distinct elements are in the stream? what are the top-k most frequent items?
Continuous Stream of Data

Typical examples:

Bounded Resource Analyzer

Stream Queries

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

The Trade-Off
Big Data
Stream Mining Map Reduce and friends

Fast

Exact

First seen here: http://www.slideshare.net/acunu/realtime-analytics-with-apache-cassandra Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Heavy Hitters (a.k.a. Top-k)

Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.
Case 1: element already in data base paul paul 12 frank paul jan leo conrad stephan 15 12 8 5 3 2 hugo 3 Case 2: new element hugo stephan 2 13

Fixed tables of counts

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Heavy Hitters over Time-Window

Time

Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay

DB

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Exponential Decay

Instead of a fixed window, use exponential timestamp decay


score halftime

The beauty: updates are recursive

time shift term


Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Count-Min Sketches

Summarize histograms over large feature sets Like bloom filters, but better
m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions

Query result: 1

Updates for new entry

Query: Take minimum over all hash functions

G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. LATIN 2004, J. Algorithm 55(1): 58-75 (2005) .

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

What you can do with Stream Mining

Count things:

user interactions items purchased pages viewed songs played most active pages most active users

Trends:

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

But there is more: Counting is Statistics

Empirical mean:

Correlations:

Principal Component Analysis:

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Profiling User Behavior

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Statistics and Outlier Detection

When does behavior change?

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Maximum-Likelihood

Estimate probabilistic models

based on

which is slightly biased, but simpler

But wait, how do I 1/n with randomly spaced events?

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Outlier detection

Once you have a model, you can compute p-values (based on recent time frames!)

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Online TF-IDF

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Online TF-IDF

estimate word document frequencies

for each word: update(word, t, 1.0) for each document: update(#docs, t, 1.0) query: score(word) / score(#docs)

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Streamdrill

Heavy Hitters counting + exponential decay Instant counts & top-k results over time windows. Indices! Snapshots for historical analysis Beta demo available at http://streamdrill.com, launch imminent

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Architecture Overview

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Example: Twitter Stock Analysis

http://play.streamdrill.com/vis/
Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Example: Twitter Stock Analysis

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Example: Twitter Stock Analysis

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Summary

Doesn't always have to be scaling! Stream mining: Approximate results with finite resources. streamdrill: stream analysis engine

Recommender Stammtisch, plista, Berlin, June 5, 2013 2013 Mikio L. Braun

Вам также может понравиться