Академический Документы
Профессиональный Документы
Культура Документы
https://id.linkedin.com/pub/ghulam-imaduddin/32/a21/507
ghulam@ideweb.co.id
WHAT’S IN THIS SLIDE
Intro & Data Trends Type of Analytics
Tech Approach
[BIG] DATA ANALYTICS Methodology
Source: http://www.cision.com/us/2012/10/big-data-and-big-analytics/
DATA VS BIG DATA
Big data is just data with:
More volume
Faster data generation (velocity)
Multiple data format (variety)
[1] http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/
CHALLENGES
More data = more storage space
More storage = more money to spend (RDBMS server needs very costly
storage)
User
APPROACH
APPROACH
HADOOP
RDBMS
IMEI
Source
&
TAC Socmed data,
Location URL access CDR Complaint,
Device info
Survey
Context
Data
Business Data
Mining
Understanding Understanding
& Modeling
- Gathering problem - Define variables to - Defining target variable
information support hypothesis - Splitting data for training and
- Defining the goal to - Cleaning & validating the model
solve the problem transforming the data - Defining analysis time frame
- Defining expected - Create longitudinal for training and validation
output data/trend data - Correlation analysis and
- Defining hypothesis - Ingesting additional variable selection
- Defining analysis data if needed - Selecting right data mining
methodology - Build analytical data algorithm
- Measuring the mart - Do validation by measuring
business value accuracy, sensitivity, and
model lift
- Data mining and modeling is
an iterative process
ANALYTICS LIFECYCLE
Type of Analysis
• Do we need only descriptive analysis? Or we need to go with predictive analysis?
Supervised or Unsupervised?
• Do we need to build unsupervised clustering/segmentation for this analysis?
Putra, B. P. (2015). Analisis Sentimen Layanan Telekomunikasi pada Pengguna Media Sosial Twitter. Jakarta: Universitas Indonesia
WORKFLOW
Data Data Data Data
Collection Labeling Preparation Modeling
- Create twitter crawler - Label some sample - Deduplication - Generate word vector
with python and for training dataset - Convert to lower case using machine
twitter API - This part done with - Tokenization learning algorithm
- Run the crawler with crowdsourcing - Filter stop word based on training
selected keyword, dataset
parse, and store to - Using SVM and C4.5
RDBMS - The result is 2
- Collection for tweet different model
generated in April - Select the best model
2015 by comparing the
accuracy
WORKFLOW
Data NPS
Scoring Calculation
- Using best model, - Aggregate scoring
score the rest dataset result by telco
- Scoring result is a provider to get count
label of positive tweets
(positive/negative/ and negative tweets
neutral) for each - Calculate the NPS for
tweet each telco provider
- Visualize the result as
a bar chart
DATA COLLECTION
We run the crawler 3 times, one time for each operator. We only
search tweets containing some keywords
• Telepon
• Telkomsel
• SMS
• Indosat
• Internet
• XL
• Jaringan
Parse the json result using json parser library embedded in python 2.7,
form it as CSV (comma separated value)
Load the csv into database (we use MySQL in this experiment)
DATA LABELING
The objective is to build the ground truth
Using crowdsourcing approach. We build online questionnaire and ask
people to define each tweets if it is negative, positive, or neutral
We label 100 tweets by ourselves as a validated tweets for
questionnaire validation
We put 20 tweets for each questionnaire. 5 tweets for Indosat, 5 for
XL, 5 for Telkomsel, and the rest 5 is random validated tweets
If 4 out of 5 validated tweets answered correctly, then we flag a
questionnaire as a valid questionnaire
This approach used to eliminate the answer submitted by people who
do it randomly
DATA PREPARATION
Deduplication process is to remove duplicated tweets
Tokenization is a process to split a sentence into words. This should be
done because the model will generate the word vector instead of
sentence.
DATA PREPARATION
Filtering stop words. We eliminate non useful word (word that doesn’t
reflect to positive or negative means)
TOOLS USED
Data preparation modeling done with RapidMiner software
RapidMiner has text analysis function and procedure. We can found
procedure to do tokenize, convert case, deduplication, and filter stop
word
RapidMiner also has SVM and C4.5 algorithm to do modeling
MODEL ACCURACY
Model accuracy measurement done by confusion matrix
(𝑇𝑃 + 𝑇𝑁)
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(𝑇𝑃 + 𝐹𝑃 + 𝑇𝑁 + 𝐹𝑁)
select("user.screen_name").
reduceByKey(_+_).
map(_.swap).
sortByKey(false).
map(_.swap).
take(10).
foreach(println)
DATA EXPLORATION
Finding top words
tweets.select("text").rdd.
flatMap(x => x(0).toString.toLowerCase.
split(“[^A-Za-z0-9]+")).
map(x => (x,1)).
filter(x => x._1.length >= 3).
reduceByKey(_+_).
map(_.swap).
sortByKey(false).
map(_.swap).
take(20).foreach(println)
DATA EXPLORATION
Finding top words with stop word exclusion
val stop_words = sc.textFile("/user/ghulam/stopwords.txt")
val bc_stop = sc.broadcast(stop_words.collect)
tweets.select("text").rdd.
flatMap(x => x(0).toString.toLowerCase.split("[^A-Za-z0-9]+")).
map(x => (x,1)).
filter(x => x._1.length > 3 & !bc_stop.value.contains(x._1)).
reduceByKey(_+_).
map(_.swap).sortByKey(false).map(_.swap).
take(20).foreach(println)
DATA EXPLORATION
Words Chain (Market Basket Analysis)
import org.apache.spark.mllib.fpm.FPGrowth
val stop_words = sc.broadcast(sc.textFile("/user/hadoop-
user/ghulam/stopwords.txt").collect)
val tweets = sqlContext.jsonFile("/user/flume/tweets/2015/09/01/*/*")
val trx = tweets.select("text").rdd.
filter(!_(0).toString.toLowerCase.contains("ini 20 finalis aplikasi")).
filter(!_(0).toString.toLowerCase.contains("telkomsel jaring 20 devel")).
filter(!_(0).toString.toLowerCase.contains("[jual")).
filter(!_(0).toString.toLowerCase.contains("lelang acc")).
filter(!_(0).toString.toLowerCase.matches(".*theme.*line.*")).
filter(!_(0).toString.toLowerCase.matches(".*fol.*back.*")).
filter(!_(0).toString.toLowerCase.matches(".*favorite.*digital.*")).
filter(!_(0).toString.toLowerCase.startsWith("rt @")).
map(x => x(0).toString.toLowerCase.split("[^A-Za-z0-9]+").filter(x =>
x.length > 3 & !stop_words.value.contains(x)).distinct)
val fpg = new FPGrowth().setMinSupport(0.01).setNumPartitions(10)
val model = fpg.run(trx)
model.freqItemsets.filter(x => x.items.length >= 3).take(20).foreach {
itemset =>
println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq)
}
WORLD TRENDS
2015 HYPE CYCLE
Big data related things in top
of hype curve:
• Advanced analytics
• IoT
• Machine Learning
DATA SCIENTIST
Data scientist/analyst is one of the sexy and emerging job in the market
WHERE TO START
LETS GET OUR HAND DIRTY
SKILLS NEEDED
DOMAIN KNOWLEDGE
SKILLS NEEDED
Business Acumen
In terms of data science, being able to discern which problems are
important to solve for the business is critical, in addition to identifying
new ways the business should be leveraging its data.
Python, Scala, and SQL
SQL skills is a must! Python and Scala also become a common language to
do data processing, along with Java, Perl, or C/C++
Hadoop Platform
It is heavily preferred in many cases. Having experience with Hive or Pig is
also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
SAS or R or other predictive analytics tools
In-depth knowledge of at least one of these analytical tools, for data
science R is generally preferred. Along with this, statistical knowledge also
important
SKILLS NEEDED
Intellectual curiosity
Curiosity to dig deeper into data and solving a problem by finding a
root cause of it
Communication & Presentation
Companies searching for a strong data scientist are looking for
someone who can clearly and fluently translate their technical findings
to a non-technical team. A data scientist must enable the business to
make decisions by arming them with quantified insights