You are on page 1of 2

Big Data tool Spark - Data sampling

Volume, Variety, Velocity (streming) Yarn Synthetic Minority Over sampling


Structure data -

weblog RDD Resilient Distributed Dataset Technique SMOTE
Unstructure data - minor data
Semi-structure - XML JS

Hadoop - software

HDFS- 128mb type data
3 datanode Continuous:




Mapreduce
transformation Categorical

Nominal:
caching
limited set of values no ordering
java dataframe python R scala Ordinal: a limited set of values
share variable
key+value

Yarn - Binary: 2 values 0,1
node manager
Principal Component Analysis
(PCA) -
Hbase - DB
eigen vector eigen
analysis
value
Hive - SQL
Missing values treatments
Yarn Replace -
Delete -
Keep-

Data collecting
incompleteness, duplication
merging problems, inconsistencies






Imbalanced data set
1 partition Hive bugket /

name node
/ /

Outlier detection and treatment Truncation Categorization Nominal Binary Text cosine distance





categories
Valid observation range z-score/IQR 50
1 Capping categorical variables 0,1 No,Yes K mean clustering -
Invalid observation Y
300 0 1

Multivariate outliers
Mahalanobis distance

Streaming K mean -





Histograms
Box plots 3 quartiles IQR=Q3-Q1
IQR*1.5
Standardization
scale veribles
range Chi-squard
min/max z-score decimal scaling

Variable selection
noisy
feature

Z-score Decimal scaling Dissimilarity or Similarity
standard deviation

3 3
Z
Filter
2



Categorization Continuous
Ordinal Descriptive Analytics
2 bin
Equal interval binning 2 bin range

treatment 1-5 5-10
Invalid observation missingEqual frequency binning 2 bin Data modeling clustering analysis
value replace,delete,keep 1,2,3 4,6,8 2

Valid observation similarity
churn, fraud,credit risk

Distance