Академический Документы
Профессиональный Документы
Культура Документы
Data collecting
incompleteness, duplication
merging problems, inconsistencies
Imbalanced data set
1 partition Hive bugket /
name node
/ /
Outlier detection and treatment Truncation Categorization Nominal Binary Text cosine distance
categories
Valid observation range z-score/IQR 50
1 Capping categorical variables 0,1 No,Yes K mean clustering -
Invalid observation Y
300 0 1
Multivariate outliers
Mahalanobis distance
Streaming K mean -
Histograms
Box plots 3 quartiles IQR=Q3-Q1
IQR*1.5
Standardization
scale veribles
range Chi-squard
min/max z-score decimal scaling
Variable selection
noisy
feature
Z-score Decimal scaling Dissimilarity or Similarity
standard deviation
3 3
Z
Filter
2
Categorization Continuous
Ordinal Descriptive Analytics
2 bin
Equal interval binning 2 bin range
treatment 1-5 5-10
Invalid observation missingEqual frequency binning 2 bin Data modeling clustering analysis
value replace,delete,keep 1,2,3 4,6,8 2
Valid observation similarity
churn, fraud,credit risk
Distance