Вы находитесь на странице: 1из 15

WHAT ARE OUTLIERS?

A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Outliers can be caused by measurement or execution error. the outliers may be of particular interest

Applications:
Fraud detection
Medicine Public health

Sports statistics Detecting measurement errors

OUTLIER DETECTION METHODS


Statistical Distribution-Based Outlier Detection Distance-Based Outlier Detection Density-Based Local Outlier Detection Deviation-Based Outlier Detection

Statistical Distribution-Based Outlier Detection


assumes a distribution for the given data set identifies outliers with respect to the model using a discordancy test requires knowledge of the data set parameters knowledge of distribution parameters expected number of outliers.

How does the discordancy testing work?


This test examines two hypotheses: working hypothesis alternative hypothesis

A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is, H : oi E F, where i = 1, 2, , n. Verifies whether oi is <> in relation to F Assume T is some statistic used as discordancy test Assume value of the statistic for object oi is vi Then distribution T is constructed SP(vi)=Prob(T > vi), is evaluated If SP(vi) is small H is rejected

An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.

kinds of alternative distributions. Inherent alternative distribution H : oi E G, where i = 1, 2, : : : , n Mixture alternative distribution G. H : oi E (1-mu)F +muG, where i = 1, 2, : : : , n. Slippage alternative distribution

Distance-Based Outlier Detection


An object, o, in a data set, D, is a distancebased (DB) outlier with parameters pct and dmin,11 that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance greater than dmin from o.

algorithms for mining distance-based outliers


Index-based algorithm Nested-loop algorithm Cell-based algorithm

Density-Based Local Outlier Detection


Distance-based outlier detection is based on global distance distribution It encounters difficulties to identify outliers if data is not uniformly distributed.

Deviation-Based Outlier Detection


it identifies outliers by examining the main characteristics of objects in a group

two techniques for deviation-based outlier detection Sequential Exception Technique OLAP Data Cube Technique

Sequential Exception Technique

OLAP Data Cube Technique

Вам также может понравиться