What Are Outliers176

WHAT ARE OUTLIERS?
A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Outliers can be caused by measurement or execution error. the outliers may be of particular interest
Applications:
Fraud detection
Medicine Public health
Sports statistics Detecting measurement errors
OUTLIER DETECTION METHODS

Statistical Distribution-Based Outlier Detection Distance-Based Outlier Detection Density-Based Local Outlier Detection Deviation-Based Outlier Detection
Statistical Distribution-Based Outlier Detection

assumes a distribution for the given data set identifies outliers with respect to the model using a discordancy test requires knowledge of the data set parameters knowledge of distribution parameters expected number of outliers.
How does the discordancy testing work?

This test examines two hypotheses: working hypothesis alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is, H : oi E F, where i = 1, 2, , n. Verifies whether oi is <> in relation to F Assume T is some statistic used as discordancy test Assume value of the statistic for object oi is vi Then distribution T is constructed SP(vi)=Prob(T > vi), is evaluated If SP(vi) is small H is rejected
An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.
kinds of alternative distributions. Inherent alternative distribution H : oi E G, where i = 1, 2, : : : , n Mixture alternative distribution G. H : oi E (1-mu)F +muG, where i = 1, 2, : : : , n. Slippage alternative distribution
Distance-Based Outlier Detection

An object, o, in a data set, D, is a distancebased (DB) outlier with parameters pct and dmin,11 that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance greater than dmin from o.
algorithms for mining distance-based outliers

Index-based algorithm Nested-loop algorithm Cell-based algorithm
Density-Based Local Outlier Detection

Distance-based outlier detection is based on global distance distribution It encounters difficulties to identify outliers if data is not uniformly distributed.
Deviation-Based Outlier Detection

it identifies outliers by examining the main characteristics of objects in a group
two techniques for deviation-based outlier detection Sequential Exception Technique OLAP Data Cube Technique
Sequential Exception Technique
OLAP Data Cube Technique

What Are Outliers176

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

What Are Outliers176

Загружено:

Авторское право:

Доступные форматы

WHAT ARE OUTLIERS?

Sports statistics Detecting measurement errors

OUTLIER DETECTION METHODS

Statistical Distribution-Based Outlier Detection

How does the discordancy testing work?

Distance-Based Outlier Detection

algorithms for mining distance-based outliers

Density-Based Local Outlier Detection

Deviation-Based Outlier Detection

Sequential Exception Technique

OLAP Data Cube Technique

Вам также может понравиться