Statistik Data Analysis PDF

Modern Methods of Data Analysis
Lecture II (27.04.10)
Contents:
● Characterize data samples
● Characterize distributions
● Correlations, covariance
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Reminder: Average of a Sample
● arithmetic mean of data set:
● weighted mean of data set:
● mode – most prob. value (peak in distribution, not unique)
● median – smallest value which is ≥ 50% of events

better use median than mean, more robust against outliers!
● similar defined Quantile: Median = 50% Quantil
● truncated mean: useful if the underlying distribution is

expected to be asymmetric
Measure the Spread of a Sample
● How to characterize width/spread?
● First thought .... mean deviation from the mean:
● Could consider average absolute deviation:
However hard to handle mathematically.

Sample Variance
● Way better quantity:
mean square deviation called sample variance s² or V
● For any random variable :

Sample Variance
● For data analysis, preferably loop only once over data:
mean square – square of the mean

Sample Variance
For large numbers, safer to shift distribution by

estimated mean :

Standard Deviation (RMS), FWHM
● standard deviation σ or RMS: root mean squared
[“standard ” is a joke, there are several standards in literature ...]
● FWHM: full width at half maximum

more robust against outliers, fluctuations harder at low
statistics; for Gaussian distributed events: FWHM = 2.35σ

Example:
● Give sample variance, RMS and FWHM:

Expectation Values
● So far characterized given set realization of an
experiment (sum over N) by sample mean,
sample spread ...
● Now talk about mean, spread of a distribution:
Note
However for N->∞, Law of large numbers
Variance of a Distribution:
● V[x] = E[(x-μ)²] =
● V[x] = f(x): PDF
● V[x] = E[x²] – µ²
V[x] is the measure of the spread of the distribution,

not how well the mean is measured!

Example:
N = 100 N = 1000
µ=5
σ=1
N = 10000

How to determine uncertainty on the mean?
● E[ x ] = ???
● V[ x ] = ???

Expectation Value of sample mean

Variance of the Sample Mean

m(B0) = 5279.63 ± 0.53 (stat) ± 0.33 (sys)
● CDF has a mass resolution of 16 MeV:
the reconstructed mass of a single B meson is spread
around the true B mass with σ=16 MeV
● The B mass can be measured with way better precision

Unbiased Estimators:
Unbiased Estimator “erwartungstreuer Schätzer”
unbiased estimator for true mean µ is :
for n data points, we estimate the true variance V(x) by the

“sample variance s²”
- if true mean µ is known!
- If the true mean is unknown, then an unbiased estimator

for the variance σ² is the “sample variance s²”:
beware of N-1!
“One single value is not enough to determine mean and spread.”

Solution: Unbiased Estimator for V(x)

Solution: Unbiased Estimators for V(x)

Efficiency of Estimators
● Optimal Estimator: ”optimal” ↔ smallest variance

(Likelihood maximization gives optimal estimator, will
be proven in later lecture)
● Efficiency of Estimator:
“variance of optimal estimator/variance of estimator”
● For Gaussian distribution is optimal estimator
● non optimal estimators are called not robust
● E.g. Median of Gauss distribution has 64% efficiency

Symmetric truncated Mean
● truncated mean (“getrimmter Mittelwert”):
– e.g. r = 40% truncated mean:
● 10% lowest and 10% highest values
ignored, calculate mean of 80% central

values
– r = 50% truncated mean -> arithmetic mean
– r -> 0% -> median

Laplace or
double exponential
Cauchy efficiency
r = 0.23 truncated
mean best estimator
for unkown sym.
distribution
r
Moments
●
r-th algebraic moment
● r-th central moment
Expectation value: 1. algebraic moment

Variance: 2. central moment
“Schiefe”/skewness
- pos. for right winged distributions
“Wölbung”/kurtosis
- measure for ratio of core relative to tails
- pos. kurtosis: longer tails than Gaussian
Skewness & Kurtosis
kurtosis < 0 kurtosis > 0
Gaussian distribution have kurtosis = 0

Which fraction of events is within 1,2,3 σ
1σ
2σ
3σ
4σ
This is only true for Gaussian distributions!

Biennaymé-Tchebycheff-Inequality
For every distribution the following inequality is valid:
k Gauss Tchebycheff
1 0.317 1.0
2 0.0555 0.25
3 0.0027 0.1111
4 0.000063 0.0625

Solution: Biennaymé-Tchebycheff-Inequality
Given a PDF f(x) and a function positive w(x)≥0:
with :

Two Dimensional Distributions
Multiple ways to visualize 2-dim distributions
● box plot
● lego plot
● surface plot
● numbers
● scatter plot
● color map
● contour plot
● ...

Two dimensional Distributions
● straight generalization of 1-dim PDFs
A 2-dim PDF is a function f(x,y)≥0 with

Marginal Distributions
● Marginal distributions: projection on the axis
“Randverteilungen”

Conditional Probability
●

Exercise:
● Compute
● Compute

Statistik Data Analysis PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Statistik Data Analysis PDF

Загружено:

Авторское право:

Доступные форматы

Modern Methods of Data Analysis

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● weighted mean of data set:

● mode – most prob. value (peak in distribution, not unique)

● median – smallest value which is ≥ 50% of events

● similar defined Quantile: Median = 50% Quantil

● truncated mean: useful if the underlying distribution is

● First thought .... mean deviation from the mean:

● Could consider average absolute deviation:

However hard to handle mathematically.

mean square deviation called sample variance s² or V

● For any random variable :

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

mean square – square of the mean

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

For large numbers, safer to shift distribution by

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

[“standard ” is a joke, there are several standards in literature ...]

● FWHM: full width at half maximum

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● Give sample variance, RMS and FWHM:

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● V[x] = f(x): PDF

V[x] is the measure of the spread of the distribution,

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● The B mass can be measured with way better precision

unbiased estimator for true mean µ is :

for n data points, we estimate the true variance V(x) by the

- If the true mean is unknown, then an unbiased estimator

“One single value is not enough to determine mean and spread.”

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

● Optimal Estimator: ”optimal” ↔ smallest variance

● For Gaussian distribution is optimal estimator

● non optimal estimators are called not robust

● E.g. Median of Gauss distribution has 64% efficiency

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

ignored, calculate mean of 80% central

– r -> 0% -> median

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Expectation value: 1. algebraic moment

kurtosis < 0 kurtosis > 0

Gaussian distribution have kurtosis = 0

This is only true for Gaussian distributions!

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

A 2-dim PDF is a function f(x,y)≥0 with

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Вам также может понравиться