Вы находитесь на странице: 1из 31

# Modern Methods of Data Analysis

Lecture II (27.04.10)

Contents:
● Characterize data samples
● Characterize distributions
● Correlations, covariance

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Reminder: Average of a Sample
● arithmetic mean of data set:

## ● median – smallest value which is ≥ 50% of events

better use median than mean, more robust against outliers!

## ● truncated mean: useful if the underlying distribution is

expected to be asymmetric
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Measure the Spread of a Sample

## However hard to handle mathematically.

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Sample Variance
● Way better quantity:

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Sample Variance
● For data analysis, preferably loop only once over data:

Sample Variance

estimated mean :

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Standard Deviation (RMS), FWHM
● standard deviation σ or RMS: root mean squared

## ● FWHM: full width at half maximum

more robust against outliers, fluctuations harder at low
statistics; for Gaussian distributed events: FWHM = 2.35σ

Example:

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Expectation Values
● So far characterized given set realization of an
experiment (sum over N) by sample mean,

Note
However for N->∞, Law of large numbers
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Variance of a Distribution:
● V[x] = E[(x-μ)²] =

## ● V[x] = f(x): PDF

● V[x] = E[x²] – µ²

## V[x] is the measure of the spread of the distribution,

not how well the mean is measured!

Example:

N = 100 N = 1000

µ=5
σ=1
N = 10000

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

How to determine uncertainty on the mean?

● E[ x ] = ???
● V[ x ] = ???

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Expectation Value of sample mean

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Variance of the Sample Mean

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

m(B0) = 5279.63 ± 0.53 (stat) ± 0.33 (sys)
● CDF has a mass resolution of 16 MeV:
the reconstructed mass of a single B meson is spread
around the true B mass with σ=16 MeV

## ● The B mass can be measured with way better precision

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Unbiased Estimators:
Unbiased Estimator “erwartungstreuer Schätzer”

## for n data points, we estimate the true variance V(x) by the

“sample variance s²”
- if true mean µ is known!

## - If the true mean is unknown, then an unbiased estimator

for the variance σ² is the “sample variance s²”:
beware of N-1!

## “One single value is not enough to determine mean and spread.”

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Solution: Unbiased Estimator for V(x)

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Solution: Unbiased Estimators for V(x)

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Efficiency of Estimators

## ● Optimal Estimator: ”optimal” ↔ smallest variance

(Likelihood maximization gives optimal estimator, will
be proven in later lecture)

● Efficiency of Estimator:
“variance of optimal estimator/variance of estimator”

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Symmetric truncated Mean
● truncated mean (“getrimmter Mittelwert”):
– e.g. r = 40% truncated mean:
● 10% lowest and 10% highest values

## ignored, calculate mean of 80% central

values
– r = 50% truncated mean -> arithmetic mean

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Laplace or
double exponential
Cauchy efficiency

r = 0.23 truncated
mean best estimator
for unkown sym.
distribution
r
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Moments

r-th algebraic moment
● r-th central moment

## Expectation value: 1. algebraic moment

Variance: 2. central moment

“Schiefe”/skewness
- pos. for right winged distributions

“Wölbung”/kurtosis
- measure for ratio of core relative to tails
- pos. kurtosis: longer tails than Gaussian
Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Skewness & Kurtosis

## Gaussian distribution have kurtosis = 0

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Which fraction of events is within 1,2,3 σ

## This is only true for Gaussian distributions!

Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer
Biennaymé-Tchebycheff-Inequality
For every distribution the following inequality is valid:

k Gauss Tchebycheff

1 0.317 1.0
2 0.0555 0.25
3 0.0027 0.1111
4 0.000063 0.0625

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Solution: Biennaymé-Tchebycheff-Inequality
Given a PDF f(x) and a function positive w(x)≥0:

with :

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Two Dimensional Distributions
Multiple ways to visualize 2-dim distributions
● box plot
● lego plot
● surface plot
● numbers
● scatter plot
● color map
● contour plot
● ...

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Two dimensional Distributions
● straight generalization of 1-dim PDFs

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Marginal Distributions
● Marginal distributions: projection on the axis
“Randverteilungen”

## Modern Methods of Data Analysis - SS 2010 Stephanie Hansmann-Menzemer

Conditional Probability

Exercise:

● Compute

● Compute