Вы находитесь на странице: 1из 147

Why we need to organize data?

  
   Item shape No. of items
  4

   6
 

   5

   3
  3
Total 21

Organization criteria:
-shape
-color
-dimension
Basic terms
The root of statistica term is the Italian word stato (state). Later, starting from this term a new
term statista was appeared. Its meaning was a person who make business with the state.
Therefore the term statistica, signified at first a collection of facts useful for a statista. With this
meaning Statistica was used in Italy in 16th century and later in France, Holland and Germany.
Today this term signify more than facts regarding to the state and it is used almost in every
domain.

Statistic population - represents a set of items used as object of study, well delimited spatial and
temporal, characterized through its size and structure;
Statistic unit - represents the fundamental item of statistic population which
may be characterized by a set specific characteristics. These characteristics represents the
subject of a research;
Sample - represents the number of statistic units which will be extracted
from a statistic population and will be studied.
Grouping variable - represents a characteristic which permit us to organize the statistic units
from a statistic population in homogenous classes or permit us to study the modifications of
other variables in time or space.
Statistic datas - values of grouping variables determined by using a scale to
measure them. These are used to study the statistic units from a population.
1. Grouping variables

Types: The grouping variables can be classified in many categories using the following
criteria:
a) by content:
a1. attributive variables - these are attributes, characteristics of statistic units
from a statistic population used to organize them in homogenous classes.
Example: gender, age, profession, productivity, seniority, salary …

a2. time variables - these permit us to study the modifications of a


phenomenon in time.
Example: day, month, quarter, semester, year etc.

Time variables, in economic field, can be conventional categorized in two


subcategories: time interval variables and time moment variables.

Time moment variables concerns to time intervals smaller or equal with a day.
Time interval variables concerns to time intervals bigger than a day.

a3. space variables - these are used to study the modifications of a


phenomenon in space.
Example: department, enterprise, location, county, region, country etc.
1. Grouping variable
b) by form of expression:
b1. quantitative variables – can have numerical values.
Example: grade, weight, height, salary, age etc.

For this type of variables the arithmetical operations must have sense!

Quantitative variables can be categorized in two subcategories, if we take in


account the possibility to take values:
 discrete - these can take specific numeric values, usual whole numbers.
Example: population of a town, grade at one exam, the workers number,
productivity in pieces etc.

 continuous - these can take any numeric values from a specific interval.
Example: average grade, weight, salary, productivity in lei etc.

b2. qualitative variables - can have values expressed by words. These are used to
make the difference between many categories.
Example: gender, profession, eyes color, the education level, nationality etc.
1. Grouping variable
Scales used to measure the values of grouping variables
a) Nominal scale – the values determined by using this scale permit us only to
categorize the elements of a population.

With these values we can not construct hierarchies for the elements of a population!

Example: Using the variable eyes color we can categorize the students from a year of
study in the following categories:

Eyes color Blue Green Black Brown Total

Number of students 30 21 15 294 360

Based on these values we can only say that the students can be distributed in the four
categories, like above, and we can not say that the students from a specific categories
are the first in some hierarchy. The colors used above are items from a set for which the
sorting operation can not have sense.
1. Grouping variable

b) Ordinal scale – the values determined by using this scale permit us to construct
hierarchies.
Example: We can determine the consumers preference for a specific product by giving a
values like the following: the best, good, normal, less normal, the worse. We can replace
these values with numbers:
1 = the best,
2 = good,
3 = normal,
4 = less normal,
5 = the worse.

Using any of these values we can not say that a product which has the preference value
the best (1) is tree times better than one product which has the preference value normal
(3) or five times better than a product which has the preference value the worse (5),
even when these preferences are from the same consumer.

The ordinal scale does not allow to determine the distance between two values!
1. Grouping variable
c) Interval Scale – the values determined by this scale can be used in calculus of
proportions with intervals between 0 value (the origin of scale) and their position.

This values cannot be used directly into the proportions calculus because the 0 value
was conventional established and does not signify the absence of studied phenomenon.

An easy to understand example for this kind of scale is the way we use to measure the
time during on a day. The 0000 a.m. does not mean the absence of time. We cannot say
that the 0800 a.m. is two times bigger than 0400 a.m., but we can say that the interval of
time between 0000 and 0800 is two times bigger than the interval 0000-0400.
Other example is the scale used to measure the temperature in Celsius or
Fahrenheit degrees. 0oC does not signify the absence of heat. Also, we cannot say that
60oC mean two times hotter than 30oC, but we can say that for raising the temperature
of an object from 0oC to 60oC is needed two times more heat than we need to raise its
temperature from 0oC to 30oC.

d) Proportional (Rapport) Scale – is the most complete type of scale. The values
determined by this scale can be used for all types of arithmetical operations. For this
scale, the 0 value is absolute 0 and it means the absence of the studied phenomenon.
Example: 0 lei means the absence of money, 100 lei means two times more money than
50 lei.
2. Statistic Series
Definition: Statistic Series – represents a parallel between two or more datasets, at
least one of them must target the grouping variable.
Types of statistic series:
a) By the number of grouping variable included, the statistic series can be categorized in:
 simple series - when are constructed as a parallel between two datasets and
includes only one grouping variable;
 complex series- when are constructed as a parallel between more than two
datasets and includes at least one grouping variable

Complex series are constructed, generally, from many simple series.

b) By the type of grouping variable the statistic series can be categorized in:
o distribution series;
o time series ;
o space series.
2. Distribution series
Conditions:
1. Distribution series can be constructed only by using attributive grouping variables.
Types of distribution series:
1. By the number of grouping variables included:
- with one grouping variable - simple series (one-dimensional)
- with two or many grouping variables - complex series (two-dimensional,
three-dimensional etc.)
2. By the way of grouping the grouping variable values:
- with values grouped by intervals - distribution series by intervals
- with values grouped by variants - distribution series by variants

In the case of constructing distribution series by intervals, we have to answer at the


following questions:
1. How many intervals we need to construct?
2. What will be the size for these intervals?
One way to determine the numbers of If we have determined the number of
intervals is to use the empirical intervals we can calculate the
formula of H.A. Sturges: size of them:
xmax  xmin
n  1 3.322 log N k
n
2. Distribution series
How to construct a simple one-dimensional distribution by intervals

Grouping Absolute
Number of Relative Increasing cumulative Decreasing
variable (X) frequency
cases frequency absolute frequency cumulative absolute
(fi) (pi) (icfi) frequency (dcfi)
IL1(=xmin)-SL1 f1 p1=f1/N icf1=f1 dcf1=N
SL1-SL2 f2 p2=f2/N icf2=icf1+f2 dcf2=dcf1-f1
SL2-SL3 f3 p3=f3/N icf3=icf2+f3 dcf3=dcf2-f2
LS3-SL4(xmax) f4 p4=f4/N icf4=icf3+f4=N dcf4=dcf3-f3=f4
Total N=f1+f2+f3+f4 1 * *
Inferior (or superior) limit is included in interval.

Example of constructing a series Example frequency calculus


2. Distribution series

No.
Grade students Salary No. of workers
2 4 (lei)
3 6 600-800 1
4 10 800-1000 7
5 14 1000-1200 12
6 18 1200-1400 11
7 19 1400-1600 5
8 17 1600-1800 4
9 14 Total 40
10 8 Note. Inferior limit included in interval
Total 110
Example: Simple distribution series by intervals
Example: Simple distribution series by
variants The distribution of workers from SC CRS CONSTRUCT SRL
company by monthly brute salary
The distribution of students from year I,
Accounting, by grades from Basic
Statistics exam in 2007


2. Two-dimensional distribution series

n n n n

F  F
i 1
xi
j 1
yj   f ij  N
i 1 j 1
2. Two-dimensional distribution series
How to construct two-dimensional distribution series

Raw data
Hourly Hourly
Crt. Salary Crt. Salary
productivity productivity
No. No.
2. Two-dimensional distribution series

First we identify the type of grouping variables:


- hourly productivity (mii lei) – attributive, quantitative and continuous;
- salary (mil. lei) – attributive, quantitative and continuous.

If between the two grouping variables there is a dependency relationship then the
independent variable will be placed in the first column of the table and the dependent variable
in first row of the table.

We calculate the number of interval:


- for the first grouping variable – hourly productivity (X):
150  100
n  1  3.322 log 18  5.170015  5 k  10 mii lei
5
-for the second grouping variable – salary (Y)
3.2  1.7
n  1  3.322 log 18  5.170015  5 k  0.30 mil. lei
5


2. Two-dimensional distribution series
We construct and fill the table for the two-dimensional distribution:

Note: The inferior limit is included in interval


Hourly
Crt. Salary
productivity
No.


2. Time series
Conditions:
1. Can be constructed only based on time grouping variable.
2. The values of grouping variable must be ordered chronologically.
3. Must contain a sufficient number of values for capturing the tendency of the studied
phenomenon
4. The values of the studied variables must refer to the same space.

(lei) (tones)
thousand tones oil equivalent l
Hidro - and euro dollar Stock of
Data Data
Year Coal Oil Gas nuclear-electrical USA diesel
energy 17.02.2012 4,3533 3,3100 17.02.2012 1500
2005 5793 5326 9536 3101 20.02.2012 4,3535 3,2903 20.02.2012 *
2006 6477 4897 9395 2961 21.02.2012 4,3550 3,2903 21.02.2012 1250
2007 6858 4651 9075 3264 22.02.2012 4,3602 3,2954 22.02.2012 *
2008 7011 4619 8982 4233 23.02.2012 4,3557 3,2714 23.02.2012 1300
2009 6477 4390 8964 4242 24.02.2012 4,3535 3,2524 24.02.2012 1100
2010 5903 4186 8705 4618 27.02.2012 4,3525 3,2468 27.02.2012 1450
Production of primary energy Exchange rates Gas station 3
Source: The Romania’s Yearbook 2011 Source: RNB
Time-interval statistic series a) b)
Time-moments statistic series


2. Space series
Conditions
1. Can be constructed only based on space grouping variables.
2. Must contain a sufficient number of values for capturing the modifications of the
studied variables in space.
3. The values of the studied variables must refer to the same period of time.

Population at 1 july 2010 Average life expectancy by gender in 2010 (years)


County Population Total Urban Rural
County
Dolj 702124 M F M F M F
Olt 462734 Vâlcea 72,23 78,9 72,75 78,76 71,41 78,55
Vâlcea 406555 Gorj 70,49 76,73 71,29 76,76 69,61 76,48
Gorj 376179
Mehedinţi 69,26 75,36 70,18 75,87 67,91 74,73
Mehedinţi 291051
Dolj 69,1 76,78 71,48 78,43 66,28 75,21
Olt 68,88 76,51 69,69 76,8 67,87 75,95

Source: The Romania’s Yearbook 2011

Simple space series Complex space series


Distribution
attributive series by
intervals
quantitative
continous
Distribution
series by
variants or by
discrete intervals

qualitative
Distribution
Grouping series by
Variables time variants

Time-moments
series
time-moments
Time-intervals
time-intervals series

space Space series


2. Distribution series
Active social and economic operators from the national economy by size class

2011

Number of Number of Relative Increasing cumulative Decreasing cumulative


employees companies frequency (pi) abs. frequency (icfi) abs. frequency (dcfi)
←9 758286 0,91451 758286 829170
10-49 56117 0,06768 814403 70884
50-249 12617 0,01522 827020 14767
250 → 2150 0,00259 829170 2150
Total 829170 1,00000 * *

Source: The Romania’s Yearbook, 2012


2. Distribution series
Active social and economic operators from the national economy by size class

2010

Number of Number of Relative Increasing cumulative Decreasing cumulative


employees companies frequency (pi) abs. frequency (icfi) abs. frequency (dcfi)
←9 931096 0,928423 931096 1002879
10-49 54304 0,054148 985400 71783
50-249 14896 0,014853 1000296 17479
250 → 2583 0,002576 1002879 2583
Total 1002879 1,000000 * *

Source: The Romania’s Yearbook, 2011


1. The constructive elements of a statistical charts
2. Charts for distribution series
3. Charts for time series
4. Charts for space series
5. Statistical charts for comparisons
6. Statistical charts for structures
7. Other statistical charts
1. Constructive elements of a statistical charts
a. Chart title – summarize in a clear and short text the chart’s content

b. The chart scale – is an essential element of a statistic chart. By using the scale we can assure
the proportionality of the indicators represented in the statistic chart.

Types of scales

1. By shape:
- liniar scale

- nonliniar scale

2. By size of the intervals between tick marks:


- uniform scale

- logarithmic scale


1. Constructive elements of a statistical charts
c. The gridlines
d. The chart figure
e. The legend
f. The explicative note

The evolution of SC. Axix SA 2001-2010 turnover The evolution of turnover


200
200
180
180 160
160 140
120
thousands lei

140
120 100
80
100
60
80 40
60 20
40 0
20 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

0 SC Fenix SRL SC Binar SRL


2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
Note: SC Binar SRL was founded in 2002


2. Charts for distribution series
A. The histogram
B. The frequency polygon
C. The curve of cumulative frequency (Ogive)
The histogram by rectangles

Number Increasing
Weight The evolution of number of packages
of cumulative
(kg) transported by SC Pegasus SRL in January 2009
packages frequencies
50

Number of packages
40 – 45 7 7 45

45 – 50 26 33 40

50 – 55 27 60 35

55 – 60 37 97 30

25
60 – 65 43 140
20
65 – 70 34 174 15

70 – 75 27 201 10

75 – 80 11 212 5

0
Total 212 * 4040 – 45 45 45 – 50 50 50 – 5555 55 – 60 60 60 – 65 65 65 – 70 70 70 – 75 75 75 – 8080
Weight (kg)

2. Charts for distribution series
A. The histogram
B. The frequency polygon
C. The curve of cumulative frequency (Ogive)
The histogram by sticks

Number Increasing
Weight The evolution of number of packages
of cumulative
(kg) transported by SC Pegasus SRL in January 2009
packages frequencies
50

Number of packages
40 – 45 7 7 45

45 – 50 26 33 40

50 – 55 27 60 35

55 – 60 37 97 30

25
60 – 65 43 140
20
65 – 70 34 174 15

70 – 75 27 201 10

75 – 80 11 212 5

0
Total 212 * 40 40 – 4545 45 – 5050 50 – 5555 55 – 6060 60 – 6565 65 – 70 70 70 – 75 75 75 – 80
80
Weight (kg)

2. Charts for distribution series
A. The histogram
B. The frequency polygon
C. The curve of cumulative frequency (Ogive)
The frequency polygon

Number Increasing
Weight The evolution of number of packages
of cumulative
(kg) transported by SC Pegasus SRL in January 2009
packages frequencies 50

Number of packages
40 – 45 7 7 45

45 – 50 26 33 40

50 – 55 27 60 35

30
55 – 60 37 97
25
60 – 65 43 140 20

65 – 70 34 174 15

70 – 75 27 201 10

75 – 80 11 212 5

0
Total 212 * 40 40 – 4545 45 – 5050 50 – 5555 55 – 6060 60 – 6565 65 – 7070 70 – 75 75 75 – 80
80
Weight (kg)

2. Charts for distribution series
A. The histogram
B. The frequency polygon
C. The curve of cumulative frequency (Ogive)
The curve of cumulative frequency (Ogive)

Number Increasing
Weight The evolution of number of packages
of cumulative
(kg) transported by SC Pegasus SRL in January 2009
packages frequencies
250

Cumulative number of packages


40 – 45 7 7
45 – 50 26 33 200

50 – 55 27 60
55 – 60 37 97 150

60 – 65 43 140
100
65 – 70 34 174
70 – 75 27 201 50

75 – 80 11 212
0
Total 212 * 40 40 – 4545 45 – 5050 50 – 5555 55 – 6060 60 – 6565 65 – 70 70 70 – 75 75 75 – 8080

Weight (kg)

2. Charts for two-dimensional distribution series

Age groups
Gender
-->15 15-30 30-45 45-60 60-->
Male 120 200 255 180 100
Female 115 205 280 250 200

300

250
300
200
250
200 150
150
100 100
50 50
0
-->15 Female 0
15-30
30
30-45
45 -->15
60
45-60 Male 15-30
30
60--> 30-45
45
45-60
60 Female
60--> Male

A.Polygonal surface B. Two-dimensional histogram


3. Charts for time series
A. Chronogram (Line charts)

Year Turnover The evolution of the turnover at


-thousands lei- SC AlumPAN SRL
2004 600 1200

2005 850 1000

2006 748
800
2007 805
2008 983 600

2009 950 400

2010 901
200
2011 705
0
2004 2005 2006 2007 2008 2009 2010 2011


3. Charts for time series

B.Chronogram with gap

Year Turnover The evolution of the turnover at


-thousands lei- SC AlumPAN SRL
2004 600 1000

2005 850 950

2006 748 900

850
2007 805
800
2008 983 750

2009 950 700

2010 901 650

2011 705 600

550
2004 2005 2006 2007 2008 2009 2010 2011


3. Charts for time series

C. Column charts

Year Turnover The evolution of the turnover at


-thousands lei- SC AlumPAN SRL
2004 600
1000
2005 850 900

2006 748 800

700
2007 805
600

2008 983 500

2009 950 400

300
2010 901 200

2011 705 100

0
2004 2005 2006 2007 2008 2009 2010 2011


3. Charts for time series
D.Radial polar diagram

xmax  xmin
r Jan
2 Dec
Feb

Nov
Mar

Icecream Icecream
Month sales Month sales Oct Apr
-mil. lei- -mil. lei-
Jan 5 Jul 35 May
Sep
Feb 6 Aug 18
Jun
Mar 9 Sep 10 Aug Jul

Apr 10 Oct 8
May 16 Nov 7
Jun 30 Dec 6
Radial polar diagram


3. Charts for time series

E.Sectorial polar diagram


xmax  xmin
r Dec Jan
2 Nov Feb

Oct Mar
Icecream Icecream
Month sales Month sales
-mil. lei- -mil. lei- Sep
Apr

Jan 5 Jul 35
May
Feb 6 Aug 18 Aug

Jul Jun
Mar 9 Sep 10
Apr 10 Oct 8
May 16 Nov 7
Jun 30 Dec 6 Sectorial polar diagram


4. Charts for space series
A. Column diagram

800000

700000

600000

500000

County Population 400000

Dolj 762142 300000

Gorj 401021 200000

Mehedinţi 332673
100000
Olt 523291
0
Vâlcea 438388 Dolj Gorj Mehedinţi Olt Vâlcea


4. Charts for space series

B.Bar (pipe) diagram

Vâlcea

Olt

Mehedinţi

County Population
Dolj 762142 Gorj

Gorj 401021
Mehedinţi 332673 Dolj

Olt 523291
0 100000 200000 300000 400000 500000 600000 700000 800000
Vâlcea 438388


4. Charts for space series

C. Cartogram

Legend
Below average
Above average
Crişana Maramureş
7 Average = 9%

Moldova
7
Transilvania
8
Source: Studiul statistic al pieţei
muncii din regiunea Oltenia.
Banat
C. Radu, 11
Muntenia
N. Vasilescu, 10 Dobrogea
Oltenia 15
C. Ionaşcu, 13
Editura Sitech,
Craiova, 2005, p. 121. Bucureşti
9

Social exclusion through reorganization and dismissal


1990-2003, by regions (%)

4. Charts for space series

D. Cartodiagram

Legend
Nord-West below average
92,3 Nord-East above average
92,4 Naţional average =
91,6%

Center
Source: Studiul statistic al pieţei West 91,6
92,8
muncii din regiunea Oltenia. South-East
89,6
C. Radu,
N. Vasilescu, South-West
South
C. Ionaşcu, 93,1
90,1
Editura Sitech,
Craiova, 2005, p. 121. Bucureşti
91,2

The proportion of employed population (% of active


population) by development regions - 2002

5. Comparisons diagrams by two-dimensional shapes
A. The rectangle. Case 1

Area = Width x Length


Production (P)
Company
- mil lei-
A 350
B 250

A P=350 mil lei

B P=250 mil lei


5. Comparisons diagrams by two-dimensional shapes
A. The rectangle. Case 2

A=Lxw

P=Nxw

No. of
Productivity (w)
Company Production (P) workers
- thousands lei -
- mil lei- (N)
A 350 200 1750
B 250 156 1603
N = 200 N = 156
w = 1750

w = 1603
P=350 mil lei P=250
mil lei

A B

5. Comparisons diagrams by two-dimensional shapes

B.The circle radius  Area


rA  PA  350  18.7 rB  PB  250  15.8

No. of
Productivity (w)
Company Production (P) workers
- thousands lei -
- mil lei- (N)
A 350 200 1750
B 250 156 1603

P=350 mil. lei P=250 mil. lei

A B


5. Comparisons diagrams by two-dimensional shapes

Area  side 2  side  Area


C. The square
No. of
Productivity (w)
Company Production (P) workers
- thousands lei -
- mil lei- (N)
A 350 200 1750
B 250 156 1603

P=350 mil. lei P=250 mil. lei

A B


5. Comparisons diagrams by three-dimensional shapes
Volume  Length  width  height
Unit cost
Production No. of Productivity
-thousands
Company -mii lei- workers -thousands lei-
lei/unit
D. The parallelepiped (P) (N) (w)
(c)
A 8750 200 1750 25
B 7200 150 1600 30

w=1750 units

w=1600 units
N=200 workers N=150 workers
A B

5. Comparisons diagrams by three-dimensional shapes

Volume  Base area  height    r  h 2

No. of
Productivity (w)
Company Production (P) workers
- thousands lei -
E. The Cylinder - mil lei- (N)
A 350 200 1750
B 250 156 1603

A B


5. Comparisons diagrams by three-dimensional shapes

No. of
Productivity (w)
Company Production (P) workers
- thousands lei -
- mil lei- (N)
A 350 200 1750
B 250 156 1603
F. Sphere

P=350 mil. lei P=250 mil. lei

A B

6. Diagrams for structures

We can use the same geometrical shape as in the case of comparisons.


A. The rectangle
Product category The sales values
B. Parallelepiped (mil. lei) %
Total, from which: 1100 100,00
- food 300 27,27
- appliances 600 54,54
- clothing 150 13,64
- other products 50 4,55

- other products
4,55%
13,64%
4,55% - clothing
13,64%
- appliances
54,54% 54,54%
- food

27,27% 27,27%

Structure diagram by:


a) rectangle b) parallelepiped

6. Diagrams for structures

Product category The sales values


(mil. lei) %
C. The square Total, from which: 1100 100,00
- food 300 27,27
- appliances 600 54,54
- clothing 150 13,64
- other products 50 4,55

- food
- appliances
- clothing

- other products


6. Diagrams for structures

Product category Sales Graphic


% (degree)
Total, from which: 100,00 360
D. The circle - food 27,27 98,18
- appliances 54,54 196,34
- clothing 13,64 49,10
- other products 4,55 16,38
4.55%

13.64%
27.27% - food
- appliances
- clothing
- other producs

54.54%


7. Other statistical charts
Balance diagram
Example: For a storage of goods we know the following data:
N1=3000 pcs.; I = 500 pcs.; E = 1500 pcs. ; N2 = 2000 pcs.

3500

3000

2500

2000

1500

1000

500

N1 I E N2


7. Other statistical charts
Ages
Pyramid

Source: The
Romania’s Yearbook
2009


Can construct a series • histogram
distributions • frequency polygon
• cumulative frequency curve

Time series
• cronogram
without
• cronogram with gap
seasonality
• column diagram

with • radial polar diagram


seasonality • sectorial polar diagram
• column diagram
Statistical • bar diagram
Space series • cartogram
data
• cartodiagram

•2D geometrical shapes (circle,


comparisons square, rectangle etc.)
•3D geometrical shapes (sphere,
parallelepiped, cylinder etc.)
structures
Can’t construct a • balance diagram
series special cases • age pyramid

Elemente de geometrie

Perimetrul= suma tuturor laturilor, adica:


P=AB+BC+CA
Aria triunghiului=(inaltimea x baza)/2, adica:
Atriunghi=(b x h)/2.
In cazul nostru, b=BC, iar h=AD. Deci,
AABC=(BCxAD)/2

Dreptunghiul are lungime( not L=AB) si latime (not l=BC).


Perimetrul= suma tuturor laturilor, adica:
P=AB+BC+CD+DA sau P=2(L+l)
Aria dreptunghiului = lungimea x latimea
Adreptunghi=L x l. In cazul nostru, AABCD=AB x BC.

Perimetrul= suma tuturor laturilor, adica:


P=AB+BC+CD+DA sau P=4 L, unde L este latura patratului
(AB=BC=CD=DA=L).
Aria patratului=latura x latura = latura2, adica, Apatrat=L2.
In cazul nostru, AABCD=AB2.

2. Rules for positioning the elements in a statistical chart


2. Rules for positioning and color the elements of a chart


2. Rules for positioning and color the elements of a chart


2. Rules for positioning and color the elements of a chart


2. Rules for positioning and color the elements of a chart


1. The mode ∟ □
2. The quantiles
3. The mean
The mode
Definition: represents that value of the studied variable which has the maximum of absolute
frequency.
•Because it need the values for absolute frequencies can be calculate for distribution series

Calculus
- mathematical calculus - Graphic calculus:

fi

1
Mo  Li  k 25
1   2
20

15

10

5
xi
100 110 120 130 140 150 160 170
Mo

The quantiles
Definition: represents indicators of position which allows us to split a dataset in a specific
number of equal size parts.
Types
•quartiles - allow us to split a dataset in 4 equal parts. There are 3 quartile.
•Deciles - allow us to split a dataset in 10 equal parts. There are 9 deciles.
•Percentiles - allow us to split a dataset in 100 equal parts. There are 99 percentiles.

Calculus

 ci  k
x ci  Li ci     f j  Sf 
n f
np np  p  ci
np


The median
Definition: represents that value which can be used to split an ascending or descending ordered
dataset in two equal parts.
Calculus
• for simple series (previously ascending or descending ordered):
- with odd number of values: - with even number of values
n 1 xn  xn
Place of Me  Me  2 2
1
2 2
• for simple distribution series:
- Mathematical calculus - Graphic calculus:
 
Increasing cumulative frequencies

  fi  100
  k
Me  Li    Sf  f
90

2 80
  ME f i 70
  2 60

The value of median is equal with 50


40
the value of :
30
- the second quartile
20
- 5th decile 10
- 50th percentile 100 110 120 130 140 150 160 170 xi
Me

The mean – definition, significance
Notation used
X - a variable X(x1 , x2 ,...xi ,..xn)
xi - values of variable X
fi - absolute frequency of the value xi
n - total number of X variable’s values
- the mean of variable X
x
Definition: The mean of a variable is a synthetic value extracted from all variable’s values,
materialized into one representative level which hold all that is essential, typical and objective in
its development.

Because of the phenomenon diversity and complexity, in practice we use many types of
means:
- arithmetic mean
- harmonic mean
- geometric mean
- quadratic mean
- chronologic mean


The arithmetic mean – subtypes, calculus
Subtypes:
-unweighted - it is used when the values of the studied variable are unique.
-weighted - it is used when the values of the studied variable are not unique.
Also we use it when we know the absolute frequency of each value
of the studied variable.

Calculus:
-unweighted - weighted
n n

x i x f i i
x i 1
x i 1
n
n
f
i 1
i

Applicability
The arithmetic mean is used when the studied phenomenon records almost constant changes
(in arithmetic progression), showing therefore a linear trend. The arithmetic mean is the type
most often used in practice.

The arithmetic mean –properties
I. properties for verifying the accuracy of the calculation

1. The arithmetic mean has a values always included between the maximum and minimum
value of the studied variable.

xmin  x  xmax
2. The sum of deviations of studied variable's values from it's mean ​is always zero.
n
- for unweighted mean
 (x  x)  0
i 1
i

n
- for weighted mean
 (x  x) f
i 1
i i 0


The arithmetic mean – properties
II. To simplify the calculation
1. x we construct another
If we start from a variable X (x1 , x2 , … xn ) with mean and
variable X’, with values determined by subtracting the same constant a from the X values
meaning X’(x1-a , x2-a , … xn-a ), the mean of X’ variable will be equal with x . a

x 
'  i
x '


 (x i  a)

 xi  na

 x i

na
 x a
n n n n n
2. If we start from a variable X (x1 , x2 , … xn) with mean and wexconstruct a new variable X”,
with values determined by dividing X values by a constant k, meaning X”(x1/k , x2/k , … xn
/k), the mean of X’’ variable will be equal with: x/k
xi 1
 x  k
''
k
 xi
1 x 1 x
x 
''
 i
 i
 x
n n n k n k k

 xi  a
n

3. By combining the above two properties,
also valid in the case of weighted  
i 1  k
 fi 
 k  a
arithmetic mean we obtain a formula x n

 fi
for simplified calculus:


i 1
The mean of binomial (dichotomial) variable
The binomial variable (dichotomial): it is that variable which has only two alternative values.
Example: a). Quality of a product: good or scrap; b). The status of a student after an exam:
passed or not passed; c). gender: male, female:

Values Absolute Relative


frequencies (fi) frequencies (pi)

x1= 1 f1 p1= p
x2= 0 f2 p2= q

Total N=f1 + f2 p+q=1

x
 xi  fi

x1  f1  x2  f2
 x1 
f1
 x2 
f2
 1 p  0  q  p
 fi f1  f2 f1  f2 f1  f2

Harmonic mean
Applicability

Harmonic mean is used in practice only in some cases.


1. It is used as mathematical model in calculus of some statistical indicators like the group index
of sale prices for goods and service on open market.
2.It is recommended for calculus of the average for interdependent variables in inverse ratio. For
one of them (ie for which the direct sum of the values​​ it makes sense) the average is calculated
using the arithmetic mean and for the other is calculated using the harmonic mean.

Formula

- unweighted - weighted

xh 
1
n

 f i

x
xh
1
i x  f i
i


Harmonic mean - properties
1. If are calculated, for the same dataset, the arithmetic mean and harmonic mean always
verify this relation:
xh  x
1
2. If between two variables X and Y exists this relation X  then the same relation will
1 Y
exists between their means: X 
Y


The quadratic mean
Applicability
1. The quadratic mean it is used in the case when the studied phenomenon records changes
approximately in exponential progression (example: when the growth is slower at the
beginning of the series and becoming more pronounced towards the end). It is used in the
analysis of exponential trends.
2. It is used as a mathematical model for one of the synthetic indicators of variance: the
standard deviation σ.

Formula
- unweighted - weighted
n

 i fi
n

x
2
2 x
xq 
i
i 1
xq  i 1
n
n
f i 1
i

Properties
1. If are calculated for the same dataset, the arithmetic mean and the quadratic mean
always verify this relation:


x  xq
The geometric mean
Applicability
1. The geometric mean it is used in the case when the studied phenomenon records changes,
approximately, in geometric progression.
2. It is used frequently when the differences between the values of the studied variable are
larger the beginning of the series and become smaller toward to the end of it.
3. It is used as a mathematical model for calculate one of chronological series synthetic indicator
(average index of dynamics).

Formula:
- unweighted - weighted
n
n
 fi n
xg  n  xi xg  i 1
i
x fi

i 1
i 1

Properties
1. If are calculated for the same dataset, the arithmetic mean and the geometric always
xg  x
verify this relation:

If are calculated for the same dataset, the arithmetic, the geometric, the quadratic and the
harmonic mean always verify this relation:
xh  x g  x  xq

The chronologic mean
Applicability:
The chronologic mean is used, exclusively, for time-moments series.
1. unweighted chronologic mean – for time-moments series with moments regularly placed
in time (the periods of time are equal between any two consecutively time-moments).
2. weighted chronologic mean – for time-moments series with moments irregularly placed in
time (at least one of the periods of time between two consecutively time-moments is
different than the rest of them).

Relaţie de calcul
- unweighted - weighted
k n 1 k n 1

x x i i xt xt i i i i
xc  i 1
 i 1
xc  i 1
 i 1
n 1
n 1 k

t t
k
i i
were: i 1 i 1
xi  xi 1 moving average n - the number of statistic series terms
xi 
2 i  1..n k  n 1 - the number of time periods
ti - the period of time, expressed usualy in days, between i and i+1 time moments.

Can construct a series
distribution •weighted arithmetic mean

time series 
•unweighted arithmetic mean
time-intervals • unweighted quadratic mean 
• unweighted geometric mean

time- = • unweighted chronological
moments mean
≠ •weighted chronological mean

Dataset
space series
special cases
Example:
• we know the xi values
• we don’t know the fi absolute frequencies
• we know values like xi.fi which are:
− equal → Unweighted harmonic mean
− different → harmonic mean as a transformed
Cannot construct a form of weighted arithmetic mean
Types of mean series

The calculul of the mode and median for distribution series
Productivity Number of 1
(pieces) employees Mo  Li  k
100 -110 5 1   2
110 -120 10 25  15
 130  10 
120 -130
130 -140
15
25
25  15  25  20
140 -150 20  136.67
150 -160 15
160 -170 10
Total 100

 
  fi 
  k  100  10
Me  Li    Sf  f  130   2  30  25  138
2
  ME
 
▲ The median The mode
Types of mean
Formula
Types of mean
unweighted weighted
 x x
 x fi i
• arithmetic x i

n f i

xh 
n
x 
 f i

• harmonic 
1 h
1
xi x f i
i

x g  n xi x g   i xif i
f
• geometric

• quadratic x 2
xp 
 i fi
x 2

xp  i

n f i

• cnronologic  x x
 x t k k
 k

t
xc
k k
▲ Diagram
The arithmetic mean –properties
Example:

International transport
a = 1,1 mean
Gross weight
Initial variable X (tones) 8 9 10 11 12 10
Transformed Net weight
variable X'=X-a (tones) 6,9 7,9 8,9 9,9 10,9 8,9

Consumption
k= 2 mean
Unit consumption
before upgrade
Initial variable X (kg/piece) 8 9 10 11 12 10
Unit consumption
Transformed after upgrade
variable X''=X/k (kg/piece) 4 4,5 5 5,5 6 5


Example of harmonic mean use:
Case: xi . fi are equal
The The average sale price is :
Sale price
collected
Product per unit n 2
value xh    13.33
1 1 1
x
(xi)
(V = xi . fi) 
i 10 20
A 10 800
B 20 800

Case: xi . fi are different

The The average sale price is :


Sale price
collected
Product per unit
value xh 
 x f i i

400  800
 15
(xi) 1 1 1
(V = xi . fi)
x x f i i
10
400  800
20
A 10 400 i

B 20 800

▲ Diagram The harmonic mean


Example of calculus for arithmetic mean

Grades 4 5 6 7 8 9 10
Number of 9 10 10 2 2 1 1
students
4  5  6  7  8  9  10
x 7
7
4  9  5 10  6 10  7  2  8  2  9 1  10 1
x  5,57
35

▲ The arithmetic mean


Example of arithmetic mean calculus by using simplified formula
Monthly salary Number of xi  a xi  a
realizat employees xi x ifi  fi
(mil. lei) (fi) k k
0 2 50 1,75 87,5 -2 -100
2 – 2,5 150 2,25 337,5 -1 -150
2,5 – 3 350 2,75 962,5 0 0
3 – 3,5 300 3,25 975 1 300
3,5 – 4 100 3,75 375 2 200
4 4,5 50 4,25 212,5 3 150
Total 1000 - 2950 - 400

x
 xf i i
x
2950
 2,95
f i 1000
xi  a
 k  fi x
400
 0,5  2,75  2,95 a=2,75
x k a 1000
 fi
k=0,5
▲ The arithmetic mean
Example of time-moments series

Sc EnGross SRL. Storage 1 OMV Gas station 3


Stock of Stock of
product Data diesel
Data
type A -tones-
-pce-
t1 1.01.01 100
t1 1.01.01 80
t2 15.02.01 150
t2 1.02.01 120
31.03.01 248
1.03.01 100 t3
t3 20.04.01 305
1.04.01 115
t4 11.05.01 250
t4 1.05.01 125
t5 21.06.01 305
t5
1.06.01 150
t6 14.07.01 300
t6 1.07.01 260
n = 7, k = n -1 =6 n = 7, k = n -1 =6
With equal time periods With at least one time periods different than
between any consecutively the rest o them
moments

The chronologic mean Diagram


1. Dynamics indicators
2. Time series adjustment
3. The trend
4. Sesonality
5. Forecasting
Dynamics indicators
Absolute indicators
1. Absolute values of studied variable (yt)
2. Absolute change (benefit or deficit)
- with fixed basis - with chain basis:
yt / 1  yt  y1 yt / t 1  yt  yt 1
Relative indicators
1. dynamics indices
- with fixed basis - with chain basis:
yt yt
I y
t /1  I y
t / t 1 
y1 y t 1
2. dynamics rhithm:
- with fixed basis - with chain basis:

 
Rty/ 1  I ty/ 1  1 100  
Rty/ t 1  I ty/ t 1  1 100


Dynamics Indicators
Average indicators
- Absolute
• average level of studied variable ( ) -determined
y by taking into account the type of
time series.
If the time series is by intervals then the arithmetic mean must be used.
If the time series is by moments then the weighted or unweighted chronological mean
must be used .

• Average absolute change (benefit or deficit) y 


 t / t 1
 y

n1
- Relative
• Average index of dynamics I y  n1 I ty/ t 1
• Average rhythm of dynamics 
R y  I y  1 100 
• Average absolute value of a percent


va (1%)  ( A)
R

Dynamics Indicators (absolute and relative) – example

Absolute Relative
Production
Month (t) Value (Y) Yn/1 Yn/n-1 IYn/1 IYn/n-1 RYn/1 RYn/n-1
mil. lei
1 430 * * * * * *
2 380 -50,00 -50,00 0,88 0,88 -11,63 -11,63
3 400 -30,00 20,00 0,93 1,05 -6,98 5,26
4 410 -20,00 10,00 0,95 1,03 -4,65 2,50
5 360 -70,00 -50,00 0,84 0,88 -16,28 -12,20
6 340 -90,00 -20,00 0,79 0,94 -20,93 -5,56
7 380 -50,00 40,00 0,88 1,12 -11,63 11,76
Sum/Product 2700 * -50,00 * 0,88 * *
 Rithm
Dynamics Indicators (average) – example

Indicators y  I R
Average absolute 385,71 -8,33 * *
Average relative * * 0,9796 -2,04
Average absolute value of a
* 4,083 * *
percent
Time series adjustment
Time series adjustment = determination of a model which is the best approximate of time series
tendency.

Utility:
Once we determine the model of time series tendency:
- we can use it to make forecasts
- we can use it to determine the value of time series missing (interpolation)

Types of methods used in time series adjustment:


1. Graphic methods
2. Mecanical methods
3. Analytic methods


Time series adjustment
1. Graphic method – it is based on visual identification of adequate tendency model by
testing many types of known models, using as support the cronogram or the historiogram
of analyzed time series.

Initial data
Linear trend
Exponential trend
Geometric trend


Time series adjustment
2. Mecanical methods – are based on using of mathemathical relations determined between
time series terms, which allow total or partial decrease of random fluctuations generated
by empirical data included in the analyzed time series and the identification of tendency
model.

Most often are used the following methods:


- staggered average method
- moving average method
- average absolute change method
- average index method


Time series adjustment
Staggered average method – is based on calculus of averages from 2,3 or many time series
successive terms, without repeating any of them, then on using this new terms instead of
initial data for determining the tendency.

Staggered
Month (t) Y 440 Y

average 420
Staggered average

1 430
405 400

2 380 380

3 400 360
405
4 410 340

5 360 320
350
6 340 300
1 2 3 4 5 6 7
7 380

This methods does not completely remove the random fluctuations.



Time series adjustment
Moving average method– is based on calculus of averages from 2, 3 or many time series
successive terms, from which one or more terms are used in calculus of many successive
averages, then by using this new terms instead of initial data for determinig the tendency.

Moving
Month (t) Y
averages 440 Y Moving averages

1 430
420

405,00 400

2 380
390,00 380

3 400
405,00
360

4 410 340
385,00
5 360 320

350,00 300
6 340 1 2 3 4 5 6 7

360,00
7 380
This methods does not completely remove the random fluctuations.

Time series adjustment
Average absolute change method – is based on using of a recurrence relation which can be
established between any of the time series terms, absolute average change and first term
of time series to calculate new values corresponding to each term of time series, then on
using of these new terms instead of initial data for determining the tendency.

Month yt  y1  t  1
Y
(t)
1 430=y1 430
2 380 421,67
3 400 413,33
4 410 405,00
5 360 396,67
6 340 388,33
7 380 380,00
This method remove completely the random fluctuations. It can be used with good results in
case of linear tendency.

Time series adjustment
Average index method – is based on using of a recurrence relation which can be established
between any of the time series terms, average index and first term of time series to
calculate new values corresponding to each term of time series, then on using of these new
terms instead of initial data for determining the tendency.

Month
Y yt  y1 I t 1
(t)
1 430 = y1 430,00
2 380 421,23
3 400 412,64
4 410 404,23
5 360 395,98
6 340 387,91
7 380 380,00

This method remove completely the random fluctuations. It can be used with good results in
case of geometric tendency.

Ajustarea seriilor cronologice –Metode analitice
Tipuri de modele
-Modele de ajustare
-Modele autoproiective
-Modele explicative

A. Modele de ajustare
-Modele aditive Yt  Tt  St  Ct  ut
-Modele multiplicative
unde:

Tt - trendul (tendinţa generală a fenomenului studiat)

- ciclul - variaţia amplă a fenomenului studiat – pe parcursul a mai mulţi ani


- sezonalitatea - variaţia sezonieră (în interiorul anului) a fenomenului studiat)

- perturbaţie

B. Modele autoproiective
Yt  f Yt 1 , Yt 2 ,..., ut 
C. Modele explicative
Ajustarea seriilor cronologice –Metode de ajustare a trendului
Metodele analitice de ajustare a seriilor cronologice au la bază ideea de a descoperi modelul de
funcţie matematică care aproximează cel mai bine tendinţa datelor reale.

Sunt folosite de obicei ca model:


-funcţia liniară
-funcţia parabolică de ordinul 2
-funcţia hiperbolică
-funcţia exponenţială
-ş.a.

Este necesar un criteriu care să permită selectarea funcţiei (din mulţimea funcţiilor testate) care
aproximează cel mai bine evoluţia datelor reale.
Criteriul cel mai utilizat este minimizarea diferenţelor dintre datele reale şi datele calculate prin
intermediul funcţiilor testate (Criteriul celor mai mici pătrate).


Ajustarea seriilor cronologice –Metode de ajustare a trendului

Testarea funcţiilor matematice, pentru a vedea cât de bine aproximează evoluţia datelor reale,
presupune parcurgerea mai multor etape:
1. Pentru fiecare funcţie se parcurg paşii:
a. Se stabileşte modelul general al funcţiei ce va fi testate;
b. Se particularizează funcţia prin determinarea valorilor parametrilor săi astfel încât să
se atingă precizia maximă în aproximare. În situaţia în care funcţia permite calculul, se
poate face prin metoda simplificată.
c. Se calculează prin funcţie, pentru fiecare termen real, câte o valoare corespondentă;
d. Se determină suma erorilor pe care le generează utilizarea respectivei funcţii în
aproximarea evoluţiei datelor reale prin calculul sumei pătratelor diferenţelor dintre
valorile reale şi cele calculate prin funcţie;
2. Se compară suma erorilor pe care le generează fiecare funcţie în aproximarea tendinţei
datelor reale şi se alegea acea funcţie pentru care erorile sunt cele mai mici.


Ajustarea seriilor cronologice –Metode de ajustare a trendului
Notaţii utilizate:
y - valorile reale ale variabilei studiate prin intermediul serie cronologice
Yt - valorile asociate variabilei studiate, dar calculate prin intermediul
funcţiei testate
t - rangurile perioadelor de timp cuprinse în serie
a,b,c.. - parametrii funcţiei testate

Funcţia liniară

a. Yt  a  b  t
b.Determinarea parametrilor funcţiei (a şi b) se poate face aplicând funcţiei criteriul minimizării
celor mai mici pătrate
2

 y  Y   min    y  a  b  t   min
2
t
Este posibilă atingerea minimului pentru suma de mai sus dacă derivatele parţiale de ordinul I
în raport cu parametrii a şi b sunt nule.


Ajustarea seriilor cronologice –Metode de ajustare a trendului
Derivând expresia celor mai mici pătrate în raport cu parametrul a obţinem:

 y  n  a  b t  0   y  n  a  b t
apoi în raport cu parametrul b,

 y  t  a  
t  b t 2
 0   y  t  a  
t  b t 2

Ambele ecuaţii de mai sus pot fi integrate într-un sistem, care poate fi rezolvat prin orice
metodă cunoscută.


 y  n  a  b t


 y t  a t  b t2
  


Ajustarea seriilor cronologice –Metode de ajustare a trendului
O variantă de rezolvare rapidă este cea utilizând determinanţi:

 y  n  a  b t


 y t  a t  b t2
  

Se construiesc următorii determinanţi:

n t y t n y
 a  b 
t  t 2
 y t t 2
t  y t
a b
a b
 

Ajustarea seriilor cronologice –Metode de ajustare a trendului
O altă variantă de rezolvare rapidă este cea folosind calcul simplificat. Acesta presupune ca lui t
(variabila timp) să îi dăm valori particulare astfel încât Σt = 0.
Astfel, dacă:
- seria are număr impar de termeni, lui t i se vor asocia următoarele valori:
- 0 în dreptul termenului din centrul seriei
- -1, -2, -3... pentru termenii aflaţi deasupra termenului central al seriei.
Atribuirea valorilor se face începând de la acesta către primul termen al
seriei.
- 1,2,3... pentru termenii aflaţi sub termenul central al seriei. Atribuirea
valorilor se face începând de la acesta către ultimul termen al seriei.
- seria are număr par de termeni, lui t i se vor asocia următoarele valori:
- -1, 1 pentru cei doi termeni din centrul seriei
- -3,-5,-7...pentru termenii aflaţi deasupra termenilor centrali
- 3,5,7... pentru termenii aflaţi sub termenii centrali

În acest caz sistemul pentru determinarea parametrilor a şi b devine:




 y  na b t 


 y  na y  y t
  a b
    n
 y t  a t  b t 2  y t  b t2
      t 2

 
▲ 
Ajustarea seriilor cronologice –Metode de ajustare a trendului
Exemplu de calcul
7 28
Valoarea 
producţiei 28 140
Anul industriale t t2 y.t Yt (y-Yt)2
(mld. lei) 3279 28
y a 
1993 410 1
13674 140
1 410 408,64 1,84
1994 430 2 4 860 428,57 2,04 7 3279
1995 450 3 9 1350 448,50 2,25 b 
28 13674
1996 460 4 16 1840 468,43 71,04
1997 490 5 76188
25 2450 488,36 2,70 a  388.7143
1998 509 6 36 3054 508,29 0,51 196
1999 530 7 49 3710 528,21 3,19 3906
b  19.92857
Total 3279 28 140 13674 3279 83,571 196

Yt  388.7143  19.92857  t
Ajustarea seriilor cronologice –Metode de ajustare a trendului

540

520

500

480

460

440

420
Valoarea producţiei industriale y Yt
400
1993 1994 1995 1996 1997 1998 1999
Ajustarea seriilor cronologice –Metode de ajustare a trendului
Funcţia parabolică de ordinul II Yt  a  b  t  c  t 2
Funcţia are trei parametri a, b, c. Valorile aferente acestora se pot determina în acelaşi mod ca
ca şi în cazul funcţiei liniare, prin aplicarea criteriului celor mai mici pătrate. Se obţine sistemul:

Rezolvarea sistemului se poate face cu ajutorul


determinanţilor sau prin metoda de calcul
simplificat.
Ajustarea seriilor cronologice –Metode de ajustare a trendului
b
Funcţia hiperbolică Yt  a 
t
În acest caz sistemul va fi:


y 
1
  na b
 t


  t
y 1 1
 a b 2
 t t

Rezolvarea sa nu se mai poate face prin metoda de calcul simplificat decât într-o singură situaţie,
dar se pot utiliza în schimb alte metode: determinanţi, substituţie, reducere...
Ajustarea seriilor cronologice –Metode de ajustare a trendului
Funcţia exponenţială Yt  a  b t

Pentru determinarea parametrilor a şi b se recurge la liniarizarea prin logaritmare (de obicei


folosind logaritmi în baza 10) a expresiei de mai sus:

 
log Yt  log a  bt  log a  log bt  log a  t log b
apoi se fac următoarele substituţii:
log Yt  u
caz în care expresia funcţiei devine:
log a  v
log b  z u  v  z t S-a obţinut o expresie similară cu cea a funcţiei
liniare. Rezolvarea se face la fel ca în cazul
funcţiei liniare, determinându-se parametrii v şi
z.

Parametrii iniţiali ai funcţiei exponenţiale vor fi determinaţi prin antilogaritmarea parametrilor v


şi z.

a  10 v b  10 z
Ajustarea seriilor cronologice –Determinarea componentei sezoniere
Pentru determinarea componentei ciclu Ct sunt necesare serii de timp foarte lungi.
De obicei acest lucru nu este posibil. În această situaţie se poate renunţa la determinarea
acesteia.
Determinarea componentei sezoniere este însă posibilă. Pentru aceasta se procedează astfel:

A. Modelul aditiv
Yt  Tt  St  ut
1. după determinarea funcţiei trendului Tt cu ajutorul metodelor prezentate până aici se poate
izola componenta sezonieră
yt  Tt  St  ut
- Componenta sezonieră St a unei serii cronologice poate fi prezentată ca o funcţie de forma:

St  c1  S1  c2  S 2  c3  S3  ...c j  S j  ...  cm  S m
unde: cj – coeficienţi ce măsoară modificările la nivelul fiecărui sezon j. j=1..m
Sj – variabilă indicatoare a sezonului
Ajustarea seriilor cronologice –Determinarea componentei sezoniere

La nivelul oricărui an, Sj respectă condiţia: S j 1

La nivelul oricărui an din serie, cj respectă condiţia: c j 0


Se presupune că la nivelul fiecărui an componenta sezonieră se manifestă identic.
Se poate extrage influenţa la nivelul fiecărui sezon
2. Se calculează coeficienţii cj prin aplicarea mediei la nivelul fiecarui sezon:

Anii 1 2 ... i ... n Mediile


Sezon sezoniere
1 c1
2 c2
... ...
j cj
... ...
m cm
Total 0
Ajustarea seriilor cronologice –Determinarea componentei aleatoare
3. Se calculează erorile aferente modelului ut prin scăderea din valorile iniţiale ale seriei a celor
calculate prin aplicarea simultană a trendului şi sezonalităţii :

yt  Tt  St  ut

u t 0
Ajustarea seriilor cronologice –Predicţia
4. Se calculează eroarea medie a estimaţiei :

n 2

 u  t
E t 1
n2
5. Se fixează precizia dorită a estimaţiei prin determinarea coeficientului ta,n-2 tabelul distribuţiei
Student:

6. Se calculează valoarea aferentă perioadei viitoare vizate prin folosirea funcţiei:

Yt  Tt  St  ut
dând lui t valori în continuarea celor corespunzătoare seriei

7. Se calculează orizontul previzional prin determinarea intervalului:

Yˆt  E  ta ,n2  Yˆt  Yˆt  E  ta ,n2


1. Introduction
2. Sampling methods
3. Statistic sampling indicators
4. Statistic sampling types
- Simple random sampling
- Stratified sampling
- Cluster sampling
Introduction

Statistical research types


In studying of any phenomena or process we can use one of the following types of
statistical researches:
1. Total statistic research – in this case all the elements (statistic units) of the population are
studied;
2. Partial statistic research – in this case only a part of the elements from the population arte
studied.

Statistic sampling is included in the partial statistic researches category.

When we use partial statistic research?

Usually when we try to study phenomena or process into a population for which:
we don’t have data for all its elements
we can’t study all the elements because in the process we damage them total or partial
we want to obtain the maximum of informations rapid and the smallest costs


Introduction
In which fields, the statistic sampling is used?
Statistic sampling is used very often in the following fields of activity:
-Marketing: market researches regarding the behavior of consumers, the demand and offer of
goods etc.
-Industry: production quality, quality of raw materials, statistic setting of equipment
-sociology: in study of behavior individuals
-medicine: treatments efficacy, determining of optimal dosage for drugs
-agriculture: for estimating the quantity and the quality of production before harvest
-other: in standard of living characterization, quality of TV or radio programs, opinion survey
etc.

The advantages of statistic sampling

The main advantages of statistic sampling are:


• It permit us to obtain smaller errors in the process of collecting data because the small
amount of data and because in this case we can use specialists (this is not possible always
in total researches);
• It permit us to obtain the results more rapidly than in the case of total statistic researches;
• Smaller costs than in the case of total statistic researches;
• It can be used in some cases were total statistic research can’t be used.


Introduction

The main concepts


- population – a set of items representing the object of study, well delimited spatial and
temporal and with a specific volume and structure;
- statistic unit- the fundamental item of the population which can be characterized through a
series of specific features that we want to study;
- sample – the set of statistic units extracted from population which will be studied.
Notation used:
X - quantitative variable that we want to study into the sample (can take the following
values x1, x2...xi...xn into the sample);
N - the volume of population;
n - the volume of sample;

Indicators For sample For population

- Average of X variable xs 
 x i
x0 
 x i
n N
- Variance of X variable  
2

 i s
x  x 2

 2

  x
i  x0 
2

n 1
s 0
N

Sampling methods

For obtaining greater precision of the statistic sampling results, the sample must respect the
condition of representativeness, meaning:

The sample must reproduce as much is possible the structure of population from where it was
extracted.

To extract a representative sample we must respect the following conditions:


a) the population must be as homogenous is possible;
b) the extraction of the items from population must be made absolutely random. In this
way we preserve equal chances for each item to be extracted from the population.

For extracting the sample we can use one from the following methods:
I. random sampling
a) Pure random sampling
b) Systematic sampling
II. Nonrandom sampling


Sampling methods
a) Pure random sampling:
There are two variants for this method:
a1. with repetition. The sample will be formed by extracting one by one the items from
population. After each extraction the item is recorded in the sample and then it is reintroduced
in the population. In this case, the volume of population is constant during the extraction of the
sample and the probability to extract any of the items from population is also constant.
a2. without repetition. It is the same methods as above with only one difference: after each
extraction the item is not reintroduced in the population.

a1. Random sampling with repetition

This methods has the following characteristics:


- The volume of population it is not modified during the extraction of the sample:
p = 1/N;
- One item can be extracted many times. This cause bigger errors comparing with other
sampling methods;
- At the end of the extraction the population contain N-1 items.
This methods can be used with good results when the population is homogenous.


Sampling methods
a2. Random sampling without repetition
This method has the following characteristics :
- the population volume became smaller and smaller during the extraction of the sample
- the probability of one item extraction fom the population raiseduring the extraction;

1 1 1 1
p1  ; p2  ; p3  ;...; pi  ;...
N N 1 N 2 N i 1
- after the last extraction in population remain N-n items.

At the last extraction the value of probability is:

Because of the fact that an item cannot be extracted many times into the sample, this method
produce smaller errors than in the previous method.


Sampling methods
b) Systematic sampling:
It is used when population is already organized by some criteria (Example: the students from
one faculty ordered by their identification number, fruit trees from an orchard, etc).
To use this methods, first we need to calculate a numbering step (k):

, n – sample volume; N- population volume


N
k Second, we place into an urn tickets (cards or chips) numbered from 1 to k and then
n
extract only one.
The number from the ticket will show the first item which will be extracted into
sample. The rest of the remaining items will be determined by adding the numbering step at the
number of the last item extracted.

Example:
Supposing that it was extracted from the urn the ticket with number 4, the sample will be formed
from the following items: 4, k+4, 2k+4, 3k+4,...,(n-1)k+4


Sampling methods
II. Nonrandom sampling:
This method can be used when the studied population has a small number of items. In this case
using the random methods to extract the sample will produce bigger error than in the situation
in which a sampling specialist subjective extract the sample.
Based on his experience the specialist can, in this conditions, to extract a
representative sample for a population.


Statistic sampling indicators

In the case of statistic sampling we can encounter errors which regards process of collecting and
processing of data and specific errors for each type of sampling methods used.

This errors can be grouped in two categories:


- systematic errors - which are based on violating the rules which must be respected
during the use of the sampling;

- random errors - which appear no matter how rigorous we may organize the
sampling and process the collected data. These errors are based on the fact that we will never
can extract perfect representative sample for the studied population.

It is important to retain that the random errors can be estimate.


Statistic sampling indicators
If in practice for studying any X variable, we would organize the extraction of all possible random
samples, by using the random sampling with repetition, we would calculate for all samples the
corresponding averages ( ), then we would calculate the xvariance of these averages from the
s
average calculated at the level of population
( ) we would obtain:
x0
The variance of samples averages from the
 x  x  2
fi
population average  2
 s 0

f
(the average error of representativeness) with rep.
i
Between the population variance ( ) and
2
0 the variance of samples averages from the
population average ( with)rep
2
exist the following relation:

 02  n  w2ith rep.
We can extract from this
When we don’t know the value of population
relation the average error of
variance (  02we can use instead
) and n >100
representativeness with
of it, with good results, the sample variance (
repretition:
):  s2
 02  s2  s2
 with rep.   with rep.   with rep. 
n n n 1

Statistic sampling indicators
Between the variance of samples averages from the population average, using the variant of
random sampling methods with repetition ( w2variant
) and the
ith rep.
without repetition (
) exist this relation: 2 without rep.
wu
2
N 1

rep.

withoutrep. N  n
2

N-1 - the number of items from population at the end of the extraction of the sample
using random sampling with repetition;
N-n - the number of items from population at the end of the extraction of the sample
using random sampling without repetition.
Thus:
 N n   N n
2
   2
   0

 N 1  n  N 1 
without rep. with rep.

If the volume of population is big, we can approximate (N-1) with N and from previous relation
we obtain:

 2
 n
 
without rep.  1  
0
n  N

Indicatorii
Statistic sampling
sondajului
indicators
statistic

The exact value of average error of representativeness can be determined only if we extract all
posibile samples and calculate the errors generated by the use of each samples average instead
of population average. In practice we never extract all the possible samples, that’s way we use
an estimation indicator:
Maxim admisible error:
z
  x  z  
, where:
- the argument of cumulative normal distribution z2

z  
1 z
function:  e 2
dz
- the probability used to guarantee the results of sampling
2 z

 z 
Using the maximum admisible error we can determine a confidence interval for the population (
):
x0 x0  xs  x; xs  x
One of the frequent encountered problem in case of statistic sampling is to determine a
specific sample volume which assure us to respect a maximum admissible error previous
established. In this case the sample volum can be deteremined starting from relation between
maximum admissible error which has a specific form for each sampling methods.
Example: In the case of random sampling with repetition it has this form:

 2 z 2 02
 xwith rep.  z with rep.  z 0 , from where n 2
n xwith rep

Types of statistic sampling

The type of statistic sampling is determined by the following factors:


a) How is organized the population at the time of sample extraction:
- unorganized population;
- population organized by groups.

b) The type of sampling methods. The most usual is random sampling:


- with repetition;
- without repetition.
c) The number of items extracted simultaneous from population:
- item by item;
- cluster by cluster.

Combining these factors results the following most important types of statistic sampling:
1) simple random sampling: -with repetition
-without repetition
2) stratified sampling: -with repetition
-without repetition
3) clustered sampling, usually organized without repetition, because operate usually
with small number of clusters.


Simple random sampling

Sampling statistic indicators

Calculus
Indicators
With repetiton Without repetiton

1. Average error of  2
 2  02  n
  0
 s    1  
representativeness
n n 1 n  N
2. Maximum admissible  x  z  x  z
error
z 2 02
z 2 02 n 
3. Sample volume n z 2 02
x 2 x 
2

N
If the studied variable is binomial then for variance ( ) andstandard
2 deviation
(  ) must use the following relations :

 2  pq  p1  p  pq  p1  p 

Stratified sampling

It is used when the population is not homogenous. In these situations the population it is
organized by homogenous groups.
For respecting the reprezentativeness condition, the sample must be formed by
extracting a number of items proportional with the volume of each groups.

Sampling statistic indicators

Calculus
Indicators
With repetiton Without repetiton

1. Average error of  2
 2  02  n
representativeness   0
 s    1  
n n 1 n  N
2. Maximum admissible
error
 x  z  x  z

z 2 02
z 2 02 n
3. Sample volume n z 0
2 2
x 2 x 
2

N

Cluster sampling

It is used when the population is formed from complex items (clusters) and not individual items
(example: juice bottles packed in boxes). In this case the sample is formed by extracting cluster
(set of items) by cluster and not item by item.

Sampling statistic indicators

Indicators Calculus

Without repetiton

2  Rr 
1. Average error of representativeness    
r 1  R 1 

2. Maximum admissible error  x  z

Rz 2 2
3. Sample volume r
R  1x 2  z 2 2

Random sampling with repetition

Population

Sample

1 2

6
3 3
4

5 7 5 5

8 8
10
9

Extraction of element no: 123


Random sampling without repetition

Population

Sample

1 2

6
3
4

5 7

8
10
9

Extraction of element no: 123


Systematic sampling
From a population of 1500 students enrolled in a faculty matriculation register we extract a
sample of 10%. N = 1500 n = 150
1500
1 2 3 4 5 6 7 8 9 k  10
150
10 11 12 13 14 15 16 17 18 We introduce into an urn
tickets containing numbers
from 1 to 10 and we extract
19 20 21 22 23 24 25 26 27 only one. Supposing that the
ticket with the number 8 was
28 29 30 31 32 33 34 35 36 extracted, we determine the
rest of the students from the
sample using the following
37 38 39 40 41 42 43 44 45
relation:

46 47 48 49 50 51 52 53 54 (i - 1)k+8
(1 - 1)10+8 = 8, for i = 1
(2 - 1)10+8=18, for i = 2
55 56 57 58 59 60 61 62 63
(3 - 1)10+8=28, for i = 3
.............................................
64 65 66 67 68 69 70 71 72
(n - 1)k+8=1498, for i = n

1. Elementary methods
2. Analytic methods
3. Linear correlation
4. Non linear correlation
5. Nonparametric correlation
Introduction

The synthetic expression of causal link intensity between phenomena is called correlation.

The phenomena between which a causal determination exists can be found in one of the
following situations:
- cause - when it determine the appearance or modification of other phenomena;
- effect - when it is a result of the effects generated by other phenomena.

The variables that describe this two categories of phenomena can be:
- Cause variable (independent, factorial) – when it characterize a cause phenomena
- Effect variable (dependent, resultatives) – when it characterize effect phenomena.


Types of correlations

By the number of variables included in the correlation:


- Simple correlation- when there are included in the correlation couple one cause variable and
one effect variable;
- Multiple correlation - when there are included in the correlation couple more than one
cause variables and one effect variable.

By the sense of determination between the variables from correlative couple:


- Direct correlation - when the values of the effect variable and cause variables increase or
decrease in the same time.
- Inverse correlation - when the values of the effect variable increase and cause
variables decrease in the same time or viceversa.

By the shape of link between of variables from correlative couple:


- Linear correlation - when the effect variable follows a linear shape under the influence of the
cause variables;
- Nonlinear correlation - when the effect variable follows a nonlinear shape under the
influence of the cause variables


Elementary methods
1. Correlation table method

Seniority (years) 1 5 10 15 20 25 Total


Y - - - - - -
X Age (years) 5 10 15 20 25 30
18-25 1 1
25-32 2 2
32-39 3 2 5
39-46 3 1 4
46-53 2 4 6
53-60 2 2 4
Total 1 2 3 6 5 5 22

a. The existence of correlation between the cause and effect variable is showed by the
frequency grouped into a strip with a specific shape.
b. The sense of correlation is given by the diagonal were the strip is placed.
c. The strength of correlation is given by the width of the strip.
d. The shape of correlation is given by the shape of the strip.

Elementary methods
2. Graphic methods

Correlogram Y

a. The existence of the correlation between the cause and effect variables is given by value of α
angle variabile different than 0o or 90o.
b. The sense of correlation is given by the tendency line.
c. The intensity of correlation is give by the size of the α angle. Maximum of correlation
intensity is when α value is has 45o for direct correlation or 135o for inverse correlation.

d. The shape of correlation is given by the shape of correlagram.



Analythic methods
For best approximation of the causal link between the cause variables and effect variable known
mathematic models (functions) are used.
Two essential aspects are studied : 1. Regression 2. The intensity of correlation
1. Regression
Testing the mathematic models to study how well approximates the causal link between cause
and effect variables involves several stages:
1. For each mathematic model we must follow the steps:
a. The general mathematic model that will be tested is chosen;
b. The general model is customized by determining its parameters values (a,b,c..) such
that the maximum approximation precision is attended. In some situations when we
can use simplified calculus.
c. For each real value of the effect variable (y) there is calculated a new value (Yx) by
using the customized model;
d. There are determined the sum of errors generated by the use of the new values
calculated by the customized model instead of the real values of effect variable by
using this formula;

2. 
For all the tested model we compare they sum
the model with the smallest error.
2
 Yxof errors calculated as above and we chose

  y  Yx 2
 min

Linear correlation
Simple linear regression Y Y
b>0 b<0

Yx  a  bx a


a

 y i  na  b xi


X X

 x y  a x  b x2
 i i  i  i

n x y x n y
 a  b 
x  x2  xy  x 2
 x  xy

a b
a b
 

Linear correlation
Example -lei, monthly, on household-
Total income Total
Years (x) expenses(y) x2 x.y Yx (y-Yx)2
2001 521,79 516,52 272264,80 269514,97 533,55 289,9618
2002 658,51 651,66 433635,42 429124,63 654,23 6,626353
2003 795,09 781,45 632168,11 621323,08 774,80 44,26912
2004 1085,79 1049,94 1178939,92 1140014,35 1031,40 343,5792
2005 1212,18 1149,33 1469380,35 1393194,84 1142,97 40,43068
2006 1386,32 1304,66 1921883,14 1808676,25 1296,69 63,53853
2007 1686,74 1541,96 2845091,83 2600885,61 1561,88 396,6701
Total 7346,42 6995,52 8753363,58 8262733,73 6995,52 1185,076

7 7346,42 532817643,00
  7303658,24 a  72,952
7346,42 8753363.58 7303658,24
6995.52 7346.42 6447108,08
a   532817643,00 b   0,883
8262733.73 8753363.58 7303658,24
Yx  75.952  0.883x
7 6995.52
b   6447108,08
7346.42 8262733.73

Linear correlation
Example

Yx  75.952  0.883x

Linear correlation
2. The intensity of linear correlation is determined by using Pearson linear correlation coefficient
(ry,x):
  
 xi  x  y i  y  n xi y i    xi   y i 
 
      ry , x    
 x  y    
2
    
2

ry , x  n xi    xi   n y i    y i  
2
  
2
  
n       
- The Pearson linear correlation coefficient (ry,x) takes values in this interval [-1;1]
- The intensity of linear correlation increase when the coefficient is approaching to the extremes
of the above interval.
-Negative values means inverse correlations and positive values means direct correlation.

Observation: The 0 value of the linear correlation coefficient means no linear correlations
between the cause and effect variable but does not exclude a nonlinear correlation!

Example. Using date from the previous table we obtain : ry,x = 0,99927


Nonlinear correlation
Y Y
1. Regression
Quadratic function: Yx  a  bx  cx 2


 i    i  i
 2
y na b x c x X X

 n x x 2


 xi y i  a  xi  b xi  c  xi
2 3

  x x x 2 3


 xi2 y i  a  xi2  b xi3  c xi4 x x x
2 3 4



y x x 2
n y x 2
n x y
a   xy  x  x 2 3
b  x x y x 3
c  x x x 2
y

 x2 y  x3  x4 x x2 2
y x 4
x x x2 3 2
y

a  c
a ; b b; c
▲   
Nonlinear correlation
1. Regression
b
Hyperbolic function: Yx  a 
x

 1
 i
y  na  b x
 i

 yi  a 1  b 1
 xi  x  x2
 i i
Nonlinear correlation
1. Regression

Exponential function: Yx  ab x

 
Y  ab x  log Y  log ab x  log a  log b x  log a  x log b
z  log Y u  log a w  log b

z  u  wx

 z  nu  w x


 xz  u x  w x 2
  

a = 10u
b = 10w
Nonlinear correlation
2. The intensity of correlation
To determine the strength of the correlation between the variables x (cause) and y (effect), in
case of using any of the linear or non-linear function is used the correlation ratio, which is based
on the overall variance decomposition in factorial variances.

General dispersion. Summarizes the total variance of the variable overall result of the
simultaneous action of all influencing factors.


 iy  y  2

 y2 
n

Explained variance. Summarizes the variation of y (effect) variable explained by the influence of
the variable x (cause) included in the correlation couple.


 xi
Y  y  2

 Y2 
x
n
Nonlinear correlation
2. The intensity of correlation

Non-explained variance. Summarizes the variation of y (effect) variable that cannot be explained
by the influence of the variable x (cause) included in the correlation couple.


 i x
y  Y  2

 y2,Y 
x
n

Rule of variances summarize:    


2
y
2
Yx
2
y ,Yx

The coefficient of determination

The nonlinear correlation coefficient

R R 2
R[0;1]
Nonparametric correlation

Rank - the position of each variables X and Y values ​of the correlation couple in the set that they
belong, ordered ascending or descending.
ui - ranks of the values xi, from ordered set x1, x2, …, xn;
wi - ranks of the values yi, from ordered y1, y2, …, yn;

Spearman coefficient
where:
6 d 2
d - the diference between ui and wi ranks;
  1

n n2 1  n - the number of statistic series terms.
The relation gives accurate results as long as the premises are used to obtain it, are

u   w
i i

Ranks ui and wi are unique, not repeated in the set that they belong. If this latter condition is
not satisfied then you can do so: the value of xi or yi that are repeating will be copied only once
in their set and as corresponding value will have the average of the other variable values
corresponding to the value that is repeating.

Spearman coefficient takes values in this interval [-1;1].


Nonparametric correlation

Kendall coefficient
where:
2S 2P  Q 
  P - the sum of wi ranks bigger than the current rank.
nn  1 nn  1 Q - the sum of wi ranks smaller than the current rank.

To determine P and Q proceed in this way:


1. Order the pairs of (xi, yi) values;
2. Eliminate all repetition for the values of X and Y variables and if is the case reorder the pairs
of (xi, yi) values;
3. Determine the ui ranks for X variable values;
4. Determine the wi ranks for Y variable values;
5. Starting from the firs value to the end of the set of wi ranks, successively determine for each
rank:
- How many ranks wi are bigger than the current rank;
- How many ranks wi are smaller than the current rank;
Note: The number of ranks smaller or bigger than the current rank it will be counted starting
from the next position to the current rank toward to the end of the set.
6. Calculate P (Q) by summing the values of ranks bigger(smaller) than the current one;
Kendall coefficient takes values in this interval[-1;1] and must be interpreted in the same way as
Spearman coefficient. If we calculate for the same set of data Spearman and Kendall coefficient
this relation exist always between them:
 
Nonparametric correlation

After eliminate the After orderin the Ordered


Initial values Ranks Indicators
repetition values by xi values
xi yi xi yi xi yi xi yi ui wi d d2 P Q
100 28 100 27 51 9 51 9 1 2 -1 1 6 1
98 24 98 24 54 8 54 8 2 1 1 1 6 0
51 9 51 9 69,5 12 69,5 12 3 3 0 0 5 0
100 26 86 14 86 14 86 14 4 4 0 0 4 0
86 14 69,5 12 98 24 98 24 5 5 0 0 3 0
70 12 54 8 100 27 100 27 6 6 0 0 2 0
69 12 108 31 102 35 102 35 7 8 -1 1 0 1
54 8 102 35 108 31 108 31 8 7 1 1 0 0
108 31 n=8 36 36 0 4 26 2
102 35
n = 10 n=8 n=8

64 2  26  2
  1  0.994   0.857

8 82  1  88  1

Вам также может понравиться