Вы находитесь на странице: 1из 51

Theory dossier

Data analysis
Course 2011-2012
1) Introduction. Basic concepts. Sampling. ...................................................................................... 2 2) Graphical description of a variable .............................................................................................. 2 Numerical description of a variable .................................................................................................... 2 1. Data transformation. .............................................................................................................. 2 1. Measures of position ........................................................................................... 4 2. Measures of spread and form .............................................................................. 5 3. Other tranformations ........................................................................................... 6 2. Grouped data .......................................................................................................................... 8 1. Computation of the median and the quartiles ...................................................... 8 2. Computation of the mean and the standard deviation ......................................... 9 Normal distribution. ........................................................................................................................... 10 Datasets with two variables (I)........................................................................................................... 10 Two numerical variables ................................................................................................................ 10 Excessive dispersion: the median and mean trace ......................................................................... 10 Non linear regression ..................................................................................................................... 11 One numerical and one categorical variable .................................................................................. 15 Example: a non-ranked categorical variable and a numerical variable. Income and county. .................................................................................................................................... 15 Example: analysis of a ranked categorical variable. Income and schooling............... 18 Datasets with two variables (II) ......................................................................................................... 21 Two categorical variables............................................................................................................... 21 Time series ......................................................................................................................................... 22 Introduction .................................................................................................................................... 22 Composition ................................................................................................................................... 22 Analysis of the trend and the cycle: the long term ......................................................................... 25 Fitting mathematical functions................................................................................................... 25 Moving averages ........................................................................................................................ 26 Short term fluctuations ................................................................................................................... 29 Seasonal variations ..................................................................................................................... 29 Predition with time series ............................................................................................................... 31 The trend .................................................................................................................................... 32 The seasonal component ............................................................................................................ 32 Measures of inequality and concentration ......................................................................................... 35 Inequality measures........................................................................................................................ 35 Concentration indices ..................................................................................................................... 41 Index numbers .................................................................................................................................... 43 Simple indices ................................................................................................................................ 45 Complex indices ............................................................................................................................. 46 Laspeyres index.......................................................................................................................... 46 Paasche index ............................................................................................................................. 48 Measuring inflation ........................................................................................................................ 48 Nominal and real growth................................................................................................................ 50

1 de 51

1)

Introduction. Basic concepts. Sampling.

How to identify individuals, variables and observations, how to organize the data adequately to enter them into the computer and how to construct a frequency table?. Three points: 1) What is and what do we do in statistics? Moore, pages in the prolog. 2) Data analysis: variables and data organization. Chapter 1, pag. 3-5 Data collection: samples. Chapter 3, pag. 205-225

2)

Graphical description of a variable

Bar diagrams, piecharts, histograms, stemplots, time-series graphs. Moore, Chap. 1, pag. 6-22

Numerical description of a variable


Center and spread measures. Boxplot. Standard deviation. Data transformation. Moore, Chap. 1, pag. 32-51

1.

Data transformation.

We have often to change the units of measure of our data. In these cases it is useful to have an idea on how our descriptive measure will change. The most common transformation of our data is what we call origin change and scale change of our data. Sometimes we have to apply these two types of change at the same time. An origin change is produced when we sum a positive or negative constant to our data. If we call our original variable X , and a is any positive or negative constant, an origin change will be produced if we sum a to each case in our data and we will get a transformed variable that we call Y . We can express this transformation by the following equation:
Y = X a

We call it origin change because from a graphical point of view, the transformation implies a shift towards the right or the left of the data (depending on a being positive or negative, respectively) over the horizontal axis. A case where we can find such a transformation is the following: consider a group of people who have between 2 and 8 euros in their pockets. Each one of them receives a present of 7 euros. The change is shown in the following chart:

2 de 51

It can be noticed the the histogram has moved to the right. We can also say that it is now farther away from the origin, that is why we call it change of origin. The second type of common transformation that we can perform to the data is multiplying or dividing it by a positive constant. This type of transformation is called change of scale, because what we do is changing the unit of measure, showing the values of the same data, but in larger or smaller units, or in other words using another scale. We can express this transformation by the following formula:

Y=

X b

where X are the original data, b is a constant larger or equal to 1, and Y are the transformed data. If b is larger than 1 we will reduce the scale using smaller units, while if b is smaller than we will increase the scale using larger units. For instance in 2002 most European Union countries adopted the euro. Spain abandoned the peseta and all prices had to be converted into euros. If X is a monetary quantity expressed in pesetas, the transformation to euros implies a b equal to 166.386. The same monetary quantity, once we apply the transformation, will be expressed in euros (in smaller units, since b is larger than 1). To show it graphically, let us suppose that we double the values of a dataset, expressing them in units of measure that are double the original. The histogram will be expanded to double its range in its horizontal dimension (but the frequencies will obviously not change):

3 de 51

Sometimes these two transformations have to be applied at the same time. We can find an example in the conversion from Fahrenheit degrees to Celsius degrees. If C are Celsius degrees and F are Fahrenheit degrees, the formula is:
C= F 32 1 .8

In general a transformation of a variable X to a variable Y, that includes an origin change and a scale change, can be represented by the formula:
Y= Xa b

where a is a a positive or negative constant and b is a larger or smaller than 1 positive constant. We call these transformations also linear transformations. We use the word linear because the function that we apply to go from X to Y is a linear function. Linear transformations are not the only ones that we can apply to the data, despite being the most common ones. At the end of this section we will also talk of non-linear transformations. What happens to our summary measures when we apply linear transformations? Do we have to recompute all the measures when we apply a linear transfomation? The answer is negative. We will see in what comes next how the summary measures are affected in front of linear transformations of the typ Y = (X-a)/b. 1. Measures of position If RX is a measure of position of a dataset with a numerical variable X to which we apply a linear transformation obtaining a new variable Y=(X-a)/b, we can find the same unit of measure for the new variable Y by the following formula::
RY = R X a b

That is we apply the same transformation that we apply to the summary measure. Proving this results is straightforward. The measures of position are also linear functions of the data (for instance the mean), and therefore we can apply directly the same linear transformation to the summary measure.

4 de 51

Suppose for instance that we are told that at New York the average temperature in September is 70 degrees Fahrenheit. Can we know the average temperature in Celsius degrees without having to ask for the daily temperatures to recompute the mean? We do not need the case by case daily temperature, the average temperature in Celsius degrees is:
70 32 = 21 .1 Celsiusdegrees 1 .8

This result is valid for all measures of position (median, quartiles, etc.)1. 2. Measures of spread and form As we had stated before, linear transformations imply a change of origin and change of scale. The change of origin implies simply a shift of the histogram without affecting its form. Consequently, the measures of spread, symmetry, kurtosis and so on, are not affected by changes of origin. Conversely, changes of scale do affect them in a predictable way. Linear transformations affect the measures of spread, but not the measures of form. If RX is a measure of spread of a dataset with a numerical variable X and we apply a linear transformation to these data we obtain a new variable Y=(X+a)/b, the same measure of spread in the dataset Y will be:

RY =
That is, only the change of scale is applied2. Example:

RX b

At the car repair shop Mario Bros. they employ seven workers with the following wages: Wages (in pesetas) Wages (in pesetas) 140.000 150.000 170.000 130.000 160.000 180.000 The mean and the standard deviation of the wages are:

X = 155.000 s X = 18.708, 29
We are told that in December the workers will get a pay raise of 20000 pesetas, and at the beginning of 2001 Spain is abandoning the peseta and adopting the euro. What will be the mean and standard deviation of the wages, now expressed in euros, taking into account the raise of 20000 pesetas (120.20 )? We will have to compute the wage of each worker during December, sum 20000 pesetas and divide by 166.386 to obtain the January wages in Euros? It is not needed, using the previous results, the mean in January is:
1

If b was negative, the formula could not be applied for resistent measures and if the distribution was skewed, the skewness would be reversed, but there are no practical cases where we need a negative b. 2 If b was negative, this would not affect the result and the new measure would end up being divided by the absolute value of b,but remember what we said from negatives b, they are not found in actual measure changes and so we assume b is always positive.

5 de 51

Y =

X + 20000 155000 + 20000 = = 1051.771 166.386 166.386

and the standard deviation

sY =

sX = 112.4391 166.386

We could have obviously computed the new wages: Wages (in euros)
853.44 913.54 1033.74 793.34 973.64 1093.84

and now compute the mean and the standard deviation, which would lead us obviously to the same results. 3. Other tranformations

In practical applications the most common transformations are linear, since they are associated with changes in unit of measure of the data. Non-linear transformations are less common and are used to change the form of the distributions. Sometimes skewed distributions can be converted into symmetric distributions using these transformations, and once they are symmetrical we can use summary numbers such as the mean or the standard deviation which are not adequate to use if the distribution is skewed. Non-linear transformations are based on non-linear functions, such a the logarithmic, exponential or polynomial function. Let us consider for instance the following dataset, corresponding to the returns obtained at the stock market by various investors, in thousands of euros:
Returns

10 10 12.58 12.58 12.58

15.84 15.84 15.84 25.11 25.11

25.11 25.11 25.11 31.62 31.62

31.62 39.81 39.81 39.81 50.11

50.11 63.09

Let us draw the histogram:

6 de 51

We can appreciate that is quite skewed to the right, indicating that most investors obtain reduced returns with a couple of lucky ones (or maybe they know more about stock market investments) that obtain higher returns. In this distribution we would not be able to apply the mean or the standard deviation, since they are skewed. For this reason we apply a non-linear transformation, in this case a logarithmic transformation reduce the skewness of the data. The transformation that we apply to the original X is

Y = log( X )
where log denotes the logarithm in base 10. The data that we obtain now are: Returns 1 1 1.1 1.1 1.1 If we draw the histogram now we obtain: 1.2 1.2 1.2 1.4 1.4 1.4 1.4 1.4 1.5 1.5 1.5 1.7 1.6 1.8 1.6 1.6 1.7

This histogram is clearly more symmetric and now we can use the mean or the standard deviation if we want.

7 de 51

Do we have properties for the summary measures like the ones that we have for linear transformations, so that we can predict their values without the need to recompute all the data and compute the numerical measures from there? For the case of non-linear transformations we do not have similar properties. This means that the mean of the data that we transformed with the logarithmic function, for instance, is not equal to the logarithm of the mean of the original data.

2.

Grouped data

We call grouped data a data set of one numerical variable presented in a frequency table. Very often we find statistical information that is presented in this format at publications from statistical agencies, government or economic press. In this case we do not know the original information, that is the data case by case, and we have to work with the data grouped in intervals or ranges. We will see in this section that we still can compute practically all the numerical summaries and we can perform a fairly accurate description of the data set. 1. Computation of the median and the quartiles

Let us suppose that we are given data on income of 280 families in the following frequency table:

Absolute Income Lower limit Upper limit Frequency 0 10000 15 10000 15000 45 15000 20000 100 20000 30000 83 30000 50000 30 50000 100000 7

We do not know the income of each family individually, but we have a lot of information at the frequency table. To compute the resistant measures (median and quarties) we first have to identify the interval of value range where they are located. To locate the first quartile we can use the formula (N+1)/4, which gives us the approximate position within the sorted list of cases from lowest to largest. In this case we have 280 cases, therefore the formula gives (280+1)/4 = 70.25, which means that the value of the first quartile can be found between observation 70 and 71. To locate this value it is convenient to present the cumulative absolute frequencies:
Absolute Cumulative absolute Income Lower limit Upper limit Frequency frequencies 0 10000 15 10000 15000 45 15000 20000 100 20000 30000 83 30000 50000 30 50000 100000 7

15 60 160 243 273 280

8 de 51

In which interval can we find observations 70 and 71? In the first inteval we cannot find them, since we accumulate until case 15, and in the second one neither, because we accumulate until case 60. We see that cases 70 and 71 can be found in the third interval, since it accumulates from case 61 until case 60. Therefore the first quartile can be found in the third interval since it contains cases with values between 15000 and 20000. But which is its value? We cannot know it exactly, but we can approximate it by the midpoint of the interval: 17500. We now do the same for the median and for the third quartile. For the median, its position can be obtained by using the formula (N+1)/2 = 281/2 = 140.5, which means that its value can be computed by taken the mean of observations 140 and 141. Where can be these observations be found? See that they are also in the third interval or value range, since we had said that this interval accumulates cases between position 61 and position 160. We approximate the value of the median by the midpoint of this interval and we obtain 17500. Therefore in this case we would give the same value for the first quartile and the median. Of course if we had the original data and if we could compute exactly the first quartile and the median, their values would not be exactly the same, but given the distribution of values that we have they would not be too different. Finally to compute the third quartile we look at its location by using the formula 3(N+1)/4 = 210.75, which means that we can find it between cases 210 and 211. With the help of the cumulative frequencies we can see that these observations can be found in the fourth interval, since it accumulates between observation 161 until observation 243. Its value can be approximated by the midpoint of this interval, that is 25000. 2. Computation of the mean and the standard deviation

We can also compute the mean and the standard deviation. To compute the mean let us suppose that the values of each interval are equal to the midpoint or class mark of the interval. For instance for the first interval we will assume that all cases that fall in the first interval have a value equal to 5000, and we know that there are 15 cases in this interval (its absolute frequency). Doing the same for all intervals we can compute the sum that we find in the numerator of the formula for the mean, and dividing by the total number of cases we have an approximate mean. The computations can be seen in the following table:
Midponint of Absolute the interval Frequency 5000 12500 17500 25000 40000 75000

Interval sum 15 45 100 83 30 7 15*5000 45*12500 100*17500 83*25000 30*40000 7*75000 Total sum 75000 562500 1750000 2075000 1200000 525000 6187500

To obtain the mean we have to divide the (approximate) total sum of the values of all cases by the total of cases that we have, that is 6187500/280 = 22098.21, which will give us an approximate mean for these cases. To compute the standard deviation we will use its formula, using also the midpoints of the intervals and their frequency as values for the data and using the approximate mean that we just computed. 9 de 51

Midpooint of Deviation with Squared Absolute Squared deviation by the interval respect the mean Deviations Frequency interval frequency 5000 -17098.21 292348931.76 15 4385233976.4 12500 -9598.21 92125717.47 45 4145657286.35 17500 -4598.21 21143574.62 100 2114357461.73 25000 2901.79 8420360.33 83 698889907.53 40000 17901.79 320473931.76 30 9614217952.81 75000 52901.79 2798598931.76 7 19590192522.32 Total sum of deviations 40548549107.14

To finish the computation of the standard deviation we have to divide the total sum of the squared deviations by N-1, that is 280-1=279, and compute the square root of the result, so that we obtain 12055.51, which is an approximate standard deviation for this data set.

Normal distribution.
Moore, pag. 51-75

Datasets with two variables (I)


Two numerical variables
Moore, chapter 2, 97 a 173

Excessive dispersion: the median and mean trace


We find ourselves in a lot of cases in front of a scatterplot for two numerical variables where we cannot figure out any relationship, because there is excessive dispersion which may be cause by some factor that we are not directly interested in analyzing. Consider for example the relationship between gas consumptions and car speed. The speed of a car clearly has an effect on gas consumption, but there can be a lot of other factors also having an effect on gas consumption, such as opposite wind, road quality, and so on. The following scatterplot shows gas consumption per 100 km in liters against average car speed in km/h for a sample of cars of the same make:

10 de 51

As we can see in the scatterplot, there does not seem to be a clear relation. May be there is just a very weak negative relation between the two variables, as it is shown by the correlation coefficient and the value of the slope. But as we said before, it is possible that other factors are influecing the spread in gas consumption that we observe, and this may be hiding the relationship between the two varaibles. In order to try to clarify this relationship we can apply a tool known as the median or mean trace. It consists in dividing the rank of variation of the explanatory variable in a number of equally sized sector, and computing the median (or the mean) of the dependent variable within these sectors. The values of these medians (or means) are plotted in the scatterplot against the midpoint of each sector, and this may help in clarifying if there is any relationship between the variables. For instance in our case we divide the rank of variation of speed in 5 setor, and for each sector we compute the median of gas consumption:

The median trace is the read line joining the medians that we have computed for each sector, represented by red dots. As it can be seen in the diagram, it seems that the minimum consumption of gas can be observed when cars are running between 90 and 95 km/h. It is convenient to try with different numbers of sectors to try to see if the median or mean trace gives us some information on the relationship between the two variables. We have to be careful though with this technique since the elimination of dispersion (by computing medians or means) always implies a stronger relationship between the original variables. Common sense has also to be applied so that a false or artificial relationship is avoided.

Non linear regression


The regression analysis techniques between two numerical variables that we have seen so far presuppose a linear relationship between the variables that we want to analyze. When the relationship is non-linear, the fit can be very poor and we can incur in large prediction errors. Consider for instance a dataset analyzing the relationship between advertising expenditure and sales for a sample of firms. Intuitively, we can argue that as the expenditure in advertising goes up, sales also go up because of the stimulus received by consumers, but this stimulus has a decreasing effect, in other words, after some point the effect of expenditure in sales starts to decrease. 11 de 51

The following scatterplot shows a sample of firms for which we have information on the level of advertising expenditure and their sales, both variables in thousands of euros:

As we can see at the diagram, there is a clear positive association between advertising and sales, but the relationship is not linear, the scatter of points suggests some sort of function but it is not a line. We can confirm this with a residual diagram:

The residual diagram clearly shows that the fit between predicted values and real values makes systematic errors, whith regions where residuals are either systematically positive o systematically negative. In this section we will learn a series of simple techniques that will allow us to continue applying the linear regression technique to some of the non-linear relationships that we may encounter. The idea that we will apply is based in a mathematical technique known as change of variables. Before presenting this technique we make a digression to explain some properties of logarithms which are going to be useful for our explanation and the techniques we are going to explain afterwards. The logarithm of a value over a given basis is the exponent of the power calculation using the given basis to obtain that value. For instance the logarithm of 100 with basis equal to 10 is 2, since 100 10 . Some useful properties of logarithms are the following: log log log log log 12 de 51

We can now present the idea of the change of variables. Let us assume that we have equation of the following type: 10 The relationship between Y and X is clearly non-linear, since linear relations only allow for X to be multiplied by a constant and to have another additive independent term (in other words, equations of the form Y = a + bX, where a and b are two constants. But we can do the following. Take logarithms on the left hand side and right hand side of the equation, so that the equality is kept: 10 We now apply the properties of logarithms that we have mentioned, and we obtain: 3 And given that log 10 1 we have: 3 7 Now we do the following change of variables: log log And we can now write our equation as: 3 7 and our equation is now linear. This is the idea that will allow Notice that now with respect to as to continue applying our linear regression techniques despite the fact that the relationship between our numerical variables is non-linear, whenever this non-linear relationships is of the type that we can solve with simple transformations, we cannot apply this technique to all non-linear relationships. But we can try with a couple of simple transformations and check if the relationship becomes linear with the transformed variables. We will apply this idea to our simple example with a sample of firms. This model is knows as loglog (because we transform taking logarithms both the dependent and the explanatory variables). Instead of using logarithms with basis 10 as in our example, we will use the so called natural logarithms, which are used because they have some convenient properties. The basis for these logarithms is a constant called e = 2.71828 (we only show the first 5 decimals). The inverse function to the natural logarithm function ln is the exponential function, that is that we usually denote by exp . To apply this model to our data, let us present the first 10 cases:
Advertising 2.96 3.43 1.7 2.49 1.91 2,63 1.78 1.82 2.45 Sales 17.31 18.17 16.33 17.2 16.63 17.26 16,47 16.5 17.28

10

13 de 51

1.34

16

We compute the natural logarithm for each value of Advertising and each value of Sales, and we obtain:
ln(Advertising) ln(Sales) 1,09 1,23 0,53 0,91 0,65 0,97 0,58 0,6 0,89 0,29 2,85 2,9 2,79 2,84 2,81 2,85 2,8 2,8 2,85 2,77

Making this transformation for all the cases of the dataset and representing the transformed variables in a scatterplot, we get:

If we compare this scatterdiagram with the transformed data with the original scatterplot we can see that now the relationship seems clearly lineal, and therefore we can compute the regression line and make accurate predictions with this transformed model. If we enter the transformed data in the computer, we can obtain the constant and the slope of this regression: ln(Sales) = 2.73 + 0.13 ln(Advertising) What prediction would we make for the sales if a firm has an advertising expenditure equal to 2000 euros? The prediction with our regression is: 2.73 + 0.13 ln (2) = 2.82 But notice that this is no the prediction of Sales directly, but of ln(Sales). To obtain the prediction in the value of Sales directly, we have to use the inverse function to the logarithmic function, that is the exponential function. So we finally obtain: exp(2,82) = 16,78 milers deuros This is our prediction of Sales. If we try the log-log trasnsformation but the scatterplot shows that the relationship is still not linear, 14 de 51

we can try with other types of transformations. Semi-log: In this model we just transform with logarithms the dependent variable, but not the explanatory variable. The relation is now of the following type : ln(y) = a + b X. Reciprocal: We only transform the explanatory variable, taking the reciprocal of this variable, and the model would be now . How do we know when to apply the log-log model, the semi-log model or the reciprocal? The original dispersion diagram can give us some idea, if the form that we are observing suggests a logarithmic relationship or of the type implied by the reciprocal function. But in practical terms we can represent the scatterplot for the three cases and determine visually which is the transformation that provides the best fit.

One numerical and one categorical variable


When we analyze a dataset with one numerical variable and one categorical variable wer are trying to find relationships between the two variables. Usually the analysis consists in studying the values of the numerical variable for each category defined by the the categorical variable. It is important to remember that the values, groups or categories of the categorical variable can be ordered or not, and this defines two types of categorical variables:

Non-ranked categorical variables: the cateogories of the categorical variable do not have a natural ordering, we just rank or sort them artificially (by alphabetic order, by number, or other arbitrary criteria). An example can be the variable County of residence. This variable does not have a natural order, we can sort them by alphabetical order or by any other arbitrary criterium Ranked categorical variable: ranked categorical variables follow a natural order. For instance consider Schooling with No schooling / Primary school / High school / College as categories. This variable is ordered, because before attengind High School a subject has attended Primary School, and so on. Another example could be Income Level with the following categories: Low Income / Middle Income /High Income. Here the rank is based on a numerical variable, income, which is behind the construction of the categorical income.

In case that the categorical variable does not have a natural order, that is we have non-ranked categorical variable, we have to analyze it for each group or categoy, and see if the distribution of the numerical variable changes if we change the group or category. To see this we use all numerical and graphical summaries that we know to analyze the numerical variable (numerical: mean, standard deviation, median, quartiles, and so on ; graphical: histograms, boxplots, and so on). In case that the categorical variable has a natural order, that is it is a ranked categorical variable, we perform the same analysis than before, studying the numerical variable within each group defined by the categorical variable, but now we can talk of association of the variables, as there is a numerical value behind the categorical variable and we can say for instance that income is positively associated with schooling, despite schooling being categorical. Example: a non-ranked categorical variable and a numerical variable. Income and county. We present next income and county or residence for 20 individuals: Individual Income (Euros) County 1 12000 Barcelons 15 de 51

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

15000 16000 14000 20000 21000 30000 22000 14000 17000 10000 11000 19000 13000 21000 25000 22000 23000 16000 17000

Barcelons Baix Llobregat Maresme Valls Occidental Baix Llobregat Valls Oriental Valls Occidental Barcelons Barcelons Maresme Baix Llobregat Baix Llobregat Maresme Valls Occidental Valls Oriental Valls Oriental Valls Oriental Valls Occidental Maresme

Primer presentem els principals resums numrics per comarca de residncia:


County Sample size Mean Coeff of Var Skewness Kurtosis Min Q1 Median Q3 Max All 20 17900 0,29 0,51 0,03 10000 14000 17000 21250 30000 Baix Llobregat 4 16750 4349,33 0,26 -0,83 -0,04 11000 14750 17500 19500 21000 Barcelons 4 14500 2081,67 0,14 0 0,39 12000 13500 14500 15500 17000 Maresme 4 13500 2886,75 0,21 0 0,91 10000 12250 13500 14750 17000 Valls Occidental 4 19750 2629,96 0,13 -1,44 2,23 16000 19000 20500 21250 22000 Valls Oriental 4 25000 3559,03 0,14 1,33 1,5 22000 22750 24000 26250 30000

StanDev 5118,59

We can appreciate that the distribution of income varies from one county to the other. For instance at the Barcelons county average income and spread are small than at the Valls Occidental. We can therefore say that income is related to county of residence. We can also present graphical summaries. The two most common numerical summaries are histograms for each category or group of categorical variables and boxplots. These are the histograms:

16 de 51

County: Baix Llobregat


1,5

Frequency

1 0,5 0 11000 13000 15000 17000 19000 21000 23000 25000 27000 29000

Income (Euros)

County: Barcelons
3

Frequency

2 1 0 11000 13000 15000 17000 19000 21000 23000 25000 27000 29000

Income (Euros)

County: Maresme
1,5

Frequency

1 0,5 0 11000 13000 15000 17000 19000 21000 23000 25000 27000 29000

Income (Euros)

County: Valls Occidental


3

Frequency

2 1 0 11000 13000 15000 17000 19000 21000 23000 25000 27000 29000

Income (Euros)

County: Valls Oriental


3

Frequency

2 1 0 11000 13000 15000 17000 19000 21000 23000 25000 27000 29000

Income (Euros)

The histograms allow us to see that the Valls Occidental and the Valls Oriental have a distribution that is more o the right than the other counties, showing that in general income levels are higher in these counties.

17 de 51

Another graphical representation which is very useful is the boxplot, which is based on resistant measures. Here we have the boxplots of income for each county:

35000

30000

25000

20000

15000

10000

Baix Llobregat
5000

Barcelons

Maresme

Valls Occidental

Valls Oriental

Income (Euros)

The boxplots show that there are important differences in income for the different counties, and therefor the variables are related. If there was no relation, the distribution of the numerical variable should be the same for any group of the categorical variable that we considered. Example: analysis of a ranked categorical variable. Income and schooling. We present next data on income and schooling for 20 individuals: Individual 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Income (Euros) 12000 15000 16000 14000 20000 21000 30000 22000 14000 17000 10000 11000 19000 13000 21000 25000 22000 23000 12000 32000 Schooling 2. Primary 1. No degree 5. Master 3. High School 4. Bachelor 5. Master 5. Master 4. Bachelor 2. Primary 2. Primary 1. No degree 1. No degree 3. High School 2. Primary 4. Bachelor 4. Bachelor 3. High School 3. High School 1. No degree 5. Master

The main numerical summaries by groups can be seen in the following table:

18 de 51

Schooling Sample size Mean Coeff of Var Skewness Kurtosis Min Q1 Median Q3 Max

All 20 18450 0,33 0,65 -0,14 10000 13750 18000 22000 32000

1. No degree 4 12000 2160,25 0,18 1,19 1,5 10000 10750 11500 12750 15000

2. Primary 4 14000 2160,25 0,15 1,19 1,5 12000 12750 13500 14750 17000

3. High School 4 19500 4041,45 0,21 -1,09 0,3 14000 17750 20500 22250 23000

4. Bachelor 4 22000 2160,25 0,1 1,19 1,5 20000 20750 21500 22750 25000

5. Master 4 24750 7544,31 0,3 -0,31 -3,64 16000 19750 25500 30500 32000

StanDev 6159,93

It makes sense now to put Schooling from the lowest category until the highest category. We can see that the average income increases with schooling, and here we can describe a positive association beteween schooling and income, the higher scholling the higher income. This would not have made sense when we were talking about income and county of residence. Another feature that we observe is that income spread is also increasing with schooling. This means that, even though our data says that income is higher the higher schooling, the degree of variability of income also increases with schooling. Those with a master degree, for instance, earn more than those with lower schooling than a master degree, but there are people who earn a lot and people who earn little, despite having more schooling. We can as always use graphical tools to perform the same description. We present next histograms of income for different schooling levels:
Schooling: 1. No degree
4

Frequency

3 2 1 0 11100 13300 15500 17700 19900 22100 24300 26500 28700 30900

Income (Euros)

Schooling: 2. Primary
3

Frequency

2 1 0 11100 13300 15500 17700 19900 22100 24300 26500 28700 30900

Income (Euros)

19 de 51

Schooling: 3. High School


3

Frequency

2 1 0 11100 13300 15500 17700 19900 22100 24300 26500 28700 30900

Income (Euros)

Schooling: 4. Bachelor
3

Frequency

2 1 0 11100 13300 15500 17700 19900 22100 24300 26500 28700 30900

Income (Euros)

Schooling: 5. Master
3

Frequency

2 1 0 11100 13300 15500 17700 19900 22100 24300 26500 28700 30900

Income (Euros)

We observe the same that we noticed in the numerical summaries table. There is a positive association between study level and income, and the spread of income increases with schooling. Histograms are not the only graphical tool that we can use to perform this description. Another tool that we have are mean , standard deviation or other summaries plots, so that we can compare them:
Mean
30000 25000 20000 15000 10000 5000 0 1. No degree 2. Primary 3. High School 4. Bachelor 5. Master

Income (Euros)

We can see clearly that average income increases with schooling. We can also study plots of other numerical summaries, such as the standard deviation:

20 de 51

Standard Deviation
8000 7000 6000 5000 4000 3000 2000 1000 0 1. No degree 2. Primary 3. High School 4. Bachelor 5. Master

Income (Euros)

This diagram shows us that income spread also increases, except for the level of bachelor, and the largest spread can be found for the Master level. This means that altghough income increase wich schooling, there is also people who obtain a Master degree and are not able to improve their income and therefore in this group there are people earning a lot and people earning very little. Finally, side by side boxplots are also useful in this case:

35000

30000

25000

20000

15000

10000

1. No degree
5000

2. Primary

3. High School

4. Bachelor

5. Master

Income (Euros)

It can be seen clearly that the median income is increasing with schooing. Furthermore the boxes get wider for higher levels of schooling, except for the bachelor degree, clearly indicating that income variability is increasing.

Datasets with two variables (II)


Two categorical variables
Moore, 173-203

21 de 51

Time series
Introduction
A time series is a dataset refered to a variable, ordered chronologically. The series can be annual, quarterly, monthly, daily or even by the hour or minute such as stock market transactions, depending on the frequency of data collection. It is difficult to think of any science discipline where there is no times series recorded. The time series gives us information on the variation of a variable along time (for instance the unemployment rate, poverty rate, growth rate, and so on). Economic or managerial decisions very often have to be based on the behavior of different variables in previous periods and have to predict what is going to happen in coming periods. The graphical representation of a time series is done by putting time as an independent variable (xaxis) and the values of the series as the dependent variable (y-axis). There are a lot of different techniques to work with time series. Predictions on the future can be based on simple perceptions of experts but also on very sophisticated analysis based on large amount of data and interrelations between variables. In any case everything has to be based on the past behavior of the variable. If we observe that a variable had a more or less systematic behavior in the past it is logic to think that this type of behavior will continue in the future. This concept is the basis for statistical prediction. There are lot of predictions being done all the time with economic time series. Opening the business and economics section of any newspaper can give us good examples. For instance the government has to take a lot of decisions based on predictions of next year GNP (Gross National Product).

Composition
The variation in past values of a time series are based on a diversity of factors. Some of them are economic factors (such as an economic crisis in any region of the world and its impact on the stock markets worldwide), some others are natural (such as natural disasters or bad weather impacting agriculture) and finally some other are institutional (such as the adoption of the Euro curreny). Some of these factors affect the series in the long term and some others in the medium or short term. Let us see some examples:

Short term: less than one year. 1. periodic factors (things repeating each year but with a frequency smaller than a year). For instance: electricity consumption has a short term seasonal component since every summer consumption is smaller than in winter. It used to be much smaller, but since a lot of homes have installed air conditioning the gap has reduced. Also economic activity, and therefore GNP, has a different behavior during August since it is a period of holidays for a lot of firms, at least in Spain. If we observe that in the past production has fallen 10% we can predict that this will happen again in the future. 2. unique factors: a natural disaster, the finantial crisis of a country, the bankruptcy of a large firm, and so on. These are factors affecting the short term, but we cannot predict them since they are unique and there is no expectation that they can repeat periodically.

Medium term (approximately a perior over a year and shorter than five years). These type of of factors are usually associated with the business cycle. For instance a recession usually lasts between 2 and 5 years. They are also irregular with respect to when they happen and their duration, but sometimes they can be predicted. 22 de 51

Long term (approximately more than five years). These refer to structural changes for instance in population growth or in the economy that have a long-term impact. For instance in the long-term the European societies are experiencing an increasing aging of their population.

More analitically, we consider that a series is formed by four basic components (but not all of them need to be present in all series), that we describe next: Trend: behavior of the series in the long term. Cycle: The time series can show cyclical behavior extending for more than one year. For instance car sales are larger when the economy is booming than when we are in the middle of a recession. These cycles have an undetermined duration, but they usually last more than one year. There is no regularity in these fluctuations. For instance, sometimes we have 2 years of recession and 1 year of boom, but some other times we have 3 years of recession and 2 year of boom, and so on. This distiguishes cycles from seasonal fluctuations, which we are going to describe next, because seasonal fluctuations have always the same duration and periodicity (for Christmas there is always an increase in sales, for instance). Seasonality: this is a periodic behavior repeating one or more times during the same year. It happens when the series is influenced by seasonal factors. For instance the sale of toys, which always increase for Christmas and a little bit less for the Summer. Another example is the unemployment rate, which in Spain no matter if the economic is in a good or bad moment always decreases for Summer due to seasonal (summer) jobs. Irregular: this behavior is completely random and impossible to predict. For instance: an unexpected shock such as the September 11 of 2001 events in New York.

T C

If the series can have these four components, it would be helpful to be able to distinguish them separately. This way when a data is announced such as the unemployment rate has diminished 0.7 points we could distinguis which part of this decrease is due to a purely seasonal factor, maybe we are in Spring and jobs are fixed-term jobs are being created in the service sector, and which part is due to the business cycle or a long-term trend. If we assume that the four components of a series are related in a particular way, we can separate these components with the help of a series of techniques that we are going to present next. It is generally supposed that the components of a series define to types of series, mainly, and these are multiplicative and additive. It is also possible to have mixed forms, but they are not used too often. The additive model assumes that the value of a time series, Y, is the sum of its four components. That is:

Y =T +C + E + I
The multiplicative model assumes that the value of a times series, Y, is the product of the its four components, that is:

Y = T C E I
Basically the additive model asssumes that the four component (the four causes of variation of the series) are independent. This means that the fact that the trend of the series implies a growth or a decrease does not affect any of the other components (For instance occupation has increased steadily since 1900 but the short term seasonal fluctuations are not affected by this long-term trend). 23 de 51

For the case of the multiplicative model, the four components can be related. For this type of series seasonal fluctuations can be for instance larger during booms than during recessions. A mixed model:

Y = T C + E I
The following table shows as a simple experiment that shows how an additive and multiplicative model look like. In the first series, which we denote Y1, the four components are related additively. In the second, which we denote Y2, the four components are related by the multiplicative model. Trend 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Seasonal 1 0 -1 1 0 -1 1 0 -1 1 0 -1 1 0 -1 Cyclic Irregular 1 -0.30 2 0.42 3 -0.21 2 -0.35 1 -0.17 0 0.29 -1 0.49 0 0.26 1 0.45 2 0.33 3 -0.22 4 0.07 3 -0.37 2 -0.17 1 -0.37 Y1 2.70 4.42 4.79 6.65 5.83 5.29 7.49 8.26 9.45 13.33 13.78 15.07 16.63 15.83 14.63 Y2 -0.30 0.00 1.93 -2.79 0.00 0.00 -3.46 0.00 -4.04 6.54 0.00 -3.32 -14.62 0.00 5.50

16 14 12 10 8 6 4 2 0 -2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Trend Seasonal Cyclic Irregular

This four components, which are the same for both series, give rise to very different behavior depending on being combined additively or multiplicatively. This can be seen in the next chart:

24 de 51

20 15 10 5 0 -5 -10 -15 -20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Y1 Y2

To constructe this example we departed from the 4 components and we formed the two series. But what if we are only given the value of the time series, can we identify and separate the 4 components? The answer to this question is yes. We will show next some simple ways of doing it (there are more comple methodds and also more accurate, but it is not possible to see them in an introductory course).

Analysis of the trend and the cycle: the long term


In order to isolate the the trend component of a series, we can try to adjust a line or any other function to a time series. There are different mathematical functions that we can use to describe the trend of some series. Per tal dallar el component de tendncia duna srie, podem intentar ajustar una recta o una corba a la srie temporal. Hi ha una srie de funcions matemtiques que sn dutilitat per descriure la tendncia dalgunes sries. Fitting mathematical functions We can assume that the trend component , T, of a series follows a mathematicl model such as: Line

Tt = a + b t , where t

indexes times.

We will only work with the linear model, that is the line, but the model can be any other function of time, for instance: Polynomial

Tt = a + b t + c t 2

(second order polynomial) (third order polynomial)

Tt = a + b t + c t 2 + d t 3

25 de 51

Exponential

Tt = a bt
Reciprocal

1 Tt = a + b t
Power

Tt = a t b
Logarithmic

log Tt = a + b t
The values for the parameters dependo on the scale we user for t. For instance, if we have annual data, we can use 1989, 1990, 1991, and so on, or also 1, 2, 3, ... , we just have to make sure it is a consecutive list of evenly-spaced numbers. Moving averages A second possible method consistes of getting rid of the spread of the series by eliminating the short-term movements. This is done by taking means of consecutive values and smoothing the series. This method is known as moving averages. With this method the trend or the cycle of a series is computed as the mean of a series of consecutive cases. For instante, if we think that each 5 years approximately the economy restarts a new cycle, we could try to take averages of values consisting of 5 consecutive years. We take groups of 5 and we move one year up, eliminating one year and adding one year, so that we always have 5 years in the mean. We can also this in groups of 3 if we think that is the relevant period where the cycles start repeating themselves. The number of cases we include in each mean is called the order of the moving average, for instance in this example the order is 5. Here we show moving averages of order 3 and order 5 for eleven years, notice that in case of order 3 we cannot compute it for the first year and

26 de 51

Year 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Original series 1 2 3 2 1 2 3 4 3 2 3 4 5 4 3 4 5 6 5 4

Moving average order 3 2,0=(1+2+3)/3 2,3=(2+3+2)/3 2,0=(3+2+1)/3 1,7 2,0 3,0 3,3 3,0 2,7 3,0 4,0 4,3 4,0 3,7 4,0 5,0 5,3=(5+6+5)/3 5,0=(6+5+4)/3 -

Moving average order 5 1,8=(1+2+3+2+1)/5 2,0=(2+3+2+1+2)/5 2,2 2,4 2,6 2,8 3,0 3,2 3,4 3,6 3,8 4,0 4,2 4,4 4,6 4,8=(4+5+6+5+4)/5 -

Graphical representation:
7 6 5 4 3 2 1 0 0 5 10 15 20 25 Original series Moving average order 3 Moving average order 5

Year

As it can be seen in the graph, the moving average of order 5 seems to be a good way of representing the trend (here it coincides with a line but it does not need to be so). If the total number of periods of the time series is odd (3,5,7, and so on) the moving average corresponds exactly to a period of the series, that is if we are computing the series for periods 1,2, and 3 the moving average will correspond exactly with the value of the trend in period 2. In case that the number of periods to include in the mean is even (2,4, 6, and so on) we find a problem because the mean does not correspond to any particular period, but to a moment between two periods. For instance if the order of the moving average is 4, and we are computing the moving average with a group formed by periods 1, 2, 3 and 4, the mean corresponds to a period between 2 and 3, but we do not have the value of the trend at period 2 or 3, but in the middle of these two 27 de 51

periods. This problem is known as the problem of centering the moving average. We find it for instance for montly data (where moving averages of order 12 are sometimes computed) of quarterly data (4 quarters within a year, moving average of order 4). The way to solve this problem is to compute centered moving averages. Suppose we want to compute the trend of occupation with quarterly data, and therefore we would like to use an order equal to 4. Let us suppose that the first case corresponds to first quarter of 2005. We compute the first moving average with a group that goes from the first quarter of 2005 until the fourth quarter of 2005. To which quarter is this value associated? The average is associated to a moment between the second quarter and the third quarter. To solve this we associate this moving average with the second quarter and continue computing the rest of moving averages. On a second step we compute moving averages of order 2 with the moving averages of order 4 that we had computed in the first step, as shown in the following example:
Period Series Moving averages Centered moving Order 4 averages Order 4 18492 18894 19191 19314 19400 19693 19895 20001 20069 20367 20510 20476 20402 20425 20346 19856 19090 18945 18972,75 19199,75 19399,5 19575,5 19747,25 19914,5 20083 20236,75 20355,5 20438,75 20453,25 20412,25 20257,25 19929,25 19559,25 19086,25 19299,63 19487,5 19661,38 19830,88 19998,75 20159,88 20296,13 20397,13 20446 20432,75 20334,75 20093,25 19744,25

2005, Q1 2005, Q2 2005, Q3 2005, Q4 2006, Q1 2006, Q2 2006, Q3 2006, Q4 2007, Q1 2007, Q2 2007, Q3 2007, Q4 2008, Q1 2008, Q2 2008, Q3 2008, Q4 2009, Q1 2009, Q2

As seen in the example we first take the average of the first four quarters and continue computing the moving averages of order 4, and then we take another round of moving averages of order 2. This second step is always done with moving averages of order 2, no matter what the order of the first step (but always for orders which are even, for odd orders it is not needed since the moving average is already centered). The trend of this series is seen in the following way:

28 de 51

20995 20495 19995

Series

19495 18995 18495 17995 2005, Q2 2005, Q4 2006, Q2 2006, Q4 2007, Q2 2007, Q4 2008, Q2 2008, Q4 2009, Q2 2005, Q1 2005, Q3 2006, Q1 2006, Q3 2007, Q1 2007, Q3 2008, Q1 2008, Q3 2009, Q1 Time series Moving averages

Time

Short term fluctuations


Seasonal variations Seasonal variations follow a recurrent pattern along time. Weather or social habits are important sources of seasonal variations. We can find seasonal components within days, weeks, quarters or months, depending on the frequency of the data. They are called seasonal because they repeat in a predictable manner, like the seasons of the year. Seasonal variations have to be predicted to manage correctly a firm or the economy in general, since for particular periods a firm may run out of capacity or some other periods may have important overcapacity. A simple way of computing a seasonal index is by using a technique based on using the moving averages to compute the trend. The technique can be applied to additive or multiplicative models. Here we will explain it with the additive model. For the multiplicative model we can simply substitute the substractions by divisions, the sums by multiplications and the arithmetic mean (sum all the terms and divide by N) by the geometric mean (multiply all terms and take the N-th root). We start with a series Y formed by the four usual components according to the model

Y =T +C + E + I
We first try to isolate the trend and cycle using a moving average with the appropriate order. For instance if the series is quarterly, we compute a centered moving average of order 4. The final series of centered moving averages of order 4 is theoretically T+C since we do not isolate the cyclical component and therefore it ends up mixed with the trend. We could also compute the trend using the method of adjusting a line and the results would be similar. Next we substract the trend that we have computed from the original series and we obtain:

T + C + E + I (T + C ) = E + I
29 de 51

We obtain a new series which gathers the seasonal component mixed with the irregular component. We now need to eliminate this irregular component if we want to isolate the seasonal component and construct a seasonal index. The nature or the irregular component is random. It does not follow any regular behavior and it cannot be predicted. If we could predict it it would be added to any of the other regular components. It is reasonable to assume therefore that if we take a mean of the irregular components for a given period (for instance for all first quarters) is 0 in the case of the additive model (1 in the case of the multiplicative model). This way if we assume that the seasonal component of each quarter is identical year after year , we can take teh mean of the values corresponding to a given quarter of the E+I series and we will obtain component E. For instance, if our series starts in 2005 and finishes on the second quarter of 2009, to compute the seasonal component of January we would compute:

EQ1 =

E2005Q1 + I 2005Q1 + E2006Q1 + I 2006Q1 + E2007 Q1 + I 2007 Q1 + E2008Q1 + I 2008Q1 + E2009Q1 + I 2009Q1 5

Since we have assumed that all seasonal components of the same quarter are the the same, that is:

E2005Q1 = E2006Q1 = E2007 Q1 = E2008Q1 = E2009Q1 = EQ1


and that the average of the irregulaar component is equal to 0, we obtain

5 EQ1 + ( I 2005Q1 + I 2006Q1 + I 2007 Q1 + I 2008Q1 + I 2009Q1 ) 5

= EQ1 + 0 = EQ1

We do the same for each quarter and we obtain the 4 numbers representing the seasonal component. (In the multiplicative case it is common to express the seasonal component in an index form, and so we multiply this number by 100). Notice that if the value of the seasonal component is 0 for any quarter, in fact for that quarter there is no seasonal component. The values of the seasonal component vary around 0. When it is larger than 0 the series will be right on the trend and when it is smaller than 0 it will be below the trend (in the multiplicative case the seasonal components vary around 100).

30 de 51

Series

Period 18492 2005, Q1 18894 2005, Q2 19191 2005, Q3 19314 2005, Q4 19400 2006, Q1 19693 2006, Q2 19895 2006, Q3 20001 2006, Q4 20069 2007, Q1 20367 2007, Q2 20510 2007, Q3 20476 2007, Q4 20402 2008, Q1 20425 2008, Q2 20346 2008, Q3 19856 2008, Q4 19090 2009, Q1 18945 2009, Q2

ODStatistics computes the seasonal components for us:

Predition with time series


We want to predict GNP for 2009 in Catalunya, and we have a seris for Catalan GNP for fourmonth periods, from 2004 to 2008 (for instance 2004-3 means the third four-month period of 2004). To make a prediction for 2009 we can use in this case the method of the regresssion line. To obtain a more reliable prediction we can correct afterwards this prediction with the seasonal component. 31 de 51

Year Period GNP 2004-1 2004-2 2004-3 2005-1 2005-2 2005-3 2006-1 2006-2 2006-3 2007-1 2007-2 2007-3 2008-1 2008-2 2008-3

1018 1037 1050 1093 1102 1113 1146 1160 1172 1208 1219 1227 1266 1278 1280

The trend
We can use the regression line. The explanatory variable is time, we will index it by 1, 2, 3, .. 15. The dependent variable is our series, GNP (Y). We want to predict the values of the three 4-month periods of 2009 (2009-1, 2009-2 and 2009-3). The regresin line is

Y = 19,5 t + 1002 (or if you prefer GNP = 19.5 Time + 1002)


The prediction of the trend is 19.5 x 16 = 1314 19.5 x 17 = 1334 19.5 x 18 = 1353

Period Predictio n 2009-1 1314 2009-2 1334 2009-3 1353

Tim e 16 17 18

But we need to also predict the seasonal factor. For this we use our method to compute the seasonal components.

The seasonal component


We use an additive model. We now compute the trend by moving averages of order 3 to substract the trend from the series (we could also do it with the linear trend that we computed in the previous section). Here are the moving averages of order 3:

32 de 51

Period

Series

Moving averages Order 3 1018 1037 1050 1093 1102 1113 1146 1160 1172 1208 1219 1227 1266 1278 1280 1035 1060 1081,67 1102,67 1120,33 1139,67 1159,33 1180 1199,67 1218 1237,33 1257 1274,67

2004-1 2004-2 2004-3 2005-1 2005-2 2005-3 2006-1 2006-2 2006-3 2007-1 2007-2 2007-3 2008-1 2008-2 2008-3

etc. And from here

Period

Series

Moving averages Seasonal Order 3 Components and Irregular

2004-1 2004-2 2004-3 2005-1 2005-2 2005-3 2006-1 2006-2 2006-3 2007-1 2007-2 2007-3 2008-1 2008-2 2008-3

1018 1037 1050 1093 1102 1113 1146 1160 1172 1208 1219 1227 1266 1278 1280 1035 1060 1081,67 1102,67 1120,33 1139,67 1159,33 1180 1199,67 1218 1237,33 1257 1274,67 2 -10 11,33 -0,67 -7,33 6,33 0,67 -8 8,33 1 -10,33 9 3,33

To eliminate the irregular factor we take means of the values correspondints to each 4-month period separately. For instance for the second 4-month period: 33 de 51

E2 =

sum of values of second 4-month period 2 0.67 + 0.67 + 1 + 3.33 = = 1.27 number of 4-month periods 5

With this same procedure we can obtain the seasonal component for the first (8.75) and the second (-8.92) 4-month periods seasonal components. Our prediction is finally adjusted for these seasonal components:

Period 2009-1 2009-2 2009-3

Prediction 1314+8.75=1322.75 1334+1.27=1335.27 1353-8.92=1344.08

34 de 51

Measures of inequality and concentration


Very often we are interested in measuring the distribution of values of a variable between the individuals or objects respresenting each case. We can for instance want to analyze income distribution within the Spanish population, the wage distribution within a firm or the dividends between the stakeholders of a corporation. All these problems are problems of where we have to compare the values between the different individuals, for instance: is there a large wage inequality within the firm? We will present in this section measures of inequality and concentration, which will give us a summary of the degree of inequality within a distribution. Notice that inequality and concentration are related conceptse, the more inequality between different individuals, the more concentrated the values of the distribution for a small number of individuals.

Inequality measures
Imagine an inheritance that is distributed within 3 families the following way: Family Family inheritance Members in the family (milions ) 4 2 A 7 7 B 99 1 C We do not need to performa sophisticated computations to understand that this distribution shows a lot of inequality. We will not enter into ethical issues here, we just want ot measure inequality. To analyze inequality it is useful to set up a couple of ideas first: 1. The most egalitarian situation would imply that each person has exactly the same inheritance. Taking into account that the total quantity to distribute is 110 milion euros and that the total number families is 10, this situation would require that each person receives 11 milion euros as an inheritance. 2. The largest inequality would be produced if we gave ll the money to just one person, for instance the single person of family C receiving 110 and all other 9 individuals receiveing nonthing. To construct the measure of inequality applicable to these two extreme situations and also to the actual distribution of the inheritance we will introduce some notation. We call the amount of money received as inheritance X, this is our variable. This variable can take certain values xi i = 1, 2,K , k and is distributed between some individuals. So x1 is observed for n1 individuals, x2 for n2 , and so on, and in general we can say that the values of the variable, once sorted in increasing order, can be represented by a pair ( xi , ni ) .
xi 1 2 99 ni 7 2 1 xi ni 7 4 99

We define the total mass of variable X as

35 de 51

Ak = xi ni
i =1

In our example:
xi 1 2 99 ni 7 2 1 A k= xi ni 7 4 99 110

The total mass is just the total inheritance, the sum of all the values for our variable. In our case X is the quantity received by each inheritor (for instante family B receive 7 milion and since there are 7 members in theory the personal inheritance is 1 milion, therefore x1 = 1 ) and Ak is the tota inheritance to distribute (since we have 3 families, we write A3 ). Rememembering that we have ordered the k possible values of X in increasing order, we can define

Ai = x j n j
j =1

for any i < k as the cumulative income for the N i first individuals.

xi 1 2 99

ni 7 2 1 Ak=

xi ni 7 4 99 110

Ni 7 9 10

Ai 7 11 110

The proportion of these N i individuals over total population N (in our case the total number of inheritors 10) would be

pi =

Ni N
Ai Ak

and the proportion of the part of the family inheritance over the total would be qi =

xi 1 2 99

ni 7 2 1

xi ni 7 4 99

Ni 7 9 10

Ai 7 11 110

pi 0 0,7 0,9 1

qi 0 0,06 0,1 1

Whenever there is inequality in the distribution of the inheritance, we will have that qi < pi

36 de 51

that is as we acumulate individuals (starting from the ones that inherited less, remember that we sorted them in increasing order of X) we will acumulate proportionally less money from the inheritance. The difference between the two proportions will tive us a measure of inequality between the families with respect to the received inheritance. Only for the case of a perfectly egalitarian distribution we would have pi = qi . We can represent this situatio with a graph. At the vertical axis we put the inheritance proportion qi and at the horizontal axis we put the inheritors proportion pi . The situation of equidistribution of perfectly egalitariaon distribution is represented by the diagonal:

But only rarely we find perfectly egalitarian situations, we will always find individuals pi that get smaller proportions of wealth qi , and consequently if we joint the dots, we will obtain a curve that is under the diagonal (since qi < pi ):

This is the so called Lorenz Curve. In the horizontal axis we have the pi s, that is the cumulative proportions of the population (cumulative relative frequencies in the graph) and in the vertical axis we have the qi s , proportions of cumulative values of the variable. If we had all the inheritance concentrated in just one person, the Lorenz curve would be the green line, we would have that qi = 0 for all the proportions of the population except pi = 1 , where qi =1.

37 de 51

To construct a numerical summary of the inequality we compute the sumo of the differences pi qi . It is obvious that pk qk =0 always holds (100% of the population acumulates 100% of the inheritance) and therefore we only need to compute the sum of the difference until k 1 , that is

( p q ) .
i =1 i i

k 1

But we need a relative measure, so that it can inform us of the degree of inequality. To accomplish this we divide the value 1.44 by the maximum value that the sum can take (maximum inequality). Recall that the maximum inequality happens when one individual receives all the inheritance. Therefore in the case of maximum inequality qi = 0 for i = 1, 2,K , k 1 and we have that

( pi qi ) = ( pi 0 ) = pi
i =1 i =1 i =1

k 1

k 1

k 1

Now we can define the Lorenz-Gini inequality Index as

IL =

( p q )
i =1 i i

k 1

p
i =1

k 1

This index will always be between 0 and 1. It is usually used for large samples and therefore the data is given in a frequency table. In that case xi stands fro the class mark of interval or class i and ni stands for the absolute frequency of the interval and k is the number of intervals. The LorezGini inequality index for the example that we have considered is:

38 de 51

The points of the Lorenz curve gives a description of the inheritance distribution (or of the variable that we are analyzing). I this case family B, representing 70% of the population (all the inheritors) has received only 6% of the total inheritance. If we add to this group family B, and now we have 90% of the total sample, we still would have only 10% of the total inheritance. This allows us to say that in this distribution there is a large inequality. The Lorenz-Gini inequality index has a value of 0.898. Let us now examine another example:

Here the distribution is clearly more egalitarian. This can be confirmed by plotting the Lorenz curve:

39 de 51

The closer the index to 1 the largest the inequality (or the largest the concentration of the values of the variable in one or a few individuals.) It is also true that index I L is a ratio between the area between the diagonal and the curve and the triangle below the diagonal. We can also define an inequality based on comparing the values of the variable between each pair of individuals of the population. This is called the difference index. What would be the maximum difference possible? Imagine again that an individual has received all the inheritance, which is

x n . We compute the difference of the wealth of this individual with the


i =1 i i

N 1 . Let us suppose

we have 3 families as in the case of the previous example and that the last family has received the full inheritance: Family ( i ) Members ( ni ) Wealth ( xi ) B A C 7 2 1 0 0 110

We compute the differences between each pair of families: B( xi = 0, ni = 7) 0 B( xi = 0, ni = 7) 0 A( xi = 0, ni = 2) (110-0)*7 C( xi = 110, ni = 1 A( xi = 0, 2)


ni =

C( xi = 110, ni = 1)

0 (110-0)*2

Therefore the maximum value of the sum of differences is:


Maximum value = ( N 1) xi ni
i =1 k

(in our case (10-1)*110 = 990) We will use again this maximum value to normalize our measure. We now can observe the difference between each pair of individuals. Family ( i ) Members ( ni ) Wealth ( xi ) B A C 7 2 1 1 2 99

To compare all individuals we take into account the number of members r of of the first family and s of the second family and we compare all the individuals from one family with all the individuals of the other familiy, that is ( xr xs ) nr ns whenever the difference is positive, that is whenever r > s :

40 de 51

B( x1 = 1, n1 = 7) B ( x1 = 1, n1 = 7) A ( x2 = 2, n2 = 2) C ( x3 = 99, n3 = 1)

A( x2 = 2, n2 = 2)

C( x3 = 99, n3 = 1)

0 (2-1)*7*2=14 0 (99-1)*7*1=686 (99-2)*2*1=194

The sum of actual differences is 14 + 686 + 194 = 894. This is the difference index, and we can compute it with the formula:
IG =

(x
r >s

xs ) nr ns
k

( N 1) xi ni
i =1

In our example the value of the index is 894/990 = 0.9030 The interpretation of this index is the same as for the case of the Lorenz-Gini inequality index, and it also varies between 0 and 1, with 1 as the maximum inequality or maximum concentration. If we compare the Lorenz-Gini Index and the Difference Index we see that they give similar results. These indices are relative measures, therefore it will be possible to compare distributions of different variables. In other words these indices are dimensionless. So if there is a change of measure, like a proportional increase of 8% of the values for all the individuals, the indices will not be affected.

Concentration indices
Very often we need to know the market shares of firms, for instance we we want to perform a marketing study for a new product. What is the sales concentration of firms? What is the market share of each firm? The concentration of market shares is not the same in the market for potatoes, where a very large number of producers each one sells very little quantities, than in the Barcelona Madrid air shuttle, where a few airlines concentrate all the sales. We will interested in measuring the concentration of firms. We first start by presenting a very simple index that takes a look at the largest firms. If they do not concentrate the sales, then the index is going to have a low value showing low concentration. We sort the firms in decreasing order of market share and we denote si the share of firm i, the index Ck is defined as:

Ck = si
i =1

k can be any of the first largest firms that we want to include in the index, the most common are the largest two firms or the largest four firms. For instance C4 represents the sum of the market shares of the tour largest firms. The value of Ck varies between a minimum concentration k n , being n the total number of firms and the value of maximum concentration 1. Minimum concentration is produced when all the firms have the same market share.

41 de 51

Another widely used concentration index is the Herfindahl index, and its definition is:
H = si2
i =1 n

The value of H varies betweeen the minimum concentration 1/n (all firms are equal) and the maximum value of concentration 1. These indices have advantages and disadvantages. Index Ck is very easy to compute, but H has clear advantages as it takes into account all firms in the market. The properties of H that make it a very convenient tool for measuring market concentration are the following: 1. Non ambiguous character. If we have two markets, index H can unambiguously say which of the two markets has a largest concentration. 2. Scale invariance. The relative size of each firm does not affect the computation of H. 3. Transference. Index H increases its value as the share of a small firm decreases in favour of the share of a large firm, in other word when concentration increases.
4. Monotoniciy. If all n firms of the market had equal shares, measure H would be decreasing with respect to n

5. Cardinality. If we divide each firm in k equal firms, measure H would decrease in the same proportion.

42 de 51

Index numbers
An index numer is a statistical measure expressing the changes registered in a variable along time, combining simultaneously information about the levels and rates of growh (or decrease) of the variable. It helps in analyzing the evolution of the variable, since its values give directly a measure of growh and they give accurate information on the original series. We can use this technique to analyze the growth of a variable in percentual terms:
Number of Index Foreigners 1998 637085 1999 748954 2000 923879 2001 1370657 2002 1977946 2003 2664168 2004 3034326 2005 3730610 2006 4144166 2007 4519554 2008 5268762

Year

100 117.56 145.02 215.15 310.47 418.18 476.28 585.57 650.49 709.41 827.01

6000000 5000000 4000000 3000000 2000000 1000000 0 1996

Number of Foreigners

1998

2000

2002

2004

2006

2008

2010

900 800 700 600 500 400 300 200 100 0 1996 1998 2000 2002 2004 2006 2008 2010 Index

43 de 51

The original series and the index contain the same information. The only difference is that the index is computed on a scale whose base is 100, and this helps a lot in quickly checking how much the series has increased (or decreased). For instance we could see that in the 11 years the number of foreigners has increased 8-fold. Or we can also see that betwee 1998 and 1999 the number of foreigners gre 17%. We can also perform these computations with the original series, but the index allows us to do the computations very fast. The indices can also be used not only to check the evolution of variables along time, but also to compare the values of a variable for different individuals. For instance, in the following table we find a comparison of GDP (Gross Domestic Product) per capit using the European Union of 27 countries as the basis for comparison (EU 27 = 100).

44 de 51

Simple indices
The two previous examples correspond to indices which we call simple indices, because they only describe the evolution of one single variable. We will formally call simple index a ration between each number of a series and the value that takes that series in the base period. We usually express an index with a basis equal to 100, so we multiply the ratio by 100. That is:
Xt 100 X0

it /0 =

Wher X t is the value of the variable for which we are constructing an index in period t and X 0 represents the value of the variable in the base period. Here it /0 is read as the value of the (simple) index in period t base 0.

Example:
2008-12 2009-1 2009-2 2009-3 2009-4 2009-5 2009-6 2009-7 2009-8 Price of Milk Index 0.70 0.69 0.69 0.68 0.68 0.66 0.66 0.65 0.65 100 98.57 98.57 97.14 97.14 94.29 94.29 92.86 92.86

In this example the base period is the first, december of 2008. Notice that the original variable has a unit of measure, in this case euros, but the index is dimensionless. We also observe that the value of the index in the base period is always 100 (if t coincides with the base period we trivially obtain that i0/0 = ( X 0 X 0 ) 100 = 100 ). The index allows us to quickly compute for instance that the price of milk during August of 2009 is (100-93=) 7% lower then in December of 2008. Choosing December 2008 as the basis period is probably the most natural election, but we can choose any other period as the base period. For instance:

45 de 51

2008-12 2009-1 2009-2 2009-3 2009-4 2009-5 2009-6 2009-7 2009-8

Price of Milk Index 0.70 0.69 0.69 0.68 0.68 0.66 0.66 0.65 0.65

107.69 106.15 106.15 104.62 104.62 101.54 101.54 100 100

Complex indices
Simple indices have a limited usefulness in economics and management. Most often we have to analyze the evolution of a lot of different variables at the same time. For instance the Consumer Price Index (CPI) is a complex index that describes the evolution of prices of a lot of different consumption products and services at the same time. Another example is the IBEX index of the Spanish Stock Market, which describes the evolution of the stocks of the largest Spanish companies. There are different techniques to cmpute a complex index, which we will illustrate with a simple example. Let us consider a family that consumes three products, A, B and C. We observe their consumption and the prices of the three products along a number of periods, and we compute the expenditure of the family during each period.
Product A Period 1 2 3 Quantity 11 10 10 Unit price 1.00 1.50 1.30 Product B Quantity 20 17 13 Unit price 10.00 12.00 14.00 Product C Quantity 100 82 86 Unit price 5.00 6.00 6.00 Expenditure 711.00 711.00 711.00

If we only observed total expenditure of each family in each period we would not have any clue on the evolution of prices of the three products. Total expenditure is a result of the variation of prices but also on the variation of quantities consumed and this family has been changing its consumption as prices has been changing. We observe that globally, the prices of the three products have increased during the three periods. How could we describe the evolution of prices only? Laspeyres index A possible solution is to elaborate a complex index, and the first one we are going to present is called Laspeyres index. It is done in the following way: 1) We first fix a base period (o reference period), it could be for instance period 1. 2) We compute the weight of each product on the total expenditure of the base period. We notice for example that in period 1 the family spent 11 in product A, 20 in product B and 500 in product C. These expenditures represent 2, 28 and 70%, respectively, of total 46 de 51

expenditure of 711. We express these weights in terms of proportions instead of percentages. 3) We now compute simple indices of the price of each product separately 4) Finally, we compute a complex index of the joint evolution of prices as a weighted average of the simple indices of prices, where the weights are computed in step 2.
Product A Period 1 2 3 Weight 0.02 Index 100 150 130 Product B Weight 0.28 Index 100 120 140 Product C Weight 0.7 Index 100 120 120 Laspeyres Index 100 120.46 125.78

That is:
L I1/1 = 0.02*100 + 0.28*100 + 0.7 *100 = 100 L I 2/1 = 0.02 *150 + 0.28*120 + 0.7 *120 = 120.46 L I 3/1 = 0.02 *130 + 0.28*140 + 0.7 *120 = 125.78

We write

L t /0

= w0j it j/0
j =1

Where J represents the total number of products or categories, w0 is the weight of each product j in j the total expenditure of the base period and it /0 is the value of the simple index of the price of product j in period t on the basis of period 0. Another way of writing the formula is:

L t /0

= w i
j =1 J j =1 J i =1

j j 0 t /0

qj pj = J 0 0j j j =1 i =1 qo po
J

J qjpj it j/0 = J 0 0 j j j =1 i =1 qo po

pj tj = p 0

q0j ptj qoj poj

That is the Laspeyres index can be interpreted as the ration between the basket of products in the 47 de 51

base period evaluated at the current prices and the same basket evaluated at the prices of the base period. Paasche index An alternative way to elaborate a complex index is by means of the weighted mean of simple indices where, contrary to the previous index where the weights were fixed at the base period, the weights are being computed for the current year. We can write the index as:

I
j with wt =

P t /0

= wtj it j/0
j =1

p0j qtj

J j =1

p0j qtj .

We observe that another way of writing this formula is:

P t /0

j j p0 qt = J p jq j j =1 j =1 0 t
J

j j J it j/0 = p0 qt J p jq j j =1 j =1 0 t

j pt = p0j

J j =1 J j =1

ptj qtj p0j qtj

With the data of the previous example we obtain:


Product A Period 1 2 3 Weight 0.02 0.02 0.02 Index 100 150 130 Product B Weight 0.28 0.29 0.23 Index 100 120 140 Product C Weight 0.7 0.69 0.75 Index 100 120 120 Paasche Index 100 120.51 124.74

It can be noticed that in this case the difference between the two indices is very small. This is not always the case, if the consumption patterns change along the periods, these two indices may give different results.

Measuring inflation
Inflation refers to changes in prices. Usually it refers to changes in the Consumer Price Index (CPI), a Laspeyres index that is computed in most countries. In Spain it is called ndice de Precios al Consumo (IPC) and is computed by the Instituto Nacional de Estadstica (National Institute of Statistics). INE publishes various inflation indices:

48 de 51

The montly variation of inflation or monthly inflation rate is the growth rate of the CPI index between two consecutive months. That is:

Monthly inflation rate of month t =

CPI month t CPI month t 1 100 CPI month t 1

The annual variation or interannual inflation rate is the growth rate of the CPI between any month and the same month but one year ago:

Interannual inflation rate month t =

CPI month t CPI month t one year ago CPI month t one year ago

100

The cumulative annual variation or cumulative inflation rate is the growth rate of the CPI between any month and the beginning of the current year:

49 de 51

Cumulative inflation rate month t =

CPI month t CPI beginning of current year CPI beginning of current year

100

Applying these formulas to actual data:


106.698 106.327 Monthly inflation rate August 2009 = 100 = 0.3% 106.327 106.698 107.571 Intearnnual inflation rate August 2009 = 100 = 0.8% 107.571 106.698 106.909 Cumulative inflation rate August 2009 = 100 = 0.2% 106.909

That is, from July 200 to August 2009 prices increased a 0.3%, but they have diminised 0.2% from the beginning of the year or 0.8% if we compare it with the prices one year ago.

Nominal and real growth


Consider that your wage was 2000 in 2002 and in 2009 is 2400. Has your wage increased? Certainly, your wage is 20% larger in 2009. But, are you better off than 8 years ago? We cannot answer this question without knowing how the prices of goods and services have evolved since 2002, maybe everything is more expensive and you cannot buy as many things with 2400 in 2009 that what you could buy with 2000 in 2002. If this was a real case, we could try to obtain information on the evolution of prices. For Spain we can find this information at the Instituto Nacional de Estadstica web site:

50 de 51

We observe that prices have increased 20.68% between 2002 and 2009 (we compute a simple index betwen 2002 and 2009). Therefore if your wage only increased 20% you are not better off, you are slightly worse off. We say that your wage has increased in nominal terms, that is its denomination in euros has increased, but no in real terms, that is you cannot buy more things with your wage now compared to the past. How much has really changed your purchasing power? One way of asssessing the real change in your wage is to deflact the nominal value, that is reduce whatever is only due to general price increase. To deflact means to eliminate the effect of inflation on the evolution of a monetary magnitude. In our case it is very simple, we simply divide the nominal wages of 2002 and 2009 by the value of the IPC index and multiply by 100. The result tells us what would be your wage if prices were the prices of 2006 (this is the base for the index). This way 2000 of 2002 are equivalent to (2000/88)*1000 = 2272.72 of 2006. And 2400 of 2009 are equivalent to (2400/106.2)*100 = 2259.89 of 2006. In real terms the wage has not increased but slightly decreased (it has fallen 0.59%).

51 de 51