Вы находитесь на странице: 1из 62

Data Analysis IA – Descriptive Statistics

Introduction
The purpose of this presentation is not to transform you into statisticians but rather to enable
you to evaluate statistically your data information as part of your research project. It will allow
you to become acquaint with some of the basic techniques that you can employ. At the same
time the limitations of each statistical technique will be emphasized. In the next three
presentations we will outline the data analysis methods which take the evidence contained in a
data record and then quantify or qualify certain features of the data to be presented in the
results section of your project. We will discuss situations in which you would use each
technique, the assumptions made as well as how to interpret the results. This includes a broad
range of techniques for exploring and summarizing data, as well as investigating and testing
underlying relationships. Essentially in the data analysis sections you will acquaint yourselves
with your data in order to answer your research questions and to meet your research objectives.
This is an application – oriented material and the approach adopted is practical therefore we will
distance ourselves from deriving proofs.
Data analysis is based on statistical methods which enable you to turn data into information and
information into knowledge. Statistics at large is the science dealing with the collection, analysis,
interpretation, and presentation of data. (Webster’s Third New International Dictionary).
Sometimes the term statistics is usually used in a generic way to refer to a group of data
representing measured facts and figures.
The starting point will be to outline the basic concepts in business statistics which are essential
to your analysis stage. Then we will outline the three important ways of describing a set of data.
First data can be summarized in a tabular form for better understanding. We will see how to
summarize a group of data into categories and frequency tables. Another important aspect of
describing data is through graphical representation. It is true that the human mind can easier
process and comprehend graphs rather than figures. Finally we will describe a set of data
utilizing some basic measurements which typify and characterize the data as well as provide
evidence for the spread of the data. Such measurements can be the average and the standard
deviation of a set of data.

Learning Objectives
The aim of this presentation is to:
 Define statistics and differentiate the statistical methods between descriptive,
inferential and relational statistics.
 Classify data according to their source and whether they are qualitative or quantitative
with their distinct data measurements.
 Explore the limitations of statistics.
 Recognize the difference between single and grouped data and construct frequency
distributions.

Elias Pavlou Page 1


 Construct a variety of graphs according to the type of the data examined such as bar, pie
charts, line graphs, histograms and scatter plots.
 Distinguish between measures of central tendency, variability and measures of shape of
a distribution.
 Understand the meanings and derivations of mean, median, mode, quartiles, range,
variance and standard deviation. Derive a relative measure of variation.
 Differentiate between sample statistics and population parameters.
 Understand the application of standard deviation in the shape of a distribution and
standardized Z-scores for comparison reasons.
 Understand the meanings of skewness and kurtosis of the shape of a frequency
distribution
 Finally, the objective of this presentation is to outline the Excel commands to produce
most descriptive measures.

Basic Statistical Concepts

Data and Variables


 Data constitutes the raw material statistics without which there cannot be any form of
statistical analysis. Data are systematically recorded information together with context
and can be numbers, dates, words, codes or other labels. Data may relate to an activity,
a phenomenon, or a problem situation under study. They are perceived as the result of
the process of measuring, quantifying, counting or classified.
 The data series generated by such a phenomenon or activity is defined as a variable.
Thus a variable is a successive set of measurements relating to specific elements, objects
or observations exhibiting a common quality or characteristic. For instance we may
record the variable of freight rates for a given period, the scrap volumes and or annual
number of sales and purchases for a number of shipping companies. All the above
examples records information depicting the same characteristic. A variable is denoted
by capital letters X, Y or Z.

Types of data
Data are classified as quantitative or qualitative. This distinction reflects the type of
characteristic being measured.
 Quantitative data are those that can be quantified in definite units of measurement.
These refer to characteristics whose successive measurements yield quantifiable
observations. In general, a quantitative variable is measured on a scale with a fixed unit
of measurement between its possible values. For example, if we measure freight rates

Elias Pavlou Page 2


for a given route to the nearest dollar, then one dollar is the fixed unit of measurement
between different freight rates.
Depending on the nature of the variable observed for measurement, quantitative data
can be further categorized as continuous and discrete data.
Continuous data represent the numerical values of a continuous variable. They can take
any value between a given interval range. The continuous data are quite precise and
very close to each other so that decimals make sense. All characteristics such as weight,
length, height, thickness, velocity, temperature and salaries represent continuous
variables. In shipping the length of a vessel may be measured to any required
accuracy.
Discrete data are the values recorded in integer form therefore a discrete variable
consist of data reported as whole numbers essentially representing counts. For
example the number of customers visiting a bank every day, the incoming ships at a
port, and the defective items in a consignment received, reflect discrete data.
In some cases we can treat a variable as either continuous or discrete. For example the
length of a ship measured to the nearest 100 meters would be treated discretely
whereas the length of a ship measured to the nearest centimeter can be treated as
continuous.
 Qualitative data refer to qualitative characteristics of a subject or an object. A variable
is qualitative in nature when the elements or objects in that variable are defined in
terms of a certain attribute in discrete numbers. We cannot perform mathematical
operations on this data.
These data are further classified as nominal and rank data.
Nominal data are the data derived as classifications into two or more categories. Take
for instance the case where data refer to classification of students according to gender
(as males and females), workers according to skill (as skilled, semi-skilled, and
unskilled), and of employees according to the level of education (undergraduates, and
post-graduates). Furthermore consider the data collected for vessels according to port
of registry. Thus nominal data are non-numerical in nature and come in the form of
words or labels. Each data element selected is classified and assigned to a particular
class.
Ordinal data, are the result of assigning ranks to specify order in terms of the integers
1,2,3, ..., n. It constitutes a scale of establishing a ranking over a set of data points.
Therefore data can be arranged in order. You can say one data entry is greater than
another. e.g.TV ratings, satisfaction levels of cruise passengers. A plethora of ordinal
data are usually generated by questionnaires.

When working with statistics, it’s important to recognize the different types of data. Data are
the actual pieces of information that you collect through your study. Therefore it is necessary to
carefully distinguish the actual nature of the variable being measured. Please note that
statistical methods are generally specific for the kind of data being handled.

Elias Pavlou Page 3


Data Types

Qualitative Quantitative
(Categorical) (Numerical)

Nominal Discrete Continuous


Ordinal no of employees, salaries,
salaries,
education level, opinions, no of vessels under freight rates,
opinions, scrap
gender, satisfaction levels
levels of
of management scrap prices
prices
satisfaction
port of
port of registry
registry cruise passengers
cruise passengers

Qualitative data cannot be measured numerically thus they can either be classified into
categories based on the qualities they describe or placed in rank order (Berman Brown and
Saunders 2008). For this reason qualitative data are usually called Categorical data.
Quantitative data on the other hand can be assigned positions on a numerical scale hence they
are usually referred to as Numerical data. Numerical data can be analyzed employing a wider
range of statistics.

Numerical data can be classified into interval and ratio data depending on the kind of arithmetic
operations which can be performed upon them.
 Interval data state the difference or interval between any two data values for a
particular variable but you cannot express their relative difference. Therefore interval
data can be added or subtracted but they cannot be meaningful multiplied and divided.
The classic example of interval data is the Celsius temperature scale. Although the
difference between, say, 20°C and 30°C is 10°C it does not mean that 30°C is one and a
half times as warm. This is because 0°C does not represent a true zero. A further
example of interval data is the ordinal data expressed as Likert scale. An opinion
question based on the Likert scale from 1 to 7 is an example of an interval scale. This
scale does not have equal distances between points in that 1 to 2 is not the same as 2
to 3. Only the order (of preference) is meaningful.
 In contrast, ratio data can be expressed as relative differences or ratios between any
two data values for a variable and there is an inherently defined zero value. Variables
such as salary, height, weight, time, and distance are ratio variables. For example if a
multinational company makes a profit of $300 000 000 in one year and $600 000 000
the following year, we can say that profits have doubled. Furthermore, a distance of
zero nautical miles is “no distance at all,” and a port that is 100 nm away is “twice as
far” as a port that is 50 nm away.

Elias Pavlou Page 4


Data Sources
As we have seen before data can be classified as secondary or primary according to their source.
Secondary data already exist in some form of published or unpublished material though not
necessarily in the form actually required. Primary data on the other hand are those data which
do not already exist in any form and therefore must be collected for the first time from primary
sources covering the whole population or a sample extracted from it.

Statistical Methods

The study of statistics can be divided into two main areas Descriptive Statistics and Inferential
Statistics.
 Descriptive Statistics deals with collecting, summarizing, and simplifying data in such a
way in order to draw meaningful conclusions readily available from the data. Therefore
descriptive statistics aims at highlighting characteristics present in a set of data. It
provides an understanding of the data for further analysis and interpretations.
The first step in any research inquiry is to collect data relevant to the problem. Following
this step the research design of your project determines the kind of data it would
require and/or generate. Once the data have been collected, these are organized and
presented in a meaningful way via appropriate tables. Further, graphs and diagrams
are also used for better presentation of the data. A useful table and graphic
presentation of raw data require the data to be classified properly in accordance with
the research objectives and the ensuing analysis to be undertaken.
The type of data required will generate appropriate summary measures. These include
measures of central tendency, dispersion and skewness which constitute the essential
scope of descriptive statistics. These form a large part of the subject matter of any basic
textbook on the subject, and thus they are being discussed in that order here as well.
 Inferential statistics, also known as inductive statistics, goes beyond describing a given
set of data. It consists of methods that are used for drawing inferences, or making
broad generalizations, about a population of observations on the basis of knowledge
about a sample drawn from this population.
A population can also be described in its entirety by observing all its elements. This
process is called census. Examining the whole population it is not always feasible since
it is a time consuming procedure and cost ineffective. In such cases, you should employ
a part of the population through a sample. Any particular measurement of the sample
can then be used to draw an inference about the entire corresponding population. This
process underlines the subject area of inferential statistics.
Consider the case in which you are required to investigate the average annual income of
a certain population of people. Then you record the annual income of a sample from the

Elias Pavlou Page 5


given population. The sample average then can be used to infer the actual average
annual income for the whole population.
Inferential statistics evaluate estimates for populations of interest. Estimates involve
incorrect decisions about a population when based on the knowledge of a limited
sample. Luckily statistics provide the necessary methods to quantify in probabilistic
terms the chances of decisions being incorrect. These probability chances constitute
the degree of reliability of inferences.
 Relationships reflect the area of statistics which investigates and identifies relations
between two or more data sets. For example we might wish to investigate the relation
between the Gross Domestic Product and the unemployment rate for a country over a
number of years. Moreover we may wish to establish the relationship between the
Baltic Dry Index and scrap prices. In statistics we talk about correlations between
variables and we are able to quantify the degree of such correlations. In other words
we quantify how the changes in one variable may affect the other(s) and make
predictions based on this correlation. This analysis requires the use of appropriate
statistical methods in the area of regression and correlation.
 Finally, business forecasting constitutes the study of methods and techniques of
estimating business and economic variables. Forecasting is concerned with future
behavior of an economic activity based on past performance. For example, monthly
product sales are an important measure of evaluating the production of a commodity.
Moreover, shipping experts wish to forecast the shipping cycles in terms of future
freight rates and transportation volumes. This requires compilation of data over time
and the appropriate analysis used is called time series analysis.

Statistical Limitations
Statistics has its limitations since it deals with uncertainties. Therefore it is not considered an
exact science as the rest of mathematics. It is simply trying to get the maximum information
about a population from a sample. Although different samples will yield different results, the
sample drawn must be representative and not on the basis of convenience. Statistical methods
are appropriate for aggregates of facts. So, single observations cannot be dealt with statistics.
Statistical methods are best applicable on quantitative data. There are certain phenomena or
concepts which are not suitable for measurement. Furthermore statistics cannot be applied to
heterogeneous data.
During the process of collecting, analyzing and interpretation of the data, statistical results
might be misleading or intentionally distorted in order to defend one’s position or to prove a
particular point. Association or relationship between two or more variables do not indicate
cause and effect relationships. It simply shows the similarity or dissimilarity in the movement of
the variables. Only a person who has an expert knowledge of statistics can handle statistical data
efficiently. Some errors are possible in statistical decisions. Particularly the inferential statistics
involve certain errors. We do not know whether an error has been committed or not.

Elias Pavlou Page 6


Tables and Graphs

In this section we will outline some guidelines in respect to incorporating numerical information
into a research project. We will demonstrate the role of graphs and tables as formats for
presenting data. Emphasis will be given to the ways in which they can be easily read and
interpreted. Determining which of these methods is the most appropriate depends upon the
amount of data you are dealing with and their complexity. It is important to remember that
when using a table or graph the associated text should describe what the data reveal about the
topic.

Presenti ng data in tables


Tables are used to present numerical or categorical data in a wide variety of publications from
newspapers to textbooks. Most data are initially stored and analyzed in tabular form. You can
tabulate both primary and secondary data. The latter may already be presented as a table in the
original work and you only require an extract from the table to support your argument, or
improve the design of the table could be improved, or merge information from two different
publications. There is no problem in doing any of these as long as you ensure that you reference
the original source of the data in your table.
Tables are an effective way of presenting data:
 when you wish to show how a single category of information varies when measured at
different points (in time or space). For example, port registry numbers, freight rates per
route or for a given time period.
 when the precise value is crucial to your argument and a graph would not convey the
same level of precision. For example, when it is important that the reader knows that the
result was 2.48 and not 2.45;
 when you don’t wish the presence of one or two very high or low numbers to detract
from the message contained in the rest of the dataset. For example if you are presenting
information about the annual profits of an organization and don’t want the underlying
variability from one year to the next to be swamped by a large loss in a particular year.
In order to ensure that your table is clear and easy to interpret there are a number of design
issues that need to be considered.
 Since tables consist of rows and columns of information it is important to consider how
the data are arranged between the two. Most people find it easier to identify patterns in
numerical data by reading down a column rather than across a row.
 If there are several columns or categories of information a table can appear complex
and become hard to read. If the columns are equally important it is often better to include
two or more simple tables rather than using a single more complex one.
 Numbers in tables should be presented in their most simple format. This may mean
rounding up values to avoid the use of decimal places, stating the units (e.g. £4.6 million
rather than £4,600,000) or using scientific notation (e.g. 6.315 x 10-2 rather than 0.06315).

Elias Pavlou Page 7


 All tables should be presented with a title that contains enough detail that a reader can
understand the content. There should also be information about the source of the data
being used.

Frequency Tables for numerical data


The most popular form of tabular representation of raw data is the use of a frequency
distribution table. For each separate value or group of values of the variable, the number of
times which it occurs is being tabulated.
Take for instance the following two examples.
Example 1: In this example the number of rainy days per month between January and December
were recorded.
Rainy days per month January to December in Sometown
18 13 13 13 14 15 13 15 16 17 17 19
and
Example 2: In the second example, the times at which ships passed a certain point were
recorded.
Times at which ships passed a certain point
11.20 11.35 11.37 11.50 12.03 12.04 12.06 12.15
12.20 12.30 12.35 12.50 12.53 13.08 13.16 13.25
13.31 13.58 14.00 14.12 14.35 14.45 14.49 14.50
14.53 15.04 15.25 15.55 16.05 16.14 16.15 16.19
16.30 16.38 17.04 17.05 17.10 17.10 17.10 17.15
17.25 17.45 17.55

The above two examples represent raw data which is of limited use. The first task is to
summarize the given data by reducing the overwhelming amount of numbers so as significant
features stand out.
In summarizing any set of data it is advisable to arrange them in ascending or descending order.
In the case of qualitative data then an arbitrary ordering may be necessary such as alphabetical
order. Such an arrangement is called array of data.

For both examples the recorded observations would be tabulated as follows

Elias Pavlou Page 8


The first table represents a frequency table for single data. The frequencies are tabulated on
discrete observations in the data series. We can see there are 4 months with 13 rainy days, 1
month with 14 rainy days and so forth.
The second table represents a grouped data frequency table where groups of values of the
variable are stated. The groupings are called classes. So 9 ships passed from this particular point
between midday and one pm. Similarly, 6 ships passed from the same point between 4 and 5
o’clock in the afternoon. Since the variable, time, is continuous the end points of each class
appear in one class only. The size of the class is known as class interval and it represents the
length of the class measured on a continuous scale and in this example the length is 60mins. We
observe all 7 classes are of equal length. This is not always true since class intervals differ for
various reasons. Nonetheless you should try to make class intervals of equal size whenever
possible. The number of classes should usually be between 5 and 20.
Let’s see how we determine the class interval:
 First we need to sort our raw data in ascending order.
 Then we calculate the data range. The range is defined as the difference between the
maximum (latest) time and the minimum (earliest) time. In this example we have Range
= 18-11=7 hrs
 We have already decided that the number of classes to be 7.
 We compute the class interval (width) as Range /(over) number of class. Hence
(7/7=1hr). This number must be rounded up.
 Determine the class boundaries (limits). In the second example the time limits are 11,
12, 13, 14, 15, 16, 17, 18.
 Finally we count observations and assign them to classes.

If there are a few extreme values at either end of the data distribution it is advisable to lump
them together in an open ended class rather than having classes with very few observations.
For example let’s assume we wish to tabulate annual income of a shipping company employees’.
The vast majority of them earn between 10000 and 50000 euros. Therefore it is decided to
construct a grouped data frequency table with 7 classes with interval width of 10000.

Elias Pavlou Page 9


We see that annual incomes of less than 10000 or more than 50000 are stated as open ended
intervals. For further manipulation these classes will assume the width of the classes
immediately next to them.

In frequency tables most often we are concerned with the relative frequency which categories
of data occur. The relative frequency % defines the proportion of the data in each category or
class. It is calculated as the ration of class frequency over the total frequency recorded times
100.
class frequency
Relative frequency %= ∗100
total frequency

In constructing a frequency table we also include the cumulative frequencies as well as the
cumulative relative frequencies %. Cumulative Frequency corresponding to a particular value is
the sum of all the frequencies up to and including that value. The relative cumulative frequency
is the proportion between the cumulative frequency of a particular value and the total number
of data.
For the first example the complete frequency table for single data is as follows.

We observe that 1 in 3 months (33%) it rains for 13 days. There are 10 months in which it rains
up to 17 days and finally two out of three (67%) months it rains up to 16 days.

Similarly, for the second example the complete frequency table for grouped data is as follows

Elias Pavlou Page 10


We observe that 1 in 5 ships (21%) passed the particular point between midday and 1 o’clock.
25 ships passed the point before 3 o’clock in the afternoon. Finally, 2 out of 3 ships (65%) passed
the point before 4 o’clock in the afternoon.

Frequency tables for categorical data


Categorical variables represent types of data which may be divided into groups. Thus apart from
tabulating numerical data we can also construct frequency tables for categorical data. Take for
instance the following example where an investor apportions his money in stocks, bonds, credit
defaults and savings. The amount invested in each category is recorded with the corresponding
relative frequency in percentages. Obviously cumulative frequencies do not make sense to be
included.

Therefore, the total amount invested was split into 42% in stocks, 29% in bonds, 14% in credit
defaults and 15% in savings.

Bivariate frequency distributions may also be tabulated whereby for each member of the
sample two variables are recorded. If we have more than 1 variable, we cannot use a regular
frequency table. In this case, we must use what is called a contingency table. A two-way table
(also called a contingency table) is a useful tool for examining relationships between categorical
variables. The entries in the cells of a two-way table can be values, frequency counts or relative
frequencies just like in a one-way table.
In the following table we have tabulated investments in thousands of dollars per category i.e.
stocks, bonds, credit defaults and savings for three investors A, B and C.

Elias Pavlou Page 11


Although the amounts invested in each category for all three investors are tabulated, it is
difficult to compare them. Therefore the same information can be presented in terms of relative
frequencies per investor by each investment category. Please observe that each entry
represents percentages and each column adds up to one hundred percent.

The above percentages facilitate comparisons. We can see that the first two investors follow
approximately the same apportionment in the categories, they predominately invest in stocks.
Investor C on the other hand mainly invests in savings therefore exhibiting risk averse attitude.

So far we have seen that a frequency table provides the most convenient means of summarizing
data. This is obvious since the required figures can be located more readily and comparisons are
made easily. Furthermore, patterns may be revealed. It should be noted that a frequency table
must be accompanied by some narrative to identify the most important features.
Any frequency table should have an explanatory heading as well as state the source of the data.
Finally the units of measurements should be stated explicitly.

Working with percentages


The term percentage or the symbol % is widely used. For example, 10% discount sales, or
invoices with 20% value added tax, unemployment and other values are expressed in
percentage terms.
Percentages are used to convey size or scale or value. Percentage means parts out of 100 and is
the same as a fraction with a denominator of 100. Therefore:
15% means 15 parts out of 100 and is the same as the fraction 15/100
87% means 87 parts out of 100 and is the same as the fraction 87/100
A further way of expressing parts out of 100 is using a decimal and so percentages can also be
expressed as decimals: 15% is the same as 0.15 or 15/100, 87% is the same as 0.87 or 87/100.

Whilst doing your research you may come across many sources of data in tables which you
would like to incorporate into your work. However, this can be difficult if they do not share a

Elias Pavlou Page 12


common base line. Percentages are useful for comparing information where the sample sizes or
totals are different. By converting different data to percentages you can readily compare them.
For example, the following table show amounts invested in the stated categories in 2000 and
2001.

Because the total amounts invested in 2000 and 2001 were different it is difficult to compare
the data for the two years and to determine whether or not there was any notable change in
the investment patterns. However if the amounts invested in each category is expressed as a
percentage of the total amount then it is easier to compare the data for the two years.
For example, the conversion from the actual amount invested in stocks in 2000 to a percentage
can be done in the following way:
First we determine the fraction of the total amount invested in Stocks in 2000:
That is 36millions out of a total of 124millions = 36/124 = 0.29.
Then we convert the decimal to a percentage by multiplying by 100: 0.29 x 100 = 29.
The result indicates that the amount invested in Stocks accounted for 29% for all investment in
2000.
You can convert all remaining entries to percentages in the same way resulting into the
following table.

Now we are in a position to compare the amounts invested according to each category. For
example, the above table shows that the amount invested in savings has dropped threefold
between 2000 and 2001.
Percentages are also very useful if you wish to quantify change. They are usually more readily
understandable and comparable than when the information is presented as raw values.
Using the information presented above the percentage increase in stock investment between
2000 and 2001 is calculated as follows:
% change is calculated as the ratio of the difference between the amount invested in current
year minus the amount invested in previous year over the amount in previous year time 100.
Percentage change = ¿ ¿

Therefore the calculated % change = (44 – 36) / 36 *100 = 8/36 = 22%


This means that there was an increase of 22% in the amount investment in stocks in 2001 as
compared to 2000.
When calculating the percentage change between two values it is important to determine the
correct base that is the appropriate starting value. This is because the percentage change from a

Elias Pavlou Page 13


low number to a higher number is not the same as the percentage change from the same higher
number to the same lower number. For example:
The percentage increase from 50 to 75 = (25/50)x(100/1) = 50%. However, the percentage
decrease from 75 to 50 = (25/75)x(100/1) = 33.3%.

Graphical Representati ons of data


Graphs are a good means of describing, exploring or summarizing numerical data because the
use of a visual image can simplify complex information and help to highlight patterns and trends
in the data. There are a number of graphs which can depict numerical or categorical data. The
main graphs we will examine are namely Bar charts, Pie charts, Histograms and Scatter plots.

Bar charts
Bar charts are one of the most commonly used types of graph and they are used to display and
compare the number, frequency or other measure (e.g. mean) for different discrete categories
or groups. Bar charts are simple to create and very easy to interpret. They are also a flexible
chart type and there are several variations including horizontal bar charts, grouped, and stacked
bar charts.
The vertical bar chart below depicts flag of convenience fleets in 1976 per country of
registration. The graph is constructed such that the heights or lengths of the different bars are
proportional to the size of the category they represent. Since the x-axis (the horizontal axis)
represents the different categories it has no scale. The y-axis (the vertical axis) does have a scale
and this indicates the units of measurement. The bars can be drawn either vertically or
horizontally depending upon the number of categories and length or complexity of the category
labels. If there is more than one set of values for each category then grouped bar charts can be
used to display the data.

Elias Pavlou Page 14


In Excel a chart in which the bars are presented vertically is referred to as a  column chart, whilst
a chart with horizontal bars is called a bar chart.

Grouped bar charts are a way of showing information about different sub-groups of the main
categories. In the example below the average composition of the USA workforce in millions
during 1986 is depicted.
A separate bar represents each of the sub-groups (e.g. professional) and these are usually
coloured or shaded differently to distinguish between them. In such cases, a legend or key is
usually provided to indicate what sub-group each of the shadings/colours represent.
Grouped bar charts can be used to show several sub-groups of each category but care needs to
be taken to ensure that the chart does not contain too much information making it complicated
to read and interpret. Grouped bar charts can be drawn as both horizontally or vertically charts
depending upon the nature of the data to be presented.

Stacked bar charts are similar to grouped bar charts in that they are used to display information
about the sub-groups that make up the different categories. In stacked bar charts the bars
representing the sub-groups are placed on top of each other to make a single column. The

Elias Pavlou Page 15


overall height or length of the bar shows the total size of the category whilst different colours or
shadings are used to indicate the relative contribution of the different sub-groups.

Pie charts
Pie charts are a visual way of displaying how the total data are distributed between different
categories. A pie chart is a circular graph that shows the relative contribution that different
categories contribute to an overall total. Such graphs resemble a pie that has been cut into
different sized slices.
 Pie charts should only be used for displaying categorical data. They are generally best
for showing information grouped into a small number of categories around 6 and are a
graphical way of displaying data that might otherwise be presented as a simple table.
When there are more categories it is difficult for the eye to distinguish between the
relative sizes of the different sectors and so the chart becomes difficult to interpret.
 Pie charts are generally used to show percentage or proportional data and usually the
percentage represented by each category is provided next to the corresponding slice of
pie.
The example below shows the proportional distribution of visitors between different types of
tourist attractions.

Elias Pavlou Page 16


We observe that 1 in 3 visitors (35%) preferred the theme parks although more
tourists preferred visiti ng museums and galleries (42%). Zoos and historical houses
were preferred by a smaller number of tourists, 8% and 15% respecti vely.

Histograms
Histograms are a special form of bar chart where the data represent continuous rather than
discrete categories. The example below presents details of the age distribution of some
employees. They are grouped in age intervals since age is a continuous rather than a discrete
category. However, because a continuous category may have a large number of possible values
the data are often grouped to reduce the number of data points.

The data represent continuous rather than discrete categories. This means that in a histogram
there are no gaps between the columns representing the different categories. The above
histogram depicts an approximately symmetrical age distribution with the highest frequency of
employees aged between 40 and 45 years old.
In a bar chart the length of the bar indicates the size of the category, but in a histogram it is
the area of the bar that is proportional to the size of the category. This difference is due to the
fact that in a histogram both the x-axis and y-axis have a scale, whereas in a bar chart only the y-
axis has a scale.
It is however, possible to draw basic histograms using Excel by selecting either the column or bar
chart types. By default these chart types include a gap between the columns representing each
category but this can be removed, in order that adjacent columns end onto one another,
resulting in the chart appearing as a histogram.

Line graphs
Line graphs are usually used to show time series data – that is how one or more variables vary
over a continuous period of time. Typical examples of the types of data that can be presented
using line graphs are all Baltic Indices and most economic data captured as time series.

Elias Pavlou Page 17


Line graphs are particularly useful for identifying patterns and trends in the data such as
seasonal effects, large changes and turning points.  In a line graph the x-axis represents the
continuous variable (for example year, month or quarter) whilst the y-axis has a scale and
indicates the measurement. Several data series can be plotted on the same line chart and this is
particularly useful for analyzing and comparing the trends in different datasets. In the following
graph the Baltic Cape Index has been depicted on a monthly basis for the years 2006 and 2007

Scatt er plots
Scatter plots are used to show the relationship between pairs of quantitative measurements
made for the same object or individual. The data is displayed as a collection of points, each
having the value of one variable determining the position on the horizontal axis and the value of
the other variable determining the position on the vertical axis. For example, let’s assume we
are interested in the relationship between Baltic Cape Index and the Dry Cargo Earnings. By
analyzing the pattern of dots that make up a scatter plot it is possible to identify whether there
is any systematic or causal relationship between the two measurements. Regression lines can
also be added to the graph and used to decide whether the relationship between the two sets of
measurements can be explained or if it is due to chance.

Elias Pavlou Page 18


Good graph design
There are a number of elements that are common in all types of graphs.
 The common feature of graphs is the Chart area which defines the boundary of all the
elements related to the graph including the plot itself and any headings and explanatory
text.
 All graphs should include a title that summarizes what the graph shows. The title should
identify what is being described and the units of measurements (e.g. percentages, total
number, frequency).
 If the graph you are presenting is based on data from another publication then you
should acknowledge the source of the original data somewhere within the chart area or
title
 In bar charts, histograms, and pie charts, shading and colour are often used to
distinguish the areas representing different categories ensuring your chart is easy to
interpret.
Graphs can be depicted three dimensionally also. In general the use of 3D makes it much more
difficult to interpret the data presented in chart or graph because the false depth and
perspective that are added to the chart make reading and comparing values extremely difficult.
In the following graph it is difficult to distinguish the no of speeding offences from one year to
another.

Elias Pavlou Page 19


Source: The Home Office Statistical Bulletin, Motoring Offences, England and Wales 1997

Descriptive Statistics - Summary Measurements

The description of statistical data may be quite elaborate or quite brief depending on two
factors: the nature of data and the purpose for which the same data have been collected. So far
we have considered tabular and graphical representation of raw data. These types of data
presentation take in several pieces of information. Any set of observations can be further
described by a series of measurements in order to communicate the largest amount of
information as simply as possible. These measurements or calculations define the data in terms
of their spread and shape of their distribution as well as presenting values which typify the data.
Therefore there are three main types of measurements
 The central tendency is the extent to which all the data values group around a typical or
central value.
 The variation is the amount of dispersion, or scattering, of values
 And finally the shape is the pattern of the distribution of values from the lowest value to
the highest value.

Descriptive Statistics - Measures of Central Tendency or location


Central tendency is defined as “the statistical measure that identifies a single value as
representative of an entire distribution.” It aims to provide an accurate description of the entire
data. It is the single value that is most typical / representative of the collected data.
The farthest one can reduce a set of data, and still retain any information at all, is to summarize
the data with a single value. Measures of location do just that: They try to capture with a single
number what is typical of the data. What single number is most representative of an entire list
of numbers? We cannot say without defining "representative" more precisely. We will study
three common measures of location: the mean, the median, and the mode. The mean, median
and mode are all "most representative," but for different, related notions of representativeness.

Elias Pavlou Page 20


Arithmetic Mean for single data
Probably the best known of the measurements of central tendency is the arithmetic mean or
the average. Adding all the observations and dividing the sum by the number of observations
results into the arithmetic mean. Suppose we have the following observations:
10 15 30 7 42 79 83
These are seven observations. Symbolically, the arithmetic mean, also called simply mean is
n
∑ Xi X 1 + X 2 +⋯+ X n 10 + 15 + 30 + 7 + 42 + 79 + 83 266
X = i =1 = = = =38
n n 7 7
The formula given above is the basic formula that forms the definition of arithmetic mean and is
used in case of ungrouped data where weights are not involved. The sign Σ, pronounced sigma,
is the capital letter in the Greek alphabet for the sound s but mathematically it means sum or
add up all available values. The Σ notation enables us to produce a formula for the calculation of
the mean for what can be lengthy process to describe.
The arithmetic mean is sensitive to extreme values or outliers especially when the sample size is
small. Therefore, it is not an appropriate measure of central tendency for non-symmetrical
distributions.

Weighted Mean for single data


Weighted mean is calculated when certain values in a data set are more important than the
others. A weight wi is attached to each of the values xi to reflect this importance. Our approach
for calculating arithmetic mean will be different from the one used earlier.
Example: An investor is fond of investing in equity shares. During a period of falling prices in the
stock exchange, a stock is sold at 12 euros per share on one day, 10 euros on the next and 9
euros on the third day. The investor has purchased 50 shares on the first day, 80 shares on the
second day and 100 shares on the third day. What average price per share did the investor pay?
Solution: Calculation of Weighted Average Price

First we tabulate the information provided. The weighted average is calculated as the ratio of
the sum products of the price per share and the no of shares purchased over the sum of the no
of shares purchased.
Σ wx w 1 x 1+ w 2 x 2+w 3 x 3 600+ 800+700
Weighted Average = = = =9,1
Σw w 1+w 2+ w3 50+ 80+100
Therefore, the investor paid an average price of 10 euros per share.

Elias Pavlou Page 21


It will be seen that if merely prices of the shares for the three days (regardless of the number of
shares purchased) were taken into consideration, then the average price would be
12+10+ 9
=¿10.33
3
This is an unweighted or simple average as it ignores the shares purchased. A simple average is
also a weighted average where weight in each case is the same, that is, only 1. When we use the
term average alone, we always mean that it is an unweighted or simple average.

Arithmetic Mean for grouped data


In most cases the count of our data/observations is rather big. One way of summarizing our
observations is with frequency tables as we have seen before. There are two types of frequency
tables namely
 Frequency tables of discrete data, and
 Frequency tables of continuous data.

Example: Let’s consider the case where we have tabulated the number of vessels under
management by a group of small independent shipping companies as follows

The variable which we are measuring is the no of vessels under management and hence we call
it x. The next column gives the number of times each value of x occurs and we call it f for
frequency. In order to derive the average number of vessels under management we need to
consider that 1 vessel is reported by 23 companies, 2 vessels is reported by 12 companies and so
forth. Therefore we have included a third column alongside the frequency distribution which
shows the individual values of each product f*x.
Σ(f ∗x) 101
The formula for the mean is given by x= = =2.02
Σf 50
In other words what we must do to get the mean is to multiply each value of x by the no of
times f it occurs to get each fx product and then add all these products together dividing the
final total by the number of values in the distribution, obtained by adding up all the frequencies.

Elias Pavlou Page 22


Example: Now we will look at an example of grouped data. The weight in tones of 46 containers
was recorded and assigned to each predetermined weight class as follows. We are interested in
finding the mean weight of these containers.

Since the values for weights are assigned to classes we need to construct a single value (x) that
represents each interval. As we have no information on the exact weight of each container we
assume that all weights falling within a class interval take the midpoint as a good approximation
of the true mean of the class. This is based on the assumption that the values are distributed
fairly evenly throughout the interval. When large numbers of frequency occur, this assumption
is usually accepted. Therefore the mean weight is calculated as before using the formula.
Σ(f ∗x) 516
x= = =11.2
Σf 46
Thus the average weight of the 46 containers is 11.2 tones.

Overall the arithmetic mean is based on all the items in a series, a change in the value of any
item will lead to a change in the value of the arithmetic mean. Also in the case of highly skewed
distribution, the arithmetic mean may get distorted on account of a few items with extreme
values. In such a case, it may cease to be the representative characteristic of the distribution.

Median for single data


Median is defined as the value of the middle item (or the mean of the values of the two middle
items) when the data are arranged in an ascending or descending order of magnitude. The
median as a measurement has the property of dividing the data into two equal halves. Half of
the observations are below the median value and the other half above it. For a series of odd
number of single data the median is the middle value whereas for a series of even number of
single data the median is the average of the two middle values.
Suppose we have the following series:
15 19 21 7 10 33 25 18 5
We have to first arrange it in either ascending or descending order. These figures are arranged
in an ascending order as follows:

Elias Pavlou Page 23


5 7 10 15 18 19 21 25 33
n+1
The median position is derived from the formula 2 . In this examples then median position
is (9+1)/2 = the 5th observation. Therefore the median value is 18.

Suppose the series consists of one more items 23. We may, therefore, have to include 23 in the
above series at an appropriate place, that is, between 21 and 25. Thus, the series is now
5 7 10 15 18 19 21 23 25 33
th
Applying the above formula, the median is located in the 5.5 position. Here, we have to take
the average of the values of 5th and 6th item. This means an average of 18 and 19, which gives
the median as 18.5.
n+1
It may be noted that the formula 2 merely indicates the position of the median, namely,
the number of items we have to count until we arrive at the item whose value is the median.

Median for grouped data


To calculate the median of a frequency distribution we make use of the associated cumulative
frequency distribution. For large amounts of data of n items, where n is 30 or more, we usually
say the median is the n/2 th item. Let’s consider the previous example of the frequency table of
discrete values.

Hence in this case the median is located at the 50/2=25 th observation. So we are looking for the
data value (x) which contains the 25 th observation. From the cumulative frequency column we
observe that the median data value is 2. Therefore, half of the companies manage up to 2
vessels.

In the case of a grouped series, let’s revisit the previous example regarding container weights

Elias Pavlou Page 24


In this case we first locate the class that contains the median value and then we calculate the
exact value by linear interpolation.
The median is located at the 46/2=23 th observation. So we are looking for the class interval
which contains the 23th observation. From the cumulative frequency column we observe that
the median is located in the class of [8, 12). The next step is to calculate the exact median
weight. For this we utilize the following formula
n
−CF (m −1)
M =Lm + 2
∗wm
fm
Where M = the median
Lm is the lower bound of the median class,
n is the number of observations,
CF (m−1) is the cumulative frequency before the median class,
f m is the frequency of the median class and
w m is the width of the median class

For our example we have Lm =8 , n/2 = 23, CFm-1 = 13, fm = 14 και wm = 4


23−13
Therefore M = 8+ ∗4=¿ 10,9 tones
14
Thus half of the containers weigh up to 10,9 tones.

The median as a measurement is not influenced by extreme values and it is preferred in case of
a distribution having outliers. In the case of qualitative which are not counted but rather they
are ranked it is considered as the most appropriate measure of central tendency.

Mode
The mode is another measure of location. It is the value which occurs most frequently. As an
example, consider the following series:
8, 9, 11, 15, 16, 12, 15, 3, 7, 15

Elias Pavlou Page 25


There are ten observations in the series wherein the figure 15 occurs three times. The mode is
therefore 15 and the data series is called unimodal.
In the case where you have two or more observations repeating themselves exactly the same
number of times then the data series will have more than one mode. There are data series
consisting of distinct observations. In these cases there will be no mode.
In the case of qualitative data and especially for nominal data, the mode is the only
representative value to be quoted. Take for example, the case of cargo marking codes. Then the
code number which occurs most frequently is the mode.
In the case of grouped data, mode is determined by the highest frequency of the class intervals.
In the preceding example of container weights the highest frequency of 14 containers falls in the
class [8, 12). Thus the modal class in this frequency distribution is [8, 12). The next step is to
determine the exact value of the mode. We will employ the following formula
Δ1
T =LT + ∗w T
Δ1 + Δ 2
Where T = the mode
LT is the lower bound of the modal class,
Δ1 is (modal class frequency – frequency before the modal class),
Δ2 is (modal class frequency – frequency after the modal class),
w T is the width of the median class

For our example we have LT =8 , Δ1= (14-8), Δ2= (14-10) and Wm = 4


6
Therefore T = 8+ ∗4=¿ 10,4 tones
6+ 4
Thus the container weight which is more frequent is 10,4 tones.

While applying the above formula, we should ensure that the class-intervals are uniform
throughout. If the class-intervals are not uniform, then they should be made uniform on the
assumption that the frequencies are evenly distributed throughout the class. In the case of
unequal class-intervals, the application of the above formula will give misleading results.

Relationships of the Mean, Median and Mode


Having discussed mean, median and mode, we now turn to the relationship amongst these
three measures of central tendency. We shall discuss the relationship assuming that there is a
unimodal frequency distribution.
 When a distribution is symmetrical, the mean, median and mode are the same, as is
shown below in the following figure.

Elias Pavlou Page 26


 In case, a distribution is skewed to the right, then mode < median < mean.

For example, income distribution is skewed to the right where a large number of
families have relatively low income and a small number of families have extremely high
income. In such a case, the mean is pulled up by the extreme high incomes and the
relation among these three measures is as shown in the figure above
 When a distribution is skewed to the left, then mean < median < mode. This is because
here the mean is pulled down below the median by extremely low values. This is shown
as in the figure.

The best measure of Central Tendency


Choosing the best of these three measures of central tendency is not easy since these measures
are based upon different concepts. The arithmetic mean is the sum of the values divided by the
total number of observations in the series. The median is the value of the middle observation
that divides the series into two equal parts. Mode is the most frequent occurring observation.

Elias Pavlou Page 27


As such, the use of a particular measure will largely depend on the purpose of the study and the
nature of the data.

Central Tendency

Arithmetic Mean Median Mode

X i
X i1
n
Middle value in the Most frequently
ordered array observed value
Consider the following example where the monthly earnings in a certain small shipping company
were recorded and the three measures of location were calculated in euros as follows:
Mean=3,500, Median=2,000 and Mode=1,500.

In order to assess the earning structure the employees choose the mode as their average salary
while the management chooses the mean. An outside negotiator wishing to compare with other
companies chooses the median as their average salary since half the employees earn below that
amount and the other half above it.

Geometric Mean
Apart from the three measures of central tendency as discussed above, there are two other
means that are used sometimes in business and economics. These are the geometric mean and
the harmonic mean. The geometric mean is more important than the harmonic mean.
Geometric mean is based on each and every observation in the data set. It is defined at the nth
root of the product of n observations of a distribution.
It is used to ratios and percentages as also in calculating growth rates as follows
1/n
X G =( X 1 ×X 2×⋯×X n )
In particular it can be used to measure the status of an investment over time
1/n
RG=[(1+R 1 )×(1+R2 )×⋯×(1+Rn )] −1

Elias Pavlou Page 28


Where Ri is the rate of return in time period i

Example: An investment of $100,000 declined to $50,000 at the end of year one and rebounded
to $100,000 at end of year two:

The overall two-year return is zero, since it started and ended at the same level.
First we will use the 1-year returns to compute the arithmetic mean and the geometric mean:
(−. 5 )+(1)
X= =. 25=25 % Obviously this is a misleading result
2
Now we will apply the above formula to compute the geometric mean for the two year
investment return
RG=[(1+R1 )×(1+R2 )×⋯×(1+R n )]1/n −1
¿[(1+(−.5))×(1+(1))]1/2−1 This is a more representative result
1/2 1/2
¿[(.50
As )×(2)]
compared to the = 1 −1=0%
−1arithmetic mean, it gives more weight to small values and less weight to
large values. As a result of this characteristic of the geometric mean, it is generally less than the
arithmetic mean. At times it may be equal to the arithmetic mean. As a derivation the geometric
mean is rather difficult to understand and has one of the major disadvantages in the case where
the data series recorded contain a negative or zero observations. Then the geometric mean
cannon be calculated.

Harmonic Mean
The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocals of
individual observations.
n
x́=
Symbolically, it is denoted as 1 , where all observations are positive real numbers.
∑x
i

For example let’s consider the case where a ship steams at 15 knots for an outward 10 nautical
miles and at 11 knots for the return 10 nautical miles. The average speed for the whole journey
is the harmonic mean and not the arithmetic mean.
The derivation of the harmonic mean is beyond the scope of this presentation.

Quartiles – Single Data

Elias Pavlou Page 29


At this stage, let us introduce two other concepts namely quartiles and deciles. To understand
these, we should first know that the median belongs to a general class of statistical descriptions
called fractiles. A fractile is a value below which lays a given fraction of a set of data.
Quartiles split the ranked data into 4 segments with an equal number of values per segment as
demonstrated here

25% 25% 25% 25%


Q1 Q2 Q3
 The first quartile, Q1, is the value for which 25% of the observations are smaller and
75% are larger.
 Q2 is the same as the median (50% of the observations are smaller and 50% are larger).
 The third quartile, Q3, is the value for which 75% of the observations are smaller and
25% are greater than the third quartile.
We calculate quartiles by determining the value in the appropriate position in the ranked data,
where
 First quartile position: Q1 = (n+1)/4 ranked value
 Second quartile position: Q2 = (n+1)/2 ranked value
 Third quartile position: Q3 = 3(n+1)/4 ranked value
where n is the number of observed values.
When calculating the ranked position use the following rules
 If the result is a whole number then it is the ranked position to use.
 If the result is a fractional half (e.g. 2.5, 7.5, 8.5, etc.) then average the two
corresponding data values.
 If the result is not a whole number or a fractional half then round the result to the
nearest integer to find the ranked position.

Sample Data in Ordered Array: 11 12 13


16 16 17 18 21 22
(n = 9)
 Q1 is in the (9+1)/4 = 2.5 position of the ranked data, so Q1 = (12+13)/2 = 12.5
 Q2 is in the (9+1)/2 = 5th position of the ranked data, so Q2 = median = 16
 Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data, so Q3 = (18+21)/2 = 19.5

Quartiles – Grouped Data


In the case of a grouped series, let’s revisit the previous example regarding container weights

Elias Pavlou Page 30


In this case we first locate the class that contains the first and third quartile values and then we
calculate the exact values by linear interpolation.
 The first quartile Q1 is located at the n/4=46/4=11.5 th observation. So we are looking for
the class interval which contains the 11.5 th observation. From the cumulative frequency
column we observe that the Q1 is located in the class of [4, 8).
 The third quartile Q3 is located at the 3*(n/4)=3*46/4=34.5 th observation. So we are
looking for the class interval which contains the 34.5 th observation. From the cumulative
frequency column we observe that the Q3 is located in the class of [12, 16).
The next step is to calculate the exact quartile value. For this we utilize the following formula
i∗n
−CF Q (i−1 )
Qi=LQi + 4
∗wQi
f Qi
Where Qi = the quartile
Li is the lower bound of the quartile class,
i is the relevant quartile ie., 1 or 2,
n is the number of observations,
CF Q (i−1) is the cumulative frequency before the quartile class,
f Qi is the frequency of the quartile class and
w Qi is the width of the quartile class

For the Q1 value we have LQ1 =4 , i=1, n/4 = 11.5, CFQ1-1 = 5, fQ1 = 8 και wQ1 = 4
1∗11.5−5
Therefore Q1 = 4 + ∗4=¿ 7.25 tones
8
Thus one in four containers weighs up to 7.25 tones.
For the Q3 value we have LQ2 =12 , i=3, n/4 = 11.5, CFQ3-1 = 27, fQ3 = 10 και wQ1 = 4
3∗11.5−27
Therefore Q1 = 12+ ∗4=¿ 15 tones
10
Thus 3 out of 4 containers weigh up to 15 tones.

Elias Pavlou Page 31


In the same manner, we can calculate deciles where the series is divided into 10 parts and
percentiles where the series is divided into 100 parts.

Descriptive Statistics – Measures of dispersion and skewness


Previously, we outlined the measures of central tendency. It should be noted that these
calculations do not indicate the extent of variability in the distribution. The dispersion or
variability allows us to better understand the pattern of the data we are to examine. A low
degree of dispersion indicates a high level of uniformity which constitutes a desirable quality. If
in a business there is a high degree of variability in the raw material, then it could not find mass
production economical.
If an investor is looking for a suitable shares to invest in he should first examine the movements
of the share prices. A risk averse investor should always refrain from highly fluctuating share
prices. Extreme fluctuations mean that there is a high risk in the investment in shares.
The various measures of central value give us one single figure that represents the entire data.
But these measures alone cannot adequately describe a set of observations, unless all the

Elias Pavlou Page 32


observations are the same. In two or more distributions the central value may be the same but
still there can be wide disparities in the formation of distribution. Measures of dispersion help
us in studying this important characteristic of a distribution.
A measure of variation or dispersion is one that measures the extent to which there are
differences between individual observation and some central or average value. Observing the
dispersion of a set of observations we can have an indication regarding the homogeneity or
heterogeneity of the distribution.
There are five measures of dispersion: Range, Variance, standard deviation, Inter-quartile
deviation, and relative variation.

Range
The simplest measure of dispersion is the range, which is essential the difference of the highest
value and the lowest value of the data. Therefore Range = X largest – X smallest.
Example: Find the range for the following three sets of data:

 In each of these three data sets, the highest number is 15 and the lowest number is 5.
 Since the range is the difference between the maximum value and the minimum value
of the data, it is 10 in each case.
 But the range fails to give any idea about the dispersal or spread of the series between
the highest and the lowest value.

Elias Pavlou Page 33


In a frequency distribution, range is calculated by taking the difference between the upper limit
of the highest class and the lower limit of the lowest class.
Example: Find the range for the following frequency distribution:

 Here, the upper limit of the highest class is 119 and the lower limit of the lowest class is
20.
 Hence, the range is Highest limit – lower limit =119 - 20 = 99.
 Please observe that the range is not influenced by the frequencies.
We will now define a relative measure called the coefficient of range calculated by the
Highest value−lowest value
formula: .
Highest value+ lowest value
Therefore for the above frequency distribution the relative range is (119-20) / (119+20) = 71.2%
The coefficient of range in respect in the earlier example having 3 data sets is (15-5) / (15+5) =
50%.
 The range is mainly used in situations where one may wish to get an idea of the
variability of a data set.
 In the case where we have small sample sizes, the range is considered quite adequate
measure of the variability. Therefore, it is widely used in quality control where variability
checks of raw material are needed. The range is also a suitable measure in weather
forecast where the maximum and minimum temperatures are provided.
Obviously the range has a number of limitations.
 First you could observe that it is based only on two items and does not cover all
observations in a distribution. Therefore it does not provide any idea about the pattern
of the data distribution.
 Furthermore it is sensitive to outliers as shown below. Consider the following two data
series which are identical except the last value which is 5 in the first series and 120 in
the second one.
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
The range in the first case is just 4 whereas in the second data series is 119!

Variance and Standard deviation

Elias Pavlou Page 34


 The variance of a sample of data s 2 is a measure of how spread out the data set is. It
indicates the average squared deviations of values from the mean as shown in the
n
∑ ( X i −X )2
S 2 = i=1
formula n-1
Calculating variance involves squaring deviations, so it does not have the same unit of
measurement as the original observations. For example, lengths measured in metres
(m) have a variance measured in metres squared (m2).
 Taking the square root of the variance gives us the units used in the original scale and
this is the standard deviation denoted as s. It shows variation about the mean. The
n

sample standard deviation is defined as,


S=
√ ∑ ( X i −X )2
i=1
n-1
mean, n = sample size, xi = ith value of the variable X

where = arithmetic

Standard deviation is the measure of spread most commonly used in statistical practice
when the mean is used to calculate central tendency. Thus, it measures spread around
the mean. Because of its close links with the mean, standard deviation can be greatly
affected if the mean gives a poor measure of central tendency.

Steps for Computing Standard Deviation


1. Compute the difference between each value and the mean.
2. Square each difference.
3. Add the squared differences.
4. Divide this total by n-1 to get the sample variance.
5. Take the square root of the sample variance to get the sample standard deviation.

Example: Sample data (Xi): 10 12 14 15 17 18 18 24


Where n = 8 and Mean = x́ = 16
Therefore,
(10− X )2 +(12− X )2 +(14−X )2 +⋯+(24−X )2
S =
n−1 √
(10−16 )2 +(12−16 )2 +(14−16 )2 +⋯+(24−16)2
¿
8−1√
130
=
7 √
= 4 . 3095
Practically speaking the standard deviation is a measure of the “average” scatter around the
mean.

In the case of continuous frequency distributions the standard deviation is calculated as follows.

Elias Pavlou Page 35


Consider the previous example of container weights.

The column f*x2 is calculated by multiplying each x values by its corresponding f*x value.

The formula for the standard deviation for frequency distributions is given by s=
√∑❑
Therefore the spread of the container weights as expressed by the standard deviation is

−¿ ¿.

calculated as
7224
s=
√ ∑ ❑ −¿ ¿=
❑ √ 46
−¿ ¿=√ 157.04−¿ ¿= √ 157.04−125.44 = √ 31.6 = 5.62 tones.

Generally,
 The more widely spread the values are, the larger the range, variance and standard
deviation is. The more the data are concentrated, the smaller the range, variance, and
standard deviation.
 If the values are all the same with no variation, then all these measures will be zero.
 None of these measures are ever negative.
 The derivation of the standard deviation considers the whole range of the data hence
the standard deviation is influenced by outliers.

Comparing Standard Deviations


Standard deviation might be difficult to interpret in terms of its magnitude in order to establish
the degree of spread of the data. The spread of the data determines the value of the mean for
the data series examined. A standard deviation for two large companies with a difference of
$10,000 in annual revenues is considered pretty close, while the measure of two individuals with
a weight difference of 30 kilograms is considered far apart. Therefore, it is useful to assess the
size of the standard deviation relative to the mean of the data set.
Comparing the two data distributions below you can observe that they have the same mean but
obviously different standard deviations. If you were to select an observation randomly then the
chance of being close to the mean is higher for the smaller standard deviation against the higher
standard deviation.

Elias Pavlou Page 36


Smaller Standard deviation

Larger standard deviation

Let’s take the case of three data series as shown here with the same mean but different
standard deviation. The data set B below has the same mean of 15.5 with the other two data
sets but a narrower spread of measurements around the mean and therefore usually has
comparatively fewer high or low values.

The standard deviation is an absolute measure of dispersion as it measures variation in the same
units as the original data Therefore it is not suitable measure when comparing two or more
distributions. For this we should use a relative measure of dispersion. One such measure of
relative dispersion is the coefficient of variation, which is denoted as CV and it defines the ratio
of the standard deviation over the mean in percentage as follows
 S 
CV    100%

X 

Elias Pavlou Page 37


Thus, the specific unit in which the standard deviation is measured is done away with and the
new unit becomes percent.
To demonstrate this relative variation measure let’s consider the following example where for
two stocks A,B we calculated last year’s average price as well as the standard deviation.
Stock A has: Average price last year = $50 and Standard deviation = $5

CV A = ( XS )⋅100%=$5$50⋅100%=10%
Stock B has: Average price last year = $100 and Standard deviation = $5
S $5
CV B = ( )
X
⋅100%=
$100
⋅100%=5%

Both stocks have the same standard deviation, but stock B is less variable relative to its price
In general, a standard deviation of 10 may be considered high when the mean is 50 but small
when the average is 500.

The coefficient of variation can be used to compare the variability of two or more sets of data
measured in different units. Let’s consider the following example.
Example: 250 employees are paid on average 30 euros daily with a standard deviation of 10
euros. During the month of March, the same employees worked on average for 16 days with a
standard deviation of 4.8 days. Which of the two distributions exhibit higher spread?
Daily Pay distribution: Average daily pay = €30 and Standard deviation = €10
S 10
CV = ( )
X
⋅100%= ⋅100%=33.3 %
30
Days worked during March: Average days worked = 16days and standard deviation=4.8days
S 4.8
CV = ( )
X
⋅100%= ⋅100%=33 . %
16
Thus we can see that both distributions have the same relative variation.

A note of caution in relation to the coefficient of variation. It loses its reliability when dealing
with negative numbers in the data series examined.

Interquartile range or Quartile deviation


The interquartile range or the quartile deviation denoted as IQR is a better measure of variation
in a distribution than the range. The IQR is defined as the difference between the third quartile
and the first quartile i.e. IQR = Q3 – Q1 and measures the spread in the middle 50% of the data
For this reason the IQR is also called the midspread of the data.
Let’s consider the following depiction where students’ grades are described. The lowest grade is
12 whereas the highest grade is 70. The first quartile reflects that 25% of the students were

Elias Pavlou Page 38


awarded up to 30. The second quartile Q2 reflects that 50% of the students got up to 45. Finally
75% of the students got up to 57 as reflected by the third quartile Q3.

The IQR for this example is Q3-Q1 = 57-30=27. The IQR is a measure of variability that is not
influenced by outliers or extreme values hence they are called resistant measures. Therefore it
is particular suitable in highly skewed distributions.

 When the interquartile range is small, it means that there is a small deviation in the
central 50 percent items.
 In contrast, if the IQR is high, it shows that the central 50 percent items have a large
variation.
 It may be noted that in a symmetrical distribution, the two quartiles, that is, Q3 and QI
are equidistant from the median. Unfortunately, symmetrical distributions are seldom in
business and economics.

The computation of a quartile deviation is very simple, involving the computation of upper and
lower quartiles which we demonstrated previously.

Descriptive Statistics for a population


Descriptive statistics discussed previously described a sample, not the population. Summary
measures describing a population, called parameters, are denoted with Greek letters.
Important population parameters are the population mean, variance, and standard deviation.

 The population mean denoted as μ is the sum of the values in the population divided by
the population size, N

Elias Pavlou Page 39


N
∑ Xi X 1 + X 2 +⋯+ X N
μ= i =1 =
N N
Where μ = population mean
N = population size
Xi = ith value of the variable X

 The population variance denoted as σ2 is the average of squared deviations of values


N
∑ ( X i−μ )2
σ 2 = i=1
from the mean derived by N
Where μ = population mean
N = population size
Xi = ith value of the variable X

 Finally the population standard deviation denoted as σ is the most commonly used
measure of variation showing variation about the mean and it is defined as the square
root of the population variance. It has the same units as the original data and is derived
N

by
σ=
√ ∑ ( X i−μ )2
i=1
N

Let’s summarize the symbols that denote the population parameters and the sample statistics as
follows:

Application of standard deviation for symmetrical distributions

Elias Pavlou Page 40


As we have seen before the standard deviation is a frequently used measure of dispersion. It
enables us to determine as to how far individual items in a distribution deviate from its mean.
In a symmetrical, bell-shaped curve:

 About 68 percent of the values in the population fall within: + 1 standard deviation from
the mean that is µ ± 1σ
 (ii) About 95 percent of the values will fall within +2 standard deviations from the mean
that is µ ± 2σ
 (iii) About 99 percent of the values will fall within + 3 standard deviations from the mean
that is µ ± 3σ
Suppose that the variable Math SAT scores is bell-shaped with a mean of 500 and a standard
deviation of 90. Then,
68% of all test takers scored between 410 and 590 (500 ± 90).
95% of all test takers scored between 320 and 680 (500 ± 180).
99.7% of all test takers scored between 230 and 770 (500 ± 270).

STANDARDISED VARIABLE, STANDARD SCORES


The variable Z = (x - x́ )/s or (x - μ)/σ denotes the ratio of the difference of each value from the
mean divided by the standard deviation. It measures the deviation from the mean in units of the
standard deviation and it is called a standardized variable. Since both the numerator and the
denominator are in the same units, a standardized variable is independent of units used.
If deviations from the mean are given in units of the standard deviation, they are said to be
expressed in standard units or standard scores. Therefore the standard score or Z-score is the
number of standard deviations a data value is from the mean.
Through this concept of Z-score variable, comparisons can be made between individual
observations belonging to two different distributions whose compositions differ.
A data value is considered an extreme outlier if its Z-score is less than -3.0 or greater than +3.0.
The larger the absolute value of the Z-score, the farther the data value is from the mean.
Let’s consider the following example

Elias Pavlou Page 41


A student has scored 68 marks in Statistics for which the average marks were 60 and the
standard deviation was 10. In Maritime Economics, he scored 74 marks for which the average
marks were 68 and the standard deviation was 15. In which module, Statistics or Maritime
Economics, was his relative standing higher?
Solution: For Statistics his Z-score is Z = (68 - 60) ÷ 10 = 0.8
For Maritime Economics his Z-score is Z = (74 - 68) ÷ 15 = 0.4
Since the standard score is 0.8 in Statistics as compared to 0.4 in Maritime Economics, his
relative standing was higher in Statistics. Also, neither of these two scores are considered
outliers.

Skewness and Kurtosis


The measures of central tendency and dispersion are the most important in describing a data
series. Occasionally, we are concerned with the shape of a frequency distribution. Generally,
there are two comparable characteristics called skewness and kurtosis that help us to
understand a distribution. Two distributions may have the same mean and standard deviation
but may differ widely in their overall appearance as it can be shown here:
In both these distributions the value of mean and standard deviation is the same ie. ( X = 15, σ =
5). But it does not imply that the distributions are alike in nature as shown below

The distribution on the left-hand side is a symmetrical one whereas the distribution on the right-
hand side is symmetrical or skewed.
Measures of skewness help us to distinguish between different types of distributions. Skewness
refers to the asymmetry or lack of symmetry in the shape of a frequency distribution.

Let’s see the three different categories of the shape of a distribution


Symmetric - Mean = Median

Elias Pavlou Page 42


A symmetric distribution is one where the left and right hand sides of the distribution are
roughly equally balanced around the mean. For symmetric distributions, the mean is
approximately equal to the median.

Right Skewed - Mean > Median

A distribution that is skewed right (also known as positively skewed) is shown below.
For a right skewed distribution, the mean is typically greater than the median. Also notice that
the tail of the distribution on the right hand (positive) side is longer than on the left hand side.

Left Skewed - Mean < Median

A distribution that is skewed left has exactly the opposite characteristics of one that is skewed
right: the mean is typically less than the median and the tail of the distribution is longer on the
left hand side than on the right hand side

The above definitions show that the term 'skewness' refers to lack of symmetry" i.e., when a
distribution is not symmetrical (or is asymmetrical) it is called a skewed distribution.
A distribution, which is not symmetrical, is called a skewed distribution and such a distribution
could either be positively skewed (right skewed) or negatively skewed (left skewed).

To determine the magnitude of the skewness of any frequency distribution we employ the
Pearson coefficient of skewness defined as the ratio of 3 times the difference between the mean
and the median over the standard deviation as shown below

Elias Pavlou Page 43


3∗(mean−median)
Coefficient of skewness =
standard deviation
For symmetrical distributions the mean and the median are equal, hence the value of this
coefficient is 0. The more skew the distribution, the larger its magnitude.

Kurtosis is another measure of the shape of a frequency curve. While skewness signifies the
extent of asymmetry, kurtosis measures the degree of peakedness of a frequency distribution.
The shape of a distribution is classified into three types on the basis of the shape of their peaks.
These are mesokurtic, leptokurtic and platykurtic. These three types of curves are shown in the
figure below.

The Mesokurtic curve is neither too much flattened nor too much peaked. In fact, this is the
frequency curve of a normal distribution. The Leptokurtic curve is a more peaked than the
normal curve. In contrast, the Platykurtic is a relatively flat curve.

SUMMARY
 In this presentation we defined statistics and the main areas of statistical methods
namely descriptive statistics, inferential statistics and exploring relations as well as
forecasting techniques.
 We defined data types and measurements and we explored the limitations of statistics.
 Moreover we looked at ways of reduced a set of raw data into a form whereby it can be
easily understood by non experts. Different methods of presentation of data, both
tabular and graphical, have been considered.
 We have seen how a set of data may be reduced to one single representative value. The
most important ways of summing up a distribution is the mean, median and the mode.

Elias Pavlou Page 44


 We have considered methods of summarizing data in terms of the spread of the
observations such as the range, variance, standard deviation as well as comparing
distribution variations.
 Finally we have looked briefly on the shape of a distribution in terms of skeweness and
kurtosis.

Using Excel for creating graphs


Excel offers the capability of producing the graphs mentioned in this presentation with Chart
Wizard. There is a plethora of examples of how to use excel in creating graphs. What follow is
based on the following two intranet sites, namely
https://support.office.com/en-au/article/Create-a-chart-from-start-to-finish-a745775f-98d9-
4c63-bfa8-9c00cd03ff0c and http://www.excel-2010.com/excel-pivottables/

Scatter Plot with Excel


A scatter chart plots the values for two variables as a set of points on a graph. One variable
controls the position on the x-axis of a point, whilst the other variable controls the position on
the y-axis. If you’re familiar with graphs, you might already understand that these points are
referred to as (x,y) where x is the position along the x-axis and y is the position along the y-axis
of each point.
Suppose you wish to establish the relationship between X (a person's salary) and Y (his/her car
price). It’s hard to see what’s going on when we look at raw numbers, so we will create a scatter
chart by executing the following steps.
1. Select the range A1:B10.
2. On the Insert tab, in the Charts group, choose Scatter, and select Scatter with only Markers.

Elias Pavlou Page 45


The chart produced is the following

The scatter diagram, shows us that there is a possible relationship You can see that as the salary
increases, so does Car price.
We added a trendline to clearly see the relationship between these two variables. Trendlines
mark out the trend in the data. To display a trendline in our scatter chart,
 click Chart Tools > Layout > Analysis > Trendline.

Elias Pavlou Page 46


 Click on the More Trendline Options…

The Format Trendline window that opens is pretty big, but there’s only one option we need
here: Display Equation on Chart. Ensure that there is a check in that checkbox and click close.

Elias Pavlou Page 47


Line graphs with Excel
Line charts can display continuous data over time, set against a common scale, and are
therefore ideal for showing trends in data at equal intervals. In a line chart, data is distributed
evenly along the horizontal axis, and all value data is distributed evenly along the vertical axis.
Line charts are especially useful for displaying multiple series.
Let’s look at an example. Here we have a spreadsheet containing sales figures by month for
three different regions: the North, Midlands and South.

Notice that the “timeline” has been entered into the left hand column while each data series
(for each region) has been entered into subsequent columns. To create a line chart,
 select all the data and the column headings.
 Click Insert > Charts > Line, and select a chart type.

Elias Pavlou Page 48


 I selected Line with Markers to produce this line chart:

Because we selected the column headings, they appear in the chart’s legend to the right.
The x-axis labels displaying the months looks a little cramped, so let’s display them at an angle.
With the chart selected,
 click Chart Tools > Design > Chart Layouts > Layout 1 (the first option).

Also, we need to give the chart a title and label the y-axis. To change the title,
 click into the title text box and select all the text. Type in something meaningful for this
line chart, such as “2010 Sales By Region”.
 We update the y-axis label in a similar way: click into that text box, select the text by
dragging over it and then type something like “sales ($)”.
This is the end result.

Elias Pavlou Page 49


Column and Bar Charts with Excel
Column or Bar charts are used to compare values across categories by using vertical or
horizontal bars. To create a column chart, execute the following steps. On a Column chart, the
values are on the vertical (y) axis, while on a Bar chart, the values are on the horizontal (x) axis.
They each process data the same way.
1. Highlight the data range to be graphed.
2. On the Insert tab, in the Charts group, choose Column, and select Clustered Column.

Elias Pavlou Page 50


Changing the chart type to a Bar chart
A Bar chart presents data in the same way as a Column chart, but it does so horizontally instead of
vertically. Here's what you do to change the chart type:
1. Click once on the chart to select it, if it is not already selected.
The menu bar now displays the Chart menu item where the Data menu item usually is found.
2. Select the Bar chart type, and the first subtype (Clustered Bar).

4. Click OK.

Elias Pavlou Page 51


Data Analysis using Excel
Excel can generate frequency distributions and histograms using the Data Analysis feature. To
access it, select Data and then select Data Analysis. If it does not appear on the menu, it must be
added in as follows
 Click on the green File tab. The File tab in Excel 2010 replaces the Office Button (or File
Menu) in previous versions of Excel.
 Click on Options.

 Under Add-ins, select Analysis ToolPak and click on the Go button.

Elias Pavlou Page 52


 Check Analysis ToolPak and click on OK.

Elias Pavlou Page 53


 On the Data tab, you can now click on Data Analysis.

The following dialog box below appears.

Excel Frequency Table and Histogram for Grouped Data


Excel refers to frequency distributions as histograms. The classes of a frequency distribution are
referred to as bins.

Discrete Case
The data below refer to the no of cars per household below:

1 1 1 2 0 1 1 1 0 1 1 0 3 1 0 0 0 1 2 0 1

3 1 0 1 1 1 1 2 2 2 0 2 1 0 1 0 1 0 0 0 1

2 1 1 2 0 0 1 1 1 1 0 2 0 0 1 2 1 0 1 1 2
A frequency table can in fact be produced using an Excel spreadsheet. To do this, you need to
2 1 1 0 2 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1
Enter the data values on an Excel worksheet as below

Elias Pavlou Page 54


We also need to produce a column of all values and label this column “N° of cars owned by
households” for example.

Data Data Analysis  Histogram


Now if you click Data on the Menu Bar, you will see a new title labelled “Data Analysis”

Click on Data Analysis

Choose “Histogram”
from the list to land a
dialogue box as on the
right-hand-side.
Fill in the boxes such that
the raw data go in “Input
Range” box and the
possible values of N° of
cars…. into “Bin Range”
box.
The “output Range” box
should be filled with the
address of the cell next to
the title (Cell B9) – see
the example on the right-
hand-side.
Click also the Chart
Output box at the end

Once you have filled in these boxes, click OK to obtain a table that looks like.

Continuous case
Let’s assume we wish to construct a frequency table and a Histogram for the following array of
data. This series of values reflect the number of students studying various shipping modules.
Suppose you wish to group these numbers in groups as follows:

Elias Pavlou Page 55


0-20 21 – 25 26 – 30 31 – 35 36 – 40

1.First, enter the bin numbers (upper levels) in the range C3:C7.

2. On the Data tab, click Data Analysis.

3. Select Histogram and click OK.

Elias Pavlou Page 56


4. Select the range A2:A19.
5. Click in the Bin Range box and select the range C3:C7.
6. Click the Output Range option button, click in the Output Range box and select cell F3.
7. Check Chart Output.

8. Click OK.

Elias Pavlou Page 57


9. Click the legend on the right side and press Delete.
10. Properly label your bins.
11. To remove the space between the bars, right click a bar, select Format Data Series and
change the Gap Width to 0%. Select Border Color to add a border.
Result:

Statistical Measures using Excel commands


Suppose we asked a cohort of 15 Phd students about their age. The responses were as follows:
Ages 32 44 37 41 28 33 38 36 33 29 35 36 33 31 33
We want to have an overall understanding of the age of students. Because this data is
quantitative, one can calculate all the possible statistics:
a) Mean, b) Median, c) Mode, d) Range, e) Inter-quartile range, f) Standard deviation and g)
Coefficient of variation
Excel allows us to derive the above measures with the following commands.

Elias Pavlou Page 58


Note: These formulae can only be used if we have single data values.

Data Analysis Descriptive Statistics with Excel


You can use the Analysis Toolpak add-in to generate descriptive statistics. For example, you may
have the scores of 14 participants for a test.

To generate descriptive statistics for these scores, execute the following steps.

1. On the Data tab, click Data Analysis.

Elias Pavlou Page 59


2. Select Descriptive Statistics and click OK.

3. Select the range A2:A15 as the Input Range.


4. Select cell C1 as the Output Range.
5. Make sure Summary statistics is checked.

6. Click OK.

Result:

Elias Pavlou Page 60


References
 Aczel, A.D. and Sounderpandian, J., (2006). Complete business statistics. 6th
Edition. Boston: McGraw-Hill.
 An Introduction to Business Statistics, www.ddegjust.ac.in/studymaterial/mcom/mc-
106.pdf, [accessed October 2014]
 Barrow, M., (2009). Statistics for economics, accounting and business studies,
5th Edition. Essex: Prentice Hall.
 Black, K. (2008) Business Statistics for Contemporary Decision Making, 5th
Edition, Wiley.
 Bryman, A. and Bell, A. (2011) Business Research Methods, 3rd Edition, Oxford:
Oxford University Press.
 Data Analysis with Excel, http://www.excel-easy.com. Accessed on <November
2014>.
 Data Analysis with Excel, http://www.iacquire.com/blog/quantitative-data-
analysis-techniques-for-data-driven-marketing-2, [accessed October 2014]
 Evans J, Quantitative Methods in Maritime Economics, Fairplay, 1990, Surrey UK
 Goodwin E, Maritime Statistics: Theory and Practice, Stanford Maritime Ltd,
1979, London

Elias Pavlou Page 61


 Saunders, M., Lewis, P. and Thornhill, A. (2009) Research Methods for Business
Students, 6th ed., Harlow: Prentice Hall.
 Taylor, S., (2007). Business statistics for non-mathematicians. 2nd Edition. New
York: Palgrave Macmillan.

Elias Pavlou Page 62

Вам также может понравиться