Вы находитесь на странице: 1из 37

Introduction to

Data Analytics
Prof. Rudra Pradhan
IIT Kharagpur
Preamble

What is data Analytics

Why is it?

How is different to data analysis

What are its requirements

Course coverage
What is Data Analytics
Analytics is the discovery and communication of meaningful patterns in
data.
Especially valuable in areas rich with recorded information, analytics
relies on the simultaneous application of statistics, econometrics, computer
programming and operations research to quantify performance.
Analytics often favors data visualization to communicate insight.
What is Data Analysis

Analysis of data is a process of inspecting, cleaning,


transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision maing.

Data analysis has multiple facets and approaches,


encompassing diverse techniques under a variety of names, in
different business, science, and social science domains.
Related Issues
Data mining is a particular data analysis technique that focuses on
modeling and nowledge discovery for predictive rather than purely
descriptive purposes.
Business intelligence covers data analysis that relies heavily on
aggregation, focusing on business information.
Structure of Data Analysis
!escriptive statistics
"#ploratory data analysis $"!A%
Confirmatory data analysis $C!A%
"!A focuses on discovering new features in the data, while C!A is on
confirming or falsifying e#isting hypotheses.
Related Issues

&redictive analytics and te#t analytics'


&A focuses on application of statistical or structural models for
predictive forecasting, while (A applies statistical, and structural
techniques to e#tract and classify information from te#tual sources,
a species of unstructured data.

!ata integration is a precursor to data analysis, and data analysis is


closely lined to data visualization and data dissemination.

(he term data analysis is sometimes used as a synonym for data


modeling.
Analytics Vs. Analysis
Analytics is a multi)dimensional discipline. (here is e#tensive use of
mathematics and statistics, the use of descriptive techniques and predictive
models to gain valuable nowledge from data ) data analysis. (he insights
from data are used to recommend action or to guide decision maing
rooted in business conte#t.
Analytics is not so much concerned with individual analyses or analysis
steps, but with the entire methodology. (here is a pronounced tendency to
use the term analytics in business settings e.g. te#t analytics vs. the more
generic te#t mining to emphasize this broader perspective.
Advanced analytics, typically used to describe the technical aspects of
analytics, especially predictive modeling, machine learning techniques lie
artificial neural networs.
Why Data Analytics
ar!eting optimi"ation
Portfolio management
Ris! management
Stoc! mar!et prediction
#inancial mar!et forecasting
Digital analytics
Few Questions

How to set a perfect path?

Do you need support?

Do you need criteria?

Do you need tricks?

Is it reliable?
Principles of odelling
Object/ System
hy? hat are
we lookin! for
"ind? hat do we
want to know
#odel
$ariable% &arameters
#odel &rediction
$alid%
Accepted predictions
'est
Basic $nderstandings
Data
Variables
Scaling
odels% S&
'ools% statistics( mathematics( econometrics( operation research
Statistical odeling
athematical odeling
Soft )omputing
odeling Structure
'heory
Assumptions
*b+ectives
)onstraints
*odelling' it shows the relationships, direct and indirect, interrelationships of
actions and reactions in terms of cause and effect.
(wo types' !escriptive and predictive
+oth dynamic and static
()
E,amples of the !ind of problems that
may be solved by an Econometrician
(. 'estin! whether *nancial markets are weak+form
informationally e,cient.
-. 'estin! whether the .A&# or A&' represent superior
models for the determination of returns on risky assets.
/. #easurin! and forecastin! the 0olatility of bond returns.
). 12plainin! the determinants of bond credit ratin!s used
by the ratin!s a!encies.
3. #odellin! lon!+term relationships between prices and
e2chan!e rates
(3
E,amples of the !ind of problems that
may be solved by an Econometrician -cont.d/
4. Determinin! the optimal hed!e ratio for a spot position in
oil.
5. 'estin! technical tradin! rules to determine which makes
the most money.
6. 'estin! the hypothesis that earnin!s or di0idend
announcements ha0e no e7ect on stock prices.
8. 'estin! whether spot or futures markets react more rapidly
to news.
(9."orecastin! the correlation between the returns to the
stock indices of two countries.
(4
Frequency & quantity of data
,toc maret prices are measured every time there is a trade or
somebody posts a new quote.
Quality
-ecorded asset prices are usually those at which the transaction too
place. .o possibility for measurement error but financial data are /noisy0.

What are the Special )haracteristics
of #inancial Data0
(5
'ypes of Data and 1otation
(here are 1 types of data which econometricians might use for analysis'
2. (ime series data
3. Cross)sectional data
1. &anel data, a combination of 2. 4 3.
(he data may be quantitative $e.g. e#change rates, stoc prices, number of
shares outstanding%, or qualitative $e.g. day of the wee%.
"#amples of time series data
Series Frequency
5.& or unemployment monthly, or quarterly
government budget deficit annually
money supply weely
value of a stoc maret inde# as transactions occur
(6
'ypes of Data and 1otation -cont.d/
Examples of Problems that Could be Tackled Usin a Time Series !eression
) How the value of a country6s stoc inde# has varied with that country6s
macroeconomic fundamentals.
) How the value of a company6s stoc price has varied when it announced the
value of its dividend payment.
) (he effect on a country6s currency of an increase in its interest rate
Cross)sectional data are data on one or more variables collected at a single
point in time, e.g.
) A poll of usage of internet stoc broing services
) Cross)section of stoc returns on the .ew 7or ,toc "#change
) A sample of bond credit ratings for 89 bans
(8
'ypes of Data and 1otation -cont.d/
Examples of Problems that Could be Tackled Usin a Cross"Sectional !eression
) (he relationship between company size and the return to investing in its shares
) (he relationship between a country6s 5!& level and the probability that the
government will default on its sovereign debt.
&anel !ata has the dimensions of both time series and cross)sections, e.g. the
daily prices of a number of blue chip stocs over two years.
:t is common to denote each observation by the letter t and the total number of
observations by T for time series data, and to to denote each observation by the
letter i and the total number of observations by # for cross)sectional data.
-9
:t is preferable not to wor directly with asset prices, so we usually convert the
raw prices into a series of returns. (here are two ways to do this'
,imple returns or log returns

where, !
t
denotes the return at time t
p
t
denotes the asset price at time t
ln denotes the natural logarithm
We also ignore any dividend payments, or alternatively assume that the price
series have been already ad;usted to account for them.


Returns in #inancial odelling
< 2==
2
2

t
t t
t
p
p p
!
< 2== ln
2

=
t
t
t
p
p
!
-(
(he returns are also nown as log price relatives, which will be used throughout this
boo. (here are a number of reasons for this'
2. (hey have the nice property that they can be interpreted as continuously
compounded returns.
3. Can add them up, e.g. if we want a weely return and we have calculated
daily log returns'
r
2
> ln p
2
?p
=
> ln p
2
) ln p
=
r
3
> ln p
3
?p
2
> ln p
3
) ln p
2
r
1
> ln p
1
?p
3
> ln p
1
) ln p
3
r
@
> ln p
@
?p
1
> ln p
@
) ln p
1
r
A
> ln p
A
?p
@
> ln p
A
) ln p
@

ln p
A
) ln p
=
> ln p
A
?p
=
2og Returns
--

(here is a disadvantage of using the log)returns. (he simple return on a
portfolio of assets is a weighted average of the simple returns on the
individual assets'

+ut this does not wor for the continuously compounded returns.
A Disadvantage of using 2og Returns
! $ !
pt ip it
i
#
=
=

2
-/
Steps involved in the formulation of
econometric models
"conomic or Binancial (heory $&revious ,tudies%
Bormulation of an "stimable (heoretical *odel
Collection of !ata
*odel "stimation
:s the *odel ,tatistically Adequate?
.o 7es
-eformulate *odel :nterpret *odel
8se for Analysis
-)
2. !oes the paper involve the development of a theoretical model or is it
merely a technique looing for an application, or an e#ercise in data
mining?
3. :s the data of /good quality0? :s it from a reliable source? :s the size of
the sample sufficiently large for asymptotic theory to be invoed?
1. Have the techniques been validly applied? Have diagnostic tests for
violations of been conducted for any assumptions made in the
estimation
of the model?
Some Points to )onsider 3hen reading papers
in the academic finance literature
-3
@. Have the results been interpreted sensibly? :s the strength of the results
e#aggerated? !o the results actually address the questions posed by the
authors?
A. Are the conclusions drawn appropriate given the results, or has the
importance of the results of the paper been overstated?
Some Points to )onsider 3hen reading papers
in the academic finance literature -cont.d/
*b+ectives of Data Analytics
Data reduction
Structural simplification
Analysis of dependence
Analysis of interdependence
Prediction& #orecasting
4ypotheses construction and testing
Strategy and policy implications
)ourse odules
odule 5% Basic Applied Econometrics
+asics, probability distribution, regression analysis, issues and problems of
regression analysis
odule 6% Advanced Econometrics
C--*, &!*, ,"*
odule 7% 'ime series Econometrics
:ntegration and co)integration, DA- modelling, volatility modelling,
bootstrapping
odule 8% *ptimi"ation 'ools
,imple E&&, :nteger programming, 5oal programming, ,imulation, AH&, WE&
odule 9% Soft computing
A.., BE, 5A, ,D*
odelling Structure
$nivariate structure
Central tendency, dispersion, sewness, urtosis
Bivariate structure
Covariance, correlation, regression
ultivarate structure
Correlation, regression, factor analysis, con;oint analysis, cluster analysis, path
analysis, *!,, AH&, ,"*
Statistical Modelling: A Basic
Fraewor!
Object/ System
:esearch Desi!n/
.hoice/ .reati0ity
;ni0ariate
#odellin!
#ulti0ariate
#odellin!
Data Analysis
Interpretation and
.onclusion
<i0ariate
#odellin!
Research Process
Step ": #e$ne Research Pro%le
Step &: Re'iew of (iterature
)Re'iew concepts and theories*
Re'iew pre'ious research $nding+
Step ,: Forulate -.potheses
Step /: Research #esign
Step 0: #ata 1ollection
Step 2: #ata Anal.sis
Step 3: Interpretation
Soft commuting% Basics
Soft computing is a term applied to a field within computer science which
is characterized by the use of ine#act solutions to computationally hard
tass such as the solution of non)deterministic polynomial $.&%) complete
problems, for which there is no nown algorithm that can compute an
e#act solution in polynomial time.
Soft computing differs from conventional $hard% computing in that, unlie
hard computing, it is tolerant of imprecision, uncertainty, partial truth, and
appro#imation. :n effect, the role model for soft computing is the human
mind.
'ools of Soft )omputing
Artificial neural networs $A..%
,upport Dector *achines $,D*%
Buzzy logic $BE%
"volutionary computation $"C%, including'
"volutionary algorithms
5enetic algorithms
!ifferential evolution
*etaheuristic and ,warm :ntelligence
Ant colony optimization
&article swarm optimization
:deas about probability including'
+ayesian networ
Chaos theory
Wavelet analysis
$ni:variate Statistics

.entral 'endency

Dispersion

Skewness

=urtosis
<i+0ariate Statistics

.o0ariance

.orrelation
Why ultivariate odelling
Applicability% )lient fields use these techni;ues
<uantification% )reate the habit of loo!ing at the strength of a
relationship( not +ust the significance=
)reativity% a!e introductory statistics give techni;ues that let
students e,press their o3n interests=

Empo3erment% ove from parado, to understanding.


-ow to Teach Multi'ariate
Modelling to Intro. Students
Replace alge%ra with coputation4 siulation
and geoetr..

Siulation:
1on$dence inter'als 'ia %ootstrapping*
h.pothesis testing 'ia randoi5ation of
e6planator. 'aria%les.

7eoetr.:
Regression as pro8ection* A9:;A as
P.thagorean 'ector decoposition* p<
'alues fro su%tended angles.
Data #odellin! and &acka!ed
Software
SPSS
=;I=>S
MI1R:FIT
7A?SS
(IM#=P
MAT(AB
AM:S
MI9ITAB
STATISTI1A
RATS
S@STAT
STATA
(IS=RA(
SAS
TSP
S-AAAM
#=A

Вам также может понравиться