Вы находитесь на странице: 1из 21

Data Driven Modelling

Data Driven Modelling using MATLAB


Shan He
School for Computational Science
University of Birmingham

Module 06-23836: Computational Modelling with MATLAB

Data Driven Modelling


Outline

Outline of Topics

What is data driven modelling?


Regression Analysis in MATLAB
Artificial Neural Networks
Conclusion

Data Driven Modelling


What is data driven modelling?

What is data driven modelling?


I

For equation and agent-based models, we assume the model is


known.

However, sometimes we have large amount of data but very


little prior knowledge.

Finding the model in the first place is the most difficult and
important question.

A new research field: data driven modelling (DDM).

Based on the data, a model is built on the basis of


connections between the system state variables, e.g., input,
internal and output variables, with only a limited assumption
about the system.

Data Driven Modelling


What is data driven modelling?

Goals/purposes of data driven modelling

Extract and recognize patterns in data

Interpret or explain observations

Test validity of hypotheses

Search the space of hypotheses

Data Driven Modelling


What is data driven modelling?

Tasks of data driven modelling

Classification: where the task constitutes of assigning a class


for an input data point.

Association: where association between variables


characterising the system is to be identified, which is used in
subsequent prediction.

Regression: where the task constitutes of predicting a real


value associated with an input data point.

Clustering: where groups of data points with within group


similarity are to be determined.

Data Driven Modelling


What is data driven modelling?

It is new and old!

Before it was called observational modelling.

Based on methods in statistics, e.g., regression.

These methods usually cannot handle nonlinear systems.

Recent years, machine learning techniques have been applied.

We will learn how to use regression and Artificial Neural


Networks to build data-driven models in MATLAB.

Data Driven Modelling


What is data driven modelling?

Data driven modelling process


I

Data preparation: obtain data / data checking/ data


cleaning

Feature selection: if you have high-dimensional data.

Specify assumptions based on domain knowledge.

Develop Model based on the assumptions.

Specify loss function, e.g., the mean least square error


between the model output and the real data.

Use algorithms to minimize loss based on the train data.

Test the model using testing data

Data Driven Modelling


What is data driven modelling?

What tools can we use?


I

Statistics:
I
I
I
I

Linear regression
Nonlinear regression
Logistic regression
Probit regression

Machine Learning techniques:


I
I
I
I
I

Decision tree
Artificial Neural Network
Nearest Neighbours
Support Vector Machine
Association rule learning

Data Driven Modelling


Regression Analysis in MATLAB

Linear regression analysis in MATLAB


I

For linear regression, we can use polynomial curve fitting.

MATLAB function: p = polyfit(x,y,n)

It finds the coefficients of a polynomial p(x) of degree n that


fits the data, p(x(i)) to y(i), in a least squares sense.

The output p is a row vector of length n+1 containing the


polynomial coefficients in descending powers:
p(x) = p1 x n + p2 x n1 + + pn x + pn+1

To evaluate the polynomial at the data points: y =


polyval(p,x)

Data Driven Modelling


Regression Analysis in MATLAB

A very simple example: fitting error function

Regression: We aim to fit the data points from the error


function erf(X) is twice the integral of the Gaussian
distribution with 0 mean and variance of 1/2:
Z
2
2
e t dt
erf(x) =
x

Data Driven Modelling


Regression Analysis in MATLAB

A more complex example: fitting traffic data

Hourly traffic counts at three intersections for a single day.

Regression: We aim to fit the data with polyval

Data Driven Modelling


Regression Analysis in MATLAB

Logistic regression

Sometimes called the logistic model or logit model.

Can be used for predicting the outcome of a binary dependent


variable: Classification.

MATLAB function: b = glmfit(X,y,distr)

Output: a p-by-1 vector b of coefficient estimates for a


generalized linear regression of the responses in y on the
predictors in X, using the distribution distr

Data Driven Modelling


Regression Analysis in MATLAB

Australian Credit Card Assessment

Task: to assess applications to an Australian bank for a credit


card based on a number of attributes.

2 classes: granted (44.5% of the instances) or denied (55.5%


of the instances)

14 attributes: names and values have been changed to


meaningless symbols to protect confidentiality of the data.

Mixing-value inputs: there are 5 continuous, 4 binary and 5


nominal

A lot of missing value.

Data Driven Modelling


Regression Analysis in MATLAB

Military Trauma survival prediction

Data Driven Modelling


Artificial Neural Networks

What is Artificial Neural Networks (ANNs)?

Input

I
I

Hidden Layer

Output

ANN: Mathematical model or computational model inspired


by biological neural networks.
Consists of an interconnected group of artificial neurons

Data Driven Modelling


Artificial Neural Networks

What are Artificial Neural Networks (ANNs)?

Non-linear statistical data modeling tools:


I
I

Model complex relationships between inputs and outputs;


Discover patterns in data.

Can be used for classification, association, regression and


clustering.

MATLAB Neural Network Toolbox (Click for more detailed


tutorial)

Data Driven Modelling


Artificial Neural Networks

Example: Prediction of number of sun spots

Sunspot series is a record of the activity of the surface of the


sun.

Important: Telecommunication will by disrupted by a


sufficiently large solar flare.

Time series data for sunspot activity over the last 300 years.

Sunspot activity is cyclical, reaching a maximum about every


11 years.

Challenging: sunspot series is nonlinear, non-stationary and


non-Gaussian

Data Driven Modelling


Artificial Neural Networks

Prediction of sunspot number by ANNs

Task: We use recorded sunspot data to train our ANN to


predict sunspot number based on the sunspot numbers of
previous 3 years.

Training data: sunspot numbers from 1705 1884

Test data: sunspot numbers from 1884 1987

Data Driven Modelling


Artificial Neural Networks

New direction for ANNs: Deep Learning


I

ANNs fell out of favour in 90s because they are slow and
inefficient

In 2006, Prof. Geoff Hinton made a breakthrough: deep


learning

Excels at unsupervised learning, e.g., recognise handwritten


words

Key idea: learn categories incrementally, e.g., lower-level


categories (letters) higher-level categories (words)

Google, Microsoft and along with other big names have


jumped on the bandwagon

Microsoft Project: Speech Recognition

Data Driven Modelling


Conclusion

Conclusion

If you know the underlying mechanisms of the system (even


partially), DO NOT use data-driven modelling methods.

How to choose your tools: start from simple tools

Regression Decision Tree ANNs (SVM, Random Forest)


Hybrid methods, e.g., Evolutionary ANNs

Also need to consider interpretability: simpler tools do better

Data Driven Modelling


Conclusion

Assignment

Based on the sunspot number prediction example, use linear


regression (polyfit) and ANNs to model Hudson Bay
Company fur record data.

Investigate how to use decision tree for Australian Credit Card


Assessment problem. Compare the results with ANNs and
Logistic regression.

Вам также может понравиться