Вы находитесь на странице: 1из 60

M.Sc. I.T.

Part I Semester I

Data Analysis Tools MANUAL FOR PRACTICAL

2013 2014

M.Sc in Information Technology Part I Course III : Data Analysis Tools

Practical based on the Book Modelling with Data

Practical Problems Prepared and Implemented by Mr. Mahesh Naik, Valia College, Andheri

& Mr. Jayesh Shinde, UDIT, Santacruz

Compiled By R. Srivaramangai, UDIT, Santacruz

INDEX S.NO 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DESCRIPTION List of Practical Installation procedure for cygwin Installation procedure for ubuntu Practical 1 Practical 2 Practical 3 Practical 4 Practical 5 Practical 6 Practical 7 Practical 8 Practical 9 Practical 10 References PAGE NUMBER 4 6 8 11 21 24 28 41 46 49 54 57 58 60

List of Practical 1. SQL queries based on Unit I a. DDL commands of SQL b. Select clause i. Simple select ii. Select queries with where clause iii. Select queries with arithmetic, relational and logical operators iv. Select queries with order by, group by, having, limit and offset v. Select queries with aggregation functions and distinct vi. Select queries with sub queries and Joins 2. Implementing gsl matrices and vectors a. Illustration of gsl Matrix multiplication b. Illustration of gsl vector with database query embedded 3. Graph Plotting a. Gnu plot for plotting vectors 1 b. Gnu plot for plotting vectors 2 c. Gnu plot for plotting vectors 3 4. Implementing Statistical Distributions Discrete distributions a) Bernoulli distribution b) Binomial distribution c) Poisson distribution d) Multinomial distribution e) Hyper geometric distribution Continuous distributions a) Normal distribution b) Lognormal distribution c) Gamma distribution d) Exponential distribution
4

e)

Beta distribution

5. Implementing Regression and goodness of fit a. Implementing OLS regression b. Implementing goodness of fit chi square 6. Illustrating the maximum likelihood 7. Generating random numbers with Monte Carlo method using a. Exponential distribution b. Uniform distribution c. Binomial distribution 8. Implementing Parametric testing a. Using t-test b. Using f-test 9. Illustrating the method of Inference 10.Implementing non-parametric testing - ANOVA

Installation of cygwin 1) Download the Cygwin software from the site named as http://www.cygwin.com/ The most recent version of the Cygwin DLL is 1.7.20-1. 2) Download one more library of functions named as apophenia from the website http://apophenia.info/ 3) Now Install cygwin by running its setup.exe. 4) There are numerous packages in cygwin ans so select those packages which are required for the practical, namely gcc compiler, make, gsl , gnu, sqlite 5) Now the apophenia library is to be included in the cygwin software. When we install cygwin ,the cygwin folder is created in the C: drive. Within the cygwin folder , go to home directory and sub directory for example C:\cygwin\home\yourname (C:\cygwin\home\Jayesh). 6) Copy the apophenia library to that directory named Jayesh 7) Double click on the Cygwin terminal icon and the terminal will open. you will be taken to the cygwin terminal window as shown below which displays the present working directory

8) Configure the apophenia library by typing: tar xvzf apophenia-0.99-09_Jul_13.tgz cd apophenia-0.99 9) . /configure To test : 1. Once cygwin installation is complete, we can check the same by running a test program. 2. To run a test program with abc.c 3. Run the following command in bash 4. gcc std=gnu99 abc.c o abc.out lapophenia lgsl lsqlite3 ./abc.out

Ubuntu Installation as per the free download. How to install the Sqlite on ubuntu 13.04 1) Download the archive package of sqlite database named sqlite-autoconf3071700.tar.gz from the htpp:// www.sqlite.org. 2) After download of the sqlite-autoconf-3071700.tar.gz package ,copy the package in the Home folder of Ubuntu 13.04 3) Open the Terminal. It will open in the Current Directory. We have to Extract the package sqlite-autoconf-3071700.tar.gz Then type the Command tar xvfz sqlite-autoconf-3071700.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as sqlite-autoconf-3071700 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd sqlite-autoconf-3071700 jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ 6) It is needed to configure all the files present in the sqlite-autoconf-3071700 folder type the Command: jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make It will ask the password ,type the passwoord and press the Enter Key 8) Now we need to install the make using the following command: jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo make install
8

9) jayesh@jayesh-G31M-S2L:~/sqlite-autoconf-3071700$ sudo ldconfig How to install the apophenia on ubuntu 13.04 1) Download the archive package of gsl named gsl-1.16.tar.gz from the htpp:// www.gnu.org/s/gsl/ 2) After download of the gsl-1.16.tar.gz package , copy the package in the Home folder of Ubuntu 13.04 3) Open the Termina. It will open in the Current Directory. We have to Extract the package gsl-1.16.tar.gz Then type the Command tar xvfz gsl-1.16.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as gsl-1.16 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd gsl-1.16 jayesh@jayesh-G31M-S2L:~/gsl-1.16$ 6) It is needed to configure all the files present in the gsl-1.16 folder type the Command: jayesh@jayesh-G31M-S2L:~/gsl-1.16$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make It will ask the password ,type the password and press the Enter Key 8) After the Make has been done it need to install the gsl jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo make install

9) jayesh@jayesh-G31M-S2L:~/gsl-1.16$ sudo ldconfig How to install the gsl on ubuntu 13.04 1) Download the archive package of apophenia named apophenia-0.99.tar.gz from the htpp:// apophenia.info/ 2) After download of the apophenia-0.99.tar.gz package, copy the package in the Home folder of Ubuntu 13.04 3)Open the Termina. It will open in the Current Directory. We have to Extract the package apophenia-0.99.tar.gz Then type the Command tar xvfz apophenia-0.99.tar.gz 4) After the Extraction of the package, the folder is created in the Current Directory is known as apophenia-0.99 5) Move to that new folder which has been created jayesh@jayesh-G31M-S2L:~$ cd apophenia-0.99 jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ 6) It is needed to configure all the files present in the gsl-1.16 folder type the Command: jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ ./configure 7) After the configuration has been done, Type the Command jayesh@jayesh-G31M-S2L:~/apophenia-0.99 $ sudo make install It will ask the password ,type the password and press the Enter Key 9) jayesh@jayesh-G31M-S2L:~/apophenia-0.99$ sudo ldconfig Installation of GNUPLOT On Ubuntu 13.04 sudo apt-get install gnuplot-x11

10

Practical No.1 - SQL queries based on Unit I For all database related practical, create a database in Sqlite3 jayesh@jayesh-G31M-S2L:~$ sqlite3 testDB.db SQLite version 3.7.17 2013-05-20 00:56:22 Enter ".help" for instructions Enter SQL statements terminated with a ";" To Check the database created or not sqlite> .databases seq name file --- --------------- ---------------------------------------------------------0 main /home/jayesh/testDB.db sqlite>

Problem statement : To execute SQL queries in order to store and retrieve the data under study in a database. Sqlite is used for executing the queries. i) Queries for performing DDL commands. DDL commands are used to create, modify and delete database objects. The data is stored in an RDBMS in the form of tables. Following are the queries to be performed for DDL commands in Sqlite sqlite> CREATE TABLE COMPANY( ID INT PRIMARY KEY NOT NULL, NAME TEXT NOT NULL, AGE INT NOT NULL, ADDRESS CHAR(50), SALARY REAL );
11

sqlite> CREATE TABLE DEPARTMENT( ID INT PRIMARY KEY NOT NULL, DEPT CHAR(50) NOT NULL, EMP_ID INT NOT NULL );

You can verify if your table has been created successfully using SQLIte command .tables command sqlite>.tables COMPANY DEPARTMENT ii) Insertion value into the COMPANY and DEPARTMENT Table INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 ); INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 ); INSERT INTO COMPANY VALUES (7, 'James', 24, 'Houston', 10000.00 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (1, 'IT Billing', 1 ); INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (2, 'Engineering', 2 );
12

INSERT INTO DEPARTMENT (ID, DEPT, EMP_ID) VALUES (3, 'Finance', 7 ); iii) Select clause is a data manipulation command used for retrieving the data in the desired format from the database objects. The syntax of the various select clause and its purpose is given below: Select * from company;

a) list down all the records where AGE is greater than or equal to 25 AND salary is greater than or equal to 65000.00:

13

sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 AND SALARY >= 65000;

a) list down all the records where AGE is greater than or equal to 25 ORsalary is greater than or equal to 65000.00: sqlite> SELECT * FROM COMPANY WHERE AGE >= 25 OR SALARY >= 65000;

list down all the records where AGE is not NULL which means all the records because none of the record is having AGE equal to NULL: sqlite> SELECT * FROM COMPANY WHERE AGE IS NOT NULL;

list down all the records where NAME starts with 'Ki', does not matter what comes after 'Ki'. sqlite> SELECT * FROM COMPANY WHERE NAME LIKE 'Ki%';

14

list down all the records where AGE value is either 25 or 27: sqlite> SELECT * FROM COMPANY WHERE AGE IN ( 25, 27 ); list down all the records where AGE value is neither 25 nor 27: sqlite> SELECT * FROM COMPANY WHERE AGE NOT IN ( 25, 27 ); list down all the records where AGE value is in BETWEEN 25 AND 27: sqlite> SELECT * FROM COMPANY WHERE AGE BETWEEN 25 AND 27;

finds all the records with AGE field having SALARY > 65000 sqlite> SELECT AGE FROM COMPANY WHERE EXISTS (SELECT AGE FROM COMPANY WHERE SALARY > 65000);

15

Find the total amount of salary on each customer sqlite> SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME;

Company Table Have a multiple record INSERT INTO COMPANY VALUES (8, 'Paul', 24, 'Houston', 20000.00 ); INSERT INTO COMPANY VALUES (9, 'James', 44, 'Norway', 5000.00 ); INSERT INTO COMPANY VALUES (10, 'James', 45, 'Texas', 5000.00 );sqlite> sqlite>

b) Order by Clause
16

SELECT NAME, SUM(SALARY) FROM COMPANY GROUP BY NAME ORDER BY NAME;

Consider COMPANY table is having following records:

c) Following is the example which would display record for which name count is less than 2: SELECT * FROM COMPANY GROUP BY name HAVING count(name) < 2;

sqlite > SELECT * FROM COMPANY GROUP BY name HAVING count(name) > 2;
17

d) which would sort the result in Ascending order by SALARY: sqlite> SELECT * FROM COMPANY ORDER BY SALARY ASC;

e) which would sort the result in descending order by NAME: sqlite> SELECT * FROM COMPANY ORDER BY NAME DESC;

f) Following is an example which limits the row in the table according to the no of rows you want to fetch from table: sqlite> SELECT * FROM COMPANY LIMIT 6;

18

sqlite> SELECT * FROM COMPANY LIMIT 3 OFFSET 2;

g) Joins sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY CROSS JOIN DEPARTMENT;

19

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY INNER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

sqlite> SELECT EMP_ID, NAME, DEPT FROM COMPANY LEFT OUTER JOIN DEPARTMENT ON COMPANY.ID = DEPARTMENT.EMP_ID;

20

Practical 2
i)

Multiplication Table

#include <apop.h> int main(){ gsl_matrix *m = gsl_matrix_alloc(20,15); gsl_matrix_set_all(m, 1); for (int i=0; i< m->size1; i++){ Apop_matrix_row(m, i, one_row); gsl_vector_scale(one_row, i+1); } for (int i=0; i< m->size2; i++){ Apop_matrix_col(m, i, one_col); gsl_vector_scale(one_col, i+1); } apop_matrix_show(m); gsl_matrix_free(m); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 multiplicationtable.c -o multiplicationtable.out -lapophenia -lgsl -lsqlite3

jayesh@jayesh-G31M-S2L:~$ ./multiplicationtable.out

21

ii) the function in will take in a double indicating taxable income and will return US income taxes owed, assuming a head of household with two dependents taking the standard deduction #include <apop.h> double calc_taxes(double income){ double cutoffs[] = {0, 11200, 42650, 110100, 178350, 349700, INFINITY}; double rates[] = {0, 0.10, .15, .25, .28, .33, .35}; double tax = 0; int bracket = 1; income -= 7850; //Head of household standard deduction income -= 3400*3; //exemption: self plus two dependents. while (income > 0){ tax += rates[bracket] * GSL_MIN(income, cutoffs[bracket]-cutoffs[bracket1]); income -= cutoffs[bracket]; bracket ++; } return tax; } int main(){ apop_db_open("data-census.db"); strncpy(apop_opts.db_name_column, "geo_name", 100); apop_data *d = apop_query_to_data("select geo_name, Household_median_in as income\
22

from income where sumlevel = '040'\ order by household_median_in desc"); Apop_col_t(d, "income", income_vector); d->vector = apop_vector_map(income_vector, calc_taxes); apop_name_add(d->names, "tax owed", 'v'); apop_data_show(d); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 taxes.c -o taxes.out -lapophenia lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./taxes.out

23

Practical III Plotting a vector #include <apop.h> void plot_matrix_now(gsl_matrix *data){ static FILE *gp = NULL; if (!gp) gp = popen("gnuplot -persist", "w"); if (!gp){ printf("Couldn't open Gnuplot.\n"); return; } fprintf(gp,"reset; plot '-' \n"); apop_matrix_print(data, .output_pipe=gp); fflush(gp); } int main(){ apop_db_open("data-climate.db"); plot_matrix_now(apop_query_to_matrix("select (year*12+month)/12., temp from temp")); }

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 pipeplot.c -o pipeplot.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./pipeplot.out

24

Eigen vector #include "eigenbox.h" apop_data *query_data(){ apop_db_open("data-census.db"); return apop_query_to_data(" select postcode as row_names, " " m_per_100_f, population/1e6 as population, median_age " " from geography, income,demos,postcodes " " where income.sumlevel= '040' " " and geography.geo_id = demos.geo_id " " and income.geo_name = postcodes.state " " and geography.geo_id = income.geo_id "); } void show_projection(gsl_matrix *pc_space, apop_data *data){ fprintf(stderr,"The eigenvectors:\n"); apop_matrix_print(pc_space, .output_pipe=stderr); apop_data *projected = apop_dot(data, apop_matrix_to_data(pc_space)); printf("plot '-' using 2:3:1 with labels\n"); apop_data_show(projected); }
25

int main(){ apop_plot_lattice(query_data(), "out"); } jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 eigenbox.c -o eigenbox.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./eigenbox.out jayesh@jayesh-G31M-S2L:~$ gnuplot -persist < out

Query out the month, average, and variance, and plot the data using errorbars. Prints to stdout, so pipe the output through Gnuplo #include <apop.h> int main(){ apop_db_open("dataclimate.db"); apop_data *d = apop_query_to_data("select \ (yearmonth/100. round(yearmonth/100.))*100 as month, \ avg(tmp), stddev(tmp) \
26

from precip group by month"); printf("set xrange*0:13+; plot with errorbars\n"); apop_matrix_show(d>matrix); }

jayesh@jayesh-G31M-S2L:~$ gcc -std=gnu99 errorbars.c -o errorbars.out lapophenia -lgsl -lsqlite3 jayesh@jayesh-G31M-S2L:~$ ./errorbars.out | gnuplot persist

27

Practical 4 Implement the statistical distributions Discrete distributions 1. Bernoulli distribution 2. binomial distribution 3. Poisson distribution 4. Multinomial distribution 5. hypergeometric distribution Continous distributions 1. Normal distribution 2. Lognormal distribution 3. Gamma distribution 4. Exponential distribution 5. Beta distribution bernoulli distribution (bernoulli.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i; double p = 0.6; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= 1; i++) { float k = gsl_ran_bernoulli_pdf (i,p); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0;
28

binomial distribution (binomial.c)

#include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i,n=5; double p = 0.6; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= n; i++) { float k = gsl_ran_binomial_pdf (i,p,n); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0; }

29

Poisson distribution (poi.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int i, n = 10; double mu = 3.0; float sum=0; /* prints probability distibution table*/ printf("random variable|||probability |||cumulative prob.\n"); printf("-------------------------------------------------------\n"); for (i = 0; i <= n; i++) { float k = gsl_ran_poisson_pdf (i,mu); sum=sum+k; printf("%d\t\t%f\t\t%f\n",i,k,sum); } printf("\n"); return 0; }

30

Uniform distribution(uniform.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { double x; int a,b ; printf("enter vaue for x ,a,b \n"); scanf("%f",&x); scanf("%d",&a); scanf("%d",&b); float sum=0; /* prints probability distibution table*/ printf("random variable|||probability \n"); printf("-------------------------------------------------------\n"); float k = (float)gsl_ran_flat_pdf (x,a,b); printf("%f\t\t%f\n",x,k);

31

return 0; }

Multinomial distribution (multinomial.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int k=3; const double p[]={0.2,0.4,0.4}; const unsigned int n[]={2,3,4};

/* prints probability */ printf("random variable|||probability \n"); printf("-------------------------------------------------------\n"); double pmf =gsl_ran_multinomial_pdf(k,p,n); printf("%3.9f\n",pmf); return 0; }

32

The following formula gives the probability of obtaining a specific set of outcomes when there are three possible outcomes for each event:

where p is the probability, n is the total number of events n1 is the number of times Outcome 1 occurs, n2 is the number of times Outcome 2 occurs, n3 is the number of times Outcome 3 occurs, p1 is the probability of Outcome 1 p2 is the probability of Outcome 2, and p3 is the probability of Outcome 3.

For the chess example, n = 12 (12 games are played), n1 = 7 (number won by Player A), n2 = 2 (number won by Player B), n3 = 3 (the number drawn), p1 = 0.40 (probability Player A wins) p2 = 0.35(probability Player B wins) p3 = 0.25(probability of a draw)

33

The formula for k outcomes is

Hypergeometric distribution (hyper.c) #include <stdio.h> #include <gsl/gsl_randist.h> int main (void) { int x,s,f,n; n=6; x=2;//random variable s=13;//success f=39;//failure

/* prints probability */ printf("random variable|||probability \n"); printf("-----------------------------------\n"); double pmf =gsl_ran_hypergeometric_pdf(x,s,f,n); printf("%d %3.6f\n",x,pmf); return 0; }

34

continous distributions (contdist.c) #include <stdio.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> #include <gsl/gsl_cdf.h> void normal(); void beta(); void gamma1(); void exponential(); void lognormal(); int main() { int choice; printf("continous distributions\n"); printf("-----------------------\n"); printf("1:Normal distribution\n"); printf("2:Gamma distribution\n"); printf("3:Exponential distribution\n"); printf("4:Beta distribution\n"); printf("5:Lognormal distribution\n"); printf("enter your choice\n"); scanf("%d",&choice); switch(choice) {case 1: normal(); break; case 2: gamma1(); break; case 3: exponential();
35

break; case 4: beta(); break; case 5: lognormal(); break; default: printf("wrong choice\n"); } return 0; } void normal() { double P, Q; double x = 10; double sigma=5; double pdf; printf("Normal distribution :x=%f sigma=%f\n",x,sigma); pdf = gsl_ran_gaussian_pdf (x,sigma); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_gaussian_P (x,sigma); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_gaussian_Q (x,sigma); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_gaussian_Pinv (P,sigma); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_gaussian_Qinv (Q,sigma); printf ("Qinv(%f) = %f\n", Q, x); } void gamma1() { double P, Q; double x = 1.5; double a=1; double b=2; double pdf;
36

printf("Gamma distribution :x=%f a=%f b=%f\n",x,a,b); pdf = gsl_ran_gamma_pdf (x,a,b); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_gamma_P (x,a,b); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_gamma_Q (x,a,b); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_gamma_Pinv (P,a,b); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_gamma_Qinv (Q,a,b); printf ("Qinv(%f) = %f\n", Q, x); } void exponential() { double P, Q; double x = 0.05; double lambda=2; double pdf; printf("Exponential distribution :x=%f lambda=%f\n",x,lambda); pdf = gsl_ran_exponential_pdf (x,lambda); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_exponential_P (x,lambda); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_exponential_Q (x,lambda); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_exponential_Pinv (P,lambda); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_exponential_Qinv (Q,lambda); printf ("Qinv(%f) = %f\n", Q, x); } void beta()
37

{ double P, Q; double x = 0.8; double a=0.5; double b=0.5; double pdf; printf("Beta distribution :x=%f a=%f b=%f\n",x,a,b); pdf = gsl_ran_beta_pdf (x,a,b); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_beta_P (x,a,b); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_beta_Q (x,a,b); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_beta_Pinv (P,a,b); printf ("Pinv(%f) = %f\n", P, x); x = gsl_cdf_beta_Qinv (Q,a,b); printf ("Qinv(%f) = %f\n", Q, x); } void lognormal() { double P, Q; double x = 4; double zeta=2; double sigma=1.5; double pdf; printf("Lognormal distribution :x=%f zeta=%f sigma=%f\n",x,zeta,sigma); pdf = gsl_ran_lognormal_pdf (x,zeta,sigma); printf ("prob(x = %f) = %f\n", x, pdf); P = gsl_cdf_lognormal_P (x,zeta,sigma); printf ("prob(x < %f) = %f\n", x, P); Q = gsl_cdf_lognormal_Q (x,zeta,sigma); printf ("prob(x > %f) = %f\n", x, Q); x = gsl_cdf_lognormal_Pinv (P,zeta,sigma); printf ("Pinv(%f) = %f\n", P, x);

38

x = gsl_cdf_lognormal_Qinv (Q,zeta,sigma); printf ("Qinv(%f) = %f\n", Q, x); }

39

40

Practical No. 5 Implement regression and goodness of fit Implementing regression Steps : Functions used : int gsl_fit_wlinear (const double * x, const size_t xstride, const double * w, const size_t wstride, const double * y, const size_t ystride, size_t n, double * c0, double * c1, double * cov00, double * cov01, double * cov11, double * chisq) This function computes the best-fit linear regression coefficients (c0,c1) of the model Y = c_0 + c_1 X for the weighted dataset (x, y), two vectors of length n with strides xstride and ystride. The vector w, of length n and stride wstride, specifies the weight of each datapoint. The weight is the reciprocal of the variance for each datapoint in y. The covariance matrix for the parameters (c0, c1) is computed using the weights and returned via the parameters (cov00, cov01, cov11). The weighted sum of squares of the residuals from the best-fit line, \chi^2, is returned in chisq. int gsl_fit_linear_est (double x, double c0, double c1, double cov00, double cov01, double cov11, double * y, double * y_err) This function uses the best-fit linear regression coefficients c0, c1 and their covariance cov00, cov01, cov11 to compute the fitted function y and its standard deviation y_err for the model Y = c_0 + c_1 X at the pointx. program computes a least squares straight-line fit to a simple dataset, and outputs the best-fit line and its associated one standard-deviation error bars. #include <stdio.h> #include <gsl/gsl_fit.h> int main (void) { int i, n = 4; double x[4] = { 1970, 1980, 1990, 2000 }; double y[4] = { 12, 11, 14, 13 }; double w[4] = { 0.1, 0.2, 0.3, 0.4 };
41

double c0, c1, cov00, cov01, cov11, chisq; gsl_fit_wlinear (x, 1, w, 1, y, 1, n, &c0, &c1, &cov00, &cov01, &cov11, &chisq); printf ("# best fit: Y = %g + %g X\n", c0, c1); printf ("# covariance matrix:\n"); printf ("# [ %g, %g\n# %g, %g]\n", cov00, cov01, cov01, cov11); printf ("# chisq = %g\n", chisq); for (i = 0; i < n; i++) printf ("data: %g %g %g\n", x[i], y[i], 1/sqrt(w[i])); printf ("\n"); for (i = -30; i < 130; i++) { double xf = x[0] + (i/100.0) * (x[n-1] - x[0]); double yf, yf_err; gsl_fit_linear_est (xf, c0, c1, cov00, cov01, cov11, &yf, &yf_err); printf ("fit: %g %g\n", xf, yf); printf ("hi : %g %g\n", xf, yf + yf_err); printf ("lo : %g %g\n", xf, yf - yf_err); } return 0; }

42

B. Implementing goodness of fit Chi Square int apop_db_open ( char const * filename ) If you want to use a database on the hard drive instead of memory, then call this once and only once before using any other database utilities. When you are done doing your database manipulations, be sure to call apop_db_close if writing to disk. Parameters: filename Returns: 0: everything OK 1: database did not open. The name of a file on the hard drive on which to store the database.

apop_model* apop_estimate ( apop_data * d, apop_model m )


43

estimate the parameters of a model given data.This function copies the input model, preps it, and calls m.estimate(d,&m). If your model has no estimate method, then I assume apop_maximum_likelihood(d, m), with the default MLE params. Parameters: d The data m The model Returns: A pointer to an output model, which typically matches the input model but has its parameters element filled in.

apop_model* apop_model_to_pmf ( apop_model * model, apop_data * binspec, long int draws, int bin_count, gsl_rng * rng ) Make random draws from an apop_model, and bin them using a binspec in the style of apop_data_to_bins. If you have a data set that used the same binspec, you now have synced histograms, which you can plot or sensibly test hypotheses about. The output is normalized to integrate to one. Parameters: binspec A description of the bins in which to place the draws; see apop_data_to_bins. (default: as in apop_data_to_bins.) The model to be drawn from. Because this function works via random draws, the model needs to have a draw method. (No default)

model

44

draws

The number of random draws to make. (arbitrary default = 10,000) If no bin spec, the number of bins to use (default: as per apop_data_to_bins, ) The gsl_rng used to make random draws. (default: see note on Auto-allocated RNGs)

bin_count

rng Returns:

An apop_pmf model.

This function uses the Designated initializers syntax for inputs.

#include <apop.h> int main(){ apop_db_open("data-climate.db"); apop_data *precip = apop_query_to_data("select PCP from precip"); apop_model *est = apop_estimate(precip, apop_normal); apop_data *precip_binned = apop_data_to_bins(precip/*, .bin_count=180*/); apop_model *datahist = apop_estimate(precip_binned, apop_pmf); apop_model *modelhist = apop_model_to_pmf(.model=est, .binspec=apop_data_get_page(precip_binned, "<binspec>"), .draws=1e5); double scaling = apop_sum(datahist->data>weights)/apop_sum(modelhist->data->weights); gsl_vector_scale(modelhist->data->weights, scaling); apop_data_show(apop_histograms_test_goodness_of_fit(datahist, modelhist)); }

45

Prac 6. Implement testing with likelihood


1.

Building an optimized model & then solving the same for maximum.( a function can be provided in this case) Nelder-Mead simplex (gradient handling rule is irrelevant) Conjugate gradient (Fletcher-Reeves) (default) simulated annealing Find a root of the derivative via Newton's method

APOP_SIMPLEX_NM APOP_CG_FR APOP_SIMAN APOP_RF_NEWTON

#include <apop.h> double sin_square(apop_data *data, apop_model *m){ double x = apop_data_get(m->parameters, 0, -1); return -sin(x)*gsl_pow_2(x); } apop_model sin_sq_model ={"-sin(x) times x^2",1, .p = sin_square};

#include "sinsq.c" void do_search(int number, char *name, char *trace){ apop_model *out; double p[] = {0}; double result; char *outf; asprintf(&outf, "localmax_out/%s.gplot", trace); Apop_model_add_group(&sin_sq_model, apop_mle, .starting_pt= p, .method= number, .tolerance= 1e-4, .mu_t= 1.25, .trace_path= outf); out = apop_estimate(NULL, sin_sq_model); result = gsl_vector_get(out->parameters->vector, 0); printf("The %s algorithm found %g.\n", name, result); Apop_settings_rm_group(&sin_sq_model, apop_mle); } int main(){
46

system ("mkdir -p localmax_out; rm -f localmax_out/*.gplot"); apop_opts.verbose ++; do_search(APOP_SIMPLEX_NM, "N-M Simplex", "simplex"); do_search(APOP_CG_FR, "F-R Conjugate gradient", "fr"); do_search(APOP_SIMAN, "Simulated annealing", "siman"); do_search(APOP_RF_NEWTON, "Root-finding", "root"); fflush(NULL); system("sed -i \"1iplot '-'\" localmax_out/*.gplot"); }

2.

Comparing 2 models using likelihood ratio

#include <apop.h> apop_model * dummies(int slope_dummies){ apop_data *d = apop_query_to_mixed_data("mmt", "select riders, year1977, line \ from riders, lines \ where riders.station=lines.station"); apop_data *dummified = apop_data_to_dummies(d, 0, 't', .append='y', .remove='y'); if (slope_dummies){ Apop_col(d, 1, yeardata); for(int i=0; i < dummified->matrix->size2; i ++){ Apop_col(dummified, i, c); gsl_vector_mul(c, yeardata); } } apop_model *out = apop_estimate(dummified, apop_ols);
47

apop_model_show(out); return out; } #ifndef TESTING int main(){ apop_db_open("data-metro.db"); printf("With constant dummies:\n"); dummies(0); printf("With slope dummies:\n"); dummies(1); } #endif

#define TESTING #include "dummies.c" void show_normal_test(apop_model *unconstrained, apop_model *constrained, int n){ double statistic = (apop_data_get(unconstrained->info, .rowname="log likelihood") - apop_data_get(constrained->info, .rowname="log likelihood"))/sqrt(n); double confidence = gsl_cdf_gaussian_P(fabs(statistic), 1); //one-tailed. printf("The Normal statistic is: %g, so reject the null of no difference between models " "with %g%% confidence.\n", statistic, confidence*100); } int main(){ apop_db_open("data-metro.db"); apop_model *m0 = dummies(0); apop_model *m1 = dummies(1); show_normal_test(m0, m1, m0->data->matrix->size1); }

48

Prac 7. Generate random numbers using Monte Carlo method using Exponential distribution 2. uniform distribution 3. binomial distribution some functions used for random number generation the functions used for random number generation are declared in the header file `gsl_rng.h'. const gsl_rng_type * T : holds static information about each type of generator. gsl_rng_env_setup() : This function reads the environment variables GSL_RNG_TYPE and GSL_RNG_SEED and uses their values to set the corresponding library variables gsl_rng_default and gsl_rng_default_seed.
1.

program to create a global generator using the environment variables GSL_RNG_TYPE and GSL_RNG_SEED, #include <stdio.h> #include <gsl/gsl_rng.h> gsl_rng * r; /* global generator */ int main (void) { const gsl_rng_type * T; gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); printf ("generator type: %s\n", gsl_rng_name (r)); printf ("seed = %lu\n", gsl_rng_default_seed); printf ("first value = %lu\n", gsl_rng_get (r)); gsl_rng_free (r); return 0; }
49

Running the program without any environment variables uses the initial defaults, an mt19937 generator with a seed of 0 as follows:

By setting the two variables on the command line we can change the default generator and the seed as follows:

using exponential distribution #include <stdio.h> #include <stdlib.h> #include <math.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> int main(int argc, char *argv[]) { int i,n; float x,alpha; gsl_rng *r=gsl_rng_alloc(gsl_rng_mt19937); /* initialises GSL RNG */ n=atoi(argv[1]); alpha=atof(argv[2]); x=0; for (i=0;i<n;i++) { x=alpha*x + gsl_ran_exponential(r,1);
50

printf(" %2.4f \n",x); } return(0); }

Generating uniform random numbers in the range [0.0, 1.0) using uniform distribution #include <stdio.h> #include <gsl/gsl_rng.h> int main (void) { const gsl_rng_type * T; gsl_rng * r; int i, n = 10; gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); for (i = 0; i < n; i++) { double u = gsl_rng_uniform (r); printf ("%.5f\n", u); }

51

gsl_rng_free (r); return 0; }

Using binomial distribution #include <stdio.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> int main (void) { const gsl_rng_type * T; gsl_rng * r; int i, n = 10;

/* create a generator chosen by the environment variable GSL_RNG_TYPE */ gsl_rng_env_setup(); T = gsl_rng_default; r = gsl_rng_alloc (T); float p=0.3;

52

/* print n random variates chosen from the binomial distribution with mean parameter mu */ for (i = 0; i < n; i++) { unsigned int k = gsl_ran_binomial(r, p,n); printf (" %u", k); } printf ("\n"); gsl_rng_free (r); return 0; }

Following functions can be used to generate random numbers using different distributions by knowing the parameters required.

53

Practical No. 8 Implementing Parametric testing t test #include <apop.h>


1.

int main(){ apop_db_open("data-census.db"); gsl_vector *n = apop_query_to_vector("select in_per_capita from income " "where state= (select state from geography where name ='North Dakota')"); gsl_vector *s = apop_query_to_vector("select in_per_capita from income " "where state= (select state from geography where name ='South Dakota')"); apop_data *t = apop_t_test(n,s); apop_data_show(t); //show the whole output set... printf ("\n confidence: %g\n", apop_data_get(t, .rowname="conf.*2 tail")); //...or just one value. }

F test apop_data* apop_f_test ( apop_model * est, apop_data * contrast )


2.

Runs an F-test specified by q and c. Your best bet is to see the chapter on hypothesis testing in Modeling With Data, p 309. It will tell you that:
54

and that's what this function is based on. Parameters: Est an apop_model that you have already calculated. (No default) The matrix and the vector , where each row represents a hypothesis. (Defaults: if matrix is NULL, it is set to the identity matrix with the top row missing. If the vector is NULL, it is set contrast to a zero matrix of length equal to the height of the contrast matrix. Thus, if the entire apop_data set is NULL or omitted, we are testing the hypothesis that all but are zero.) Returns: An apop_data set with a few variants on the confidence with which we can reject the joint hypothesis. Todo: There should be a way to get OLS and GLS to store . In fact, if you did GLS, this is invalid, because you need , and I didn't ask for .

There are two approaches to an -test: the ANOVA approach, which is typically built around the claim that all effects but the mean are zero; and the more general regression form, which allows for any set of linear claims about the data. If you send a NULL contrast set, I will generate the set of linear contrasts that are equivalent to the ANOVA-type approach. Readers of {Modeling with Data}, note that there's a bug in the book that claims that the traditional ANOVA approach also checks that the coefficient for the constant term is also zero; this is not the custom and doesn't produce the equivalence presented in that and other textbooks.

Exceptions: out->error='a' Allocation error. out->error='d' dimension-matching error. out->error='i' matrix inversion error. out->error='m' GSL math error. #include "eigenbox.h" int main(){ double line[] = {0, 0, 0, 1}; apop_data *constr = apop_line_to_data(line, 1, 1, 3); apop_data *d = query_data();
55

apop_model *est = apop_estimate(d, apop_ols); apop_model_show(est); apop_data_show(apop_f_test(est, constr)); }

56

Practical No. 9 Drawing an Inference Obtaining mean ,standard error & p value for the given data. #include <apop.h> void one_boot(gsl_vector *base_data, gsl_rng *r, gsl_vector* boot_sample); void one_boot(gsl_vector * base_data, gsl_rng *r, gsl_vector* boot_sample){ for (int i =0; i< boot_sample>size; i++) gsl_vector_set(boot_sample, i, gsl_vector_get(base_data, gsl_rng_uniform_int(r, base_data>size))); } int main(){ int rep_ct = 10000; gsl_rng *r = apop_rng_alloc(0); apop_db_open("data-census.db"); gsl_vector *base_data = apop_query_to_vector("select in_per_capita from income where sumlevel+0.0 =40"); double RI = apop_query_to_float("select in_per_capita from income where sumlevel+0.0 =40 and geo_id2+0.0=44"); gsl_vector *boot_sample = gsl_vector_alloc(base_data->size); gsl_vector *replications = gsl_vector_alloc(rep_ct); for (int i=0; i< rep_ct; i++){ one_boot(base_data, r, boot_sample); gsl_vector_set(replications, i, apop_mean(boot_sample)); } double stderror = sqrt(apop_var(replications)); double mean = apop_mean(replications); printf("mean: %g; standard error: %g; (RI-mean)/stderr: %g; p value: %g\n", mean, stderror, (RI-mean)/stderror, 2*gsl_cdf_gaussian_Q(fabs(RI-mean), stderror)); }

57

Practical No 10.Implement Non-parametric Testing


1.

Anova apop_data* apop_anova ( char * char * char * char * ) table, data, grouping1, grouping2

2. 3. 4.

This function produces a traditional one- or two-way ANOVA table. It works from data in an SQL table, using queries of the form select data from table group by grouping1, grouping2. Parameters: table The table to be queried. Anything that can go in an SQL from clause is OK, so this can be a plain table name or a temp table specification like (select ... ), with parens. The name of the column holding the count or other such data

data

grouping1 The name of the first column by which to group data If this is NULL, then the function will return a one-way grouping2 ANOVA. Otherwise, the name of the second column by which to group data in a two-way ANOVA. #include <apop.h> int main(){ apop_db_open("data-metro.db"); char joinedtab[] = "(select year, riders, line \ from riders, lines \ where riders.station = lines.station)"; apop_data_show(apop_anova(joinedtab, "riders", "line", "year"));

58

59

References
1. 2. 3.

4. 5.

Modelling with data, Ben Klemens, Princeton University Press Computational Statistics, James E. Gentle, Springer Computational Statistics, Second Edition, Geof H. Givens and Jennifer A.Hoeting, Wiley Publications www.cygwin.com
http://apophenia.info/

60

Вам также может понравиться