Вы находитесь на странице: 1из 35

Welcome to Powerpoint slides for

Chapter 7

Planning the Data Analysis

Marketing Research Text and Cases by Rajendra Nargundkar

Slide 1 Processing of Data with Computer Packages This chapter deals with 1. A brief description of data processing and analysis packages for computerised analysis. 2. Common rules for adapting data for computerised analysis, including coding. 3. Some analytical approaches for univariate, bivariate and multivariate analysis. 4. The 3 factors which determine the analytical technique to be selected for a problem 5. The concept of hypothesis testing and 6. How to perform a 't' test using the computer.

Slide 2

Statistical and Data Processing Packages

1. Today, in most cases, the computer is used for data processing and analysis. 2. Most students of management are familiar with simple data processing packages like Excel and FoxPro, which are essentially spreadsheets and database management packages.

3. But for the types and quantum of data generated by a field survey, there is another set of packages available, and the student can choose from several which are commercially available. Most of these have been developed in the U.S., and are now available either directly from the respective companys marketing office in India, or through their dealers.
4. Some of these packages are called SPSS, SAS, STATISTICA and SYSTAT. There are several others also available, but these four are among the more popular and widely available. The names of these packages are registered trademarks of these companies. Usually, the package name is an abbreviation of its function. For example, SPSS stands for Statistical Package for the Social Sciences. 5. The new versions of these packages are usually WINDOWS-based. They are user-friendly, and can be learnt fairly easily. In this book, SPSS and STATISTICA have been used for data analysis.

Slide 3 Types of Analysis Packages like SPSS, STATISTICA, etc. can be used for two major types of applications in Marketing Research Data Processing General Statistical Analysis Specialised Bivariate and Multivariate) Data Processing This application includes coding and entering data for all respondents, for all questions on a questionnaire. For example, there may be a question which asks for the education level of a participant. The choices may be 12th or below, Graduate, Post-Graduate and any other. The first step in data processing is to assign a code for each of the options for instance, 1 for 12th or below, 2 for Graduate, 3 for Post-Graduate and 4 for any other. Next, depending on the option ticked for each respondent, to enter the respective code against his row (usually, the data for one respondent is entered in a row assigned to him in the data set) in the column assigned to the question, in the data matrix.

(Univariate,

Slide 3.contd... The end result of data processing for this question would be to be able to tell the researcher how many of the sample of respondents were of education level 12th or below (Code 1), how many were Graduates (Code 2), how many Post-Graduates (Code 3) and how many were in any other category (Code 4). For example, it could be that out of a sample of 500 respondents, 100 were in Code 1 category, 200 in Code 2, 150 in Code 3, and 50 in Code 4 (Any other). Similarly, all other questions on the questionnaire are processed, and totals for each category of answers can be computed. The menu commands used for such data processing are called FREQUENCIES, SUMMARY STATISTICS, DESCRIPTIVE STATISTICS, or TABLES depending on the software package used.

Slide 4

Data Input Format

Most of the above-mentioned packages have a format similar to spreadsheet packages for data entry. Readers familiar with any spreadsheet package like Excel can easily handle the data entry (input) part of these statistical packages.
The input follows a matrix format, where the variable name/number appears on the column heading and data for one person (respondent or record, also called a case in statistical terminology) is entered in one row. For example, the data for respondent no. 1 is entered in row 1. The answer given by respondent no.1 to Question 1 is entered in Row 1 and Column1. The answer given by respondent no.1 to Question 2 is entered in Row 1 and Column 2. The input matrix looks like the following : Respondent 1 Respondent 2 Respondent 3 Respondent n Var 1 x x Var 2 x x Var 3 Var k x x x x

x x

x x

x x

x x

Here, n would be the sample size of the marketing research study, consisting of k variables. Sometimes, each question on a questionnaire can generate more than one variables.

Slide 5 Coding One limitation of doing analysis on the computer with these statistical packages is that all data must be converted into numerical form. Otherwise, it cannot be counted or manipulated for analysis. So, all data must be coded and converted to numbers, if it is non-numerical. We saw one example of coding in the previous section, where we gave numerical codes of 1, 2, 3 and 4 to the education level of the respondent.

Similarly, any non-numerical data can be converted into numbers. Usually, all nominal scale variables (categorical variables) need to be coded and entered into the packages.
An important aspect of coding is to remember which code stands for what. Most software packages have a facility called definition of Value Labels for each variable, which should be used to define the codes for every value of a variable. This is illustrated in a section labelled "value labels" a little later.

Slide 6

Variable
Usually, a question on the questionnaire represents a Variable in the package. This is not always the case, because sometimes we may create more than one variables out of answers to a question. For example, it could be a ranking question which requires respondents to rank 5 brands on a scale of 1 to 5. We may define Ranking given to Brand X as variable 10, and ranks given to it could be any number from 1 to 5. Similarly, Ranking of Brand Y could be defined as variable 11, and again, the responses could be from 1 to 5. Therefore, we may end up with 5 variables from that single ranking question on the questionnaire. It all depends on how we want the output to look like, and how we want to analyse it. One very useful provision that all the packages have is the variable name. For instance, if the particular question (variable) represents the respondents Income, then the Variable Name can be INCOME on the column representing this variable.

Slide 7

Variable Label and Format There is a provision to give a longer name to each variable if required (usually called Variable Label) in each one of the packages.
There is also a provision by which the user can define in these packages the type of variable (Numeric or nonnumeric), and the number of digits it will have. A non-numeric variable can be defined, but no mathematical calculations can be performed with it. For a numerical variable, you can also define the number of decimal points (if applicable). SPSS Commands for Defining Variable Labels In SPSS, you can double click on the column heading of the Variable and fill out the Variable Name, format etc. in the dialog box /table which opens up. In SPSS version 10.1, a table opens up where Variable Name is filled in the first column, and Label in another column, etc. In older versions of SPSS, a dialogue box opens when you double click on a variable (column heading) in the data file, and you have to fill up the relevant Variable Label, format, etc. in the dialogue box.

Slide 8

Value Labels/Codes

Sometimes, the different values taken by the variable are continuous numbers. But sometimes, they are categories. For example, income categories could be Below 5,000 per month 5,001 to 10,000 per month 10,001 to 20,000 per month More than 20,000 per month Each of these could be given numerical codes such as 1, 2, 3 or 4. To save these codes along with their meanings (labels) in the computer, we have to use a feature called Value Labels. We can use the feature and label 1 as Below Rs. 5,000 p.m., 2 as Rs. 5,001 to 10,000 p.m., 3 as Rs. 10,001 to 20,000 p.m., and 4 as More than Rs. 20,000 p.m. . The words used in quotes are called Value Labels, and can be defined for each variable separately. For each categorical variable that we have allotted codes to, we need to record the codes along with the Variable Name and Question Number for our records in a separate coding sheet also. Definition of Value Labels simplifies the problems while interpreting the output. The value labels are generally printed along with the codes when a table is printed involving the given variables (for example, income).

Slide 8contd...

SPSS commands for Defining Value Labels


In SPSS, the same procedure described earlier for defining a Variable Label also gives the opportunity to define Value Labels. That is, double click on the column heading of a variable. In the table or dialog box which opens up, go to the relevant space for Value Labels, and define a label for each value of a variable, one after another. In SPSS 10.1, a table opens up when you double click. You have to then go to a column labeled VALUES, select the cell in the relevant row, and click to open a Value Labels dialogue box. In the Value Label dialogue box, type the value labels, for example , 1 as value and Below Rs.5000 as the label, then Click ADD, then 2 as value, followed by Rs,.500010000 as its label, etc. Do this for all value labels for a variable. Repeat the process for other variables where value labels have to be defined.

Slide 9 Record Number / Case Number Every row is called a case or record, and represents data for one respondent. In rare cases, the respondent may occupy two rows, if the number of variables is too large to be accommodated in one row. We may not encounter such cases in our examples, but these are sometimes encountered in commercial applications of Marketing Research. The manual for the package being used (SPSS, SAS, SYSTAT etc.) can be referred to for an explanation of how to use two or more rows for representing a single case (respondent). If a respondent is represented by one row, usually the row number and the serial number of respondent become identical. In other words, the number of rows will add up to the sample size. If a survey had 100 respondents, 100rows of data would be entered into the data input matrix.

Slide 10 Missing Data Frequently, respondents do not answer all the questions asked. This leaves some blanks on the questionnaire. There are two approaches for handling this problem. Pairwise Deletion : The computer can be asked to use the pairwise deletion, which means that if one respondents data is missing for one question, then the package simply treats the sample size as one less than the given number of respondents for that question alone, and computes the information asked for. All other questions are treated as usual. Listwise Deletion : This instruction to the computer results in the entire row of data being deleted, even if there is one missing (blank) piece of data in the questionnaire. This may result in a large reduction in sample size, if there is a lot of missing data on different questions.

Slide 11

Statistical Analysis

We have so for discussed general data processing applications of statistical packages. But these packages are capable of a lot of statistical tests, like the chisquared, the the t test and the F test. They can also be used to perform analyses such as Correlation and Regression Analysis, ANOVA or Analysis of Variance, Factor Analysis, Cluster Analysis, Discriminant Analysis, Multidimensional Scaling, Conjoint Analysis and many other advanced statistical analyses. The packages we have mentioned (SPSS, SAS, SYSTAT) generally perform most of these analyses. In addition, the statistical packages also have varying graphical capabilities for drawing graphs.
Some of the packages require a large amount of computer memory to operate some of the advanced multivariate statistical techniques, particularly if the data size is large.

Slide 11 contd... Most of the important statistical analysis techniques typically used by a marketing researcher are described in detail in later chapters. The exact commands used will vary depending on which statistical package is used by the reader. But in most of the current packages, a pull-down menu is used, and a Help feature is available on line, so a user can easily perform most of these analyses if he is slightly familiar with WINDOWS operating system and general data entry into packages like EXCEL. For details, the manual for whichever package is being used should be consulted. The chapters which follow guide even the inexperienced users with a detailed example of how to use each major statistical technique. A description of a problem is accompanied by the input data, and the exact output of the computer for the analysis being described. It is desirable for the user to have access to one of the statistical packages which can perform these analyses, but it is possible to understand the essence of these methods even if one has no access to a computer package.

Slide 12 Hypothesis Testing and Probability Values (p values) In manual forms of hypothesis testing, we generally compute the value of a statistic (the z, the t, or the F statistic, for example), and compare it with a table value of the same statistic for a given constraint (sample size, degrees of freedom, etc.). But in the computer output for any analysis involving a statistical test, a more convenient way is to interpret the p-value printed for a particular test. For example, if we are conducting a hypothesis, we only need to decide on the confidence level (statistical) for the test before the computerised analysis. Suppose we decide that we want a confidence level of 95 percent for the test (assume it is a t test). Suppose now that the computer gives an output that shows the p-value as 0.067 for the t test we requested. This value being more than 0.05 (100-confidence level of 95 %), the null hypothesis cannot be rejected. If the p-value had been less than 0.05, we would have rejected the null hypothesis.

Slide 12 contd... But what is a null hypothesis? In general, a null hypothesis is the opposite of any statistical relationship between variables that we expect to prove. In other words, if we want to check if variables x and y are related to each other, the null hypothesis would be that there is no significant relationship between x and y. This method of proving or disproving a hypothesis is very simple to understand and use in the context of computers doing the testing. This is what we will use throughout this book.

Slide 13

Approaches to Analysis

Analysis of data is the process by which data is converted into useful information. Raw data as collected from questionnaires cannot be used unless it is processed in some way to make it amenable to drawing conclusions. Various techniques of data analysis are available, and it is sometimes difficult to choose one that will be the most appropriate for the research problems on hand. The types of analysis to be done and format of output desired should be planned at the time of designing the questionnaire. This is true particularly when special kinds of analysis are needed, requiring specific forms or scale of data. Three Types of Analysis Broadly, we can classify analysis into three types 1. Univariate, involving a single variable at a time, 2. Bivariate, involving two variables at a time, and 3. Multivariate, involving three or more variables simultaneously.

Slide 14 The choice of which of the above types of data analysis to use depends on at least three factors - 1) the scale of measurement of the data, 2) the research design, and 3) assumptions about the test statistic being used, if one is used. We will briefly discuss these factors and their implications with some illustrations. Scale of Data: If the variables being measured are nominally scaled or ordinally scaled, there are severe limitations on the usage of parametric multivariate statistics. Mostly, univariate or bivariate analysis can be used on nominal/ordinal data. For example, a ranking of 5 brands of audio systems by a sample of consumers may produce ordinal scale data consisting of these ranks. We cannot compute an average rank for each brand, because averages are not meaningful for ordinal level data. But univariate analysis can be done to make statements such as 70 percent of the sample ranked Brand A (say, Aiwa) as no.1, or 20 percent of the sample ranked Brand B (say, Philips), as no.1. Similarly, numbers and percentages can be calculated for ranks 2, 3, 4 and 5.

Slide 15

We can also do some types of bivariate analysis such as a chi-squared test of association between say, the brand ranked as no. 1 and say, the income group to which the respondent belongs (a nominal variable). This would tell us if a significant association exists between these variables. The chi-squared test is explained in the next chapter. The crosstabs in this case may look as follows
Brand Ranked 1 Brand A Brand B Brand C Brand D Brand E Income Grp.1 x x x x x Income Income Income Grp.2 Grp.3 Grp. 4 x x x x x x x x x x x x x x x

The x values in the above table represent the number of respondents in each cell.

Nominal and ordinal scale data are also called non-metric data, and generally various non-parametric tests are used on non-metric data. Interval scaled or ratio scaled data are also called metric data, and many more statistical techniques, including univariate, bivariate and multivariate, can be used for their analysis.

Slide 16

Research Design

The second determinant of the analysis technique is the Research Design. For example, whether one sample is taken or two, and whether one set of measurements is independent of the other or dependent on the other determine the analysis technique. Let us consider an example of Attitude towards a Brand, measured from Buyers and Non-buyers of the brand. These two are independent samples, and a t test for independent samples can be used to measure if the mean attitude is different among the users and nonusers, if the attitude is measured with an interval scale. As an example of dependent samples, assume that a group of respondents is given a new product to try. Before and after trial, their opinion about the product is measured, using an interval scale. This is a set of dependent samples, and a different type of t test called the paired difference t test, is used in this case to find out if there is a significant difference in their opinion before and after the trial.

Slide 17

Assumptions About the Test Statistic or Technique


The third factor affecting the choice of analytical technique is the set of assumptions made while using a particular test statistic.

For example, the independent samples 't' test assumes that the two populations from which the samples are drawn is independent.
In addition, it assumes that the populations are normally distributed and that they have equal variances. When these assumptions are violated, the test's efficacy is reduced, or sometimes, totally lost. Another type of assumption is related to the scale of the variable. For example, chi-squared test assumes the data are nominally scaled simple counts, whereas the techniques of factor analysis and cluster analysis assume the data to be interval scaled.

Slide 18 Fig. 1 lists out the various options available to the analyst who wants to do univariate or bivariate analysis.

UNIVARIATE TECHNIQUES Non-parametric Statistics One Sample


chi square KolmorovSmirnov Runs

Parametric Statistics One Sample Two or more samples * 't' test * Z test Independent Dependent 't' test Z test Paired ANOVA sample 't' test

Two or more samples

Independent Dependent chi-square Rank Sums Kolmogorov -Smirnov Sign Wicoxon McNemar Cochran Q

Slide 19 Fig. 2 lists out a roadmap for selecting appropriate multivariate analysis techniques. Fig. 2 Multivariate Techniques Dependence Techniques Interdependence Techniques One Independent Variable Multiple Independent Variables

ANOVA Multiple Regression Discriminant Analysis Conjoint Analysis

MANOVA Canonical Correlation


Focus on Variables * Factor Analysis Focus on Objects

Cluster Analysis Multidimensional Scaling

Slide 20 The next chapter describes how simple tabulation and crosstabulation of data can be done. These two are the most widely used analysis techniques in survey research. A detailed coverage of the non-parametric techniques mentioned on the left side of Fig.1 is beyond the scope of this book. Out of these non-parametric tests, we will discuss only the chi-squared test for crosstabulations in the next chapter, because that is the most widely used in practice. For the univariate and bivariate analysis of metric data (interval scale or ratio scale), we use 't' tests of different types, or the Z test. We will illustrate the use of two types of 't' tests, which are shown in the right half of Fig.1. These are The independent sample 't' test and The paired sample 't' test These two are the most likely tests which a marketing researcher would encounter. The major focus of this book will be on simple and crosstabulations for univariate and bivariate analysis (used mainly for non-metric data), and a variety of multivariate analysis techniques for special applications (using primarily metric data, with a few exceptions).

Slide 21

Hypothesis for the t-Test


Before we illustrate the use of the independent sample 't' test and the paired sample 't' test, we will again discuss the concept of hypothesis testing, in the context of the 't' test.

Suppose, as marketers of a brand of jeans, we wanted to find out whether a set of customers in Delhi and a set of customers in Mumbai thought of our brand in the same way or not. Suppose we conducted a small survey in both cities and got Ratings on an interval scale (assume it was a seven point scale with ratings 1 to 7) from our customers.
We now want to do a statistical test to find out if the two sets of Ratings are "significantly different" from each other or not. We have to now set a level of "statistical significance" and select a suitable test. We also need to specify a null hypothesis. The 'null hypothesis' represents a statement to be used to perform a statistical test to prove or to disprove (reject) the statement. In the above example, the null hypothesis for the 't' test would be "There is no significant difference in the ratings given by customers in Mumbai and Delhi". In other words, the null hypothesis states that the mean (average) rating from these two places is the same.

Slide 22

Now, we have to set a significance level for the test. This represents the chance that we may be making a mistake of a certain type. It can also be set as (100 minus confidence level desired in the test, divided by 100). For example, if we desire that the confidence level for the test should be 95 percent, then (100-95)/100, or .05, becomes the significance level. We can think of it as a .05 probability that we are making a certain type of error (called Type I error) in our decisionmaking process. Type I error is the error of rejecting the null hypothesis (wrongly, of course) when it is true.
Commonly used values of significance used in marketing research are .05 (corresponding to a confidence level of 95 percent) or 0.10 (corresponding to a confidence level of 90 percent). But there is no hard and fast rule, and the significance level can be set at a different level if necessary. Let us assume that we take the conventional value of .05 for our hypothesis test. Now, a suitable test for the problem discussed above has to be found. In this case, from Fig. 1, we know that the independent sample 't' test is required. What do we expect to achieve from this test? We will either reject the null hypothesis (that is, prove that the Delhi and Mumbai ratings are significantly different), or fail to reject it (conclude that there is no difference between the Delhi and Mumbai ratings).

Slide 23 The independent sample 't' test Let us proceed with the same example and set up an independent sample 't' test as discussed above, at a significance level of .05. Table 1 presents the input data (assumed) for the test. This assumes that 15 customers of our brand each in Mumbai and Delhi were asked to rate our brand on a 7 point scale. The responses of all the 30 customers are in column labelled 'Ratings' in the table. The column labelled City indicates the city from which the ratings came, with a code of 1 for Mumbai and 2 for Delhi. Table 1: Input Data for Independent Sample 't' test SERIAL No. 1 2 3 4 5 6 7 8 9 10 RATINGS 2 3 3 4 5 4 4 5 3 4 CITY 1 1 1 1 1 1 1 1 1 1

Slide 23 contd...

SERIAL No. 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

RATINGS 5 4 3 3 4 3 4 5 6 5 5 5 4 3 3 5 6 6 6 5

CITY 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

Slide 24 Table 2 presents the output from the independent sample 't' test performed on the above data. The decision rule for the test (for any computerised output which gives a 'p' value for the test) at .05 significance level is this If the 'p' value is less than the significance level set up by us for the test, we reject the null hypothesis. Otherwise, we accept the null hypothesis. In this case, we find that the 'p' value for the 't' test is .011 assuming unequal variances in two populations. This value of .011 being less than our significance level of .05, we reject the null hypothesis and conclude that the Ratings of Mumbai and Delhi are different. If the 'p' value had been larger than .05, we would have accepted the null hypothesis that there was no difference between the two ratings. Table 2. t tests for independent samples of CITY t test for Equality of Means Variances Unequal t-value -2.75 df 26.76 p- value 0.011

Slide 25 Manual Versus Computer-based Hypothesis Testing Please note that conventional hypothesis testing would have required us to do a manual computation of the t value from the data, compare it with a value from the 't' tables and arrive at the same kind of conclusion that we did. The advantage of using the computer is that the test is performed by the package automatically, and we get the 'p' value for the test in the computer output. All that we need to do is to compare the p-value from the computer output with our significance level (usually .05), and reject the null hypothesis when the computer gives us a value less than the one set by us (less than .05 if we have set it at .05). We are going to use this approach (computerised testing) throughout this book for all the tests and analytical procedures. This removes the need for tedious manual calculations, and leaves the student to do managerial jobs like interpreting computer outputs rather than waste time in manual computation. This is modern approach, because managers can increasingly delegate mundane tasks to the computer, and add more value to their own jobs by concentrating on design and interpretation of Marketing Research studies.

Slide 26

Paired Sample 't' test

In some cases, we may not have independent samples, but the same sample could be used to do a research study involving two measurements. For instance, we may measure somebody's attitude towards a brand before it is advertised, and after it is advertised, to try and find out if their attitude has changed due to the ad campaign. In such cases, a paired sample 't' test is the appropriate statistical test. We will illustrate using the example mentioned above. Assume that we used a sample of 18 respondents whom we asked to rate on a 10 point interval scale, their attitude towards say, Tamarind brand of garments, before and after an ad campaign was released for this brand. A rating of 1 represents "Brand is Highly Disliked" and a rating of 10 represents "Brand is Highly Liked", with other ratings having appropriate meanings. The assumed data are in Table 3. The first column contains ratings given by respondents Before they saw the ad campaign, and the second column represents their ratings After they saw the ad campaign.

Slide 27

Table 3 : Input Data for Paired Sample t test

SERIAL No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

BEFORE 3 4 2 5 3 4 5 3 4 2 2 4 1 3 6 3 2 3

AFTER 5 6 6 7 8 4 6 7 5 4 6 7 4 6 8 4 5 6

Slide 28 Table 4 contains the resultant computer output for a paired sample 't' test. Assume that we had set the significance level at .05, and that the null hypothesis is that "there is no difference in the ratings given by respondents before and after they saw the ad campaign. Table 4 : t tests for paired samples Mean AFTER BEFORE Ratings after Ad Campaign Ratings before Ad Campaign 5.7778 3.2778 df 17 Std. Deviation 1.309 1.274 2- tailed significance 0.000

Paired Differences Mean Std. t value Difference Deviation 2.5000 1.295 8.19

The output table shows that the 2- tailed significance of the test is .000, from the last column. This is the 'p' value, and it is less than the level of .05 we had set. Therefore, as per our decision rule specified in the earlier example, we have to reject the null hypothesis at a significance level of .05, and conclude that there is a significant difference in the ratings given by respondents Before and After their exposure to the ad campaign. The mean rating after the ad campaign is 5.7778 and before the campaign, it is 3.2778, and the difference of 2.5 is statistically significant.

Slide 29 Large Sample Sizes If we have a sample size larger than 30 for the independent sample 't' test, we can use the 'Z' test instead of the 't' test . The statement of null hypothesis etc. will remain the same in the case of a Z test also. Proportions Even though we have tested for differences in mean values of variables in this section, we could also test in the same way for differences in Proportions. The procedure is the same, and a Z test or a 't' test is used, depending on whether the sample size is more than or less than 30.

Вам также может понравиться