Вы находитесь на странице: 1из 24

5 Association between Categorical Variables

CONTINGENCY TABLES MARGINAL AND CONDITIONAL DISTRIBUTIONS SEGMENTED BAR CHARTS AND MOSAIC PLOTS LURKING VARIABLES AND SIMPSONS PARADOX CHI-SQUARE: A MEASURE OF ASSOCIATION CRAMERS V: INTERPRETING THE STRENGTH OF ASSOCIATION CHECKLIST: CHI-SQUARE AND CRAMERS V SUMMARY 5-3 5-4 5-5 5-11 5-14 5-17 5-18 5-21

2/7/2008

5 Categorical Association

Amazon spends millions of dollars on advertising. That sounds like a lot until you think about the size of the Internet. Busy portals like Google and Yahoo charge plenty for the privilege of advertising on their pages. Which locations deliver buyers? The answer comes from understanding the variation in the categorical variable that indicates whether a visitor makes a purchase. Some buy, others only browse. Suppose that everyone from Yahoo is a buyer, but no one from Deal Time. Knowing the link that attracted the shopper explains variation in behavior and reveals the better location for ads. Lets focus on the choices faced by an advertising manager at Amazon. She has a budget to allocate among three busy hosts: msn.com, recipesource.com, and yahoo.com. Together, these three delivered 17,619 visits to Amazon during the fall of 2002.
Host msn.com recipesource.com yahoo.com Total shoppers Visits 7,258 4,283 6,078 17,619

Table 5-1. Frequency table of the categorical variable that identifies shoppers from three hosts.

MSN generates the most visits, but more visits do not automatically mean more sales. More visits do translate into higher costs, however. Amazon pays a fee for every visit, whether the shopper buys anything or not. Hosts that generate many visits but few sales are costly to Amazon. Should Amazon pay some hosts more than others for each shopper sent to amazon.com?

5-2

2/7/2008

5 Categorical Association

Contingency Tables
To discover whether some hosts are better than others, we have to consider a second categorical variable, one that identifies the visits that result in a sale. Here are the bar charts. Host Purchase

Figure 5-1. Bar charts of the hosts and purchase actions.

The categorical variable Host identifies the originating site, the categorical variable summarized in Table 5-1. Purchase indicates whether the session produced a sale. Theres precious little variation in Purchase. Only 516 visits, less than 3% result in a purchase. If every sale comes from one host, Amazon would know where to place its ads. The bar charts in Figure 5-1 summarize each categorical variable separately, but we need to consider them simultaneously. For instance, we need to separate visitors from MSN into those who made a purchase and those who did not. The most common arrangement of such counts organizes them in a table. The rows of the table identify the levels of the one variable, and the columns of the table identify the levels of the other. Such a table is called a contingency table. This contingency table shows the variable Purchase (along the rows) and the variable Host (along the columns).
contingency table A table that shows counts of the cases of one categorical variable contingent on the value of another.
Host recipesource. com 4282 1 4283 yahoo. com 5848 230 6078

msn.com Purchase No Yes Total 6973 285 7258

Total 17103 516 17619

Table 5-2. Contingency table of web shopping.

mutually exclusive The conditions that define the cells in a contingency table allow a case to appear in only one cell. Theres no double-counting of cases.

The cells of this contingency table count the visits for every combination of Host and Purchase. The cells of the contingency table are mutually exclusive; each case appears in exactly one cell. For example, the column labeled msn.com shows that 285 of the 7,258 visits from MSN generated a purchase. Of those from recipesource.com, only 1 of the 4,283 visits led to a purchase.

5-3

2/7/2008

5 Categorical Association

Marginal and Conditional Distributions


The margins of Table 5-2 (shown in gray) give the total counts in each row and column. Because the cells of the table are mutually exclusive, the sum of the counts in the cells of the first column equals the total number of visits from msn.com. The sum for each column appears in the bottom margin of the contingency table; these sums match the frequency distribution of Host shown in Table 5-1. The right margin shows the frequency table of Purchase. Because these counts are typically placed along the margins of a contingency table, the frequency distributions of the variables in the table are also called marginal distributions. The bar charts in Figure 5-1 show these distributions. Percentages help us interpret a contingency table, but weve got to make a choice of which percentage to show. For example, 285 shoppers from MSN made a purchase. To show this count as a percentage, we have three choices: " 1.62% of all 17,619 visits $ 285 is # 3.93% of the 7, 258 visits from MSN $55.23% of the 516 visits that made a purchase % All are potentially interesting. Some statistics packages embellish a contingency table with every percentage, like this:

marginal distribution The frequency distribution of a variable in a contingency table given by counts of the total number of cases in rows (or columns).

!
Count Total % Col % Row % No Purchase Yes msn. com
6973 39.58 96.07 40.77 285 1.62 3.93 55.23 7258 41.19

Host Recipe source .com


4282 24.30 99.98 25.04 1 0.01 0.02 0.19 4283 24.31

Yahoo .com
5848 33.19 96.22 34.19 230 1.31 3.78 44.57 6078 34.50

Total
17103 97.07 516 2.93 17619

Table 5-3. Too many percentages clutter this contingency table.

Tables like this one give percentages a bad reputation. The table shows too many percentages. Each cell lists the count along with percentages of the total, the column, and the row. While its fine to consider all of these, its better to choose the percentage that answers the relevant question.
msn.com
6973 96.07% 285 3.93%

7258

Because the account manager at Amazon is interested in which host produces the highest proportion of purchasers, a better table shows only the counts and column percentages. Lets start with msn.com. For the moment, were interested in only the 7,258 visits from msn.com in the 1st column of Table 5-2. The distribution of a variable that is restricted to cases satisfying a condition is called a conditional distribution. In a table, a conditional distribution refers to counts within a row or column.
5-4

conditional distribution The distribution of a variable restricted to cases that satisfy a condition, such as those in a row of column of a contingency table.

2/7/2008

5 Categorical Association

By limiting our attention to visits from msn.com, we see the conditional distribution of Purchase conditional on the host being MSN. The following contingency table shows the counts and the column percentages. The percentages within each column show the conditional distribution of Purchase for each host.
Host Count Col % No Purchase Yes msn. com
6973 96.07% 285 3.93% 7258

recipe source .com


4282 99.98% 1 0.02% 4283

yahoo .com
5848 96.22% 230 3.78% 6078

Total
17103 97.07% 516 2.93% 17619

Table 5-4. Contingency table with relevant percentages.

Compare this table to Table 5-3. Without the distraction of extraneous percentages, we can quickly see that visitors from MSN and Yahoo yield similar shares of purchases (3.93% and 3.78%, respectively). In comparison, only one visitor from RecipeSource bought anything (0.02%). Weve just discovered that Host and Purchase are associated. Categorical variables are associated if the column percentages vary from column to column (or if row percentages vary from row to row). In this case, the proportion of visits that produce a purchase differs among hosts. The association between Host and Purchase means that knowing the host changes your impression of the chance of a purchase. Variables can be associated to different degrees. The least association occurs when the column percentages are identical. Overall, 516/17619 = 2.93% of the visits made a purchase. If 2.93% of visits from every host made a purchase, then the chance of a purchase would not depend on the host. Each conditional distribution of Purchase given Host would match the marginal distribution of Purchase. Thats not what happens: Host and Purchase are associated. Visitors from some hosts are more likely to make a purchase. Because Host and Purchase are associated, the account manager at Amazon might be willing to pay more for visits from MSN or Yahoo and less for those from Recipe Source. The value of the visit depends on the host.

association Two categorical variables are associated if the conditional distribution of one variable depends on the value of the other.

Segmented Bar Charts and Mosaic Plots


Bar charts of marginal distributions like those in Figure 5-1 dont reveal association, but other charts do. For example, Amazon locates warehouses near large concentrations of shoppers to reduce shipping costs. Being close makes it cheaper to offer free shipping. This contingency table shows the counts of Purchase by Location over a wider range of hosts.
5-5

2/7/2008
North Central Purchase No Yes Total
5640 24.46 161 24.66 5801

5 Categorical Association
Location North East
4450 19.30 146 22.36 4596

South
8321 36.09 177 27.11 8498

West
4645 20.15 169 25.88 4814

Total
23056 653 23709

Table 5-5. Contingency table of purchases organized by region.

segmented bar chart A bar chart that divides the bars into shares based on a second categorical variable.

Because were interested in discovering where those who make a purchase live, this table shows row percentages. With four percentages in each conditional distribution, it becomes helpful to have a plot. A segmented bar chart divides bars in a bar chart proportionally into segments corresponding to the percentage in each group. If the bars look identical, then the variables are not associated. Although the South sends the largest number of visitors, visitors from the South are more likely to browse rather than buy. About 36% of the browsers come from the South, but only 27% of the buyers. If Purchase and Region were not associated, then these percentages should be about the same. Because they differ, Region is associated with making a purchase. You can see the differences in this segmented bar chart. The yellow segment identifying visits from the South makes up a larger share among those who dont purchase (on the top) than among those who do make a purchase.

Figure 5-2. Segmented bar chart.

Be careful interpreting a segmented bar chart. This chart compares relative frequencies of two conditional distributions. Because these are relative frequencies rather than counts, the bars do not represent the same number of cases. The bar on the top summarizes 23,056 cases whereas the bar on the bottom summarizes 653 purchases. The chart obeys the area principle, but the area is proportional to the percentages within each row of the table. Segmented bar charts frequently appear in news items such as this one from the New York Times.1
1

Economists debate the quickest cure, The New York Times, January 19, 2008.

5-6

2/7/2008

5 Categorical Association

At the time, the government was debating the use of tax cuts to stimulate consumer spending and avoid a recession. The chart shows that the conditional distribution within the bars is changing, meaning theres association. In this case, we can see that households with smaller incomes are more likely to use tax rebates to pay down debt than households with higher incomes. A mosaic plot is an alternative to the segmented bar chart. A mosaic plot shows tiles, colored rectangular regions, that represent the counts in each cell of a contingency table. The layout of the tiles matches the layout of the cells in a contingency table, and the sizes of the tiles are proportional to the counts in each cell. The tiles within a column have the same width, but possibly different heights. The widths of the columns are proportional to the marginal distribution of the variable positioned on the bottom of the table. For example, this figure shows the mosaic plot of the data in Table 5-5. The tiny height of the red tiles in the second row show the counts of purchases; their tiny sizes remind you how rare it is to find a purchase among the visits.

mosaic plot A tiled plot in which the size of each tile is proportional to the count in a cell of a contingency table.

Figure 5-3. Mosaic plot of the purchases by region.

Overall, the South contributes the most. These are the widest tiles in the plot. Because purchases are so rare, however, its hard to see in the mosaic plot that the share of purchases is smaller for visitors from the South. Mosaic plots are much more useful for seeing dependence in data for which the relative frequencies do not get so small. As an example, the following table shows counts of sales of shirts at a mens clothing retailer. Do Size and Style appear associated? If the two are not associated, managers should order the same proportion of sizes in every style. If the two are associated, the distribution of sizes varies from style to style.

5-7

2/7/2008

5 Categorical Association Style Polo 18 65 103 186 27 82 65 174

Button Down Size Small Medium Large

Small Print 36 28 22 86 81 175 190 446

Table 5-6. Sales of shirts at a mens clothing retailer.

Its hard to see the association quickly in this table of counts, but a mosaic plot makes the association very clear.

Figure 5-4. Mosaic plot of the shirt sales shows association between Size and Style.

The tiles would line up in the absence of association. In this example, the proportions of sizes vary across the styles, causing the tiles to vary in height. The irregular heights indicate that these variables are associated. Small sizes are much more prevalent among beach prints than the button-down shirts. Because the mosaic plot respects the area principle, we can also see that the button-down style is the biggest seller overall (these tiles are wider than the others) and the beach-print style is the smallest seller.

4M: Wheres my car?

2002 Dodge Intrepid

2002 Toyota Camry

Auto theft costs owners and insurance companies billions of dollars. The FBI estimates that 1.2 million cars worth $8.4 billion were stolen in 2002. Should insurance companies charge the same premium for theft insurance or should they vary the premium? Obviously, a policy that insures a $90,000 Porsche costs more than one for a $15,000 Hyundai. But should the premium be a fixed percentage of the cars value, or
5-8

2/7/2008

5 Categorical Association

should the percentage vary from model to model? To answer this question, we need to know whether some cars are more likely to be stolen than others. It comes down to whether there is an association between car theft and car model. For this example, its up to you to decide whether an insurance company should charge a fixed percentage of the price to insure against theft. The data come from the National Highway Traffic Safety Administration (NHTSA). We picked seven popular models.
Motivation What questions would you
like to answer? My company deciding whether to base premiums for theft insurance on the chance that the car is stolen. We can either charge a fixed percentage of the replacement cost, or charge a variable percentage for cars that are stolen more often. Are there large differences in the rates of theft? My data from NHTSA give the number of cars stolen for seven 2002 models. If thefts are associated with the model, then we should vary the rate. Ill judge the association by seeing whether the percentage of cars stolen varies by model. This table shows the data along with the percentage of each model that is stolen.
Model Chevrolet Cavalier Dodge Intrepid Dodge Neon Ford Explorer Ford Taurus Honda Accord Toyota Camry Stolen 1017 1657 959 1419 842 702 1027 Made 259230 111491 119253 610268 321556 419398 472030 Pct Stolen 0.392 1.486 0.804 0.233 0.262 0.167 0.218

Motivation
Method Identify the variables and report

Method

the Ws. Be certain that the data are counts and that the categories do not overlap so that no individual is counted twice. Also indicate what you intend to do with these data.

Mechanics

Mechanics Make an appropriate display or table to see whether there is a difference in the relative proportions. Notice that we did not add the missing column for not stolen. If you do that, youll see that the number made is the marginal (row) total. For example,
Model Intrepid Stolen 1657 Not stolen 109834 Total 111491

Among these models, the Dodge Intrepid has the highest percentage stolen (1.486%), followed by the Dodge Neon (0.804%). The Honda Accord has the least (0.167%).

5-9

2/7/2008 Message Discuss the patterns in the

5 Categorical Association
Some models (e.g., Dodge Intrepid) are more likely to be stolen than others ( e.g., Honda Accord). About 1.5% of 2002 Intrepids were stolen, compared to less than 0.17% of 2002 Accords.

Message

table and displays.

A segmented bar chart is less useful because the percentages stolen are so small.

A lot of Accords get stolen, but thats explained by the sheer number of Accords sold each year. If you can, discuss possible real-world consequences. We should charge higher premiums for theft insurance for models that are most likely to be stolen. A Dodge Intrepid is 7 times as likely to be stolen as a Toyota Camry. Customers who buy an Intrepid (which costs about the same as a Camry) should pay a higher premium for theft insurance.

Are You There?

An on-line questionnaire asked visitors to a retail web site if they would like to join a mailing list. This contingency table summarizes the counts of those who join as well as those who made a purchase. Mailing List Join Decline Yes 52 12 Purchase No 343 3720 The columns indicate whether the visitor signed up (Mailing list = Join or Decline), and the rows indicate whether the visitor made a purchase (Purchase = Yes or No). For example, 52 visitors joined the mailing list and made a purchase. (a) Find the marginal distribution of Purchase.2 (b) Find the conditional distribution of Purchase given whether the customer signed up or not. Do the conditional distributions differ?3 (c) Does a segmented bar chart provide a helpful plot for these data?4 (d) Is the variable Purchase associated with the variable Mailing List?5
2 3

The row totals determine the marginal distribution, 64 Yes who made a purchase and 4,063 No. Among the 395 who join the list, 52 make a purchase (13%). Among those who decline, 12 out of 3,732 make a purchase (0.32%). Customers who join are more likely to make a purchase. These are rather different. 4 Not really, because one percentage is so small. You could show a figure like that in the prior 4M. 5 The two are associated (dependent) because the conditional distributions differ. The chance that a customer makes a purchase depends on whether they sign up.

5-10

2/7/2008

5 Categorical Association

Lurking Variables and Simpsons Paradox


Association gets confused with causation. This mistake can lead to serious errors in judgment. Consider the following contingency table.
Count Column % Status Damaged OK Service Orange Arrow 45 15% 255 85% 300 Brown Box 66 33% 134 67% 200 Total 111 22.2% 389 87.8% 500

Table 5-7. Counts of damaged packages.

This contingency table shows the number of cartons that were damaged when shipped by two delivery services. The percentages in each cell are column percentages. Overall, 22.2% of the 500 cartons arrived with visible damage. Conditionally, 15% of cartons shipped via Orange Arrow arrived damaged compared to 33% for Brown Box. Theres definitely association; neither conditional distribution matches the marginal distribution of Status. Table 5-7 suggests that Orange Arrow is the better shipper, and we might be tempted to believe that cartons are more likely to arrive undamaged because they are shipped on Orange Arrow. If we believe that, we might decide to ship everything on Orange Arrow. Before we do that, however, we better make sure that this table offers a fair comparison. Maybe theres another explanation for why packages shipped on Brown Box are damaged more often. To think of an alternative explanation, we have to know more about these packages. In this instance, the cartons hold car parts. Some cartons hold engine parts whereas others hold plastic molding. Guess which cartons are heavier and more prone to damage? The next two tables separate the counts in Table 5-7 into those for heavy cartons (left table) and those for light cartons (right table).
Count Column % Damaged OK Heavy Service Orange Brown Arrow Box 20 60 67% 40% 10 90 33% 60% 30 150 Light Service Orange Brown Arrow Box 25 6 9% 12% 245 44 91% 88% 270 50

Status

Total 80 44.4% 100 55.6% 180

Total 31 9.7% 389 90.3% 320

Table 5-8. Separate tables for heavy and light packages.


lurking variable A concealed variable that affects the apparent relationship between two other variables.

Orange Arrow is no longer clearly the better shipper. Among heavy packages, 67% of those shipped on Orange Arrow arrived damaged compared to 40% for Brown Box. For light cartons, 9% of those shipped by Orange Arrow arrive damaged compared to 12% for Brown Box.
5-11

2/7/2008

5 Categorical Association

The initial comparison favors Orange Arrow because it handles a higher share of light packages. Brown Box seems more likely to damage packages because it handles a greater proportion of heavy cartons. Heavy cartons more often arrive with some damage (44.4% versus 9.7%). Table 5-7 presents a misleading comparison; it compares how well Orange Arrow ships light cartons to how well Brown Box ships heavy cartons. The weight of the cartons is a hidden, lurking variable. Table 5-8 adjusts for the lurking variable by separating the data into heavy cartons and light cartons.
Simpsons paradox occurs when the direction of an association between two variables is reversed when a third variable is controlled. The effect was named after E. H. Simpson in who compiled entertaining examples in 1951, but it was carefully described much earlier by G. U. Yule (18711951) in 1903.

Such reversals often go by the name Simpsons paradox. It can seem surprising downright paradoxical that one service looks better overall, but the other looks better when we restrict the comparison. The explanation lies in recognizing the presence of a lurking variable. Before you act on association (like sending all the business to Orange Arrow), be sure to identify the effects of lurking variables. One of the best-known examples of Simpsons paradox occurred when U. C. Berkeley was sued for bias against women applying to graduate school. When data from all of the graduate programs at Berkeley were pooled, the admission rate for men was much higher than that for women. However, it was discovered that the rate was very similar in every department. In fact, most departments had a small bias in favor of women. The explanation for the apparent overall bias was that women tended to apply to departments that had many applicants and therefore low rates of admission. Men, on the other hand, tended to apply to departments such as mathematics that had fewer applicants and higher admission rates. (P. J. Bickel, E. A. Hammel, and J. W. O'Connell (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science 187:4175, pp. 398 404.)

4M Picking an Airline

Which airline is more likely to get you to a meeting on time, US Airways or Delta? The following table summarizes 10,906 arrivals at four airports served by both carriers: Boston, Orlando, Philadelphia, and San Diego.
Count Column % On Time Delayed Total Airline Delta 2596 80% 659 20% 3255 US Airways 5966 78% 1685 22% 7651 Total 8562 9667 10906

Arrival

5-12

2/7/2008 Table 5-9. Airline arrivals.

5 Categorical Association

This table suggests that the two airlines perform comparably, with a slight edge to Delta with 80% on time compared to 78% for US Airways. Before you book a flight on Delta, however, you should think about whether theres a lurking variable.
Motivation List the questions that

Motivation Method

you would like to answer, and state the implications.

My business regularly takes me to these four destinations. Does it matter which airline I choose, Delta or US Airways? A late arrival might cause me to miss a meeting with a client. Both variables are categorical: the airline and the arrival status. Data are from the Bureau of Transportation Statistics. Contingency tables answer my questions. Id better think about lurking variables. A possible lurking variable behind Table 5-1 is the destination of the flight. This table combines the status for all four destinations.

Method Identify the variables and


your data. Describe your plan for the analysis.

Mechanics

Mechanics

Ill form a contingency table that isolates flights into one destination: Orlando. Heres the table. Count US Col % Delta Airways Total Delayed 228 19.5% 150 15.5% 378

940 820 1760 On 84.5% Time 80.5% 1168 970 1468 For flights to Orlando, US Airways is the better choice. In fact, no matter which destination, US Airways has a higher percentage of on-time arrivals. US On Time% Delta Airways Boston 80.1% 81.7% Orlando 80.5% 84.5% 74.3% Philadelphia 70.5% San Diego 84.2% 85.4%

Message

Message Discuss the patterns in


the table and displays.

Ill book a flight on US Airways. No matter which destination, US Airways is more likely to arrive on time.

Its worthwhile to review why Delta appears better overall, even though US Airways arrives on time more often for each destination. The initial table, Table 5-9, masks a lurking variable: destination. The destination matters: delays are more common at Philadelphia.
Count Col % Delayed Arrival On Time Total Boston 615 19% 2620 81% 3235 Destination Orlando Philadelphia 378 1230 18% 26% 1760 3505 82% 74% 2138 4735 San Diego 121 15% 677 85% 798 Total 2344 8562 10906

Table 5-10. Delayed arrivals by destinations.

5-13

2/7/2008

5 Categorical Association

In addition, most of these flights on US Airways go to Philadelphia, whereas most on Delta go to Boston.
Count Row % Delta Airline US Airways Total Boston 1409 43% 1826 24% 3235 Destination Orlando Philadelphia San Diego 1168 312 366 36% 10% 11% 970 4423 432 13% 58% 6% 2138 4735 798 Total 3255 7651 10906

Table 5-11. Airlines by destinations.

The initial table (Table 5-9) answers a strange question: Am I more likely to arrive on time flying to Boston on Delta or arrive on time in Philadelphia on US Airways? The answer: take Delta to Boston. Theres nothing wrong with that answer its just an odd question. By focusing the analysis on flights into a specific destination, we control for this lurking variable and answer the right question. Once you identify a lurking factor, you can remove its effects as we did in this example. But heres the hard part: How do you know whether there is lurking factor? Its easy to imagine other lurking factors, too. Maybe its the type of airplane, the day of the week, or the time of day. Make no mistake about it. You need to understand the context of your data to find a lurking factor.

Chi-Square: A Measure of Association


In the first example, we concluded that Purchase and Host are associated because the proportion of visitors who make purchases differs from host to host. How different are they? Rather than leave it to subjective judgment, its useful to have a statistic that quantifies the amount of association. Instead of saying Theres some association or Theres a lot of association, the statistic called chi-square (pronounce chi as ki ) measures of the degree of association. The larger chi-square becomes, the larger the amount of association. This statistic also offers a preview of an approach frequently taken in statistics. To quantify the degree of association, we compare the ! data we observe to artificial data that have none. Chi-square measures association in a contingency table by comparing the observed contingency table to an artificial table that has no association. If the tables are similar, then theres not much association. The larger the difference between the tables, the larger the association. Well illustrate the use and calculation of chi-square with an example. A recent poll asked 200 people at a university about their attitudes toward sharing copyrighted music. Half of the respondents were students and the other half were staff at the university (administrators or faculty). This table summarizes the counts.
Attitude to Sharing 5-14

chi-square A statistic that measures association in a contingency table; larger values of chisquare indicate more association.

2/7/2008

5 Categorical Association Group Staff Student Total Attitude to Sharing OK 30 Not OK 70 50 50 80 120 Total 100 100 200

Table 5-12. Attitudes toward sharing copyright materials.

Overall, 40% (80 of 200) of those questioned thought it was OK to share copyrighted music. Thats the marginal percentage. Each row determines a conditional distribution of the attitude, one for staff and one for students. Only 30% of the staff thought it was OK to share, compared to 50% of students. Because the row percentages differ, Group and Attitude are associated. To quantify the amount of association, we need a benchmark for comparison, a point of reference. For that, consider what Table 5-12 would look like if there were no association. To figure this out, pretend that we know the marginal totals, but not the counts within the table:
Attitude to Sharing OK Not OK ? ? ? ? 80 120 Totals 100 100 200

Group

Staff Student Totals

Table 5-13. What goes in these cells if the variables are not associated?

Overall, of the respondents are staff and are students. Were Group and Attitude not associated, then of the cases in each column would be staff and would be students. We would expect the table to look like this:
Attitude to Sharing OK to Share Not OK 40 60 40 60 80 120 Totals 100 100 200

Group

Staff Students Totals

Table 5-14. Artificial table with cells that we would expect were Group and Attitude not associated.

Chi-square measures the distance between the cells in the real table and those in the artificial table. We first subtract the values in the cells; use only the cells, not the margins. The differences in the counts are:
30 50 Real Data 70 50

Artificial 40 60 40 60

Difference -10 10 10 -10

Table 5-15. Deviations from the original counts.

Next, we combine the differences. If we add them, we get zero because the negative and positive values cancel. We had this problem with cancellation when we defined the variance s2 in Chapter 4. Well solve the problem as we did then: square the differences before we add them.

5-15

2/7/2008

5 Categorical Association

When the squared deviations are added, chi-square assigns some of them larger weight. Look at the differences in the first row. Both are 10, but the difference in the first column is larger relative to what we expected than the difference in the second column (10 out of 40 compared to 10 out of 60). Rather than treat these the same, chi-square assigns more weight to the first. After all, saying 40 and finding 30 is a larger proportional error than saying 60 and finding 70. To give more weight to larger proportional deviations, we divide the squared deviations by the expected values in the artificial table. The chi-square statistic is the sum of these weighted, squared differences. For this table, chi-square, denoted in formulas as 2, is 2 2 2 2 30 # 40) 70 # 60) 50 # 40) 50 # 60) ( ( ( ( 2 " = + + + 40 60 40 60
40 60 40 60 = 2.5 + 1.67 + 2.5 + 1.67 = 8.33 Chi-square has another similarity to s2: its hard to interpret. The value of chi-square depends on n, the total number of cases, and the size of the table. The larger the table, the larger chi-square becomes. !

(#10) =

(10) +

(10) +

(#10) +

Are You There?

Heres the contingency table from the prior AYT, including the marginal totals. Mailing List Join Decline Purchase Yes No Total 52 343 395 12 3720 3732 Total 64 4063 4127

(a) Chi-square requires the artificial table of counts. What count would be expected in the highlighted cell for those who join the mailing list and make a purchase if Purchase and Mailing List are not associated?6 (b) What is the contribution to chi-square from the cell for those who join the mailing list and make a purchase?7 (c) The value of chi-square for this table is 2 385.9. Does your answer to b reveal which cell produces the largest contribution to chi-square?8

If the two variables are not associated, then the percentage who make a purchase among those who join ought to be the same as the percentage in the margin of the table, which is 64/4127 or about 1.55%. The expected count in the first cell is then 395 64/4127 6.126. 7 Subtract the expected count from (a) from the observed count to get the deviation. Then square the deviation and divide by the expected count. The contribution is (52-6.126)2/6.126 343.5. 8 Each summand that goes into 2 is positive, so most of it is coming from the first cell. The big deviation from the artificial table is the large count in the first cell.

5-16

2/7/2008

5 Categorical Association

Cramers V: Interpreting the Strength of Association


Cramers V A statistic derived from chi-square that measures the association in a contingency table on a scale from 0 to 1.

The value of chi-square for the example of music sharing is 8.33 whereas chi-square for the AYT exercise is 385.9. Is there much more association in the second example, or is chi-square larger because n = 4,127 in the second table compared to n = 200 in the first? A more interpretable statistic allows comparisons of the amount of association across tables. To remove the effects of n and the size of the table, Cramers V adjusts chi-square so that the resulting measure of association lies between 0 and 1. If V = 0, the variables are not associated. If V = 1, they are perfectly associated. If V < 0.25, we will say that the association is weak. If V > 0.75, we will say that its strong. In between, we will say there is moderate association. To find Cramers V, divide 2 by the product of the number of cases times the smaller of the number of rows minus 1 or the number of columns minus 1 and take the square root. The formula for Cramers V is simpler than words. As usual, n stands for the total number of cases, and let r be the number of rows and c the number of columns. The formula for Cramers V is "2 V= n min(r # 1, c # 1) If V = 0, the two categorical variables are not associated. If V = 1, the two variables are perfectly associated. If variables are perfectly associated, you can guess one ! once you know the value of the other. For the survey 2 of file sharing, = 8.33 and both r and c are 2 and n = 200. Hence,
V=

tip

"2 8.33 = $ 0.20 200 min(2 # 1,2 # 1) 200

Cramers V is named after the influential Swedish mathematician and statistician, Carl Harald Cramr (18931985). Cramr is best known for his work in probability and risk. He found many real-life applications for his work, especially in the insurance industry.

Theres association, but its weak. Staff and students have different attitudes toward file sharing, but the differences are not very large. For the AYT! example, 2 = 385.9, n = 4127, and r = c = 2. In this case,
V=

"2 385.9 = $ 0.31 n min(r # 1, c # 1) 4127

There is indeed more association in this example than in the example of file sharing, but not that much more. The huge difference between the values of! chi-square is a consequence of the difference in sample sizes, not the degree of dependence. What does a table look like when there is strong association? Strong association implies very large differences among row or column percentages of a table. Suppose the survey results had turned out as shown in this table:

5-17

2/7/2008 OK to Share 0 80 80

5 Categorical Association Staff Students Totals Not OK 100 20 120 Totals 100 100 200

Table 5-16. A table with strong association.

No staff thought it was OK, compared to 80% of the students. Youd expect arguments between staff and students about sharing materials on this campus. Lets find 2 and Cramers V for this table. The margins of Table 5-16 are the same as those in the original contingency table, so the calculation of 2 is similar. We just need to replace the original counts by those in Table 5-16.

"2

(0 # 40) + (100 # 60) + (80 # 40) + (20 # 60) =


40 60 = 40 + 26.67 + 40 + 26.67 = 133.33 40 60

Cramers V indicates strong association between the variables: 133.33 V= = 0.816 ! 200 The size of Cramers V indicates that you can almost predict exactly what a respondent will say if you know whether the respondent is on the staff or is a student. If you know a person is a staff member, then you ! know their attitude toward sharing files. Every member of the staff say that file sharing is not OK. Among students, 80% say that its OK to share.

Checklist: Chi-square and Cramers V


Chi-square and Cramers V measures association between two categorical variables that define a contingency table. Before you use these, verify that your data meet these prerequisites. Categorical variables. If a variable is numerical, there are better ways to measure association. No obvious lurking variables. A lurking variable means that the association youve found is the result of some other variable in thats not shown.

4M Real Estate
A developer needs to pick heating systems and appliances for newly built single-family homes. If the house has electric heat, its cheaper to install electric appliances in the kitchen. If the home has gas heat, gas appliances make sense in the kitchen. If hes limited to gas or electric heating, how many of each should he offer? Does everyone who heats with gas prefer to cook with gas as well?
5-18

2/7/2008

5 Categorical Association

The builder checked the preferences of 447 homes in the area. For each, his data give the type of fuel used for cooking and the type used for heating. Its your job to use these data to answer the two questions.
Motivation State the
questions that you would like to answer and mention the implications.

The builder wants to configure homes that match the demand for gas or electric heat. He also has to decide the types of appliances customers want in kitchens. If theres little association, then the developer needs a wider mix of configurations. The data are two categorical variables with 447 rows. The rows are homes that heat with gas or electric. The variables are the type of fuel used for heating and the type used for cooking. Ill generate a contingency table and compare the conditional distributions to the marginal distribution. Theres association if these are different. About 2/3 heat with natural gas (298/447) and 1/3 with electricity. This contingency table shows column percentages. These give the conditional distributions of cooking fuel given the type of fuel used for heating. Cooking Fuel By Fuel Heat Home Count Electric Gas Col % Heat Heat Total Electricity 136 136 272 91.28 45.64 Natural 10 162 172 Gas 6.71 54.36 Other 3 0 3 0.20 0.00 149 298 447 Theres association. Among homes with electric heat, 91% cook with electricity. Only 46% of those who heat with gas use electricity to cook. To quantify the strength of the association, 2=98.62 and V = sqrt(98.62/(447 1)) 0.47. Thats moderate association.

Method Identify the variables


and data. Describe your plan for the analysis.

Mechanics

Message Summarize your


key results and displays.

Homeowners prefer natural gas to electric heat by 2 to 1. Of those with electric heat, 90% cook with electricity. Of those with gas heat, 46% cook with electricity. These findings suggest building 2/3 of the homes with gas heat and the rest with electric heat. Of those with electric heat, keep it simple and install an electric kitchen. For those with gas, put an electric kitchen in half and gas in the rest. 5-19

If you can answer the questions, state your answer directly.

Be honest. If you have some

2/7/2008
reservations, mention them here.

5 Categorical Association Theres a big caveat, however. Ive assumed that new buyers are looking for the same things in a home that these residents have a big if.

5-20

2/7/2008

5 Categorical Association

Summary
A contingency table displays counts and may include selected percentages. The totals for rows and columns of the table give the marginal distributions of the two variables. Individual rows and columns of the table show the conditional distribution of one variable given a label of the other. If the conditional distribution of a variable differs from its marginal distribution, the two variables are associated. Segmented bar charts and mosaic plots are useful for seeing association in a contingency table. A lurking variable offers another explanation for the association found in a table. A lurking variable can produce Simpsons paradox; the association in the table might be the result of a lurking variable rather than the two that define the rows and columns. Chi-square and Cramers V are statistics that quantify the degree of association.

Key Terms
association, 5-5 chi-square, 5-14, 5-16 contingency table, 5-3 cell, 5-3 margin, 5-4 Cramers V, 5-17 distribution conditional, 5-4 marginal, 5-4 lurking variable, 5-12 mosaic plot, 5-7 mutually exclusive, 5-3 segmented bar chart, 5-6 Simpsons paradox, 5-12

Formulas
Chi-square The key step in computing chi-square is to obtain the table of artificial counts that are expected were there no association. The marginal counts of the artificial table match those of the data. The conditional distributions within the rows and columns of the artificial table must be consistent with these, but not show any association. A formula shows how to compute the cells of the artificial table. Let rowi denote the marginal frequency of the ith row (the number of observations in this row), and let colj denote the marginal frequency of the jth column (the number in this column). If there is no association between the two variables, then we expect to find rowi " col j expectedi , j = n cases in the jth cell of the ith row. A spreadsheet is helpful to organize the calculations for larger tables. To find 2, sum the weighted, squared deviations between expectedi,j and the observed counts observedi,j. Using the summation ! notation introduced in Chapter 4, the formula for chi-square is compactly written like this:
5-21

2/7/2008

5 Categorical Association

expectedi , j where the sum extends over all of the cells of the table.
i,j

" =$
2

(observed

i,j

# expectedi , j

Cramers V

"2 n min(r # 1, c # 1) for a table with r rows and c columns that summarizes n cases.
V=

Best Practices

! Use contingency tables to find association between categorical varibles. You cannot see the association in the separate bar charts. It only becomes evident when you look at the table and compare the conditional distributions to the marginal distributions.

Consider the possibility of lurking variables. Before you interpret the association you find between two variables, think about whether there is some other variable that offers a different explanation for your table. Are the data in the columns or rows of your table really comparable, or might some other factor thats not evident explain the association that you see. Recognize association. Assciation means that some of the variation in a variable can be described, or explained, by knowing another. By identifying this type of association, you can make choices that fit together. Use plots to show association. Segmented bar charts and mosaic plots are useful for comparing relative frequencies in larger tables. Adjacent pie charts are another choice, but these can make it hard to compare percentages unless the differences are large. Exploit the absence of association. If the two categorical variables are not associated, the variation is self-contained in each of the variables, and you do not need the complexity of a table.

Pitfalls
Dont interpret association as causation. You might have found association, but that hardly means that you know why values fall in one category rather than another. Think about the possibility of lurking variables. Dont put too many numbers in a table. Computers make it easy to decorate a table with too many percentages. Choose just the ones that you need, those that help you answer the question at hand.

About the Data


The Amazon data in this chapter (and Chapter 3) come from ComScore, a firm that monitors the web-browsing habits of a sample of consumers around the country. The data on airline arrivals in the 4M example of
5-22

2/7/2008

5 Categorical Association

Software Tips Excel

Simpsons paradox is from the web site of the Bureau of Transportation Statistics. (From the main page, follow the links to data that summarize information about various types of travel in the US.) We used arrival data for January 2006. The data for kitchen preferences is a subset of RECS, the Residential Energy Consumption Survey performed by the Department of Energy. The example of attitudes toward file sharing is from a story in The Daily Pennsylvanian, the student newspaper at the University of Pennsylvania.

Excel has a powerful feature for producing contingency tables, but you need to master its concept of pivot tables to get them. If you want to stay with Excel for all of your computing, then its probably worth the effort. Start by reading the help files produced by searching for pivot tables from the help menu. Once you have the contingency table, its not too hard to compute the value of chi-square and Cramers V using typical formula manipulations. We find it easiest to build a table of expected counts (under independence) then subtract this table from the observed table and square each cell. Adding up the squared deviations divided by the expected counts gets chi-square. To obtain the contingency table, follow the menu items Stat > Tables > Cross-Tabulation and chi-square and fill in dialog with the names of two categorical variables. Pick one variable to identify the rows of the table and the other for the columns. (Layers allow you to produce tables such as Table 5-8 that show a separate table for each value of a third variable.) Options also produce intermediate steps in the calculation of chi-square, such as the contribution from each cell to the total. Its an easy calculation to convert chi-square to Cramers V. Follow the menu commands Analyze > Fit Y by X and pick one categorical variable for the Y variable and one for X. The variable chosen for Y identifies the columns of the contingency table and the variable identified as X identifies the rows of the contingency table. By default, the output from JMP shows the mosaic plot. The pop-up menu produced by clicking on the red triangle beside the header Contingency Table in the output window allows you to modify the table by removing, for instance, some of the shown percentages. The value of the chi-square statistic appears below the table in the section of the output labeled Tests. The value of chi-square is labeled
5-23

Minitab

JMP

2/7/2008

5 Categorical Association

in the output Pearson. (There are variations on how to compute the chisquare statistic.)
Source Model Error C. Total N Test Likelihood Ratio Pearson DF 5 1221 1226 1231 -LogLike 8.28511 567.79052 576.07562 ChiSquare 16.570 16.056 RSquare (U) 0.0144

Prob>ChiSq 0.0054 0.0067

Once you have chi-square, use the formula given in the text to obtain Cramers V.

5-24

Вам также может понравиться