Вы находитесь на странице: 1из 4

# A Memo answering a Question about Generalizing test Results from a Small number of Test Participants used in a Usability Test

## IBM Lexington, 1988

15 January 1988

From: Phil Cullum Regis, I keep getting beat up at meetings when I show HF studies that used 12 subjects. People are concerned that that is not enough to allow for all the human variations of hand-size, handedness, experience, and other factors involved in keyboard design. Please email me a paragraph or two explaining why we use that population size for our studies, and what, if anything, we would gain by using a larger size. Regards, Phil Cullum ________________________________________________________________________
Date: 18 January 1988 From: Regis L. Magyar, Ph.D., IBM Human Factors & Information Development Subject: Generalizing test results from limited samples to populations

Phil, Your question is not an easy one to answer in a few paragraphs but Ill try to address your concerns the best I can. In its simplest from, an experiment usually compares the behavior of one group of subjects with the behavior of another group, where each group is assigned a different condition or experimental treatment (e.g. Keyboard A and Keyboard B). Our goal, of course, is to say with some certainty that any observed differences in performance are attributed to the differences in the treatments. In studies involving two groups, however, it is unlikely that subjects in both groups are in fact identical and it is possible that differences in performance may be attributable to uncontrolled variations in the subjects alone (i.e. Sampling Error) as well as to differences in the experimental treatments. Inferential statistics are used to help us decide if differences between the experimental and control groups are large enough to attribute that change to the experimental treatment. However, we can

never really be sure that a given difference did not arise strictly from sampling error. Thus, inferential statistics allows us to state the probability that a particular difference arose from sampling error. If the probability is sufficiently low (e.g. .05 or .01) we reject a straw man hypothesis that theres no difference between the two conditions. The hypothesis of no difference is termed a Null Hypothesis, and its rejection allows us to make the decision that the two conditions produced reliably different behaviors in the two groups. The level of probability that we decide is necessary to reject the null hypothesis is called the Confidence Level, and the value of the statistic associated with a confidence level is called the critical value of the statistic. When our tests results reveal a significant difference between the control condition and the experimental condition at the 0.05 level of probability, it means that we would get the same results in 95 out of 100 replications of the same experiment. In other word, the probability of getting the same results between the two test treatments might occur by random chance alone in only five time out 100. Using more subjects in an experiment allows us to be more confident that our findings are reliable and representative of a given population of users. However, the mathematical relationship between the number of subjects that are used in a test and the degree of confidence we have in our conclusions is not linear. In fact, for most tests, there is often little to be gained statistically by using more than 30 subjects since the value of the confidence level shows little radical change between 30 and 1000 subjects. Almost all of my tests that I conducted at Lexington used a within subject design in which the same subject was exposed to both experimental treatments. The big advantage of this approach (using correlated data) over a two-group design (having independent observations) is that sampling error is generally smaller or minimalized, so we can be sure that differences in a study are more likely attributed to differences in the treatments rather than differences attributed to variations in the subjects. Statistically, we are also more confident that an effect is not resulting from some other uncontrolled variable since every test subject serves as his or her own control condition. The big concern I read in your note, however, does not seem to be questioning how certain we are of our findings. Instead it seems to address issues involving the generality of our findings. To this problem I can only say that I am confident that a particular performance difference reported with 12 typists is statistically valid across those in a user population matching the same characteristics, skills, and experience of those used in the test. From this perspective we must ask if there is any reason to believe why the test group of typists that was randomly selected from the whole population of typists may not be representative of this general group of users. I dont think there will ever be an experiment that can make the claim that its findings extend to

EVERY person in the world. Indeed, that is why we often go the opposite direction and get very specific in our testing, especially in the specification of our subject groups. For example, if Im comparing two keyboards in a test, I make very sure that I use subjects whose skills match as closely as possible the characteristics and profiles of those who would most likely represent typists or users of the keyboards. Thats why I dont use farmers, mechanics, or engineers as test subjects; they dont represent the group of people that have the skills, needs, or background necessary to use my product in a particular context. On the other hand, no study can control for ALL the variables that may be determinants of typing performance; and indeed I am sure that there are probably many that have yet to be discovered, but thats the point of research. We try to control all variables other than those that we are trying to investigate to be sure that the effects we do observe in an experiment are most likely the results of the treatment variables being studies. So what Im really saying is this: If someone complains that our results may not extend to everyone in the world, the comeback should be Why should they? Theyre only supposed to extend to those exhibiting the characteristics of users that we think are most important. If these critics can provide data showing that some other variable or characteristic has been shown to be major determinant of the performance that Im investigating, and that the subjects that I tested do not exhibit that trait, then I would be worried about the generality of my test results to all members of the specific group. Otherwise, I dont feel obligated to generalize my results past those used as subjects in my research, nor should I make the attempt either. A final point to remember about the interpretation of test results is that the magnitude of the treatment effects being observed can often overshadow how many subjects are used. In other words, if our test results reveal a very large effect that is attributed to a specific treatment variable (e.g. typing performance increased 200% for EVERY one of the test subjects during the experimental condition), you may not need to have a large sample of subjects to be relatively certain that the effect is reliable or real. Remember, you only need to show me ONE chimpanzee who can converse with me in fluent English to convince me that your particular training method is worth investing in, and that your methods produce real effects. I hope that this information is useful and provided the kind of information that you requested, Phil. Please contact me if you have any further questions. Best Regards, Regis Magyar, Ph.D.