Вы находитесь на странице: 1из 7
RATIONALE AND RULES-OF-THUMB FOR QUESTIONNAIRE-BASED SYSTEM EVALUATION STUDIES Bruce L. Rosenberg and Thomai B. Zurinskes Federal Aviation Administration Technical Center Atlantic City Airport, New Jersey PURPOSE In thie paper, ve address the methodology involved in operational test and evaluation (OTE) by Air Traffic Control (ATC) Specielist subjects (expert users). Ae old questionnaire designers, we asked ourselves, "What can we share with test planners, test conductors, and test aubjects which will facilitate the” OTSE process?” The anewer we came up with is, "We fan show you what ve did on recent evaluation study, document some of the difficult trade-offs involved, and give you some of our whys and wherefores and tricks of the trade." So here is ‘2 mini-tutorial on putting together subjective OTSE glegned from our experience conducting etudies!>? for the Federal Aviation Administration (FAA) over the past 17 years. BACKGROUND The Need for OT&E Major FAA development grams are now in progress several eystens is planned for the near future. Much of this work will be performed using que: tionnsire-based evaluations by ATC Specialists ‘expert user-subjects either in the field or in the simulation facilities at the FAA Techni~ cal Center. and acquisition pro~ Extensive OT6E of Why Sub jective Measures OTGE etudies subjective me quantitative. and require both objective ures. Objective measures ‘Some examples of systes-oriented objective measures are: system response time, number of tracks handled, system load, There are also user-oriented objective measures such ae task activity recording, eye recording, etc. Subjective measures can be either quantified qualitative responses (euch a6 ratings ona scale) or qualitative, narrative responses. Some examples of subjective measures are: controllers’ ansvers to questionnaire items Tegerding system effectiveness, acceptebi readiness for operational use, ete. R. T. Stevens? in hie book on OT&E ac~ Knowledges the usefulness of subjective meas~ ures. They are necessary since 2 systen may checkout on all parameters, yet users may have unforseen operational probleme. Additionally, even “objective” criteria in a requirements document, say # 0.S-second response time, ulti~ derive from aubjective judgments "es to What is acceptable for that systen. Subjective opinions from expert users regarding a systen's euitebility for operational deployment should be the bottom Line in any OTSE study. OTGE questionnaires cen accomplish more then determining whether # system is acceptable to 181 better can help foture users or whether the new systen than the system now in use. They familiarize test subjects with the systen's functions, features, and benefits. They can provide user-generated design suggestions for future system development The ETABS Study ‘The Electzonic Tabular Display Subsyster (eTass) study?, conducted at the FAA Technical Center in 1982, is used as an exemple for @ number of questionnaire design points which follow. That interactive ATC simulation study evaluated replacing flight strips with video Gisplays. It involved administering seven ques- tionneire forme over # period of 3 weeks to 10 field controller subjects. The first week vas apent in training subjects on the system and the remaining 2 weeks were spent in interactive ATC simulation test rune. Although parte of ETABS were rated acceptable, the bottom line was that it was not evitable for operational use. How ever, this paper on the design of quest ionnaire~ based systen evaluation studies derives, in pert, from experiences vith ETABS. Specific examples are presented in five exhibite at the end of this paper. From the complexity of the system, it soon became apparent that ETABS required more than @ single questionnaire form for its evaluation. The final count was eight forms, A through H. As. aduinistered, each questionnsire form wa preceded by a title page containing explanatory taterial as to the form's purpose end content. Exhibit 1 presente thie descriptive material . Exhibit 2 shows a calender of dates, the schedule of training and testing, end times vhen specific forms were administered. Exhibit 3 shows a multi-item, tabular rating scale ques: tion (number 3.0, Form F). Exhibit 4 shows a single item rating scele question (number 15.1, Form F). Exhibit 5 shovs # narrative-soliciting question (number 9.1, Form F). ‘THE EVALUATION STUDY IN GENERAL Keep It Practicable As in any engineering effort, designing # do-able, questionnaire-based evaluation study involves evaluating trade-offs and waking com A number of practical constraints, scarcity of operational ATC Specialists and use of simulated air traffic, place limite tions on the extent to which sophisticated ex- perimental designs and statistical analyses can be carried out. Difficult compromises must often be made to complete a successful evalue~ tion study. Many strategies (and pitfalls) to consider in the design of evaluation, studies are well documented by Ieaac and Michael*. Sone critically important considerations for successful implementation and execution of relisble and valid subjective aystem evaluations are: selection of ATC Specialist subjects, training of subjects, content and format of questionnaires, scheduling of test runs, assig ment of subjects to positions, scheduling ad istration of questionnaires, ‘statistical analy- sis of results, and integrated planning for the entire study at the outset. Perhaps the most important consideration is to develop a wel funet ionin team and to keep it stable lover the time period of the study. Those plan- ning the study should also be involved in per- forming it. Those performing the study should also be involved in planning it. ‘Think Ahead to the Final Report When writing questions for the questionnaire it's important to think ahead to issues thet must be addressed in the final report. Addrese- ing these issues with very pointed questions allove the tester to refer directly to the an~ ‘avers to backup hie conclusions. Fussy que tione cannot lead to convincing conclusion while direct, concise questions allow direct conclusions « A given OTGE study might involve 15 sub- ject, 5 questionnaire forme, 40 questions ithia each form, and 3 adninistrations of the form. This would result in 15x5x40x3 or 9,000 ta pointe. If differences within forme, dif- ferences among forms, and/or differences anon. ‘administrations are computed, this figure can be multiplied many times. The speed with which the magnitude of a data reduction effort can expend ie a constant source of wonderment! This is a good reason for integrated planning at the outset and for ensuring that those responsible for designing the questionnaire are also respon sible for the date reduction and analysis. ‘Think Ahead to the Stetisticel Analyses When doing the experimental design a ing the questionnaires, think shead to statistical treatment of the resulte. tistics should not be an after-thought; but should influence design of the study from the outset. Keep in mind the ultimete decision~ aking information required from the inferential statistical anelye writ the ste Keep Tt Open Isaac and Michael refer to the "guinea pig! effect in vhich subjects feel "used." Avoid this by creating an open atmosphere in which controller subjects feel part of the evaluation team, by building trust, and by not hiding any- thing from them. Make the purpore of each form and questionneire item explicit. Stress that you want their individual opinions, not their opinions of what the group (or ite most opinion fated and vociferous menber) thinks. You may ot want to put everything in the open, however It is usually advieable to keep the subjects 12 identities confidential. Thie allove then to be candid and to feel free to give negative ratings or comments where appropriate. Assign ID num bers for placing on the forms and keep the key List under control of one person. Select Typical Users As Subjects In most OTE studies, (eample) of user-evaluators (test subjects) much smaller than the entire user populati To obtain viewpoints which are representative of the entire user population, testers must ensure that all. aegnents of the user population are included in the sample. Some factors vhich nay affect test outcome are: geographical location of the user's facility, equipment in use, enount of experience of the user, etc. These and other possibly relevant factors should be considered in identifying population segnents and in selec~ ting test subjects. The larger the sample size, the easier it is to obgein a representative group of subjects. Kish? is an excellent ref- erence on methods and statietics of various sampling approaches. the size of the group ‘THE QUESTIONNATRE PACKAGE Questionnaires are the most common ments used in subjective evaluations. ETABS questionnaire forms (Exhibit 1) were ministered at different tines during the study. Each form consisted of a umber of questionnaire Reference is made to thie material in following four hierarchically arranged sec~ C1)” QUESTIONNATRE FORMS WITHIN THE ineeru- Eight STUDY, (2) QUESTIONNAIRE ITEMS WITHIN THE FORM, (3) STRUCTURE OF THE ITEMS THEMSELVES, and (4) CONTENT OF THE ITEMS THEMSELVES. QUESTIONNATRE FORMS WITHIN THE STUDY Administration of the Questionssire Forms avare that answering @ questionnaire is both a teaching (the Socratic Method) and a learning proce Develop a mental set in the subjects by allowing them to preview the forme or by giving the form more than once. This primes then for what to look for in the system. For example, as can be seen in Exhibite 1 and 2, all questions about specific system componente inthe Wrap-Up questionnaire (Form F) hed ap- peared in earlier forms. Information was also Presented in questionnsires given during trein= ing (Form B) which was intended to help subjects learn to use ETABS. Query subjects as soon a8 possible they have been exposed to parte of the systen for vhich information is needed. Exhibite 1 and 2 show that for ETABS eight different question naire forme were used and administered at eppro~ priate times during the evaluation. after Repeatedly administer the sane questions to be able to measure changes of opinions on key aspects of the system as tine goes on. Early in ‘an evaluation there may be an initial negative reaction due just to frustration of learning needed skills or, on the other hand, on initial positive reaction due to interacting with a "new improved" system (the Halo or Hawthorne effect, see Isaac and Michael”). As experience with the systen increases, opinion of the subject may change for better or worse. These time-depend~ ent changes should be documented by repeated administration of the same questions (see Exh dite 1 and 2). QUESTIONNATRE ITEMS WITHIN THE FORM Keep The Questionneire short Design the questionnaire to obtain as such information es poscible vith « winizum sueber of questions. Overburdening subjects with an ex- cessive sumber of questions can cause negative reactions which might invalidate the results. For unsonitored subjects, shortness and simplic~ ity are crucial or the trash can become too cempt ing But Cover All Relevant Teaues You Can The questionnaire designer should cover #11 parts of the eystem about which information is vanted, and realize thet no matter how thorough 2 questionneire is, it is unlikely to cover all relevant issues. ‘For this reason, ETABS mult: tabular questions (Exhibit 3) contained @ row entitled "Other." Each multi-iten tabular question vas also followed by space for comments. Open-ended narrative items also s0- Licited responses on items vhich might have been missed. Look For Inedvertent Inbelance of Items A questionnaire can be biased by items cluded or not included in it (Poulton®). design of a questionnaire to evaluate « systen requires knovledge on the part of the designer jout that system, this knowledge often includes information about the eysten's weak or strong A questionnaire designer who feels about e system, either pro or con, jdvertently include more itene supporting his view and fewer items negating hy view. Thus, inclusion of « truly representative set of questions in a questionnaire is essential to @ balanced evaluation of @ new system. ‘Since The senior author had en experience nicely illustrates the above point. He consulting on the design of a questionnaire with the team of engineers who built the system which was to be evaluated. These engineers were pain- fully aware of the problems with and deficien- cies of the systen. Being naturel problem sol- vers, they hed « mental set vhich wae focused primarily on problems. After being briefed on the system and looking over the draft of que tions, he blurted out, "If you give this os « questionnaire, the results can't help but show the system as being poor!" He then directed their thinking more along lines of the benefits of the system, and they came up with » more balanced set of questions whieh 183 Use Redundancy Sparingly While the number of questions asked of subject should be held to # minimum, there reasons for occasionelly asking extra eummery or bottom-line type questions similar to others already asked. Although pretest of the ques: tionnaire should eliminate most unclear ques tions, it is helpful to ask the eame tort of question in different ways and in different contexts as a hedge against not getting needed informetion. the Phrase questions as clearly as possible. Occasionally, however, subjects heve varying interpretations which show thet either sone questions are not cleerly phrased, or that they are clearly phrased yet subjects disagree on the answer. Clear questions are ones which are interpreted similarly by most aubjects. A ques tion may be cleer, but there might not be gener al agreement as to the answer. There might be reel differences of opinion on the question. Tf time and conditions permit, it is useful, fol- lowing conpletion of the questionnaire, ‘to sit down with the group of subjects and review each question to determine how they interpreted it. Both group review and statistical anslysie were used on the ETABS etudy to separate wheat from the chaff, to sort the questions on which there was clear consensus toverd "good" or "poor" from those in which there was not. Our statistical treatment of responses filtered out both unclear ‘questions and clear questions on which there wae no consistency in response. We used aultiple t- teste (alpha <=0.05) to. separate questions whose mean ratings did not deviate significantly from midescale from those whose mean ratinge did~ Ask Specific Questions Before General Or The order of presentation or sequence of questions within @ form is important. In the case of BTABS, the questions presented firet were more specific or detailed ones. More gen~ eral, "summing-up," questions were placed ot the fend Of the questionnsire. Doing this #llowe the subject to review all individual items, which might affect his opinion, prior to meking over all judgments of suitability, system effective- STRUCTURE OF THE ITEMS THEMSELVES Itene With Rating Sceles There is © lerge body of research on the theory, and practice of rating methods (Dunn= Rankin’). Most rating scales jo the ETABS study used seven categories. Miller® provides # basis in human information handling cepecity for using seven rating categories. Five categories are ‘also comonly used, for example, the Likert scale, e.g., Very Cood, Good, Feir, Poor, and Very Poor. ‘However, there is's natural tendency for subjects to avoid the use of scale extremes ‘Thus, if extremes are avoided, a five-cetepory scele becomes a three-category scale. The use of seven categories provides more options for the subjects and more information for the ane- lyst. Also, it attempts to nudge noncommittal

Вам также может понравиться