Flight Examiner Methods

Flight Examiners Methods
Flight Examiners Methods of Ascertaining Pilot Proficiency

Wolff-Michael Roth1,2
1
University of Victoria; 2Griffith University
Abstract There are no studies about how flight examiners (check captains) think while
assessing line pilots during flight training and examination. In this study, 23 flight examiners
from 5 regional airlines were observed or interviewed in three contexts: (a) surrounding the
assessment of line pilots in simulator sessions; (b) stimulated recall concerning the assessment of
pilots; and (c) modified think-aloud protocols during assessment of flight episodes. The data
reveal that flight examiners use the documentary method, a mundane form of reasoning, where
observations are treated as evidence of underlying phenomena while presupposing these
phenomena to categorize the observations. Although this method of making sense of pilot
performance is marked by uncertainty and vagueness, different mathematical approaches are
proposed that have the potential to model this form of reasoning or its results. Possible
implications for theory and the practice and training of flight examiners are provided.
Keywords
Debriefing assessment cognition cognitive anthropology thought process
A recently accredited flight examiner during stimulated recall: Probably your biggest fear
is having to fail someone. Thats why people say, Your worst one always is your first
one. . . . Becoming an examiner is like going for your first solo or your first commercial
nav. Well done, youve got the job. (D2)
A first officer during a post-debriefing interview: Youre always trying to remember
[the flight examiners] way of doing it. Youve just got to remember to tick their boxes
and youre okay. Youre also trying to remember the way theyre thinking. And if you get
back in to that then you dont get so many comments.
A flight examiner is an authorized check pilot, who has taken on some of the duties of a
flight inspector on behalf of the regulatory authority (e.g., CAA-NZ, 2013; Transport Canada,
2013). Flight examiners are airline captainsserving both their airline and are responsible to the
regulatory authoritywho assess the competencies of pilots against a national standard for
continued accreditation purposes and during type-rating. In the first introductory quotation, a
recently accredited flight examiner talks about becoming a flight examiner with rather little
experience or training in how to do the job or on how to assess such that the fears associated with
having to fail a pilot are lessened by the ability to ground the assessment in evidence. Such
comments concern the conceptual shifts individuals have had to make when becoming flight
examiners. Perhaps unsurprisingly, conversations with airline training and standards managers
reveal their eagerness to find out more about how flight examiners think for the purpose of using
such information in their training of new flight examiners, an increasing number of which are
needed because of the rapid expansion some airlines experience.
In the second introductory quotation, an experienced first officer (52 years of age, 30 years as
commercial pilot, 9,000 flight hours) talks about what really matters when he undergoes
examination following an operational competency assessment. First, he suggests that a pilot has
to remember how the particular examiner flies, which is the implicit referent for his assessment;
and a pilot also has to remember the way in which flight examiners are thinking, that is, the
processes of their thinking that lead them to make assessments and influences recommendations
for improvement or training. But how do flight examiners think? What evidence do they seek and
use to arrive at an assessment and how do they think (what are their thinking processes)?
Currently there are few studies; and those that can be found have been conducted in constrained
settings using pre-recorded video (e.g., Roth, Mavin, & Munro, 2014a) rather than observing
flight examiners at work.
Early work on pilot assessment investigated assessment in terms of measurement models and
focused on outcomes (e.g., Flin et al., 2003; Holt, Hansberger, & Boehm-Davis, 2002;
OConnor, Hrmann, Flin, Lodge, & Goeters, 2002), where the measurement sometimes are
supported and enhanced by automated tools (e.g., Deaton et al., 2007; Johnston, Rushby, &
Maclean, 2000). More recent studies focus on the nature of the evidence that flight examiners
use in support of their ratings (e.g., Roth & Mavin, 2014). None of these studies investigate how
flight examiners think, the method or methods that they use to arrive at statements about the
proficiency, knowledge, skills, or states of pilots. This study, grounded in the cognitive
anthropology of work, was designed to investigate flight examiners methods. The ultimate goal
of this work is to construct a basis for the training and professional development of flight
examiners.
There exists considerable research in how experts from a variety of fields think, including
clinical reasoning of medical experts (Boshuizen & Schmidt, 1992), instructional designers
(Perez, Johnson, & Emery, 1995), historians (Wineburg, 1998), and scientists (Alberdi, Sleeman,
& Korpi, 2000). Such studies show the tremendous importance of subject-matter specific
knowledge in the forms of reasoning observed, knowledge that sometimes is not apparent
because it is encapsulated in the practical knowledge that experts have developed over the years
(Boshuizen & Schmidt, 1992). A recent study from aviationinvestigating the ways in which
pilots, including flight examiners, assessed other pilotsshowed that assessors vary in the facts
on which they base their assessment (Roth et al., 2014a). Moreover, first officers and captains
are going about assessment in ways that differed from flight examiners: the former move item by
item through an assessment form and identify performances that allow them to give a score on
that item whereas the latter first construct narrative descriptions and then map the result onto the
assessment form (Roth & Mavin, 2014). Both of these studies investigate assessment of short
scenarios. Such studies tell us little about how experts in assessmentflight examinersactually
think in the context of real situations where they observe pilots over four-hour periods. In the
present study, therefore, we combine information gleaned in the field (by means of interviews
with flight examiners at three points during their work of examining pilots and videotapes of the
debriefings) with information gleaned from experimental settings (think-aloud protocols typical
for research on expertise).
Research Methods
This study was designed to investigate the methods flight examiners use to assess airline
pilots. We employ methods typical for cognitive anthropology. In this field, approaches from
different research traditions are combined, including field observations in real settings typical of
anthropology and think-aloud protocols and stimulated recall in constrained settings typical of
empirical cognitive science.
Design
In this study, flight examiners were recorded in three contexts: (a) in and related to actual
debriefing at work; (b) stimulated recall sessions; and (c) modified think aloud protocols
requiring pairs of flight examiners to assess crewmembers featured in videotaped scenarios. The
design included flight examiners participating (a) in the debriefings only, (b) in debriefings and
stimulated recall, (c) debriefings, stimulated recall, and think-aloud protocols, and (d) thinkaloud protocols only (Table 1). In the debriefing context, individual flight examiners were
recorded from 1 to 5 times (Table 1). The nature of this design provides for both breadth and
depth to the investigation of how flight examiners think while controlling for any particulars of
thinking as a function of the task.
Insert Table 1 about here
Participants
A total of 23 flight examiners (Age = 47.9 years, SD = 8.4) from 5 airlines participated in this
study; all were male. Eight flight examiners took part in debriefings only, with different numbers
of sessionse.g., three were recorded during one session, 4 during two sessions, etc. (Table 1).
Two examiners were recorded surrounding the debriefing sessions and in stimulated recall
concerning one of the sessions. Four flight examiners participated in all three tasksdebriefing,
stimulated recall, and think-aloud protocols; and 8 individuals participated in the think-aloud
protocols only (Table 1). All flight examiners were experienced pilots, with a mean of 25.2 years
(SD = 8.7) as commercial pilots and mean accumulated flying time of 13,200 hours (SD =
4,140). They had served as flight examiners from 1 month to 23 years, with a mean of 8.5 years
(SD = 6.6).
At the time of the study, the flight examiners worked for regional airlines that had been
selected based on a 2 x 2 factorial design: (a) use (airlines B, D, E) or not (airlines A, C) of an
explicit, human factors based model of assessment of pilot performance (MAPP) (Mavin, Roth,
& Dekker, 2013) and (b) use (airlines C, D) or not (airlines A, B, E) of a debriefing tool, which
allows flight examiners to replay part of a simulator session featuring a video, some of the
instruments (e.g., electronic attitude director indicator [EADI], electronic flight instrument
system [EFIS]), and some actuators (e.g., control column, flap levers, power levers).
All participants in the debriefing and think-aloud parts of the study were randomly selected
among those who were willing to participate and that the company roster had available during
the field work periods. The flight examiners in the stimulated recall sessions had time because of
scheduling or were freed by their airline to be able to participate.
Ethics
This study was designed in collaboration with the participating airlines. In addition to being
approved by the university ethics board, approval was received, where applicable, from the
respective labor unions. All potential participants were guaranteed that their non/involvement in
the study would not affect their employment status; and participants were free to leave the study
at any time or to withdraw their data from use. No participant withdrew.
Tasks and Task Settings
Debriefing at work. Flight examiners were recorded during debriefing sessions that
followed actual examinations or training in the flight simulator. The flight examiners were
interviewed following the first half of the 4-hour simulator session and at the end of the second
half, immediately prior to debriefing. The flight examiners were interviewed again immediately
following debriefing. Participants were asked about what had been salient to them during the
simulator session, what they intended to bring up during debriefing, and how they felt that the
examinees/trainees were doing. Following debriefing, participants talked about their thinking
from the beginning of the simulator session to the end of debriefing, how they arrived at their
assessments, and how they selected what they intended to debrief. The interviews were semistructured, containing both specific questions directed to every participant (e.g., What stood out
for you in the session?, What and how could debriefing be improved? or At what point did
you decide on your assessment and how?) and providing opportunities to participants to
articulate any issues that they deemed relevant.
Stimulated recall. In the stimulated recall sessions, participants are asked to talk about their
reasoning during the assessment after the fact, generally using videotapes of their performances
of interest such as rater cognition (Suto, 2012). Participants talked about their thought processes
during the period from the beginning of the simulator session to the end of the debrief snippets
were replayed to them.
Modified think-aloud protocols. Think-aloud protocols constitute a standard method in

cognitive science for investigating the nature of expertise (Ericsson & Simon, 1993); the method
was employed in a modified form that has shown to be successful in recent aviation-related work
(Roth & Mavin, 2014). Three pairs of flight examiners each from airlines A (no-MAPP, no
debriefing tool) and D (MAPP, debriefing tool) were asked to assess both crewmembers
(captain, first officer) who appear in three videotaped scenarios.
Data Collection
Debriefing-related. The debriefings lasted between 11.1 and 57.2 minutes (X = 36.4, SD =
14.3) for a total of 17.6 hours; approximately 8 hours of interviews associated with the
debriefings were recorded. Debriefings were recorded in the respective companies regular
facilities using two cameras that captured all parts of the rooms; a third, laptop-based camera was
used as a backup.
Stimulated recall. A total of 5.1 hours of stimulated recall were recorded. Participants were
shown excerpts from the debriefings they had conducted. One camera was used, recording the
debriefing video and the notes that the flight examiner had taken during the session.
Modified think-aloud protocol. The think aloud sessions were recorded using three
cameras, one showing the workspace (notes), one featuring the pilots head on, and a third
recording what the participants were currently watching on their TV monitor. A total of 13.8
hours of think-aloud protocols were collected.
Across-task materials. In addition to the recordings, we draw on informal interviews and
observations from our ongoing ethnographic studies. The database includes all aircraft-specific
systems manuals, manufacturer- and airline-specific standard operating procedures and
procedures for abnormal situations for the different aircrtaft involved. It also includes the
authors field notes and all analysis-related and information-seeking exchanges with training
managers.
Analyses
Settings and processes. All recordings were transcribed verbatim in their entirety. The job
was contracted to a commercial provider with access to an individual who has aviation
experience. All transcriptions were verified in their entirety by the authors. For the purpose of
analysis, the different videos of the same situations were combined into a single display. To
analyze the data, we drew on interaction analysis (Jordan & Henderson, 1995), an
interdisciplinary method for the empirical investigation of the interaction of human beings with
each other and with objects in their environment (p. 39). It involves groups of researchers, both
those who own the project and colleagues interested in data sessions, who jointly analyze the
data with a commitment to ground their assertions and theories in the empirical evidence
available. Every assertion, every claim, has to be supported by evidence from the tapes
(transcriptions). Analyses begin during jointly conducted fieldwork, often replaying video in the
evenings following the recordings. In weeklong analysis sessions, sometimes involving
colleagues from other disciplines, were for held for developing the contents of the findings.
Analytic expertise. The analyses of cognitive tasks require related competencies. The
second author has 22 years of experience as a commercial pilot before becoming a university
professor; he continues working as flight examiner for a major aircraft manufacturer, and
provides workshops for flight examiners for different airlines. The third author has a total of 28
years of military and civil flight experience (8,500 flight hours), has been flight examiner for 9
years, and currently serves as training manager. The first author is an applied cognitive scientist
with extensive experience in the study of cognition at work. For the past three years, he has
engaged in cognitive anthropological study of assessment and debriefing in aviation. As part of
this work, he flew small aircraft, had simulator sessions flying larger aircraft, observed
simulator-based examinations, and accompanied pilots in the cockpit during regular line
operations.
Findings
So were using specific examples to cover a generic fix. (B3)
This study was designed to investigate how flight examiners think while assessing pilots
during regulator specified mandatory examinations. The introductory quotation from one of the
most experienced participant in this study (18 years as flight examiner plus 10 years as standards
and training manager) captures the essence of the flight examiners method: During the
examination in the simulator, the flight examiner develops a generic fix, a sense of the pilots
current abilities and then uses specific examples from the flight session as evidence, that is, as
concrete manifestation of the presupposed underlying phenomenon denoted by the generic fix.
This is the essence of what has been called the documentary method of interpretation
(Mannheim, 2004). In this method, observations are taken as evidence for, or documents of, an
underlying reality while using this reality as a resource for explaining or interpreting the
observation (Suchman, 2007). It corresponds to the mundane idealizing of reality (Pollner,
1987). All flight examiners, without exception, use the documentary method of interpretation
(Table 2). In fact, it has been suggested that this constitutes an everyday, mundane method of
making sense of the world (Garfinkel, 1967). The documentary method was employed even
when the flight examiners worked with an explicit model of assessment of pilot performance
with an associated assessment metric that mapped performance descriptions to a score (e.g.,
Unable to recall facts or made fundamental errors in their recall = 1 [unsatisfactory]
knowledge/facts or Adequate organization of crew tasks = 3 [satisfactory] of
management/workload). That is, rather than engaging in the measurement of assessment, flight
examiners employ ways of categorizing and explaining observations that underlie mundane and
formal scientific reasoning methods (e.g., Bohnsack, Pfaff, & Weller, 2010).
Flight Examiners Use a Documentary Method
The documentary method of interpretation as described in the literature is based on three
levels of sense: objective sense, expressive sense, and documentary sense. However, the
expressive sense pertains to a social actors intentions that cannot be objectively obtained. In the
present study, inferences that the flight examiners make about pilots intentions, therefore, are
treated as special cases of the documentary sense.
Objective sense. The objective sense of a situation refers to what different observers can
actually see and agree upon: their facts or their evidence. For example, flight examiners identify
indisputable facts, for example, that a pilot has (not) pushed the go-around button, what the
precise speed is (e.g., 145 knots, white bug + 10), or what the torque gauge reads. In the
following example from a debriefing, the flight examiner lists a set of observations that
constitute his objective sense, the facts used in the assessment of the pilots.
You were at 34 or something like that. Not a long way out. Told ATC. Made a PA. And then
came back . . . to around the 240 to 250 indicated mark until we got below 10. And then we were
sitting at 235 knots. (B3)
The list includes where the pilots had made the decision to turn around (35 DME), that there
was an exchange with air traffic control followed by a public announcement. They were flying
with a speed between 240 and 250 knots until they were less than 10 nautical miles on the
distance measuring equipment (DME) at which point they were flying at a rate of 235 knots.
Such lists do not in themselves constitute an evaluation but are (a) used as manifestations of
underlying intentions and (b) treated as the manifestations (documentary evidence) of one or
more underlying, not-directly observable knowledge (aircraft or standard operating procedures),
skill (manipulative, communicative, management, or decision-making, skill), or state (e.g.,
situational awareness). Although more imprecise and fuzzy, non-technical areas are described in
terms of concrete, observable evidence. For example, a flight examiner described the slow
responses of a first officer observable in the cockpit, during debriefings, and during regular
conversations, which he suggested could be verified by the interviewer (see Table 3).
[The first officers] execution of procedures is, its slow. First officers response to everything,
and youll find this when you talk with him, is a very slow response . . . theres considerable
delay and then you get a response. And you generally get the right response. (B1)

Documentary sense. Over the course of a session, the flight examiners built up a
whole/holistic sense of pilots (a generic fix), the evaluation of their skills, as per the
documentary evidence that is indicative of and explains the actual performance. For example, in
an icing condition during a single-engine approach, the pilots had not entered the correct speeds
in their landing speed card. All speeds (VREF, VAPP, VAC, and VFS) should have been the same.
This observation contributed to the flight examiners sense that the pilots had poor time
management and, as a result, forgot to enter the speeds as per operating procedures (It comes
back to managing your time and what you actually want to achieve [E1]).
Three categories of idealizations can be identified in the data: non/proficiency (pass/fail),
1
For cross-referencing purposes with Table 1, participant ID is given at the end of the transcription (e.g.,
B3, where B refers to airline B). Square brackets (i.e., [. . .]) enclose descriptive information;
chevrons (i.e., . . .) enclose replacements for proper names of persons, cities, and airports.
10
(non/technical) skills and knowledge (e.g., handling, decision-making, management,

communication), and states/processes (e.g., situational awareness, thinking). All of these
idealizations are mundane and uncontroversial cultural objects denoted by the language shared
within the aviation community or specific airline (e.g., Mavin & Roth, 2014). These
idealizations, which are taken to be underlying the opbserved performance, are not directly
accessible. Instead, the word or language used denotes a sense that arises with observations.
They might say, for example, I had this gut feeling that somethings not right. Whether it was
body language or something Id seen, Im not sure. But something didnt sit with me (D4). This
gut-level sense, which often begins with the flight examiners observations of pilot behavior
during the briefing preceding the simulator session, subsequently is worked out in terms of the
evidence, as B3 said in the above quotation, We are using specific examples to cover a generic
fix.
Mutual determination of objective and documentary sense. Flight examiners are tasked
with an assessment of pilots competencies (proficiency) levels, holistically or, as in airlines B,
D, and E, in terms of ratings of a set of human factors. Any idealization given in the sense is
based on what flight examiners actually observe, the objective sense of the situation (facts, actual
performance), which is taken to be a manifestation (document) of what by nature is
unobservable. There is therefore a reflexive relationship between concrete observations and
idealization of the underlying reality (phenomenon): the former lead to the emergence of the
latter, but the latter explains the presence of the former. In the following example recorded
during a debriefing session, the flight examiner justifies a passing grade to a worried first officer
who has had some performance problems in the past.
You maintained situational awareness and were able to make the airplane follow the correct flight
path at all times. The decisions that you had to make today were easy ones today . . . considered
all the points, and I saw a little bit of evidence of that early on in your decision to divert to
airport 1 You said, Okay, we need engineering and we need runway length, which then kind
of, thats airport 2 out the way. And obviously airport 1 was the closest place. So I saw clear
evidence that you were actually diagnosing the situation and making sure that you considered all
the facts that you need to consider to generate the options, which enabled you to make your
11
decision. (B3)
In this explanation, the maintenance of a correct flight path is evidence for the pilots
situational awareness, the overt consideration of requirements for a diversion is evidence for
decision-making/diagnosing, and the active selection of an appropriate alternate airport is
evidence for decision-making/option generation. The state of the derived situational awareness
becomes a master concept, which is both evidenced in observational behaviors and performances
and explains these. The factual evidence determines the examiners sense that the pilot has
satisfactory decision-making skills (here options and diagnosis dimensions), and the satisfactory
decision-making skills explain the observed performance. This holistic sense in turn mediates
what flight examiners are looking for, and, therefore, what they collect as data and the intentions
that they take to be expressed in the objective facts. The relationship between an evolving
documentary sense and the objective sense of the situation can be seen at work in the thoughts of
a flight examiner from an airline using the explicit human factors model based Model of
Assessment of Pilot Performance (MAPP):
If I see something go wrong, then I sit there myself going, Rightyo. I then, as you say, visualize
the MAPP and go, Okay, well wheres this fit in it? Did they lose situational awareness? ((Points
to item on visual MAPP model)) No. Okay, well what else could it have been? Well they flew the
aircraft within tolerances ((Points to item on MAPP)). Decisions? ((Points to item on MAPP))
Yep, they decided to go to airport. Right call. And I start trying to cross bits off and then
narrow it down myself. So reckon its management of the crew. (E1)
The conceptual tool (MAPP) provides the flight examiners with a way of mapping some
observable expression, a manifestation, to a presumed underlying performance or skill
(idealization).
How Flight Examiners Evolve the Objective Sense
To arrive at conclusions about pilots proficiency levels, knowledge, (non/technical) skills, or
states (situational awareness, thinking), flight examiners require documentary evidence on which
to base their assessment. In the simulator, they observe and generally take some notes in real
time without time out or recourse to revisiting an event. These notes are used both for
establishing the record of the flight performance observed and for the debriefing, where they
12
describe to the pilots what they have done for the purpose of critique or praise. In this subsection,
we report findings concerning the process of establishing the documentary evidence for flight
examiners conclusions.
Flight examiners differ in which facts and how many facts they identify. The assessments
are based on documentary evidence. One might ask whether flight examiners identify the same
kind and number of facts. This is difficult to establish in the context of regular examinations but
can easily be done when, as in the present case, the same flight segment is evaluated with the
possibility to repeatedly replay the segment. Whereas there is little debate about facts once they
are articulated (e.g., the calls were non-standard at the bottom of the approach [A5]), the
modified think aloud protocols that control for the assessment situations show there is variation
between the flight examiner pairs whether a fact is actually noted and therefore taken into
account in the assessment (Table 4). There tends to be no debate about what the standard
operating procedures say and whether a pilot action is consistent or inconsistent with these. For
example, only two of 6 flight examiner pairs noticed that the pilot in the scenario did not push
the go-around button, the first step specified for a go-around in the standard operating procedure.
This disengages the autopilot, which, by means of the flight director, continues to direct the pilot
to continue downward in the approach rather than upward. Because this step is missing in the
kinetic sequence of the cockpit as a whole (e.g., Roth, Mavin, & Munro, 2014b), the procedure
that follows is messy (A3, A4, A5, B3, C1, D2, D3, D4), untidy (B3, D4, D6), or otherwise
deemed inappropriate. But the origin of the messy procedure is not apparent to four of the
examiner pairs. Three pairs noted that the captain in the scenario was flying against the bars,
that is, had a positive rate of climb whereas the command bars directed him to head down. Two
of these pairs identified the missing engagement of the go-around procedurepushing the goaround buttonas the source of this divergence. Finally, only three pairs noted the crucial fact
that passengers were evacuated on the side of the running engine after landing with a fire on the
other engine (Table 4). That is, facts about instruments, actuators, and observable performance
constitute a baseline that is relatively undisputed.
13
There is considerable variation in terms of the total number of facts articulated and taken into
consideration when flight examiners articulate the evidence on which they base their assessment
decisions. In the context of assessing in simulator sessions, flight examiners take notes of what
they observe. But the extent of these notes varies widely. We therefore investigated the number
of facts holding constant the event to be assessed. Thus, in the scenario with the inappropriate
evacuation, made salient and took into account different facts and different numbers thereof
(from 1 to 9) (Table 5). However, all those pairs who noted the evacuation on the side of the
running engine failed both captain and first officers, whereas those who did not passed both.
Flight examiners tend to be aware of the limitations of their evidence. Flight examiners
tend to be aware of the limitations of the documentary evidence that they obtain. They often find
out in the discussion with the pilots that they have missed something (e.g., I must admit I didnt
actually notice at the time too. It was only a bit later when I went oh, whats going on here?
[E1]); or report themselves having missed something (e.g., I failed to note the point when the
autopilot was turned off [B1]). In part, situations in which the flight examiners do not take
notice important facts arise while they take notes (I dont see with heads down [A3]). While
having their heads down to write down observations, they are actually missing other potentially
relevant flight-related facts. As a result, examiners find themselves in situations where their own
observations and those pilots report differ. This is frequently made explicit in the training of new
flight examiners: they teach us to try not to get yourself in that situation, because its quite, a bit
sort of, you know, he said, she said. I said, they said.
Flight examiners noted the inherent contradiction in their task: To get the documentary
evidence that they need to support a pass/fail decision or their assessment of underlying skill
levels, they need to record their observations. But in the production of recording such such notes,
they miss out on observing flight relevant actions. The flight examiners from airline B explicitly
focus on observation while taking the scantest of notes (12 pages, some 15 observations). They
subsequently review their notes and what they remember in addition, pulling together all of the
information to arrive at an overall assessment as well as at assessments of categories of
performance, some of which may require special attention.
14
I think keeping the notes is actually the thing thats distracting. I find myself starting to note
something down, Ill see something else thats happening and so Ill stop what Im doing, take
note of whats happening and then I forget what I was writing down in the first place. And thats
lost. Sometimes. Thats a bit of a pain. But you still get the overall picture. (B3)
The flight examiners in the other airlines, too, tend to take brief rather than extended notes
(up to 5 pages for a 4-hour session). These notes in themselves are insufficient as a repoertoire of
facts (I keep my notes pretty short, so if you read them they probably wouldnt make a lot of
sense to you. But its just a few words to jog my memory [D3]). Instead, these notes trigger
(episodic) memory and allow examiners to bring back what happened and those facts that they
are using in the assessment. What is important to the flight examiners is the overall picture,
which is more important than a complete tally of all facts.
The conflict is mitigated to some extent for those flight examiners who have access to a
debriefing tool. This tool records the entire simulator session and includes a videotape of pilots,
shows what pilots view, and features representations of instruments and actuators. The debriefing
tool allows flight examiners to mark simulator events for subsequent replay in the debriefing.
The process of going from observation to assessment is mirrored in the use of the debriefing tool.
Thus, a flight examiner was observed marking for replay 21 events during a 4-hour simulator
session. However, he would not actually play all of these and instead focus on four. The total
number of marked events gives him a selection to work from. In the end, as the overall picture
emerges, the flight examiners then select those that he deems most valuable in terms of
triggering learning. In airline D, the marking process has been adapted to their performance
model (MAPP) such that the flight examiners can now mark events according to the agreed-upon
performance categories (e.g., knowledge, communication, decision-making). Even with the
debriefing tool, reviewing one or more sequences for the purpose of getting all the facts may (but
does not have to be) prohibitive in terms of time available and returns for the investment.
Flight examiners engage in targeted evidence collection. In some airlines, records on the
preceding examination are kept. Individual flight examiners might keep their notes or remember
having assessed individual pilots repeatedly. In both types of cases, flight examiners use the
records or their memory to look for documentary evidence to support statements about whether
15
or not a pilot has improved: If hes still having problems with his engine failure after take-off
then we might have to dig a little bit deeper in to it. And it just helps us tell whether something
that you see is random or systematic (B3).
Important here is that flight examiners and training managers want to see whether a particular
(poor) performance is recurrent rather than a one-off in the actual performance. Sometimes flight
examiners and training manager choose events such that the evidence required in support of their
documentary sense is produced. This evidence then is used to teach the pilot a particular lesson:
We know what areas they need to improve in and so sometimes, I have to confess, I would
introduce a malfunction at a difficult time for them to handle so that you can use it as a lesson
(B3). Across a flight simulator session, flight examiners look for multiple pieces of evidence to
support their assessment of an underlying factor. Thus, in most real examination cases and in
contrast to evaluating brief video scenarios (e.g., Roth et al., 2014a), it is not the performance in
one individual situation that determines the assessment. Instead, the flight examiners build their
case based on the overall performance during the simulator session. In the following quotation,
the flight examiner supports his rating of 3 (satisfactory) rather than a 4 (good) on the technical
skill of flying the aircraft within limits because of one instance during a non-directional beacon
(NDB) approach, the aircraft was at the lower limit, which was taken as an indication that the
flight path management was problematic. But for the remainder of the examination, the pilot had
kept the aircraft well within the required limits.
Because were looking at a whole, you know, 2-hour, 3-hour session. And for example, first
officer got a three for flight path within limits on that exercise. Had it not been for the NDB
approach and the circling . . . he was only just fast enough. And so his flight path management for
the rest of the session was actually quite good ((i.e., rating = 4)). But that dragged it down. So it
was kind of holistic. (B2)
On rare occasions, an examination session is organized to have another flight examiner

provide an independent assessment. In such cases, the examiners use specific events to collect
evidence on the particular issues that the preceding examination/s had identified. They then
obtain the observation that makes the overall decision go one or the other way (And as soon as I
sort of delved in to that area, it was like, right, thats black and white [D4]).
16
Flight examiners selection of events limits the types of facts that they anticipate to
observe. In the situation of the think-aloud protocols, flight examiners were confronted with
brief segments of flights with any knowledge of the context. The situation on the job is different
because flight examiners do not identify arbitrary facts. Instead, having programmed the events,
they have readied themselves to observe specific facts that are associated with this type of event.
Moreover, in a particular examination cycle, all pilots fly the same line-oriented flight segments
and do the same spot checks. From delays in required actions, they anticipate workload to
increase and pilots come under time pressure, which results to a loss in the awareness of the
situation as a whole. That is, flight examiners perception is configured by the choice and timing
of events. It also affords them to anticipate facts related to particular human factors areas that are
more salient than others. Each event has a set of challenges, or boxes, and the flight examiner
observes whether or not: they ticked every box (D3) and how well they do (The session has
actually got a little bit of stop start in it . . . to tick the boxes . . . so you see some slips and errors
that you wouldnt normally see [C2])
Flight examiners use repeat performance to increase the amount of documentary
evidence. The account provided so far may sound as if flight examiners do their work in an
unprincipled manner. But this is not so. To make their cases for the presence of particular levels
of proficiency, knowledge, skill, or state, flight examiners require documentary evidence. They
do not take a single instant as a case for proficiency, knowledge, skill, or state. This especially
important to them in those cases where the observed performance requires them to make a
pass/fail decision.
At that stage I hadnt failed him; but I hadnt passed him either. I was sitting there thinking,
Weve got a 4-hour session here; well see how the rest goes. Depending on how the rest goes,
well need to come back and look at that. (D4)
Repeat observations. Flight examiners build their cases as they go along, taking their
observations as evidence that stand for some performance and the level thereof, which stand for
the underlying skill. For some, this mapping occurs immediately (e.g., supported by the
conceptual model), whereas others may wait. As the session progresses things may change
because something else might become more important: So I do have all those thoughts while I
17
am going through, but when I come out of the sim, I ask myself, What is the main, the big issue
here? (B3). Flight examiners then make one or a series of observations that determines their
decision:
So the guy next to him started managing it. And to me that was the point then when I had had two
individual exercises that werent managed well. So I was like, rightyo, weve got issues here.
And by this stage I had already decided, you know, hes not going to pass today. (D4)
In another case, a flight examiner takes the fact that the first officer moves correctly through
the list of actions stated in the standard operating procedures as evidence for the presence of the
underlying knowledge. However, these steps occurred in the same manner across situations
(Hes actually leveled out twice now and hasnt pulled the power levers up.) There were two
observations where the first officer had leveled out without pushing the power levers forward. In
each situation the observation is evidence pointing to a performance problem. This performance
problem occurs across situations, and, in this, is consistent with an underlying skill issue. That
concern for manipulative ability is more serious than the concern with assertiveness, for which
the flight examiner has had evidence that can be fixed. One of the observations he has made is
good performance when the captain has been asked to fake incapacitation (by means of a
shoulder tap during the simulator exercise). In this situation, the first officer has performed well
(stepped up . . . because he didnt have to deal with the person next to him [B1]).
Observations during Repeats. When performances are problematic potentially pointing to
underlying problems, flight examiners ask for repeating a situation, segment, or exercise to
collect further documentary evidence that allows them to get a better fix on a factor of interest.
Repeats provide further information about the underlying proficiency, knowledge, skill or state
underlying the pilots performance. If the pilot/s perform sufficiently well during the repeated
exercise, then this provides the flight examiner with evidence that there was an issue with the
particular performance not with the underlying dimension.
We did the exact same exercise again and he made the same mistake. And then just went through
the whole session making individual management mistakes. So someone like that, its actually
quite black and white. (D4)
18
The flight examiner provides documentary evidence for the fact that an underlying ability is
present but occluded in the performance. Thus, talking about a first officer, a flight examiner
suggested that And that was the case with first officer on a couple of occasions where, for
example, with the briefing the wrong flap setting for landing and briefing the wrong speed. He
knew it, he just hadnt realized it (B3).
Flight examiners do not seek to ascertain the nature of evidence even when technology
affords it. With the debriefing tool, flight examiners do have the possibility to replay some event
and to ascertain the nature and number of facts (documentary evidence). But nowhere in the
present dataset does a flight examiner use or talk about using the debriefing tool to check an
observation. When an observation was checked, then always because of discrepancies between
the flight examiners and a pilots description of what was the case.
How Flight Examiners Develop and Articulate Their Documentary Sense
The documentary method of interpretation is a common, everyday method for determining
some assumed underlying pattern that also explains the observatione.g., for finding out what
someone thinks, for a coroner to determine the course of events that led to a death, or for a
historian to describe the worldview of an era (Garfinkel, 1967; Mannheim, 2004). But precisely
because it is an everyday method, it is so powerful: the method not only helps in making sense
but also intuitively makes sense. Flight examiners employ the documentary method of
interpretation to determine whether pilots are non/proficient (pass/fail), what their non/technical
knowledge and skills are, or the pilots situational awareness. All of these are phenomena are not
given in themselves: these are cultural constructs that held to manifest themselves in observables.
These constructs therefore exist only in and as documentary sense.
Viewing the same scenarios, flight examiners evolve different patterns taken to underlie
performance given in documentary sense. Previous studies suggest considerable variation in
the ratings of pilots and flight examiners asked to assess the same video (Flin et al., 2003; Mavin
et al., 2013). Such variation is also observed here in the form of different appreciations of the
proficiency or non-proficiency of a pilot. Thus, the 6 flight examiner pairs did not come to
complete agreement on the level of the performance for of the six pilots they assessed in the
think-aloud part of this study: no two pairs had the same ratings across the six pilots (Table 6).
19
That is, even in a condition where flight examiners work in pairs such that individual subjectivity
is minimized, different conclusions are observed. What previous studies have not explained are
the reasons for such variations.
The flight examiners do not have access to the knowledge and skills underlying performance
or to a pilots grasp of the situation (i.e., situational awareness). Here, as in the case of whether
to pass or fail the pilot, they use the documentary method of interpretation. Because the cultural
objects are not given directly but indirectly through the manner in which they manifest
themselves and because flight examiners differ in the contents of their observations, as shown
above, the differences in flight examiners overall documentary sense become intelligible. The
flight examiners are most concerned with overall proficiency, which they tend to ascertain by
means of the question whether they would want themselves or their family members and friends
to be a passenger on the aircraft flown by that pilot. If the response is yes, then the pilot passes; if
no, the pilot fails. This is so independent of the root causesi.e., human factorsattributed to
non/proficiency:
And at the end of the day, I dont actually think it matters that much what you call it, as long as
you call it something. And you can say, Look, what I did notice during this exercise, I know you
know this stuff, but you just couldnt recall it. (B3)
The documentary sense begins with an indeterminate feel that articulates itself over time
into a more grounded sense. Flight examiners observe pilots over the course of a four-hour
period and then make their assessment. But their sense of how a pilot is doing emerges early,
often during the initial encounter in the briefing preceding the simulator session but certainly as
soon as the session begins:
I guess one thing Im thinking in a 4-hour session though is Ive got no hurry to make up my
decision. You know, you do, and of all the times in the past where Ive had to not pass someone,
theres always some stage during the session where Ive gone, No, they havent passed, they
havent failed. And then at the end you might say, Well you need to come up with a result.
(D4)
The beginning tends to be some very general and generic without much concrete (objective)
20
evidence description (e.g., Ive found the FO is very introverted and hes either intimidated by
the simulator or hes intimidated by the process [A1], Hes a plodder [B1], They are going
reasonably well [A3], or Theyre getting through okay [A4]). Sometimes flight examiners
note that their sense begins with observations of pilots body language (A4, C1, D3, D5). The
overall sense of whether a pilot is proficient or not, while evolving over the entire simulator
session, may start as soon as the session begins (e.g., So the big picture actually developed over
the whole session. It might start as soon as you walk in [C3]). This is so because there are
training sessions preceding the actual examination. Their observations during these sessions
configure the flight examiners sense at the beginning of the examination session:
And we spent three days leading up to it . . . so it wasnt just a one off sort of day. However, over
those three days Id sort of continued to work at all these things and we were making progress.
And my thought was, well hes going to get through. Its not going to be a great pass, but as long
as he keeps improving hell be fine. (D3)
As the examination session evolves, there is an increasing fixation of the general sense,
deriving from the increasing amount of concrete evidence available that can be used in the
documentation of the case ultimately made. Oftentimes flight examiners say with hindsight that
the problem has shown up from the beginninge.g., in the body language of the pilot.
The evolving documentary sense shapes subsequent observations. When there is some
event, it cannot be known whether the problematic performance will be recurrent. It is only with
hindsight, after having repeated events that flight examiners will and do attribute the problem to
some inherent short-coming in the pilot, which leads to a fail rating. There therefore is path
dependence in the evolving overall sense concerning non/proficiency.
If they appear flustered, straight away, quite often at that point in time, theyll say the wrong
thing, theyll say, Unscheduled feather when its really a prop over speed. That sort of things
quite common. And so usually if theyre going to start making mistakes its going to be a poor
performance, it starts happening quite early on. (B3)
Flight examiners seek further evidence (implicitly or explicitly) to confirm or disconfirm the
current documentary sense. As the modified think-aloud protocols reveal, in the attempt to locate
specific facts, flight examiners tend to find more negative evidence. In none of the 20 fail or pass
21
with marker cases was there an evolution from a more negative to a more positive sense. Instead,
flight examiners either began or moved to a more negative sense concerning a pilots
performance. Thus, the two examiner pairs where the emergency evacuation into the running
engine with noted while viewing the scenario had the definitive sense that it was a fail. One
flight examiner pair, during one of the repeated viewings, noted the running engine which led to
the reversal of their earlier sense of a good performance (pass) to a definitive fail.
By means of the documentary sense flight examiners evolve an entire explanatory
framework. Together with the general sense of overall proficiency that flight examiners evolve
based on their observations also arises an explanatory framework. Their observations contribute
to the emergence of a sense (e.g., level of situational awareness), which then explains the fact
observed. The associated idealization (e.g., situational awareness) might then be explained by
something else that is based on evidence (e.g., [workload] management). Thus, the fact that a
pilot delayed some task might be taken as evidence that there are problems with management,
which may have high workload as its consequence, which in turn lowers situation awareness.
That is, taken together all the explanatory terms that flight examiners use in their assessment
discoursee.g., knowledge (facts, procedures), management, communication, decision-making,
and situational awareness (Mavin & Roth, 2014)constitute the shared explanatory framework
based on the documentary method of interpretation.
Flight examiners distinguish between underlying pattern or momentary lapse. On the job,
flight examiners distinguish between momentary lapses and underlying problems. That is, flight
examiners distinguish between actually observed performance (evidence) and the presumed
underlying pattern. Flight examiners have to find out whether a poor performance is a
manifestation of an underlying problem or the result of something else. In the following
example, the captains was moving forward in the direction of the power transfer unit switch,
which sits right below the radio magnetic indicator, itself below the airspeed indicator. The
movement itself, which can be objectively seen, is a manifestation of the intention to go for the
power transfer unit (PTU) switch. The movement stops as a radio call comes in, and the pilot
attends to the call. The flight examiners suggests, [the captain] had a brain fart after takeoff and
forgot to turn the PTU off because a radio call came through at that exact same time his hand
22
was going to it (A3). The result is a brain fart, the pilot, upon returning from the call, does not
complete the action ascribed to the earlier movement. The sequence becomes a manifestation,
documentary evidence, for poor management in this situation. In contrast to the think-aloud task,
where flight examiners only rated single episodes, on the job where they observe pilots over two
4-hour sessions, they tend to reason in this way:
Good people can have a bad day in the sim and still come out smelling like roses because theyve
got good management and good communication. It might have been a lapse or something that got
them in to that situation, however, their tools in their tool bag, their good management and
communication increases their own situational awareness. (D4)
This may explain the following observation. In the think-aloud tasks, five flight examiner
pairs failed captain and one pair rated it repeat with markers for being confused about the turn
(left or right) following a missed approach call (Table 6 below). On the other hand, both captains
who made wrong turns during the examinations observed in this study actually passed their
examinations. Although the wrong turns were recognized as serious errors, the captains passed
because they had exhibited good performance for the remainder of the two 4-hr sessions.
Situational awareness is a master concept that flight examiners evolve by means of the
documentary method. In the scholarly literature, there is a debate whether situational awareness
really exists or whether it is part of a folk model of human factors (e.g., Dekker & Hollnagel,
2004). In the present study, situational awareness is one of three types of cultural objects derived
by means of the documentary method. It has the status of a master concept in the explanations of
pilot performance even though, and perhaps because, flight examiners find it hard to assess.
Flight examiners treat situational awareness as a state rather than as a (non/technical)
proficiency, knowledge, or skill that needed to be maintained. However, there is awareness that
this dimension cannot be measured but is something (a cultural object) that manifests itself in
concrete actions that stand in a mutually constitutive relation with the underlying state (level of
situational awareness). If pilots do not have situational awareness, they would not be able to fly
correctly; and flying correctly inherently means having situational awareness.
If they didnt have good situational awareness (SA), they wouldnt be able to do it. Because like
we said before, you cant measure SA as such. Youre, youre judging how good their situational
23
awareness is based on the results of their situational awareness. (B3)
This flight examiner articulates, in his words, the core of the documentary method, the
reflexive relation between evidence (results) and idealization (situational awareness). Situational
awareness cannot be assessedin fact, is different from the other performance aspects in that the
flight examiners treat it as a state that is affected by a range of human factors and circumstances.
They explain this state in terms of other human factors, inaccessible directly but manifesting
themselves in concrete actions and performances: He could work on bettering knowledge,
because that would enable his management and pick up his situational awareness (B2).
Flight examiners employ the documentary method of interpretation even when an explicit
assessment model and metric exists. This framework might be the same that he is using as part
of the assessment model that the airline uses, or it might be in terms of other human factorsrelated concepts that characterizes the culture-specific discourse of flight examiners (e.g.,
automation management, manipulation, or compliance), or these might be other concepts
that provide an explanation for a range of concrete observations (brain fart, airmanship,
currencies). Pass/fail decision tend to be explained in terms of human factors-related concepts,
themselves given in the form of documentary sense
With pilot, I said, Im not comfortable now. But when I came back to the MAPP,
Management, ineffective organization of crew tasks ((1 [minimum standard])) So it was like,
Yep, controlled self or crew members actions [though with difficulty]. . . And when I came
back and then transferred it on to here ((assessment metric)) suddenly I found myself down here
((1 and 2 ratings)). And . . . this is why I have to not pass you. That gut feeling is confirmed. (D4)
Here, the flight examiner used the word pictures of the assessment metricIneffective
organization of crew tasks = 1 (unsatisfactory); Controlled self or crew member actions,
though with difficulties = 2 (minimum standard)to translate between the performance he has
seen and the ratings. When he tallied his ratings on the different human factors of the assessment
metric, he ended up with a fail according to the company policy (one 1 or three 2 ratings). The
reverse has also been described. Thus, when a flight examiner is not happy with the results of the
tally of scores, which comes to a different result concerning passing or failing a pilot than what
his general sense is telling him, flight examiners describe changing individual scores to align the
24
content of the sense with the outcome according to the assessment categories and then use these
to explain the level of performance.
Flight examiners include informal cultural objects to constitute the overall sense of
proficiency. Other, even more mundane explanations can also be observed. Thus, for example,
the flight examiner in the following excerpt relates all his observations concerning handling
techniques in different situations to the fact that the captain also does the duties of a check
captain and, therefore, is not or cannot be as current on technique as a regular line pilot, who
does much more actual flying:
There was a few little handling techniques, training captain. I think it also comes down to training
and check captain, it comes down to currency as well, a big thing for the check captains. Just
currencies, because they obviously dont fly as much as the line guys do. (C2)
Flight examiners do know that there is uncertainty in establishing just what the observation is
evidence of. This is evident in the statement one flight examiner made: Sometimes its a bit
difficult to tell whether its a lack of knowledge or a lack of SA [situational awareness], because
sometimes they actually know it, but they just didnt notice it (B3). Here, the flight examiner
suggests that the underlying skill may actually be present or of a particular level but is not
expressed in that situation.
The documentary method amounts to good story telling. Some flight examiners describe
their method in terms of story telling. A good story binds together a large number of
observations into a simple story line. In this, a good story is similar to a parsimonious scientific
theory: it is convenient and convincing. Here, the documentary method amounts to creating a big
picture by putting together the right and required pieces of evidence that both create, hold
together, and make plausible the overall narrative; and it is the overall narrative that drives the
selection of the individual pieces of documentary evidence. An experienced flight examiner in
the process of training a captain to become flight examiner described the latters natural ability
for doing this job and substantiates his assessment from an informal setting in the hotels bar.
He can sit down and talk and discuss. Its storytelling, hes a storyteller. Its obvious, last night
we were having a few beers, hes a natural storyteller. People learn from that. And not everyones
got that ability. (D4)
25
Flight examiners level of experience changes the relation between facts and sense.
There are differences between beginning and experienced flight examiners with respect to the
dominance of one sense over another.
Experienced flight examiners let the documentary sense dominate the objective sense. The
overall assessment may override any other assessment, for example, derived by rigorously
implementing an assessment metric:
It would be fair to say that sometimes youll skew the results, I suppose. You know. If you think
somebody deserves to pass, but purely by reading the book, reading the metrics theyre going to
end up with more than two twos, then I probably make one of them a three. If I really thought
they had to pass. And conversely, um, if the metric said that the persons passed, but I really
wasnt happy, then Id probably skew the results in that direction as well. (B3)
Here, too, a minor variation in marking determines whether a particular dimension of the
assessment model is a minimum standard or satisfactory, which entails a fail or a pass. There is
an awareness of the subjectivity (indeterminacy, uncertainty) involved and that what is a 2 rating
for one examiner might be a 3 rating for another. This is why the overriding sense governs
whether at the borderline (cusp) will pass or fail, and the documentary evidence will be adjusted
accordingly.
Less senior flight examiners focus on individual facts. The debriefings show that less
experienced flight examiners tend to focus on what more experienced examiners refer to as
minutiae of small errors rather than on the big picture (sense). When they have an explicit
assessment model, less senior flight examiners may entirely rely on the assessment metric
consistent with the observations in another study that describes the assessments made by captains
and first officers to be driven by a tool whereas flight examiners are driven by their overall sense
(Roth & Mavin, 2014). Thus, beginning flight examiners, who tend to be more concerned with
individual facts, find themselves assisted in looking for the documentary evidence required for
making a particular assessment. They use the word pictures to map apparent observables (e.g.,
Manipulated accurately, with no deviations from target parameters) onto the associated score
(e.g., 5).
Its a lot easier because you know, on our check forms that says ILS approach, its quite simple
26
looking at the word pictures: Did they fly it with no mistakes? Did they fly it with minimal
mistakes or did they project forward what was going to happen? Did they project backwards? Did
they think ahead? (D2)
The boundary between pass and fail becomes clearer with increasing examination
experience. There are suggestions that sensitivity between adjacent scores is low in pilot
examinations (Holt et al., 2002). The decision whether to pass or fail a pilot based on the
performance during a simulator session, though never taken lightly, becomes clearer with
increasing experience. This is an analogue case to natural scientists increasing competence in
classifying initially poorly distinguished natural objects, which comes with an increasing number
of cases that they classified in a study or over their career (Roth, 2005): Im more able now to
discern, even if I can see something thats really poor, Im still able to discern whether Im going
to be happy at the end of the day to pass them or not (D2). Looking back at their development,
flight examiners realize that they passed or failed a pilot in the past that they would now rate
differently.
Weve all had one session in the past where you think, Now I wouldnt have passed that, but I
did then. And Ive got that, Ive got one and I know it, I can even tell you the time of night that
it happened and where we were and I can tell you the two crew. I know what they did wrong and I
know what I did wrong. (D4)
Although the overall sense between what constitutes the difference between pass and fail
ratings becomes sharper, the difficulties between two adjacent scores remains difficult: Its not
a one is there and that box and a two is in that box. Theres like a 2.4 and a 2.5 and a 2.6 (B3).
Because of the difficulty to discriminate at that level, flight examiners tend to know that some of
them will score a pilot 2 and others 3. This description is consistent with the videotapes of the
think-aloud protocols, where the flight examiners using a rating scale for individual human
factors and components often vacillate between two adjacent scores.
Discussion
In this study, we provide evidence for and theorize the mundane way in which flight
examiners get their work done: the documentary method. Flight examiners inquiries concerning
pilot proficiency are based on the tight, reflexive relation between observational facts (evidence)
27
and mundane idealizations. In this section, we discuss the findings and then offer three
mathematical approaches that might be used to model (some aspect of) the documentary method.
Documentary Method in Pilot Assessment
This study was designed to investigate how flight examiners think at work, including lineoriented flight examination (LOFE), operational competency assessment (OCA), or air transport
pilot license (ATPL). The results show that flight examiners use a documentary method of
interpretation to arrive at their sense of the different cultural phenomena of interest
non/proficiency, (non/technical) knowledge and skill, or state (situational awareness, pilot
thinking). The phenomenon initially is given in vague terms, a general sense that becomes the
seed to an evolving idea of whether or not a pilot assessed is proficient or what the pilots level
of knowledge, skill, or state. With an increasing number of observations, the sense tends to
become more specific as it is increasingly concretized in documentary evidence. There is a
movement from abstractness to a concretely grounded sense, while there is a parallel movement
from the concrete to the abstract, as the vague notion becomes increasingly structured and finegrained. Ultimately the documentary method evolves at an explanatory framework in every
practical case. That is, the result of the documentary method covers everything, which is both its
strength and its downfall, as one commentator on social psychology notes: In any actual case it
is undiscriminating and . . . absurdly wrong (Garfinkel, 1996, p. 18).
Some readers may be taken to think that the documentary evidence is the same as classical
concept learning (Bruner, Goodnow, & Austin, 1956), where research participants derive
concepts from instances and non-instances (Figure 1). There are some significant differences,
however. In the classical case, the observations (facts) are clear, as there are only limited
numbers of attributes; and the concept can be given in an unambiguous manner, such as two
circles or two boundaries in the case of Figure 1. In the assessment of pilots, however, the
phenomena of interest themselves are fuzzy, as are many of the perceptual attributes (Roth &
Mavin, 2014). Although flight examiners talk about ideal performances, corresponding to the
prototype of a concept without that such a prototype has to exist in any hard way (Rosch, 1998).
Ideal performances exist only as approximations, for even when the pilots examined are flight
examiners themselves, the examiner highlighted aspects that could be improved. Unlike in
28
classical concept learning paradigm, the flight examiners tend to actively look for evidence of a
certain kind or introduce specific failures that condition the kinds of problems that the pilots will
face and consequences of which the flight examiner will observe. In the documentary method
approach, a concept (e.g., situational awareness) exists in and as the totality of evidence and,
therefore, never is abstract.
Insert Figure 1 about here
Past research on pilot assessment noted the considerable variations in the scores used as part
of a measurement paradigm (e.g., Flin et al., 2003; Mavin et al., 2013; OConnor et al., 2002).
Flight examiners easily admit that assessing pilots is not a hard science. This experience is
captured in the notion of documentary sense. It goes together with an objective sense associated
with flight examiners concrete observations that they take to be manifestations of some cultural
object (e.g., decision-making skill or situational awareness). The inter-rater reliability approach
to human factors is based on the assumption that phenomena such as pilots knowledge,
decision-making, management, or communication are objective phenomena that can be
measured. When raters differ, problems are ascribed to the lack of rater training, the
measurement instrument, or some other variable. In this study, we show that the cultural objects
are not themselves given. Instead, they are treated as black boxes, the contents of which only
manifest themselves in some way rather than being directly given; but not all objective
manifestations reflect what is taken to be the real underlying pattern. There is mounting evidence
that in the flight examiners workplace, assessment is a categorization rather than a measurement
issue (Roth & Mavin, 2014; Roth et al., 2014a). We can find here the very source of the
variations observed in previous research on pilot assessment. Flight examiners do have (and can
give) good reasons for their sense that a pilot is or is not proficientas can be seen when they
collaborate in an assessment in the think-aloud protocol part of the present study.
This study shows that flight examiners do not and cannot perceive all relevant facts
(attributes) of an event, which mediates how they rate the performance that can be seen. In a
more extreme example, those flight examiners who did not notice that the pilots assessed
evacuated the aircraft on the side of a running engine all passed the crewmembers, but those
flight examiner pairs noticing this fact all failed the crew. Whether the failure to observe such an
29
important aspect would also occur during a regular simulator session cannot be ascertained by
the data available. Given the control the flight examiners have over setting up situation and their
awareness of the flight as a whole, such cases may actually be rarethough in this study there
were instances where flight examiners had missed important aspects of the flight, such as
disabling the automatic pilot.
Material phenomena are directly available such that they can be pointed to, taken in hand, or
relatively agreed upon. Cultural objects, however, including Galilean pulsars (Garfinkel,
Lynch, & Livingston, 1981) or help (Mannheim, 2004) are assumed phenomena available only
indirectly: through their manifestations. Whether some material fact is a manifestation of an
underlying cultural phenomenon, a coincidence, or merely a contingency requires some
methodical approach. Flight examiners use a variety of methods to increase the evidence for or
against the existence of a phenomenon. Thus, for example, they select from a database of more
than 200 forms of incidence that might affect the flight in progress. They then conduct
observations on the pilots performances in response to the disturbance at hand. In the end, they
produce an assessment of the pilot or, in training situations, identify a collection of different
issues that the pilots should focus on for the purpose of professional development. What the final
narrative will be is unknown at the beginning. Yet every observation possibly has a place in the
final story line, which is in part constituted by the observation. Which observations will be
included depends on the overall narrative, but the overall narrative depends on the observations
made and salient for the purpose. The narrative is an emergent one and can change from one
instant to the next in the case of a serious performance issue.
This study also reveals that the relation between evidence and the phenomenon that it
supposedly manifests. That is, for example, a flight examiners sense that a pilot has lost or
diminished situational awareness derives from a particular observation; but this observation is
made and explained by an assumed level of situational awareness (lost, diminished). The
objective sense and the documentary sense go together and cannot be uncoupled. This is similar
to the findings of a study of classification where sociology graduate students were asked to code
hospital records for the purpose of identifying the organized ways of an outpatient clinic that led
to particular patient trajectories (Garfinkel, 1967). The study showed that the graduate students
30
not only assumed the knowledge that their coding procedures were to reveal but also such
knowledge was necessary to make decisions about what really happened in the outpatient clinic.
Studies show that experienced experimental biologists used this same method while attempting
to interpret, understand, and explain their data and the associated graphical representations
(Roth, 2014). Thus, the documentary method of interpretation differs from testing (given)
hypotheses, because the cultural object (proficiency, knowledge, skill, state) is itself a function
of the observations.
There is a temporal order when flight examiners work in the simulator, where they only have
one shot at making observations in any one instance. If they miss a real fact, it will not and
cannot enter the overall story line. When there is an opportunity for replay, such as with the
debriefing tool or in the case of the modified think-aloud protocol, facts may be discovered after
a first, second, or later viewing. (This was the case with the failure to push the go-around button,
which one pair of flight examiner noticed only after repeated viewing.) There is therefore an
emergent sense of what the narrative might be. After an initial assessment has been made, the
results can yet be revised, such that an initial pass (perhaps with markers) might turn into a fail.
In a small number of instances, flight examiners waited to find out from the pilots what they
have to say to a critical incident before assessing a particular event. Although rarely observed,
flight examiners do take up and take into account what they learn about some event into their
assessment of it. This is especially so when the debriefing tool is used to replay events. What has
happened as seen in the videotape is taken as the way, as objective evidence that overrides what
pilots or flight examiners remember.
The Documentary Method of Pilot Assessment: Three Mathematical Models
The assessment of pilots tends to be treated as a measurement issue with the associated
question of inter-rater reliability (Flin et al., 2003). The present study shows that flight examiners
draw on the documentary method for making sense of the simulator sessions and for arriving at
an assessment and at an explanatory framework. This appears to be consistent with suggestions
that meaningful criteria for consistently assessing performance are elusive (Rigner & Dekker,
2000), and, therefore, to be consistent with the idea that assessment cannot be modeled
31
mathematically. However, this is not the case. We briefly present three possible approaches to
mathematically model the fuzziness of assessment.
Fuzzy logic. Assessment may be modeled using the fuzzy logic approach (Roth & Mavin,
2013). The assessment category is found by the minimum distance D given the fuzzy sets
specifying the lower (BL) and upper boundaries of performance (BU) and a fuzzy relation W that
specifies the weight a rater observation is given in the assessment, a set of fuzzy observations A:
$n
'1/ 2
n
2
2
D = & [Wj (B L, j A j )] + [Wj (BU , j A j )] )
&% j =1
)(
j =1
In essence, the fuzzy logic approach maps a fuzzy set of given observations on assessment /
pass and fail; or unsatisfactory, minimum standard, satisfactory, good,
rating categories (e.g.,
very good). When the overall assessment is the result of rating different human factors, the same
observation may be used in one or more categories, such as situational awareness, management,
or decision-making. The same observation may therefore contribute in different ways to an
overall assessment, which may be based on an automatic failure because of low situational
awareness or because of problems in decision-making.
The present study shows that the set of fuzzy observations does not just exist but establishes
itself over time while flight examiners observe. Later observations may or may not cancel the
effect of earlier observations (e.g., in repeats). More importantly, there is a mutually
constitutive relation between the overall (documentary) sense and the observations sought
(objective sense). Neither aspect is modeled in the fuzzy logic approach. Thus, whereas it
appears useful in mapping a given set of (fuzzy) observations onto an outcome category system,
it does not model the reasoning process and the emergence of the observations. It is a static
model that takes the outcome of a process as its input.
Catastrophe theory. Transitions between binomial situations, such as changes in attitudes
(van der Maas, Kolstein, & van der Pligt, 2003), conceptual change of scientists (Roth, 2014), or
category formation among scientists (Roth, 2005) may be described mathematically drawing on
catastrophe theory (Figure 2). In the region of the cusp, small variations in the information
parameter can lead to sudden transitions from one to another state. The model depends only on
two control variables, a normal factor and a splitting factor . Both in attitude transition and in
32
categorization, the factor corresponded to information available; factor was involvement and
amount of experience, respectively. The category formation case is suitable in the present
instance, especially useful in modeling the assessments flight examiners make at the boundary
between a pass and fail rating. Van der Maas et al. (2003) provide eight flags indicative of the
suitability of the catastrophe theoretic model and different techniques to fit the catastrophe model
to the data.
This model is consistent with the results of this study that show a sharpening of the contrast
between pass and fail (Figure 2, along axis); it is also consistent with the observation that
assessment at the boundary between two scores or pass and fail remains difficult. In the model,
minute variations in observation or circumstances can be the trigger for a transition to occur
(Figure 2, the jump from the lower to the upper surface of the cusp); or flight examiners actively
look for the tiny piece of evidence that allow them to pass or fail that reflects their overall sense.
Future research is required to test whether such models, already successfully explaining binomial
situations in other areas, are applicable to modeling assessments where flight examiners struggle
placing a performance in one of two adjacent categories.
Constraint satisfaction. Interpretation formation and classification may be modeled using
constraint satisfaction models, as one study showed in the case of navigation of navy vessels
(Hutchins, 1995). Here, different underlying attributes modeled in terms of nodes that represent
the current hypothesis concerning the attribute (Figure 3a). When information becomes
available, it feeds into the respective hypothesis, which has an activation level between 0 and 1.
There are sets of attributes supporting/reinforcing each other (+), whereas other pairs of
attributes counteract (). In the model, one set of attributes supports a pass decision (the
activation levels of the 6 nodes are {1, 1, 1, 0, 0, 0}), whereas another set supports a fail decision
(the activation of the 6 nodes are {0, 0, 0, 1, 1, 1}). At any one point, the distance of the network
from the two extreme cases can be calculated (e.g., using a Euclidean metric in the vector space).
The assessment or interpretation formation trajectories can then be graphed (Figure 3b), as
shown in models of how artifact design processes evolve (Roth, 2001). In Figure 3b, all four
trajectories shown begin with the same initial state: two leading to a pass decision, one to fail
33
rating, and one process remains undecided between as no new information was provided at that
point. This latter may be taken as representing the pass with marker ratings, which flight
examiners used when a performance was not a clear pass rating but insufficient enough to
warrant a fail. The network model corresponds to the observation that the overall state of the
documentary sense (current interpretation, inclination) is a function of the parts (information
about different attributes) but each attribute is a function of the documentary sense. This type of
model therefore most clearly represents the evolving nature of the process of an assessment.
Future research would be required to test the fit of constraint satisfaction model with concrete
data from pilot assessment.
Implications
Research already showed that flight examiner assessment is based on and can be modeled by
means of fuzzy concepts and fuzzy observations (Roth & Mavin, 2014). The present
investigation extends past research by showing that flight examiners use a documentary method
of interpretation that evolves a relation between an overall sense and concrete, sometimes exact
(e.g., current speed, torque, or presence/absence of a procedural call) and sometimes more fuzzy
observations (e.g., whether a captain is leaning on the first officer or whether they have an open
discussion). The emerging sense both is a function of and drives factual observation. This study
therefore allows us to anticipate variations between the assessment results of flight examiners,
who nevertheless have and can provide good reasons, as seen in the think-aloud protocol and
stimulated recall parts of this study. There is variation even though these flight examiners go
systematically about their evaluations. These results therefore provide an explanation to the
disappointing results of a study, where after three years of training, authors conclude that there
where typically several I/Es who gave noticeably different distributions of ratings, [s]ystematic
differences among raters were typically found, it may be difficult to achieve [consistency in
the .70s], agreement for specific items is often inadequate, and sensitivity levels were quite
low across the 3 years (Holt et al., 2002, pp. 324325). There is therefore mounting evidence
for the hypothesis that variation is true, therefore mitigating any training effort that attempts to
increase rater calibration.
34
The true purpose of pilot assessment is the overall improvement of safety in the industry
rather than a school-like grade for each pilot. Thus, rather than falling into despair over
irremediable rater variance, we might ask how the observed variations or its underlying causes
might be used positively to improve safety in aviation. In the context of the participant airlines,
this research team has begun to work with flight examiners to change the practice of debriefing,
giving more space to the reflections of pilots on their own practices, such that the focus of the
biannual two-day session is on the learning with a decreased emphasis on the grading while
maintaining the identification of problem areas.
In the documentary method, the presumed underlying patterns (sense) are based on the
observations (evidence), which are in turned explained by the patterns. Practitioners might be
interested to focus on increasing the number of observations, thereby increasing the number of
pieces of evidence substantiating the sense flight examiners have concerning proficiency,
knowledge and skills, or state. This is different from the behavioral marker approach to
assessment, where markers are rated on a numerical scale (e.g., Flin & Martin, 2001). Flight
examiner assessment would be based on observable evidence rather than on ratings of
overarching but inaccessible factors. Diagnostic tools such as the Enhancing Performance With
Improved Coordination (EPIC) tool (Deaton et al., 2007), which alert instructors to specific facts
easily synthesized from simulators, may turn out to assist flight examiners in collecting more
evidence than they have done in the past. As a result, the approach would reflect the increasing
tendency of the industry to ground decisions in solid evidence. But the increase in the amount of
evidence should be balanced by efforts to get the big picture, which amounts to conceptualizing
performance and telling a parsimonious and coherent story concerning the proficiencies of pilots.
References
Alberdi, E., Sleeman, D. H., & Korpi, M. (2000). Accommodating surprise in taxonomic tasks:
The role of expertise. Cognitive Science, 24, 5391.
Bohnsack, R., Pfaff, N., & Weller, W. (Eds.). (2010). Qualitative analysis and documentary
method in international educational research. Leverkusen, Germany: Barbara Budrich
Publishers.
35
Boshuizen, H. P. A., & Schmidt, H. G. (1992). On the role of biomedical knowledge in clinical
reasoning by experts, intermediates and novices. Cognitive Science, 16, 153184.
Bruner, J. S., Goodnow, J., & Austin, G. A. (1956). A study of thinking. New York, NY: Wiley.
Civil Aviation Authority of New Zealand (CAA-NZ). (2013, February). Flight test standards
guide: Airline flight examiner rating. Accessed August 20, 2014 at
http://www.caa.govt.nz/pilots/Instructors/FTSG_Airline_Flt_Examiner.pdf
Deaton, J. E., Bell, N., Fowlkes, J., Bowers, C., Jentsch, F., & Bell, M. A. (2007). Enhancing
team training and performance with automated performance assessment tools. International
Journal of Aviation Psychology, 17, 317331.
Dekker, S., & Hollnagel, E. (2004). Human factors and folk models. Cognition, Technology and
Work, 6, 7986.
Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Verbal reports as data (Rev. ed.).
Cambridge, MA: MIT Press.
Flin, R., & Martin, L. (2001). Behavioral markers for crew resource management: A review of
current practice. International Journal of Aviation Psychology, 11, 95118.
Flin, R., Martin, L., Goeters, K., Hrmann, H., Amalberti, R., Valot, C., & Nijhuis, H. (2003).
Development of the NOTECHS (non-technical skills) system for assessing pilots' skills.
Human Factors and Aerospace Safety, 3, 97119.
Garfinkel, H. (1967). Studies in ethnomethodology. Englewood Cliffs, NJ: Prentice-Hall.
Garfinkel, H. (1996). Ethnomethodologys program. Social Psychology Quarterly, 59, 521.
Garfinkel, H., Lynch, M., & Livingston, E. (1981). The work of a discovering science construed
with materials from the optically discovered pulsar. Philosophy of the Social Sciences, 11,
131158.
Holt, R. W., Hansberger, J. T., Boehm-Davis, D. A. (2002). Improving rater calibration in
aviation: A case study. International Journal of Aviation Psychology, 12, 305330.
Hutchins, E. (1995). Cognition in the wild. Cambridge, MA: MIT Press.
Johnston, A. N., Rushby, N., & Maclain, I. (2000). An assistant for crew performance
assessment. International Journal of Aviation Psychology, 10, 99108.
36
Jordan, B., & Henderson, A. (1995). Interaction analysis: Foundations and practice. Journal of
the Learning Sciences, 4, 39103.
Mannheim, K. (2004). Beitrge zur Theorie der Weltanschauungs-Interpretation [Contributions
to the theory of worldview interpretation]. In J. Strbing & B. Schnettler (Eds.),
Methodologie interpretativer Sozialforschung: Klassische Grundlagentexte (pp. 103153).
Konstanz, Germany: UVK.
Mavin, T. J., & Roth, W.-M. (2014). A holistic view of cockpit performance: An analysis of the
assessment discourse of flight examiners. International Journal of Aviation Psychology, 24,
210227.
Mavin, T. J., Roth, W.-M., & Dekker, S. W. A. (2013). Understanding variance in pilot
performance ratings: Two studies of flight examiners, captains and first officers assessing the
performance of peers. Aviation Psychology and Applied Human Factors, 3, 5362.
OConnor, P., Hrmann, H. J., Flin, R., Lodge, M., & Goeters, K.-M. (2002). Developing a
method for evaluating crew resource management skills: A European perspective.
International Journal of Aviation Psychology, 12, 263285.
Perez, R. S., Johnson, J. F., & Emery, C. D. (1995). Instructional design expertise: A cognitive
model of design. Instructional Science, 23, 321349.
Pollner, M. (1987). Mundane reason: Reality in everyday and sociological discourse.
Cambridge, UK: Cambridge University Press.
Rigner, J., & Dekker, S. W. A. (2000). Sharing the burden of flight deck automation training.
International Journal of Aviation Psychology, 10, 317326.
Rosch, E. (1998). Principles of categorization. In G. Mather, F. Verstraten, & S. Anstis (Eds.),
The motion aftereffect (pp. 251270). Cambridge, MA: MIT Press.
Roth, W.-M. (2001). Designing as distributed process. Learning and Instruction, 11, 211239.
Roth, W.-M. (2005). Making classifications (at) work: Ordering practices in science. Social
Studies of Science, 35, 581621.
Roth, W.-M. (2014). Graphing and uncertainty in the discovery sciences: With implications for
STEM education. Dordrecht, The Netherlands: Springer.
37
Roth, W.-M., & Marvin, T. J. (2013). Assessment of non-technical skills: From measurement to
categorization modeled by fuzzy logic. Aviation Psychology and Applied Human Factors, 3,
7382.
Roth, W.-M., & Mavin, T. J. (2014). Peer assessment of aviation performance: Inconsistent for
good reasons. Cognitive Science. DOI: 10.1111/cogs.12152
Roth, W.-M., Mavin, T. J., & Munro, I. (2014a). Good reasons for high variance (low interrater
reliability) in performance assessment: A case study from aviation. International Journal of
Industrial Ergonomics, 44, 685696.
Roth, W.-M., Mavin, T., & Munro, I. (2014b). How a cockpit forgets speeds (and speed-related
events): toward a kinetic description of joint cognitive systems. Cognition, Technology and
Work. DOI: 10.1007/s10111-014-0292-0
Suchman, L. (2007). Human-machine reconfigurations: Plans and situated actions. Cambridge,
UK: Cambridge University Press.
Suto, I. (2012). A critical review of some qualitative research methods used to explore rater
cognition. Educational Measurement: Issues and Practice, 31, 2130.
Transport Canada. (2013, January). Pilot examiner manual (4th ed.). Ottawa, Canada: Minister
of Transport. Accessed August 22, 2014 at
http://www.tc.gc.ca/publications/en/tp14277/pdf/hr/tp14277e.pdfconduct
van der Maas, H. L. J., Kolstein, R., & van der Pligt, J. (2003). Sudden transitions in attitudes.
Sociological Methods & Research, 32, 125152.
Wineburg, S. (1998). Reading Abraham Lincoln: An expert/expert study in the interpretation of
historical texts. Cognitive Science, 22, 319346.

Flight Examiner Methods

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Flight Examiner Methods

Загружено:

Авторское право:

Доступные форматы

Flight Examiners Methods

Flight Examiners Methods of Ascertaining Pilot Proficiency

University of Victoria; 2Griffith University

Debriefing assessment cognition cognitive anthropology thought process

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Modified think-aloud protocols. Think-aloud protocols constitute a standard method in

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Insert Table 3 about here

Flight Examiners Methods

(non/technical) skills and knowledge (e.g., handling, decision-making, management,

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

On rare occasions, an examination session is organized to have another flight examiner

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

awareness is based on the results of their situational awareness. (B3)

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Flight Examiners Methods

Вам также может понравиться