Вы находитесь на странице: 1из 8

Evaluating Rating Scales for

Sensory Testing with Children


Sensory testing with children is becoming increasingly important
to the food industry, but little research on appropriate
methodology has been conducted
Beverley J. Kroll

o AS THE NUMBER of food products


aimed at the children's market increases
and the role of children in purchase
decisions expands, sensory testing with
children becomes increasingly important to the food processing industry.
However, sensory research has not kept
pace with this need.
Testing with children is in an embryonic stage. Over the years, a few
sensory researchers have considered the
problems involved in applying their
science to this special population, but
for the most part the field has been
static. The need for serious investigation is pointed up by how little research
has been done in this area.
As a way of focusing on the specific
needs for this kind of research, a
thumbnail sketch of certain key
questions the literature considers is
presented in the box on p. 80.
One thing is very noticeable not only
in the literature, but also in word-ofmouth, unpublished material about
children's testing. The methods used
have been intuitive, even granted that
the investigator may have had a
rationale. Once a method has been
selected, there has been no serious
investigation of possible alternatives. It
is as if the researchers said, "We
planned this, we tried it, it seemed to
work, and there was no time to bother
with what might have worked better.
We therefore undertook a basic
research project designed to help
establish a solid foundation for future
investigations. This article describes the
procedures, analysis, and conclusions of
research intended to evaluate the
relative merit of rating scales that might
be used when testing with children. In
this study, we used two methods of
questioning one-on-one interviewing
(Fig. 1) and self-administered questionnaire (Fig. 2) and three types of rating
scale (Fig. 3).

FOOD TECHNOLOGY

Variables Selected
A great many variables could be
considered. Hence, it was necessary to
be selective and try to choose the more
important ones.
l Test Products. The test product
was not really a source of variation, but
remained constant throughout the main
series of experiments. We settled on a
sweetness difference in an orange
drink. One can reliably predict that
children will like a sweeter drink, at
least within the normal range. This
proved to be the case.
Preliminary testing of drinks with
various sweetness differences indicated
the adjustments needed. For example, a
drink sweetened with the recommended
amount of sugar com- pared to one
made with only 50% of that amount
produced highly significant differences
no matter what rating scale was used.
Needed was a difference that was
definite but not overwhelming, so that
the possible effects of the variations of
interest could emerge. The final choice
was an orange- flavored drink sweetened with the recommended amount of
sugar, compared to a drink with 80% of
that amount.
l Scale Type. Differences in scale
type were the main issue ad- dressed in
these experiments. After preliminary
work with older children, we concentrated on three scale types (Fig. 3) the
standard hedonic scale with the usual
verbal categories, a pictorial or face
scale, and a child-oriented verbal scale
we developed.
Over the years, researchers have
investigated test language suitable for
children. After reviewing childThe author is President, Peryam & Kroll,
Marketing and Sensory Research, 6323 N.
Avondale Ave., Suite 121, Chicago, IL 60631

oriented word scales designed by


others, we decided to develop our own
scale, with more nearly equal intervals
(although exact equality probably
cannot be achieved with scales of this
type). The result was dubbed the
Peryam & Kroll or P&K scale.
It was imperative that the study
include a picture scale. Testing with
children is overrun with picture scales,
the rationale being that younger people
may not understand words and phrases
but can more accurately deal with facial
expressions. Besides, pictures are
entertaining and should inspire closer
attention to the task.
There are many such caricature scales
around, but all have the same general
characteristics, representing degrees of
pleasantness ranging from high to low.
The question is how well successive
pictures communicate the basic idea.
Some preliminary work was done with
a scale from an earlier published study,
which used the Snoopy cartoon
character, but the results were disappointing. Scales using children's faces
with variations in degree of detail were
also tried. Eventually a series of
simplified people faces was selected as
probably best and certainly representative.
l Scale Length. There is a school of
thought, bolstered by intuition, that
longer scales tend to create confusion
because there are lots of words to
understand and choices to make. The
implication is that this problem should
be more serious with younger children.
On the other hand, there is evidence
that longer scales can be more discriminating and produce more reliable
results.
Certainly, this factor was of enough
importance to be included in the study.
Starting with the frequently used 9
points, how far down
Text continued on page 80

Fig. 1 Children Ages 5 7 and 8 10 were tested using one-on-one interviews

Fig. 2 Children Ages 8 10 were also


tested using self-administered questionnaires in standard sensory testing booths

Traditional hedonic scale

P&K scale

Like extremely

Super good

Like very much

Really good

Like moderately

Good

Like slightly

Just a little good

Neither like nor dislike

Maybe good or maybe bad

Dislike slightly

Just a little bad

Dislike moderately

Bad

Dislike very much

Really bad

Dislike extremely

Super bad

Face scale

Fig. 3 Three Types of Rating Scale Were Used: the traditional hedonic scale, the
P&K scale developed for this study, and the typical face scale. After testing, scale
values of 1 to 9 were assigned (starting with 1 at the top) for the purposes of

FOOD TECHNOLOGY 79

Evaluating Rating Scales


(continued)

Questions Addressed in Earlier Studies


should one go? To 7 points? 5? 3? Or
even to just 2 points, which would be
paired comparison?
The study addressed this variable in
subdued fashion by trying 7 points, using
the same three scale types as before but
eliminating one good category and one
bad category from each scale.
l Age. For what ages might special
techniques be required? Our initial work
was with children over 10 years of age,
most of whom seemed to handle selfadministered questionnaires fairly well,
with no problems that are not encountered to some degree with adults.
To address the real issue, there- fore,
we defined two age groups based on
suppositions about ability to handle
verbal input: the preliterate, ages 5 7,
where most can be expected to read very
little if at all and not understand big
words; and the semiliterate, ages 8 10,
where most can read at some level but
still may not understand words such as
"extremely" or "moderately." No attempt was made to extend the investigation to preschoolers.
l Mode of Presentation. Most of the
experiments employed a straightforward
approach, where the successive categories were read one after another, always
starting at the good end.
Another approach sometimes used by
investigators is what may be called
"bifurcated" the interviewer first asks
the subject to place the stimulus into
either the good/ like or the bad/dislike
category, then tries to get the child to
scale degree of like or dislike by
presenting the successive categories. The
categories were presented starting in the
middle and proceeding to the ends. This
seemed logical, but that could be open to
debate. If the subject failed to make a
choice in response to the initial question,
the result was recorded as "maybe
good/maybe bad" or "neither like nor
dislike" (but was not read to the subject).
This phase of testing included only the
hedonic and P&K scales because the
face scale is inappropriate to this
approach.
The question of which was the better
procedure the bifurcated or the
straightforward was addressed in a side
experiment.

Text continued on page 82

80 FOOD TECHNOLOGY

Can children discriminate? How far down the age scale does the
capacity for discrimination exist?
There has never been much argument here. Children can definitely
discriminate. At least they have preferences. Observations of the
behavior of even infants indicate the capability of choice in terms of
rejection and acceptance.
l About 1955, investigators at Eli Lilly, Inc., developed a procedure
for working with children 2 3 years old to evaluate formulas for
vitamin preparations (Peryam, 1989). They used one-on-one interviewing and the paired-comparison method and claimed to have
obtained results useful in product development.
l Investigators at the University of Florida did extensive testing of
various citrus products with preschool children 6ges 3 5 (Morse,
1953). They found lots of discrimination, as well as puzzling aberrations.
They used, and endorsed, paired comparisons, which produced the only
meaningful results. However, they also tried a method which was
essentially the triangle test, although not labeled as such. Their conclusion that the method was too complicated for kids should not surprise
anyone.
l Work with preschoolers ages 3 5 used fruit as stimuli and an
interesting variation of the rank-order method (Birch, 1979). The child
was presented with a number of different kinds of fruit and asked to
select the one liked best. This was then removed, and the one liked best
among the rest was chosen, and so on. Whatever the utility of the
findings, there was discrimination, which replicate testing showed was
reliable.
l Colwill (1987) reviewed scaling methods for obtaining information
about consumers' likes and dislikes. He recommended using picture
scales, preferably with five or seven points, for testing preliterate
children.
Can one use a measuring device more sophisticated than simple paired
comparisons? Can children differentiate degrees of liking and/or
disliking?
Usually investigators have found that children do have such ability, but
the extent of that ability, as well as how it might be affected by any one
of many variables, is seldom considered.
l Some years ago, Bert Krieger, a researcher with a candy manufacturer, was faced with the problem of evaluating formulation changes in
chocolate bars (Moskowitz, 1985). He dealt with children 5 7 years old
as well as older children, using a picture scale that showed the Snoopy
cartoon character in a series of nine poses ranging from up-eared elation
to droopy disgust. His subjects were able to discriminate.
l Another researcher (Wells, 1965) used a scaling method to evaluate
children's feelings about cereals. He was not concerned with the foods as
eaten, but evaluated children's ideas about familiar cereals and their
feelings about TV commercials. Some of the subjects were in the 5 7
age range. The study used 7-point face scales showing a youngster (a
boy for boys, a girl for girls) in poses ranging from grinning happiness to
hold-the-nose distaste. The children could discriminate, and the results
were meaningful.
Are the results of testing children useful in solving typical product
development problems?
The sponsors must be getting something useful, or why would so much
be attempted? Some of the published studies actually address the
question, e.g., the previously cited work by Krieger, who achieved
comparative evaluation of formulas for chocolate bars.
Summary
Briefly summarizing the literature, we note that:
l There is consensus that children can discriminate, particularly in
regard to degree of liking.
l Children are able to show degree of preference if the proper
measuring device is used.
l Children can provide useful information about products if the right
methods are employed.
l Children require special handling, i.e., handling that is different
from the procedures routinely employed with adults. One must pay
attention to such things as gaining confidence, providing motivation, and
expressing tasks in language children understand. This recognition
appears throughout the literature.

Evaluating Rating Scales (continued)

Another side issue that seemed worth


testing was one-on-one interviewing vs
a self-administered questionnaire. This
experiment used the 9-point hedonic
scale and P&K scales and involved only
children 8 10 years old, i.e., the
semiliterate group. Again, the face scale
was excluded because the concern was
mainly with ability to read with
understanding.

Testing Procedure
The test subjects were prerecruited
from families on our extensive roster of
consumer panelists. Usually, the
computer knows which families have
children and their ages. All had to like
orange drinks, which was no problem.
Otherwise the only concern was age,
sex, and availability to fit into the
schedule. An important proviso was that
no child should be invited to participate
in more than one test, which would
raise questions about training effect.
In all cases, a subject tried the pair of
samples, high sweet vs low sweet,
twice, using a different scale for each
pair, then made a paired-comparison
choice after each pair. Except for those
on the mode of presentation, the
experiments included all three scale
types hedonic, P&K, and face. The
design required that the scales be used
equally often and appear equally often
as the first or second pair. Furthermore,
for each scale type the high-sweet and
low- sweet samples were served first or
second equally often.
Sex differences did not seem important in the context of this investigation,
but our recruiters attempted to have
equal numbers of girls and boys in each
of the age groups. This was not
achieved exactly, but it was close. They
also tried to get an even distribution of
ages within each age group. Again, this
was not exact but was very close.
The drinks were prepared in quantity
ahead of time, chilled to refrigerator
temperature, and held at that temperature throughout testing. They were
poured just before serving. A sample as
served was about 1% oz of drink in a
small plastic glass. The samples were
identified by code number, but only for
the convenience of the operators and to
avoid errors. If a subject even saw the
codes, it was accidental.
All interviewing was conducted oneon-one, except for the sessions

82 FOOD TECHNOLOGY

using the regular written questionnaires.


The interviewers were carefully briefed
on the protocol to be followed for each
variation.
The interviewer met the subject and
parent in a reception area. Leaving the
parent there, the interviewer took the
child to the testing area while chatting
in a friendly manner to establish rapport
and relieve possible tension. The test
itself was not discussed except in a very
general way.
In the test room, the child was seated
at a table across from the interviewer
(Fig. 1) and told that he or she would
get some samples of orange drink and
would be asked questions about them.
The first sample was brought and the
child invited to try it. When the child
was finished, the interviewer began the
questioning procedure according to the
set protocol. After a rating was made,
the child was told to drink some water
while the interviewer got the next
sample. The waiting period was about 2
minutes. The second sample of the pair
was then tried and rated. This was
followed by the question, "Which did
you like better, the first sample you tried
or the second one?
Then the child was told there were
more drinks to be tried and had a drink
of water while waiting another 2
minutes. The second pair was handled
like the first, and the child was escorted
back to his or her parent. The whole
sequence took about 10 minutes.

Analyses
There is a qualification to note here.
Some findings, in the sense of the
objectives of the research, rely on what
may be called soft data; however, they
were derived from hard data.
l Hard Data. For the paired
comparison, the significance of the
proportions of choice was deter- mined
by the z-test. For the scalar measures,
the significance of the difference
between the average rating for the highsweet and low-sweet drinks was
determined using the t-by-difference
test, which was natural, since each
subject had tried both samples. Using
the variances of the distributions was
also considered, but the figures were
volatile and hard to interpret. With
scales of this kind, the variance is
highly dependent on the average rating,
being quite low when the upper end of
the scale is approached, but increasing

as the average drops toward the


midpoint.
l Soft Data. The tables of results
show significance levels ranging from
1% to 15%. These figures were
compared among scales, between age
groups, between test orders, between
orders of serving, and so on.
How legitimate, or how useful, is this
approach? There is no routine, accepted
statistical procedure for determining
whether one level of significance is or
is not significantly different from
another. Perhaps a method for this
purpose could be devised, but its
possible utilization has not been
explored. An example of the questions
to be resolved would be, how much
more important is the 1 % level than the
2% level? Probably not very important,
since both are near certainty. But one is
easily convinced that the 1% level
shows more discrimination than the
10% level. These are the kinds of
decisions that served as the basis for
most of the conclusions in this study.

Results
What, if anything, was discovered in
this study? Are any conclusions
definitive, settling certain points once
and for all? Not likely! But there are
results that can direct future research on
the subject.
l Paired-Comparison. The paired
comparisons were always made after
the pair of drinks had been presented
and rated. The results, summarized
across all tests, are shown in Table l.
Overall, there was a highly significant
difference well below the 0.1% level
which was due in part to the large
number of subjects (N). As expected,
the high-sweet sample was preferred,
which validated the product variable.
Other conclusions come from comparing different subgroups.
Test order, whether the first or second
pair of the session, made no difference.
There was no difference in discrimination between boys and girls.
Children 8 10 years old were
definitely more discriminating than the
younger kids, who failed to establish a
significant difference. Their failure
might have been due to interference by
the scaling task. The difference between
ages might have been expected.
Scale type may also have made a
difference, although evidence is
borderline. When the comparison was

Evaluating Rating Scales (continued)

made after the hedonic and P&K scales,


discrimination was about the same as
overall; but when it was made after the
face scale, it dropped to the level of
nonsignificance. This might be a chance
effect, or there may be something about
the face scale which later interfered with
the paired comparison.
l Scale Length. Scale-length results
(Table 2) tend to lay to rest the belief
that children need simplicity and
shouldn't be presented with too much
because they will get con- fused. Within
the context of these experiments, that
did not prove to be the case. Quite the
contrary the 9-point scales were as
good, if not better, than the 7-point
versions. Definitely, the 7-point scales
were not better. Whether the 9-point
scales were actually better for discrimination rests on comparison of the 5% vs
1% levels of significance, but the 7point scales offer no advantage.
With the 9-point scales, all sub- groups
showed significant discrimination,
granted that at one point it dropped to a
questionable 15% level; whereas with
the 7-point scales, three subgroups
showed nonsignificance.
The boys did slightly better than the
girls, although this was not consistent. It
is probably trivial, and not indicative of
any meaningful trend.
This result is definite and hardly
unexpected. The children 8 10 years
old showed good discrimination with
both scale lengths, whereas the children
5 7 years old showed significant
discrimination only with the 9-point
scales, completely failing the task with
the shorter version. On the basis of the
supposition that the simpler scales
should be easier for younger children,
one might have expected this to be the
other way around.
It is often noted in sequential monadic
testing that there is better discrimination
when only the second-served samples
are considered. In this study, there was
significant discrimination with the
second-served samples for both scale
lengths, but almost none with the firstserved samples. Is this due to some kind
of contrast?
Is it a training effect, where the ratings
of the second sample have the benefit of
experience with the first? This research
could not address such questions in all
of their complexity. Besides, such effects

84 FOOD TECHNOLOGY

pertain to all testing, not just when


children are concerned.
l Scale Type. The crux of the
research is the comparative evaluation
of the three scale types. Overall, with
N = 208 for each scale, all scales
significantly discriminated at better than
the 10% level. However, the P&K scale
(1% significance level) was better than
the hedonic scale (8% significance
level) and the face scale (7% significance level). We think this is an
important finding, but remember the
qualification about soft data it is based
on comparison of the 1% vs the 7% or
8% level of significance. In addition, the

face scale, which typified the kind


alleged to be better for children, failed
to emerge as better than the other
scales.
I n a w a y , Ta b l e 3 i s r e p e t i t i v e ,
exhibiting effects shown in the other
tables, but now separately for each
scale type. However, it may add further
emphasis to the following conclusions:
The P&K scale gave better overall
discrimination; older children showed
better discrimination with all scales;
and no scale discriminated when just
the first-served samples were considered, but the P&K and face scales did
with the second-served samples.

Continued on page 86

Evaluating Rating Scales (continued)

The second pair of drinks tested was


consistently better for discrimination
than the first pair, no matter the scale
type. Does this mean that there is a
learning effect, even from the brief first
exposure to the task? If so, it is both bad
news and good news. The bad news is
that one does not have a pure measure.
But who believes that is possible
anyway? The good news is that kids
quickly learn to do a good job, and that
the testing of multiple pairs is acceptable.
l Mode of Presentation. Table 4
shows the results of the side study
designed to help answer the question, Is
there any advantage in using the twostage, bifurcated approach? The study
was limited to the 9-point scale.
Overall, the bifurcated approach
seems to offer no advantage over the
straightforward. Even for the children 5
7 years old the age group for whom
the method was de- signed the
bifurcated scale was little better than
the straightforward approach.
The self-administration phase of the
study was an embellishment done as an
afterthought. It was limited in scope,
utilizing only the hedonic and P&K
scales, and excluding children 5 7
years old for the obvious reason that
they are preliterate.
The results (Table 5) showed that
children 8 10 years old can handle
written questionnaires effectively.
Overall, the results were significant at
the 1% level.
Although not shown in the table, the
effect of self-administration was more
pronounced with the hedonic scale,
whereas discrimination with the P&K
scale was about the same with both
approaches (one-on-one interviewing
and self-administration). This finding
should cheer sensory specialists. It
makes things easier. If children of this
age are sufficiently knowledgeable that
big words do not defeat the purpose,
why bother with expensive one-on-one
interviewing?

Further Studies Needed

References

The results of this study can be


summarized as follows: The P&K scale
performs better than the hedonic or face
scale. Reducing scale length from 9
points to 7 offers no advantage.
Children 5 7 years old do not perform
any better with the face scale than with
the other two scales. The bifurcated
approach does not discriminate as well
as the straightforward method. And
older children perform as well using
written questionnaires as when
interviewed one-on-one.
The study, as noted earlier, was not
intended to be the be all and end all.
Rather, it was intended as a foundation
for further studies. A re- view of
variables will show that many need
further attention. While there are
problems involved, there is a great deal
to be obtained.

Birch, L.L. 1979. Dimensions of preschool


children's food preferences. J. Nutr. Educ. 2(2):
77.
Colwill, J.S. 1987. Sensory analysis by
consumers. Food Mfr., Feb., p. 53.
Morse, R.L.D., 1953. Exploratory studies of
preschool children's taste discrimination and
preference for selected juices. Proc. of Florida
State Horticultural Soc., Daytona Beach.
Moskowitz, H.R. 1985. Product testing with
children. In "New Direction for Product Testing
and Sensory Analysis of Foods," p. 147. Food
and Nutrition Press, Inc., Westport, Conn.
Peryam, D.R. 1989. Personal communication.
Peryam & Kroll Marketing and Sensory
Research, Chicago.
Wells, W.D. 1965. Communicating with children.
J. Adv. Res., p. 2.
Based on a paper presented at the Spring
Meeting of ASTM, San Francisco, Calif, May 24,
1990.
Edited by Neil H. Mermelstein, Senior
Associate Editor

Reprinted from Food Technology 44(11) 78-80, 82, 84, & 86

1990 Institute of Food Technologists

86 FOOD TECHNOLOGY

Peryam & Kroll goes West Coast!


Peryam & Kroll has set up in
Greater Los Angeles with all the
sensory facilities, marketing
resources, years of expertise and
qualified staff you thought you
would never find anywhere but
at their Metropolitan Chicago
headquarters.

The Greater Los Angeles and


Metropolitan Chicago offices
serve clients nationwide. They
both deliver quality research work
and sophisticated project reports
quickly and economically. So,
you can select a location on the
basis of test demographics,
proposed market or personal
In fact, the West Coast Division convenience, and know you are
still getting the comprehensive
already has a data base that
services you need to bring
includes many thousands of
products successfully from
people with special demoGraphic characteristics - ethnic concept to regional distribution or
national rollout.
background, economic
standards, non-traditional

To contact P&Ks Greater


Los Angeles office directly, call
Jackie Beckley at:
Peryam & Kroll
West Coast Division
4175 East La Palma
Anaheim, California 92807
tel: 714-572-6888
fax: 714-572-6808

MARKETING & SENSORY RESEARCH


METROPOLITAN CHICAGO AND
GREATER LOS ANGELES
1-800-74-PKLAB

Вам также может понравиться