You are on page 1of 11

Journal of Wine Economics, Volume 5, Number 1, 2010, Pages 132-142

Evaluation of Wine Judge Performance through Three Characteristics: Bias, Discrimination, and Variation*
Jing Caoa and Lynne Stokesb Abstract
Judge performance is a critical component of a wine competition's success. A number of studies have shown that wine judges may differ considerably in their opinions. In this paper, we have conducted an in-depth examination of wine judge performance at a U.S. wine competition. Three characteristics of judge's performance are examined: bias, discrimination ability, and variation. Based on the analysis, we can identify the judges who had discrepant scoring patterns and can gain insight into which of the three characteristics cause particular judges to disagree. The evaluation of wine judge performance through these three characteristics may provide useful information for training them to have consistent performance and in assisting competition coordinators in judge selection. (JEL Classification: Cl, D8, Q13)

I. Introduction
Wine quality is an abstract measure that is difficult to define in absolute terms. Thus, wine judge performance is a critical component of a wine competition's success. Ideally, all judges participating in a wine competition should have the same judging criteria and similar scoring patterns. However, studies have shown that wine judges may differ considerably in their opinions (Cliff and King, 1997; Ashenfelter, 2006; Hodgson, 2008). A number of approaches have been developed to evaluate wine judge performance. For example, Cicchetti (2004, 2006) used the intraclass correlation to measure the reliability of judges based on their concordance. Cliff and King (1996) described the use of descriptive statistics for judge evaluation. The same authors (1997) then proposed to use eggshell plots to identify similarities and
* The authors would like to acknowledge the support from the administration of the California State Fair Wine Competition for making the data available. Special thanks go to Robert T. Hodgson and G.M. "Pooch" Pucilowski. Also, the authors would like to thank the anonymous referee for the valuable comments and the editor for the editorial improvements. * Department of Statistical Science, Southern Methodist University, Dallas, Texas, 75275, email: b Department of Statistical Science, Southern Methodist University, Dallas, Texas, 75275, email: slstokes@smu. edu

The American Association of Wine Economists, 2010

Jing Cao and Lynne Stokes


differences among the judges. Seaman et al. (2001) conducted principle component similarity analysis to group judges according to their judging style and to identify outliers. Hodgson (2008) analyzed the data from a major U.S. wine competition from 2005 to 2008. The data, which contains panels with triplicate samples poured from the same wine bottle, had rich information on judge performance. Hodgson used pooled standard deviation and an ANOVA model to measure a judge's ability to consistently evaluate replicate samples of an identical wine. The above methods have one common goal, which is to investigate whether or not a judge's performance is consistent with the performance of the other judges. In this paper, we have the same goal of identifying discrepant judges. However, our analysis extends the comparison to try to identify which one or ones of three judge characteristics, bias, discrimination ability, and random variation, are responsible when judges disagree. Judge bias measures the systematic difference between a judge's score and the average score from all the judges. Based on the bias, we can identify whether a judge is relatively generous, neutral, or stringent in his or her assignment of scores. Judge discrimination measures a judge's ability to distinguish wines based on their quality. Judges with a higher degree of discriminating power can distinguish wines more easily. Judge variation measures the size of the random component of variability in a judge's assessment of wine quality. Each judge receives a score on each characteristic of his judging performance. This provides insight into what makes judges differ. This might prove useful for training wine judges to have consistent performance and for assisting competition coordinators in judge selection. The data analyzed here is from the California State Fair Wine Competition in 2009. The judges were instructed to provide letter scores (e.g., No award, Bronze, Bronze+). Letter scores were later transformed to numerical scores ranging from 80 points to 100 points. In the dataset, 8 distinct scores were recorded (80, 84, 86, 88, 90, 92, 94, 96). Note that these numerical scores are not continuous, and the difference in the scores does not necessarily represent the true difference in wine quality. In a wine competition, it is reasonable to assume an order of the form "Bronze (84) < Bronze+ (86) < Silver-(88)", but it usually does not make sense to assume (Bronze+)- (Bronze) = (Silver-) -(Bronze+) which is implied by treating the scores as continuous values to measure wine quality. Since the scores are essentially ordinal data, which represent ordinal response categories with no underlying interval scale (Johnson and Albert, 1999), wefita Bayesian ordinal model (Cao, Stokes, and Zhang, 2010). The approach was developed to evaluate rater performance in the ranking of proposals in a grant review process, which is similar in important ways to a wine competition.

II. Method
There were approximately 3000 wines in the competition. Among these wines, 68 had triplicate samples served to 17 panels of judges. Each of the panels evaluated three samples of 4 different wines, where each triplicate was poured from the same bottle. The analysis presented is based on the data of the 68 wines with triplicate samples. The data from one of the 17 panels is presented in Table 1. Following the notation used by Hodgson (2008), we


Evaluation of Wine Judge Performance through Three Characteristics

use J\ to 74 to denote the four judges in a panel. The three R\ values represent the scores given to the replicates of the first wine tested, with the scores of the remaining wines denoted by Rl, R3, and R4.
Table 1

Raw Data from Panel A

Jl Rl Rl Rl R2 R2 R2 R3 R3 R3 R4 R4 R4 80 84 84 86 84 80 84 80 86 90 80 84 J2 86 86 80 86 88 96 90 84 88 86 84 84 J3 80 80 86 86 92 86 86 80 96 90 96 86 J4 80 86 86 84 84 86 92 86 94 94 90 88

Jl to J4 denote the four judges in a panel. The three Rl values represent the scores given to the replicates of the first wine tested, with the scores of the remaining wines denoted by R2, R3 and R4.

In the following we introduce the model in a general format. Suppose there are M judges who are instructed to assign ordinal scores to N wines, with the levels and meaning of the ordinal scores being the same for all judges. We assume that each wine to be evaluated has an underlying continuously-valued latent (non-observed) quality, which determines the true ordering of all wines. Judges are assumed to assign scores by first estimating the wine quality and then comparing it with category cutoffs. In this case, the categories are No award, Bronze, etc. We further assume that the category cutoffs are the same for all judges. O Let 8,(/= 1,..., A be the underlying quality of wine i, x^k be the estimate of 9, by judge j on the k th replicate, and ytjk be the observed ordinal score assigned by judge j on the k th replicate of wine i. The standard ordinal data often takes values of consecutive integers, so for convenience of explication we use scores (1, 2, 3, 4, 5, 6, 7, 8) to replace the initial scores (80, 84, 86, 88, 90, 92, 94, 96). Then the ordinal model is
Xijk = a ; + P v e, + eijk, eijk ~ N(0,a2j), and (1)

That is, measurement error eijk made by judge j on his k th assessment of wine i is assumed to have an independent normal distribution with mean zero and judge-specific

Jing Cao and Lynne Stokes


variance O2. The quantities cs (5=0,..., 8) denote the common cutoffs for all judges, and we assume c 0 =-oo and c8=oo. We describe Oj as the bias parameter for judge j , and ($,- as the discrimination parameter. A generous judge tends to give a higher average score to all wines, producing a positive Oj, while a stringent judge has a negative Oj. A competent y judge's discrimination parameter P will be positive, indicating that the criteria used by judge j are consistent with the wine's latent quality 9,. A small P, value suggests that the judge assigns ratings in a small range, and a large P;- value suggests that the judge assigns ratings that are more separated to distinguishable wines. A judge with a negative p, makes judgements contrary to the general opinion represented by the majority of judges. Under the assumption that the general opinion is consistent with wine quality, a negative P;- indicates that the judge's ratings are inaccurate. Parameter ti2- describes the amount of inconsistency in a judge's scoring pattern. The larger the value of a2, the more randomness in the judge's evaluation. Specifically, a judge with a large a2 may assign substantially varying scores to wines with identical quality. The parameters describing wine quality 9,, judge bias Oj, discrimination P;, and variation a2 are estimated by Bayesian inference. In Bayesian inference, parameters are assigned a prior distribution, which is a reasonable opinion concerning the different values of the parameters. Then the prior on the parameters is updated by incorporating the information contained in observed data, which in this case are the ratings of the judges. The final inference is based on this updated distribution, known as a posterior distribution. The estimates of the parameters are then the means of the posterior distributions, and measures of uncertainty can be obtained from the spread of the posterior. A credible interval is the Bayesian equivalent of the frequentist confidence interval. A 95% credible interval, for example, can be interpreted as one with a 95% probability of containing the parameter, given the observed data. The estimates thus take into account the combined sources of information of the prior and observed data (Carlin and Louis, 2008). To avoid imposing subjective opinions on the parameters, we use so-called noninformative priors. That is, the chosen priors assume that the wines have similar quality and the priors have comparable scoring patterns. For the technical details, readers may refer to (Cao, Stokes, and Zhang, 2010). The program to run the model, which is available upon request from the first author, was written in R, which is a free software for statistical computing and graphics.

III. Results
In this paper, we focus on the evaluation of wine judge performance, not the estimation of wine quality. All but one of the 17 panels has 4 judges; the remaining panel has 3 judges. Thus, scores from 67 judges were available and were evaluated in the analysis. We chose the 17 panels with triplicate samples because the data can provide a better demonstration of typical scoring patterns associated with the three characteristics. However, the model does not require replicate data on individual wines.


Evaluation of Wine Judge Performance through Three Characteristics

A. Judge Bias
Figure 1 shows the 95% posterior credible intervals (CI) of judge bias cj (y=l,..., 67), where the dashed line spans the CI and the circle in the middle designates the posterior estimate of ty. A judge with little bias will have an oj value near zero. If the 95% CI of cj is below (above) 0, it indicates that the judge has a significant negative (positive) bias. Based on the results, a number of judges had negative bias, suggesting that they were more stringent than the other judges. For example, the posterior estimate of oti in Figure 1, marked by a solid circle, is for judge 71 in the panel shown in Table 1. Note that the average scores over the triplicates for the four wines evaluated by judge 71 are (82.67, 83.33, 83.33, 84.67), and the corresponding average scores over the other three judges are (83.33, 87.56, 88.44, 88.67). It is clear that the order of the four wines assigned by judge 71 is consistent with the order assigned by judge 72, 73, and 74 in the panel. However, judge 71 's scores are uniformly lower than the average scores assigned by the other judges. A judge with a large bias, in either the positive or negative direction, does not necessarily indicate that the judge is inaccurate or inconsistent with other judges in his ranking of the wines. However, since each judge serves on one and only one panel in the wine competition and wines were compared across panels, wines that were evaluated by stringent judges would have a disadvantage if compared with those evaluated by panels containing all relatively generous judges.
Figure 1

The 95% Posterior Credible Intervals (CI) of Judge Bias a,(y=l,...,67)

( ,I 11

* ! '


"t i ,


; :;<\ *!'
' I*!



>( ' '


^ l |




I"?! . i t t i ' II9 o,v , i y


i; *


ti* i* ii:

Ql , II ' 1




30 Judges


I 50



Note: The dashed line spans the CI and the circle in the middle designates the posterior estimate The horizontal line at 0 is the reference line.

Jing Cao and Lynne Stokes Figure 2 The 95% Posterior Credible Intervals (CI) of Judge Discrimination O'=l 67),


' 1 1

11 11 11 1 , 1 11

'I*'!* 1

1, 1


11 1 1


1 O . I

; JKii;
' '
i 1 '

< i' i i

i Hi 1 ,

14 :<
' 1 1

i i




30 40 Judges




Note: The dashed line spans the CI and the circle in the middle designates the posterior estimate fa The horizontal line at 0 is the reference line.

B. Judge Discrimination
Figure 2 shows the 95% posterior CI of the judge discrimination parameters p;- (_/= 1,..., 67). A negative value of P, indicates that the judge scores are contrary to the opinion of the other judges in the panel. Based on Figure 2, there are 3 judges with significantly negative discrimination parameters, i.e., whose 95% CI's for P,- are entirely below zero. In Table 2, we display the data from a panel having one judge (71) with a negative estimate of Py (marked with a solid circle in Figure 2). For each judge in the panel, we also reported the average score of the three samples for each wine. Note that the average scores over the triplicates for the four wines evaluated by judge 71 are (80,90.67, 85.33, 80) with the order being (1,4, 3, 1). The corresponding average scores over the other three judges are (89.33, 89.11, 90, 93.77) with the order being (2, 1, 3, 4). Thus, judge 71 considered the best (worst) wine selected by the other judges to be the worst (best), completely contrary to the majority opinion. Furthermore, judges 73 and 74 awarded the fourth wine (R4) the highest score possible (96) on all the triplicate samples, while judge 71 gave the same wine the lowest score possible (80) on those samples. A discrimination factor near zero is also an undesirable feature for a judge, because such P,- indicates that the judge exhibits a low degree of discriminating power. In the panel shown in Table 2, the estimate of p, for judge 72 is close to zero (marked with a solid square in Figure 2). Based on the average score of triplicate samples listed in the parentheses in Table 2, judge 72 has a range of around 5 points (91.3386) to differentiate the four wines. By comparison, judges 71, 73, and 74 all have a range of more than 10 points (90.67-80, 96-85.33, 96-85.33) in their evaluation.


Evaluation of Wine Judge Performance through Three Characteristics Table 2 Raw Data from Panel B Jl 80 80 (80) 80 m 92 Rl 84 (90.67) Rl . 96 R2 90 Rl 80 (85.33) R3 86 R3 R4 80 fl4 80 (80) R4 80 R\ Rl J2 88 90 (86) 80 92 86 (88) 86 90 92(91.33) 92 92 86 (89.33) 90 J3 90 90 (89.33) 88 86 84 (85.33) 86 92 92 (93.33) 96 96 96 (96) 96 J4 90 96 (92.67) 92 96 96(94) 90 96 80 (85.33) 80 96 96 (96) 96

Note: The average scores of triplicate samples for each judge on each wine are listed in ().

C. Judge Variation
Figure 3 shows the 95% posterior CI of judge variance aj (j = l,..., 67). A relatively large value of GJ indicates that the judge has more variation (or lower internal consistency) in assigning scores. Based on Figure 3, there are 3 judges who appear to have a distinctly larger variation than the other judges. In Table 3, we present the data from a panel which involves one of those three judges (72), whose estimate of GJ is marked with a solid circle in Figure 3. For each judge in the panel, we also reported the standard deviation of the scores over the triplicate samples for each wine. Judge 72 is consistent only in evaluating the second wine (R2), which was voted unanimously to be the worst. As Hodgson (2008) suggested, judges tend to be more consistent in what they don't like. For the other three wines, judge 72 showed substantially more variation than the other three judges. Note that for two wines (#3 and /?4), judge 72 assigned the lowest score (80) and the highest score (96) to the same wine. For the other wine (Rl), the judge assigned the lowest score (80) and the sub-highest score (94) to the replicates.

VI. Discussion A. Discussion of the Current Analysis

Hodgson's two papers published in Journal of Wine Economics (2008; 2009) focused much attention on judge performance in wine competitions (see the article in Wall Street Journal, Mlodinow, 2009). It is not surprising that judges have different scoring patterns and show some variability in their evaluation, because wine ratings can be influenced by uncontrolled factors such as a judge's personal perception of taste, the order in which the wines are

Jing Cao and Lynne Stokes Figure 3 The 95% posterior credible Intervals (CI) of Judge Variance


a] 0=1,.... 67)



30 40 Judge;




Note: The dashed line spans the CI and the circle in the middle designates the posterior estimate i

Table 3 Raw Data from Panel C

Jl R\ R\ R\ R2 R2 R2 R3 3 3 R4 R4 R4 92 94 80 (7.57) 92 80 80(0) 80 96 94 (8.72) 80 80 84 (8.33) 96 J2 90 J3 90 J4

90 80

88 80

92 80

84 88

80 88

84 88

88 88

84 84

90 90




Note: The standard deviations of the scores on triplicate samples are Ustcd in 0-

tasted, and the other wines in the lineup. In order to improve consumer confidence and increase the credibility of wine competitions, it is of great importance to the wine business to recognize the problem and take measures to improve judge performance. Because of this, it could be useful to develop methods to answer these questions: Can we identify judges who are less qualified? For the less qualified judges, can we identify the main cause of their discrepancy from the other judges?


Evaluation of Wine Judge Performance through Three Characteristics

We provided an analysis of wine judge performance by describing three characteristics: bias, discrimination, and variation. Based on the analysis, we are able to determine which of the characteristics has caused the inconsistency in judges' scores: a significant bias, or a low (negative) level of discriminating power, or a high degree of variation in assigning scores. Among these problems regarding judging styles, bias and a low level of discrimination may be easier to overcome than high variance. Judges with these two issues still rank the wines in a similar order compared to the other judges, and they only need to adjust the center and scale of their scores to be more consistent with the rest of the panel. The more serious problems for judges are having a negative level of discrimination and a large variance, where the former indicates the judges are using criteria contrary to the majority opinion and the latter indicates that there is much randomness in the judges' score. In our analysis, only a few judges had these two problems. Out of 67 judges, 3 judges showed a negative level of discrimination, and 3 judges had a large variance. It is a fundamental assumption in voting that the majority opinion is likely to be correct (Young, 1988). This analysis is conducted under the assumption that the majority opinion is in conformity with wine quality. We understand that the majority opinion may not always be the truth. Thus, it is possible that a judge with a negative discrimination may be more qualified than the judges in the panel who have a positive discrimination (the majority opinion). In this case, further investigation may be needed. The purpose of this analysis was to provide more insight on wine judge performance. With the information, wine judges may be trained more efficiently, competition coordinators in wine competition will have clearer guidelines in future judge selection, and eventually, the accuracy of wine competition may be improved.

B. Discussion of Further Applications of the Method

There are three main advantages of the Bayesian ordinal model which are not demonstrated by the current data analysis and which improve on current methods of judge ranking. The first is that the model enables us to extract information about judge performance without replicate data. The second is that the model can provide estimates of wine quality, which can be used to rank the wines. The third is that because of the modeling approach, we can provide a way to answer a variety of hypothetical questions in a structured way. First, the model can evaluate judges without replicate data because anytime there is sufficient overlap in the wines being evaluated (that is, the same wines being evaluated by several judges, which is the common scenario in wine competition), the data contains information on judge performance. In Table 4, we present (artificial) data representing 4 judges' evaluations of 12 different wines, with no replication. The pairwise correlations of the scores assigned by the judges are cor{J\, 72)=0.97, cor{J\, 73)=0.98, cor(J2, 73)=0.95, and coHJA, 71) = 0.19, cor{JA, 72) = 0.16, cor{JA, 73) = 0.27. These correlations indicate that Judges 71, 72, and 73 rank the wines similarly, while Judge 74's scores are not highly correlated with the other 3 judges. Our model can estimate bias, discrimination, and variance

Jing Cao and Lynne Stokes


parameters for the 4 judges, just as it can when there is replicate data, and the parameters have the same interpretation. The analysis shows that judge bias and discrimination are similar for the 4 judges. However, judge JA has a substantially larger variance than the other 3 judges, where the variance estimates are (0.12, 0.13, 0.13, 1.38) respectively. It is judge 74's variance, which is a measure of the deviation in his scores from the "true" wine quality as estimated by the model, that leads to the inconsistency in his evaluation.
Table 4 Artificial Data for One Panel Jl J2 80 80 84 86 86 84 88 90 88 92 94 96 J3 80 80 86 84 84 86 86 88 90 92 94 96 J4 80 84 90 92 80 94 84 84 86 80 96 86

w\ m m
W4 W5 W6 Wl

W9 W\0


80 80 84 84 86 86 88 88 90 92 94 96

Second, the estimate of wine quality produced by the model can adjust judge bias and discrimination and place more weight on the evaluations of the more qualified judges who have a smaller variance. Cao, Stokes, and Zhang (2010) have shown that the ranking of items based on the model is more accurate than ranking based on the commonly-used observed score average, which weighs the scores from all the judges in the panel equally. The third advantage is really an advantage of the modeling approach itself to describe a data generating process. Once a model is estimated and fits the data reasonably, it can be used to make predictions about what would happen under different circumstances. For example, the model can be used to determine 1) how many judges are needed to achieve a certain quality of evaluation (i.e., giving a "good" wine a sufficiently high chance of being evaluated correctly); 2) whether there are hidden biases for or against certain grape varieties (i.e., Chardonnay, Merlot, and White Zinfandel) by individual judges; 3) which of the different competition arrangements will give the most reliable results. Thus the competition manager could, through the model, virtually alter the competition design, say by having one additional judge per panel, or by reducing the bias of all judges by training them to use the scale in a particular way, or some other method of his choosing, and predict the effect of changes to competition design. It is in our research plan to obtain different data to demonstrate the thorough potential of the ordinal model in wine competition.


Evaluation of Wine Judge Performance through Three Characteristics

Ashenfelter, O. (2006). Tales from the crypt: Bruce Kaiser tells us about the trials and tribulations of a wine judge. Journal of Wine Economics, 1, 173-175. Cao, J., Stokes, L., and Zhang, S. (2010). A Bayesian approach to ranking and rater evaluation: an application to grant reviews. Journal of Educational and Behavioral Statistics, 35, 194-214. Carlin, B.P. and Louis, T.A. (2008). Bayesian Methods for Data Analysis. 3 rd ed., Boca Raton: CRC Press. Cicchetti, D.V. (2004). Who won the 1976 wine tasting of French Bordeaux and U.S. cabernets? Parametrics to the rescue. Journal of Wine Research, 15,211-220. Cicchetti, D.V. (2006). The Paris 1976 tastings revisited once more: comparing ratings of consistent and inconsistent tasters. Journal of Wine Economics, 1, 125-140. Cliff, M.A. and King, M.C. (1996). A proposed approach for evaluating expert judge performance using descriptive statistics. Journal of Wine Research, 1, 83-90. Cliff, M.A. and King, M.C. (1997). Application of eggshell plots for evaluation of judges at wine competition. Journal of Wine Research, 8, 75-80. Hodgson, R.T. (2008). An examination of judge reliability at a major U.S. wine competition. Journal of Wine Economics, 3, 105-113. Hodgson, R.T. (2009). An analysis of the concordance among 13 U.S. wine competitions. Journal of Wine Economics, 4, 1-9. Johnson, V.E. and Albert, J. A. (1999). Ordinal Data Modeling. New York: Springer. Mlodinow, L. (2009). A Hint of Hype, A Taste of Illusion. Wall Street Journal, Life & Style, November 20, 2009. online at: 282653628.html Seaman, C.H., Dou, J., Cliff, M.A., Yuksel, D., and King, M.C. (2001). Evaluation of wine competition judge performance using principle component similarity analysis. Journal of Sensory Studies, 16, 287-300. Young, H.P. (1988). Condorcet's theory of voting. The American Political Science Review, 82, 1231-1244.