Вы находитесь на странице: 1из 21

DEVELOPMENT OF ROBUST DESIGN

UNDER CONTAMINATED AND NON-NORMAL DATA


Chanseok Park
Department of Mathematical Sciences, Clemson University
Clemson, SC 29634
Byung Rae Cho*
Department of Industrial Engineering, Clemson University
Clemson, SC 29634

Revised manuscript submitted to


Quality Engineering

April 2002

* Corresponding author
(864) 656 1874 (Voice)
(864) 656 0795 (Fax)
bcho@ces.clemson.edu

DEVELOPMENT OF ROBUST DESIGN


UNDER CONTAMINATED AND NON-NORMAL DATA

Chanseok Park
Department of Mathematical Sciences, Clemson University
Clemson, SC 29634
Byung Rae Cho
Department of Industrial Engineering, Clemson University
Clemson, SC 29634

The usual assumptions behind robust design are that the distribution
of experimental data is approximately normal and that there is no major contamination due to outliers in the data. Under these assumptions,
sample mean and variance are often used to estimate process mean and
variance. In this article, we first show simulation results indicating that
sample mean and variance may not be the best choice when one or both assumptions are not met. The results further show that sample median and
median absolute deviation (MAD) or sample median and inter-quartile
range (IQR) are indeed more resistant to departures from normality and
to contaminated data. We then show how to incorporate this observation
into robust design modeling and optimization. A case study is presented.
KEY WORDS: Robust design; Outliers; Non-normality; Outlier-resistant
estimators; Optimum operating conditions.

INTRODUCTION

Robust design is often identified as one of the most important design methodologies currently studied in the research community today for quality improvement
purposes. Major US industries have promoted and implemented robust design techniques in order to make significant improvements in product quality and in production
methods. Examples of the application of robust design to various engineering problems in the automotive industry, plastic technology, process industry, and information
technology can be found in Bendell et al. [1] and Dehnad [2]. The main objective
of robust design is to obtain the optimum operating conditions of control factors by
minimizing variability associated with the quality characteristic of interest, and at
the same time by keeping the process mean at the customer-identified target value.
While the basic concept underlying robust design is clearly important, Taguchis
tools for achieving the goal, such as orthogonal arrays and signal-to-noise ratios, have
drawn much criticism. Initiated by Box [3], several authors, including Vining and Myers [4], Pignatiello and Ramberg [5], Myers et al. [6], and Myers and Montgomery [7],
have pointed out numerous shortcomings embodied in Taguchis approach to robust
design. Consequently, there has been a great deal of research eort in order to rectify
these drawbacks. One of the alternatives, the response surface approach, has received
much attention. This approach facilitates understanding the system by separately
modeling the response functions for process mean and variance. For detailed information, readers are referred to Vining and Myers [4], Del Castillo and Montgomery [8],
Lin and Tu [9], Cho et al. [10], and Kim and Cho [11].

RESEARCH MOTIVATION

The response surface approach uses the method of least squares to obtain
the adequate response functions for process mean and variance by assuming that
experimental data are normally distributed and that there is no major contamination
in the data. Often however, these assumptions may not hold in modeling many
real-world industrial problems. In particular, when the sample size is small, as is
generally the case in many engineering problems, the fitted response surface functions
for process mean and variance are very sensitive to these assumptions. In other
words, if one or both of these assumptions are violated in a serious manner, the
optimum operating conditions of control factors may be located far from the true
optimum conditions we are actually looking for. Thus, a dierent approach needs to
be developed.
In fact, there are rigorous ways to overcome this situation. When the normal assumption is not met, either a nonparametric approach or a data transformation technique, such as a logarithm and Box-Cox transformations, may be employed.
However, there is no guarantee that these approaches will lead to a satisfactory level
for a normal assumption. When there are contaminated data, our first suspicion may
be that the observations have resulted from a mistake or other extraneous eects and
hence should be discarded. A major reason for discarding data points is that under
the least squares method, a fitted function is pulled disproportionately toward an outlying observation because the sum of the squared deviations is minimized. However,
this could cause a misleading fit if indeed the contaminated data have resulted from a

mistake or other extraneous cause. On the other hand, contaminated data may convey significant information for example, they may occur because of an interaction
with another independent variable. More detailed discussions of contaminated data
can be found in Neter et al. [12].
The purpose of this article is two-fold. First, in the case where these assumptions are not met, we show that the Monte Carlo simulation results, indicating that
median and median absolute deviation (MAD) or interquartile range (IQR) are more
resistant to departures from normality and to contaminated data. Therefore median
and MAD or IQR may be good alternatives to mean and variance, respectively. We
then incorporate these observations into robust design modeling and optimization,
and show how the proposed robust design model works through a case study.

THE PROPOSED METHOD

Outlier-Resistant Estimators

Consider a system involving a response Y which depends on the levels of k


control factors (x1 , x2 , . . . , xk ). The following assumptions are made:
A functional structure, Y = g(x1 , x2 , . . . , xk ), is either unknown or complicated.
The levels of xi for i = 1, 2, . . . , k are quantitative and continuous.
The levels of xi for i = 1, 2, . . . , k can be controlled by the experimenter.
Suppose that m replicates are taken at each of the design points. Let Yij
represent the jth response at the ith design point where i = 1, 2, . . . , n and j =

1, 2, . . . , m. The most popular estimators of the location and scale parameters are
mean and variance, respectively. At the ith design point, we have the sample mean
and sample variance as follows:
1
1
Yi =
Yij and Si2 =
(Yij Y i )2 .
m j=1
m 1 j=1
m

These estimators are very sensitive to outliers which may present unexpected
values in relation to the majority of the sample. For example, the sample mean

Y = (1/m) Yj can be upset completely by a single outlier; if any one of data Yj goes
to , then Y goes to . Readers are referred to Tukey [13] for more examples.
Hence there are practical situations in which the sample mean and variance need to
be replaced with other estimators which are less sensitive to outliers. We propose
the median as an alternative to the mean, and the median absolute deviation (MAD)
and the inter-quartile range (IQR) as alternatives to the standard deviation, which
are given as
{
}
MAD(Y1 , . . . , Ym ) = median Yj median(Yk )
1jm

1km

IQR(Y1 , . . . , Ym ) = Y[3m/4] Y[m/4] ,


where Y[p] are the pth order statistics.
Let Y be the random variable forming N (, 2 ) and the random variable Z
forming N (0, 1) with () being its cumulative distribution. It is easily seen that
Y and Z have the same N (0, 2 ) distribution. Let (z) denote the probability
density function (pdf) of Z. The pdf of W = |Z| then becomes 2(w) of which the
support is (0, ). The median of W is obtained by solving the following equation for

m:

1
2(w)dw = .
2

It follows that 2{1 (m)} = 1/2. Hence we have m = median(|Z|) = 1 (3/4).


Using median1jn (Yj ) , and Z[pn] 1 (p) as n , we have the following
results:
{
}
( )
MAD(Y1 , . . . , Yn ) median |Y | = median |Z| = 1 (3/4)
{
}
IQR(Y1 , . . . , Yn ) 1 (3/4) 1 (1/4) .
Hence the estimators MAD/1 (3/4) and IQR/{1 (3/4)1 (1/4)} are consistent
for the scale parameter and we denote them as
D(Y1 , . . . , Yn ) = MAD(Y1 , . . . , Yn )/1 (3/4)
{
}
Q(Y1 , . . . , Yn ) = IQR(Y1 , . . . , Yn )/ 1 (3/4) 1 (1/4) .

Incorporating the Outlier-Resistant Estimators into Robust Design


Let
(x) and
2 (x) represent the fitted response functions for the mean and
the variance of the response Y . Assuming a second-order polynomial model for the
response functions, we get

(x) = 0 +

i=1

i xi +

k
k

ij xi xj and
2 (x) = 0 +

i=1 j=i

i xi +

i=1

k
k

ij xi xj .

i=1 j=i

The usual method is to estimate the regression coecients in


(x) by using the sample
mean of Y and those in
2 (x) by using the sample variance of Y .
The main objective of robust design is to obtain the optimum operating conditions of control factors, and this goal can be easily achieved by employing the following

squared-loss optimization model:


(
)2
minimize
(x) t0 +
2 (x)
where t0 is the customer-identified target value for the quality characteristic of interest. Two items are worth mentioning. First, the following dual-response optimization
model proposed by Vining and Myers [4] can be also used for optimization purposes:
minimize
2 (x) subject to
(x) = t0 .
However, the dual-response model strictly imposes a zero-bias condition, while
the squared-loss model allows some bias which may result in less variability. For
detailed information regarding the squared-loss model, readers may refer to Lin and
Tu [9] and Cho et al. [10]. Second, although the quadratic fitted functions are shown
above, the estimated functions can also be linear.
Using the outlier-resistant estimators and the squared-loss optimization scheme,
we may develop three dierent models as follows:
Model A:
(x) using sample mean and
2 (x) using sample variance.
Model B:
(x) using sample median and
2 (x) using squared sample MAD.
Model C:
(x) using sample median and
2 (x) using squared sample IQR.

SIMULATION RESULTS AND VERIFICATION

In this section, we analyze the simulation results for verification purposes


by replacing mean with median, and standard deviation with MAD and IQR. The

numerical simulations are performed using the R language which is a non-commercial,


open source software for statistical computing and graphics originally developed by
Ihaka and Gentleman [14]. This can be obtained at no cost from
http://www.r-project.org/
The responses (Y ) were randomly generated using the R language from a normal distribution with and without contaminated data to check how the presence of
contaminated data aects the estimators. To see how the lack of a normal assumption
aects the estimators, the responses are also generated from other distributions, such
as double exponential, logistic, and Cauchy distributions, whose probability density
functions are given by
double exponential: f (x) =

1 |x|/
e
2

logistic: f (x) =

1
[
]
2 1 + cosh{(x )/

Cauchy: f (x) =

.
2
+ (x )2

These three distributions are well-known for their heavier tails as compared to a
normal distribution.
As illustrated in the next section, five responses (Yi1 , . . . , Yi5 ) are generated
from the distributions with (xi ) and (xi ) at each control factor settings xi =
(xi1 , xi2 , xi3 ), i = 1, . . . , 27. The total number of iterations is 500, each having 27
design points and 135 responses, and (x) and 2 (x) are given as follows:
(x) = 50 + 5(x21 + x22 + x23 ),
}
{
2 (x) = 100 + 5 (x1 0.5)2 + x22 + x23 .

For each distribution specified above, two statistical measures, such as bias
and mean squared error (MSE), were considered as decision criteria to judge the
performance of estimators. Assuming that the customer-identified product target
t0 = 50.0, Table 1 shows the estimated bias and MSE of the optimal mean response

(x ), where model A uses the conventional estimators of sample mean and variance,
model B uses sample median and squared MAD, and model C uses sample median
and squared IQR. As expected, model A, using sample mean and variance, turned
out to be the best estimator under a normal distribution without contaminated data.
Under a normal distribution with contaminated data, however, model B, using sample
median and squared MAD, outperformed model A using sample mean and variance.
For other non-normal distributions, model C, using sample median and IQR, seems
to be the best choice.
Table 1 around here

Figure 1 shows the kernel density estimates of the optimal mean response

(x ) for a normal distribution with and without contaminated data using models A,
B, and C. The kernel density estimate is one of the most useful non-parametric
density estimators for a continuous case. A good reference is Silverman [15]. Figure 2
shows the kernel density estimates of the optimal mean responses
(x ) for double
exponential and logistic distributions. Using the notion of kernel density estimates,
we arrive at the same conclusions. That is, sample mean and variance are useful
estimators under a normal distribution without contamination. However, when a
distribution is contaminated, sample median and squared MAD are more useful.

Finally, sample median and IQR can be a good choice when a normality assumption
is not met.

Figures 1 and 2 around here

A CASE STUDY

The following 33 factorial design presents data obtained in the development


of a tire tread compound on the PICO abrasion index (Yij ). Five replicates were
taken at each design point (i = 1, . . . , 27, j = 1, . . . , 5) shown in Table ??, where
x1 , x2 , and x3 are hydrated silica level, silane coupling agent level, and sulfur level,
respectively. Three contaminated data (Y10,2 , Y17,4 , and Y25,3 ) were observed, and the
sample mean (Y i ), sample median (Yei ), sample variance (Si2 ), MAD (Di2 ), and IQR
(Q2i ) were calculated. As shown in the table, the conventional estimation approach
using sample mean and variance turns out to be very sensitive to the contaminated
data points denoted in bold in Table 2, when compared to the proposed estimation
approach using median, MAD and IQR.
Table 2 around here

For illustrative purposes, the three models are compared and the summary
is shown in Table 3. It can be observed that the estimated bias (i.e., the absolute
value of the dierence between estimated mean and target) and variance using the
conventional approach (i.e., model A) turned out to be very large (14.43 and 611.07),
as compared to those using models B and C. This is because model A does not

10

take into consideration the contaminated data points. This particular example shows
that the optimal operating conditions using models B and C are more robust when
contaminated data are present.

Table 3 around here

CONCLUSIONS

In this article, we have used simulations to show that the proposed models
using sample median, MAD and IQR are outlier-resistant, in the case where the
distribution of experimental data is not approximately normal or where there is major
contamination in the data. We then show how to incorporate these observations
into robust design modeling and optimization. The numerical example clearly shows
that the proposed models using the outlier-resistant estimators, provide a significant
reduction in bias and variance. It is hoped that the proposed models, which are
relatively simple to implement, can be used as part of a quality improvement eort.

REFERENCES

References
[1] Sangmun Shin, Pauline Kongsuwon, and Byung-Rae Cho. Development of the
parametric tolerance modeling and optimization schemes and cost-eective solutions. European Journal of Operational Research, 207(3):1728 1741, 2010.
http://dx.doi.org/10.1016/j.ejor.2010.07.009.

11

[2] Byung-Rae Cho. Optimization Issues in Quality Engineering. PhD thesis, School
of Industrial Engineering: University of Oklahoma, 1994.
[3] Byung-Rae Cho, Y. J. Kim, D. L. Kimbler, and M. D. Phillips. An integrated
joint optimization procedure for robust and tolerance design. International Journal of Production Research, 38:23092325, 2000.
[4] Byung-Rae Cho, Yongsun Choi, and Sangmun Shin.

Development of cen-

sored data-based robust design for pharmaceutical quality by design. The International Journal of Advanced Manufacturing Technology, 49:839851, 2010.
http://dx.doi.org/10.1007/s00170-009-2455-3.
[5] Y. J. Kim and Byung-Rae Cho. Economic considerations on parameter design.
Quality and Reliability Engineering International, 16:501504, 2000.
[6] Y. J. Kim and Byung-Rae Cho. Development of priority-based robust design.
Quality Engineering, 14(3):355363, 2002.
[7] Abdul-Baasit Shaibu, Byung-Rae Cho, and Jamison Kovach. Development of a
censored robust design model for time-oriented quality characteristics. Quality
and Reliability Engineering International, 25:181197, 2009.
[8] Abdul-Baasit Shaibu, Jamison Kovach, and Byung-Rae Cho. Investigation of
convergence properties of the expectation maximization algorithm. In Proceedings of the IIE Annual Research Conference, Atlanta, GA, 2005.
[9] S. M. Shin and Byung-Rae Cho. Bias-specified robust design optimization and its
analytical solutions. Computers and Industrial Engineering, 48:129140, 2005.

12

[10] Madhumohan S. Govindaluri, Sangmun Shin, and B. R. Cho. Tolerance optimization using the Lambert W function: an empirical approach. International
Journal of Production Research, 42(16):32353251, 2004.
[11] Sangmun Shin, Madhumohan S. Govindaluri, and Byung Rae Cho. Integrating the Lambert W function to a tolerance optimization problem. Quality and
Reliability Engineering International, 21(8):795808, 2005.
[12] M. S. Govindaluri, S. Shin, and B. R. Cho. Integrating the Lambert W function to a tolerance optimization problem. Quality and Reliability Engineering
International, 21(8):795808, 2005.
[13] A. Bendell, J. Disney, and W. A. Pridmore. Taguchi Methods: Applications in
World Industry. IFS Publications, London, UK, 1987.
[14] K. Dehnad. Quality Control, Robust Design and the Taguchi Method. Wadsworth
and Brooks/Cole, Pacific Grove, CA, 1989.
[15] G. E. P. Box. Discussion of o-line quality control, parameter design and the
Taguchi methods. Journal of Quality Technology, 17:198206, 1985.
[16] G. G. Vining and R. H. Myers. Combining Taguchi and response surface philosophies: a dual response approach. Journal of Quality Technology, 22:3845, 1990.
[17] J. Pignatiello and J. S. Ramberg. Top ten triumphs and tragedies of Genichi
Taguchi. Quality Engineering, 4:221225, 1991.
[18] R. H. Myers, A. I. Khuri, and G. G. Vining. Response surface alternatives to
the Taguchi robust design problem. American Statistician, 46:131139, 1992.

13

[19] R. H. Myers and D. C. Montgomery. Response Surface Methodology. John Wiley


& Sons, New York, 1995.
[20] E. Del Castillo and D. C. Montgomery. A nonlinear programming solution to
the dual response problem. Journal of Quality Technology, 25:199204, 1993.
[21] D. K. J. Lin and W. Tu. Dual response surface optimization. Journal of Quality
Technology, 27:3439, 1995.
[22] John Neter, Michael H. Kutner, Christopher J. Nachtsheim, and William Wasserman. Applied Linear Statistical Models. Irwin, Chicago, 1996.
[23] J. W. Tukey. A survey of sampling from contaminated distributions. In I. Olkin,
S. Ghurye, W. Hoeding, W. Madow, and H. Mann, editors, Contributions to
Probability and Statistics, pages 448485. Stanford University Press, Stanford,
1960.
[24] R. Ihaka and R. Gentleman. R: A language for data analysis and graphics.
Journal of Computational and Graphical Statistics, 5:299314, 1996.
[25] B. W. Silverman. Density Estimation for Statistics and Data Analysis. Chapman
& Hall, London, 1986.

About the Authors:


Chanseok Park is an assistant professor of Mathematical Sciences at Clemson University. He received his Ph.D. in Statistics from Pennsylvania State University and his

14

research areas of interest include statistical inference using quadratic inference function,
robust inference, and statistical computing and simulation.
Byung Rae Cho is an associate professor of Industrial Engineering at Clemson University. He received his Ph.D. in Industrial Engineering from the University of Oklahoma
and his research areas of interest include robust design, tolerance synthesis and optimization, reliability engineering, and military operations research.

15

0.20
0.10

density

(x )
Model A

C1
C2
C3

0.15

PSfrag replacements

P1
P2
P3

Model C

0.05

Model B

0.00

Model A*
Model B*

40

50

60

70

80

mu

Model C*

Figure 1: Kernel density estimates for various models. In the first data set, 135 random samples were drawn from N ((x), 2 (x)). In the second data set, 132 samples
were drawn from N ((x), 2 (x)) and 3 samples from N (250, 102 ) (2.2% contamination). The results were obtained after 500 iterations.

16

0.20

0.20

P1
P2
P3

0.15

0.15

PSfrag replacements

P1
P2
P3

0.10

density

0.10

0.05

(b)

0.05

(a)

density

(x )

Model B
Model C

0.00

0.00

Model A
45

50

55

60

65

70

45

50

55

60

mu

mu

(a)

(b)

65

70

Figure 2: Kernel density estimates for various models. (a) Samples drawn from
double exponential ((x), = (x)). (b) Samples drawn from logistic ((x), =
(x)). The results were obtained after 500 iterations.

17

Table 1: Estimated Bias and MSE of the optimal mean response


(x ).
Model A
Distribution

Model B

Model C

Bias

MSE

Bias

MSE

Bias

MSE

Normal

3.58

17.10

4.42

26.09

3.73

18.79

Normal (contaminated)

6.99

99.62

4.70

29.75

4.20

24.85

Double Exponential

6.13

48.59

5.71

42.19

5.05

34.23

Logistic

6.38

57.95

6.67

66.53

5.74

49.70

Cauchy

12.13

75018.21

8.14

101.22

7.42

85.89

18

Table 2: Data for case study example. The data set comes from N ((x), 2 (x))
i

xi1

xi2

xi3

Yi1

Yi2

Yi3

Yi4

Yi5

Yi

Yei

Si2

Di2

Q2i

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1
-1
0
1

-1
-1
-1
0
0
0
1
1
1
-1
-1
-1
0
0
0
1
1
1
-1
-1
-1
0
0
0
1
1
1

-1
-1
-1
-1
-1
-1
-1
-1
-1
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1

54
55
44
66
52
67
64
49
94
65
48
56
48
40
77
48
49
64
61
75
71
51
61
67
64
36
44

59
60
55
46
70
78
94
33
64
239
56
79
37
49
57
55
70
48
66
68
78
52
49
69
62
45
69

40
61
49
53
55
47
59
55
62
65
64
58
63
43
35
57
27
77
74
42
58
58
57
56
244
77
69

66
56
74
92
54
55
65
61
57
52
43
55
45
54
74
65
238
66
63
63
51
69
73
72
36
65
57

76
66
87
71
53
61
65
67
54
41
57
66
56
52
42
69
47
82
43
56
39
82
60
64
66
72
78

59.0
59.6
61.8
65.6
56.8
61.6
69.4
53.0
66.2
92.4
53.6
62.8
49.8
47.6
57.0
58.8
86.2
67.4
61.4
60.8
59.4
62.4
60.0
65.6
94.4
59.0
63.4

59
60
55
66
54
61
65
55
62
65
56
58
48
49
57
57
49
66
63
63
58
58
60
67
64
65
69

181.0
19.3
327.7
317.3
55.7
138.8
195.3
170.0
257.2
6816.8
67.3
100.7
100.7
35.3
349.5
69.2
7432.7
173.8
130.3
158.7
242.3
171.3
75.0
37.3
7142.8
313.5
173.3

107.7
35.2
266.0
371.5
2.2
79.1
2.2
79.1
55.0
371.5
140.7
19.8
140.7
55.0
635.3
140.7
969.4
266.0
19.8
107.7
371.5
107.7
19.8
19.8
8.8
316.5
178.0

79.1
13.7
343.4
178.0
2.2
79.1
0.5
79.1
26.9
92.9
44.5
55.0
66.5
44.5
562.7
55.0
290.7
92.9
13.7
79.1
219.8
158.8
8.8
13.7
8.8
400.6
79.1

19

Table 3: Estimates of regression coecients of


(x) and the optimal settings (x )
under the Models A and B.
Model A
(p-value)

Model B
(p-value)

Model C
(p-value)

0
1

55.03 (0.000)

51.48 (0.000)

51.48 (0.000)

-2.67 (0.340)

0.44 (0.683)

0.44 (0.683)

2
3

2.61 (0.350)

0.83 (0.446)

0.83 (0.446)

1.86 (0.504)

1.67 (0.137)

1.67 (0.137)

11
22

5.84 (0.231)

4.22 (0.036)

4.22 (0.036)

7.54 (0.127)

2.72 (0.159)

2.72 (0.159)

33
12

-0.66 (0.891)

5.22 (0.012)

5.22 (0.012)

0.27 (0.937)

2.25 (0.104)

2.25 (0.104)

13
23

-2.12 (0.533)

1.75 (0.199)

1.75 (0.199)

2.25 (0.508)

0.50 (0.707)

0.50 (0.707)

(1.00, 0.98, 0.97)

(0.57, 0.28, 1.00)

(0.27, 0.72, 1.00)

|
(x ) t0 |

14.43

5.43

6.92

2 (x )

611.07

87.12

49.11

20

Вам также может понравиться