DM Missing Data

Expert Systems with Applications 39 (2012) 25902596
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa
A data mining approach considering missing values for the optimization

of semiconductor-manufacturing processes
Doh-Soon Kwak a, Kwang-Jae Kim b,
a
Samsung Electronics, San 61, Banwol-Dong, Hwasung, Gyeonggi 445-701, Republic of Korea
b
Division of Mechanical and Industrial Engineering, Pohang University of Science and Technology, San 31, Hyoja-Dong, Nam-Gu, Pohang, Kyungbuk 790-784, Republic of Korea
a r t i c l e i n f o a b s t r a c t
Keywords: Due to the rapid development of information technologies, abundant data have become readily available.
Data mining approach Data mining techniques have been used for process optimization in many manufacturing processes in
Missing values automotive, LCD, semiconductor, and steel production, among others. However, a large amount of miss-
Patient Rule Induction Method ing values occurs in the data set due to several causes (e.g., data discarded by gross measurement errors,
Process optimization
measurement machine breakdown, routine maintenance, sampling inspection, and sensor failure), which
frequently complicate the application of data mining to the data set. This study proposes a new procedure
for optimizing processes called missing values-Patient Rule Induction Method (m-PRIM), which handles
the missing-values problem systematically and yields considerable process improvement, even if a signif-
icant portion of the data set has missing values. A case study in a semiconductor manufacturing process is
conducted to illustrate the proposed procedure.
! 2011 Elsevier Ltd. All rights reserved.
1. Introduction (e.g., data discarded by gross measurement errors, measurement

machine breakdown, routine maintenance, sampling inspection,
The use of data mining techniques in manufacturing industries and sensor failure) (Arteaga & Ferrer, 2002; Muteki, Macgregor, &
has begun in the 1990s, gradually receiving attention from many Ueda, 2005; Nelson, Taylor, & Macgregor, 1996). A large amount
manufacturing processes in automotive, LCD, semiconductor, and of missing values frequently complicates the application of data
steel manufacturing for predictive maintenance, fault detection, mining algorithms (including PRIM) to the data set, because most
diagnosis, and scheduling (Harding, Shahbaz, Srinivas, & Kusiak, data mining algorithms have not been designed for them. More-
2006). Data mining techniques have also been used for process over, if missing values are not handled in principled ways, these
optimization in order to find optimum conditions for input vari- can produce biased, distorted, and unreliable conclusions (Dasu &
ables that maximize (or minimize) output variables (Braha & Johnson, 2003; Feelders, 1999). Thus, for the successful application
Shmilovici, 2002; Kim & Ding, 2005). of the existing PRIM works in process optimization, it is necessary
Among many data mining techniques, the Patient Rule Induc- to enhance existing works by systematically treating the missing-
tion Method (PRIM), originally proposed by Friedman and Fisher values problems.
(1999), has been successfully applied for process optimization de- The purpose of this paper is to develop a new PRIM-based
spite its recent emergence (Chong, Albin, & Jun, 2007; Chong & Jun, method for optimizing processes, where a significant portion of
2008; Kwak, Kim, & Lee, 2010; Lee & Kim, 2008). This method di- the data set has missing values. This method will be referred to
rectly seeks a set of sub-regions for input variables, in which higher as the missing values-PRIM (m-PRIM). The remainder of the paper
quality values are observed from the historical data. is organized as follows: PRIM is briefly reviewed in the next sec-
An embedded assumption in existing PRIM works meant for tion; the proposed method is introduced, and the results of a case
process optimization is that missing values do not exist in the data study are presented; finally, the conclusion and discussion are
sets, or the amount of missing ones is negligible. Although abun- given.
dant data are readily available due to the rapid development of
information technologies, missing values are a common occur-
rence in various industrial process data sets due to several causes 2. Patient Rule Induction Method (PRIM)
The goal of PRIM is to discover a small box-shaped region, called

Corresponding author. Tel.: +82 54 279 2208; fax: +82 54 279 2870. a box, with a higher proportion of good observations compared to
E-mail addresses: dskwak@samsung.com (D.-S. Kwak), kjk@postech.ac.kr the entire region from a large data set {(y(r), x(r)), r = 1, 2, . . . , N}. In
(K.-J. Kim). this set, y(r) and x(r) = x1(r), x2(r), . . . , xp(r) are the output and p
0957-4174/$ - see front matter ! 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.08.114
D.-S. Kwak, K.-J. Kim / Expert Systems with Applications 39 (2012) 25902596 2591
input variables for the rth observation, respectively; and N is the (Quinlan, 1994, 1995) is its patient strategy, which has a small value
number of observations in the entire region. Below is a brief of the peeling parameter a, allowing for the possibility of creating
description of PRIM as it pertains to the current research. Further many peeling steps. Thus, each peeling becomes less important in
details about PRIM can be found in Friedman and Fisher (1999) determining the final box, and unfortunate peelings that remove
and Hastie, Tibshirani, and Friedman (2001). good observations can be mitigated in subsequent steps.
2.1. Box and related statistics

3. The proposed method: m-PRIM
A p-dimensional box B is defined as the intersection of sub-
ranges of input variables such that: In this section, m-PRIM, which considers the missing-values
problem and its role in optimizing processes is presented. The left
B sx1 " sx2 " # # # " sxp ; 1 side of Fig. 1 shows a brief procedure to optimize processes based
where sxj xlj ; xuj ' is a sub-range of the input variable xj (j = 1, on PRIM, where the amount of missing values is negligible. The
2, . . . , p), and xlj and xuj denote the lower and upper bound of xj in overall procedure of m-PRIM, which has three additional steps
the box B, respectively. after Step 0: Prepare the data set, is presented in the dotted
Given a box B and the data set {(y(r), x(r)), r = 1, 2, . . . , N}, there box in Fig. 1.
are two statistics indicating the properties of the box. The first The basic idea of m-PRIM is to convert an incomplete data set,
one is the support (bB ), which denotes the proportion of the obser- which has missing values in the data set, into k-imputed complete
vations contained in B given by: data sets, generate k optimal boxes from each of k-imputed com-
pleted data sets, and aggregate the k optimal boxes.
bB nB =N; 2 In this study, an incomplete data set is converted into k-im-
where nB is the number of observations inside B, calculated by: puted complete data sets by multiple imputation (MI). Originally
proposed by Rubin (1987), MI has been used as an alternative to
N
X traditional methods such as case deletion, mean imputation, hot-
nB 1xr 2 B: 3
r1
deck, regression approach and single imputation using Expectation
Maximization (EM), among others, in a wide variety of missing-
Here, the function 1(#) takes one when the argument is true, and values problems (Schafer & Graham, 2002).
zero otherwise. The support clearly ranges from zero to one. The MI is a simulation-based approach, where each of the missing
second statistic is the box objective (ObjB), which is the mean of value is replaced with k > 1 plausible values from their predictive
the output variable in B given by: distribution. Further, MI is implemented based on the assumption
X that missing values would be missing at random (MAR). In this
ObjB 1=nB yr: 4
xr2B study, NORM is used for multiple imputations. As developed by
Schafer, 1999a, 1999b, NORM performs multiple imputations un-
If the output variable has the most desirable value (i.e., target), der a multivariate normal model. This model makes no distinctions
the box objective in (4) is expressed as: between input or output variables, although it treats all as a mul-
X tivariate variable. In NORM, proper multiple imputations are cre-
ObjB (1=nB yr ( t arg et2 : 5
xr2B
ated through data augmentation (Tanner & Wong, 1987), where
EM-estimated is used as starting values for the parameters. The
rule of thumb suggested by Schafer is being used to guarantee
2.2. Algorithm the convergence of the data augmentation. Additionally, in terms
of number of imputations (k), the use of more than five to ten
A prepared data set of interests is randomly split into a learning imputations (Schafer, 1999a, 1999b) tends to have little or no
set and a test set. Then, PRIM starts with box B0, which includes all practical benefit. The details of MI can be found in Rubin (1996),
observations in the learning set. From B0, PRIM creates 2 " p candi- Schafer (1997) and Schafer and Olsen (1998).
date boxes, {C1(, C1+, C2(, C2+, . . . , Cp(, Cp+}. The Cj( and Cj+ (j = 1, In the following study, Dinc and Dc denote the incomplete data
2, . . . , p) are obtained by peeling 100a% of the observations inside set and the complete data set, respectively; D(j)c (j = 1, 2, . . . , k) is
the box from the left and right side of the jth input variable xj. Here, the imputed complete data set, where k is the number of imputa-
a is the peeling parameter which determines the number of obser- tions. Additionally, xa) is the representative optimal box aggre-
vations peeled off at each iteration and is typically set to a small gated from the k optimal boxes xj) (j = 1, 2, . . . , k) that have been
value (between 0.05 and 0.1). Then, PRIM chooses the one with generated from each of the imputed complete data sets D(j)c
the largest box objective among the candidate boxes and lets this (j = 1, 2, . . . , k). Each step of the proposed method is described
box be B1. Boxes B1, B2, . . . , Bk are iteratively generated until the below.
support becomes less than the predetermined stopping parameter
b (e.g., 0.05). By peeling off a small number of observations in each
iteration, a long sequence of boxes is thereby created.
(Step 0)
To avoid over-fitting, the box objectives in the generated boxes Prepare the data set
B1, B2, . . . , Bk are recalculated using the test set. The one with the
Convert an incomplete data set into
largest box objective from the test set is selected as the optimal (Step 1)
k complete data sets
box (i.e., the optimum condition on the input variables) and is gi- No
Missing values
ven by: negligible ?
! " !# h i" Generate k optimal boxes (Step 2)
$ # $
x) x)1 ; x)2 ; . . . ; x)p xL1 ; xU1 ; xL2 ; xU2 ; . . . ; xLp ; xUp ; 6 Yes
where xLj and xUj denote the lower and upper bound of xj (j = 1, Make one representative optimal box (Step 3)
Determine the optimal box
2, . . . , p) in the optimal box, respectively. The advantage of the PRIM
algorithm compared with other rule discovery algorithms such as
CART (Breiman, Friedman, Olshen, & Stone, 1984) and C4.5 Fig. 1. Overall m-PRIM procedure.
2592 D.-S. Kwak, K.-J. Kim / Expert Systems with Applications 39 (2012) 25902596
Table 1 tor manufacturing processes, input variables (i.e., process

Generated k optimal boxes. variables) are 100% inspected or measured through automated
Data Optimal boxes sensors embedded in the main processing machines. In the case
set of output variables, which are not always 100% inspected, these
! " !h i h i h i"
D(1)c x1) x1) ; x1) ; . . . ; x1) x1L ; x1U ; x1L ; x1U ; . . . ; x1L ; x1U usually inspected on specialized inspection machines separated
1 2 p 1 1 2 2 p p
(2)c ! " !h i h i h i" from the main processing machines. This is because 100% inspec-
D x2) x2) ; x2) ; . . . ; xp2) x2L ; x2U ; x2L ; x2U ; . . . ; xp2L ; xp2U
1 2 1 1 2 2 tion increases total cycle time and production cost, and inspection
... ... ...
! " !h i h i h i" machines have limited resources.
D(k)c xk) xk) ; xk) ; . . . ; xk) xkL ; xkU ; xkL ; xkU ; . . . ; xkL ; xkU
1 2 p 1 1 2 2 p p In the case of the etching process, input variables (e.g., amount
of gas flow, pressure, and temperature) are 100% measured by the
sensors attached inside the etching machine. However, After
Step 0: Prepare the data set. Cleaning Inspection Critical Dimension (ACI CD, lm), which is
An incomplete data set (Dinc) with missing values from a pro- the output variable representing the quality of circuit pattern
cess is prepared. (e.g., length of a transistor gate), is not 100% measured by the
inspection machine such as the scanning electron microscopy.
Fig. 3 displays a typical data set of the etching process with the
Step 1: Convert an incomplete data set into k complete data sets. missing-values occurrence in an output variable. Here, O means
An incomplete data set (Dinc) is converted into k-imputed that the variable is observed in the corresponding case, and X
complete data sets D(j)c (j = 1, 2, . . . , k) by MI. means the data are missing.
Step 2: Generate k optimal boxes. Although the sampling inspection rate depends on the critical-
The PRIM algorithm, explained in Section 2.2 is applied to ity of the output variable y, it is usually very low (e.g., less than 50%
each of the k-imputed complete data sets. Thus, k optimal on the average) in the case of the semiconductor processes. Thus,
boxes xj) (j = 1, 2, . . . , k) are generated as shown in Table 1, the portion of missing values in the data set is very high, which
where figures in parenthesis refer to the numbers of k- complicates the application of data mining algorithms to the data
imputed data sets. sets of the semiconductor processes.
Step 3: Make one representative optimal box.
Resulting k optimal boxes are combined by an aggregation 4.2. Data set preparation
method to obtain one representative optimal box. One representa-
tive optimal box is given by: A complete data set (Dc) was prepared from the etching process
! " !# h i" in the S semiconductor company in Korea. A total of 300(N) histor-
$ # aL aU $
xa) xa) a) a)
1 ; x2 ; . . . ; xp xaL aU aL aU
1 ; x1 ; x 2 ; x2 ; . . . ; xp ; xp ; 7 ical data points were collected for nine months from January to
September 2008. There were no missing values on the four input
where xaL i and xi
aU
denote the combined lower and upper bounds of variables and one output variable (ACI CD). Four input variables
xi (i = 1, 2, . . . , p) in the representative optimal box, respectively. In were selected by process engineers as core variables among others.
this study, median is used as an aggregation method. Thus, xaL i are ACI CD (y) is a nominal-the-better (NTB) type variable. This com-
1L 2L kL
calculated as median of observations xi ; xi ; . . . ; xi (i = 1, plete data set was used as the standard data set in investigating
1U
2, . . . , p), and xaU i are calculated as median of observations xi , the performance and properties of the proposed method. The val-
2U kU aU
xi ; . . . ; xi (i = 1, 2, . . . , p). In addition, the value of xi is always ues of input and output variables are standardized. The standard-
and without exception larger than that of xaL i ; this
h is because i xaU
i ized target of ACI CD (y) is 0.4816.
jL jU
and xaL i are calculated based on ordered pairs x i ; x i (j = 1, Meanwhile, an incomplete data set (Dinc) was intentionally gen-
2, . . . , k). Mean and trimmed mean can be used for the purpose of erated from the complete data set (Dc) mentioned above. The
aggregation. The performance of three aggregation methods (i.e., incomplete data set has a large amount of missing values and sim-
mean, median, and trimmed mean) are compared in Section 4.6. ilar missing patterns as shown in Fig. 3 (i.e., missing values oc-
curred only in the output variable, and there was no missing
4. Case study in a semiconductor manufacturing process value in the first observation). For example, in the case of a missing
value rate of 90% (i.e., sampling inspection rate 10%), the 1st obser-
The etching process in semiconductor fabrication was em- vation was not omitted from the complete data set (Dc), but the
ployed in this study to illustrate the proposed method in this sec- next nine consecutive observations were intentionally omitted.
tion. Semiconductor fabrication is a process in which electronic This sequence was repeated 29 times beginning with the 11th
circuits are gradually created on a wafer through several stages observation. Two incomplete data sets with missing value rates
(Fig. 2). The etching process is a critically important stage used of 50% and 90% were prepared in this case study.
to remove layers from the surface of a wafer during manufacturing.
The entire fabrication process, performed in highly specialized 4.3. Performance evaluation and measures
facilities referred to as fabs, takes six to eight weeks to complete.
Performance of the proposed method is evaluated based on the
values contained in the representative optimal box (i.e., xa ) as ap-
)
4.1. Occurrence of missing values in the semiconductor manufacturing
c
process plied on the complete data set (D ). This is due to the fact that
many values contained in the imputed complete data sets D(j)c
Sampling inspection is one of the major reasons of the occur- (j = 1, 2, . . . , k) are not real observed values but imputed values by
rence of missing values in the process data sets. In the semiconduc- simulation. The concept of performance calculation for the
Pure wafer Fabricated

Deposition Photo Etching Implantation wafer
Fig. 2. An etching process in the semiconductor fabrication.

proportion that could not be obtained by the effects of missing val-

ues, respectively.The second one is MSE Improvement Ratio (MIR),
which is defined as:
MSEcurrent ( MSEa
MIR : 10
MSEcurrent
Here MIR refers to the proportion of process improvement over cur-

rent level with respect to the mean squared error. Both MSEcurrent
and MSEc were calculated as 1.172 and 0.171, respectively, from
the complete data set of the etching process in this case study.
Fig. 3. An incomplete data set with missing values.

4.4. Case study results
proposed method is shown in Fig. 4. Additionally, xc) is the optimal The results of the case study, which are summarized illustra-
box generated from the complete data set (Dc). tions of Steps 2, 3, and 4 of the proposed method, are as follows.
In this case study, (-) mean squared error given in Eq. (5) is used First, the incomplete data set (Dinc) is converted into k-imputed
as the box objective when generating k optimal boxes at Step 2 in complete data sets D(j)c (j = 1, 2, . . . , k) by multiple imputations. In
Fig. 1, since the output variable (ACI CD, y) is an NTB type. Thus, this case study, 15 imputations (i.e., k = 15) were made for each
performance measures were developed based on the values of of the two incomplete sets having missing value rates of 50% and
mean squared error. The development of the two performance 90%. Three replications were made.
measures are described below. Each of the 15 imputed data sets was randomly split into a
The first one is Obtained Improvement Ratio (OIR), which is de- learning set (67%) and a test set (33%) when generating the optimal
fined as: box. The peeling parameter a and the stopping parameter b were
set to 0.1 and 0.05, respectively.
MSEcurrent ( MSEa Table 2 shows 15 optimal boxes from the first replication that
OIR ; 0 * OIR * 1: 8
MSEcurrent ( MSEc have been generated from 15 imputed complete data sets at the
missing value rate of 50%. Fifteen optimal boxes were combined
In Eq. (8), MSEcurrent and MSEa are mean squared errors calculated
by calculating the medians of 15 lower and upper bounds of each
using all observations in the complete data set (Dc), and the obser-
xj (j = 1, 2, 3 and 4) in the 15 optimal boxes. One representative
vations in the representative optimal box (xa) ) are applied on the
optimal box was finally obtained (shown in the bottom of Table 2).
complete data set (Dc). Here, MSEc is the mean squared error calcu-
Representative optimal box (xa) ), OIR, MER, and MIR for each
lated using the observations in the optimal box xc) as shown in
replication are shown in Table 3. As shown in Table 3, the OIR aver-
Fig. 4. The numerator (i.e., MSEcurrent ( MSEa) of Eq. (8) is the ob-
ages at missing value rates of 50% and 90% are 65% and 52%,
tained improvement when process optimization is conducted based
respectively. This means that the etching process achieved 65%
on the incomplete data set (Dinc), whereas the denominator (i.e.,
and 52% of the obtainable improvement. The MER averages at miss-
MSEcurrent ( MSEc) is the obtainable improvement when process
ing value rates of 50% and 90% are 35% and 48%, respectively. This
optimization is conducted based on the complete data set (Dc).
means that the etching process did not improve by 35% and 48%
Additionally, the missing values effects ratio (MER) is defined as:
from the effects of missing values.
MER 1 ( OIR; 0 * MER * 1: 9 The MIR averages at missing values rates of 50% and 90% are 56%
and 45%, respectively, which means that the etching process
Thus, OIR and MER refer to the proportion of obtained process achieved an improvement of 56% and 45% compared with the
improvement over the obtainable improvement, and the remaining current level with respect to the mean squared error. Additionally,
Fig. 4. Concept of performance evaluation.

Table 2
Examples of 15 optimal boxes (at a missing value rate of 50% and first replication).
Data sets Optimal box Var.

x1 x2 x3 x4
1 x1) [(0.396, 0.431] [(1.123, 0.232] [(1.320, 0.733] [(0.677, 0.951]
2 x2) [(0.409, 1.161] [(0.947, 0.004] [(0.350, 0.741] [(0.875, 1.301]
3 x3) [(1.202, 0.286] [(1.535, 0.427] [(0.930, 0.168] [(0.190, 0.873]
4 x4) [(0.451, 0.561] [(3.023, 0.832] [(1.003, 0.604] [(0.112, 0.777]
5 x5) [0.032, 1.999] [(3.023, 0.940] [(0.958, 0.211] [(2.772, 0.708]
6 x6) [(0.416, 1.978] [(3.023, 0.951] [(1.254, 0.891] [(2.772, 2.659]
7 x7) [0.263, 1.482] [(0.238, 2.797] [(0.198, 0.697] [(1.060, 0.159]
8 x8) [(2.433, 1.094] [(1.240, 0.124] [(1.167, (0.329] [(2.772, 1.229]
9 x9) [(0.415, 1.892] [(0.938, 1.344] [(0.957, 1.363] [(2.772, 1.409]
10 x10) [(0.381, 0.688] [(3.023, 0.255] [(0.936, (0.150] [(1.510, 0.746]
11 x11) [0.145, 1.857] [(0.256, 0.619] [(0.414, 0.524] [(1.014, 0.797]
12 x12) [(0.445, 0.320] [(3.023, 0.084] [(0.566, 0.967] [(0.399, 0.927]
13 x13) [(1.209, 1.077] [(3.023, 0.493] [(1.409, 0.116] [(0.290, 0.753]
14 x14) [(0.498, 1.251] [(3.023, 0.842] [(0.957, 0.091] [(2.772, 2.659]
15 x15) [(0.373, 0.294] [(3.023, 0.491] [(0.360, 0.670] [(0.881, 0.794]
xa) [(0.415, 1.094] [(3.023, 0.493] [(0.957, 0.604] [(1.014, 0.873]
Table 3
Optimization results.
Missing value rate Replication Var. OIR MER MIR

x1 x2 x3 x4
50 1 [(.415, 1.094] [(3.023, 0.493] [(0.957, 0.604] [(1.014, 0.873] 0.66 0.34 0.57
2 [(0.352, 0.965] [(1.085, 0.232] [(0.855, 0.454] [(2.772, 0.924] 0.60 0.40 0.52
3 [(0.341, 1.000] [(0.727, 0.444] [(0.941, 0.715] [(1.100, 0.769] 0.70 0.30 0.60
Avg. 65(%) 35(%) 56(%)
90 1 [(0.235, 1.125] [(0.884, 0.793] [(0.624, 0.719] [(1.208, 0.754] 0.48 0.52 0.41
2 [(0.537, 0.474] [(1.221, 0.232] [(0.889, 0.317] [(2.772, 2.659] 0.40 0.60 0.34
3 [(0.299, 1.021] [(0.903, 0.449] [(0.916, 0.546] [(1.046, 0.557] 0.69 0.31 0.59
Avg. 52(%) 48(%) 45(%)
Current [(2.433, 2.513] [(3.023, 2.797] [(2.646, 2.974] [(2.772, 2.659]
OIR and MIR averages at missing value rates of 50% are expectedly MIR values decreased as missing value rates increased in both
larger compared with those of the missing value rate of 90%. methods.
In summary, the proposed method yielded considerable The case deletion method seemed to be used effectively when
improvements on the etching process compared with the current the missing value rate was low. The MIR values from multiple
level, although the level of process improvements did not reach imputation were larger than those from the case deletion at the
an improvement level in which process optimization is conducted, missing value rates of 50% and 90%. The gaps also became larger
based on the complete data set (Dc). as missing value rates increased.
4.6. Comparison of aggregation methods

4.5. Comparison between case deletion and multiple imputation
Three aggregation methods (i.e., mean, median, and trimmed

In this section, MI performance is compared with case deletion
mean) can be used to obtain one representative optimal box in Sec-
using the etching process case at missing value rates of 50% and
tion 3. In this section, a simulation experiment was conducted to
90%. Case deletion, which deletes the case with missing values, is
compare the performance of these three aggregation methods.
the simplest method. The main virtue of case deletion is its sim-
plicity. Nevertheless, disadvantages stems from the potential loss
of information while discarding incomplete cases. For the compar- 4.6.1. Design of simulation experiment
ison, data sets were prepared from the incomplete data set (Dinc) The purpose of the simulation experiment is to identify the con-
using the case deletion method, where the cases with missing val- ditions under which one method outperforms the others. This was
ues were deleted according to the missing value rate. In the case of done by comparing the performance measures.
missing value rate of 50%, 150 complete cases were left after dis- The factors are given in Table 4. The first factor, missing value
carding incomplete cases from the incomplete data set. Optimal rate (R), has three levels: 10% (small), 50% (medium), and 90%
box, denoted as xd) , was obtained based on the data set with 150
complete cases. Then, the performance was measured based on
Table 4
the values contained in the optimal box (xd) ), as applied on the Summary of experimental factors.
complete data set (Dc).
Factor Levels
The performance of MI was compared with case deletion with
respect to MIR. The results of comparison are displayed in Fig. 5. Missing value rate (R) 10% (small), 50% (medium), 90% (large)
Number of data (N) 100 (small), 250 (medium), 400 (large)
Small and gray, and large dark circles indicate the values of MIR
Magnitude of random error (M) 0 (small), 3 (large)
based on case deletion and MI, respectively. Fig. 5 shows that
90%
80%
70%
60%
50%
MIR
40%
30%
20%
10%
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
missing rate
Fig. 5. MIR according to missing value rate (d: case deletion, d: MI).
(large). The second factor, number of data (N), has three levels: 100 from the experiments were combined. Table 5 also shows which
(small), 250 (medium), and 400 (large). The last factor, magnitude method performed better at each main factor level. The perfor-
of random error (r), has two levels: 0 (small) and 3 (large). mance of the median method was significantly better compared
Levels for each factor were chosen considering their relevance with those of the mean and trimmed mean methods when the R,
in a real semiconductor production process. For example, 90(%)
of the missing value rate is the maximum level (i.e., 10% is usually
the minimum sampling inspection rate), and 50(%) is the medium Table 5
level; 400(N) is a very large number of observations which can be Comparison of performance among three aggregation methods.
collected after a one-year duration. Finally, 0(r) was chosen as the Factor MIR(m1) MIR(m2) MIR(m3)
minimum level of random error because most semiconductor pro-
Significant factorsa
cesses were tightly controlled to prevent out-of-control situations; Main R, N, M R, N, M R, N, M
3(r) was chosen as the level where the process is out-of-control. Interaction R ) M, R ) M, R ) N, R ) M,
There were 18 (=3 " 3 " 2) different treatments and three rep- R)N)M R)N)M R)N)M
Comparisons (m1, m2)b (m1, m3) (m2, m3)
lications for each treatment. Thus, 54 (=3 " 3 " 2 " 3) runs were
All factors m1c m1 (m2)d
made. Next, MIR(m1), MIR(m2), and MIR(m3) were used as re- combined
sponses; these represent MIR values based on the representative R = small m1 m1 (m3)
optimal box aggregated by median, mean, and trimmed mean, Medium (m1) (m1) (m2)
respectively. Large (m2) (m3) (m2)
N = small m1 m1 (m3)
Additionally, in this experiment, a complete data set Dc:
Medium (m2) (m1) (m2)
{(y(r), x(r)), r = 1, 2, . . . , N} was generated from the multivariate Large (m1) (m1) (m2)
normal distribution with a mean vector (1:33 " 10(11 ; 2:00" M = small m1 m1 (m3)
10(11 ; 3:00 " 10(11 ; (4:33 " 10(11 ; (3:33 " 10(11 ; ' and a covari- Large (m1) (m1) (m2)
ance matrix, a
Factors in boldface were significant at a = 0.01; all other factors were significant
2 3 at a = 0.05.
1 b
(m1, m2) means that a comparison of performance between median and mean
6 7
6 0:474 1 7 aggregation method was made.
6 7 c
m1 means that median method performed better at a = 0.05.
C6
6 0:163 0:684 1 7
7 d
Parentheses means that there was no significant difference in the performances
6 7
4 (0:510 (0:313 0:001 1 5 of the two methods at a = 0.05.
(0:259 0:288 0:430 0:097 1
which were estimated from the complete data set of the etching
process in Section 4.2.
An incomplete data set (Dinc) was intentionally generated from
the complete data set (Dc) as explained in that same section.
5. Results
Analysis of variance and paired comparison test were employed

to analyze the results at the 5% significance level. Table 5 indicates
the factors with significant effects on the responses MIR(m1),
MIR(m2), and MIR(m3), and shows which method performed bet-
ter at various factor levels.
All three responses, MIR(m1), MIR(m2) and MIR(m3), decreased
as missing value rate increased. Random error increased as well.
Fig. 6 shows an example of a main effect plot for MIR(m1).
It can be seen that the median method performed significantly
better than mean and trimmed mean methods when all the data Fig. 6. Main effect plot for MIR (m1).
Table 6 Additionally, although joint normality is rarely realistic, MI based

Comparison of performance at the levels of missing rate (R) and random error (M). on the assumption has been known to be useful for a wide variety
of problems (Schafer & Graham, 2002). Finally, the assumption of
MAR in the case study could be justified because the probability
of missing values depends on the sampling inspection scheme
and not on the missing values themselves.
Acknowledgement
a
Boldface means significant at a = 0.01.
b This work was supported by the Korea Research Foundation
Light face means significant at a = 0.05.
c
Parentheses means that there was no significant difference in the performances of Grant funded by the Korean Government (MOEHRD, Basic Research
the two methods at a = 0.05. Promotion Fund) (313-2008-2-D01192).
d
A gray rectangle means that MIR did not improve (close to 0%, or less than 0%).
References
Arteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: Several methods,
N, and M values were small. There was no significant difference be- different interpretations, some examples. Journal of Chemometrics, 16, 408418.
tween mean and trimmed mean methods. Braha, D., & Shmilovici, A. (2002). Data mining for improving a cleaning process in
Table 6 shows which aggregation method performed better at the semiconductor industry. IEEE Transactions on Semiconductor Manufacturing,
15(1), 91101.
various combinations of R and M levels. Checking the interaction Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and
reveals that the interacting effects did not show severe contradic- regression trees. Pacific Grove, CA: Wadsworth.
tions of the main factor level effects. Chong, I. G., Albin, S. L., & Jun, C. H. (2007). A data mining approach to process
optimization without an explicit quality function. IIE Transactions, 39, 795804.
As seen in Table 6, if at least one condition was satisfied (i.e., M Chong, I. G., & Jun, C. H. (2008). Flexible patient rule induction method for
was small, or R was small or medium), MIR could be improved optimizing process variables in discrete type. Expert System with Applications,
regardless of the aggregation method employed. Furthermore, 34(4), 30143020.
Dasu, T., & Johnson, T. (2003). Exploratory data mining ad data cleaning. John Wiley &
the performance of the median method was significantly better Sons.
compared with those of the mean and trimmed mean methods Feelders, A., (1999). Handling missing data in trees: Surrogate splits or statistical
when R was small or medium, when M was small, and when R imputation. In Proceedings of the third European conference on principles of data
mining and knowledge discovery.
was small and M was large. Friedman, J. H., & Fisher, N. I. (1999). Bump hunting in high-dimensional data.
In this section, the performances of three aggregation methods Statistics and Computing, 9, 123143.
were compared via a simulation experiment. Lower and upper Harding, J. A., Shahbaz, M., Srinivas & Kusiak, A. (2006). Data mining in
manufacturing: A review. Journal of Manufacturing Science and Engineering,
bounds of input variables in the optimal boxes can have extreme
128, 969976.
values due to imputation variability. The median in such situation Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning.
could aggregate the optimal boxes in a more stable manner than New York: Springer, pp. 279282.
Kim, P., & Ding, Y. (2005). Optimal engineering design guided by data-mining
mean, since median has the advantage of not being extremely
methods. Technometrics, 47(3), 336348.
influenced by extreme values (trimmed mean has similar proper- Kwak, D., Kim, K., & Lee, M. (2010). Multistage PRIM: Patient rule induction method
ties). Therefore, although the median method performed signifi- for optimization of a multistage manufacturing process. International Journal of
cantly better than mean and trimmed mean methods at limited Production Research, 48(12), 34613473.
Lee, M., & Kim, K. (2008). MR-PRIM: Patient rule induction method for
specific factor levels (or combinations of factor levels), the use of multiresponse optimization. Quality Engineering, 20(2), 232242.
the median method as the major aggregation method is thus Muteki, K., Macgregor, J. F., & Ueda, T. (2005). Estimation of missing data using
recommended. latent variable methods with auxiliary information. Chemometrics and Intelligent
Laboratory Systems, 78, 4150.
Nelson, P. R. C., Taylor, P. A., & Macgregor, J. F. (1996). Missing data methods in PCA
6. Conclusion and discussion and PLS: Score calculation with incomplete observations. Chemometrics and
Intelligent Laboratory Systems, 35, 4565.
Quinlan, J. R. (1994). C4.5: Programs for machine learning. San Mateo, CA.: Morgan-
To optimize a process using data mining techniques, it is impor- Kaufmann.
tant to consider the occurrence of missing values in the process Quinlan, J. R. (1995). MDL and categorical theories (continued). In Proceedings of the
data set. This work proposed a procedure for optimizing a process 12th international conference on machine learning (pp. 464470). San Mateo, CA:
Morgan-Kaufmann.
based on the existing PRIM, called m-PRIM, where the amount of Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
missing values is not negligible. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American
Using a real data set from a semiconductor manufacturing pro- Statistical Association, 91(434), 473489.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman &
cess, the study demonstrates that m-PRIM yielded considerable Hall.
improvements on the process compared with the current level. Schafer, J. L. (1999a). NORM: Multiple imputation of incomplete multivariate data
The degree of process improvement, however, did not reach that under a normal model, version 2. Software for Windows 95/98/NT, available from
<http://www.stat.psu.edu/~jls/misoftwa.html>.
of process optimization when the latter was conducted without Schafer, J. L. (1999b). Multiple imputation: A primer. Statistical Methods in Medical
missing values in the data set, as expected. Research, 8, 315.
In the case study, it was tentatively assumed that a joint distri- Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7(2), 147177.
bution of all variables in the etching process followed multivariate
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-
normal distribution, and missing values would be missing at ran- data problems: a data analysts perspective. Multivariate Behavioral Research,
dom (MAR). It is difficult to test the multivariate normality 33(4), 545571.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by
assumption of an incomplete data set in practice, thus the use of
data augmentation (with discussion). Journal of the American Statistical
the domain knowledge of engineers and the aid of statistical tools Association, 82, 528550.
(e.g., Mahalabanobis distance plots, Mardias test, etc.) are required.

DM Missing Data

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DM Missing Data

Загружено:

Авторское право:

Доступные форматы

Expert Systems with Applications 39 (2012) 25902596

Contents lists available at SciVerse ScienceDirect

Expert Systems with Applications

A data mining approach considering missing values for the optimization

1. Introduction (e.g., data discarded by gross measurement errors, measurement

The goal of PRIM is to discover a small box-shaped region, called

2.1. Box and related statistics

Table 1 tor manufacturing processes, input variables (i.e., process

Pure wafer Fabricated

Fig. 2. An etching process in the semiconductor fabrication.

proportion that could not be obtained by the effects of missing val-

Here MIR refers to the proportion of process improvement over cur-

Fig. 3. An incomplete data set with missing values.

Fig. 4. Concept of performance evaluation.

Data sets Optimal box Var.

xa) [(0.415, 1.094] [(3.023, 0.493] [(0.957, 0.604] [(1.014, 0.873]

Missing value rate Replication Var. OIR MER MIR

4.6. Comparison of aggregation methods

Three aggregation methods (i.e., mean, median, and trimmed

(0:259 0:288 0:430 0:097 1

Analysis of variance and paired comparison test were employed

Table 6 Additionally, although joint normality is rarely realistic, MI based

Вам также может понравиться