Академический Документы
Профессиональный Документы
Культура Документы
a r t i c l e i n f o a b s t r a c t
Keywords: Due to the rapid development of information technologies, abundant data have become readily available.
Data mining approach Data mining techniques have been used for process optimization in many manufacturing processes in
Missing values automotive, LCD, semiconductor, and steel production, among others. However, a large amount of miss-
Patient Rule Induction Method ing values occurs in the data set due to several causes (e.g., data discarded by gross measurement errors,
Process optimization
measurement machine breakdown, routine maintenance, sampling inspection, and sensor failure), which
frequently complicate the application of data mining to the data set. This study proposes a new procedure
for optimizing processes called missing values-Patient Rule Induction Method (m-PRIM), which handles
the missing-values problem systematically and yields considerable process improvement, even if a signif-
icant portion of the data set has missing values. A case study in a semiconductor manufacturing process is
conducted to illustrate the proposed procedure.
! 2011 Elsevier Ltd. All rights reserved.
0957-4174/$ - see front matter ! 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2011.08.114
D.-S. Kwak, K.-J. Kim / Expert Systems with Applications 39 (2012) 25902596 2591
input variables for the rth observation, respectively; and N is the (Quinlan, 1994, 1995) is its patient strategy, which has a small value
number of observations in the entire region. Below is a brief of the peeling parameter a, allowing for the possibility of creating
description of PRIM as it pertains to the current research. Further many peeling steps. Thus, each peeling becomes less important in
details about PRIM can be found in Friedman and Fisher (1999) determining the final box, and unfortunate peelings that remove
and Hastie, Tibshirani, and Friedman (2001). good observations can be mitigated in subsequent steps.
where xLj and xUj denote the lower and upper bound of xj (j = 1, Make one representative optimal box (Step 3)
Determine the optimal box
2, . . . , p) in the optimal box, respectively. The advantage of the PRIM
algorithm compared with other rule discovery algorithms such as
CART (Breiman, Friedman, Olshen, & Stone, 1984) and C4.5 Fig. 1. Overall m-PRIM procedure.
2592 D.-S. Kwak, K.-J. Kim / Expert Systems with Applications 39 (2012) 25902596
MSEcurrent ( MSEa
MIR : 10
MSEcurrent
proposed method is shown in Fig. 4. Additionally, xc) is the optimal The results of the case study, which are summarized illustra-
box generated from the complete data set (Dc). tions of Steps 2, 3, and 4 of the proposed method, are as follows.
In this case study, (-) mean squared error given in Eq. (5) is used First, the incomplete data set (Dinc) is converted into k-imputed
as the box objective when generating k optimal boxes at Step 2 in complete data sets D(j)c (j = 1, 2, . . . , k) by multiple imputations. In
Fig. 1, since the output variable (ACI CD, y) is an NTB type. Thus, this case study, 15 imputations (i.e., k = 15) were made for each
performance measures were developed based on the values of of the two incomplete sets having missing value rates of 50% and
mean squared error. The development of the two performance 90%. Three replications were made.
measures are described below. Each of the 15 imputed data sets was randomly split into a
The first one is Obtained Improvement Ratio (OIR), which is de- learning set (67%) and a test set (33%) when generating the optimal
fined as: box. The peeling parameter a and the stopping parameter b were
set to 0.1 and 0.05, respectively.
MSEcurrent ( MSEa Table 2 shows 15 optimal boxes from the first replication that
OIR ; 0 * OIR * 1: 8
MSEcurrent ( MSEc have been generated from 15 imputed complete data sets at the
missing value rate of 50%. Fifteen optimal boxes were combined
In Eq. (8), MSEcurrent and MSEa are mean squared errors calculated
by calculating the medians of 15 lower and upper bounds of each
using all observations in the complete data set (Dc), and the obser-
xj (j = 1, 2, 3 and 4) in the 15 optimal boxes. One representative
vations in the representative optimal box (xa) ) are applied on the
optimal box was finally obtained (shown in the bottom of Table 2).
complete data set (Dc). Here, MSEc is the mean squared error calcu-
Representative optimal box (xa) ), OIR, MER, and MIR for each
lated using the observations in the optimal box xc) as shown in
replication are shown in Table 3. As shown in Table 3, the OIR aver-
Fig. 4. The numerator (i.e., MSEcurrent ( MSEa) of Eq. (8) is the ob-
ages at missing value rates of 50% and 90% are 65% and 52%,
tained improvement when process optimization is conducted based
respectively. This means that the etching process achieved 65%
on the incomplete data set (Dinc), whereas the denominator (i.e.,
and 52% of the obtainable improvement. The MER averages at miss-
MSEcurrent ( MSEc) is the obtainable improvement when process
ing value rates of 50% and 90% are 35% and 48%, respectively. This
optimization is conducted based on the complete data set (Dc).
means that the etching process did not improve by 35% and 48%
Additionally, the missing values effects ratio (MER) is defined as:
from the effects of missing values.
MER 1 ( OIR; 0 * MER * 1: 9 The MIR averages at missing values rates of 50% and 90% are 56%
and 45%, respectively, which means that the etching process
Thus, OIR and MER refer to the proportion of obtained process achieved an improvement of 56% and 45% compared with the
improvement over the obtainable improvement, and the remaining current level with respect to the mean squared error. Additionally,
Table 2
Examples of 15 optimal boxes (at a missing value rate of 50% and first replication).
Table 3
Optimization results.
OIR and MIR averages at missing value rates of 50% are expectedly MIR values decreased as missing value rates increased in both
larger compared with those of the missing value rate of 90%. methods.
In summary, the proposed method yielded considerable The case deletion method seemed to be used effectively when
improvements on the etching process compared with the current the missing value rate was low. The MIR values from multiple
level, although the level of process improvements did not reach imputation were larger than those from the case deletion at the
an improvement level in which process optimization is conducted, missing value rates of 50% and 90%. The gaps also became larger
based on the complete data set (Dc). as missing value rates increased.
90%
80%
70%
60%
50%
MIR
40%
30%
20%
10%
0%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
missing rate
Fig. 5. MIR according to missing value rate (d: case deletion, d: MI).
(large). The second factor, number of data (N), has three levels: 100 from the experiments were combined. Table 5 also shows which
(small), 250 (medium), and 400 (large). The last factor, magnitude method performed better at each main factor level. The perfor-
of random error (r), has two levels: 0 (small) and 3 (large). mance of the median method was significantly better compared
Levels for each factor were chosen considering their relevance with those of the mean and trimmed mean methods when the R,
in a real semiconductor production process. For example, 90(%)
of the missing value rate is the maximum level (i.e., 10% is usually
the minimum sampling inspection rate), and 50(%) is the medium Table 5
level; 400(N) is a very large number of observations which can be Comparison of performance among three aggregation methods.
collected after a one-year duration. Finally, 0(r) was chosen as the Factor MIR(m1) MIR(m2) MIR(m3)
minimum level of random error because most semiconductor pro-
Significant factorsa
cesses were tightly controlled to prevent out-of-control situations; Main R, N, M R, N, M R, N, M
3(r) was chosen as the level where the process is out-of-control. Interaction R ) M, R ) M, R ) N, R ) M,
There were 18 (=3 " 3 " 2) different treatments and three rep- R)N)M R)N)M R)N)M
Comparisons (m1, m2)b (m1, m3) (m2, m3)
lications for each treatment. Thus, 54 (=3 " 3 " 2 " 3) runs were
All factors m1c m1 (m2)d
made. Next, MIR(m1), MIR(m2), and MIR(m3) were used as re- combined
sponses; these represent MIR values based on the representative R = small m1 m1 (m3)
optimal box aggregated by median, mean, and trimmed mean, Medium (m1) (m1) (m2)
respectively. Large (m2) (m3) (m2)
N = small m1 m1 (m3)
Additionally, in this experiment, a complete data set Dc:
Medium (m2) (m1) (m2)
{(y(r), x(r)), r = 1, 2, . . . , N} was generated from the multivariate Large (m1) (m1) (m2)
normal distribution with a mean vector (1:33 " 10(11 ; 2:00" M = small m1 m1 (m3)
10(11 ; 3:00 " 10(11 ; (4:33 " 10(11 ; (3:33 " 10(11 ; ' and a covari- Large (m1) (m1) (m2)
ance matrix, a
Factors in boldface were significant at a = 0.01; all other factors were significant
2 3 at a = 0.05.
1 b
(m1, m2) means that a comparison of performance between median and mean
6 7
6 0:474 1 7 aggregation method was made.
6 7 c
m1 means that median method performed better at a = 0.05.
C6
6 0:163 0:684 1 7
7 d
Parentheses means that there was no significant difference in the performances
6 7
4 (0:510 (0:313 0:001 1 5 of the two methods at a = 0.05.
which were estimated from the complete data set of the etching
process in Section 4.2.
An incomplete data set (Dinc) was intentionally generated from
the complete data set (Dc) as explained in that same section.
5. Results
Acknowledgement
a
Boldface means significant at a = 0.01.
b This work was supported by the Korea Research Foundation
Light face means significant at a = 0.05.
c
Parentheses means that there was no significant difference in the performances of Grant funded by the Korean Government (MOEHRD, Basic Research
the two methods at a = 0.05. Promotion Fund) (313-2008-2-D01192).
d
A gray rectangle means that MIR did not improve (close to 0%, or less than 0%).
References
Arteaga, F., & Ferrer, A. (2002). Dealing with missing data in MSPC: Several methods,
N, and M values were small. There was no significant difference be- different interpretations, some examples. Journal of Chemometrics, 16, 408418.
tween mean and trimmed mean methods. Braha, D., & Shmilovici, A. (2002). Data mining for improving a cleaning process in
Table 6 shows which aggregation method performed better at the semiconductor industry. IEEE Transactions on Semiconductor Manufacturing,
15(1), 91101.
various combinations of R and M levels. Checking the interaction Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and
reveals that the interacting effects did not show severe contradic- regression trees. Pacific Grove, CA: Wadsworth.
tions of the main factor level effects. Chong, I. G., Albin, S. L., & Jun, C. H. (2007). A data mining approach to process
optimization without an explicit quality function. IIE Transactions, 39, 795804.
As seen in Table 6, if at least one condition was satisfied (i.e., M Chong, I. G., & Jun, C. H. (2008). Flexible patient rule induction method for
was small, or R was small or medium), MIR could be improved optimizing process variables in discrete type. Expert System with Applications,
regardless of the aggregation method employed. Furthermore, 34(4), 30143020.
Dasu, T., & Johnson, T. (2003). Exploratory data mining ad data cleaning. John Wiley &
the performance of the median method was significantly better Sons.
compared with those of the mean and trimmed mean methods Feelders, A., (1999). Handling missing data in trees: Surrogate splits or statistical
when R was small or medium, when M was small, and when R imputation. In Proceedings of the third European conference on principles of data
mining and knowledge discovery.
was small and M was large. Friedman, J. H., & Fisher, N. I. (1999). Bump hunting in high-dimensional data.
In this section, the performances of three aggregation methods Statistics and Computing, 9, 123143.
were compared via a simulation experiment. Lower and upper Harding, J. A., Shahbaz, M., Srinivas & Kusiak, A. (2006). Data mining in
manufacturing: A review. Journal of Manufacturing Science and Engineering,
bounds of input variables in the optimal boxes can have extreme
128, 969976.
values due to imputation variability. The median in such situation Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning.
could aggregate the optimal boxes in a more stable manner than New York: Springer, pp. 279282.
Kim, P., & Ding, Y. (2005). Optimal engineering design guided by data-mining
mean, since median has the advantage of not being extremely
methods. Technometrics, 47(3), 336348.
influenced by extreme values (trimmed mean has similar proper- Kwak, D., Kim, K., & Lee, M. (2010). Multistage PRIM: Patient rule induction method
ties). Therefore, although the median method performed signifi- for optimization of a multistage manufacturing process. International Journal of
cantly better than mean and trimmed mean methods at limited Production Research, 48(12), 34613473.
Lee, M., & Kim, K. (2008). MR-PRIM: Patient rule induction method for
specific factor levels (or combinations of factor levels), the use of multiresponse optimization. Quality Engineering, 20(2), 232242.
the median method as the major aggregation method is thus Muteki, K., Macgregor, J. F., & Ueda, T. (2005). Estimation of missing data using
recommended. latent variable methods with auxiliary information. Chemometrics and Intelligent
Laboratory Systems, 78, 4150.
Nelson, P. R. C., Taylor, P. A., & Macgregor, J. F. (1996). Missing data methods in PCA
6. Conclusion and discussion and PLS: Score calculation with incomplete observations. Chemometrics and
Intelligent Laboratory Systems, 35, 4565.
Quinlan, J. R. (1994). C4.5: Programs for machine learning. San Mateo, CA.: Morgan-
To optimize a process using data mining techniques, it is impor- Kaufmann.
tant to consider the occurrence of missing values in the process Quinlan, J. R. (1995). MDL and categorical theories (continued). In Proceedings of the
data set. This work proposed a procedure for optimizing a process 12th international conference on machine learning (pp. 464470). San Mateo, CA:
Morgan-Kaufmann.
based on the existing PRIM, called m-PRIM, where the amount of Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley.
missing values is not negligible. Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American
Using a real data set from a semiconductor manufacturing pro- Statistical Association, 91(434), 473489.
Schafer, J. L. (1997). Analysis of incomplete multivariate data. London: Chapman &
cess, the study demonstrates that m-PRIM yielded considerable Hall.
improvements on the process compared with the current level. Schafer, J. L. (1999a). NORM: Multiple imputation of incomplete multivariate data
The degree of process improvement, however, did not reach that under a normal model, version 2. Software for Windows 95/98/NT, available from
<http://www.stat.psu.edu/~jls/misoftwa.html>.
of process optimization when the latter was conducted without Schafer, J. L. (1999b). Multiple imputation: A primer. Statistical Methods in Medical
missing values in the data set, as expected. Research, 8, 315.
In the case study, it was tentatively assumed that a joint distri- Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods, 7(2), 147177.
bution of all variables in the etching process followed multivariate
Schafer, J. L., & Olsen, M. K. (1998). Multiple imputation for multivariate missing-
normal distribution, and missing values would be missing at ran- data problems: a data analysts perspective. Multivariate Behavioral Research,
dom (MAR). It is difficult to test the multivariate normality 33(4), 545571.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by
assumption of an incomplete data set in practice, thus the use of
data augmentation (with discussion). Journal of the American Statistical
the domain knowledge of engineers and the aid of statistical tools Association, 82, 528550.
(e.g., Mahalabanobis distance plots, Mardias test, etc.) are required.