Вы находитесь на странице: 1из 8

Proceedings of the 5 Asian Mathematical Conference, Malaysia 2009

th

POSSIBILISTIC LOGISTIC REGRESSION MODEL WITH MINIMUM FUZZINESS


Ramadan Hamed1, Aly El-Hefnawy2, Maha M. El-Ashram3 and Hesham A. Abdalla4.
2 1 Professor, Department of Statistics, Faculty of Economics & Political Science, Cairo University, Egypt. Associate Professor, Department of Statistics, Faculty of Economics & Political Science, Cairo University, Egypt. 3 Assistant Professor, Department of Statistics, Faculty of Economics & Political Science, Cairo University, Egypt. 4 Assistant Professor, Department of Statistics & Insurance, Faculty of Commerce, Assiut University, Egypt.

Abstract Parameter estimation for logistic regression is usually based on maximizing the likelihood function. For large well-balanced datasets ML estimation is a satisfactory approach. Unfortunately, ML can fail completely or at least produce poor results in terms of estimated probabilities and confidence intervals of parameters, specially for small datasets. This study extends logistic regression model to fuzzy logistic regression model by suggesting a new approach based on fuzzy concepts to estimate the model parameters. The proposed approach is expected to be an effective alternative to ML in case of small samples. A mathematical programming model which minimizes the total spread of the estimated probabilities of the logistic model is suggested and its performance is evaluated versus ML approach by a Monte Carlo simulation study. Results show that the new proposed model outperforms ML in case of small sample size. Keywords Fuzzy Numbers; Extension Principle; Logistic Regression; Maximum Likelihood; Small Data Sets; Mathematical programming; Monte Carlo Approach; Similarity Measure.

Introduction

Logistic regression is one of the most frequently used statistical methods. It is widely used in biomedical, behavioral and physical researches. Parameters estimation for logistic regression is usually based on maximizing the unconditional likelihood function. The maximum likelihood estimates (MLEs) of the parameters in logistic regression are asymptotically unbiased with minimum variance when the model is correctly specified. For large well-balanced datasets ML estimation is a satisfactory approach . Unfortunately, ML can fail completely or at least produce poor results in terms of estimated probabilities and confidence intervals of parameters, specially for small datasets or unbalanced ones where the percentage of observations with Y values coded 1, is close to zero or one [3],[7],[10], and [11]. To overcome the special problems that logistic regression analysis of small and sparse data sets faces, special methods have been developed, including estimation and exact inference based on conditional likelihood [7] and penalized maximum likelihood estimation and inference [3],[5]. These methods are updated versions of the classical ML but they provide less biased estimates and more accurate inference. In this study, a new logistic regression approach based on fuzzy set theory is developed. This approach is based on the idea of Fuzzy Linear Regression proposed by Tanaka et al [14] where, we assume that the unfitted errors in logistic regression can be viewed as the fuzziness of the model components. This means that the residuals between predictions and actual observations are not produced by measurements errors, but rather by the parameters uncertainty in the model, respectively a possibility distribution can be used instead of probability distribution. The new developed approach produces an alternate to the Maximum Likelihood method to estimate the models parameters, especially in situations where the ML method fails to converge as situations of small sample data sets. This paper is organized as follows: concepts of logistic regression model and the limitations of the Maximum Likelihood approach are reviewed in section 2. In section 3, the fuzzy logistic regression is formulated and proposed. A Monte Carlo simulation study is designed and presented in section 4 to evaluate the proposed model as an alternate to ML approach. Section 5 presents the measures used to evaluate the proposed model against the ML. Finally, results and conclusions are discussed in sections 6 and 7.

621

Maximum Likelihood Estimates for Logistic Regression Model.


Consider the logistic regression model [8]:

(xi ) =

1 1 + exp[ j x ij ]
j =0 k 1

i = 1, 2,....., n

(1)

which represents the relationship between (xi ) = P ( y i = 1 x i , ) (probability of success) and the covariates

x 's . Where j ' s are unknown model parameters. The maximum likelihood method is the most commonly
used method of parameter estimation in logistic regression. ML estimates j of the regression parameters

j ' s j = 0,..., k 1 which can be interpreted as log odds ratio estimates, are obtained as solutions to the score
equations
l () = ( ) = 0

(2)

where l () = log( L ( 0 , 1 , 2 ,...., k 1 )) is the log-likelihood function, is a ( n k ) matrix of the covariates values, and is a ( n 1 ) vector of response variable values. The ML is an iterative approach which requires initial estimates for the parameters j ' s . It generally performs well for large sample sizes, but for small data sets or data sets in which the average value of Y is close to zero or one, the method may produce poor results, or even fail to converge [3],[7][10],[11], and [13]. Some methods have been developed to overcome problems that logistic regression analysis of small data sets faces, including estimation and exact inference based on conditional likelihood [7] and penalized maximum likelihood estimation and inference [3],[5]. They generally are updated versions of the classical ML. Although, exact logistic regression can provide finite and accurate estimates in some situations, it cannot generally be used as a tool to deal with small data sets [6]. Besides, the availability of the penalized maximum likelihood estimation in statistical package software is very limited and rare.

Possibilistic Logistic Regression.

We extend the logistic regression model (1) to a fuzzy logistic regression model assuming that the parameters of the model are fuzzy numbers and both of i and x are crisp numbers . Consequently, the fuzzy model is expressed as:

( xi ) =

1 1 + exp[ j x ij ]
j =0 k 1

i = 1, 2,....., n

(3)

where j is assumed to be a fuzzy number as defined by Zadeh [18]-[20] and refined later by Dubois & Prade [4]. To estimate these fuzzy parameters, the proposed approach is based on the extension principle stated by Zadeh [17] and Brdossys third proposition [1],[2]. This proposition expresses the lower (upper) limit of the fuzzy number y ih (where y i = f (xi , ) i = 1,......, n ) as functions of x and the lower (upper) limit of respectively at h level . According to this proposition , assuming we have crisp data from the logistic model, i.e., xi ' s and (xi ) (denoted i ) are crisp and is composed of symmetric triangular fuzzy numbers with membership function L (z ) = max(0,1 z ) , the conditions of Fuzzy Logistic Regression can be expressed as:
1 i , 1 + exp (c xi (1 h )s t xi )
t

i = 1,......, n

(4)
i = 1,......, n

1 i , t 1 + exp (c xi + (1 h )s t xi )

622

The proposed model is based on Tanakas approach to minimize the total spread of the fuzzy predicted values. Therefore, the proposed model minimizes the distance between the upper and lower bounds of the fuzzy predictions, simultaneously with minimizing the distance between the observed values and the center of the fuzzy predicted values. Consequently, the model can be formulated as:
min { D s .t . E iL ( h ) O i O i E iU ( h ) i = 1, 2,...., n i = 1, 2,...., n
p

+V

(5)

where
O i is the i
th

actual observation.

E iL and E iU are the lower and upper limits of the fuzzy predictions represented as intervals at chosen h level .
= {O i E iC }P p i predictions. D
P

1/ p

is L p norm of the distance between the actual observations and the center of fuzzy

V = {E iU E iL }P i predictions.

1/ p

is L p norm of the distance between the upper and lower bounds of the fuzzy

Using the squared Euclidean norm ( p = 2 ) and conditions (4), we can reformulate the model (5) as follows:
min s .t . wi i , 1 + exp (c xi (1 h )s t xi )
t

(
i =1

iU

iL ) 2 + ( p i
i =1

1 )2 1 + exp(ct xi ) i = 1,......, n i = 1,......, n

(6)

wi i , 1 + exp (c xi + (1 h )s t xi )
t

s0

where
c j center of the parameter j of logistic model.

s j spread of the parameter j of logistic model.

i the ith observed value of the dependent variable represents P ( y i = 1) .


w i weight corresponding to data after being grouped (discussed later).

iL , iU are the lower and upper bounds of the estimated probabilities P ( y i = 1)

where
iL = iU = 1 1 + exp (c t x i (1 h )s t x i ) 1 1 + exp (c x i + (1 h )s t x i )
t

i = 1,......, n i = 1,......, n

To implement this model, we consider two main aspects. First, how can be the probabilities p i obtained or computed, if the data used to estimate the logistic model parameters consist of the covariates x and the response variable represented as 0,1 variable rather than probabilities. The second, which value of h level should be chosen. 623

Real data sets of logistic models often consist of the values of the covariates x and the values of the response variable represented in binary form rather than probabilities. Therefore, to implement the proposed model, the values 0,1 of the response variable should be expressed as probabilities. To obtain these probabilities, we suggest the following steps:

Break down each covariate into a limited number of levels (intervals). Compute the mean value of each interval. Identify all possible combinations of covariates levels or intervals Identify the number of combinations n and the number of observations into each combination
m i i = 1,...., n where the total number of observations M = m i .
i =1 n

Compute the probability i for each combination which is equivalent to the average of y values corresponding to combination i .

In this context, n new observations (combinations) are then obtained; each of them has the modified values of x i and the corresponding i . Consequently, we can implement the proposed models with these new observations. Reducing the covariates into a limited number of levels leads to reduction of the number of observations from M to n observations ( M > n ). Consequently, weighting can be used to compensate for sampling schemes that stratify on the covariates, giving results that more accurately reflect the population. The new modified values of the covariates are weighted by their actual weights in the original data set. Therefore, the proposed models are weighted by the weights w i where w = exp m i i = 1,..., n . i M

Selection of h level is the second issue to discuss for implementing these models. Condition (4) in those models takes into account that the membership degree of each observation is greater than an imposed threshold possibility h where h [0,1] . This criterion simply expresses the fact that the fuzzy output of the model should cover all data points to a certain h level . The selection of h level value will influence the spread of the parameters, consequently the estimation errors. Therefore, we adopt the algorithm proposed by Nasrabadi and Nasrabadi [12] to select h level that minimizes the estimation error.

4 A Monte Carlo Simulation Study.


A simulation study is designed to empirically asses the performance of the proposed Fuzzy Logistic Regression model in a number of situations and compare its performance to that of the Maximum Likelihood approach. Three factors are considered in the simulation study: 1. Sample size: the sample size 30 is chosen to study the effect of small samples (Bull et al [3] used 50, 75 and 100 observations for small sample size) on the performance of the proposed models while the sample sizes 200 and 600 are chosen to reflect Moderately Large and Reasonable Large sample in common practice [15]. Model type (based on number of parameters): simple model (two parameters and one covariate) and multiple model (four parameters and three covariates). Type and Distribution of covariates: continuous with Normal distribution, continuous with Uniform distribution, Binary (Bernoulli) and mixed (Normal, Uniform, Binary).

2. 3.

Table 1 shows the details of 21 groups arising from the different combinations of the categories of the three factors. The factor that is held fixed in the simulation study is the number of replications for each combination of design factors. Since the objective of the simulation study is to asses the performance of the proposed model compared to ML, rather than studying the asymptotic properties of estimated parameters, we select only 50 replications for each combination of design factors. We think the selected number of replications is enough to investigate the comparison between ML and the proposed models and there is no need to select larger number of replications (A similar simulation study selected only 20 replications [9]).

624

5 Performance Measures.
Since, our aim is to compare the performance of the proposed fuzzy model against the ML approach, the selection of the goodness of fit measure that will be used here, should take into account two main aspects: The proposed model is built on fuzzy concepts consequently the selected measures should be applicable to both possibilistic and probabilistic cases. The selected measures should not be built on the concept of ML approach, in order not to be tendentious to that approach.
Table 1. Combination of Simulated Data According to Simulation Factors Combination Group 1 Group 2 Group 3 Group 4 Group 5 Group 6 Group 7 Group 8 Group 9 Group 10 Group 11 Group 12 Group 13 Group 14 Group 15 Group 16 Group 17 Group 18 Group 19 Group 20 Group 21 Sample Size 30 30 200 200 600 600 30 30 200 200 600 600 30 30 200 200 600 600 30 200 600 Number of Covariates (k-1) 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 1 3 3 3 3 Distribution of Covariates N(0,2) N(0,2) N(0,2) N(0,2) N(0,2) N(0,2) U(-6,6) U(-6,6) U(-6,6) U(-6,6) U(-6,6) U(-6,6) Binary Binary Binary Binary Binary Binary Mixed Mixed Mixed

Accordingly, a fuzzy similarity measure is used to evaluate the performance of the proposed as well as the classical predictive measure of correct classification. We use Yang et al [16] similarity measure S (A , B ) that measures the distance between two L-R type fuzzy numbers Ai = (ai , ai l , ai r )LR and B i = (bi , bi l , bi r ) LR as:
1 S (Ai , Bi ) = 2 exp(d LR (Ai , Bi ) / i ) where if Ai = Bi if Ai Bi

2 d LR (Ai , Bi ) = (ai bi )2 + ((ai i ai l ) (bi i bi l ))2 + ((ai i ai r ) (bi i bi r ))2

(7)

i = { (ai i ai l ) (bi i bi l ) + (ai i ai r ) (bi i bi r ) }/ 2


+{ (ai ai l ) (bi bi l ) + (ai ai r ) (bi bi r ) }/ 23 and
i

= L1 (v i )dv
0

i = R 1 (v i )dv
0

when v = 1 both Ai , B i will be the common normal L-R fuzzy numbers. This measure is defined for i as:
th

observation, respectively the index for the whole data set can be computed

SIM =

S (A , B
i i

(8)

This index is used to measure the similarity between the original probability i (true probability of simulated data) and the estimated probability of the proposed model i . The original probabilities i are crisp numbers, thus the left and right spreads are zeros. To make an equitable comparison between the performance of the 625

proposed model and ML, we use the statistical confidence interval of i of ML (as Kim et al suggested [9]) and the fuzzy interval of i of the proposed model to compute the similarity index for each of them. Another intuitively appealing way is to summarize the results of a fitted logistic regression model by a classification table [8]. This table is the result of cross-classifying the outcome variable Y with Y the variable whose values are derived from the estimated logistic probabilities.

6 Simulation Results
The similarity measure (7) is computed to measure the similarity between the estimated probability i and the true probability i and the overall sample measure (8) is obtained for proposed model as well as ML. The similarity measure results are compared for each combination (group) of the simulation study. The Wilcoxon Signed Ranks Test is employed to the proposed model versus ML. Besides, an analysis is made using the frequencies of the superiority of proposed model on ML (for each group, how many times the proposed model results are superior the ML results).

Table 2. Model with Minimum Fuzziness versus ML Performances Similarity Measure. Mean of similarity measure Model3 ML .5839 .4938 .6920 .5835 .7312 .5865 .5515 .5068 .5720 .5086 .5972 .5180 .6498 .6039 .6715 .6926 .6858 .7535 .5269 .5904 .5762 .3662 .3235 .8703 .8605 .4662 .9277 .3535 .2407 .8973 .8960 .9432 .9393 .5085 .3896 .8446 .7785 .9080 .8874 .3235 .8847 .9249 Mean ranks (Wilcoxon Test) Negative Positive 20.000 14.000 25.750 25.500 21.000 25.500 14.538 11.700 26.957 25.500 26.458 25.500 16.438 14.267 27.651 29.100 27.556 27.578 14.273 25.500 25.500 26.250 30.429 19.500 0.000 25.592 0.000 29.351 27.868 2.667 0.000 2.500 0.000 29.765 30.314 12.286 20.100 7.000 6.800 28.667 0.000 0.000 Proposed Model do better 88% 70% 4% 0% 98% 0% 74% 80% 6% 0% 4% 0% 68% 70% 14% 40% 10% 10% 78% 0% 0%

Group Group 01 Group 02 Group 03 Group 04 Group 05 Group 06 Group 07 Group 08 Group 09 Group 10 Group 11 Group 12 Group 13 Group 14 Group 15 Group 16 Group 17 Group 18 Group 19 Group 20 Group 21

Z -4.996 -4.127 -5.778 -6.155 -5.952 -6.155 -4.330 -4.831 -6.077 -6.155 -6.106 -6.155 -3.615 -4.089 -5.325 -2.274 -5.817 -5.827 -4.639 -6.155 -6.155

Sig. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.023 0.000 0.000 0.000 0.000 0.000

For small size samples, Table 2 shows that proposed model significantly outperforms ML in all groups of small size samples. In these groups, the average of similarity measure ranges between .49 and .60 for the proposed model against the range of .24 to .5 for ML. This result indicates that the sample size is the only factor that affects the performance of the proposed models. The percentage of superiority of the proposed model on ML ranges from 68% to 88% for the groups with small samples. For large size samples, Table 2 shows that ML significantly outperforms the proposed model in all groups with large samples, except group 5 which represents simple model with Normal distribution covariate and large sample size 600. The overall rate of correct classification index measures the degree of classification accuracy rather than the model fit. Therefore, we used this index as a supplementary tool to investigate the classification capability of the 626

proposed models. The index is based on using the central values of estimated probabilities to predict group membership. The Wilcoxon Signed Ranks Test is employed to compare the results of each proposed model versus the ML. Comparisons results are presented as follows: For small size samples, Table 3 shows that the proposed model significantly outperforms ML for groups 1 and 7, which represent simple models with a constant term and one covariate. While, their values of overall rate of correct classification are equivalent to ML rates for groups 13 and 14, which represent simple model and multiple model with binary covariates respectively.
Table 3. Model with Minimum Fuzziness versus ML Performances Overall Rate of Correct Classification Mean of correct classification Rate Model3 ML .7387 .6673 .7979 .7761 .8262 .8137 .7053 .7133 .7623 .7154 .7698 .7450 .7953 .7887 .7433 .7898 .7815 .7911 .6740 .7759 .7809 .5013 .9493 .8448 .9051 .5183 .9146 .4947 .9787 .9150 .9534 .9086 .9494 .7727 .7727 .7929 .8420 .8185 .8180 .9480 .9356 .9196 Mean ranks (Wilcoxon Test) Negative 23.833 25.959 23.132 25.885 1.000 26.809 21.450 25.500 21.689 25.500 26.350 26.000 6.500 12.563 10.500 11.222 7.000 9.967 24.968 24.500 26.435 Positive 25.163 3.000 6.000 4.250 26.000 5.000 24.069 0.000 5.833 0.000 4.500 1.000 14.000 13.778 0.000 4.000 0.000 1.750 2.500 0.000 3.000 -4.675 -6.131 -5.363 -5.558 -6.146 -6.011 -3.565 -6.160 -5.278 -6.156 -5.612 -6.145 -1.025 -1.038 -3.935 -3.643 -3.183 -3.459 -6.013 -6.035 -6.009 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.306 0.299 0.000 0.000 0.001 0.001 0.000 0.000 0.000 Proposed Model do equal or better 88% 2% 24% 22% 98% 6% 80% 0% 26% 0% 20% 2% 76% 68% 60% 64% 74% 70% 6% 4% 8%

Group

Sig.

Group 01 Group 02 Group 03 Group 04 Group 05 Group 06 Group 07 Group 08 Group 09 Group 10 Group 11 Group 12 Group 13 Group 14 Group 15 Group 16 Group 17 Group 18 Group 19 Group 20 Group 21

For large size samples, it is shown that ML significantly outperforms the proposed model for all groups except group 5 which represents simple model with normally distributed covariate of large sample. The overall rate of correct classification of the proposed model ranged between 67% and 82%, which indicates that the Model produces reasonable values of the overall rate of correct classification although it does not outperform ML.

7 Conclusions
A Possibilistic Logistic Regression Model With Minimum Fuzziness is formulated to extend logistic regression model to fuzzy logistic regression model to overcome the failure of the Maximum Likelihood approach in case of small sample size. The new proposed model is evaluated through a simulation study and the following concluding points have been reached: Based on the similarity measure, The proposed model performs successfully for small datasets. Moreover, it doesnt fail if the average value of Y is exactly 0 or 1. ML outperforms the proposed model for moderately large and reasonably large sample sizes. There is no evidence that the number of parameters and type and distribution of covariates have effect on the proposed model. 627

Based on the overall rate of correct classification, Generally ML outperforms the proposed model considering the overall rate of correct classification. The overall rate of correct classification of proposed model outperforms ML for most of small sample size data sets and is affected by covariates type and model type. In general, The proposed model produces reasonable values of the overall rate of correct classification although it does not outperform ML.

References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Brdossy, A., Bogrdi, I. and Duckstein, L. (1993), "Fuzzy nonlinear regression analysis of dose-response relationships", European J. of Oper. Res., 66, 36-51. Brdossy, A. (1990), "Note on fuzzy regression", Fuzzy Sets and Systems, 37, 65-75. Bull, S. B., Mak, C., and Greenwood, C. M. T. (2002), A modified score function estimator for multinomial logistic regression in small samples, Computational Statistics & Data Analysis, 39, 57-74. Dubois, D. and Prade, H. (1980), Fuzzy sets and systems: Theory and applications. New York: Academic Press. Firth, D. (1993) "Bias reduction of maximum likelihood estimates." Biometrika 80, 2738. Heinze, G., Schemper, M., (2002), "A solution for the problem of separation in logistic regression", Statistics in Medicine 21, 2409-2419. Hirji, K. F., Mehta, C. R. and Patel, N. R. (1987) computing distributions for exact logistic regression, Journal of American Statistical Association, 82 (400), 1110-1117. Hosmer, D. W. and Lemeshow, S. (2000), Applied Logistic Regression, John Wiley & sons, Inc., USA. Kim, K. J., Moskowitz, H. and Koksalan, M. (1996), Fuzzy versus statistical linear regression European Journal of Operational Research 92, 417-434. King, E. N. and Ryan, T. P. (2002), A preliminary investigation of maximum likelihood logistic regression versus exact logistic regression, The American Statistician, 56 (3), 99-170. Mehta, C. R., Patel, N. R. and Senchaudhuri, P. (2000), Efficient Monte Carlo methods for conditional logistic regression Journal of American Statistical Association, 95 (449) 99-108. Nasrabadi, M. M. and Nasrabadi E. (2004), A mathematical-programming approach to fuzzy linear regression analysis Applied Mathematics and Computation 155, 873-881. Ryan, T. P. (1997), Modern Regression Methods, John Wiley & sons, Inc., New York. Tanaka, H., Uejima, S. and Asai, K. (1982), "Linear regression analysis with fuzzy model", IEEE Trans. Systems, Man, Cybernet. 12, 903907. Xie, X. 2005, A Goodness Of Fit Test For Logistic Regression Models With Continuous Predictors", Ph.D. Dissertation, Graduate College The University of Iowa, Iowa City. Yang, M.-S., Hung, W.L. and ChangChien, S.J. (2005), On a similarity measure between LRtype fuzzy numbers and its application to database acquisition International Journal of Intelligent Systems 20, 1001- 1016. Zadeh, L. A. (1965), Fuzzy sets, Information and Control 8, 338-353. Zadeh, L. A. (1975)a, The concept of linguistic variable and its application to approximate reasoning-I. Information Sciences, 8, 199-249. Zadeh, L. A. (1975)b, The concept of linguistic variable and its application to approximate reasoning-II. Information Sciences, 8, 301-357. Zadeh, L. A. (1975)c, The concept of linguistic variable and its application to approximate reasoning-III. Information Sciences, 9, 43-60. Zadeh, L. A. (1978), Fuzzy Sets as a basis for a theory of possibility, Fuzzy Sets and Systems, 1, 3-28. Zimmermann, H.-J. (1991), Fuzzy Set Theory And Its Applications, 2nd ed., Kluwer Academic.

628

Вам также может понравиться