Вы находитесь на странице: 1из 8

2014 Fourth International Conference on Advanced Computing & Communication Technologies

Mining Students Data for Performance Prediction


Tripti Mishra Dr. Dharminder Kumar Dr. Sangeeta Gupta
Research Scholar Professor &Chairman, CSE Department Professor, Management and IT
Mewar University G.J. University, Hisar Bhagwan Parushram Institute of
Technology, Delhi
mishratripti2007@gmail.com dr_dk_kumar_02@yahoo.com sangeet_gju@yahoo.com

Abstract-A countrys growth is strongly measured by socio economic and previous academic performance
quality of its education system. Education sector, across parameters to predict academic performance using
the globe has witnessed sea change in its functioning. data mining techniques. The emotional skills like
Today it is recognized as an industry and like any other assertion, leadership, stress management etc are
industry it is facing challenges, the major challenges of
obtained, using standard Emotional Skill assessment
higher education being decrease in students success
rate and their leaving a course without completion. An process ESAP.
early prediction of students failure may help the
management provide timely counseling as well coaching
to increase success rate and student retention. We use Data mining tasks can be either descriptive or
different classification techniques to build performance predictive. Descriptive data mining uses techniques
prediction model based on students social integration, of association rule mining, clustering etc. to find
academic integration, and various emotional skills patterns hidden in large data set and help in
which have not been considered so far. Two algorithms intelligent decision making. Predictive data mining
J48 (Implementation of C4.5) and Random Tree have
constructs models using rule set, decision tree,
been applied to the records of MCA students of colleges
affiliated to Guru Gobind Singh Indraprastha neural nets, and support vectors etc. to predict the
University to predict third semester performance. class of a new data set.
Random Tree is found to be more accurate in
predicting performance than J48 algorithm. The objective of this paper is to predict the third
semester performance of MCA students. The
Keywords- classification, data mining, prediction, rationale behind considering third semester for
prediction is the observation that most of the
I. INTRODUCTION students drop out of the course after first year and
also students normally take a year to get integrated
Preliminary education adds to a nations literacy rate in an institute academic environment. Two decision
but higher education has a direct impact on the work tree algorithms, J48 and Random Tree, have been
force being provided to the industry and hence used to build the model and the main contribution of
directly affects the economy. this paper is the model comparison along with
finding the impact of various attributes on
Lots of Institutions of higher learning have been set students performance.
up across India. However the quality of education is
judged by the success rate of students and to what The remainder of this paper is organized as follows.
extent an institute is capable of retaining its students. Section 2 discusses previous work followed by
Predicting students performance can help identify experimental settings in section 3.Section 4 presents
the students who are at risk of failure and thus the result and conclusions are discussed in section 5.
management can provide timely help and take
essential steps to coach the students to improve his II. LITERATURE SURVEY
performance.
Most cited literature survey papers in Educational
Data mining techniques have been applied to predict data Ming have been by Romero and Ventura [4],
the academic performance of the students based on Ryan Baker [16], and Romero and Ventura [5] which
their socio economic condition and previous indicate performance prediction as one of the
academic performances. This paper explores the link
emerging field of educational data mining. Paris,
between emotional skills of the students along with

978-1-4799-4910-6/14 $31.00 2014 IEEE 255


DOI 10.1109/ACCT.2014.105
Affendy and Musthphain [9] have used various Osmanbegovic and Suljic in [7] have considered
subject performance attributes to predict final CGPA students attitude towards studying apart from
of bachelor of computer science students of a demographic variables and score earned at high
Malaysian University. Various Bayesians school to predict the grade in Business Information
Classification techniques have been used and a for first year students. Score of entrance exam, study
comparative study suggests that Ensemble method material and average weekly hours devoted to
gives best overall accuracy. studying have been found to have maximum impact
while number of household member distance of
Nghe, Janice and Haddawy [17] have considered two residence and gender have been found to have least
diverse populations, international, large Can Thao impact. Nave Bayes is found to be better classifier
University of Vietnam and Asian Institute of than J48.
Technology a small international postgraduate
institute and achieved similar levels of accuracy of Chi-squared Automatic Interaction Detector
prediction performance for both the population. (CHAID) has been used by Ramaswami and
Bhaskaran in [10] to classify 12th class students of
Cheewaprakobkit [14] considered 1600 students select Tamilnadu schools. Apart from demographic
records bet 2001 and 2011 in Thailand University
details, students health, tuition, care of study at
and applies decision tree and neural network to most
important factors affecting students academic home etc. have been studied. Prediction Accuracy
achievement. Decision tree proves to be a better obtained was 44.69, and potential influences
classifier than the neural network with 1.31% more variables were found to be Xth grade marks, location
accuracy. Number of hours worked per semester, of school, private tuition etc.
additional English course, no of credits enrolled per
semester and marital status of the students are major Bharadwaj and Pal [2] base their experiment only on
factors affecting the performance. Previous Semester marks, class test grade, seminar
Importance of 24 predictor variables including performance, Assignment, attendance, Lab work to
demography, scores in maths , Turkish, religion and predict end semester marks. Records of 50 students
ethics, science and technology and level of Session 2007 to 2010 MCA of Purvanchal
determination exams etc have been ranked by University were considered. The paper calculates
Sen, Uar, and Delen [1] for predicting Turkish
Split info, gain ratio of each predictor and products
secondary education placement result. Application of
Artificial Neural Network, Support Vector Machine, prediction rules.
Multiple Regression and Decision indicated that most
important predictor variables are determination exam, N. S. Shah [13] has applied various algorithms of
scholarship, number of siblings, success level in decisions tree (C45 Random Forest, BF Tree, Rep
Turkish Language etc. Tree),Functions ( logistic RBF Network) ,Rule (3
Rip) and Bayes Net, Naive Bayes to categorize
Wook et al.[11] have considered few personality
students of BBA program of University of Karanchi.
traits like motivation of study, interests, learning
Out of 42 independent variable 5 best variables
environment , along with demographic details and
having highest effect in determining performance is
previous academic performance to predict CGPA of
considered. Random Forest proven to most accurate
computer science graduate and finally find out
classifier J48 decision tree, BF Tree ,Rep Tree and J
students who at risk of failing.
Rip rule .
Bidgoli, Koshy, Kortemeyer and Punch [3 ] applied
Kabakchieva in [6] conducts data mining
tree classifier as well as non tree classifiers to predict
classification Techniques on 10330 students, with 14
the grades of students enrolled with online education
attributes including personal profile, secondary
Latest Learning Online Network with Computer
educational score, entrance exam score, admission
Assisted Personalized Approach (LON CA PA)
year etc. The students are classified into five
developed at Michigan State University and found
categories excellent, very good, good, average and
that combination of multiple classifier enhanced the
bad. 10 fold cross validation and percentage split is
prediction accuracy.
used for all the classifier J48, Bayesian, K-nearest

256
neighbor one R and J Rip.J48 has been found to be location and other attributes like gender, medium at
most suitable of all classifiers. secondary level are found to be less relevant

Kabra and Bichkar [15] experimented with 346 first The students dropping out of an open polytechnic of
year students of an engineering college collecting New Zealand due to failure has been explored by
their demographic data (category, gender etc), past Kovaic.Z [18].Enrollment data consisting of socio-
performance data (SSC or 10th marks, HSC or 10 + 2 demographic variables (age, gender, ethnicity,
exam marks etc.), address and contact number to education, work status, and disability) and study
predict whether a student will PASS/FAIL or get environment (course programme and course block),
promoted(When he fails in 3 theory and 2 practical of 435 students of polytechnic students of
subjects). J48 algorithm in WEKA produces a Information system course were collected. The final
prediction model with accuracy 60.46 %. The most label consisting of two categories PASS (those who
important attribute in predicting students completed the course) and FAIL (Those who did not
performance is found to be HSCCET. The social complete) were considered. Feature selection
attributes like category, parents occupation, living indicated that most important attributes for prediction
are ethnicity, course programme and course block.

III. EXPERIMENTAL SETTING data. Data source from the total of 250 instances in
the raw data, the data cleaning process ended up in
The major objective of the proposed methodology is 215 instances.
to build the classification model that classifies a
students third semester performance as BAVG
(<60%), AVG (60% to less than 70%) , ABVG (70% C. Modeling
to less than79%) and EXCL (>=80%) . The
classifiers, has been built by combining the Standard The open source data mining tool Waikato
Process for Data Mining that includes business Environment for Knowledge Analysis (WEKA), has
understanding, data understanding, data preparation, been used for classification. WEKA provides inbuilt
modeling and finally application of data mining algorithms that can be applied to any data set.
techniques which is classification in present study.
D. Classification
A. Data Understanding
Tree-based methods classify instances by sorting the
The data of MCA students from various Institutions instances down the tree from the root to some leaf
affiliated to GGSIP University was collected through node, which provides the classification of a particular
a structured questionnaire. A sample of 250 students instance. Each node in the tree specifies a test of
was collected having 25 attributes which included some attribute of the instance and each branch
academic integration, social integration and descending from that node corresponds to one of the
emotional skills as shown in Table I. possible values for this attribute [16]. J48 is a class
for generating a pruned or unpruned C4.5 decision
B. Data-Preprocessing tree while Random Tree constructs a tree that
considers K randomly chosen attributes at each node
The data collected was saved as Excel spread sheets. without pruning. We have used Cross-validation for
The cleaning process required data eliminating data testing as it has been proved to be more suitable for
with missing values, correcting inconsistent data, limited dataset and gives best estimate of error [10].
identifying outliers, as well as removing duplicate

257
TABLE I. Attributes Description

Attribute Name Values Description


GENDER Male, Female Gender
FE Midschool, Inter, Grad, Postgrad Fathers Education
ME Midschool, Inter, Grad, Postgrad Mothers Education
FO Govtjob, Pvtjob, Business Fathers Occupation
MO Govtjob, Pvtjob, Business ,Housewife Mothers Occupation
FI MIG,HIG,LIG,VHIG Annual Family Income
LOAN Yes, No Educational loan at any level of education
EARLYLIFE Metro, City, Village 15 years of life spent
MEDIUM English, Other Medium of instruction at school level.
TENTH BLAVG, AVG, ABAVG, EXCL % marks in 10th
TWELVTH BLAVG, AVG, ABAVG, EXCL % marks in 12th
GRAD BLAVG, AVG, ABAVG, EXCL % marks in Graduation
FIRST_SEM BLAVG, AVG, ABAVG, EXCL % marks in 1st Semester of MCA
SECSEM BLAVG, AVG, ABAVG, EXCL % marks in 2nd Semester of MCA
THIRDSEM BLAVG, AVG, ABAVG, EXCL % marks in 3rd Semester of MCA
GRADDEGTYPE Regular, Distance Type of Graduation Degree
GRADDEGSTREAM CS, NCS Graduation Degree Stream
GAPYEAR Yes, No Gap year in education
ACADEMICHRS INSUF, SUF, OPTIMAL Hours spent on academic activities
ASSERTION D, S,E* Assertiveness of the student
EMPATHY D, S,E Empathy of the student
DECISIONMAKING D, S,E Decision making ability of the student
LEADERSHIP D, S,E Leadership ability of the student
DRIVE D, S,E Drive of the student

STRESSMGMT D, S,E Stress management skill of the student

*D, S , E represent need to develop, need to strengthen ,need to enhance respectively

D. Classification

Tree-based methods classify instances by sorting the for generating a pruned or unpruned C4.5 decision
instances down the tree from the root to some leaf tree while Random Tree constructs a tree that
node, which provides the classification of a particular considers K randomly chosen attributes at each node
instance. Each node in the tree specifies a test of without pruning. We have used Cross-validation for
some attribute of the instance and each branch testing as it has been proved to be more suitable for
descending from that node corresponds to one of the limited dataset and gives best estimate of error [8].
possible values for this attribute [12]. J48 is a class
IV. RESULT AND DISCUSSION
while summary of random tree and rules are shown in
J48 and Random tree were applied on the data set Fig .2 and Table III. The performance of algorithms
using 10 fold cross validation. The summary and the is evaluated on the basis of recall and precision and
rules obtained by J48 are listed in Fig. 1 and Table II, true positive(TP) rate. Precision is defined as number

258
of correct positive prediction over total number of irrelevant and high recall means that most of the
positive prediction and recall is defined as number of results retuned by the algorithm are relevant. The
correct positive prediction over total number of performance comparison of J48 and Random Tree is
positive cases. A high precision indicates that shown in Table IV.
algorithm returns more relevant results than

Figure 1. J48 Result Summary

TABLE II. Rules Derived from J48

1. If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and(EARLYLIFE = Metro)and(ME = Grad): ABAVG


2. If(SECSEM = AVG)and (TWELFTH = AVG)and(MEDIUM = English)and (EARLYLIFE = City)and (GRAD = AVG): AVG)
3. If(SECSEM = AVG)and( TWELFTH = EXCL): AVG
4. If(SECSEM = AVG)and (TWELFTH = ABAVG): AVG
5. If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP = S)and (GRAD = BAVG): BAVG
6. If(SECSEM = AVG)and (TWELFTH = BAVG)and (LEADERSHIP =E): AVG
7. If(SECSEM = EXCL)and (GRADDEGTYPE = Regular): EXCL
8. If(SECSEM = ABAVG)and (FIRSTSEM = ABAVG)and(TENTH = ABVG): ABAVG
9. If(SECSEM = BAVG)and (ME = Inter): BAVG

259
Figure 2. Random Tree Result Summary

TABLE III. Rules Derived from J48

1. If(SECSEM = BAVG)and(ASSERTION = D)and(FI = MIG) : BAVG

2. If(SECSEM = ABAVG) and (LEADERSHIP = E ): ABAVG

3. If(SECSEM = EXCL)and (FE = Grad) : EXCL /0)

4. If(SECSEM = AVG)and(FO = Govtjob)and(ACADEMICHRS = SUF)and(DRIVE = S) : AVG

TABLE IV. Performance Comparison of J48 and Random decision Tree

J48 Random Tree


TP Precision Recall TP Rate Precision Recall
Rate
ABVG 0.792 0.924 0.792 0.948 0.924 0.948
EXCL 1.000 0.857 1.000 0.952 0.952 0.952
AVG 0.900 0.851 0.900 0.914 0.970 0.914
BAVG 0.923 0.923 0.923 1.000 0.929 1.000
Weighted Average 0.884 0.887 0.884 0.944 0.945 0.944
Correctly Classified 88.3721% 94.4186%
Instances
Incorrectly Classified 11.627% 5.5814%
Instances

260
It is evident from the rules derived from the J48 and x Socio economic conditions are having only
Random tree that marginal effect on performance.
x Result of second semester is key influencer of
third semester result. It is expected also, as the The performance of both the algorithm is
programming subjects of second semester forms satisfactory; however, higher overall accuracy
the foundation of programming subjects of third (94.418%) was attained by Random Tree
semester. implementation as compared to J48 with 88.372%
x Consistently good academic performance is accuracy. Also the True Positive Rate, Precision and
clearly a good indication of good performance in Recall measures of Random tree are higher than J48
third semester too. and in line with the corresponding accuracy.
x Out of all emotional attributes leadership and
drive of the students have been found to affect
the performance.

V. CONCLUSIONS AND FUTURE


DIRECTIONS

Today academic success of students of any


professional Institution has become the major issue
for the management. An early prediction of students
at risk of poor performance helps the management
take timely action to improve their performance
through extra coaching and counseling. This paper
focused on identifying attributes that influenced
students third semester performance. Effect of
emotional quotient parameters on placement has been
established. Random tree gave higher accuracy of
prediction than J 48.The future research direction will
include professional courses of B.Tech as well as the
development of a decision support system to help
authorities identify the weak students and take timely
measures.

261
REFERENCES [10] M. Ramaswami, and R. Bhaskaran, A CHAID Based
Performance Prediction Model in Educational Data Mining,
International Journal of Computer Science, Vol. 7, Issue 1,
[1] B. Sen, E. Uar and D. Delen, Predicting and
No. 1.of 2010.
Analyzing Secondary Education Placement-Test
Scores: A Data Mining Approach, Expert System with
[11] M. Wook, Y.H.Yahaya, N. Wahab, M. R.M. Isa, N. F.
Application, Volume 39, Issue 10, 2012.
Awang a International Conference nd H.Y. Seong,
Predicting NDUM Student's Academic Performance Using
[2] B.K.Bhardwaj and S.Paul , Mining Educational Data
Data Mining Techniques, Paper presented at International
to Analyze Students Performance, International
Conference of Computer and Electrical Engineering, ICCEE.
Journal Advanced Computer Science and application
December 28-30. 2009.
Vol. 2 No. 6 , 2011 .
[12] Mitchell, T.: Machine Learning. McGraw Hill, New York
[3] B. M. Bidgoli, D.Koshy, G.Kortemeyer, W.F.Punch,
(1997).
Predicting Student Performance: An Applicant of Data
Mining methods with an educational web based
[13] N. S. Shah, Predicting Factors that Affect Students
system , 33rd ASEE/ IEEE .frontiers in Education
Academic Performance By Using Data Mining, Pakistan
Conference 20004.
Business Review, January 2012.
[4] C. Romero and S. Ventura, Educational data mining: a
[14] P.Cheewaprakobkit, Study of Factor Analysis Affecting
survey from 1995 to 2005, Expert Systems with
Achievements of Undergraduate, Paper presented at
Applications, no. 33, pp. 135146, 2007.
International Multi Conference of Engineers and Computer
Scientists, IMECS , Hong Kong, HK, March 13 - 15, 2013.
[5] C.Romero ans S, Ventura, Educational Data Mining:
A Review of the State of the Art,IEEE Transaction on
[15] R. R. Kabra, R.R, Bichkar , Performance Prediction of
Systems, Man, and Cybernatics,Vol.40,No.6,2010.
Engineering Students using Decision Trees, International
Journal of Computer Applications, Volume 36, No.11, 2011.
[6] D.Kabakchieva, Predicting Student Performance by
using Data Mining methods for classification. ,
[16] R.S.J.D Baker and K.Yacef, The State of Educational Data
Cybernetics and Information Technologies, Volume 13,
Mining in 2009: A Review and Future Visions , Journal of
2013.
Educational Data Mining, 1, Vol 1, No 1, 2009.
[7] E. Osmanbegovic and M. Suljic, Data mining
[17] T.Nghe, J.Paul , Aneek and Peter Heddawy, A Comparitive
Approach for Prediction of Student Performance
Analysis of Techniques for Predicting Academic
Economic Review - Journal of Economics &
Performance, Paper presented at 37th ASEE/IEEE
Business Vol. 10, issue 1, 2012.
Conference, Frontiers in Education Conference - Global
Engineering: Knowledge Without Borders, Opportunities
[8] IH. Witten and E. Frank, Data Mining: Practical
machine learning tools and techniques. San Francisco: Without Passports, Milwaukee,WI,October 10-13,2007.
Morgan Kaufmann, 2 ed., 2005.
[18] =-.RYDL  (DUO\3UHGLFWLRQRI6WXGHQW6XFFHVV
[9] I.H. M. Paris, L.S. Affecndy and N.Musthafa, Mining Students Enrolment Data, Paper presented at
Improving Performance Prediction using Voting Proceedings of Informing Science & IT Education
technique in data Mining, World Academy of Science, Conference (InSITE) ,Casinio Italia, June, 19-24,2010.
Engineering and Technology World Academy of
Science, Engineering and Technology, Vol 38,2010.

262

Вам также может понравиться