Вы находитесь на странице: 1из 19

4/25/2019 Adult Income Analysis

Prediction of Earning based on Employment


Data

Muhammad Bilal
03-243181-008
Abstract:
The prominent inequality of wealth and income is huge concern especially in the United States. The
likelihood of diminishing poverty is one valid reason to reduce the world’s surging level of economic
inequality. The principal of universal moral equality ensures sustainable development and improve
the economic stability of a nation. Governments in different countries have been trying their best to
address this problem and provide optimal solution. This study aims to show the usage of machine
learning and datamining techniques in providing a solution to the income equality problem. The UCI
adult dataset has been used for this purpose. Classification has been done to predict whether a
person's yearly income in US falls in the income category of either greater than 50K Dollars or less
equal to 50K Dollars category based on a certain set of attributes. We applied different preprocessing
techniques i.e. Discretization, Principal Component Analysis (PCA) etc. and classification techniques
i.e. Naïve Bayes Algorithm, J48 Decision Tree, Logistic Regression. J48 and Logistic Regression clocked
the highest accuracy of 93% – 94% while Naïve Bayes Algorithm give the accuracy of 83% but this
defines the minimum benchmark for any classification algorithm for correctly classified.
Table of Contents
1. Introduction:................................................................................................................................... 3
1.1. Background ............................................................................................................................. 3
1.2. Scope ....................................................................................................................................... 3
 Naïve Bayes Algorithm ............................................................................................................ 3
 Decision Tree Algorithm.......................................................................................................... 3
 Logistic Regression .................................................................................................................. 4
2. Literature Review: .......................................................................................................................... 5
3. Problem Statement ........................................................................................................................ 6
4. Dataset Acquisition & Description................................................................................................. 6
4.1. Training Set ............................................................................................................................. 8
4.2. Testing Set ............................................................................................................................... 8
5. Data Preprocessing......................................................................................................................... 8
5.1. Data Preparation for Removing Outliers and Missing Values ................................................. 8
5.2. Outliers.................................................................................................................................... 8
5.3. Missing Values....................................................................................................................... 10
5.5. Discretization ........................................................................................................................ 11
5.6. Principal Component Analysis............................................................................................... 12
7. Classification ................................................................................................................................. 13
7.1. Naïve Bayes Algorithm .......................................................................................................... 13
7.2. J48 algorithm......................................................................................................................... 14
7.3. Logistic Regression ................................................................................................................ 15
8. Performance Experiments & Post Processing ............................................................................. 16
8.1. Training Set ........................................................................................................................... 16
8.2. K fold cross validation ........................................................................................................... 16
9. Conclusion .................................................................................................................................... 17
10. References ................................................................................................................................ 18
1. Introduction:
1.1. Background
Society produces various amount of raw data to record facts and to discover patterns from that data.
Without techniques to extract data from raw data, this raw data is useless. The process of mining
previously anonymous and hypothetically useful information from large amount data is called Data
Mining. Classification, Association Rules and Sequence Analysis are major component of Data Mining.

A classification includes discovering rules that infer definite association into predefined classes. In this
procedure, training data set is analyzed and a set of rules are generated to classify the testing data
set.

An association rule involves finding rules that imply certain association relation among a set of
attributes in the given data. In this process, a set association rules is generated at multiple levels of
abstraction from relevant sets of attributes in the data.

In sequential analysis, patterns are discovered that occur in sequence.

1.2. Scope
The scope of this research is limited to classification. The following data classification methods are
used in this research.

 Naïve Bayes Algorithm


It is a classification technique dependent on Bayes' Theorem with a guess of independence among
predictors. In short terms, a Naive Bayes classifier accept that the nearness of a specific element in a
class is irrelevant to the nearness of some other component. For example, a fruit may be considered
to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on
each other or upon the existence of the other features, all of these properties independently
contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

 Decision Tree Algorithm


Decision Tree algorithm belongs to supervise learning algorithm. Decision Tree algorithm can be used
for problem related to classification as well as regression. Generally Decision Tree are used to create
training model that can predict class or value of target variables by learning decision rules from
training data.

i. C4.5 Algorithm
Tis algorithm is constructs a decision tree for training data by recursively splitting that data. The
decision tree grow using Depth-first approach. This algorithm also consider all possible tests that can
split the data and select a test that gives the highest information gain. C4.5 algorithm removes the
bias favor that give ID3 algorithm. C4.5 algorithm allow to prune the tree that produce as a result.
However this increase the error rate in training data but on the other hand it reduce error at unseen
data. This algorithm is also deal with missing values, noisy data as well as numeric attributes.

ii. J48 Algorithm


The J48 algorithm is an improved version of the C4.5 algorithm and it is supplied with WEKA toolset.

 Logistic Regression
Logistic regression is technique lent by machine learning from the field of statistics. It is the go-to
method for binary classification problems. Logistic regression is used in various fields, including
machine learning, most medical fields, and social sciences. Logistic regression may be used to predict
the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed
characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.).
2. Literature Review:
Certain endeavors utilizing AI models have been made in the past by scientists for foreseeing salary
levels.

 Chockalingam[1] investigated and examined the Adult Dataset and utilized a few Machine
Learning Models like Strategic Regression, Stepwise Logistic Regression, Naive Bayes, Decision
Trees, Extra Trees, k-Nearest Neighbor, SVM, Gradient Boosting and 6 setups of Activated
Neural Network. They likewise drew a similar investigation of their prescient exhibitions.
 Bekena[2] executed the Random Forest Classifier calculation to foresee pay dimensions of
people.
 Topiwalla[3] made the utilization of complex calculations like XGBOOST, Random Forest and
stacking of models for forecast assignments including Logistic Stack on XGBOOST furthermore,
SVM Stack on Logistic for scaling up the precision.
 Lazar [4] executed Principal Component Analysis (PCA) and Support Vector Machine
techniques to create furthermore, assess salary forecast information dependent on the
Current Populace Survey given by the U.S. Evaluation Bureau.
 Deepajothi [5] endeavored to duplicate Bayesian Networks, Choice Tree Induction, Lazy
Classifier and Rule Based Learning Techniques for the Adult Dataset and displayed a near
examination of the prescient exhibitions.
 Lemon[6] endeavored to distinguish the significant includes in the information that could
streamline the intricacy of various AI models utilized in order undertakings.
 Haojun Zhu[7] endeavored Logistic Regression as the Statistical Modeling Tool and 4
distinctive Machine Learning Strategies, Neural Network, Classification and Regression Tree,
Random Forest, and Support Vector Machine for anticipating Income Levels.
3. Problem Statement
There are many arguments about how to become a member of high-income social level in US, but
there is no conclusion. Some people believe education is the key while some people insist that the
capital gain is the only way to be richer than others. In the same time, the middle classes in the
emerging countries desperately want to know how the middle class in the developed counties gained
their fortune. As a group of international students, we want to know the crucial factors to become the
higher income level in the US. Based on the data from UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets/Census+Income).

We want to try to predict the high level income using different datamining techniques and address
the above problem.

4. Dataset Acquisition & Description


The dataset used in this research is called “Adult” and it is used to predict if an individual’s annual
income exceeds 50k US dollar.

Dataset comprised of:

 48842 instances (32561 training set and 16281 testing set)


 14 attributes (8 nominal and 6 continuous numerical)
 2 classes
 Missing value (7%)

This dataset is taken from the Data Extraction System (DES) of the US Census Bureau:
http://www.census.gov/ftp/pub/DES/www/welcome.html
This dataset can be downloaded from the following:
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult/
The following table provides details and description about 14 attributes that will be used to train and
test the results/outcome.

Data Set Data Type Data Size

Age Numerical : years 16 to 150 years


Work class Nominal Private, Self-emp-not-inc, Self-emp-inc,
Federalgov, Local-gov, State-gov, Without-pay,
Never-worked
Fnlwgt Numerical : ZipCode 000000 to 999999

Education Nominal Bachelors, Some-College, 11th , HS-grad, Prof-


School, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th,
Masters, 1st-4th, 10th, Doctorate, 5th-6th, PreSchool
Education-num Numerical : years 0 to 17 years

Martial-Status Nominal Married-civ-spouse, Divorced, Never-married,


Separated, Widowed, Married-spouse-absent,
Married-AF-spouse
Occupation Nominal Tech-support, Craft-repair, Other-service, Sales,
Exec-managerial, Prof-specialty, Handlers
cleaners, Machine-op-inspct, Adm-clerical,
Farming-fishing, Transport-moving, Priv-
houseserv, Protective-serv, Armed-Forces
Relationship Nominal Wife, Own-child, Husband, Not-in-family, Other
relative, Unmarried
Race Nominal White, Asian-Pac-Islander, Amer-Indian-Eskimo,
Other, Black
Gender / Sex Nominal Male, Female
Capital Gain Numerical : Dollars US 0 to 50K

Capital Loss Numerical : Dollars US 0 to 50K


Hours per Hour Numerical : hours (0 to 24) * 7 Hours
Native Country Nominal United-States, Cambodia, England, Puerto-Rico,
Canada, Germany, Outlying-US(Guam-USVIetc),
India, Japan, Greece, South, China, Cuba,
Iran, Honduras, Philippines, Italy, Poland,
Jamaica, Vietnam, Mexico, Portugal, Ireland,
France, Dominican-Republic, Laos, Ecuador,
Taiwan, Haiti, Columbia, Hungary, Guatemala,
Nicaragua, Scotland, Thailand, Yugoslavia,
Salvador, Trinidad &Tobago, Peru, Hong,
Holland-Netherlands.
4.1. Training Set
The training set is the set of independent instances that have been used to build classifier. Generally
large dataset, the better the classifier. Two third of the whole dataset is used for training. Due to large
size of training set, we expect to obtain a good cluster.

4.2. Testing Set


The testing set is the set of independent instances the have played in the construction of classifier.
The large testing set, the more accurate the error estimates. One third of the whole dataset is used
for testing.

5. Data Preprocessing
5.1. Data Preparation for Removing Outliers and Missing Values
The original dataset contained some missing values and outlier. We converted the dataset in to
Microsoft Excel spreadsheet document and separated in to 14 columns that are represented as 14
instances. We used WEKA to clean outliers and missing values.

5.2. Outliers
Data object that vary considerably and/or fall outside the expected or accepted range can be consider
as outlier. Outliers can be caused by measurement or execution errors. Outliers can worsen the
performance of datamining algorithms and also may result in algorithms lead to inaccurate results.

Mining outliers from the dataset can be classified by two main sub-problems.

 Defining the data that can be considered as inconsistent in dataset.


 Finding inefficient method to mining the data that was found inconsistent.

For this purpose we use Interquartile Range Filter from WEKA. The Interquartile Range Filter from
weka library uses an IQR formula to designate some values as outliers/extreme values. Any value
outside this range [Q1−k(Q3−Q1),Q3+k(Q3−Q1)] is considered some sort of an outlier, where k is some
constant, and IQR=Q3−Q1.

By default weka uses k=3 to define something as outlier, and k=3∗2 to define something as extreme
value (extreme outlier).The formula guarantees that at least 50% value are considered non-outliers.
Having a single variable (univariate sample of values), it's practically impossible to reproduce your
result.

Note however that this filter can be applied to a data frame. When applied like this, it will consider as
an outlier any instance of the data frame which has at least one value of the instance considered as
outlier for that variable.
Figure 1: Before Applying Outlier Detection

Using WEKA, detecting outlier using WEKA.filters.unsupervised.attribute.InterquartileRange.

Figure 2: After Applying Outlier Algorithm


5.3. Missing Values
A major part of the data cleaning processes involves detecting and correcting missing values in the
dataset. Missing values may occur for various reasons such as malfunctioning measurement of
instruments, changes in experimental design during data collection.

The following table shows the different types of method that are used to fix missing values in the
dataset for this research.

Missing Values Correction Method Description of Method


Replace with Missing Attributes This method takes an input file with missing
values denoted by “?” and replace it with a
“Missing” attributes. This is used for both
nominal and numeric values.
ReplaceNomMostFrequent This method is for nominal attributes. This
method will replace all the nominal values in the
dataset that have missing values with the most
frequent value in that attribute set
ReplaceNomMostFrequentSameClass This method will replace all the nominal values
in the dataset that have missing values with the
most frequent value of the same class of the
missing value. This is used for nominal
attributes.
Replacedbygloballyvalues() This is used for numeric attributes. This method
will replace all the numerical values that are
missing in the dataset by a global value that the
user enters.
Replacedbyaverage_of_present_feature() This method will replace all the numerical values
that are missing in the dataset by the average of
the present feature. This is used for numeric
attributes.
Replacedbyaverage_of_corresponding_class() This method will replace all the numerical values
that are missing in the dataset by the average of
the corresponding class. This is used for numeric
attributes.
5.5. Discretization
Data discretization converts a large number of data values in to smaller one, so that data evaluation
and data management becomes very easy. So that we use Discretize filter from WEKA to discretize
our dataset.

Figure 3: Before Discretization

Figure 4: After Applying Discretization


5.6. Principal Component Analysis
Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique
that is used in numerous applications, such as stock market predictions, the analysis of gene
expression data, and many more. For this data set we use PCA to select top ranked attributes.

Figure 5: Ranker Attributes selected from PCA

The following graphs describe the figure 5, that we obtained by using PCA.

Figure 6: After applying PCA


7. Classification
7.1. Naïve Bayes Algorithm
The first algorithm used was Naïve Bayes algorithm, which produces very simple rules based
classification. We use the default parameter provided in WEKA to classify the testing data.

Figure 7: Summary of Naive Bayes Algorithm

Using Naïve Bayes algorithm 13849 (83.39 %) instances are classified correctly. And 2757 (16.06 %)
instances are classified as incorrectly.
7.2. J48 algorithm
J48 builds a decision tree model by analyzing training data, and uses this model to classify the testing
data (user data). We use the default parameters provided in WEKA. For example, the confidence
threshold factor for pruning is set to 0.25, minimum21 number of instances in a leaf is 2, and the
reduced-error pruning is set to false default.

Figure 8: Tree Generated by J48 Pruned Tree

Using J48 algorithm 15481 (93.22 %) instances are classified correctly. And 1125 (6.77 %) instances
are classified as incorrectly. The confusion matrix for this algorithm is as followed.

Figure 9: Confusion Matrix for J48 algorithm


7.3. Logistic Regression
We also used Logistics Regression which produces rules based classification. We use the default
parameter provided in WEKA to classify the testing data.

Figure 10: Summary of Logistic Regression

Using J48 algorithm 15668 (94.22 %) instances are classified correctly. And 938 (5.77 %) instances
are classified as incorrectly.
8. Performance Experiments & Post Processing
The three different learning schemes used are Naïve Bayes, J48 Decision Tree, and Logistics
Regression.

8.1. Training Set


In this section we compare the performances of these schemes. We compare the accuracy of each
algorithm.

Algorithms Correctly Classified Incorrectly Classified Percentage Accuracy


(%)
Naïve Bayes 13849 2757 83%
J48 Decision Tree 15481 1125 93%
Logistic Regression 15668 938 94%

We can see that J48 and Logistic Regression are better as compare to Naïve Bayes for the given
dataset.

8.2. K fold cross validation


We saw the performance of each algorithm using training and testing data. It is possible that the
sample used for training (or testing) may not be a correct representative. One way to remove the bias
caused by any particular sample is to run many iterations with different random samples. To reduce
the effect of any uneven representation of the training or testing data set, we will use the stratified
tenfold cross validation method.

The data is divided randomly into ten parts. Each part is held out in turn and the learning scheme is
trained on the remaining nine parts. Then its error rate is calculated on the holdout set. The procedure
is executed a total of ten times. The error estimates from each portion is averaged together to yield
an overall error estimate.

In order to analyze the performance of different algorithms, we ran the tenfold cross validation .The
results are recorded below:

Number of runs J48 (%) Naïve Bayes (%) Logistic Regression (%)
1 93.1172 83.203 94.0123
2 92.9884 83.1518 94.0401
3 93.1254 83.2931 94.0021
4 93.1148 83.3331 94.0232
5 93.2141 83.3331 94.0999
6 92.9945 83.3421 94.1032
7 93.1358 83.3431 94.1066
8 93.1854 83.1518 94.1100
9 93.1175 83.1538 94.1323
10 93.2117 83.1395 94.1464
MEAN 93.1204 83.24444 94.07761
9. Conclusion
We obtained raw data, and wrote our own code to fix outliers and missing values from the training
and testing set. Then we used the training dataset to train the machine to predict whether a person
makes over 50K a year. Then we use the testing dataset to test whether the machine can predict if a
person makes over 50K a year. Based on the experiment results, we compared the accuracy and
performance of several data mining algorithms. The different algorithm used were: Naïve Bayes
Algorithm, J48 Decision Tree and Logistic Regression (LR).

Our result shows that the Naïve Bayes had not such accuracy that gave by J48 and LR as we aim to
find out one attribute that accurately determine whether an individual’s income exceed 50K US
dollars or not. But it is useful to set benchmark performance before progressing toward
sophisticated learning algorithms.
10. References
[1] V. Chockalingam Sejal Shah Ronit Shaw, “Income Classification using Adult Census Data.”
[2] M. Personal, R. Archive, and S. Menji Bekena, “M P RA Using decision tree classifier to predict
income levels,” 2017.
[3] M. Topiwalla, “Machine Learning on UCI Adult data Set Using Various Classifier Algorithms
And Scaling Up The Accuracy Using Extreme Gradient Boosting.”
[4] A. Lazar, “Income prediction via support vector machine,” in 2004 International Conference
on Machine Learning and Applications, 2004. Proceedings., pp. 143–149.
[5] S. Deepajothi and S. Selvarajan, “A Comparative Study of Classification Techniques On Adult
Data Set.”
[6] C. L. A, C. Z. A, and K. M. A, “No Title,” 1994.
[7] “A Comparative Study of Classification Techniques in Data Mining Algorithms,” Int. J. Mod.
Trends Eng. Res., vol. 4, no. 7, pp. 58–63, 2017.

Вам также может понравиться