Академический Документы
Профессиональный Документы
Культура Документы
Overview
This project served as an introduction to image classification using a convolutional neural net
(CNN). This convolutional neural net was an 18-layer cascaded network comprised of basic
image processing operations such as normalization, convolution, rectified linear unit,
maxpool, full-connect and softmax. The motivation of the project was to show that complex
image processing tasks can in fact be performed through many elementary operations
applied many times. Image normalization scaled the input image values between the range
of -0.5 to 0.5. Convolution was performed using pretrained data given with the project.
Rectified linear unit thresholded the images such that negative values were set to 0.
Maxpool downsampled the image by a factor of two, while selecting the maximum image
intensity value in every 2x2 pixel block. Full-connect assigned all the processed information
to a value for each possible imageclass, while softmax converted these values into
probabilities. The class with the highest probability is determined as the predicted class of
the image.
The image set used for this project was cifar10testdata.mat which contains 10,000 images
with their corresponding imageclass from 1 to 10. In order the classes are airplane(1),
automobile(2), bird(3), cat(4), deer(5), dog(6), frog(7), horse(8), ship(9), and truck(10). Each
image is 32x32x3 being 32x32 pixels with red, green, and blue intensity level channels.
This project passed all images in the cifar10testdata i nto our 18-layer CNN and compared
the predicted class to the true class. It assessed the quality of the CNN through making
accuracy measurements to the predicted class, and how it responded to user defined lamp,
truck, and bird images outside of the cifar10testdata i mageset. The CNN’s accuracy is
43.71% for correctly guessing the class from the cifar10testdata imageset. For guessing the
correct class of an image within the top-3 most probable classes, the CNN’s accuracy is
78.64%. For additional images we decided to test the CNN on, it was able to guess the truck
image, but it incorrectly guessed the bird image as an airplane. The lamp image was to
evaluate the output when the CNN did not have any pretrained data for any lamp objects.
layer1 is then used to compute layer2, where we convolve the image data from layer1 with
the filter banks for this stage (filterbanks{2}). We used two for loops to go through the filter
banks and layers, where the outer loop uses a variable i in the range of 1 to 10 (classes) and
the inner loop uses a variable k in range of 1 to 3 (RGB channels). We chose to use nested
for loops for the convolution steps because it is a traversal method that we are both familiar
with and can therefore both understand and write code that functions correctly. i is used to
access the 10 arrays from the 32x32x10 layer 2 output array and the 3x3x10 filter bank
arrays. k is used to access the 3 arrays from the 32x32x3 input array. The innermost for loop
calculates the convolution for each channel and sums them, and the outer for loop then adds
the bias vector to the calculated sum.
The result from layer 2 then is used in the ReLU phase (activation function) that is layer 3,
where we simply take the values calculated in layer 2 and set values which are negative and
set them to zero. We can do this using the max function, where each value within the 10
arrays are compared with 0 to see which is larger.
Layer 4 and 5 follow, where their layout is very similar to layers 2 and 3. Layer 6, however,
we calculate the maxpool. In this project, each of the 10 arrays from the previous layer is
downscaled by a factor of 2 every time maxpool is called. To calculate the downscaled
arrays, we used a 3 nested for loops. The outer for loop uses a variable l which is meant to
cycle through each of the 10 arrays from the input. The next for loop is used to cycle through
every other row in the array, using the variable i with a value of 1:2:(M-1), where M is the
height of the array. The innermost for loop is meant to cycle through every other column in
the array using the variable j with a value of 1:2:(N-1), where N is the width of the array.
Inside of the innermost for loop, 2x2 blocks are taken from the input array and the maximum
from each is calculated using the max function. This process starts at the upper left 2x2
block and works row by row, ending at the lower right 2x2 block. At the end of each iteration
of this loop, we receive 4 values. In layer 6, for instance, we receive 16 values total for each
of the 10 arrays. We again chose to use nested for loops to calculate these values because
it is easily understood and easily portable to other maxpool calculations done in other layers.
Layers 7-16 use the techniques defined for previous layers, but layer 17 is different as it
implements fullconnect. We took a similar approach to previous layers by using two nested
for loops. The outer for loop uses variable l which takes values from 1 to 10 and is used to
access each of the output’s 10 arrays as well as the 10 filter arrays in each of 10 sets of filter
arrays in filterbanks. The inner for loop uses a variable k which also takes values 1 to 10 and
is used to access each of the 10 sets of 10 filter arrays in filter banks, as well as each of the
10 arrays from the input from layer 16. Inside of the inner for loop, we sum the sum of filter
banks, which are multiplied by the input arrays (in code: sum(sum(filterbanks{17}(:, :, k,
l).*layer16(:, :, k)))). The outer for loop then adds the corresponding bias vectors to the
summed value.
The final layer then gives us a probability for each of the 10 arrays from layer 17. To emulate
the equations:
we set 𝞪 to the maximum value from the values calculated in layer 17. From there, we used
a for loop to go through each of the 10 values and calculate their probability. The output from
this layer is then returned to the main function.
In the main function, we create a table for our confusion matrix. From there, we test each of
the 10000 images using a for loop to iterate. main calls convnn for each image and the value
returned is split into its probability and the predicted class. Its predicted class is then
compared to its actual class and is recorded in the confusion matrix based on its class index
and predicted class index. Those with the same index correspond to a diagonal value in the
table (e.g. (0,0), (1,1), ... , (10,10)). Once each of the 10000 images are classified, we find
the accuracy by summing the correctly identified (diagonal) values and dividing them by the
total 10000.
The flowchart showing how these subroutines interact can be seen below:
For our project, we passed each image in the imageset cifar10testdata.mat into our CNN
and saved the results into the tableAccuracy.mat variable. This variable contains 10, 10x10
confusion matrices which correspond to the top-k classifications. The total run time of our
CNN for the imageset was 34.6 minutes. The submitted code loads from the
tableAccuracy.mat variable to calculate the accuracy without needed to compile for another
34.6 minutes.
Experimental Observations
Based upon our intermediate and final results for the test case, photo number 490, our code
seems to be working as it should. The variation between our results and the test results at
each layer, shown by layerResults, was zero or very close to zero (a value multiplied by 10-14
→ 10-18) for each pixel . Some examples of the intermediate results are shown below, where
the console output is our layer result (e.g. layer1, layer2, etc.) subtracted by the
test/expected layer result (e.g. layerResult{1}, layerResult{2}, etc.):
Layer 1 Layer 2
Layer 3 Layer 4
Layer 5 Layer 6
Layer 7 Layer 8
We additionally checked the images from several of the intermediate layers to further verify
the functionality. These images are shown below:
Each convolution yields an altered version of the image from the layer before it, highlighting
a specific feature or lack thereof. Additionally, maxpool creates an image that is ¼ the size of
the layer before it as we expected.
The output of running image 490 does in fact match the output of the debugging test (see
image below).
The results of the debugging test is shown in the following image. We can see that class 1,
or airplane, has the highest probability which matches with the above command window.
The following figure shows the accuracy curve for the top-k classes. As previously
mentioned, the accuracy of the CNN is 43.71% for guessing the correct class from the
highest probability. The CNN correctly guesses the class within the top two probabilities with
an accuracy of 65.91%. Eventually and as expected, the CNN has 100% accuracy when
considering the top-10 probabilities as there are only 10 classes.
Figure 2 - Additional images to evaluate the effectiveness of the CNN.
Each image is 256x256 pixels with 3 color channels. In order for these images to be used in
the CNN, they were downsampled to 32x32 pixels with 3 color channels. In order to do so,
the downsampled image were created by selecting every 8th pixel of the original image
convolved with a gaussian filter with standard deviation of 2. The gaussian smoothing helps
prevent high-frequency artifacts when downsampling. The standard deviation of 2 was
chosen after some trial and error. In this project, all standard deviations lower than 4 had
consistent outputs. Standard deviations greater than 4 blurred the image too much and
classifications were not consistent when increasing the standard deviation from 4.
Using the listed configuration from above, here are the following results.
Finding the class for the bird you picked
The estimated class is airplane with probability 0.3042
Finding the class for the truck you picked
The estimated class is truck with probability 0.3055
Finding the class for the lamp you picked
The estimated class is bird with probability 0.3318
From these results, it appears that the bird is classified as an airplane probably due to the
presence of the sky in the bird image. As most of the training data for the airplane
classification consists of regions of sky (or blue), then input images that have quite a bit of
blue might correlate highly with the airplane training data. The truck happens to be
correlated well with the truck training data and has been classified correctly. The lamp
however, has no training data so it is classified by what object is has the most similarity to in
the training data.
Using a standard deviation of 4, here are the following results
Finding the class for the bird you picked
The estimated class is airplane with probability 0.3721
Finding the class for the truck you picked
The estimated class is airplane with probability 0.3118
Finding the class for the lamp you picked
The estimated class is frog with probability 0.3168
Here we see that the classifications are not correct for all images. The truck image is now
classified as an airplane, the bird remains classified as an airplane, but the lamp is now
classified as a frog. With a standard deviation of 10, the other remain the same, but the truck
is now classified as a ship.
For further testing, it is best to use an image that contains the object of interest as large as
possible, with only that object in the image. To account for an unknown object in an input
image, we could evaluate an unknown class if many of the known objects’ probabilities are
too close to each other. If for example, all classes approach a uniform distribution, the output
could be “unable to detect class”. Another enhancement could be adding a particular
threshold for saying what an object is. More specifically, if a class is not above at least 0.25
probability, it could be said that classifications are too ambiguous. If an unknown image is
placed in, for example as with the lamp, this could be the scenario that plays out where the
classification does not exist within the training data.
Documentation of Roles on Project
MATLAB Coding for CNN Peter & Matthew
Overview Matthew
Exploration Matthew