Вы находитесь на странице: 1из 12

Study Project Report

On

Multiple Object Tracking in Real Time


Prepared​ ​for

Mr. Ashish Chittora


​Dept. of Electrical Engineering, BITS Pilani KK Birla Goa Campus

Prepared by

Kishan Kumar Gupta 2016A8PS0406G

In partial fulfillment of the requirement of the course


INSTR F266

STUDY PROJECT
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to Ashish Chittora Sir who
continuously conveyed a spirit of enthusiasm in his teachings. He was always
willing to meet me at any point of time and gave me constructive feedback on my
study. Sir’s constant involvement kept me motivated to achieve my objective. It
has been a great privilege to be able to work with him.

ABSTRACT
Multiple object tracking consists of detecting and identifying objects in video. In
some applications, such as robotics and surveillance, it is desired that the tracking
is performed in real-time. This poses a challenge in that it requires the algorithm to
run as fast as the frame-rate of the video. Today’s top performing tracking methods
run at only a few frames per second, and can thus not be used in real-time. Further,
when determining the speed of the tracker, it is common to not include the time it
takes to detect objects. I propose that one way of running a method in real-time is
to not look at every frame, but skip frames to make the video have the same
frame-rate as the tracking method. However, I believe that this will lead to
decreased performance. In this project, I studied a multiple object tracker,
following the tracking-by-detection paradigm, as an extension of an existing
method.
Content
1. Introduction
2. Methods
3. Object Detection
3.1. Background Subtraction
3.1.1. GMM
4. Object Tracking
4.1. Tradition Methods
4.1.1. Centroid Tracking
4.2. Modern Methods
4.2.1. MeanShift
4.2.2. CamShift
5. Person Re-identification
6. Conclusion
7. References
1. Introduction
When a video contains multiple moving objects that we wish to track, we refer to this as
multiple object tracking. Object detection is still an unsolved problem, and the most powerful
methods are limited by their speed. Adding tracking capabilities on top of the detector
usually slows down the algorithm further. Because of this, multiple object tracking is difficult
to do in real-time, since the best algorithms can only analyse a few frames per second at
best, even on powerful hardware. For such algorithms to run in real-time, it would be
necessary to skip multiple frames in order to prevent an ever-increasing delay. Object
tracking is an area within computer vision which has many practical applications such as
video surveillance, human-computer interaction, and robot navigation. Surveillance is the
monitoring of the behavior, activities or other changing information usually of people and
often in a surreptitious manner. Video surveillance is commonly used for event detection and
human identification. But it is not easy as think to detect the event or tracking the object.
It is a well-studied problem, and in many cases a complex problem to solve. The problem of
object tracking in video can be summarized as the task of finding the position of an object in
every frame. The ability to track an object in a video depend on multiple factors, like
knowledge about the target object, type of parameters being tracked and type of video
showing the object. Video-based multiple vehicle tracking is essential for many vision-based
intelligent transportation systems applications. Although lots of tracking methods have been
studied, there are still some challenges E.g. Multiple vehicles tracking will be missed when
the occlusion happens by vehicle overlapping or connecting. It is common that the partial
occlusion can cause the features of vehicle, such as sizes, textures, and colors, to be
changed in the viewpoint of camera. In the vehicle occlusion, there are three kinds of objects
to block the vehicle, including other moving objects, background scene objects, and other
vehicles. These objects make the tracking trajectories insignificant when the tracked vehicles
are not the same one. There are several important steps towards effective object tracking,
including the choice of model to represent the object, and object
tracking method suitable for the task. This project is aimed to
answer the methods for multiple object tracking and their benefits
and disadvantages and the different methods more suitable for
certain applications and environments.

2. Method
There are three basic steps in video analysis, these are object
detection, object tracking, and recognition of object activities by
analysing their tracks.
3. Object Detection
The Object detection and tracking are playing an important role in many pattern recognition
and computer vision pattern recognition applications like autonomous robot navigation,
surveillance and vehicle navigation. An object detection mechanism used in when the object
first appears or in every frame in the video. In order to reduce the number of false detection
and increase accuracy rate, some object detection methods use the temporal information
computed from analyzing a sequence of frames. Object detection can be done by various
techniques such as temporal differencing, frame differencing, Optical flow and Background
subtraction.

3.1. Background Subtraction


Background subtraction is a popular technique to
segment out the interested objects in a frame. This
technique involves subtracting an image that contains
the object, with the previous background image that
has no foreground objects of interest. The area of the
image plane where there is a significant difference
within these images indicates the pixel location of the
moving objects. These objects, which are represented
by groups of pixel, are then separated from the
background image by using threshold technique.

3.1.1. Gaussian Mixture Model


Gaussian Mixture Model (GMM) is basically one of the
most popular technique to construct the background
model for segmentation of moving objects from
background. GMM technique assigns number of
Gaussian distributions for each pixel to estimate
reference frame. If there is no any variations in the
pixel values then all Gaussian distributions
approximate the same values. In that case only one
distribution exist and the other distributions are not
important at all. However on other side, if pixel values
are changing continuously then constant number of Gaussian is not always sufficient to
estimate the background model. In that case it is necessary to determine approximate
number of Gaussian. A Gaussian Mixture Model(GMM) is a parametric probability density
function. It is expressed in terms of weighted sum of Gaussian component densities. GMM is
basically used as a parametric model which estimate the probability distribution function
based on various object features. In computer vision technology, it is difficult to detect
multiple moving objects in case of severe occlusion and dynamic scene changes occurred.
In order to implement the background subtraction method to identify the moving objects from
each portion of video frames, background modeling is always the first step. This background
modeling can be done with the help of Gaussian Mixture Model. To get the desired result, all
incoming frames from video sequence are subtracted from a reference background modeling
frame and compare the difference with threshold value to segment the image between
foreground and background. Finally, any other random pixels which are detected as a
foreground pixels can be eliminated to improve the foreground mask.

Algorithm of Gaussian Mixture Model


In order to give a better understanding of the algorithm used for background subtraction,
following steps were adopted to achieve the desired results:
a) Firstly, we compare each input pixels with the mean of the associated elements. If
the value of a pixel is close enough to a chosen elements mean, then that element is
considered as the matched element. In order to be a matched element, the difference
between the pixels and mean must be less than the element's standard deviation
which is scaled by factor D in the algorithm.
b) Secondly, update the Gaussian weight, mean and standard deviation (variance) to
demonstrate the new obtained pixel value. For non matching elements, the weight 'w'
decreases whereas mean and standard deviation remains constant. It is dependent
on learning rate 'p' in relation to how fast they change.
c) Thirdly, evaluate the elements which are part of the background model. For this a
threshold value is applied to the element weight 'w'.
d) Finally, identify the foreground pixels which are not match with any elements
determined as a background.

3.1.2. Shadow Detection and Removal


Once the foreground object identified, each foreground pixels are checked whether they are
part of a shadow or the object. This process is necessary, since, shadow of the some of the
background object may get combined with the foreground object. This causes the object
tracking task as a complicated task. For pixel (x, y) in a shadowed region, the Normalized
Cross Covariance (NCC) in a neighboring region B(x, y) is found and the shadow can be
detected using equation given below, Lncc NCC(x, y) (4) Where Lncc is fixed threshold. If
Lncc is low, several foreground pixels corresponding to moving objects may be misclassified
as shadows. On the other hand, selecting a larger value for Lncc results in less false
positives, but pixels related to actual shadows may not be detected [25].

3.1.3. Occlusion Detection


While two moving objects coming closer to each other, the background subtracted frame
shows it as a single object. This situation is called as occlusion and will create problem while
tracking two objects. In this approach, an algorithm is proposed for detecting the occlusion.
This approach will inform the frame number where the occlusion has taken place. In the
number of object in the frame is increased suddenly shows the entry of new objects into the
frame or separation of occluded objects. Consequently if there is a sudden reduction of
number of objects present in the frame indicates the process of occlusion of two or more
objects or the exit of the objects from the frame to the outside area.
4. Object Tracking
The goal of an object tracking is to generate the trajectory of an object over time by
discovering its exact position in every frame of the video sequence. I have studied several
object tracking algorithms (Meanshift, Camshift). The algorithm for object tracking is
composed of three modules: selection object module in the first frame of the sequence, the
module of Meanshift algorithm and the module of Camshift algorithm. The selection module
selects the position of the object in the first frame. It consists of extracting the module
initialization parameters that are moving through the position, size, width, length of the
search window of the object in the first frame of the sequence.

4.1. Traditional Method


4.1.1. Centroid Tracking
The primary assumption of the centroid tracking algorithm is that a given object will
potentially move in between subsequent frames, but the distance between the centroids for
frames and will be smaller than all other distances between objects.
It relies on the Euclidean distance between
(1) existing object centroids (i.e., objects the centroid tracker has already seen before) and
(2) new object centroids between subsequent frames in a video

When we detect an object we enclose with a bounding box using our previous methods.
Once we have the bounding box coordinates we compute the centroid, or more simply, the
center (x, y)-coordinates of the bounding box. For every subsequent frame in our video
stream we compute object centroids; we first need to determine if we can associate the new
object centroids with the old object centroid. To accomplish this process, we compute the
Euclidean distance between each pair of existing object centroids and input object
centroids (i.e. the object centroids at t-1 frame). We the distance between the centroids is
less than the threshold then we assume it is the same object. But this method doesn’t work
when two objects overlap.

4.2. Modern Method


4.2.1. MeanShift - An object Tracking Algorithm
The meanshift algorithm is a non-parametric method. It provides accurate localization and
efficient matching without expensive exhaustive search. The size of the window of search is
fixed. It is an iterative process, that is to say, first compute the meanshift value for the current
point position, then move the point to its meanshift value as the new position, then compute
the meanshift until it fulfill certain condition. For an frame, we use the distribution of the
levels of grey which gives the description of the shape and we are going to converge on the
centre of mass of the object calculated by means of moments. The flow chart of meanshift in
figure 2 described the steps of the algorithm. The number of iterations of the convergence of
the algorithm is obtained when the subject is followed within the image sequence.

The intuition behind the meanshift is simple. Consider you have a set of points. (It can be a
pixel distribution like histogram backprojection). You are given a small window ( may be a
circle) and you have to move that window to the area of maximum pixel density (or maximum
number of points). It is illustrated in the simple image given below:

How to Calculate the Mean Shift Algorithm


1. Choose a search window size.
2. Choose the initial location of the search window.
3. Compute the mean location in the search window.
4. Center the search window at the mean location computed in Step 3.
5. Repeat Steps 3 and 4 until convergence (or until the mean location moves less than a
preset threshold).

4.2.2. CamShift - An object Tracking Algorithm

The principle of the CamShift algorithm is based on the principles of the algorithm Meanshift.
Camshift is able to handle the dynamic distribution by adjusting the size of the search
window for the next frame based on the moment zero of the current distribution of images. In
contrast to the algorithm Meanshift who is conceived for the static distributions, Camshift is
conceived for a dynamic evolution of the distributions. It adjusts the size of searching
window by invariant moments. This allows the algorithm to anticipate the movement of
objects to quickly track the object in the next frame. Even during the fast movements of an
object, Camshift is still capable of tracking well. It occurs when objects in video sequences
are tracked and the object moves such that the size and location of the change in probability
distribution changes in time. The initial search window was determined by a detection
algorithm or software dedicated to video processing. The CamShift algorithm calls upon the
MeanShift one to calculate the target centre in the probability distribution image, but also the
orientation of the principal axis and dimensions of the probability distribution. Defining the
first and second moments for x and y. These parameters are given from the first and second
moments, are defined by equations (4, 5 et 6).

How to Calculate the Continuously Adaptive Mean Shift Algorithm


1. Choose the initial location of the search window.
2. Mean Shift as above (one or many iterations); store the zeroth moment.
3. Set the search window size equal to a function of the zeroth moment found in Step 2.
4. Repeat Steps 2 and 3 until convergence (mean location moves less than a preset
threshold).
5. Person re-identification in videos

Version 0

Approach
The model proposed in the paper uses a CNN to perform person re-identification in static
images. Given 2 images, the model predicts whether they are of the same person or not. I
have used the Tensorflow object detection API to adapt this model for videos.
My code processes each frame independently. It runs the Tensorflow object detector on
each frame and obtains the bounding boxes for people in that frame. Using the obtained
bounding boxes, the people are cropped out and sent to the person re-identification model.
The people in the current frame are compared pairwise with all the previously detected
unique people. This determines whether the person was previously detected or not. If the
person is not matched with any of the previous people, then he/she is added to the list as a
new unique person. His/her image is also added to the database of previously detected
people so people in the future frames can be compared with this person.

Results

Time taken for 240 frames: 15 hours

Version 1

Approach
I implemented a tracking mechanism using intersection-over-union. Previously, the model
would iterate over every frame. In each frame, the object detection module would return
bounding boxes for the people. And each detected person in each frame was compared with
all the previously detected people. This led to a large number of comparisons.
In this version, the model iterates over all frames and in each frame it obtains the person
bounding boxes. But it doesn’t re-identify each person, only the uncertain ones. The model
stores a list of bounding box positions from the previous frame. It matches the bounding
boxes from the current frame to the previous frame boxes that are very close in position. In
other words, if the current bounding box is very close to a bounding box from a previous
frame- that means that it’s the same person who has just slightly moved between frames.
This way the model identifies each person only once (when they’re first seen) using the
neural network. For the next frames, it just tracks the person.
To determine that 2 bounding boxes from consecutive frames are very close, it uses
intersection-over-union.

IOU between 2 boxes = Area of the intersection of boxes / Area of the union of boxes
IOU is a measure of the overlap between 2 boxes. If the boxes overlap a lot, IOU is close to
1. If the 2 boxes don't overlap at all, IOU is 0.

So, for each bounding box from the current frame, the model tries to find a bounding box
from the previous frame which greatly overlaps with it (IOU > 0.9). If such a box is found, the
model assigns the previous box’s person to the new bounding box. In this way, people are
identified without actually running the neural network on them. If a bounding box is not able
to match with any previous box, it means that this person just entered the frame and was not
there in the previous frame. In this case, the neural network is run on the person to
determine if he has appeared before or if he is a new person totally, in which case he is
added to the list of unique people detected. This new person is compared with all the other
previously detected people. If a match is found, then that means the person had appeared
before but then disappeared for an intermediate period. If a match is not found, that means
the person is appearing in the video for the first time and needs to be added to the list of
unique people.

The following 2 screenshots are very few frames apart. Hence, for every person, his
bounding box in the next frame is very close in position to his bounding box in his previous
frame. Hence for every person, the current and previous bounding boxes overlap a lot. This
overlapping is measure by intersection-over-union. For 2 consecutive frames, the 2
bounding boxes of the same person from both frames are extremely close and hence they
have an IOU close to 1.
Code optimization​ Another improvement I made was with the code to call the person
re-identification model. In version0, each time my code would call the person re-id module,
the entire neural network would be built again. This repeated construction of the network
also slowed things down. In version1, I encapsulated the person re-id model inside a class.
In my main program, I created an object of this class once. I shifted the network building
code inside the constructor so the network was built only once, during the creation of the
object. This also increased the execution speed.

Results

Time taken for 240 frames: 10 hours 20 min

False positives and negatives while tracking


I added some code to the version1 model to measure how well the tracking and
re-identification models were working. Ideally, the reid model should be called only when a
person cannot be tracked back to the last frame. This is supposed to happen only when the
person has just entered the video in the current frame and wasn’t present in the previous
frame. For all subsequent frames he must be tracked as long as he is in the video.

The new code for assessing the performance works like this. Every time a person is tracked,
the reid model is also run on the person to verify that the tracked person is in fact the same.
This tells us whether the tracking was correct or not. Also, if a person is not tracked but
re-identified instead, the current and previous bounding boxes of the person are compared
to determine whether they are close enough. This tells us whether this person should have
been tracked instead.

The people who are not being tracked should have then been re-identified, but that is also
not happening. This may mean that the re-identification model is not working accurately.

I wrote another code to measure the false positives and negatives. This shows that after 130
frames, 93% of total bounding boxes identified were correct (with version 0 as ground truth).
However, only 8% of these were tracked. This is bad because most of the boxes should
have been tracked and only a few should have been sent to the re-identification neural net.
This means that tracking is not working correctly.

Version 2: Only tracking


This time I implemented only a tracking mechanism with no re-identification. Every detected
person is compared with k=25 previous frames to find the most overlapping bounding box.
Using intersection-over-union, the person is tracked across frames. If a new person enters,
he is not tracked to anyone in the previous frame and hence a new folder is created for him.
Accuracy: Comparing with baseline, % of bounding boxes correctly identified = 7%.
However, from the output images, it is clear that most of the people are being tracked
accurately. This means that the baseline is wrong. The baseline is re-identifying the wrong
people and so it is not matching with the tracking output, which is the correct one.

Time: 5 min for 240 frames

References
1. SAMUEL MURRAY “Real-Time Multiple Object Tracking”, ​A Study on the Importance of Speed
2. I​OSR Journal of Electronics and Communication Engineering (IOSR-JECE)​ “Moving Object Tracking Distance and
Velocity Determination based on Background Subtraction” Algorithm B.Tharanidevi , R.Vadivu, K.B.Sethupathy
3. International Journal of Computer Science and Information Technologies “A Survey on Moving Object Detection and
Tracking Methods” Imran Khan Pathan, Chetan Chauhan
4. Deep Learning in Computer Vision - Coursera
5. Afef SALHI and Ameni YENGUI JAMMOUSSI “Object tracking system using Camshift, Meanshift and Kalman filter”
6. International Journal of Computer Applications (0975 – 8887) Multiple Object Detection using GMM Technique and
Tracking using Kalman Filter Rohini Chavan
7. B Y Lee, L H Liew, W S Cheah and Y C Wang “​Occlusion handling in videos object tracking: A survey”
Published under licence by IOP Publishing Ltd
8. Multi-object tracking with dlib https://www.pyimagesearch.com/2018/10/29/multi-object-tracking-with-dlib/
9. Detecting Cars Using Gaussian Mixture Models
https://in.mathworks.com/help/vision/examples/detecting-cars-using-gaussian-mixture-models.html

Вам также может понравиться