Академический Документы
Профессиональный Документы
Культура Документы
On
Prepared by
STUDY PROJECT
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to Ashish Chittora Sir who
continuously conveyed a spirit of enthusiasm in his teachings. He was always
willing to meet me at any point of time and gave me constructive feedback on my
study. Sir’s constant involvement kept me motivated to achieve my objective. It
has been a great privilege to be able to work with him.
ABSTRACT
Multiple object tracking consists of detecting and identifying objects in video. In
some applications, such as robotics and surveillance, it is desired that the tracking
is performed in real-time. This poses a challenge in that it requires the algorithm to
run as fast as the frame-rate of the video. Today’s top performing tracking methods
run at only a few frames per second, and can thus not be used in real-time. Further,
when determining the speed of the tracker, it is common to not include the time it
takes to detect objects. I propose that one way of running a method in real-time is
to not look at every frame, but skip frames to make the video have the same
frame-rate as the tracking method. However, I believe that this will lead to
decreased performance. In this project, I studied a multiple object tracker,
following the tracking-by-detection paradigm, as an extension of an existing
method.
Content
1. Introduction
2. Methods
3. Object Detection
3.1. Background Subtraction
3.1.1. GMM
4. Object Tracking
4.1. Tradition Methods
4.1.1. Centroid Tracking
4.2. Modern Methods
4.2.1. MeanShift
4.2.2. CamShift
5. Person Re-identification
6. Conclusion
7. References
1. Introduction
When a video contains multiple moving objects that we wish to track, we refer to this as
multiple object tracking. Object detection is still an unsolved problem, and the most powerful
methods are limited by their speed. Adding tracking capabilities on top of the detector
usually slows down the algorithm further. Because of this, multiple object tracking is difficult
to do in real-time, since the best algorithms can only analyse a few frames per second at
best, even on powerful hardware. For such algorithms to run in real-time, it would be
necessary to skip multiple frames in order to prevent an ever-increasing delay. Object
tracking is an area within computer vision which has many practical applications such as
video surveillance, human-computer interaction, and robot navigation. Surveillance is the
monitoring of the behavior, activities or other changing information usually of people and
often in a surreptitious manner. Video surveillance is commonly used for event detection and
human identification. But it is not easy as think to detect the event or tracking the object.
It is a well-studied problem, and in many cases a complex problem to solve. The problem of
object tracking in video can be summarized as the task of finding the position of an object in
every frame. The ability to track an object in a video depend on multiple factors, like
knowledge about the target object, type of parameters being tracked and type of video
showing the object. Video-based multiple vehicle tracking is essential for many vision-based
intelligent transportation systems applications. Although lots of tracking methods have been
studied, there are still some challenges E.g. Multiple vehicles tracking will be missed when
the occlusion happens by vehicle overlapping or connecting. It is common that the partial
occlusion can cause the features of vehicle, such as sizes, textures, and colors, to be
changed in the viewpoint of camera. In the vehicle occlusion, there are three kinds of objects
to block the vehicle, including other moving objects, background scene objects, and other
vehicles. These objects make the tracking trajectories insignificant when the tracked vehicles
are not the same one. There are several important steps towards effective object tracking,
including the choice of model to represent the object, and object
tracking method suitable for the task. This project is aimed to
answer the methods for multiple object tracking and their benefits
and disadvantages and the different methods more suitable for
certain applications and environments.
2. Method
There are three basic steps in video analysis, these are object
detection, object tracking, and recognition of object activities by
analysing their tracks.
3. Object Detection
The Object detection and tracking are playing an important role in many pattern recognition
and computer vision pattern recognition applications like autonomous robot navigation,
surveillance and vehicle navigation. An object detection mechanism used in when the object
first appears or in every frame in the video. In order to reduce the number of false detection
and increase accuracy rate, some object detection methods use the temporal information
computed from analyzing a sequence of frames. Object detection can be done by various
techniques such as temporal differencing, frame differencing, Optical flow and Background
subtraction.
When we detect an object we enclose with a bounding box using our previous methods.
Once we have the bounding box coordinates we compute the centroid, or more simply, the
center (x, y)-coordinates of the bounding box. For every subsequent frame in our video
stream we compute object centroids; we first need to determine if we can associate the new
object centroids with the old object centroid. To accomplish this process, we compute the
Euclidean distance between each pair of existing object centroids and input object
centroids (i.e. the object centroids at t-1 frame). We the distance between the centroids is
less than the threshold then we assume it is the same object. But this method doesn’t work
when two objects overlap.
The intuition behind the meanshift is simple. Consider you have a set of points. (It can be a
pixel distribution like histogram backprojection). You are given a small window ( may be a
circle) and you have to move that window to the area of maximum pixel density (or maximum
number of points). It is illustrated in the simple image given below:
The principle of the CamShift algorithm is based on the principles of the algorithm Meanshift.
Camshift is able to handle the dynamic distribution by adjusting the size of the search
window for the next frame based on the moment zero of the current distribution of images. In
contrast to the algorithm Meanshift who is conceived for the static distributions, Camshift is
conceived for a dynamic evolution of the distributions. It adjusts the size of searching
window by invariant moments. This allows the algorithm to anticipate the movement of
objects to quickly track the object in the next frame. Even during the fast movements of an
object, Camshift is still capable of tracking well. It occurs when objects in video sequences
are tracked and the object moves such that the size and location of the change in probability
distribution changes in time. The initial search window was determined by a detection
algorithm or software dedicated to video processing. The CamShift algorithm calls upon the
MeanShift one to calculate the target centre in the probability distribution image, but also the
orientation of the principal axis and dimensions of the probability distribution. Defining the
first and second moments for x and y. These parameters are given from the first and second
moments, are defined by equations (4, 5 et 6).
Version 0
Approach
The model proposed in the paper uses a CNN to perform person re-identification in static
images. Given 2 images, the model predicts whether they are of the same person or not. I
have used the Tensorflow object detection API to adapt this model for videos.
My code processes each frame independently. It runs the Tensorflow object detector on
each frame and obtains the bounding boxes for people in that frame. Using the obtained
bounding boxes, the people are cropped out and sent to the person re-identification model.
The people in the current frame are compared pairwise with all the previously detected
unique people. This determines whether the person was previously detected or not. If the
person is not matched with any of the previous people, then he/she is added to the list as a
new unique person. His/her image is also added to the database of previously detected
people so people in the future frames can be compared with this person.
Results
Version 1
Approach
I implemented a tracking mechanism using intersection-over-union. Previously, the model
would iterate over every frame. In each frame, the object detection module would return
bounding boxes for the people. And each detected person in each frame was compared with
all the previously detected people. This led to a large number of comparisons.
In this version, the model iterates over all frames and in each frame it obtains the person
bounding boxes. But it doesn’t re-identify each person, only the uncertain ones. The model
stores a list of bounding box positions from the previous frame. It matches the bounding
boxes from the current frame to the previous frame boxes that are very close in position. In
other words, if the current bounding box is very close to a bounding box from a previous
frame- that means that it’s the same person who has just slightly moved between frames.
This way the model identifies each person only once (when they’re first seen) using the
neural network. For the next frames, it just tracks the person.
To determine that 2 bounding boxes from consecutive frames are very close, it uses
intersection-over-union.
IOU between 2 boxes = Area of the intersection of boxes / Area of the union of boxes
IOU is a measure of the overlap between 2 boxes. If the boxes overlap a lot, IOU is close to
1. If the 2 boxes don't overlap at all, IOU is 0.
So, for each bounding box from the current frame, the model tries to find a bounding box
from the previous frame which greatly overlaps with it (IOU > 0.9). If such a box is found, the
model assigns the previous box’s person to the new bounding box. In this way, people are
identified without actually running the neural network on them. If a bounding box is not able
to match with any previous box, it means that this person just entered the frame and was not
there in the previous frame. In this case, the neural network is run on the person to
determine if he has appeared before or if he is a new person totally, in which case he is
added to the list of unique people detected. This new person is compared with all the other
previously detected people. If a match is found, then that means the person had appeared
before but then disappeared for an intermediate period. If a match is not found, that means
the person is appearing in the video for the first time and needs to be added to the list of
unique people.
The following 2 screenshots are very few frames apart. Hence, for every person, his
bounding box in the next frame is very close in position to his bounding box in his previous
frame. Hence for every person, the current and previous bounding boxes overlap a lot. This
overlapping is measure by intersection-over-union. For 2 consecutive frames, the 2
bounding boxes of the same person from both frames are extremely close and hence they
have an IOU close to 1.
Code optimization Another improvement I made was with the code to call the person
re-identification model. In version0, each time my code would call the person re-id module,
the entire neural network would be built again. This repeated construction of the network
also slowed things down. In version1, I encapsulated the person re-id model inside a class.
In my main program, I created an object of this class once. I shifted the network building
code inside the constructor so the network was built only once, during the creation of the
object. This also increased the execution speed.
Results
The new code for assessing the performance works like this. Every time a person is tracked,
the reid model is also run on the person to verify that the tracked person is in fact the same.
This tells us whether the tracking was correct or not. Also, if a person is not tracked but
re-identified instead, the current and previous bounding boxes of the person are compared
to determine whether they are close enough. This tells us whether this person should have
been tracked instead.
The people who are not being tracked should have then been re-identified, but that is also
not happening. This may mean that the re-identification model is not working accurately.
I wrote another code to measure the false positives and negatives. This shows that after 130
frames, 93% of total bounding boxes identified were correct (with version 0 as ground truth).
However, only 8% of these were tracked. This is bad because most of the boxes should
have been tracked and only a few should have been sent to the re-identification neural net.
This means that tracking is not working correctly.
References
1. SAMUEL MURRAY “Real-Time Multiple Object Tracking”, A Study on the Importance of Speed
2. IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) “Moving Object Tracking Distance and
Velocity Determination based on Background Subtraction” Algorithm B.Tharanidevi , R.Vadivu, K.B.Sethupathy
3. International Journal of Computer Science and Information Technologies “A Survey on Moving Object Detection and
Tracking Methods” Imran Khan Pathan, Chetan Chauhan
4. Deep Learning in Computer Vision - Coursera
5. Afef SALHI and Ameni YENGUI JAMMOUSSI “Object tracking system using Camshift, Meanshift and Kalman filter”
6. International Journal of Computer Applications (0975 – 8887) Multiple Object Detection using GMM Technique and
Tracking using Kalman Filter Rohini Chavan
7. B Y Lee, L H Liew, W S Cheah and Y C Wang “Occlusion handling in videos object tracking: A survey”
Published under licence by IOP Publishing Ltd
8. Multi-object tracking with dlib https://www.pyimagesearch.com/2018/10/29/multi-object-tracking-with-dlib/
9. Detecting Cars Using Gaussian Mixture Models
https://in.mathworks.com/help/vision/examples/detecting-cars-using-gaussian-mixture-models.html