Академический Документы
Профессиональный Документы
Культура Документы
A Dissertation submitted to
Degree of
Sandhya R. Sharma
DEAN
Faculty of Technology & Engineering
2
The Maharaja Sayajirao University Of Baroda
ACKNOWLEDGEMENT
I would like to express my deep sense of respect and gratitude towards my guide
Mrs. Sandhya R. Sharma (Assistant Professor) at faculty of Technology and Engineering,
The Maharaja Sayajirao University of Baroda for his exemplary guidance, cordial support,
valuable information, monitoring and constant encouragement throughout my dissertation
work. I want to thank him for giving me an opportunity to work under him. Without his
experience and insights, it would have been very difficult for me to do quality work.
3
Table of Content
Abstract 7
List of Figure 8
List of Table 9
1.1 Overview 10
1.3 Applications 12
2.2 Representation 16
2.4 Methods 19
4
Chapter 4 System Block Diagram 23-35
4.2.2 Segmentation 26
5.2.1 MATLAB 39
6.1 Result 41
5
Chapter 7 Conclusion and Future Scope 46-47
7.1 Conclusion 46
References 48
6
ABSTRACT
A new method for calorie burning monitoring during human activity by SVM model for
human action recognition has been developed. First task is action recognition and second
to find the calorie burnt in that action. Action recognition can be defined as the ability to
determine whether a given action occurs in the video. This problem is complicated due to the high
complexity of human actions such as appearance variation, motion pattern variation, occlusions,
etc. Here action recognition is done by focusing s on a local spatio-temporal neighbourhood. In this
method local space time features capture local events in video and can be adopted the
size, frequency and velocity of moving patterns. Firstly the video is constructed in terms of
local space time features and then this representation is integrated with SV [Space Vector
Machine] classification scheme of recognition. The calorie monitoring done by identifying
type of human activity and then calculating average calories during the activity. For the
purpose of I have used video database from KTH video dataset which contains video sequences of
different human actions performed by different people different scenarios. The presented results
of action recognition justify the proposed method and demonstrate its advantage compared to
other relative approaches for action recognition
7
List of Figures
Figure 2.3 Examples of matching local features for pairs of sequences with complex 21
non-stationary backgrounds.
Fig. 4.5 Examples of scale and Galilean adapted spatio-temporal interest points. 31
8
List of Table
Table 3.1 Average Calories Data Sheet 22
9
CHAPTER 1
INTRODUCTION
1.1 Overview: -
AT present, physical fitness is prime important. It’s no doubt that inactivity
can lead to a number of health and personal issues, including weight gain, onset of
chronic and acute illness and even low productivity in school, work and daily life.
Conversely, constant activity can prevent and may even reverse many of these issues.
Moving around – by walking, running, even fidgeting in your seat, can help boost a
person’s overall health.
For physical fitness, calorie burning during day the entire day should be
monitored. Many applications like hike messenger, fitness tracker and many more
application are available for calorie burning monitoring. But these applications have
some disadvantages like some of these applications mistake insignificant activities and
mannerisms for an activity while sometimes the application says you have worked out
a lot, but you have spent much of your time just shaking your legs underneath your
office desk.
For this the spatial-temporal information for human action recognitionis explored
To generate the recognition result, there are three important and necessary steps:
• Feature extraction: Raw video sequence consists of massive spatio-temporal pixel intensity
variations that contribute nothing to the action itself, such as pixels related to the color of
clothes and cluttered background. Feature extraction is a process that detects and extracts
most representative information from raw data as features.
• Feature representation: Any video sequence will generate a specific number of features, and
different video sequences will have distinctive number of features. Feature representation is a
process to give a unique representation for every video sequence based on the extracted
features. The final representation should be of the same dimension among different videos.
10
1.1 Dissertation Objectives: -
(2) To identifying type of human action and find the how many calories burn
during human activity.
11
1.2 Applications: -
(1) In fitness tracker.
(2) In sports, to measure calorie burning during boxing, running, jumping etc.
(3) In gym, its use for monitoring calorie burning during exercise and keeping
record of your exercise statics.
12
1.4 Literature Review: -
Serval researches have carried out work on monitoring calorie burning during Human
action and different method of Human Action Recognition.
In this section a modest attempt is made to some review some of pertinent research
papers related to this topic with emphasis Human Action Recognition method.
[1] In this paper, we studied Local space-time features capture local events in video and
can be adapted to the size, the frequency and the velocity of moving patterns like human
action. In this paper we demonstrate how such features can be used for recognizing
complex motion patterns. We construct video representations in terms of local space-time
features and integrate such representations with SVM classification schemes for
recognition. For the purpose of evaluation, we introduce a new video database containing
2391 sequences of 6 human actions performed by 25 people in four different scenarios.
The presented results of action recognition justify the proposed method and demonstrate
its advantage compared to other relative approaches for action recognition.in this paper
only six types of human action can be recognition possible. We can reorganization more
action by modifying the algorithm.
[2] In this paper, A new method for Human action recognition is proposed by LBP based
dynamic texture operators. It captures the similarity of motion around key points tracked
by a semi dense point tracking method. The use of self-similarity operator allows to
highlight the geometric shape of rigid parts of foreground object in video sequence.
Inheriting from the efficient representation of LBP based methods and the appearance
invariance of patch matching method, the method is well designed for capturing action
primitives unconstrained videos. Action recognition experiments, made on serval action
academic video. We are interested in serval perspectives related to this method such as
multi scale SMPs and extension to moving backgrounds.
[3] In this paper, the recent shift in computer vision from static images to video sequence
has focused research on the understanding of action and behaviour. The lure of wireless
interface and interactive environments has heightened interest in understanding human
actions. Recently a number of approaches have appeared attempting the full 3-D
reconstruction of the human form from image sequence with the presumption that such
information would be useful and perhaps even necessary to understand the action taking
place. We develop a view based approach to the representation and recognition of action
that is designed to the full 3-D reconstruction of the human form from image sequence
with the presumption that such information would be useful and perhaps even necessary
to action taking place.
13
1.5 Organization of Dissertation: -
Chapter 1: This chapter introduce overview of the title, Dissertation Objectives,
Applications, Literature Review and organization of Dissertation etc.
Chapter 2: This chapter introduce brief overview of the Human Action Recognition
method and procedure for identifying which type of human action in video.
Chapter 3: This chapter introduce how to measure calorie burning during different
human action.
Chapter 4: This Chapter introduce system block diagram and its brief explanation.
Chapter 5: This Chapter introduce which hardware tools and software tools are used in
this dissertation.
Chapter 6: This Chapter introduce result of the dissertation result and its analysis.
Chapter 7: This Chapter introduce conclusion of Dissertation and the future scope of the
Dissertation.
14
CHAPTER 2
HUMAN ACTION RECOGNITION
2.1 Introduction of human action recognition method: -
Applications such as video retrieval and human – computer interaction require
methods for recognizing human actions in various scenarios. Typical scenarios include
scenes with cluttered, moving backgrounds, scale variations, induvial variations in
appearance and cloth of people, changes in light and view point and so forth. All of these
conditions introduce challenging problems that have been addressed in computer vision in
the past.
Recently, serval successive methods for learning and recognizing human actions
directly from image methods have been proposed like (a) Probabilistic recognition of
activity using local appearance (b) The representation and recognition of action using
temporal templates jets. (c) Recognizing action at a particular distance.
This motivates the need of alternative video representation that are stable with
respect to changes of recording condition.
In spatial recognition, local features have recently been combined with SVM in
robust classification approach in a similar manner, here we explore the combination of
local features and SVM and apply the resulting approach to the recognition of human
actions. For the purpose of evaluation we introduce a new video database and present
results of recognizing eight human action. [1]
15
2.2 Representation: -
To represent motion patterns, we use local space time features which can be considered
as primitive events corresponding to moving two-dimensional image structures at
moments of non -constant motion (see Figure 2.1).
To detect local features in image sequence f (x, y, t), we construct its scale -space
representation
The spatio-temporal neighbourhood of features in space and time is then defined by spatial
and temporal scale parameters (𝜎, 𝑇) of the associated Gaussian kernel. The size of
features can be adapted to match the spatio-temporal extent of underlying image
structures by automatically selecting scales parameters (𝜎, 𝑇).
Moreover, the shape of the features can be adapted to the velocity of the local pattern,
hence, making the features stable with respect to different amounts of camera motion.
Here we use both of these methods and adapt features with respect to the scale and
velocity to obtain invariance with respect to the size of the moving pattern in the image as
well as the relative velocity of the camera.
16
computed using selected values (𝜎2,T2). To enable invariance with respect to relative
camera motions, we also warp the neighbourhood of features using estimated velocity
values prior to computation of l.
K- means clustering of descriptor l in the training set gives a vocabulary of primitive events
hi. the numbers of features with labels hi in a particular sequence define a features
histogram
H = (h1,…………, hn).
Consider the problem of separating the set of training data (x1, y1), (x2,y2),….(xm,ym)
into two classes, where xi 𝜖 RN is a feature vector and yi 𝜖 {-1,1} its class label. If we assume
that the two classes can be separated by a hyper plane
𝜔 *x +b =0
17
in a some space and that we have no prior knowledge about the data distribution, then the
optimal hyper plane is the one which maximization problem, using Lagrange multipliers
αi(i=1,2,3,….,m)
f(x) = sgn ( ∑𝑚
𝑖=1 αi yi K(xi, x) + b) (3)
Where, αi and b are found by using SVC learning algorithm. Those xi with nonzero αi are the
“support vectors”. For K(x, y) =x*y, this correspond to constructing an optimal separating
hyper plane in the input space RN .
For histogram features H, and for local features we use the kernel
̂ (Lh, Lk) + 𝐾
KL (Lh, LK) = 1/2[𝐾 ̂ (Lk, Lh)]
With
Where Li = { lji }nij=1 and lji is a jet descriptor of interest point j in sequence i and
<𝑥−𝜇𝑥|𝑦−𝜇𝑦>
Kl(x,y) = exp {−𝜌 (1 − ||𝑥−𝜇𝑥|| ||𝑦−𝜇𝑦||)} (5)
18
2.4 Methods: -
We compare results of combining three different representations and two classifiers.
The representations are
(1) Local features described by spatio –temporal jets l of order four (LF).
(2) 128-bin histograms of local features (HistLF).
(3) Marginalized histograms of normalized spatio-temporal gradients (HistSTG)
computed at 4 temporal scale of a temporal pyramid.
In the latest approach we only used image points with temporal derivative higher than
some threshold which value was optimised on the validation set.
(1) SVM with either local feature kernel in combination with LF or SVM with X2 kernel
for classifying histogram based representation HistLF and HistSTG .
(2) Nearest neighbour classification in combination with HistLF and HistSTG. [1]
19
2.5 Matching of local features: -
A necessary requirement for action recognition using the local feature kernel in
Equation (5) is the match between corresponding features in different sequence. Figure
2.2 present a few pairs of matched features in different sequence with human actions. The
pairs correspond to features with human actions.
The pairs correspond to features with jet descriptor ljh and ljk selected by maximizing
the feature kernel over jk in Equation (4). As can be seen, matches are found for similar
parts (legs, arms and hands) at moments of similar motion. The locality of descriptors
allows for matching of similar events in spite of variations in clothing, lighting, individual
patterns of motion.
Due to the local nature of features and corresponding jet descriptors, however, some
of the matched features correspond to different parts of action which are difficult to
distinguish based on local information only. Hence, there is an obvious possibility for
improvement of our method by taking the spatial and the temporal consistency of local
features into account.
The locality of our method also allow for matching similar events in sequence with
complex non-stationary backgrounds as illustrated in Figure 2.3. This indicates that local
space time features could be used for motion interpretation in complex scenes.
20
Figure 2.2 Examples of matched features in different sequences. [1]
Figure 2.3 Examples of matching local features for pairs of sequences with complex non-
stationary backgrounds. [1]
21
CHAPTER 3
CALORIES MEASUREMENT METHOD
Average Calories Burn = (Type of Action) * (Average Calorie for particular type action/60)
* *(frame rate/60)
Jogging 5
Walking 3
Running 9.5
Boxing 7
Hand clapping 1
Cycling 9.8
Surfing 11.2
22
CHAPTER 4
SYSTEM BLOCK DIAGRAM
Camera Calculation
Human
system Type of of average
Action
Human calorie
(Action Recognition
Action burning
video) method
system
23
4.1 Camera System (Action video): -
It captures human action and give it to the simulator.
It contains:
(1) Camera
(2) Twain Driver
24
4.2 Human Action Recognition Method: -
RGB
image Gaussian
Spatio-
to Gray Convolution Harris
Video Segment temporal Langrage
scale Kernel corner
read ation Jet Multiplier
image (Gaussian detector
descriptor
conversi smoothing)
on
while hasFrame(v)
video =readFrame(v);
end
whose video
25
4.2.2 Segmentation: -
Segmentation as the partition of an image into a set of non- overlapping regions whose
union is the entire image, some rules to be followed for regions resulting from the image
segmentation can be stated as:
(1) They should be uniform and homogenous with respect to some characteristics;
(2) Their interiors should be simple and without many holes;
(3) Adjacent regions should have significantly different values with respect to the
characteristics on which they are uniform
(4) Boundaries of each segment should be simple and must be spatially accurate.
If the basic 2-D still gray level image is represented by f(x, y), then the extension of 2-D
images to 3-D can be represented by f(x,y) ⟹ f(x,y,z); the extension of still image to moving
images or sequence of image can be represented by f(x,y) ⟹ f(x,y,t); a combination of the
above extensions can be represented by f(x,y) ⟹ f(x,y,z,t); and the extension of gray level
images to, for example, color images or multi-band images can be represented by by f(x,y)
⟹ f(x,y,z,t).
26
4.2.3 RGB image to Gray scale image conversion: -
Color to gray image conversion has been widely used in real world application such
as printing color images in black and white format and pre-processing in image processing.
The main reason for using gray scale representation instead of operating on color image
directly is that gray-scale simplifies algorithm and reduces computational requirements.
For many applications of image processing, color information does not help us to
identifying important edge or other features and also producing unwanted information
which could increase the number of training data required to achieve good performance.
A gray scale image is constructing of different shades of gray color. A true color
image can be converted to a gray scale image by maintaining the brightness of image.
MATLAB provides the ‘rgb2gray” function which converts RGB to gray-scale by removing
hue and saturation information.
I = rgb2gray (image)
The above function converts the true color image to gray scale image I. The RGB image
is a combination of RED, BLUE, GREEN colors. It is three dimensional image. At a particular
position call (i,j) in a image (i,j,1) produce the value of RED pixel. image (i,j,2), produce the
value of GREEN pixel. image (i,j,3), produce the value of BLUE pixel. The combination of
these primary color are normalized with R+G+B=1. This gives the neutral white color. The
gray scale image is attained from the RGB image by combining 30% of RED, 59% of GREEN,
11% of BLUE. This produces the brightness information of the image. The resulting image
will be two dimensional. The value 0 represents black and the value 255 represents white.
The range will be black and white values.
27
Figure 4.3: A RGB Image
28
4.2.4 Gaussian Convolution Kernel (Gaussian smoothing: -
Gaussian convolution is used to blur images and remove noise and detail. The
Gaussian function is used in numerous research areas:
When working with images we need to use the two-dimensional Gaussian function.
The Gaussian filters works by using the 2D distribution as a point spread function. This is
achieving by convolving the 2D Gaussian di8stribution function with the image. We need
to produce a discrete approximation to the Gaussian function. The theoretically requires
an infinitely large convolution kernel, as the Gaussian distribution is non-zero everywhere.
Fortunately, the distribution has approached very close to zero at about three standard
deviations from the mean. 99% of the distribution falls within 3 standard deviations. This
means we can normally limit the kernel size to contain only values within three standard
deviations of the mean.
Where, 𝜎 is the standard deviation of the distribution. The distribution is assumed to have
a mean of zero. An integer valued 5 by 5 convolution kernel approximating a Gaussian with
a 𝜎 of 1 shown below.
The Gaussian filter is a non- uniform low pass filter. The kernel co-efficient diminish
with increasing distance from the kernel’s Centre. Central pixels have a higher weighting
than those on the periphery. Larger values of 𝜎 produce a wider peak. Kernel size must
29
increase with increasing 𝜎 to maintain the Gaussian nature of filter. Gaussian kernel
coefficient depends on the value of 𝜎. At the edge of the mask, coefficient must be close
to 0. The kernel is rotationally symmetric with no directional bias. Gaussian filters might
not preserve image brightness.
A local interest point approach for capturing spatiotemporal events in video data.
Consider an image sequence f and construct a spatio-temporal scale-space
representation L by convolution with a spatiotemporal Gaussian kernel
t2/2τ2) with spatial and temporal scale parameters σ and τ. Then, at any point p = (x,y,t) in
space-time define a spatio-temporal second-moment matrix µ as
(1)
where ∇L = (Lx, Ly, Lt)T denotes the spatio-temporal gradient vector and (σi = γσ,τi = γτ) are
spatial and temporal integration scales with 𝛾 = 2.
30
maximizing all eigenvalues λ1,..,λ3 of µ or, similarly, by searching the maxima of the interest
point operator H = detµ − k(traceµ)2 = λ1λ2λ3 − k(λ1 + λ2 + λ3)3 over (x,y,t) subject to H ≥ 0
with k ≈ 0.005.
To estimate the spatial and the temporal extents (σ0,τ0) of events, we maximize the
following normalized feature strength measure over spatial and temporal scales at each
detected interest point p0 = (x0,y0,t0)
Velocity adaptation. Moreover, to compensate for relative motion between the camera
and the moving pattern, we perform velocity adaptation by locally warping the
neighbourhoods of each interest point with a Galilean transformation using image velocity
u estimated by computing optic flow at the interest point.
Figure 2.7.1 shows a few examples of spatio-temporal interest points computed in this
way from image sequences with human activities. the method allows us to extract scale-
adaptive regions of interest around spatiotemporal events in a manner that is invariant to
spatial and temporal scale changes as well as to local Galilean transformations. [1]
Fig. 4.5: Examples of scale and Galilean adapted spatio-temporal interest points.
The illustrations show one image from the image sequence and a level surface of
image brightness over space-time with the space-time interest points illustrated as
dark ellipsoids.
31
4.2.6 Harris Corner Detector: -
Where, E is the difference between the original and the moved window.
w (x, y) is the window at position (x, y). This acts like a mask. Ensuring that only the desired
window is used.
𝐼𝑥2 𝐼𝑥 𝐼𝑦 𝑢
E(𝑢, 𝑣) ≈ [𝑢 𝑣] (∑ [ 2 ]) [𝑣 ]
𝐼𝑥 𝐼𝑦 𝐼𝑦
𝐼𝑥2 𝐼𝑥 𝐼𝑦
Where, M= ∑ [ ]
𝐼𝑥 𝐼𝑦 𝐼𝑦2
32
It was figured out that eigenvalues of the matrix can help determine the suitability of a
window:
det M = 𝜆1 𝜆2
trace M = 𝜆1 + 𝜆2
In short, The Harris corner detector is just a mathematical way of determining which
window produce large variations when moved in any direction.
(1) isolate any possible singular point of the solution set of the constraining equations,
(2) find all the stationary points of the Lagrange function,
(3) establish which of those stationary points and singular points are global maxima of
the objective function.
33
4.3 Human Action: -
It specifies the output of Human action recognition method. It gives type of human
action like Running, Walking, Jogging, Cycling, Surfing etc. The type of human action gives
as a input to the calories calculation system.
34
4.4 Calculation of average calories burning system: -
It contains the average calorie data sheet for particular human action. The calories
burn during the human activity like running, walking, boxing, hand clapping, hand waving,
cycling is measure by various method.
In our method, after human action recognition based on action the average calories for
particular action will be calculated.
35
CHAPTER 5
HARDWARE AND SOFTEWARE TOOLS
36
Figure 5.1 Logitech Web Camera (C170)
Technical Specification: -
Megapixel 5
Frame Rate 30
System Requirements: -
OS Windows 10 onwards
Features: -
Operating System Linux, macOS, Microsoft Windows
Supported Technologies: -
Twain provides support for:
38
5.2 SOFTWARE TOOLS: -
(1) MATLAB
5.2.1 MATLAB: -
39
Figure 5.3 MATLAB
Features of MATLAB: -
(1) It is high level language for numerical computation, visualization and application
development.
(2) It also provides an interactive environment for interactive environment for iterative
exploration, design and problem solving.
(3) It provides vast library of mathematical functions for mathematical functions for
linear algebra, statics, Fourier analysis, Filtering, numerical integration, and solving
ordinary differential equations.
40
(4) It provides built in graphics for visualizing data and tools for creating custom plots.
(5) MATLAB’s programming interface gives development tools for improving code
quality maintainability and maximizing performance.
(6) It provides tools for building applications with custom graphical interfaces.
(7) It provides functions for integrating MATLAB based algorithms with external
applications and languages such as C, Java, .NET, and Microsoft Excel .
CHAPTER 6
RESULT AND RESULT ANALYSIS
6.1 Result: -
41
Figure 6.1 Segmentation Result
(a)
42
(b)
43
Figure 6.4 Harris Corner Detector Result
44
Figure 6.6 Calorie Monitoring result
Global motion of subjects in the database is a strong for discriminating between the
leg and arm actions when using histograms of spatio-temporal gradients. This information
is cancel when representing the actions in terms of velocity adapted local features. Hence,
LF and HistLF representation can be expected to give similar recognition performance
disregarding global motion of the person relative to stationary camera.
45
Boxing 0.7 0.0 0.0 97.9 0.7 0.7 0.0 0.0
CHAPTER 7
CONCLUSION AND FUTURE SCOPE
7.1 Conclusion: -
The core ideas of this thesis are to develop calories burning monitoring system which
can monitoring calorie burning during Human activity like running, walking, boxing, hand
clapping, hand waving, jogging, surfing, cycling etc. Human action identified by the local
SVM technique which described in the chapter 2 Human Action Recognition. According to
the human action how to calorie burning is measure is described in chapter 3 calorie
measurement. All the experiment shows Human action recognition identifies more
accurately than other method of Human Action Recognition and the calorie burning during
human action monitoring is far better than other method or device.
46
7.2 Future Scope: -
The work presented in this dissertation has numerous opportunities for future
research in the field of calorie burning monitoring during Human action:
The calorie burning monitoring system have to become live calorie monitoring
system by capturing live human action using camera. This live system is helpful for
calorie monitoring during athletics, boxing etc.
The human action recognition with non-stationary background can be possible.
47
REFERENCES: -
[1] Christian Schuldt, Ivan Laptev, Barabara Caputo: Recognizing Human Actions: A
Local SVM Approach. In: IEEE International Conference on Pattern
Recognition(ICPR’04).1051-4651/04.
[2] d'Angelo, E., Paratte, J., Puy, G., Vandergheynst, P.: Fast TV-L1 optical flow for
interactivity. In: IEEE International Conference on Image Processing (ICIP'11).
pp.1925{1928. Brussels, Belgium (September 2011)
[3] Proceedings of the Ninth IEEE International Conference on Computer Vision (ICCV’03)
[3] Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Comput. Surv.
16, 16:1–16:43 (2011)
[4] Kellokumpu, V., Zhao, G., Pietik¨ainen, M.: Human activity recognition using a
dynamic texture-based method. In: BMVC (2008)
[5] Kellokumpu, V., Zhao, G., Pietik¨ainen, M.: Texture based description of
movements for activity analysis. In: VISAPP (2), pp. 206–213 (2008)
48
[6] Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal
templates. PAMI 23, 257–267 (2001)
[7] Nanni, L., Brahnam, S., Lumini, A.: Local ternary patterns from three orthogonal
planes for human action classification. Expert Syst. Appl. 38, 5125–5128 (2011)
[8] Ojala, T., Pietik¨ainen, M., M¨aenp¨a¨a, T.: Multiresolution gray-scale and
rotation invariant texture classification with local binary patterns. PAMI 24, 971–
987 (2002)
[9] Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from video “in the wild”.
In: CVPR, pp. 1996–2003 (2009)
[10] J. Aggarwal and Q. Cai. Human motion analysis: A review. CVIU, 73(3):428–
440, 1999.
49