Вы находитесь на странице: 1из 13

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
1

An Optical Flow-Based Full Reference Video


Quality Assessment Algorithm
Manasa K.1 , Sumohana S. Channappayya1

AbstractWe present a simple yet effective optical flow-based


full-reference video quality assessment (FR-VQA) algorithm for
assessing the perceptual quality of natural videos. Our algorithm
is based on the premise that local optical flow statistics are
affected by distortions and that the deviation from pristine
flow statistics is proportional to the amount of distortion. We
characterize local flow statistics using the mean, the standard
deviation, the coefficient of variation (CV), and the minimum
eigenvalue (min ) of the local flow patches. Temporal distortion
is estimated as the change in the CV of the distorted flow
with respect to the reference flow, and the correlation between
the min of the reference and of the distorted patches. We
rely on the robust Multi-scale Structural SIMilarity (MS-SSIM)
index for spatial quality estimation. The temporal and spatial
distortions thus computed are then pooled using a perceptually
motivated heuristic to generate a spatio-temporal quality score.
The proposed method is shown to be competitive with the
state-of-the-art when evaluated on the LIVE SD database, the
EPFL Polimi SD database, and the LIVE Mobile HD database.
The distortions considered in these databases include those due
to compression, packet-loss, wireless channel errors, and rateadaptation. Our algorithm is flexible enough to allow for any robust FR spatial distortion metric for spatial distortion estimation.
Additionally, the proposed method is not only parameter-free
but also independent of the choice of the optical flow algorithm.
Finally, we show that the replacement of the optical flow vectors
in our proposed method with the much coarser block motion
vectors also results in an acceptable FR-VQA algorithm. Our
algorithm is called the FLOw SIMilarity (FLOSIM) index.
Index TermsFull reference video quality assessment, optical
flow, MS-SSIM.

I. I NTRODUCTION
The explosive growth of video content over the past decade
has led to a very urgent need to effectively manage this content
[1]. This includes better acquisition, compression, storage and
transport of video data. In other words, these systems must
be designed to minimize perceptual distortion while optimally
utilizing available storage and communication resources. Distortions can be potentially introduced at various stages of video
processing; ranging from acquisition, storage, transport, and
even during the rendering process. Distortions can lead to
a loss in the visual quality of the video. In a majority of
the cases, the ultimate consumer of the video content is a
human subject. Humans have the ability to rate the perceptual
quality of a video based on their prior experience of the world
and the training that they have acquired over time. However,
the subjective evaluation of the video is time consuming and
expensive; with the huge volumes of the data being generated,
1 The authors are with the Lab for Video and Image Analysis (LFOVIA),
Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Kandi, India, 502285 e-mail: {ee12p1002, sumohana}@iith.ac.in.

it becomes impractical. This calls for the development of an


effective video quality assessment (VQA) algorithm that is
motivated by the human visual system (HVS), and correlates
well with mean opinion scores (MOS) of subjective evaluation.
The MOS depends on the visually perceived quality of the
video. Visual perception in the brain results from the cognitive
processes involved in interpreting the physical senses. The
optic nerve transmits the images captured by the eye to the
lateral geniculate nucleus (LGN), which relays the information
to the cortex. The visual cortex is responsible for processing
visual information. The visual cortex is made up of the
primary visual cortex (V1) and visual areas from V2 to V5.
The area V1 receives the sensory inputs from the thalamus
and the tuning properties of its neurons are time dependent.
For approximately the first 40 ms, the neurons in the area
V1 are sensitive to even a small set of stimuli, later after
100 ms the neurons become more sensitive to global effects
[2], [3]. The neurons in the area V2 are tuned to simple
visual characteristics such as orientation, spatial frequency,
size, color, and shape [4][6]. Neurons in area V3 respond
to coherent motion of large patterns [7] and the area V4 is
involved in recognizing geometric shapes. Visual area V5/MT
plays a major role in the perception of motion and the guidance
of some eye movements [8].
In addition to feedforward connections, there are feedback
connections from the higher-tier cortical areas (V2, V3, V4,
V5) to the lower-tier cortical areas such as V1 which are
functional to occlusion or illusions and brain states such
as attention and expectation [9], [10]. The brain states of
expectation implies that the annoyance levels are different at
the onset and the persistence of distortion: this state of the
brain serves as a foundation of our approach
The saliency of motion in the visual world is emphasised
by the fact that an entire area in the brain (area MT/V5) is
dedicated to visual motion processing. The neurons in this
area are associated with the perception of motion [3], [8]. It
is hypothesized that the optical flow field is computed and
represented by neurons in area MT [3], [11], and is central
to our perception of motion. Breakthroughs in VQA [12],
[13] have been possible due to the effective modeling of the
functions of the visual cortex, including optical flow estimation
models and speed estimation models. Further, optical flow
represents motion information at its finest resolution. These
observations form the primary motivation for us to work with
the optical flow for motion processing.
Our contribution is summarized as follows: we first present
a FR technique for measuring the perceptual annoyance that
results from temporal distortions. This technique is based on

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

the fact that optical flow is able to capture the temporal


onset and clearing of distortion (i.e., sudden changes in frame
quality) very well. We hypothesize that visual annoyance is
higher at such temporal locations and assign higher perceptual
importance to them. Toward this end, we propose features
based on optical flow statistics that are able to reliably capture
temporal distortions. However, when distortion persists over
a few frames, its effect on the observer is lower (than at
distortion onset/clearing locations) but certainly not negligible.
Due to the differential nature of optical flow computation,
it does not capture persistent distortions well. By persistent
distortions we mean spatial distortions that appear in several
consecutive frames. Therefore, persistent distortion is estimated using a spatial quality metric. Any robust FR image
quality assessment (IQA) algorithm serves this purpose. While
we have chosen the MS-SSIM index for performing robust
FR IQA, we demonstrate that other FR IQA methods could
also be employed. We then propose a heuristic for pooling the
temporal and spatial scores into an effective spatio-temporal
score. This heuristic is based on classifying video frames
according to their perceptual importance and pooling their
scores in proportion to their importance. We show that our
algorithm performs competitively with the state-of-the-art FRVQA methods over both SD and HD databases.
The rest of the paper is organized as follows. We review
relevant literature in Section II, followed by a description of
the proposed algorithm in Section III. We present our results
and discuss them in Section IV, and make concluding remarks
in Section V.
II. BACKGROUND
The objective assessment of the perceptual quality of natural
videos is a challenging and open research problem. This
statement is true for all the flavors of video quality assessment
(VQA) - full-reference (FR), reduced-reference (RR), and noreference (NR). The evidence for this claim is the fact that
the state-of-the-art methods for all three flavors have only
recently been approaching a reasonable level of correlation (in
the range (0.75, 0.9)) with subjective scores on a moderately
complex video database (LIVE) [14][16]. In the following,
we non-exhaustively review recent and relevant FR-VQA
algorithms in order to place our work in context.
FR-VQA is a challenging task due to several reasons of
which we opine the primary ones are the highly non-stationary
nature of video signals, and an incomplete understanding of
the human visual system (HVS). FR-VQA is a well-studied
problem with several approaches having been explored. One
classification of FR-VQA algorithms could be based on
their domain of operation as either compressed-domain methods or uncompressed-domain methods. We briefly review
the compressed-domain FR-VQA literature. The compresseddomain FR-VQA approaches operate with limited compressed
bitstream information and are primarily geared toward addressing artifacts arising out of communication errors such as
bit-errors and packet losses. The effect of these packet losses
on the videos have been estimated using mean squared error
estimation techniques that are based on bitstream parsing [17],

a spatio-temporal technique by Yang et al. [18], a weighted


combination of blockiness, blurriness and noise by Farias et al.
[19], temporal quality estimation based on frame drop by Yang
et al. [20], generalized linear models for packet loss visibility
[21], [22], to name a few.
Uncompressed-domain FR-VQA approaches on the other
hand utilize all the information available in the spatio-temporal
domain. Both image and video quality measures based on
human visual perception [23][28] have been explored from
the very beginning. The properties of the HVS that have been
employed include modeling the visual sensitivity to predict the
visibility of the error [23], modeling the human visual sensitivity to spatial and chromatic signals [26], incorporating aspects
of early visual processing, such as light adaptation, luminance
and chromatic channels, spatial and temporal filtering, spatial
frequency channels, contrast masking, and probability summation [27], measurement of sensory scales based on Bayesian
estimation [28]. A significant amount of I/VQA approaches in
the past were based on the error sensitivity philosophy [29]
[32] motivated from psychological vision science research
where the distorted signal is the sum of a reference signal
and an error signal, the amount of perception of the error
signal by HVS is determined. The error sensitivity based
quality measurement is as follows: the original and test signals
are subject to preprocessing procedures, such as alignment,
luminance transformation, and color transformation. A channel
decomposition method (wavelet transforms, discrete cosine
transform (DCT), and Gabor decompositions) is then applied
to these two preprocessed signals.
The errors between the two signals in each channel are
calculated and weighted, usually by a Contrast Sensitivity
Function (CSF). The weighted error signals are adjusted by
a visual masking effect model, which reflects the reduced
visibility of errors presented on a background reference signal.
The Minkowski error pooling of the weighted and masked
error signal is then employed to obtain a single quality score.
Wang et al. [33] proposed a structural distortion estimation
approach to FR-VQA which hypothesized that amount of
distortion in the structure is a measure of perceived distortion. It is an extension of the Structural SIMilarity (SSIM)
index [34] with two adjustments including local spatial and
temporal weighting based on the luminance and global motion
respectively. Another popular approach to the quality assessment problem is to cast it in an information communication
framework [13], [35], [36], where the HVS is modeled as
an error-prone communication channel. Yet another popular
approach is based on the observation that the HVS does not
perceive all the distortions equally [12][14], [37][39].
Wang and Li [13] incorporated the model proposed by
Stocker and Simoncelli [40] for human visual speed perception. Spatio-temporal weighting is done based on the motion
information content and the perceptual uncertainty.
Seshadrinathan and Bovik employed an approach to tune
the orientation of a set of three-dimensional Gabor filters
according to local motion based on optical flow [41], [42]. The
adapted Gabor filter responses are then incorporated into the
SSIM and the visual information fidelity (VIF) [43] measures
for the purpose of VQA.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

Ninassi et al. [44] hypothesize that the temporal distortion


is the temporal evolution of the spatial distortion and is closely
linked to the visual attention mechanisms. Hence they resort to
short-term temporal pooling and long-term temporal pooling.
In the short-term evaluation of the temporal distortions, the
spatiotemporal perceptual distortion maps are computed from
the spatial distortion maps (wavelet based quality assessment
metric (WQA) [45]) and the motion information. In the longterm evaluation, the quality score for the whole video sequence
is computed based on the concepts of perceptual saturation and
asymmetric behavior of the humans.
The Motion-tuned Video Integrity Evaluator (MOVIE) index [12] is a HVS-inspired algorithm where the response of
the visual system to video stimulus is modeled as a function of
linear spatio-temporal bandpass filter outputs. The central idea
is that distortions cause the optical flow plane of the distorted
video to move away from the reference videos optical flow
plane. The filters close to the reference videos optical flow
plane are given excitatory weights and those away from it are
given inhibitory weights.
A video quality assessment model based on Most Apparent Distortion (MAD) [46] called Spatiotemporal MAD (STMAD) [47], is designed on the assumption that motion artifacts
will manifest as spatial artifacts and the appearance-based
model of MAD can measure these changes to agree with
human perception. MAD has two stages; a detection-based
stage, which computes the perceived degradation due to visual
detection of distortions and an appearance-based stage, which
computes the perceived degradation due to visual appearance
changes.
Park et al. [14] hypothesize that non-uniform local distortion
is perceptually more annoying than distortion that is more or
less uniform both spatially and temporally. This hypothesis
is demonstrated with a strategy for pooling local quality scores
into a global score. Local quality scores are sorted in ascending
order and higher weights are assigned to patches that occupy
the steepest ascent region in the sorted quality curve.
Wolf and Pinson [48] present a VQA algorithm called
VQM VFD to quantify the effects of temporal distortion due
to frame delay on perceptual video quality. This is an extension
to their previous work called video quality metric (VQM)
[37]. In VQM VFD, the authors extract hand crafted spatiotemporal features primarily using edge detection filters. A
neural network is trained using these features extracted from
a training set composed of a varied collection of video data.
The VQM VFD was evaluated on the LIVE Mobile database
in [49] and shown to have the best performance across all
VQA algorithms.
In this work we present a simple yet effective FR-VQA
algorithm based on local optical flow statistics and a robust
FR-IQA algorithm. We also propose a perceptually inspired
pooling strategy and demonstrate the efficacy of our approach
on popular SD and HD video databases. Our algorithm is
presented in Section III, followed by results and discussion
in Section IV and concluding remarks in Section V.

Fig. 1: Overview of the Proposed Approach.

III. P ROPOSED A PPROACH

A large fraction of VQA algorithms (and quality assessment


algorithms in general) have been inspired by the properties
of the HVS. The HVS consists of the eye, the optic nerve,
optic chiasm, tract and the visual cortex. The responses of the
neurons in the area 18 of the visual cortex are shown to be
almost separable in the spatial and temporal fields [50]. This
motivated us to propose a three-stage approach, where the
spatial and temporal features are computed individually and
later pooled to obtain a single quality score for the entire video.
Temporal features are extracted from the optical flow. The
temporal feature computation stage is based on the property
that the spatially proximate flows are highly correlated and
temporally proximate flows have strong dependencies. The
optical flow is computed for the entire video sequence on a
frame-by-frame basis. The flow is computed for the reference
and distorted videos and the features are extracted from the
flow information. The deviation of the distorted video features
from the reference video features is considered as a measure
of distortion.
Our algorithm is flexible enough to allow any robust fullreference image quality (IQA) assessment algorithm for spatial
quality assessment. We chose the MS-SSIM index [51] as a
representative for the class of robust FR-IQA algorithms. The
MS-SSIM index is used to compute the spatial score on a
frame-by-frame basis. The temporal and spatial features are
pooled using a perceptual importance classification strategy
that is inspired by our previous work [52]. Fig. 1 shows the
block diagram of the proposed approach and each stage of the
algorithm is described in the following subsections.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

A. Stage 1: Temporal Feature Extraction


Motion trajectory is among the most salient features of
a video. The optical flow gives the motion/flow information
in its finest resolution. Flow-based methods have been very
successful in assessing temporal video quality [53]. Natural
images (and video frames) are shown to exhibit strong local
correlation which is inherited by the optical flow frames [54].
Therefore, local flow statistics can be used to estimate the
naturalness of a video.
The optical flow in a pristine natural video is generally
smooth and is highly correlated spatially as well as temporally.
When distortion sets in, there is inconsistency in the flow and
the spatial and temporal correlation is affected. We hypothesize
that distortion results in the deviation of the local statistical
properties of the flow relative to undistorted flow statistics.
Further, we claim that these local flow inconsistencies can be
captured well using the local mean |.| and the local standard
deviation |.| Additionally, we claim that the local flow
randomness is represented well by the minimum eigenvalue
|.| of the patch flow components (after performing a PCAlike eigen decomposition on the patch flow vectors).
The empirical justification for our claims is clearly illustrated in Fig. 2. Fig. 2a shows a video frame with high
perceptual quality and Fig. 2b shows a video frame with
low perceptual quality. Both these frames are derived from
the same reference video and are temporally identical (96th
frame). Fig. 2c shows a 77 optical flow patch from the
high quality frame having consistent flow with the local mean
|.| = 5.72, local standard deviation |.| = 0.057, and the
minimum eigenvalue |.| = 0.0021. Similarly, Fig. 2d shows
a 77 optical flow patch from the low quality frame at the
same spatial location as the high quality patch. The values of
the local statistics for the low quality patch are |.| = 8.937,
|.| = 3.017, and |.| = 23.6829. These values clearly depict
the deviation in the local statistical properties of a low quality
patch from a good quality patch. While Fig. 2 is an illustrative
example, we found these local statistics to work consistently
well over a large set of frames and videos. The local statistics
defined above form the key features of our proposed algorithm
and are formalized in the following.
1) Feature 1: |.| is the mean of the flow magnitudes in a
local patch.
2) Feature 2: |.| is the standard deviation of flow magnitudes in a local patch.
3) Feature 3: |.| is the minimum eigenvalue of a flow
patchs covariance matrix. The covariance matrix is of
size 22 with the two dimensions corresponding to the
horizontal and vertical flow components respectively.
Assuming K L non-overlapping patches in a frame, the perframe feature vectors for the ith frame are notated as f i1 , f2i ,
and f3i for the patch mean (|.| ), patch variance (|.| ) and the
patch eigenvalue (|.| ) respectively. These feature vectors are
defined as follows:
f1i = [i1 , i2 , ...., iKL ]T ,

(1)

i
f2i = [1i , 2i , ...., KL
]T ,

(2)

f3i

(3)

[i1 , i2 , ...., iKL ]T .

It should be noted that the subscript in the vector elements


corresponds to the patch index, and the superscript corresponds
the frame index.
Having identified features that effectively capture local
temporal distortion, the challenge lies in identifying the frames
with perceivable distortion. The optical flow in a pristine
natural video is generally smooth; when the distortion sets in,
this smoothness is lost. Fig. 2e shows the scatter plot of the
features |.| and |.| . This plot illustrates a higher dispersion of
the features in a low quality frame relative to the corresponding
reference frame. These observations show that the amount of
dispersion of each feature in a distorted frame with respect
to the corresponding frame in the pristine video is a measure
of the temporal distortion in a frame. The distortion might be
uniform throughout the frame and hence results in change in
mean of the features uniformly, but such kind of distortion
might not be visually annoying [14]. Therefore a measure
of the dispersion of the features can effectively characterize
the perceivable distortion. The coefficient of variation is a
standardized measure of the dispersion of data and is defined
as
x
,
(4)
CV (x) =
x
where x is the data vector, x is the standard deviation of x,
and x is the mean of x. The coefficient of variation (CV )
allows us to compare data with different means and pool f1i
features in a frame i into CV (f1i ) and the f2i features into
CV (f2i ). The difference in dispersion between the reference
and the test data vectors is defined as
D(xr , xt ) = CV (xr ) CV (xt ),

(5)

where the subscripts r and t denote the reference and test sets
respectively.
The difference in dispersion defined in (5) is applied to the
per-frame features f1i , f2i to effectively measure the amount
of temporal distortion across the frames. This is illustrated in
Fig. 2, where Fig. 2f is the scatter plot of D(f1r , f1t ) versus
D(f2r , f2t ) for the video sequences tr6 and tr13 over all the
frames of the video. This plot clearly shows the distorted
frames of the tr6 video sequence having high dispersion difference of both the features compared to the higher perceptual
quality frames of the tr13 video sequence. Figs. 2g and 2h
illustrate that the low quality video (high DMOS) has high
fluctuations in the dispersion of the features across frames
compared to the high quality video (low DMOS).
The physical meaning of difference in dispersion is as foli
i
lows: higher D(f1r
, f1t
) across frames implies that a majority
of the frame has inter-patch inconsistencies and the inconsistency spread across the frame. This in turn implies irregular
motion in a few patches resulting in non-uniform motion in
i
i
) implies greater distortions within
a frame. Higher D(f2r
, f2t
the patches and implies random or haphazard flow in the patch.
It is important to note that D(xr , xt ) presents a frame level
measure of distortion using patch level features but without
making use of patch-wise correspondence. In other words, this
measure does not directly compare the statistics of a test patch
with the corresponding reference patch statistics. Rather, the
comparison happens at the frame level based on the statistics

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Fig. 2: Illustration of the effectiveness of the selected features. (a) 96

(h)
th

frame of the tr13 sequence with a DMOS of 33.47 (good quality).


(b) 96th frame of the tr6 sequence with a DMOS of 73.473 (bad quality). (c) 77 block showing the flow regularity in the 96th frame of a
low DMOS (good quality) video (|.| = 5.72, |.| = 0.057, |.| = 0.0021). (d) 77 block showing the flow irregularity in the 96th frame
of a high DMOS (bad quality) video (|.| = 8.937, |.| = 3.017, |.| = 23.6829). (e) Scatter plot of feature 1 (|.| ) and feature 2 (|.| ) in
the 96th frame in the tr1 (Reference) and tr6 (high DMOS) sequences. (f) Amount of dispersion difference of feature 1 (|.| ) and feature 2
(|.| ) of the tr13 (low DMOS) and tr6 (high DMOS) sequences. (g) Dispersion difference of feature 1 (|.| ) across frames in the tr13 (low
DMOS) and tr6 (high DMOS). (h) Dispersion difference of feature 2 (|.| ) across frames in the tr13 (low DMOS) and tr6 (high DMOS).

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

of the pool of patch level statistics. This observation plays a


crucial role in our frame classification and pooling strategy.
Further, undistorted flow patches that depict object motion
exhibit consistent flow vectors while distorted flow patches
exhibit random flow vectors as shown in Fig. 2. Fig. 2d shows
the randomness of the flow in the distorted patch. An eigen
decomposition of the local flow vectors in a patch gives the
direction of major and minor flow components. To perform
this eigen decomposition, the covariance matrix of the flow
components (i.e., the horizontal and vertical components of
the flow patch) is found first. The major flow component
is represented by the dominant eigenvalue of the covariance
matrix while the minor flow component is represented by the
smaller eigenvalue [55]. The smaller (or minimum) eigenvalue
is a measure of the randomness of the flow that effectively
quantifies distortion. The deviation in the overall trend of the
minimum eigenvalues of the distorted patch from the reference
patch is measured to compute the non-uniformity in the flow of
a frame. This deviation in trend is quantified by the correlation
of the minimum eigenvalues of the reference and distorted
patches in the frame and defined as:.
i
i
i
i
C(f3r
, f3t
) = 1 corr(f3r
, f3t
),

(6)

where as before, the subscripts r and t denote the reference


and test sets respectively, and corr(x, y) is the correlation
coefficient between data vectors x and y. Unlike D(xr , xt ),
i
i
C(f3r
, f3t
) measures frame level distortion by explicitly doing
a patch-wise comparison, thereby capturing deviations in local
flow behavior.
B. Stage 2: Spatial Quality Assessment
As mentioned previously, any robust FR-IQA algorithm can
be used for spatial quality assessment. We chose to work
with the MS-SSIM index in our reference implementation.
The MS-SSIM index [51] is an extension of the single scale
SSIM index [34] that measures image similarity at multiple
spatial scales. The primary motivation for this extension was
to handle image details at different resolutions. It computes
contrast and structural similarity at all spatial scales and
luminance similarity only at the coarsest spatial resolution.
The overall score is computed by a product of the scores at
each spatial scale that are raised to an exponent. The exponents
at each scale are used to assign different levels of importance
to different scales. The strength of the MS-SSIM index has
been reported by several authors with the most comprehensive
evaluation provided by Sheikh et al. [56]. We leverage the
strength of the MS-SSIM index to measure the spatial quality
of video frames by applying it on a frame-by-frame basis.
The spatial quality Qi of the ith test frame Vti relative to the
reference frame Vri is defined as
Qi = 1 MS-SSIM(Vri , Vti ).

(7)

C. Stage 3: Pooling
While the features mentioned in the previous section capture
the changes in the local flow statistics, the effectiveness of a

good FR-VQA metric lies in its ability to temporally localise


the distortions. Distortions that occur in bursts or that affect
frame rate have proven to be challenging for VQA algorithms
[33]. For instance, if a small percentage of frames in a video
are of low quality and the majority of them are of very good
quality, according to visual psychophysics the observer gives
a low quality rating to the video [33]. To overcome such
issues, temporally adaptive thresholds are chosen such as to
distinguish between perceptually visible and invisible distortion. Further, a non-linear pooling strategy is also suggested
in [33]. The issue of variable frame rate or frame delay was
clearly identified and addressed first by Wolf and Pinson [48]
in their solution to the variable frame delay issue. We propose
a pooling strategy to handle these issues effectively.
We first present a method to classify frames according to
their temporal importance. Each frame is then assigned a
temporal quality score in a non-linear fashion based on its
temporal importance (or its class). The spatial score for each
frame is computed next using a robust FR-IQA algorithm and
pooled with the temporal score in a multiplicative fashion.
At the outset, we note that only the temporal features f1
and f2 are employed in our temporal importance classification
strategy. This is motivated by the fact that large differences
in dispersion (i.e., large values of D(xr , xt )) of these features
are clear indicators of temporal distortion. As noted earlier,
the definition of dispersion does not depend on pair-wise patch
comparison (i.e., is unordered) but rather measures behavior
at the frame level. Importantly, this also ensures that frames
that have uniform (or consistent) distortion are not classified
as having high perceptual annoyance [14]. We illustrate our
strategy in Fig. 3. Fig. 3d illustrates the fluctuations in
these features across frames of varying levels of distortions.
However, not all distortions are perceivable. For instance, the
distortions are not perceivable in Fig. 3a, as compared to the
obvious distortion in Fig. 3b. This can be explained by the fact
that there is a sudden onset of distortion in Fig. 3b relative
to Fig. 3a and no new distortion in Fig. 3c compared to Fig.
3b. Hence, the perceptibility of the distortion is a function
of the temporally proximate frames. Therefore, the frames
should be categorized based on a threshold that is designed
by taking into account the temporal neighbours. The median
of the dispersion difference of temporally proximate frames
centered at the frame under consideration is chosen as the
threshold and sets a bar on the amount of distortion that is
visually perceived.
Therefore, the temporal thresholds for the per-frame features
are computed using f1i , f2i and are defined as follows:
i1 i1
i+1 i+1
i
i
fi1t = median([D(f1r
, f1t ), D(f1r
, f1t
), D(f1r
, f1t )]),
(8)
i1 i1
i+1 i+1
i
i
fi2t = median([D(f2r
, f2t ), D(f2r
, f2t
), D(f2r
, f2t )]).
(9)
We draw inspiration from our previous work [52] for using
these thresholds for a frames temporal importance classification. The basic idea is to classify every frame into one of four
classes based on its contribution to the amount of perceivable
distortion. We call these classes R1, R2, R3, R4 and define
them as follows.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

(a)

(b)

(c)

(d)

Fig. 3: a) 16th frame of the pa4 sequence. b) 17th frame of the pa4 sequence where the distortion sets in. c) 18th frame of the pa4
sequence where the distortion persists. d) Depicts the frame 17 where both feature 1 and feature 2 have high dispersion compared to its
temporal neighbours and hence fall into first quadrant (R1) while pooling.

(a)

(b)

(c)

(d)

Fig. 4: a) 121st frame of the pa4 sequence. b) 122nd frame of the pa4 sequence where the distortion sets in. c) 123th frame of the pa4
sequence where the distortion persists. d) Depicts the frame 122 where feature 1 has high dispersion compared to its temporal neighbours
and hence fall into fourth quadrant (R4) while pooling.

(a)

(b)

(c)

(d)

Fig. 5: a) 21st frame of the pa4 sequence. b) 22nd frame of the pa4 sequence where the distortion sets in. c) 23rd frame of the pa4
sequence where the distortion persists. d) Depicts the frame 22 where feature 2 has high dispersion compared to its temporal neighbours
and hence fall into second quadrant (R2) while pooling. Illustration of pooling strategy.
A test video frame Vti is classified into one of these classes
according to the following rule:

R1:

Points to the region with high dispersion difference


of both the features indicating non-uniform motion
across the patches in a frame and a large fraction
of patches having irregularity within them. In short,
a frame is classified into this region if it has high
irregularity at the intra and inter-patch level compared
to its temporal neighbours. Therefore, the frames that
fall into this class are visually the most annoying.

R2:

Points to the region with higher dispersion difference


in the feature 2, which implies that such a frame
exhibits intra-patch irregularity. These frames have
localised distortions.

i
i
i
i
R1 = (D(f1r
, f1t
) > fi1t ) & (D(f2r
, f2t
) > fi2t ),
i
i
i
i
R2 = (D(f1r
, f1t
) < fi1t ) & (D(f2r
, f2t
) > fi2t ),
i
i
i
i
R3 = (D(f1r
, f1t
) < fi1t ) & (D(f2r
, f2t
) < fi2t ),
i
i
i
i
R4 = (D(f1r
, f1t
) > fi1t ) & (D(f2r
, f2t
) < fi2t ),

where & is the logical AND operator.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

R3:

Points to the region of acceptable temporal distortion. spatio-temporal score according to:

Hi = Gi .Qi ,
(11)
Points to regions having high inter-patch distortions
and low intra-patch distortion which implies that the where Hi is the spatio-temporal score assigned to the ith
frame. As discussed in the Section I, the responses of the
distortion is spread in the frame.
Figs. 3, 4 and 5 vividly explains the visual effectiveness neurons in the visual cortex are almost separable in the spatial
of the classification strategy. Fig. 3a shows frame number 16 and temporal fields [50] and hence the spatio-temporal score
from the pa4 sequence that has no perceivable distortion and is obtained by a simple product of the spatial and temporal
hence the plot in Fig. 3d depicts lower values of feature 1 and scores.
The final quality score for the video is given by the weighted
feature 2 compared to its immediate temporal neighbours. Fig.
mean
of the spatio-temporal score Hi assigned to each frame.
3b shows frame number 17 from the same sequence where a
R4:

perceivable distortion has set in, Fig. 3d shows higher values


for both feature 1 and feature 2 compared to its immediate
temporal neighbours. Therefore, this frame is assigned region
R1 while pooling. Subsequently, Fig. 3c shows frame number
18 where the distortion is persistent from frame 17 and
therefore Fig. 3d shows lower values for the features.
Similarly, Fig. 4d at frame 122 shows high dispersion of
feature 1 compared to its immediate temporal neighbours
whereas feature 2 has low dispersion compared to its temporal
neighbours and hence assigned region R4 while pooling.
The above argument applies to Fig. 5d where frame 22
shows high dispersion of feature 2 compared to its immediate temporal neighbours while feature 1 has low dispersion
compared to its temporal neighbours and hence assigned R2
while pooling.
At this point, we recall that C(f3r , f3t ) computes the
temporal quality of the distorted frame by performing a patchwise comparison of the minimum eigenvalue with the reference frame. Therefore, this measure captures local similarities
between the reference and distorted frames. We hypothesize
that a non-linear combination of the frame level similarity
(D(xr , xt )) and the patch-wise similarity (C(f3r , f3t )) results
in a good overall temporal distortion measure. We propose the
following heuristic to achieve this goal. The temporal score Gi
for the ith test frame Vti is given by

i
i
i
i
i
i
(D(f1r
, f1t
) + D(f2r
, f2t
)).C(f3r
, f3t
), Vti R1

D(f i , f i ).C(f i , f i ),
Vti R2
2r 2t
3r 3t
Gi =
i
i

C(f3r
, f3t
),
Vti R3

i
i
i
i
Vti R4
D(f1r , f1t ).C(f3r , f3t ),
(10)
The choice of the frame level score depends on the frame class.
i
i
) is a common factor since it is the only measure of
C(f3r
, f3t
patch-wise similarity. Specifically, in region R1, irregularities
are high at both the intra and inter patch level and hence
i
i
i
i
, f1t
) and D(f2r
, f2t
)) are used for temporal score
both D(f1r
computation. Similarly, in R2 and R4, the regions corresponding to high intra and inter patch irregularities respectively, the
i
i
i
i
corresponding feature scores D(f2r
, f2t
) and D(f1r
, f1t
) are
used for score computation. Finally, in R3, the region where
both the intra and inter patch irregularities are low, there
is still a need to account for patch-wise similarity with the
i
i
reference video and hence only C(f3r
, f3t
) is used for score
computation.
We pool the spatial and temporal scores to come up with a

FLOSIM =

4
X

wi

i=1

Hj ,

(12)

jRi

NRi
, i {1, 2, 3, 4}
(13)
N
where, w is the weight assigned to each frame based on the
region to which it belongs, NRi is the number of frames in the
region Ri, and N is the total number of frames in the video.
The temporal quality score T for the video is calculated by
taking the mean of the temporal score for each frame and is
given by
N
P
Gi
T = i=1 .
(14)
N
Similarly, the spatial quality score S for a video is given by
wi =

N
P

S=

Qi

i=1

(15)

where Qi is the spatial quality of the ith frame of the video


obtained from (7) and N is the total number of frames in the
video. The temporal and spatial scores (S and T respectively)
are defined and used primarily for performance comparison.
D. Distortion Map
Another embellishment of the proposed solution is that it
provides a spatio-temporal distortion map. The distortion map
per frame is formed by taking the product of the difference
of the flow features (f1i , f2i ) of the distorted patch and the
reference patch.
i
i
i
i
mi = [f1r
f1t
] [f2r
f2t
]

where the difference and product is performed element-wise.


The 2D map is obtained by appropriately reshaping the vector
mi .
The map is a visual representation of the spatial distortion
occurring due to temporal inconsistency between frames. Fig.
6 depicts the effectiveness of the map obtained from the
proposed approach, where Fig. 6c shows the distortion map
of the 96th frame for a good quality video with low DMOS
(tr13) and Fig. 6d shows the distortion map of the 96th frame
for a bad quality video with high DMOS (tr6). The annoying
regions are clearly visible in the distortion map. Whereas,
Fig. 6e represents a map obtained by taking the MSE of the
luminance values of the 96th frame of the reference video

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

(tr1) and the less distorted video (tr13) which shows spurious
regions which are not perceivable distortions. Fig. 6f shows a
MSE map of a more distorted frame (tr6) with the reference
(tr1) luminance values where the localisation of distortions
are not clearly seen. Fig. 6g is MSE map for the flow of 96th
frame of reference (tr1) and less distorted (tr13) sequences
which highlights areas that do not appear visually distorted.
Fig. 6h is MSE map for the flow of 96th frame of reference
(tr1) and more distorted (tr6) sequences which misses out on
the more annoying regions. It is clear from these error maps
that the proposed distortion map is able to localize distortions
that the neither the MSE map of luminance nor the MSE map
of the flow is able to. The ability of the distortion map to
discern distortion is limited by the patch size used for feature
computation.
IV. R ESULTS AND D ISCUSSION
We report the performance of our algorithm on two SD
databases: the LIVE video databases [57] and the EPFLPoliMI database [58][60] and a HD database: LIVE Mobile
VQA database [61], [62] which is described in Table I. The
database independence of the proposed approach is validated
from the results on these three databases. To demonstrate the
robustness of the algorithm and to depict the independence
of our approach on the type of flow used we report results
using different flow algorithms. Finally, the effectiveness of
our approach is shown by replacing optical flow with the much
coarser motion vectors. Such a replacement still results in an
acceptable FR-VQA algorithm.
The LIVE SD video database has 10 reference videos and
150 distorted videos. The distortions include wireless distortions, IP distortions, H.264 and MPEG2 compression artifacts.
The resolution of all the videos of LIVE is 768432. Every
reference sequence has 15 distorted videos, each of which correspond to one of 4 distortions of varied amounts. The EPFLPoliMI database contains 156 sequences in total, of which 12
are reference sequences and 144 are distorted sequences. Half
of the sequences are of CIF resolution (352288) and the rest
are of 4CIF resolution (704576). The reference sequences are
encoded in the H.264 format and these bitstreams are subjected
to 6 different packet loss rates (0.1%, 0.4%, 1%, 3%, 5%,
10%) to produce the distorted bitstreams. We used the MOS
scores provided by EPFL to tabulate the results in this paper
for consistency in comparison of the performance with the
MOVIE index [14]. The LIVE Mobile VQA database consists
of 10 HD reference videos and 200 distorted videos. The
distortions include compression, wireless packet-loss, framefreeze, rate-adaptation, temporal dynamics per reference. Each
video is of HD resolution (1280720) at a frame rate of 30
fps and a duration of 15 seconds each. We have omitted framefreeze distortion for the performance analysis on this database.

Database Resolution

Frame
Rate

Number
of
videos

Distortions

768432

25/50
fps

150

Wireless distortions, IP
distortions, H.264 and
MPEG2 compression
artifacts

EPFLPoliMI

CIF
resolution
(352288)
and 4CIF
resolution
(704576)

30 fps

144

6 different packet loss


rates (0.1%, 0.4%, 1%,
3%, 5%, 10%)

LIVE
Mobile
VQA

1280720

30 fps

160

Compression, wireless
packet-loss,
rate-adaptation,
temporal dynamics

LIVE
SD

TABLE I: Description of the databases used for evaluation.


[65]. Since brightness constancy and spatial smoothness constraint violations are common issues in flow estimation algorithms, the Black and Anandan flow estimation algorithm
addresses these violations using a robust estimation approach.
This increases its accuracy while it is comparable in efficiency
with other flow algorithms. The Farneback [65] approach is
based on describing the image structure by a second order
polynomial and estimating the displacement by observing the
polynomial transforms. Additionally, we have also considered
replacing the optical flow with motion vectors. The proposed
approach is tested using more than one flow algorithm and
with motion vectors to demonstrate the robustness of the
approach.
The patch size for temporal feature extraction is chosen
empirically to be 77. Specifically, we experimented with
patch sizes varying from 5 5 to 9 9 in steps of 2 and
found 7 7 to give best performance. A frame of size M N
is divided into non-overlapping patches of size 77. The flow
features |.| , |.| and |.| are computed for every patch in the
frame resulting in (M/7N/7) element per feature vector per
frame. The amount of dispersion of the distorted features from
the reference features is measured using (5). The thresholds
for the temporal features are computed using (8) and (9).
The correlation in the minimum eigenvalue trend is computed
according to (6) and is used in weighting the score of each
frame. FLOSIM has the ability to work with any robust FRIQA method. We get the best performance using the MS-SSIM
index. We report results on other popular FR-IQA algorithms
including the SSIM index [34], the VIF index [43], and the
FSIM index [66]. We would like to note that none of the FRIQA methods have been altered in our implementation.
For the performance evaluation with motion vectors, block
motion estimation was done using the adaptive rood pattern
search algorithm [67] for a macroblock of size 44. For
a frame of size M N , M/4 N/4 motion vectors are
available. To compute the temporal features, 22 such blocks
are combined resulting in a patch size of 88 as opposed to
the 77 patch size used in the flow based algorithm.

A. Feature Computation
The choice of the flow algorithm is a trade off between
efficiency and accuracy. We have worked with two flow algorithms; Black and Anandan algorithm [63], [64], Farneback

B. Performance Evaluation
The performance of the algorithm is tested on the three
databases specified previously. Table II shows the performance

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

of the proposed approach on the LIVE SD database and


the EPFL-PoliMI database in terms of the Linear Correlation
Coefficient (LCC) after applying the logistic fit mentioned in
[68], and the Spearman Rank Order Correlation Coefficient
(SROCC). It compares the performance of FLOSIM with
the state-of-the-art FR-VQA algorithms as well as FR-IQA
algorithms (applied on a frame-by-frame basis and averaged).
It is clear from this table that FLOSIM performs very competitively with the state-of-the-art methods including the MOVIE
index [53], the video quality pooling method applied to the
MOVIE index [14], and the ST-MAD method [47]. We present
FLOSIMs performance on individual distortions in the LIVE
SD database in Tables III and IV. These tables demonstrate
FLOSIMs consistent and competitive performance across
distortion types.
Table V presents an important contribution of this work
that of its flexibility to accommodate any robust FR-IQA
algorithm. It presents FLOSIMs performance using various
popular FR-IQA algorithms (MS-SSIM index [51], SSIM
index [34], VIF index [56], and the FSIM index [66]) in
terms of LCC, SROCC, OR (Outlier Ratio) and RMSE (Root
Mean Squared Error) across the two flow algorithms. The
effectiveness of the features and the pooling strategy is clearly
evident from the table which demonstrates the improved
performance of the spatio-temporal metric compared to using
only spatial metrics, irrespective of the spatial metric used. For
e.g., this table shows that the FSIM index is also effective in
the proposed framework. Further, the outlier ratio is as low as
0.0133, demonstrating the consistency of the algorithm across
a variety of video content.
Table VI shows the performance of FLOSIM on the LIVE
Mobile database. Again, its performance is very competitive
for both the Mobile and the Tablet cases. Specifically, it
outperforms the state-of-the-art VQM VFD algorithm [49] in
the Mobile case and is close to it in the Table case. Table VII
shows yet another important facet of the proposed method. It
demonstrates that the FLOSIM strategy works effectively even
when optical flow vectors are replaced by the much coarser
motion vectors. FLOSIMs performance using motion vectors
is better than the best spatial metric and the vanilla VQM
algorithm on the LIVE SD database. It performs competitively
on the Polimi database as well. This also presents a fast
and efficient method for implementing a coarse FR-VQA
algorithm.
Tables VIII and IX compare FLOSIMs performance with
the benchmark MOVIE index in terms of improvements in
correlation and reduction in computational complexity. While
both methods compute flow, FLOSIM does so only at one scale
thereby contributing to reduced complexity. The reference C++
MOVIE implementation was used in this comparison. The
computation time was found by instrumenting this code on
a system running Windows 7 with 16 GB RAM and a 3.40
GHz Intel Core i7 processor. FLOSIMs computation time was
computed similarly. From these tables it is clear that FLOSIM
achieves better performance at a lower computational cost.
We now discuss the effectiveness of the proposed temporal
features in terms of their contribution to FLOSIMs performance. Our hypothesis for the proposed algorithm was that

local flow statistics are affected in the presence of distortions.


We defined three features to represent the local flow statistics
and visually demonstrated their ability to capture distortions in
Fig. 2. Table X reports the statistical performance of each of
these features individually and in combination with the MSSSIM index on the SD databases. We conclude that while
(f1 + f2 ) and f3 are quite effective (in combination with MSSSIM), the proposed pooling strategy results in the best overall
performance.
Metrics
Only PSNR
Only MS-SSIM
Only SSIM
Only VIF
Only FSIM
MOVIE
MOVIE with VQ
Pooling [14]
STMAD
VQM
BA
FLOSIM with
Classic
MS-SSIM
Farne

SD databases
LIVE SD
LCC
SROCC
0.4035
0.3684
0.7642
0.7482
0.5498
0.5381
0.5721
0.574
0.7376
0.7278
0.8116
0.789

Polimi
LCC
SROCC
0.8475
0.9034
0.812
0.963
0.786
0.9098
0.896
0.956
0.863
0.965
0.9302
0.9203

0.8611

0.8427

0.9422

0.9422

0.8299
0.7236

0.8242
0.7026

0.8433

0.8375

0.859

0.8537

0.956

0.965

0.8236

0.8227

0.956

0.9674

TABLE II: Comparison of performance of the proposed FR-VQA


algorithm with standard VQ Metrics on SD databases.

Distortions
FLOSIM (BA)
FLOSIM
(Farne)
MOVIE [14]
STMAD [47]
MOVIE with
VQ Pooling
[14]

Wireless
0.874

LIVE SD
IP
H264
0.823
0.935

MPEG2
0.836

All
0.859

0.8495

0.75

0.8491

0.7431

0.8236

0.8386
0.8123

0.7622
0.79

0.7902
0.9097

0.7595
0.8422

0.8116
0.8299

0.8502

0.8015

0.8444

0.8453

0.8611

TABLE III: LCC comparison with state-of-the-art across distortion


types in the LIVE SD database.

Distortions
FLOSIM (BA)
FLOSIM
(Farne)
MOVIE [14]
STMAD [47]
MOVIE with
VQ Pooling
[14]

Wireless
0.8672

LIVE SD
IP
H264
0.7637
0.9394

MPEG2
0.8204

All
0.8537

0.8396

0.6801

0.8394

0.721

0.8227

0.8109
0.806

0.7157
0.7686

0.7664
0.9043

0.7733
0.8478

0.789
0.8242

0.8026

0.806

0.8309

0.8504

0.8427

TABLE IV: SROCC comparison with state-of-the-art across


distortion types in the LIVE SD database.

V. C ONCLUSIONS
We presented a simple and novel optical flow based FRVQA algorithm that is highly competitive with the state-ofthe-art methods. We demonstrated that local flow statistics
and their dispersion form good features for assessing temporal
quality. We showed that the proposed approach supports any
robust FR-IQA algorithm for spatial quality assessment. We
reported our results using the MS-SSIM index as a representative for robust FR-IQA algorithms for measuring spatial

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 6: Distortion Map representing annoying regions. (a) 96th frame of the tr13 sequence with a DMOS of 33.473 (good quality). (b)
96th frame of the tr6 sequence with a DMOS of 73.473 (bad quality). (c) Distortion Map for the 96th frame of the tr13 (low DMOS). (d)
Distortion Map for the 96th frame of the tr6 (high DMOS); clearly shows the regions of disturbance. (e) MSE of the 96th frame of the
reference (tr1) and less distorted sequence (tr13). (f) MSE of the 96th frame of the reference (tr1) and more distorted sequence (tr6). (g)
MSE of the flow of the 96th frame of the reference (tr1) and less distorted sequence (tr13). (h) MSE of the flow of the 96th frame of the
reference (tr1) and more distorted sequence (tr6).

Metric Mode
Temporal-only
Spatial-only
MSFLOSIM (BA)
SSIM
FLOSIM (Farne)
Spatial-only
SSIM FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
VIF
FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
FSIM FLOSIM (BA)
FLOSIM (Farne)
Temporal-only
Spatial-only
MSFLOSIM (BA)
SSIM FLOSIM (Farne)
Spatial-only
SSIM Temporal-only
FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
FLOSIM (BA)
VIF
FLOSIM (Farne)
Spatial-only
FSIM FLOSIM (BA)
FLOSIM (Farne)

LIVE SD
LCC
SROCC
0.6439
0.6271
0.7642
0.7482
0.859
0.8537
0.8236
0.8227
0.5498
0.5381
0.6513
0.6280
0.6391
0.6077
0.5721
0.5740
0.628
0.6196
0.637
0.6363
0.7376
0.7278
0.8236
0.8213
0.8111
0.8055
Polimi
0.9198
0.9055
0.8111
0.9628
0.9603
0.9610
0.9628
0.9694
0.7738
0.8858
0.9198
0.9055
0.9503
0.9494
0.9476
0.9542
0.8970
0.9554
0.9629
0.9612
0.9605
0.9626
0.8625
0.9648
0.9620
0.9608
0.9645
0.9664

OR
0.0267
0.0200
0.0133
0.0333
0.0467
0.0333
0.033
0.0400
0.033
0.0467
0.0067
0.0133
0.0133

RMSE
8.39
7.079
5.62
6.23
9.17
8.833
8.44
9.000
8.543
8.46
7.41
6.23
6.42

0.2917
0.4931
0.1806
0.1528
0.5486
0.2917
0.2014
0.2222
0.3958
0.1667
0.1736
0.4722
0.1667
0.1597

0.5164
0.7315
0.3486
0.3379
0.7921
0.5164
0.3895
0.3993
0.5527
0.3376
0.3480
0.6328
0.3414
0.3304

LIVE Mobile VQA


Mobile

Metric

LCC

LCC

SROCC RMSE

0.6909
0.7986
0.6637
0.7870
0.7157
0.7085
0.6917
0.8631
Temporal 0.7019

0.6780
0.7864
0.6498
0.7439
0.642
0.6631
0.6945
0.8295
0.6133

0.6670
0.553
0.6901
0.5692
0.6443
0.653
0.6663
0.466
0.657

0.6348
0.6945
0.4893
0.7635
0.7828
0.5016
0.5816
0.8347
0.6201

0.5886
0.6748
0.4300
0.7261
0.6792
0.4583
0.5552
0.8385
0.6315

0.6630
0.6174
0.7483
0.5541
0.5342
0.742
0.6980
0.4726
0.6732

SpatioTemporal

0.8712

0.411

0.7468

0.8025

0.571

Only PSNR [49]


Only MS-SSIM
Only SSIM [49]
Only VIF [49]
MOVIE [49]
STMAD
VQM [49]
VQM-VFD [49]
FLOSIM
with MSSSIM

Tablet

SROCC RMSE

0.8954

TABLE VI: Performance comparison of the proposed FR-VQA

TABLE V: Performance of popular spatial metrics on the LIVE


SD and Polimi databases. The proposed FLOSIM algorithm was
implemented using two flow algorithms: BA and Farne.

distortions. Further, we illustrated the perceptual motivation


and the effectiveness of our pooling strategy. Additionally, we
presented a method to generate a distortion map per frame to
spatially localise the distortions. Finally, we demonstrated the
competitive performance of the algorithm on the LIVE SD,
EPFL-PoliMI and LIVE Mobile VQA databases along with
its robustness to the choice of the flow algorithm including
motion vectors. In future, we plan to explore features that are

algorithm with the state-of-the-art metrics on the LIVE Mobile


VQA database.
FLOSIM
MV + MS-SSIM
MV + SSIM
MV + VIF
MV + FSIM
MV + MS-SSIM
MV + SSIM
MV + VIF
MV + FSIM

LIVE SD
LCC
SROCC
0.7824
0.7686
0.5194
0.4901
0.4675
0.467
0.7208
0.7163
Polimi
0.8117
0.9626
0.7855
0.9098
0.8964
0.9556
0.8630
0.9650

OR
0.0133
0.08
0.08
0.02

RMSE
6.836
9.380
9.7037
7.6084

0.4861
0.5625
0.3889
0.4792

0.7304
0.7738
0.5542
0.6318

TABLE VII: Performance of the proposed metric using MVs on


the LIVE SD and Polimi databases.
LIVE SD
Metrics
MOVIE
FLOSIM (BA)
FLOSIM (Farne)
BA
% of Improvement
Farne

LCC
0.8116
0.859
0.8236
5.84%
1.48%

SROCC
0.789
0.8537
0.8227
8.20%
4.27%

TABLE VIII: Percentage of improvement compared to the


MOVIE index.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

LIVE SD
Metrics
Time Taken in secs
MOVIE
11438
FLOSIM (BA)
3816
FLOSIM (Farne)
396.4117
BA
33.36%
% of Complexity
Farne
2885.38%

TABLE IX: Complexity comparison of FLOSIM with the MOVIE


index.
Features
f1
f2
f3
f1
MS-SSIM
f2 MSSSIM
f3 MSSSIM
(f1 +
f2 )MSSSIM (with
frame classification)
FLOSIM

LCC
0.2738
0.2276
0.46

LIVE SD
SROCC
0.2049
0.1984
0.452

RMSE
10.55
10.68
9.747

LCC
0.6836
0.6098
0.848

Polimi
SROCC
0.6239
0.6787
0.9158

RMSE
0.912
0.991
0.662

0.5563

0.5302

9.122

0.7828

0.7684

0.778

0.4016

0.3935

10.05

0.0409

0.0584

1.249

0.855

0.8458

5.69

0.9568

0.9655

0.363

0.8431

0.8353

5.91

0.9584

0.9572

0.357

0.859

0.8537

5.62

0.956

0.965

0.36

TABLE X: Performance of the individual features on the


LIVE SD and Polimi databases.

based on more sophisticated models of optical flow statistics.


R EFERENCES
[1] C. Inc., Cisco visual networking index: Global mobile data traffic
forecast update, 2011-2016, February 2012.
[2] F. S. Chance, S. B. Nelson, and L. F. Abbott, Synaptic depression
and the temporal response characteristics of v1 cells, The Journal of
neuroscience, vol. 18, no. 12, pp. 47854799, 1998.
[3] E. P. Simoncelli and D. J. Heeger, A model of neuronal responses in
visual area mt, Vision research, vol. 38, no. 5, pp. 743761, 1998.
[4] J. Hegde and D. C. Van Essen, Selectivity for complex shapes in
primate visual area v2, J Neurosci, vol. 20, no. 5, pp. 6166, 2000.
[5] J. Hegde and D. C. Van Essen, Temporal dynamics of shape analysis
in macaque visual area v2, Journal of neurophysiology, vol. 92, no. 5,
pp. 30303042, 2004.
[6] A. Anzai, X. Peng, and D. C. Van Essen, Neurons in monkey visual area
v2 encode combinations of orientations, Nature neuroscience, vol. 10,
no. 10, pp. 13131321, 2007.
[7] L. L. Lui, J. A. Bourne, and M. G. Rosa, Functional response properties
of neurons in the dorsomedial visual area of new world monkeys
(callithrix jacchus), Cerebral Cortex, vol. 16, no. 2, pp. 162177, 2006.
[8] R. T. Born and D. C. Bradley, Structure and function of visual area
mt, Annu. Rev. Neurosci., vol. 28, pp. 157189, 2005.
[9] L. S. Petro, L. Vizioli, and L. Muckli, Contributions of cortical feedback
to sensory processing in primary visual cortex, Frontiers in psychology,
vol. 5, 2014.
[10] H.-J. Park and K. Friston, Structural and functional brain networks:
from connections to cognition, Science, vol. 342, no. 6158, p. 1238411,
2013.
[11] B. A. Wandell, Foundations of vision, vol. 8. Sinauer Associates
Sunderland, MA, 1995.
[12] K. Seshadrinathan and A. Bovik, Motion tuned spatio-temporal quality
assessment of natural videos, Image Processing, IEEE Transactions on,
vol. 19, pp. 335 350, feb. 2010.
[13] Z. Wang and Q. Li, Video quality assessment using a statistical model
of human visual speed perception, J Opt Soc Am A Opt Image Sci Vis.,
dec. 2007.
[14] J. Park, K. Seshadrinathan, S. Lee, and A. C. Bovik, Video quality
pooling adaptive to perceptual distortion severity, Image Processing,
IEEE Transactions on, vol. 22, no. 2, pp. 610620, 2013.

[15] R. Soundararajan and A. C. Bovik, Video quality assessment by


reduced reference spatio-temporal entropic differencing, Circuits and
Systems for Video Technology, IEEE Transactions on, vol. 23, no. 4,
pp. 684694, 2013.
[16] M. A. Saad, A. C. Bovik, and C. Charrier, Blind prediction of natural
video quality., IEEE Transactions on Image Processing, vol. 23, no. 3,
pp. 13521365, 2014.
[17] A. R. Reibman, V. A. Vaishampayan, and Y. Sermadevi, Quality monitoring of video over a packet network, Multimedia, IEEE Transactions
on, vol. 6, no. 2, pp. 327334, 2004.
[18] F. Yang, S. Wan, Y. Chang, and H. R. Wu, A novel objective noreference metric for digital video quality assessment, Signal Processing
Letters, IEEE, vol. 12, pp. 685 688, oct. 2005.
[19] M. C. Farias and S. K. Mitra, No-reference video quality metric based
on artifact measurements, in Image Processing, 2005. ICIP 2005. IEEE
International Conference on, vol. 3, pp. III141, IEEE, 2005.
[20] K.-C. Yang, C. Guest, K. El-Maleh, and P. Das, Perceptual temporal
quality metric for compressed video, Multimedia, IEEE Transactions
on, vol. 9, pp. 1528 1535, nov. 2007.
[21] T.-L. Lin, S. Kanumuri, Y. Zhi, D. Poole, P. C. Cosman, and A. R.
Reibman, A versatile model for packet loss visibility and its application
to packet prioritization, Image Processing, IEEE Transactions on,
vol. 19, no. 3, pp. 722735, 2010.
[22] Y.-L. Chang, T.-L. Lin, and P. C. Cosman, Network-based h. 264/avc
whole-frame loss visibility model and frame dropping methods, Image
Processing, IEEE Transactions on, vol. 21, no. 8, pp. 33533363, 2012.
[23] S. A. Karunasekera and N. G. Kingsbury, A distortion measure for
image artifacts based on human visual sensitivity, in Acoustics, Speech,
and Signal Processing, 1994. ICASSP-94., 1994 IEEE International
Conference on, pp. V117, IEEE, 1994.
[24] C. J. V. D. B. Lambrecht, V. Bhaskaran, A. Kovalick, and M. Kunt,
Automatically assessing mpeg coding fidelity, IEEE Design & Test of
Computers, vol. 12, no. 4, pp. 2833, 1995.
[25] C. J. van den Branden Lambrecht, Perceptual models and architectures
for video coding applications, 1996.
[26] A. B. Watson, Toward a perceptual video-quality metric, in Photonics
West98 Electronic Imaging, pp. 139147, International Society for
Optics and Photonics, 1998.
[27] A. B. Watson, Q. J. Hu, J. F. McGowan III, and J. B. Mulligan, Design
and performance of a digital video quality metric, in Electronic Imaging99, pp. 168174, International Society for Optics and Photonics,
1999.
[28] A. B. Watson and L. Kreslake, Measurement of visual impairment
scales for digital video, in Photonics West 2001-Electronic Imaging,
pp. 7989, International Society for Optics and Photonics, 2001.
[29] Z. Wang, Rate scalable foveated image and video communications, in
PhD thesis, Dept. of ECE, The University of Texas at Austin, Dec 2001.
[30] B. L. Evans and W. S. Geisler, Rate scalable foveated image and video
communications,
[31] Z. Wang, A. C. Bovik, and L. Lu, Why is image quality assessment so
difficult?, in Acoustics, Speech, and Signal Processing (ICASSP), 2002
IEEE International Conference on, vol. 4, pp. IV3313, IEEE, 2002.
[32] Z. Wang, H. R. Sheikh, and A. C. Bovik, Objective video quality
assessment, The handbook of video databases: design and applications,
pp. 10411078, 2003.
[33] Z. Wang, L. Lu, and A. C. Bovik, Video quality assessment based on
structural distortion measurement, Signal processing: Image communication, vol. 19, no. 2, pp. 121132, 2004.
[34] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image
quality assessment: From error visibility to structural similarity, Image
Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600612, 2004.
[35] H. R. Sheikh and A. C. Bovik, A visual information fidelity approach to
video quality assessment, in The First International Workshop on Video
Processing and Quality Metrics for Consumer Electronics, pp. 2325,
2005.
[36] H. R. Sheikh and A. C. Bovik, Image information and visual quality,
Image Processing, IEEE Transactions on, vol. 15, no. 2, pp. 430444,
2006.
[37] M. H. Pinson and S. Wolf, A new standardized method for objectively
measuring video quality, Broadcasting, IEEE Transactions on, vol. 50,
no. 3, pp. 312322, 2004.
[38] D. M. Chandler and S. S. Hemami, Vsnr: A wavelet-based visual
signal-to-noise ratio for natural images, Image Processing, IEEE Transactions on, vol. 16, no. 9, pp. 22842298, 2007.
[39] Z. Wang, L. Lu, and A. C. Bovik, Video quality assessment based on
structural distortion measurement, Signal processing: Image communication, vol. 19, no. 2, pp. 121132, 2004.

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing

[40] A. A. Stocker and E. P. Simoncelli, Noise characteristics and prior


expectations in human visual speed perception, Nature neuroscience,
vol. 9, no. 4, pp. 578585, 2006.
[41] K. Seshadrinathan and A. C. Bovik, A structural similarity metric
for video based on motion models, in Acoustics, Speech and Signal
Processing, 2007. ICASSP 2007. IEEE International Conference on,
vol. 1, pp. I869, IEEE, 2007.
[42] K. Seshadrinathan and A. C. Bovik, An information theoretic video
quality metric based on motion models, in Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics, pp. 2526, Citeseer, 2007.
[43] H. R. Sheikh and A. C. Bovik, Image information and visual quality,
Image Processing, IEEE Transactions on, vol. 15, no. 2, pp. 430444,
2006.
[44] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba, Considering temporal variations of spatial visual distortions in video quality assessment,
Selected Topics in Signal Processing, IEEE Journal of, vol. 3, no. 2,
pp. 253265, 2009.
[45] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba, On the performance
of human visual system based image quality assessment metric using
wavelet domain, in SPIE Conference Human Vision and Electronic
Imaging XIII, vol. 6806, pp. 6806101, 2008.
[46] E. C. Larson and D. M. Chandler, Most apparent distortion: fullreference image quality assessment and the role of strategy, Journal
of Electronic Imaging, vol. 19, no. 1, pp. 011006011006, 2010.
[47] P. V. Vu, C. T. Vu, and D. M. Chandler, A spatiotemporal mostapparent-distortion model for video quality assessment, in Image Processing (ICIP), 2011 18th IEEE International Conference on, pp. 2505
2508, IEEE, 2011.
[48] S. Wolf and M. Pinson, Video quality model for variable frame delay
(vqm vfd), US Dept. Commer., Nat. Telecommun. Inf. Admin., Boulder,
CO, USA, Tech. Memo TM-11-482, 2011.
[49] M. H. Pinson, L. K. Choi, and A. C. Bovik, Temporal video quality
model accounting for variable frame delay distortions, Broadcasting,
IEEE Transactions on, vol. 60, no. 4, pp. 637649, 2014.
[50] S. M. Friend and C. L. Baker, Spatio-temporal frequency separability
in area 18 neurons of the cat, Vision research, vol. 33, no. 13, pp. 1765
1771, 1993.
[51] Z. Wang, E. P. Simoncelli, and A. C. Bovik, Multiscale structural similarity for image quality assessment, in Signals, Systems and Computers,
2004. Conference Record of the Thirty-Seventh Asilomar Conference on,
vol. 2, pp. 13981402, Ieee, 2003.
[52] K. Manasa, K. V. S. N. L. M. Priya, and S. S. Channappayya, A
perceptually motivated no-reference video quality assessment algorithm
for packet loss artifacts, in QoMEX 2014, Singapore, September, IEEE.
[53] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,
Study of subjective and objective quality assessment of video, Image
Processing, IEEE transactions on, vol. 19, no. 6, pp. 14271441, 2010.
[54] S. Roth and M. J. Black, On the spatial statistics of optical flow,
International Journal of Computer Vision, vol. 74, no. 1, pp. 3350,
2007.
[55] K. Liu, Q. Du, H. Yang, and B. Ma, Optical flow and principal component analysis-based motion detection in outdoor videos, EURASIP
Journal on Advances in Signal Processing, vol. 2010, p. 1, 2010.
[56] H. R. Sheikh, M. F. Sabir, and A. C. Bovik, A statistical evaluation
of recent full reference image quality assessment algorithms, Image
Processing, IEEE Transactions on, vol. 15, no. 11, pp. 34403451, 2006.
[57] K. Seshadrinathan, R. Soundararajan, A. C. Bovik, and L. K. Cormack,
Study of subjective and objective quality assessment of video, Image
Processing, IEEE transactions on, vol. 19, no. 6, pp. 14271441, 2010.
[58] EPFL-PoliMI Video Quality Assessment Database, (2009).
[59] F. De Simone, M. Naccari, M. Tagliasacchi, F. Dufaux, S. Tubaro,
and T. Ebrahimi, Subjective assessment of h. 264/avc video sequences
transmitted over a noisy channel, in Quality of Multimedia Experience,
2009. QoMEx 2009. International Workshop on, pp. 204209, IEEE,
2009.
[60] F. De Simone, M. Tagliasacchi, M. Naccari, S. Tubaro, and T. Ebrahimi,
A h. 264/avc video database for the evaluation of quality metrics,
in Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE
International Conference on, pp. 24302433, IEEE, 2010.
[61] A. K. Moorthy, L. K. Choi, A. C. Bovik, and G. De Veciana, Video
quality assessment on mobile devices: Subjective, behavioral and objective studies, Selected Topics in Signal Processing, IEEE Journal of,
vol. 6, no. 6, pp. 652671, 2012.
[62] A. K. Moorthy, L. K. Choi, G. De Veciana, and A. C. Bovik, Subjective
analysis of video quality on mobile devices, in Sixth International

[63]
[64]
[65]

[66]
[67]
[68]

Workshop on Video Processing and Quality Metrics for Consumer


Electronics (VPQM), Scottsdale, Arizona, Citeseer, 2012.
M. J. Black and P. Anandan, A framework for the robust estimation of
optical flow, in Computer Vision, 1993. Proceedings., Fourth International Conference on, pp. 231236, IEEE, 1993.
D. Sun, S. Roth, and M. J. Black, Secrets of optical flow estimation and
their principles, in Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, pp. 24322439, IEEE, 2010.
G. Farneback, Very high accuracy velocity estimation using orientation
tensors, parametric motion, and simultaneous segmentation of the motion field, in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth
IEEE International Conference on, vol. 1, pp. 171177, IEEE, 2001.
L. Zhang, D. Zhang, and X. Mou, Fsim: a feature similarity index for
image quality assessment, Image Processing, IEEE Transactions on,
vol. 20, no. 8, pp. 23782386, 2011.
Y. Nie and K.-K. Ma, Adaptive rood pattern search for fast blockmatching motion estimation, Image Processing, IEEE Transactions on,
vol. 11, no. 12, pp. 14421449, 2002.
(2000) final report from the video quality experts group on the
validation of objective quality metrics for video quality assessment.
[online].

1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Вам также может понравиться