Академический Документы
Профессиональный Документы
Культура Документы
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
1
I. I NTRODUCTION
The explosive growth of video content over the past decade
has led to a very urgent need to effectively manage this content
[1]. This includes better acquisition, compression, storage and
transport of video data. In other words, these systems must
be designed to minimize perceptual distortion while optimally
utilizing available storage and communication resources. Distortions can be potentially introduced at various stages of video
processing; ranging from acquisition, storage, transport, and
even during the rendering process. Distortions can lead to
a loss in the visual quality of the video. In a majority of
the cases, the ultimate consumer of the video content is a
human subject. Humans have the ability to rate the perceptual
quality of a video based on their prior experience of the world
and the training that they have acquired over time. However,
the subjective evaluation of the video is time consuming and
expensive; with the huge volumes of the data being generated,
1 The authors are with the Lab for Video and Image Analysis (LFOVIA),
Department of Electrical Engineering, Indian Institute of Technology Hyderabad, Kandi, India, 502285 e-mail: {ee12p1002, sumohana}@iith.ac.in.
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(1)
i
f2i = [1i , 2i , ...., KL
]T ,
(2)
f3i
(3)
(5)
where the subscripts r and t denote the reference and test sets
respectively.
The difference in dispersion defined in (5) is applied to the
per-frame features f1i , f2i to effectively measure the amount
of temporal distortion across the frames. This is illustrated in
Fig. 2, where Fig. 2f is the scatter plot of D(f1r , f1t ) versus
D(f2r , f2t ) for the video sequences tr6 and tr13 over all the
frames of the video. This plot clearly shows the distorted
frames of the tr6 video sequence having high dispersion difference of both the features compared to the higher perceptual
quality frames of the tr13 video sequence. Figs. 2g and 2h
illustrate that the low quality video (high DMOS) has high
fluctuations in the dispersion of the features across frames
compared to the high quality video (low DMOS).
The physical meaning of difference in dispersion is as foli
i
lows: higher D(f1r
, f1t
) across frames implies that a majority
of the frame has inter-patch inconsistencies and the inconsistency spread across the frame. This in turn implies irregular
motion in a few patches resulting in non-uniform motion in
i
i
) implies greater distortions within
a frame. Higher D(f2r
, f2t
the patches and implies random or haphazard flow in the patch.
It is important to note that D(xr , xt ) presents a frame level
measure of distortion using patch level features but without
making use of patch-wise correspondence. In other words, this
measure does not directly compare the statistics of a test patch
with the corresponding reference patch statistics. Rather, the
comparison happens at the frame level based on the statistics
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
th
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(6)
(7)
C. Stage 3: Pooling
While the features mentioned in the previous section capture
the changes in the local flow statistics, the effectiveness of a
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(a)
(b)
(c)
(d)
Fig. 3: a) 16th frame of the pa4 sequence. b) 17th frame of the pa4 sequence where the distortion sets in. c) 18th frame of the pa4
sequence where the distortion persists. d) Depicts the frame 17 where both feature 1 and feature 2 have high dispersion compared to its
temporal neighbours and hence fall into first quadrant (R1) while pooling.
(a)
(b)
(c)
(d)
Fig. 4: a) 121st frame of the pa4 sequence. b) 122nd frame of the pa4 sequence where the distortion sets in. c) 123th frame of the pa4
sequence where the distortion persists. d) Depicts the frame 122 where feature 1 has high dispersion compared to its temporal neighbours
and hence fall into fourth quadrant (R4) while pooling.
(a)
(b)
(c)
(d)
Fig. 5: a) 21st frame of the pa4 sequence. b) 22nd frame of the pa4 sequence where the distortion sets in. c) 23rd frame of the pa4
sequence where the distortion persists. d) Depicts the frame 22 where feature 2 has high dispersion compared to its temporal neighbours
and hence fall into second quadrant (R2) while pooling. Illustration of pooling strategy.
A test video frame Vti is classified into one of these classes
according to the following rule:
R1:
R2:
i
i
i
i
R1 = (D(f1r
, f1t
) > fi1t ) & (D(f2r
, f2t
) > fi2t ),
i
i
i
i
R2 = (D(f1r
, f1t
) < fi1t ) & (D(f2r
, f2t
) > fi2t ),
i
i
i
i
R3 = (D(f1r
, f1t
) < fi1t ) & (D(f2r
, f2t
) < fi2t ),
i
i
i
i
R4 = (D(f1r
, f1t
) > fi1t ) & (D(f2r
, f2t
) < fi2t ),
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
R3:
Points to the region of acceptable temporal distortion. spatio-temporal score according to:
Hi = Gi .Qi ,
(11)
Points to regions having high inter-patch distortions
and low intra-patch distortion which implies that the where Hi is the spatio-temporal score assigned to the ith
frame. As discussed in the Section I, the responses of the
distortion is spread in the frame.
Figs. 3, 4 and 5 vividly explains the visual effectiveness neurons in the visual cortex are almost separable in the spatial
of the classification strategy. Fig. 3a shows frame number 16 and temporal fields [50] and hence the spatio-temporal score
from the pa4 sequence that has no perceivable distortion and is obtained by a simple product of the spatial and temporal
hence the plot in Fig. 3d depicts lower values of feature 1 and scores.
The final quality score for the video is given by the weighted
feature 2 compared to its immediate temporal neighbours. Fig.
mean
of the spatio-temporal score Hi assigned to each frame.
3b shows frame number 17 from the same sequence where a
R4:
i
i
i
i
i
i
(D(f1r
, f1t
) + D(f2r
, f2t
)).C(f3r
, f3t
), Vti R1
D(f i , f i ).C(f i , f i ),
Vti R2
2r 2t
3r 3t
Gi =
i
i
C(f3r
, f3t
),
Vti R3
i
i
i
i
Vti R4
D(f1r , f1t ).C(f3r , f3t ),
(10)
The choice of the frame level score depends on the frame class.
i
i
) is a common factor since it is the only measure of
C(f3r
, f3t
patch-wise similarity. Specifically, in region R1, irregularities
are high at both the intra and inter patch level and hence
i
i
i
i
, f1t
) and D(f2r
, f2t
)) are used for temporal score
both D(f1r
computation. Similarly, in R2 and R4, the regions corresponding to high intra and inter patch irregularities respectively, the
i
i
i
i
corresponding feature scores D(f2r
, f2t
) and D(f1r
, f1t
) are
used for score computation. Finally, in R3, the region where
both the intra and inter patch irregularities are low, there
is still a need to account for patch-wise similarity with the
i
i
reference video and hence only C(f3r
, f3t
) is used for score
computation.
We pool the spatial and temporal scores to come up with a
FLOSIM =
4
X
wi
i=1
Hj ,
(12)
jRi
NRi
, i {1, 2, 3, 4}
(13)
N
where, w is the weight assigned to each frame based on the
region to which it belongs, NRi is the number of frames in the
region Ri, and N is the total number of frames in the video.
The temporal quality score T for the video is calculated by
taking the mean of the temporal score for each frame and is
given by
N
P
Gi
T = i=1 .
(14)
N
Similarly, the spatial quality score S for a video is given by
wi =
N
P
S=
Qi
i=1
(15)
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(tr1) and the less distorted video (tr13) which shows spurious
regions which are not perceivable distortions. Fig. 6f shows a
MSE map of a more distorted frame (tr6) with the reference
(tr1) luminance values where the localisation of distortions
are not clearly seen. Fig. 6g is MSE map for the flow of 96th
frame of reference (tr1) and less distorted (tr13) sequences
which highlights areas that do not appear visually distorted.
Fig. 6h is MSE map for the flow of 96th frame of reference
(tr1) and more distorted (tr6) sequences which misses out on
the more annoying regions. It is clear from these error maps
that the proposed distortion map is able to localize distortions
that the neither the MSE map of luminance nor the MSE map
of the flow is able to. The ability of the distortion map to
discern distortion is limited by the patch size used for feature
computation.
IV. R ESULTS AND D ISCUSSION
We report the performance of our algorithm on two SD
databases: the LIVE video databases [57] and the EPFLPoliMI database [58][60] and a HD database: LIVE Mobile
VQA database [61], [62] which is described in Table I. The
database independence of the proposed approach is validated
from the results on these three databases. To demonstrate the
robustness of the algorithm and to depict the independence
of our approach on the type of flow used we report results
using different flow algorithms. Finally, the effectiveness of
our approach is shown by replacing optical flow with the much
coarser motion vectors. Such a replacement still results in an
acceptable FR-VQA algorithm.
The LIVE SD video database has 10 reference videos and
150 distorted videos. The distortions include wireless distortions, IP distortions, H.264 and MPEG2 compression artifacts.
The resolution of all the videos of LIVE is 768432. Every
reference sequence has 15 distorted videos, each of which correspond to one of 4 distortions of varied amounts. The EPFLPoliMI database contains 156 sequences in total, of which 12
are reference sequences and 144 are distorted sequences. Half
of the sequences are of CIF resolution (352288) and the rest
are of 4CIF resolution (704576). The reference sequences are
encoded in the H.264 format and these bitstreams are subjected
to 6 different packet loss rates (0.1%, 0.4%, 1%, 3%, 5%,
10%) to produce the distorted bitstreams. We used the MOS
scores provided by EPFL to tabulate the results in this paper
for consistency in comparison of the performance with the
MOVIE index [14]. The LIVE Mobile VQA database consists
of 10 HD reference videos and 200 distorted videos. The
distortions include compression, wireless packet-loss, framefreeze, rate-adaptation, temporal dynamics per reference. Each
video is of HD resolution (1280720) at a frame rate of 30
fps and a duration of 15 seconds each. We have omitted framefreeze distortion for the performance analysis on this database.
Database Resolution
Frame
Rate
Number
of
videos
Distortions
768432
25/50
fps
150
Wireless distortions, IP
distortions, H.264 and
MPEG2 compression
artifacts
EPFLPoliMI
CIF
resolution
(352288)
and 4CIF
resolution
(704576)
30 fps
144
LIVE
Mobile
VQA
1280720
30 fps
160
Compression, wireless
packet-loss,
rate-adaptation,
temporal dynamics
LIVE
SD
A. Feature Computation
The choice of the flow algorithm is a trade off between
efficiency and accuracy. We have worked with two flow algorithms; Black and Anandan algorithm [63], [64], Farneback
B. Performance Evaluation
The performance of the algorithm is tested on the three
databases specified previously. Table II shows the performance
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
SD databases
LIVE SD
LCC
SROCC
0.4035
0.3684
0.7642
0.7482
0.5498
0.5381
0.5721
0.574
0.7376
0.7278
0.8116
0.789
Polimi
LCC
SROCC
0.8475
0.9034
0.812
0.963
0.786
0.9098
0.896
0.956
0.863
0.965
0.9302
0.9203
0.8611
0.8427
0.9422
0.9422
0.8299
0.7236
0.8242
0.7026
0.8433
0.8375
0.859
0.8537
0.956
0.965
0.8236
0.8227
0.956
0.9674
Distortions
FLOSIM (BA)
FLOSIM
(Farne)
MOVIE [14]
STMAD [47]
MOVIE with
VQ Pooling
[14]
Wireless
0.874
LIVE SD
IP
H264
0.823
0.935
MPEG2
0.836
All
0.859
0.8495
0.75
0.8491
0.7431
0.8236
0.8386
0.8123
0.7622
0.79
0.7902
0.9097
0.7595
0.8422
0.8116
0.8299
0.8502
0.8015
0.8444
0.8453
0.8611
Distortions
FLOSIM (BA)
FLOSIM
(Farne)
MOVIE [14]
STMAD [47]
MOVIE with
VQ Pooling
[14]
Wireless
0.8672
LIVE SD
IP
H264
0.7637
0.9394
MPEG2
0.8204
All
0.8537
0.8396
0.6801
0.8394
0.721
0.8227
0.8109
0.806
0.7157
0.7686
0.7664
0.9043
0.7733
0.8478
0.789
0.8242
0.8026
0.806
0.8309
0.8504
0.8427
V. C ONCLUSIONS
We presented a simple and novel optical flow based FRVQA algorithm that is highly competitive with the state-ofthe-art methods. We demonstrated that local flow statistics
and their dispersion form good features for assessing temporal
quality. We showed that the proposed approach supports any
robust FR-IQA algorithm for spatial quality assessment. We
reported our results using the MS-SSIM index as a representative for robust FR-IQA algorithms for measuring spatial
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 6: Distortion Map representing annoying regions. (a) 96th frame of the tr13 sequence with a DMOS of 33.473 (good quality). (b)
96th frame of the tr6 sequence with a DMOS of 73.473 (bad quality). (c) Distortion Map for the 96th frame of the tr13 (low DMOS). (d)
Distortion Map for the 96th frame of the tr6 (high DMOS); clearly shows the regions of disturbance. (e) MSE of the 96th frame of the
reference (tr1) and less distorted sequence (tr13). (f) MSE of the 96th frame of the reference (tr1) and more distorted sequence (tr6). (g)
MSE of the flow of the 96th frame of the reference (tr1) and less distorted sequence (tr13). (h) MSE of the flow of the 96th frame of the
reference (tr1) and more distorted sequence (tr6).
Metric Mode
Temporal-only
Spatial-only
MSFLOSIM (BA)
SSIM
FLOSIM (Farne)
Spatial-only
SSIM FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
VIF
FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
FSIM FLOSIM (BA)
FLOSIM (Farne)
Temporal-only
Spatial-only
MSFLOSIM (BA)
SSIM FLOSIM (Farne)
Spatial-only
SSIM Temporal-only
FLOSIM (BA)
FLOSIM (Farne)
Spatial-only
FLOSIM (BA)
VIF
FLOSIM (Farne)
Spatial-only
FSIM FLOSIM (BA)
FLOSIM (Farne)
LIVE SD
LCC
SROCC
0.6439
0.6271
0.7642
0.7482
0.859
0.8537
0.8236
0.8227
0.5498
0.5381
0.6513
0.6280
0.6391
0.6077
0.5721
0.5740
0.628
0.6196
0.637
0.6363
0.7376
0.7278
0.8236
0.8213
0.8111
0.8055
Polimi
0.9198
0.9055
0.8111
0.9628
0.9603
0.9610
0.9628
0.9694
0.7738
0.8858
0.9198
0.9055
0.9503
0.9494
0.9476
0.9542
0.8970
0.9554
0.9629
0.9612
0.9605
0.9626
0.8625
0.9648
0.9620
0.9608
0.9645
0.9664
OR
0.0267
0.0200
0.0133
0.0333
0.0467
0.0333
0.033
0.0400
0.033
0.0467
0.0067
0.0133
0.0133
RMSE
8.39
7.079
5.62
6.23
9.17
8.833
8.44
9.000
8.543
8.46
7.41
6.23
6.42
0.2917
0.4931
0.1806
0.1528
0.5486
0.2917
0.2014
0.2222
0.3958
0.1667
0.1736
0.4722
0.1667
0.1597
0.5164
0.7315
0.3486
0.3379
0.7921
0.5164
0.3895
0.3993
0.5527
0.3376
0.3480
0.6328
0.3414
0.3304
Metric
LCC
LCC
SROCC RMSE
0.6909
0.7986
0.6637
0.7870
0.7157
0.7085
0.6917
0.8631
Temporal 0.7019
0.6780
0.7864
0.6498
0.7439
0.642
0.6631
0.6945
0.8295
0.6133
0.6670
0.553
0.6901
0.5692
0.6443
0.653
0.6663
0.466
0.657
0.6348
0.6945
0.4893
0.7635
0.7828
0.5016
0.5816
0.8347
0.6201
0.5886
0.6748
0.4300
0.7261
0.6792
0.4583
0.5552
0.8385
0.6315
0.6630
0.6174
0.7483
0.5541
0.5342
0.742
0.6980
0.4726
0.6732
SpatioTemporal
0.8712
0.411
0.7468
0.8025
0.571
Tablet
SROCC RMSE
0.8954
LIVE SD
LCC
SROCC
0.7824
0.7686
0.5194
0.4901
0.4675
0.467
0.7208
0.7163
Polimi
0.8117
0.9626
0.7855
0.9098
0.8964
0.9556
0.8630
0.9650
OR
0.0133
0.08
0.08
0.02
RMSE
6.836
9.380
9.7037
7.6084
0.4861
0.5625
0.3889
0.4792
0.7304
0.7738
0.5542
0.6318
LCC
0.8116
0.859
0.8236
5.84%
1.48%
SROCC
0.789
0.8537
0.8227
8.20%
4.27%
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
LIVE SD
Metrics
Time Taken in secs
MOVIE
11438
FLOSIM (BA)
3816
FLOSIM (Farne)
396.4117
BA
33.36%
% of Complexity
Farne
2885.38%
LCC
0.2738
0.2276
0.46
LIVE SD
SROCC
0.2049
0.1984
0.452
RMSE
10.55
10.68
9.747
LCC
0.6836
0.6098
0.848
Polimi
SROCC
0.6239
0.6787
0.9158
RMSE
0.912
0.991
0.662
0.5563
0.5302
9.122
0.7828
0.7684
0.778
0.4016
0.3935
10.05
0.0409
0.0584
1.249
0.855
0.8458
5.69
0.9568
0.9655
0.363
0.8431
0.8353
5.91
0.9584
0.9572
0.357
0.859
0.8537
5.62
0.956
0.965
0.36
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TIP.2016.2548247, IEEE
Transactions on Image Processing
[63]
[64]
[65]
[66]
[67]
[68]
1057-7149 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.