Академический Документы
Профессиональный Документы
Культура Документы
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 1
A Bag of Words
self-evident
Liberty truths
happiness
endowed
inalienable
Creator pursuit
Life
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 2
Representing a Text
as a “Bag of Words”
We hold these truths to be self-evident, that all men are created equal, that
they are endowed by their Creator with certain unalienable Rights, that
among these are Life, Liberty and the pursuit of Happiness. That to secure
these rights, Governments are instituted among Men, deriving their just
powers from the consent of the governed, That whenever any Form of
Government becomes destructive of these ends, it is the Right of the People
to alter or to abolish it, and to institute new Government, laying its foundation
on such principles and organizing its powers in such form, as to them shall
self-evident
seem most likely to effect their Safety and Happiness. Prudence, indeed, will Liberty truths
dictate that Governments long established should not be changed for light
and transient causes; and accordingly all experience hath shewn, that mankind happiness
are more disposed to suffer, while evils are sufferable, than to right themselves endowed
by abolishing the forms to which they are accustomed. But when a long train inalienable
of abuses and usurpations, pursuing invariably the same Object evinces a
design to reduce them under absolute Despotism, it is their right, it is their Creator pursuit
duty, to throw off such Government, and to provide new Guards for their
Life
future security.
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 3
Representing an Image
as a “Bag of Visual Words”
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 4
Feature descriptors
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 5
Scale/rotation invariant feature descriptors
72 deg
144 deg 144 deg
72 deg
180 deg
180 deg
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 6
SIFT descriptors
n SIFT - Scale-Invariant Feature
Transform [Lowe,1999, 2004]
n Sample thresholded image gradients at
16x16 locations in scale space
(in local coordinate system for rotation and
scale invariance)
n For each of 4x4 subregion, generate
orientation histogram with 8 directions
each; each observation weighted with
magnitude of image gradient and a
window function
n 128-dimensional feature vector
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 7
SURF descriptors
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 8
Computing feature descriptors
Σ dx Σ
dx
Σ dy Color
Gray Σ
Σ|dx| Σ
Σ|dy| Σ
dy
al e
Σ
SURF Descriptor
sc
Σ
Σ dy Σ
Σ|dx| Σ Maxima
Σ|dy|
Dxy y
… … DxxDyy-
(0.9Dxy)2
Σ dx x
Σ dy
Dyy
Orient
Σ|dx| along
dominant
Σ|dy| gradient Oriented
Gradient Patch
Field
Filters Blob Response
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 9
“Bag of Visual Words” Matching
Pairwise
Comparison
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 10
Geometric mapping
n Notation: T
l Homogeneous coordinates; reference image x = x y 1
T ( )
l Inhomogeneous coordinates; target image x ! = x ! y ! ( )
n Translation
xʹ = ⎡⎣ I t ⎤⎦ x
xʹ = x + t or
n Euclidean transformation (rotation and translation)
⎡ cosθ −sin θ t ⎤
xʹ = ⎢
x ⎥
x
⎢ sin θ cosθ t ⎥
⎣ y ⎦
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 12
Geometric mapping
n Motion of planar surface in 3d under perspective projection
n Homography ⎛ h h h ⎞⎜ 00 01 02 ⎟
xʹ ∼ ⎜ h10 h11 h12 ⎟x
⎜ ⎟
⎜ h20 h21 h22 ⎟
⎝ ⎠
n Inhomogeneous coordinates (after normalization)
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 13
RANSAC
n RANdom Sample Consensus [Fischer, Bolles, 1981]
n Randomly select subset of k correspondences
n Compute geometric mapping parameters by linear regression
n Apply geometric mapping to all keypoints
n Count no. of inliers (closer than ε from the corresponding keypoint, typical ε = 1…3 pixels)
n Repeat process S times, keep geometric mapping with largest no. of inliers
n Required number of trials
Total probability of success
P=0.99
S=
(
log 1− P ) q=0.3
(
log 1− q k ) Probability of k=3 -> S=168
valid correspondence k=4 -> S=
566
n Use small number of correspondences
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 14
RANSAC with Affine Model
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 15
RANSAC with Homography
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 16
SURF features & affine RANSAC
Pairwise
Comparison
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 17
Local Feature Descriptor Aggregation
n Nearest-neighbor matching of variable-size sets of local features is costly
n Compare images based on a global binary signature of constant size
(“hash”) instead
n Simple: VQ of feature vectors to generate histogram,
compare non-empty histogram bins (“bag of features,” “bag of visual
words”)
n Better: binarize gradient of log likelihood of w.r.t. to parameter vector
(“Fisher vector”)
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 18
Comparing Feature Histograms
n Speed up by comparing histograms of features:
pairwise image comparison only for similar histograms
n Histogram intersection Query histogram Histogram of
database entry
ρ=
∑ i=1
min (Qi , Di )
n
∑ i=1
Di
[Swain, Ballard 1991]
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 19
Growing Vocabulary Tree
k=3
k=3
Query
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 25
Hard Binning vs. Soft Binning
query
w1db = 0 w1q = 1 feature w2db = 1 w2q = 0 ⎛ ( d db )2 ⎞ ⎛ ( d q )2 ⎞ ⎛ ( d db )2 ⎞ ⎛ ( d q )2 ⎞
w1db ∼ exp ⎜ − w1q ∼ exp ⎜ − 2 ⎟ w2db ∼ exp ⎜ − w2q ∼ exp ⎜ − 2 ⎟
1 1 2 2
2 ⎟ ⎟
⎜⎝ σ ⎟⎠ ⎜⎝ σ ⎟⎠ ⎜⎝ σ 2 ⎟⎠ ⎜⎝ σ ⎟⎠
d1q
node 1 node 2 node 1 node 2
d1db
w1db + w2db + w3db = 1
⎜⎝ σ 2 ⎟⎠ ⎜⎝ σ ⎟⎠
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 26
Stanford Mobile Visual Search Dataset
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 27
Stanford Mobile Visual Search Dataset
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 28
Querying: Hard Binning vs. Soft Binning
Precision ~ 97%
SURF features
6-level vocab tree
1M leaf nodes
Affine RANSAC
for 100 top tree results
25 inliers min.
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 29
Fisher Vector
n Discriminative score function
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 30
MPEG standard “Compact Descriptors for Visual Search” (CDVS)
xy-location needed for
Non-orthogonal object location (and
transform + geometric verification)
quantization
LoG
peaks
Query
512,
304, 1K,
384, 2K,
404, 4K,
Statistically optimized SIFT 1117, 8K,
based on peak
response, scale, descriptor Fisher vector 1117, 16K bytes
location, … based on GMM 1117 bytes
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 31
CDVS Evaluation Framework
Graphics
Paintings
Video Frames
Landmarks
Common Objects
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 32
`
1M Distractor Images
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 33
MPEG CDVS Performance
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 34
On-Device Image Matching Demo
Demo Video
Database of 100K Images
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 35
On-Device Timing Measurements
Samsung Galaxy S3 Smartphone
1.4 GHz Processor
1 GB RAM
Database of 100K Images
400 queries
100
Global signature
database search
80
Feature 54%
extraction
Frequency
60
32%
40
20
14%
0
0.5 0.6 0.7 0.8 0.9 1 Geometric
Time (sec) verification
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 36
Augmented Reality Glasses
Camera
Android
controller
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 37