Вы находитесь на странице: 1из 37

Feature-based methods for image matching

n Bag of Visual Words approach


n Feature descriptors
l SIFT descriptor
l SURF descriptor
n Geometric consistency check
n Aggregation of local descriptors into global descriptors
l Vocabulary trees
l Fisher vectors
n Image-based retrieval
l MPEG CDVS standard
l Mobile visual search
l Augmented reality

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 1
A Bag of Words

self-evident
Liberty truths
happiness
endowed
inalienable
Creator pursuit
Life

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 2
Representing a Text
as a “Bag of Words”
We hold these truths to be self-evident, that all men are created equal, that
they are endowed by their Creator with certain unalienable Rights, that
among these are Life, Liberty and the pursuit of Happiness. That to secure
these rights, Governments are instituted among Men, deriving their just
powers from the consent of the governed, That whenever any Form of
Government becomes destructive of these ends, it is the Right of the People
to alter or to abolish it, and to institute new Government, laying its foundation
on such principles and organizing its powers in such form, as to them shall
self-evident
seem most likely to effect their Safety and Happiness. Prudence, indeed, will Liberty truths
dictate that Governments long established should not be changed for light
and transient causes; and accordingly all experience hath shewn, that mankind happiness
are more disposed to suffer, while evils are sufferable, than to right themselves endowed
by abolishing the forms to which they are accustomed. But when a long train inalienable
of abuses and usurpations, pursuing invariably the same Object evinces a
design to reduce them under absolute Despotism, it is their right, it is their Creator pursuit
duty, to throw off such Government, and to provide new Guards for their
Life
future security.

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 3
Representing an Image
as a “Bag of Visual Words”

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 4
Feature descriptors

n Represent local pattern around a keypoint by a vector (“feature descriptor”)


n Establish feature correspondences by finding the nearest neighbor in
descriptor space

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 5
Scale/rotation invariant feature descriptors
72 deg
144 deg 144 deg
72 deg

180 deg

180 deg

n Scale invariance: extract features at scale provided by keypoint detection


n Rotation invariance:
l Detect dominant orientation by finding peak in orientation histogram
l Rotate coordinate system to dominant orientation
l Multiple strong orientation peaks: generate second feature point

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 6
SIFT descriptors
n SIFT - Scale-Invariant Feature
Transform [Lowe,1999, 2004]
n Sample thresholded image gradients at
16x16 locations in scale space
(in local coordinate system for rotation and
scale invariance)
n For each of 4x4 subregion, generate
orientation histogram with 8 directions
each; each observation weighted with
magnitude of image gradient and a
window function
n 128-dimensional feature vector

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 7
SURF descriptors

n SURF – Speeded Up Robust Features [Bay et al. 2006]


n Compute horizontal and vertical pixel differences, dx, dy (in local coordinate system for rotation
and scale invariance, window size 20σ x 20σ, where σ2 is feature scale)
n Sum dx, dy, and |dx|,|dy| over 4x4 subregions (SURF-64) or 3x3 subregions (SURF-36)
n Normalize vector for gain invariance, but distinguish bright blobs and dark blobs based on sign
of Laplacian (trace of Hessian matrix)

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 8
Computing feature descriptors
Σ dx Σ
dx
Σ dy Color
Gray Σ
Σ|dx| Σ
Σ|dy| Σ
dy

al e
Σ
SURF Descriptor

Σ dx SIFT Descriptor Dxx

sc
Σ
Σ dy Σ
Σ|dx| Σ Maxima
Σ|dy|
Dxy y
… … DxxDyy-
(0.9Dxy)2
Σ dx x
Σ dy
Dyy
Orient
Σ|dx| along
dominant
Σ|dy| gradient Oriented
Gradient Patch
Field
Filters Blob Response
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 9
“Bag of Visual Words” Matching

Pairwise
Comparison

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 10
Geometric mapping
n Notation: T
l Homogeneous coordinates; reference image x = x y 1
T ( )
l Inhomogeneous coordinates; target image x ! = x ! y ! ( )
n Translation
xʹ = ⎡⎣ I t ⎤⎦ x
xʹ = x + t or
n Euclidean transformation (rotation and translation)
⎡ cosθ −sin θ t ⎤
xʹ = ⎢
x ⎥
x
⎢ sin θ cosθ t ⎥
⎣ y ⎦

n Scaled rotation (similarity transform)


⎡ s ⋅ cosθ −s ⋅ sin θ tx ⎤
xʹ = ⎢ ⎥x
⎢ s ⋅ sin θ s ⋅ cosθ ty ⎥
⎣ ⎦
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 11
Geometric mapping
n Affine transformation
⎡ a a a ⎤
xʹ = ⎢ 00 01 02 ⎥x
⎢ a a a ⎥
⎣ 10 11 12 ⎦
n Motion of planar surface in 3d under orthographic projection
n Parallel lines are preserved

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 12
Geometric mapping
n Motion of planar surface in 3d under perspective projection
n Homography ⎛ h h h ⎞⎜ 00 01 02 ⎟
xʹ ∼ ⎜ h10 h11 h12 ⎟x
⎜ ⎟
⎜ h20 h21 h22 ⎟
⎝ ⎠
n Inhomogeneous coordinates (after normalization)

h00 x + h01 y + h02 h10 x + h11 y + h12


xʹ = yʹ =
h20 x + h21 y + h22 h20 x + h21 y + h22
n Straight lines are preserved

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 13
RANSAC
n RANdom Sample Consensus [Fischer, Bolles, 1981]
n Randomly select subset of k correspondences
n Compute geometric mapping parameters by linear regression
n Apply geometric mapping to all keypoints
n Count no. of inliers (closer than ε from the corresponding keypoint, typical ε = 1…3 pixels)
n Repeat process S times, keep geometric mapping with largest no. of inliers
n Required number of trials
Total probability of success
P=0.99
S=
(
log 1− P ) q=0.3
(
log 1− q k ) Probability of k=3 -> S=168
valid correspondence k=4 -> S=
566
n Use small number of correspondences

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 14
RANSAC with Affine Model

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 15
RANSAC with Homography

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 16
SURF features & affine RANSAC

Pairwise
Comparison

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 17
Local Feature Descriptor Aggregation
n Nearest-neighbor matching of variable-size sets of local features is costly
n Compare images based on a global binary signature of constant size
(“hash”) instead
n Simple: VQ of feature vectors to generate histogram,
compare non-empty histogram bins (“bag of features,” “bag of visual
words”)
n Better: binarize gradient of log likelihood of w.r.t. to parameter vector
(“Fisher vector”)

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 18
Comparing Feature Histograms
n Speed up by comparing histograms of features:
pairwise image comparison only for similar histograms
n Histogram intersection Query histogram Histogram of
database entry

ρ=
∑ i=1
min (Qi , Di )
n
∑ i=1
Di
[Swain, Ballard 1991]

n Equivalent to mean absolute difference, if both histograms


contain same number of samples

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 19
Growing Vocabulary Tree

[Nistér and Stewenius, 2006]


Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 20
Growing Vocabulary Tree

[Nistér and Stewenius, 2006]


Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 21
Growing Vocabulary Tree

[Nistér and Stewenius, 2006]


Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 22
Growing Vocabulary Tree

k=3

[Nistér and Stewenius, 2006]


Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 23
Growing Vocabulary Tree

k=3

[Nistér and Stewenius, 2006]


Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 24
Querying Vocabulary Tree

Query

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 25
Hard Binning vs. Soft Binning
query
w1db = 0 w1q = 1 feature w2db = 1 w2q = 0 ⎛ ( d db )2 ⎞ ⎛ ( d q )2 ⎞ ⎛ ( d db )2 ⎞ ⎛ ( d q )2 ⎞
w1db ∼ exp ⎜ − w1q ∼ exp ⎜ − 2 ⎟ w2db ∼ exp ⎜ − w2q ∼ exp ⎜ − 2 ⎟
1 1 2 2
2 ⎟ ⎟
⎜⎝ σ ⎟⎠ ⎜⎝ σ ⎟⎠ ⎜⎝ σ 2 ⎟⎠ ⎜⎝ σ ⎟⎠

d1q
node 1 node 2 node 1 node 2
d1db
w1db + w2db + w3db = 1

database w1q + w2q + w3q = 1


node 3 node 3
feature
⎛ ( d db )2 ⎞ ⎛ ( d q )2 ⎞
w3db = 0 w3q = 0 w3db ∼ exp ⎜ −
3
⎟ w3q ∼ exp ⎜ − 2 ⎟
3

⎜⎝ σ 2 ⎟⎠ ⎜⎝ σ ⎟⎠

Hard Binning Soft Binning


[Nistér and Stewenius, CVPR 2006] [Philbin et al., CVPR 2008]

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 26
Stanford Mobile Visual Search Dataset

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 27
Stanford Mobile Visual Search Dataset

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 28
Querying: Hard Binning vs. Soft Binning
Precision ~ 97%

SURF features
6-level vocab tree
1M leaf nodes
Affine RANSAC
for 100 top tree results
25 inliers min.

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 29
Fisher Vector
n Discriminative score function

d-dimensional k-dimensional d Parameters


vector
d≫k feature vector

n Typical, we use Gaussian mixture model (GMM) for


n Parameters : mean (and variance) of Gaussian clusters
n For GMM, feature scores U(X) are soft-assigned distance vectors (and squared distance vectors)
relative to cluster centers
n Sums of feature scores of an image are “Fisher vector” that can be used to compare images
n Binarization & Hamming distance comparison results in only minor performance loss
(“Binarized Fisher vector”)

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 30
MPEG standard “Compact Descriptors for Visual Search” (CDVS)
xy-location needed for
Non-orthogonal object location (and
transform + geometric verification)
quantization
LoG
peaks

Query

512,
304, 1K,
384, 2K,
404, 4K,
Statistically optimized SIFT 1117, 8K,
based on peak
response, scale, descriptor Fisher vector 1117, 16K bytes
location, … based on GMM 1117 bytes

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 31
CDVS Evaluation Framework

Graphics

Paintings

Video Frames

Landmarks

Common Objects

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 32
`

1M Distractor Images

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 33
MPEG CDVS Performance

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 34
On-Device Image Matching Demo
Demo Video
Database of 100K Images

Samsung Galaxy S3 Smartphone

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 35
On-Device Timing Measurements
Samsung Galaxy S3 Smartphone
1.4 GHz Processor
1 GB RAM
Database of 100K Images

400 queries
100
Global signature
database search
80
Feature 54%
extraction
Frequency

60
32%
40

20

14%
0
0.5 0.6 0.7 0.8 0.9 1 Geometric
Time (sec) verification
Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 36
Augmented Reality Glasses

Right-eye LCD Left-eye LCD

Camera

Android
controller

Digital Image Processing: Bernd Girod, © 2013-2018 Stanford University -- Image Matching 37

Вам также может понравиться