Вы находитесь на странице: 1из 1

THE GEOMETRY OF PERCEPTRONS

MATT RAYMOND

Lemma 0.1. Define D to be the set of input-output pairs. We call the elements of D data-
points. Fix (xi , yi ) ∈ D, w ∈ Rn and ϕ : R → {−1, 1} the binary step activation function.
(a) The data-point (xi , yi ) is misclassified if and only if yi (ϕ(wT xi )) ≤ 0.
(b) The inequality yi (ϕ(wT xi )) > 0 holds if and only if (xi , yi ) was classified correctly.

Proof. If yi (ϕ(wT xi )) ≤ 0 then yi > 0 and ϕ(wT xi ) < 0 or yi < 0 and ϕ(wT xi ) > 0. It
follows that either yi > ϕ(wT xi ) or yi < ϕ(wT xi ). In both cases, yi 6= ϕ(wT xi ). Hence,
(xi , yi ) is misclassified. By a similar argument, it is easy to show the converse. Then (a)
holds. Since (b) is the contrapositive of (a), (b) holds. This completes the proof. 

Definition 0.2. Let V be a k-dimensional vector space over R. A subspace H is called a


hyperplane if it has codimension 1.

Theorem 0.3. For each xi , define a orthogonal hyperplane H(xi ) to xi . That is, for each
w ∈ H(xi ), wT xi = 0. Define W↑ (xi ) to be the set of w ∈ Rd with wT xi < 0, and W↓ (xi )
the set of w ∈ Rd with wT xi > 0.
(a) If yi > 0, then any w with yi (ϕ(wT xi )) > 0 is an element of W↑ .
(b) If yi < 0, then any w with yi (ϕ(wT xi )) > 0 is an element of W↓ .
(c) The complement of W↑ (xi ) ∪ W↓ (xi ) is H(xi ).

Proof. If yi > 0 and yi (ϕ(wT xi )) > 0 then ϕ(wT xi ) > 0. It follows that wT xi > 0, so
w ∈ W↑ (xi ). Hence (a) holds. The proof of (b) is so similar it is omitted. If w is in the
complement of W↑ (xi ) ∪ W↓ (xi ), then 0 ≤ wT xi and wT xi ≤ 0. It follows that wT xi = 0 so
w ∈ H(xi ). The other inclusion is similar, so (c) holds. This completes the proof. 

Corollary 0.4. Fix xi ∈ Rn .


(a) The union H(xi ) ∪ W↑ (xi ) ∪ W↓ (xi ) = Rn .
(b) The intersection H(xi ) ∩ W↑ (xi ) ∩ W↓ (xi ) = ∅.

Definition 0.5. Fix w ∈ Rn . Define the geometric margin of H(w) as

γH(w) = min kwT xi k


(xi ,yi )∈D

Theorem 0.6. If there exists some w∗ with yi (ϕ(xTi w∗ )) > 0 for every choice of (xi , yi ),
then the perceptron learning algorithm converges in a finite number of steps.

Proof. Fix R > 0, set kw∗ k = R and constrain kxi k ≤ R. Choose w with yi (ϕ(wT xi )) ≤ 0.
Then after k updates wT w∗ ≥ kγH(w∗ ) and wT w ≤ k. It suffices to show that k bounded
2
above. By elementary algebra, k ≤ R/γH(w ∗ ) . The result follows. 

Remark 0.7. This result of course depends on the existence of w∗ . In 1969, Minsky and
Papert showed - among other things, that the perceptron could not classify datasets which
are not linearly separable. Perhaps the most famous example of this is the XOR problem.

Trinity Grammar School

Date: March 29, 2020.


1

Вам также может понравиться