ML Ex02 Solution

Machine Learning: Exercise Sheet 2
Manuel Blum
AG Maschinelles Lernen und Natürlichsprachliche Systeme
Albert-Ludwigs-Universität Freiburg
mblum@informatik.uni-freiburg.de
Manuel Blum Machine Learning Lab, University of Freiburg Machine Learning: Exercise Sheet 2 (1)
Exercise 1: Version Spaces
Task (a)
What are the elements of the version space?

I hypotheses (descriptions of concepts)
I VSH,D ⊆ H with respect to the hypothesis space H contains
those hypotheses that are consistent with the training data D
How are they ordered?

I arranged in a general-to-specific ordering
I partial order: ≤g , <g
Manuel Blum Machine Learning Lab, University of FreiburgMachine Learning: Exercise Sheet 2 — Exercise 1: Version Spaces (2)
Task (a)
What can be said about the meaning and sizes of G and S?

I They are sets containing the most general and most specific
hypotheses consistent with the training data. Thus, they
depict the general and specific boundary of the VS.
I For conjunctive hypotheses (which we consider here) it always
holds |S| = 1, assuming consistent training data. G attains its
maximal size, if negative patterns with maximal hamming
distance have been presented. Thus, in the case of binary
constraints, it holds |G | ≤ n(n − 1) where n denotes the
number of constraints per hypothesis.
Task (b)
In the following, it is desired to describe whether a person is ill.

We use a representation based on conjunctive constraints (three
per subject) to describe individual person. These constraints are
“running nose”, “coughing”, and “reddened skin”, each of which
can take the value true (‘+’) or false (‘–’). We say that somebody
is ill, if he is coughing and has a reddened nose — each single
symptom individually does not mean that the person is ill.
I Specify the space of hypotheses that is being managed by the
version space approach. To do so, arrange all hypotheses in a
graph structure using the more-specific-than relation.
I hypotheses are vectors of constraints, denoted by hN, C , Ri
I with N, C , R = {−, +, ∅, ∗}
Task (c)
Apply the candidate elimination (CE) algorithm to the sequence of

training examples specified in the table and name the contents of
the sets S and G after each step.
Training N C R Classification
Example (running nose) (coughing) (reddened skin)
d1 + + + positive (ill)
d2 + + – positive (ill)
d3 + – + negative (healthy)
d4 – + + negative (healthy)
d5 – – + negative (healthy)
d6 – – – negative (healthy)
Task (c)
I Start (init): G = {h∗ ∗ ∗i}, S = {h∅∅∅i}

I foreach d ∈ D do
I d1 = [h+ + +i, pos] ⇒ G = {h∗ ∗ ∗i}, S = {h+ + +i}
I d2 = [h+ + −i, pos] ⇒ G = {h∗ ∗ ∗i}, S = {h+ + ∗i}
I d3 = [h+ − +i, neg ]
I no change to S: S = {h+ + ∗i}
I specializations of G : G = {h− ∗ ∗i, h∗ + ∗i, h∗ ∗ −i}
I there is no element in S that is more specific than the first
and third element of G
→ remove them from G ⇒ G = {h∗ + ∗i}
Task (c)
I loop continued ...
I so far we have S = {h+ + ∗i} and G = {h∗ + ∗i}
I d4 = [h− + +i, neg ]
I no change to S: S = {h+ + ∗i}
I specializations of G : G = {h+ + ∗i, h∗ + −i}
I there is no element in S that is more specific than the second
element of G
→ remove it from G ⇒ G = {h+ + ∗i}
I Note:
I At this point, the algorithm might be stopped, since S = G
and no further changes to S and G are to be expected.
I However, by continuing we might detect inconsistencies in the
training data.
Task (c)
I Start (init): G = {h∗ ∗ ∗i}, S = {h∅∅∅i}

I loop continued ...
I d5 = [h− − +i, neg ] ⇒ Both, G = {h+ + ∗i} and
S = {h+ + ∗i} are consistent with d5 .
I d6 = [h− − −i, neg ] ⇒ Both, G = {h+ + ∗i} and
S = {h+ + ∗i} are consistent with d6 .
I return S and G
Task (d)
Does the order of presentation of the training examples to the

learner affect the finally learned hypothesis?
I No, but it may influence the algorithm’s running time.
Task (e)
Assume a domain with two attributes, i.e. any instance is described

by two constraints. How many positive and negative training
examples are minimally required by the candidate elimination
algorithm in order to learn an arbitrary concept?
I By learning an arbitrary concept, of course, we mean that the
algorithm arrives at S = G .
I The algorithm is started with S = {h∅, ∅i} and G = {h∗, ∗i}.
I We just consider the best cases, i.e. situations in where the
training instances given to the CE algorithm allow for
adapting S or G .
Task (e)
Clearly, three appropriately chosen examples are sufficient.

I Negative Examples: Change G from h∗, ∗i to hv , ∗i or h∗, w i.
Or they change G from hv , ∗i or h∗, w i to hv , w i.
I Positive Examples: Change S from h∅, ∅i, hv , w i. Or they
change S from hv , w i to hv , ∗i or h∗, w i. Or from hv , ∗i or
h∗, w i to h∗, ∗i.
I At least one positive example is required (otherwise S remains
h∅, ∅i).
I Special case: Two positive patterns hd1 , d2 i, he1 , e2 i are
sufficient, if it holds d1 6= e1 and d2 6= e2 .
⇒ S = h∅, ∅i → hd1 , d2 i → h∗, ∗i
Task (f)
We are now extending the number of constraints used for

describing training instances by one additional constraint named
“fever”. We say that somebody is ill, if he has a running nose and
is coughing (as we did before), or if he has fever.
Training N C R F Classification
Example (running nose) (coughing) (reddened skin) (fever)
d1 + + + – positive (ill)
d2 + + – – positive (ill)
d3 – – + + positive (ill)
d4 + – – – negative (healthy)
d5 – – – – negative (healthy)
d6 – + + – negative (healthy)
Task (f)
How does the version space approach using the CE algorithm

perform now, given the training examples specified on the previous
slide?
I Initially: S = {h∅∅∅∅i}, G = {h∗ ∗ ∗∗i}
I d1 = [h+ + +−i, pos] ⇒ S = {h+ + +−i}, G = {h∗ ∗ ∗∗i}
I d2 = [h+ + −−i, pos] ⇒ S = {h+ + ∗−i}, G = {h∗ ∗ ∗∗i}
I d3 = [h− − ++i, pos] ⇒ S = {h∗ ∗ ∗∗i}, G = {h∗ ∗ ∗∗i}
→ We already arrive at S = G .
I d4 = [h+ − −−i, neg ] ⇒ S = {h∗ ∗ ∗∗i}, G = {h∗ ∗ ∗∗i}
I Now, S becomes empty since h∗ ∗ ∗∗i is inconsistent with d4
and is removed from S.
I G would be specialized to
{h− ∗ ∗∗i, h∗ + ∗∗i, h∗ ∗ +∗i, h∗ ∗ ∗+i}. But it is required that
at least one element from S must be more specific than any
element from G .
→ This requirement cannot be fulfilled since S = ∅. ⇒ G = ∅
Task (f)
What happens, if the order of presentation of the training

examples is altered?
I Even a change in the order of presentation does not result in
yielding a learning success (i.e. in S = G 6= ∅).
I When applying the CE algorithm, S and G become empty
independent of the presentation order.
I Reason: The informally specified target concept of an “ill
person” represents a disjunctive concept.
I The target concept is not an element of the hypothesis space
H (which is made of conjunctive hypotheses).
Exercise 2: Decision Tree Learning with ID3
Task (a)
Apply the ID3 algorithm to the training data in the table.
Training fever vomiting diarrhea shivering Classification

d1 no no no no healthy (H)
d2 average no no no influenza (I)
d3 high no no yes influenza (I)
d4 high yes yes no salmonella poisoning (S)
d5 average no yes no salmonella poisoning (S)
d6 no yes yes no bowel inflammation (B)
d7 average yes yes no bowel inflammation (B)
Manuel Blum Machine Learning Lab, University of Freiburg Machine Learning: Exercise Sheet 2 — Exercise 2: ID3 (15)
Task (a)
Exemplary calculation for the first (root) node.

I entropy of the given data set S: Entropy (S)
= − 17 log2 ( 17 ) − 72 log2 ( 27 ) − 27 log2 ( 72 ) − 27 log2 ( 27 ) = 1.950
I consider attribute x=“Fever”
Values H I S B Entropy (Si )
S1 (no) * * [ 12 , 0, 0, 21 ] → 1
S2 (average) * * * [0, 31 , 13 , 13 ] → 1.585
S3 (high) * * [0, 21 , 12 , 0] → 1
2 3
⇒ Entropy (S|Fever ) = 7
·1+ 7
· 1.585 + 72 · 1 = 1.251
Task (a)
I consider attribute x=“Vomiting”

S1 (yes) * ** [0, 0, 13 , 23 ] → 0.918
S2 (no) * ** * [ 14 , 24 , 41 , 0] → 1.5
⇒ Entropy (S|Vomiting ) = 7 · 0.918 + 74 · 1.5 = 1.251
3
I consider attribute x=“Diarrhea”

S1 (yes) ** ** [0, 0, 42 , 24 ] → 1
S2 (no) * ** [ 13 , 23 , 0, 0] → 0.918
⇒ Entropy (S|Diarrhea) = 7 · 1 + 73 · 0.918 = 0.965
4
I consider attribute x=“Shivering”

S1 (yes) * [0, 0, 1, 0] → 0
S2 (no) * * ** ** [ 16 , 16 , 26 , 26 ] → 1.918
⇒ Entropy (S|Shivering ) = 17 · 0 + 76 · 1.918 = 1.644
Task (a)
choose the attribute that maximizes the information gain

I Fever: Gain(S) = Ent(S) − Ent(S|Fever ) = 1.95 − 1.251 = 0.699
I Vomiting: Gain(S) = Ent(S) − Ent(S|Vomit) = 1.95 − 1.251 = 0.699
I Diarrhea: Gain(S) = Ent(S) − Ent(S|Diarrh) = 1.95 − 0.965 = 0.985
I Shivering: Gain(S) = Ent(S) − Ent(S|Shiver ) = 1.95 − 1.644 = 0.306
⇒ Attribute “Diarrhea” is the most effective one,

maximizing the information gain.
Task (b)
Does the resulting decision tree provide a disjoint definition of the

classes?
I Yes, the resulting decision tree provides
disjoint class definitions.
Task (c)
Consider the use of real-valued attributes, when learning decision

trees, as described in the lecture.
The data in the table below shows the relationship between the
body height and the gender of a group of persons (the records
have been sorted with respect to the value of height in cm).
Height 161 164 169 175 176 179 180 184 185
Gender F F M M F F M M F
I Calculate the information gain for the potential splitting
thresholds (recall that cut points must always lie at class
boundaries) and determine the best one.
I Potential cut points must lie in the intervals
(164, 169), (175, 176), (179, 180), or (184, 185).
Task (c)

thresholds (ctd.).
I C1 ∈ (164, 169)
I resulting class distribution: if x < C1 then 2 − 0 else 3 − 4
I conditional entropy: if x < C1 then E = 0 else
E = − 73 log2 37 − 47 log2 47 = 0.985
I entropy: E (C1 |S) = 92 · 0 + 79 · 0.985 = 0.766
I C2 ∈ (175, 176)
I entropy: E (C2 |S) = 94 · 1 + 59 · 0.971 = 0.984
I C3 ∈ (179, 180)
I entropy: E (C3 |S) = 96 · 0.918 + 39 · 0.918 = 0.918
Task (c)

thresholds (ctd.).
I C4 ∈ (184, 185)
I entropy: E (C4 |S) = 98 · 1 + 19 · 0 = 0.889
I Prior entropy of S is − 95 log2 5
9 − 49 log2 4
9 = 0.991.
I Information gain is Gain(S, C1 ) = 0.225, Gain(S, C2 ) = 0.007,
Gain(S, C3 ) = 0.073, and Gain(S, C4 ) = 0.102
→ First splitting point (C1 ) is the best one.

ML Ex02 Solution

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ML Ex02 Solution

Загружено:

Авторское право:

Доступные форматы

Machine Learning: Exercise Sheet 2

What are the elements of the version space?

How are they ordered?

What can be said about the meaning and sizes of G and S?

In the following, it is desired to describe whether a person is ill.

Apply the candidate elimination (CE) algorithm to the sequence of

I Start (init): G = {h∗ ∗ ∗i}, S = {h∅∅∅i}

I Start (init): G = {h∗ ∗ ∗i}, S = {h∅∅∅i}

Does the order of presentation of the training examples to the

Assume a domain with two attributes, i.e. any instance is described

Clearly, three appropriately chosen examples are sufficient.

We are now extending the number of constraints used for

How does the version space approach using the CE algorithm

What happens, if the order of presentation of the training

Apply the ID3 algorithm to the training data in the table.

Training fever vomiting diarrhea shivering Classification

Exemplary calculation for the first (root) node.

I consider attribute x=“Vomiting”

I consider attribute x=“Diarrhea”

I consider attribute x=“Shivering”

choose the attribute that maximizes the information gain

I Vomiting: Gain(S) = Ent(S) − Ent(S|Vomit) = 1.95 − 1.251 = 0.699

I Diarrhea: Gain(S) = Ent(S) − Ent(S|Diarrh) = 1.95 − 0.965 = 0.985

I Shivering: Gain(S) = Ent(S) − Ent(S|Shiver ) = 1.95 − 1.644 = 0.306

⇒ Attribute “Diarrhea” is the most effective one,

Does the resulting decision tree provide a disjoint definition of the

Consider the use of real-valued attributes, when learning decision

I Calculate the information gain for the potential splitting

I Calculate the information gain for the potential splitting

Вам также может понравиться