Decision Trees in Machine Learning

 
Decision Trees
 
 
Outline
 
** Figures are from the 
 (or by the instructor).
textbook site
 
I. Decision trees
 
II. Learning decision trees from examples
 
III. Entropy and attribute choice
Restaurant Waiting
 
P
r
o
b
l
e
m
 
 
D
e
c
i
d
e
 
w
h
e
t
h
e
r
 
t
o
 
w
a
i
t
 
f
o
r
 
a
 
t
a
b
l
e
 
a
t
 
a
 
r
e
s
t
a
u
r
a
n
t
.
 
Output
: a Boolean variable 
WillWait 
(true where we do wait for a table).
 
Input
: a vector of ten attributes, each with discrete values:
 
1. 
Alternate
: whether there is a suitable alternative restaurant nearby.
 
2. 
Bar
: whether the restaurant has a comfortable bar area to wait in.
 
3. 
Fri/Sat
: true on Fridays and Saturdays.
 
4. 
Hungry
: whether we are hungry right now.
 
5. 
Patrons
: how many people are in the restaurant (values: 
None
, 
Some
, and 
Full
).
 
6. 
Price
: the restaurant’s price range ($, $$, $$$).
 
7. 
Raining
: whether it is raining outside.
 
8. 
Reservation
: whether we made a reservation.
 
9. 
Type
: the kind of restaurant (
French
, 
Italian
, 
Thai
, or 
burger
).
 
10. 
WaitEstimate
: host’s wait estimate: 0-10, 10-30, 30-60, or >60 minutes.
Training Examples
 
 
The correct output is given for only 12 out of 9,216 examples.
 
 
We need to make our best guess at the missing 9,204 output values.
 
I. What is a Decision Tree?
 
 
A 
decision tree 
maps a set of attribute values to a “decision”.
 
 
It
 
performs a sequence of tests, starting at the root and descending
    down to a leaf (which specifies the output value).
 
 
Each internal node corresponds to a test of the value of an attribute
.
 
A decision tree for
the restaurant
waiting problem
Boolean Classification
 
 
Inputs are discrete values.
 
Outputs are either 
true
 (a positive example) or 
false
 
(a negative one).
 
Positive examples:
 
Negative examples:
 
Shape Recognition Tree
 
A robotic hand can recognize curved shapes
from differential invariants (expressions of curvature,
torsion, etc.) constructed over tactile data.
 
https://faculty.sites.iastate.edu/jia/files/inline-files/IJRR05.pdf
 
Why Decision Trees?
 
 
They yield nice, concise, and understandable results.
 
 They r
epresent simple rules for classifying instances that are
    described by discrete attribute values.
 
 
Decision tree learning algorithms are relatively efficient
linear
 in the size of the decision tree and the size of
    the data set.
 
 Decision trees a
re often among the first to be tried on a new
   data set.
 
 They are not good with real-valued attributes.
Expressiveness of a Decision Tree
 
Equivalent logical statement of a decision tree:
where
 
value
 
attribute
 
Disjunctive normal
form (DNF)
 
Any sentence in propositional logic can be converted into a DNF.
 
Any sentence in propositional logic can expressed as a decision tree.
(one statement for Output = Yes 
 and another for Output = No)
yes
yes
yes
II. Learning DTs from Examples
 
P
r
o
b
l
e
m
 
 
F
i
n
d
 
a
 
t
r
e
e
 
t
h
a
t
 
i
s
 
a
)
 
c
o
n
s
i
s
t
e
n
t
 
w
i
t
h
 
t
h
e
 
t
r
a
i
n
i
n
g
 
e
x
a
m
p
l
e
s
 
(in, e.g, restaurant waiting) and b) as small as possible.
 
 
Generally, it’s 
intractable
 to find a smallest consistent tree.
 
 A 
suboptimal
 
tree
 can be constructed with some simple heuristics.
 
Greedy divide-and-conquer strategy
:
 
 Test the 
most important attribute 
first.
 
make the most difference
to the classification
 
 Recursively solve the smaller
   subproblems.
Poor and Good Attributes
 
Poor attribute 
Type
:
 
 four outcomes
 
 each with the same number
    of positive as negative examples
 
Good attributes 
Patrons
 
and 
Hungry
:
 
 effective separations of
    positive and negative
    examples
Four Cases
 
1.
 All positive (or all negative) remaining examples.
 
2.
 
Mixed positive and negative remaining examples.
Done.
 
Choose the best attribute to split them.
 
3.
 
No examples left.
 
4.
 
No attributes left but both positive and negative
    examples.
 
Return the most common output value of
these examples (by a 
majority vote
)
 
No
 
Yes
Algorithm for Learning DTs
 
// case 3
// case 1
// case 4
// case 2
Output Decision Tree
 
 
Considerably simpler!
 
Learning Curve
 
 
 100 randomly generated examples in the restaurant domain.
 
 Split them randomly into a training set and a test set.
 
 Split 20 times for each size (1 – 99) of the training set.
 
 Average the results of the 20 trials.
 
happy graph
III. Entropy and Attribute Choice
 
Importance of an attribute is measured using the notion of information
gain defined in terms of 
entropy
.
measure of the uncertainty of a random variable: 
the more information, the less entropy.
 
 A fair coin has 1 bit of entropy as it is equally likely to come up heads
  and tails when flipped.
 
 An unfair coin that comes up heads 99% of the time would have an entropy
   measure that is close to zero.
Entropy
 
 
Entropy measures the 
expected amount of information 
conveyed by identifying
the outcome of a random trial.
Entropy Examples
 
 
 A coin with known outcome (e.g., head)
 
 A four-sided die
 
 A loaded coin with 99% heads
 
There is no uncertainty at all – 
no information
.
 A fair coin
There is 
maximum
 surprise in this case. 
Entropy of a Boolean Variable
 
Choice of an Attribute
 
 
 Consider a randomly chosen example from the training set.
 
Information Gain
 
Information Gain
 
Slide Note
Embed
Share

Decision trees are a popular machine learning technique that maps attribute values to decisions. They involve tests that lead from the root to leaf nodes, with each internal node representing a test on an attribute. The use cases range from the restaurant waiting problem to boolean classification and shape recognition. Decision trees offer simplicity, efficiency, and easy-to-understand results, making them valuable tools in data analysis and pattern recognition.

  • Decision Trees
  • Machine Learning
  • Data Analysis
  • Pattern Recognition

Uploaded on Aug 24, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Decision Trees Outline I. Decision trees II. Learning decision trees from examples III. Entropy and attribute choice ** Figures are from the textbook site (or by the instructor).

  2. Restaurant Waiting Problem Decide whether to wait for a table at a restaurant. Output: a Boolean variable WillWait (true where we do wait for a table). Input: a vector of ten attributes, each with discrete values: 1. Alternate: whether there is a suitable alternative restaurant nearby. 2. Bar: whether the restaurant has a comfortable bar area to wait in. 3. Fri/Sat: true on Fridays and Saturdays. 4. Hungry: whether we are hungry right now. 5. Patrons: how many people are in the restaurant (values: None, Some, and Full). 6. Price: the restaurant s price range ($, $$, $$$). 7. Raining: whether it is raining outside. 8. Reservation: whether we made a reservation. 9. Type: the kind of restaurant (French, Italian, Thai, or burger). 10. WaitEstimate: host s wait estimate: 0-10, 10-30, 30-60, or >60 minutes. 26 32 42= 9,216 possible combinations of attribute values.

  3. Training Examples The correct output is given for only 12 out of 9,216 examples. We need to make our best guess at the missing 9,204 output values.

  4. I. What is a Decision Tree? A decision tree maps a set of attribute values to a decision . It performs a sequence of tests, starting at the root and descending down to a leaf (which specifies the output value). Each internal node corresponds to a test of the value of an attribute. A decision tree for the restaurant waiting problem ?1

  5. Boolean Classification Inputs are discrete values. Outputs are either true (a positive example) or false (a negative one). Positive examples: Patrons = Some Patrons = Full WaitEstimate = 0 10 Patrons = Full WaitEstimate = 10 30 Hungry = Yes Alternate = No Negative examples: Patrons = No Patrons = Full WaitEstimate > 60

  6. Shape Recognition Tree A robotic hand can recognize curved shapes from differential invariants (expressions of curvature, torsion, etc.) constructed over tactile data. // Values of ??1 and ??2 do not vary // at different contour points (i.e., they are // invariants on the shape)? https://faculty.sites.iastate.edu/jia/files/inline-files/IJRR05.pdf

  7. Why Decision Trees? They yield nice, concise, and understandable results. They represent simple rules for classifying instances that are described by discrete attribute values. Decision tree learning algorithms are relatively efficient linear in the size of the decision tree and the size of the data set. Decision trees are often among the first to be tried on a new data set. They are not good with real-valued attributes.

  8. Expressiveness of a Decision Tree Equivalent logical statement of a decision tree: Output Path1 Path2 (one statement for Output = Yes and another for Output = No) Disjunctive normal form (DNF) where Path?= (??1= ??1 ??2= ??2 ) Path1 Path? attribute value yes yes yes Any sentence in propositional logic can be converted into a DNF. Any sentence in propositional logic can expressed as a decision tree.

  9. II. Learning DTs from Examples Problem Find a tree that is a) consistent with the training examples (in, e.g, restaurant waiting) and b) as small as possible. Generally, it s intractable to find a smallest consistent tree. A suboptimaltree can be constructed with some simple heuristics. Greedy divide-and-conquer strategy: Test the most important attribute first. make the most difference to the classification Recursively solve the smaller subproblems.

  10. Poor and Good Attributes Poor attribute Type: four outcomes each with the same number of positive as negative examples Good attributes Patrons and Hungry: effective separations of positive and negative examples

  11. Four Cases 1. All positive (or all negative) remaining examples. Done. 2. Mixed positive and negative remaining examples. Choose the best attribute to split them. 3. No examples left. ??? Return the most common output value (e.g., ) for the example set used in constructing the parent (???) ?1 ?2 ?3 Yes 4. No attributes left but both positive and negative examples. Return the most common output value of these examples (by a majority vote) No

  12. Algorithm for Learning DTs // case 3 // case 1 // case 4 // case 2

  13. Output Decision Tree Considerably simpler! ?2 Original decision tree ?1

  14. Learning Curve happy graph 100 randomly generated examples in the restaurant domain. Split them randomly into a training set and a test set. Split 20 times for each size (1 99) of the training set. Average the results of the 20 trials.

  15. III. Entropy and Attribute Choice Importance of an attribute is measured using the notion of information gain defined in terms of entropy. measure of the uncertainty of a random variable: the more information, the less entropy. A fair coin has 1 bit of entropy as it is equally likely to come up heads and tails when flipped. A four-sided die has 2 bits of entropy since there are 22 equally likely choices. An unfair coin that comes up heads 99% of the time would have an entropy measure that is close to zero.

  16. Entropy Random variable ? with values ?? having probability ? ??. Entropy of ?: ? log2? 1 ? ? = ?(??)log2 ?(??) ? = ?(??)log2?(??) ? Entropy measures the expected amount of information conveyed by identifying the outcome of a random trial.

  17. Entropy Examples A fair coin ? Fair = 0.5log20.5 + 0.5log20.5 = 1 (bit) There is maximum surprise in this case. A coin with known outcome (e.g., head) ? Head = 1 log21 = 0 (bit) There is no uncertainty at all no information. A loaded coin with 99% heads ? Loaded = 0.99log20.99 + 0.01log20.01 0.08 (bits) A four-sided die ? Die4 = 4 0.25log20.25 = 2 (bits)

  18. Entropy of a Boolean Variable A Boolean random variable that is true with probability ?. ? ? = ?(1 ?) ? ? = (?log2? + (1 ?)log2(1 ?)) ? Loaded = ?(0.99) 0.08 A training set contains ? positive examples and ? negative examples. ? ? Output = ? ? + ? ? = ? = 6 ? 0.5 = 1

  19. Choice of an Attribute An attribute ? has ? distinct values. ? divides the training set ? into subsets ?1, ,??. Each subset ?? has ?? positive examples and ?? negative examples. ? ? and ? = ?? ? = ?? ?=1 ?=1 ?? This implies an additional ? ??+?? bits of information. Consider a randomly chosen example from the training set. Its attribute ? value is in ?? with probability ??+ ??/ ? + ? . Expected entropy remaining after testing attribute ? is ??+?? ?+?? ?? ? Remainder ? = ?=1 ??+??

  20. Information Gain Expected reduction in entropy from the attribute test on ?: ? Gain ? = ? Remainder ? ? + ? ? ? ??+ ?? ? + ? ?? = ? ? ? + ? ??+ ?? ?=1 Choose the attribute ? among the remaining attributes to maximize Gain ? .

  21. Information Gain ? ? ??+ ?? ? + ? ?? Gain ? = ? ? ? + ? ??+ ?? ?=1 1 2 2 12? 1 2+ 2 12? 1 2+ 4 12? 2 4+ 4 12? 2 4 = 0 bits Gain Type = ? 1 2 2 12? 0 2+ 4 12? 4 4+ 6 12? 2 6 0.541 bits Gain Patrons = ?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#