Lazy Learning Classification Using Nearest Neighbors

 
Chapter 3
 
Lazy Learning – Classification Using
Nearest Neighbors
 
The approach
 
An adage: if it smells like a duck and tastes like a duck, then you are
probably eating duck.
A maxim: birds of a feather flock together.
The idea: things that are alike are likely to have properties that are
alike.
The principle for machine learning: classify data by placing it in the
same category as similar or “nearest” neighbors.
 
Topics of this chapter
 
The key concepts that define 
nearest neighbor 
classifiers, and why
they are considered "lazy" learners
Methods to measure the similarity of two examples using distance
To apply a popular nearest neighbor classifier called k-NN
 
U
n
d
e
r
s
t
a
n
d
i
n
g
 
n
e
a
r
e
s
t
 
n
e
i
g
h
b
o
r
 
c
l
a
s
s
i
f
i
c
a
t
i
o
n
 
nearest neighbor 
classifiers are defined by their characteristic of
classifying unlabeled examples by assigning them the class of similar
labeled examples.
nearest neighbor methods are extremely powerful. They have been
used successfully for:
Computer vision applications, including optical character recognition and
facial recognition in both still images and video
Predicting whether a person will enjoy a movie or music recommendation
Identifying patterns in genetic data, perhaps to use them in detecting specific
proteins or diseases
 
Suited and unsuited areas
 
Suited: relationships among the features and the target classes are
numerous, complicated, or extremely difficult to understand, yet the
items of similar class type tend to be fairly homogeneous.
Or, if a concept is difficult to define, but you know it when you see it, then
nearest neighbors might be appropriate.
Unsuited: if the data is noisy and thus no clear distinction exists
among the groups, the nearest neighbor algorithms may struggle to
identify the class boundaries.
 
T
h
e
 
k
-
N
N
 
a
l
g
o
r
i
t
h
m
 
k-nearest neighbors algorithm 
(
k-NN
).
one of the simplest machine learning algorithms
used widely
strengths and weaknesses
 
Strengths:
• Simple and effective
• Makes no assumptions about the
underlying data distribution
• Fast training phase
 
Weaknesses:
• Does not produce a model, limiting the ability to
understand how the features are related to the class
• Requires selection of an appropriate 
k
• Slow classification phase
• Nominal features and missing data require additional
processing
 
The process
 
uses information about an example's k-nearest neighbors to classify
unlabeled examples
k 
is a variable term implying that any number of nearest neighbors
could be used
the algorithm requires a training dataset made up of examples that
have been classified into several categories, as labeled by a nominal
variable
For each unlabeled record in the test dataset, k-NN identifies 
k
records in the training data that are the "nearest" in similarity. The
unlabeled test instance is assigned the class of the majority of the k
nearest neighbors.
 
Blind tasting experience
 
To classify ingredients
rate two features of each ingredient:
Crunchiness (1-10)
Sweetness (1-10)
label each ingredient as one of the three types of food: fruits,
vegetables, or proteins
 
Feature space
 
The k-NN algorithm treats
the features as coordinates
in a multidimensional feature
space.
Ingredient's dataset: scatter
plot
 
The pattern
 
Similar types of food tend to
be grouped closely together.
vegetables tend to be crunchy
but not sweet
Fruits tend to be sweet and
either crunchy or not crunchy
proteins tend to be neither
crunchy nor sweet
 
Is tomato a fruit or vegetable?
 
M
e
a
s
u
r
i
n
g
 
s
i
m
i
l
a
r
i
t
y
 
w
i
t
h
 
d
i
s
t
a
n
c
e
 
distance function
 - a formula that measures the similarity between
the two instances
Euclidean distance – 
the shortest direct route, measured "as the crow
flies“
Traditionally used by R
Manhattan distance 
- based on the paths a pedestrian would take by
walking city blocks
 
Euclidean distance
 
p 
and 
q - 
examples to be compared, each having 
n 
features
 
E.g., distance between the tomato (
sweetness = 6
, 
crunchiness = 4
),
and the green bean (
sweetness = 3
, 
crunchiness = 7
):
 
distance between the tomato and several of its closest neighbors
 
Classification of tomato
 
1-NN classification: assigning the tomato, the food type of its single
nearest neighbor, orange - tomato is a fruit.
k-NN algorithm with 
k = 3: 
performs a vote among the three nearest
neighbors: orange, grape, and nuts - tomato is a fruit.
 
C
h
o
o
s
i
n
g
 
a
n
 
a
p
p
r
o
p
r
i
a
t
e
 
k
 
bias-variance tradeoff - 
balance between overfitting and underfitting the
training data
large 
k:
reduces the impact or variance caused by noisy data
bias the learner so that it runs the risk of ignoring small, but important patterns
extreme stance: 
k
, as large as the total number of observations in the
training data
every training instance is represented in the final vote
the most common class always has a majority of the voters
The model would always predict the majority class, regardless of the nearest
neighbors
 
The opposite extreme
 
using a single nearest neighbor:
noisy data or outliers can unduly influence the classification of examples
Any unlabeled example that happens to be nearest to the incorrectly labeled
neighbor will be predicted to have the incorrect class
The best 
k 
value is somewhere between these two extremes
choosing 
k 
depends on
:
difficulty of the concept to be learned
number of records in the training data
One common practice is to begin with 
k 
equal to the square root of the
number of training examples
E.g., 
k = 4 
because there were 15 example ingredients in the training data
 
Decision boundary
 
Smaller values of k allow more complex decision boundaries that
more carefully fit the training data
 
Choosing k
 
rules may not always result in the single best 
k
Alternative approach is to test several 
k 
values on a variety of test
datasets
a large training dataset can make the choice of 
k 
less important
even subtle concepts will have a sufficiently large pool of examples to vote as
nearest neighbors
A less common solution: choose a larger 
k
, but apply a 
weighted
voting 
process in which the vote of the closer neighbors is considered
more authoritative than the vote of the far away neighbors
 
P
r
e
p
a
r
i
n
g
 
d
a
t
a
 
f
o
r
 
u
s
e
 
w
i
t
h
 
k
-
N
N
 
Features are typically transformed to a standard range prior to
applying the k-NN algorithm
Distance formula is highly dependent on how features are measured:
if certain features have a much larger range of values than the others, the
distance measurements will be strongly dominated by the features with larger
ranges
Solution is to rescale the features by shrinking or expanding their
range such that each one contributes relatively equally to the
distance formula
 
m
i
n
-
m
a
x
 
n
o
r
m
a
l
i
z
a
t
i
o
n
 
traditional method of rescaling features for k-NN
transforms a feature such that all of its values fall in a range between
0 and 1
 
 
Normalized feature values can be interpreted as indicating how far,
from 0 percent to 100 percent, the original value fell along the range
between the original minimum and maximum
 
z
-
s
c
o
r
e
 
s
t
a
n
d
a
r
d
i
z
a
t
i
o
n
 
 
 
z-score: 
how many standard deviations the feature's values fall above
or below the mean value.
z-scores fall in an unbound range of negative and positive numbers
Issue with min-max normalization: the minimum or maximum of
future cases might be outside the range of values observed in the
training data
Issue with z-score standardization: future examples may not have
similar mean and standard deviation as the training examples
 
Distance between nominal features
 
Euclidean distance formula is not defined for nominal data
Dummy coding: 
a value of 
1 
indicates one category, and 
0
, the other
E.g., dummy coding for a gender variable
n
-category nominal feature can be dummy coded by creating the
binary indicator variables for (
n - 1
) levels of the feature
E.g., dummy coding for a three-category temperature variable (for
example, hot, medium, or cold) could be set up as 
(3 - 1) = 2 
features
 
Distance between dummy coded features
 
Distance between dummy coded features is always one or zero, no
additional transformation is necessary.
An alternative to dummy coding for ordinal nominal features: number
the categories and apply normalization
E.g., cold, warm, and hot could be numbered as 1, 2, and 3, which
normalizes to 0, 0.5, and 1.
Caveat: this approach should only be used if the steps between the
categories are equivalent.
E.g., steps between groups of poor, middle class, and wealthy are not equal
 
W
h
y
 
i
s
 
t
h
e
 
k
-
N
N
 
a
l
g
o
r
i
t
h
m
 
l
a
z
y
?
 
lazy learning - 
no abstraction occurs, abstraction and generalization
processes are skipped altogether
Under the strict definition of learning, a lazy learner is not really
learning anything
merely stores the training data verbatim
training phase can be done very rapidly
process of making predictions tends to be relatively slow in
comparison to training
instance-based learning 
or 
rote learning: 
heavy reliance on the
training instances rather than an abstracted model
 
N
o
n
-
p
a
r
a
m
e
t
r
i
c
 
l
e
a
r
n
i
n
g
 
m
e
t
h
o
d
s
 
no parameters are learned about the data
allows the learner to find natural patterns rather than trying to fit the
data into a preconceived and potentially biased functional form
 
E
x
a
m
p
l
e
 
 
d
i
a
g
n
o
s
i
n
g
 
b
r
e
a
s
t
 
c
a
n
c
e
r
 
w
i
t
h
 
t
h
e
 
k
-
N
N
 
a
l
g
o
r
i
t
h
m
 
Detecting cancer by applying the k-NN algorithm to measurements of
biopsied cells from women with abnormal breast masses
Step 1 – collecting data
Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine
Learning Repository at 
http://archive.ics.uci.edu/ml
donated by researchers of the University of Wisconsin
includes the measurements from digitized images of fine-needle aspirate of a
breast mass
values represent the characteristics of the cell nuclei present in the digital
image
 
The dataset
 
569 examples of cancer biopsies
32 features
identification number
cancer diagnosis - 
"M" 
to indicate malignant or 
"B" 
to
indicate benign
30 numeric-valued laboratory measurements -
mean, standard error, and worst (that is, largest)
value for 10 different characteristics of the digitized
cell nuclei
 
• Radius
• Texture
• Perimeter
• Area
• Smoothness
• Compactness
• Concavity
• Concave points
• Symmetry
• Fractal dimension
 
S
t
e
p
 
2
 
 
e
x
p
l
o
r
i
n
g
 
a
n
d
 
p
r
e
p
a
r
i
n
g
 
t
h
e
 
d
a
t
a
 
Download the wisc_bc_data.csv file from the Packt website
a header line was added
rows of data were randomly ordered
Importing the CSV data file:
> wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
> 
str(wbcd)
 
Exclude ID
 
ID variables should always be excluded
Neglecting to do so can lead to erroneous findings
ID can be used to uniquely "predict" each example
A model that includes an identifier will suffer from overfitting
Exclude ID (first column)
> wbcd <- wbcd[-1]
Target feature – 
diagnosis
> table(wbcd$diagnosis)
B 
 
M
357 
 
212
 
target feature as a factor
 
Many R machine learning classifiers require that the target feature is
coded as a factor
> wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c("Benign", "Malignant"))
> round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
Benign 
 
Malignant
62.7 
  
37.3
 
30 numeric features
 
Observation: Normalization is needed!
 
T
r
a
n
s
f
o
r
m
a
t
i
o
n
 
 
n
o
r
m
a
l
i
z
i
n
g
 
n
u
m
e
r
i
c
 
d
a
t
a
 
create a normalize() function in R:
> normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) }
test the function:
> normalize(c(1, 2, 3, 4, 5))
[1] 0.00 0.25 0.50 0.75 1.00
> normalize(c(10, 20, 30, 40, 50))
[1] 0.00 0.25 0.50 0.75 1.00
 
lapply() function
 
lapply() function takes a list and applies a specified function to each
list element
> wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
confirm the correctness of the transformation
 
D
a
t
a
 
p
r
e
p
a
r
a
t
i
o
n
 
 
c
r
e
a
t
i
n
g
 
t
r
a
i
n
i
n
g
 
a
n
d
 
t
e
s
t
 
d
a
t
a
s
e
t
s
 
simulate the scenario of applying trained model to unlabeled dataset
divide the data into two portions:
a training dataset to build the k-NN model: the first 469 records
a test dataset to estimate the predictive accuracy of the model: the remaining
100 records
> wbcd_train <- wbcd_n[1:469, ]
> wbcd_test <- wbcd_n[470:569, ]
Store class labels in factor vectors:
> wbcd_train_labels <- wbcd[1:469, 1]
> wbcd_test_labels <- wbcd[470:569, 1]
 
S
t
e
p
 
3
 
 
t
r
a
i
n
i
n
g
 
a
 
m
o
d
e
l
 
o
n
 
t
h
e
 
d
a
t
a
 
training a lazy learner like k-NN simply involves storing the input data
in a structured format
the 
class
 package
> install.packages("class")
> library(class)
The knn() function in the class package:
a standard, classic implementation of the k-NN algorithm
using Euclidean distance
classification by taking a "vote" among the k-Nearest Neighbors
A tie vote is broken at random
 
knn() function
 
apply the k-NN algorithm
 
try k = 21
training data includes 469 instances
With a two-category outcome, using an odd number eliminates the chance of
ending with a tie vote
> wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl =
wbcd_train_labels, k = 21)
 
S
t
e
p
 
4
 
 
e
v
a
l
u
a
t
i
n
g
 
m
o
d
e
l
 
p
e
r
f
o
r
m
a
n
c
e
 
> CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq=FALSE)
 
Four categories
 
true negative 
results: top-left cell – 61%
true positive 
results: bottom-right cell – 37%
false negative 
results: lower-left cell – 2%
Errors in this direction could be extremely costly - lead a patient to believe
that she is cancer-free, but in reality, the disease may continue to spread
false positive 
results: top-right cell – 0%
less dangerous than a false negative result
lead to additional financial burden on the health care system
 
S
t
e
p
 
5
 
 
i
m
p
r
o
v
i
n
g
 
m
o
d
e
l
 
p
e
r
f
o
r
m
a
n
c
e
 
Two simple variations on our previous classifier:
alternative method for rescaling our numeric features
several different values for 
k
Transformation – z-score standardization
scale() function:
rescales values using the z-score standardization
can be applied directly to a data frame
> wbcd_z <- as.data.frame(scale(wbcd[-1]))
rescales all the features, with the exception of diagnosis
confirm the correctness of the transformation
 
Divide the data into training and test sets
 
> wbcd_train <- wbcd_z[1:469, ]
> wbcd_test <- wbcd_z[470:569, ]
> wbcd_train_labels <- wbcd[1:469, 1]
> wbcd_test_labels <- wbcd[470:569, 1]
 
> wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl =
wbcd_train_labels, k = 21)
 
Analyze results
 
> CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq =
FALSE)
 
5% false negative
rate - worse
 
T
e
s
t
i
n
g
 
a
l
t
e
r
n
a
t
i
v
e
 
v
a
l
u
e
s
 
o
f
 
k
 
It would be unwise to tailor our approach too closely to our test data
To be certain that a learner will generalize to future data, create
several sets of 100 patients at random and repeatedly retest the
result
 
The End of Chapter 3
Slide Note
Embed
Share

Lazy Learning Classification Using Nearest Neighbors explores the concept of classifying data by grouping it with similar neighbors. The chapter delves into the characteristics of nearest neighbor classifiers, their applications in various fields, and the suitability of using them based on data complexity. It also introduces the k-NN algorithm, outlining its strengths and weaknesses in machine learning applications.

  • Classification
  • Nearest Neighbors
  • Machine Learning
  • k-NN Algorithm

Uploaded on Sep 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Chapter 3 Lazy Learning Classification Using Nearest Neighbors

  2. The approach An adage: if it smells like a duck and tastes like a duck, then you are probably eating duck. A maxim: birds of a feather flock together. The idea: things that are alike are likely to have properties that are alike. The principle for machine learning: classify data by placing it in the same category as similar or nearest neighbors.

  3. Topics of this chapter The key concepts that define nearest neighbor classifiers, and why they are considered "lazy" learners Methods to measure the similarity of two examples using distance To apply a popular nearest neighbor classifier called k-NN

  4. Understanding nearest neighbor classification Understanding nearest neighbor classification nearest neighbor classifiers are defined by their characteristic of classifying unlabeled examples by assigning them the class of similar labeled examples. nearest neighbor methods are extremely powerful. They have been used successfully for: Computer vision applications, including optical character recognition and facial recognition in both still images and video Predicting whether a person will enjoy a movie or music recommendation Identifying patterns in genetic data, perhaps to use them in detecting specific proteins or diseases

  5. Suited and unsuited areas Suited: relationships among the features and the target classes are numerous, complicated, or extremely difficult to understand, yet the items of similar class type tend to be fairly homogeneous. Or, if a concept is difficult to define, but you know it when you see it, then nearest neighbors might be appropriate. Unsuited: if the data is noisy and thus no clear distinction exists among the groups, the nearest neighbor algorithms may struggle to identify the class boundaries.

  6. The k The k- -NN algorithm NN algorithm k-nearest neighbors algorithm (k-NN). one of the simplest machine learning algorithms used widely strengths and weaknesses Strengths: Simple and effective Makes no assumptions about the underlying data distribution Fast training phase Weaknesses: Does not produce a model, limiting the ability to understand how the features are related to the class Requires selection of an appropriate k Slow classification phase Nominal features and missing data require additional processing

  7. The process uses information about an example's k-nearest neighbors to classify unlabeled examples k is a variable term implying that any number of nearest neighbors could be used the algorithm requires a training dataset made up of examples that have been classified into several categories, as labeled by a nominal variable For each unlabeled record in the test dataset, k-NN identifies k records in the training data that are the "nearest" in similarity. The unlabeled test instance is assigned the class of the majority of the k nearest neighbors.

  8. Blind tasting experience To classify ingredients rate two features of each ingredient: Crunchiness (1-10) Sweetness (1-10) label each ingredient as one of the three types of food: fruits, vegetables, or proteins

  9. Feature space The k-NN algorithm treats the features as coordinates in a multidimensional feature space. Ingredient's dataset: scatter plot

  10. The pattern Similar types of food tend to be grouped closely together. vegetables tend to be crunchy but not sweet Fruits tend to be sweet and either crunchy or not crunchy proteins tend to be neither crunchy nor sweet

  11. Is tomato a fruit or vegetable?

  12. Measuring similarity with distance Measuring similarity with distance distance function - a formula that measures the similarity between the two instances Euclidean distance the shortest direct route, measured "as the crow flies Traditionally used by R Manhattan distance - based on the paths a pedestrian would take by walking city blocks

  13. Euclidean distance p and q - examples to be compared, each having n features E.g., distance between the tomato (sweetness = 6, crunchiness = 4), and the green bean (sweetness = 3, crunchiness = 7): distance between the tomato and several of its closest neighbors

  14. Classification of tomato 1-NN classification: assigning the tomato, the food type of its single nearest neighbor, orange - tomato is a fruit. k-NN algorithm with k = 3: performs a vote among the three nearest neighbors: orange, grape, and nuts - tomato is a fruit.

  15. Choosing an appropriate k Choosing an appropriate k bias-variance tradeoff - balance between overfitting and underfitting the training data large k: reduces the impact or variance caused by noisy data bias the learner so that it runs the risk of ignoring small, but important patterns extreme stance: k, as large as the total number of observations in the training data every training instance is represented in the final vote the most common class always has a majority of the voters The model would always predict the majority class, regardless of the nearest neighbors

  16. The opposite extreme using a single nearest neighbor: noisy data or outliers can unduly influence the classification of examples Any unlabeled example that happens to be nearest to the incorrectly labeled neighbor will be predicted to have the incorrect class The best k value is somewhere between these two extremes choosing k depends on: difficulty of the concept to be learned number of records in the training data One common practice is to begin with k equal to the square root of the number of training examples E.g., k = 4 because there were 15 example ingredients in the training data

  17. Decision boundary Smaller values of k allow more complex decision boundaries that more carefully fit the training data

  18. Choosing k rules may not always result in the single best k Alternative approach is to test several k values on a variety of test datasets a large training dataset can make the choice of k less important even subtle concepts will have a sufficiently large pool of examples to vote as nearest neighbors A less common solution: choose a larger k, but apply a weighted voting process in which the vote of the closer neighbors is considered more authoritative than the vote of the far away neighbors

  19. Preparing data for use with k Preparing data for use with k- -NN NN Features are typically transformed to a standard range prior to applying the k-NN algorithm Distance formula is highly dependent on how features are measured: if certain features have a much larger range of values than the others, the distance measurements will be strongly dominated by the features with larger ranges Solution is to rescale the features by shrinking or expanding their range such that each one contributes relatively equally to the distance formula

  20. min min- -max normalization max normalization traditional method of rescaling features for k-NN transforms a feature such that all of its values fall in a range between 0 and 1 Normalized feature values can be interpreted as indicating how far, from 0 percent to 100 percent, the original value fell along the range between the original minimum and maximum

  21. z z- -score standardization score standardization z-score: how many standard deviations the feature's values fall above or below the mean value. z-scores fall in an unbound range of negative and positive numbers Issue with min-max normalization: the minimum or maximum of future cases might be outside the range of values observed in the training data Issue with z-score standardization: future examples may not have similar mean and standard deviation as the training examples

  22. Distance between nominal features Euclidean distance formula is not defined for nominal data Dummy coding: a value of 1 indicates one category, and 0, the other E.g., dummy coding for a gender variable n-category nominal feature can be dummy coded by creating the binary indicator variables for (n - 1) levels of the feature E.g., dummy coding for a three-category temperature variable (for example, hot, medium, or cold) could be set up as (3 - 1) = 2 features

  23. Distance between dummy coded features Distance between dummy coded features is always one or zero, no additional transformation is necessary. An alternative to dummy coding for ordinal nominal features: number the categories and apply normalization E.g., cold, warm, and hot could be numbered as 1, 2, and 3, which normalizes to 0, 0.5, and 1. Caveat: this approach should only be used if the steps between the categories are equivalent. E.g., steps between groups of poor, middle class, and wealthy are not equal

  24. Why is the k Why is the k- -NN algorithm lazy? NN algorithm lazy? lazy learning - no abstraction occurs, abstraction and generalization processes are skipped altogether Under the strict definition of learning, a lazy learner is not really learning anything merely stores the training data verbatim training phase can be done very rapidly process of making predictions tends to be relatively slow in comparison to training instance-based learning or rote learning: heavy reliance on the training instances rather than an abstracted model

  25. Non Non- -parametric parametric learning methods no parameters are learned about the data allows the learner to find natural patterns rather than trying to fit the data into a preconceived and potentially biased functional form

  26. Example Example diagnosing breast cancer with the k diagnosing breast cancer with the k- -NN algorithm NN algorithm Detecting cancer by applying the k-NN algorithm to measurements of biopsied cells from women with abnormal breast masses Step 1 collecting data Wisconsin Breast Cancer Diagnostic dataset from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml donated by researchers of the University of Wisconsin includes the measurements from digitized images of fine-needle aspirate of a breast mass values represent the characteristics of the cell nuclei present in the digital image

  27. The dataset 569 examples of cancer biopsies 32 features identification number cancer diagnosis - "M" to indicate malignant or "B" to indicate benign 30 numeric-valued laboratory measurements - mean, standard error, and worst (that is, largest) value for 10 different characteristics of the digitized cell nuclei Radius Texture Perimeter Area Smoothness Compactness Concavity Concave points Symmetry Fractal dimension

  28. Step 2 Step 2 exploring and preparing the data exploring and preparing the data Download the wisc_bc_data.csv file from the Packt website a header line was added rows of data were randomly ordered Importing the CSV data file: > wbcd <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE) > str(wbcd)

  29. Exclude ID ID variables should always be excluded Neglecting to do so can lead to erroneous findings ID can be used to uniquely "predict" each example A model that includes an identifier will suffer from overfitting Exclude ID (first column) > wbcd <- wbcd[-1] Target feature diagnosis > table(wbcd$diagnosis) B M 357 212

  30. target feature as a factor Many R machine learning classifiers require that the target feature is coded as a factor > wbcd$diagnosis<- factor(wbcd$diagnosis, levels = c("B", "M"), labels = c("Benign", "Malignant")) > round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1) Benign Malignant 62.7 37.3

  31. 30 numeric features Observation: Normalization is needed!

  32. Transformation Transformation normalizing numeric data normalizing numeric data create a normalize() function in R: > normalize <- function(x) { return ((x - min(x)) / (max(x) - min(x))) } test the function: > normalize(c(1, 2, 3, 4, 5)) [1] 0.00 0.25 0.50 0.75 1.00 > normalize(c(10, 20, 30, 40, 50)) [1] 0.00 0.25 0.50 0.75 1.00

  33. lapply() function lapply() function takes a list and applies a specified function to each list element > wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize)) confirm the correctness of the transformation

  34. Data preparation Data preparation creating training and test datasets creating training and test datasets simulate the scenario of applying trained model to unlabeled dataset divide the data into two portions: a training dataset to build the k-NN model: the first 469 records a test dataset to estimate the predictive accuracy of the model: the remaining 100 records > wbcd_train <- wbcd_n[1:469, ] > wbcd_test <- wbcd_n[470:569, ] Store class labels in factor vectors: > wbcd_train_labels <- wbcd[1:469, 1] > wbcd_test_labels <- wbcd[470:569, 1]

  35. Step 3 Step 3 training a model on the data training a model on the data training a lazy learner like k-NN simply involves storing the input data in a structured format the class package > install.packages("class") > library(class) The knn() function in the class package: a standard, classic implementation of the k-NN algorithm using Euclidean distance classification by taking a "vote" among the k-Nearest Neighbors A tie vote is broken at random

  36. knn() function

  37. apply the k-NN algorithm try k = 21 training data includes 469 instances With a two-category outcome, using an odd number eliminates the chance of ending with a tie vote > wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

  38. Step 4 Step 4 evaluating model performance evaluating model performance > CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq=FALSE)

  39. Four categories true negative results: top-left cell 61% true positive results: bottom-right cell 37% false negative results: lower-left cell 2% Errors in this direction could be extremely costly - lead a patient to believe that she is cancer-free, but in reality, the disease may continue to spread false positive results: top-right cell 0% less dangerous than a false negative result lead to additional financial burden on the health care system

  40. Step 5 Step 5 improving model performance improving model performance Two simple variations on our previous classifier: alternative method for rescaling our numeric features several different values for k Transformation z-score standardization scale() function: rescales values using the z-score standardization can be applied directly to a data frame > wbcd_z <- as.data.frame(scale(wbcd[-1])) rescales all the features, with the exception of diagnosis confirm the correctness of the transformation

  41. Divide the data into training and test sets > wbcd_train <- wbcd_z[1:469, ] > wbcd_test <- wbcd_z[470:569, ] > wbcd_train_labels <- wbcd[1:469, 1] > wbcd_test_labels <- wbcd[470:569, 1] > wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test, cl = wbcd_train_labels, k = 21)

  42. Analyze results > CrossTable(x = wbcd_test_labels, y = wbcd_test_pred, prop.chisq = FALSE) 5% false negative rate - worse

  43. Testing alternative values of k Testing alternative values of k It would be unwise to tailor our approach too closely to our test data To be certain that a learner will generalize to future data, create several sets of 100 patients at random and repeatedly retest the result

  44. The End of Chapter 3

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#