Machine Learning Techniques: K-Nearest Neighbour, K-fold Cross Validation, and K-Means Clustering

undefined

COMP 307 — Lecture

Machine Learning 2

3-K

Techniques:

K-Nearest

Neighbour,

K-fold

Cross

Validation

and

K-Means

Clustering

Dr Bing Xue

bing.xue@ecs.vuw.ac.nz

Outline

•

Nearest neighbour method

Basic nearest neighbour method

K-Neighbour method

Distance measure/Similarity measure

•

K-fold cross validation

Leave-one out cross validation

k-fold cross validation vs validation set

•

K-means clu

tering

Dataset (Classification)

Iris Dataset:

•

150 examples/

instances/

observations

/objects

Each instance is represented as a

feature vector

 and a

desired/target

class label

•

3 classes:

Iris-

Setosa,

Iris-

Versicolor,

Iris-

Virginica

•

4 features/variables/attributes:

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

@Iris:

5.1,3.5,1.4,0.2,

Iris-setosa

4.9,3.0,1.4,0.2,

Iris-setosa

4.7,3.2,1.3,0.2,

Iris-setosa

4.6,3.1,1.5,0.2,

Iris-setosa

5.0,3.6,1.4,0.2,

Iris-setosa

7.0,3.2,4.7,1.4,

Iris-versicolor

6.4,3.2,4.5,1.5,

Iris-versicolor

6.9,3.1,4.9,1.5,

Iris-versicolor

5.5,2.3,4.0,1.3,

Iris-versicolor

6.5,2.8,4.6,1.5,

Iris-versicolor

6.3,3.3,6.0,2.5,

Iris-virginica

5.8,2.7,5.1,1.9,

Iris-virginica

7.1,3.0,5.9,2.1,

Iris-virginica

6.3,2.9,5.6,1.8,

Iris-virginica

Nearest Neighbour

•

Given a

training

 set with a number of instances

•

Nearest neighbour method

Each

unseen

 instance (in the

test

set

) is compared with

all

the instances in the

training

set

Find the “

nearest neighbour

” (instance) from the training set

 the unseen instance is classified as the class of the nearest

neighbour

@Iris:

5.1,3.5,1.4,0.2

Iris-setosa

4.9,3.0,1.4,0.2,

Iris-setosa

7.0,3.2,4.7,1.4,

Iris-versicolor

6.4,3.2,4.5,1.5,

Iris-versicolor

6.3,3.3,6.0,2.5,

Iris-virginica

5.8,2.7,5.1,1.9,

Iris-virginica

K-Nearest Neighbour

•

K-Nearest Neighbour method:

Similar to the nearest neighbour method

But find

nearest

instances from the training set

Then choose the

majority

 class as the class label of the unseen

instance

•

But

how to find

the nearest neighbours?

Nearest Neighbour — Distance Measures

•

Given two feature vectors with

numeric

values

•

Use the

distance

measure

•

is the

range

of the

th component

•

The (k-)nearest neighbour method is

simple, easy to use

, and

can achieve

good results

 in many cases

•

What problem can you find?

Does this method explicitly learn

a classifier

? If yes, what is it?

Efficient

K-fold Cross

Validation

•

Idea: chop the available data into

K equal

chunks

•

For

each chunk

 in turn:

Treat it as the

test

data set

Treat the rest

K − 1 chunks

as the

training

data set

The classifier

trained/learned from

 the training set is

applied to

the test set

•

The process is then

repeated K times

(the folds), with

each

of

the K chunks used

exactly once as the test data set

•

The K results from the folds can be then

averaged

(or

otherwise

combined

) to produce

a single estimation.

•

Can be used for comparing two algorithms, or measure the

performance of a particular algorithm when the data set is

small

10-fold

Cross

alidation

•

Example:

0 instances for

-fold cross validation

Splitting into

 chunks

15, 15

Leave-one-out Cross Validation

•

It is very similar to the K-fold cross-validation method

•

Every time, it only uses

one instance

as

the test data set

•

The process needs to be

repeated

 times

, where

 is the

total number of examples/instances

 in the

entire data set

•

K-fold cross validation (including leave-one-out cross

validation) is

NOT

 a machine learning or classification method

or technique

•

It is an

experimental design method

for setting up

experiments for

supervised

 learning tasks such as

classification and regression

Validation Set

vs

Cross

alidation

•

Validation set (vs training set vs

test

set)

The

alidation set is a

data set

Validation set is a separate data set from the training set and

the test set.

Validation set is used for

monitoring

 the training process but is

not directly used for learning

 the classifier

the

validation set

is used for avoid

overfitting

or

overtraining

Assume:

‣

100 examples – 50 vs 50

‣

40 vs 30 vs 30

•

Cross Validation

Is a

experimental design method, NOT a data set

In this method, there are only training sets and test sets

No validation set exists

Data

ubset

vs

How

se

ata

K-Means Clustering

•

Unlabelled

data

•

xpect to obtain

good partitions

 for the instances using

learning techniques.

•

Need

clustering

 techniques

unsupervised

learning

•

-means clustering

is a method of cluster analysis which aims

to

partition

 instances into

 clusters

 in which each instance

belongs to the cluster with

the nearest mean

•

Need some kind of

distance measure

such as

Euclidean

distance

•

Need to

assume

the

number of clusters

K-Means Clustering: Algorithm

1.

Set

 initial

“

means

”

(in this case

k=3

randomly

 from the

data set (shown in color).

2.

Create k clusters

 by associating every instance with the

nearest mean

 based on a distance measure.

3.

Replace the old means

 with the

centroid

 of each of the k

clusters (as the new means).

4.

Repeat the above two steps until convergence

(no

change

in

each

cluster

center)

K-Means Clustering: An Example

K-means Algorithm Demo

https://www.youtube.com/watch?v=zHbxbb2ye3E

Summary

•

Nearest neighbour method for classification

K-Nearest neighbour method — classification method

Measures of comparing two feature vectors

•

K-fold cross validation

experimental design method, NOT a learning method

validation set is a data set, NOT a method

•

K-means method —- clustering method, NOT for classification

•

Next Lecture: Decision tree learning for classification

•

Suggested reading: Section 18.3 (both 2nd and 3rd editions)

and online materials

Slide Note

easiest methods

three most simple, most basic,

but probably most widely used, most important (if do not know 3 k, do not know DM ML)

Embed Share

Download

This lecture covers important machine learning techniques such as K-Nearest Neighbour, K-fold Cross Validation, and K-Means Clustering. It delves into the concepts of Nearest Neighbour method, distance measures, similarity measures, dataset classification using the Iris dataset, and practical applications in classification tasks. The content explains the working of K-Nearest Neighbour method, leaves one out cross-validation, and the comparison between k-fold cross-validation and validation set. It also discusses the Nearest Neighbour method in detail, how to find the nearest neighbors, and distance measures.

yetz582 Follow

Uploaded on Sep 18, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

COMP 307 Lecture 05 Machine Learning 2 3-K Techniques: K-Nearest Neighbour, K-fold Cross Validation and K-Means Clustering Dr Bing Xue bing.xue@ecs.vuw.ac.nz

COMP307 2 ML2(3-K Technique): Outline Nearest neighbour method - Basic nearest neighbour method - K-Neighbour method - Distance measure/Similarity measure K-fold cross validation - Leave-one out cross validation - k-fold cross validation vs validation set K-means clustering

COMP307 3 ML2(3-K Technique): Dataset (Classification) Iris Dataset: @Iris: 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa 4.7,3.2,1.3,0.2, Iris-setosa 4.6,3.1,1.5,0.2, Iris-setosa 5.0,3.6,1.4,0.2, Iris-setosa . . 7.0,3.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-versicolor 6.9,3.1,4.9,1.5, Iris-versicolor 5.5,2.3,4.0,1.3, Iris-versicolor 6.5,2.8,4.6,1.5, Iris-versicolor . . 6.3,3.3,6.0,2.5, Iris-virginica 5.8,2.7,5.1,1.9, Iris-virginica 7.1,3.0,5.9,2.1, Iris-virginica 6.3,2.9,5.6,1.8, Iris-virginica . . 150 examples/ instances/ observations/objects - Each instance is represented as a feature vector and a desired/target class label 3 classes: Iris-Setosa, Iris-Versicolor, Iris-Virginica 4 features/variables/attributes: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm

COMP307 4 ML2(3-K Technique): Nearest Neighbour Given a training set with a number of instances Nearest neighbour method - Each unseen instance (in the test set) is compared with all the instances in the training set - Find the nearest neighbour (instance) from the training set - the unseen instance is classified as the class of the nearest neighbour @Iris: 5.1,3.5,1.4,0.2, Iris-setosa 4.9,3.0,1.4,0.2, Iris-setosa . . 7.0,3.2,4.7,1.4, Iris-versicolor 6.4,3.2,4.5,1.5, Iris-versicolor 6.3,3.3,6.0,2.5, Iris-virginica 5.8,2.7,5.1,1.9, Iris-virginica

COMP307 5 ML2(3-K Technique): K-Nearest Neighbour K-Nearest Neighbour method: - Similar to the nearest neighbour method - But find k nearest instances from the training set - Then choose the majority class as the class label of the unseen instance But how to find the nearest neighbours?

COMP307 6 ML2(3-K Technique): Nearest Neighbour Distance Measures Given two feature vectors with numeric values A=(a 1 ,a 2 ,...,a n )andB=(b 1 ,b 2 ,...,b n ) COMP307 Whatproblemcanyoufind? The(k-)nearestneighbourmethodissimple,easytouse,and Usethedistancemeasure: Giventwofeaturevectorswithnumericvalues canachievegoodresultsinmanycases R i istherangeoftheithcomponent Efficient? Doesthismethodexplicitlylearnaclassifier?Ifyes,whatisit? NearestNeighbour DistanceMeasures d= i=1 R 2 i n (a i b i ) 2 = (a 1 b 1 ) R 2 1 + R 2 2 +...+ R 2 n 2 (a 2 b 2 ) 2 (a n b n ) 2 ML2(3-Ktechniques):4 Use the distance measure: R i istherangeoftheithcomponent COMP307 Whatproblemcanyoufind? The(k-)nearestneighbourmethodissimple,easytouse,and Usethedistancemeasure: Giventwofeaturevectorswithnumericvalues canachievegoodresultsinmanycases Efficient? Doesthismethodexplicitlylearnaclassifier?Ifyes,whatisit? NearestNeighbour DistanceMeasures d= i=1 R 2 i n (a i b i ) 2 = (a 1 b 1 ) R 2 1 + R 2 2 +...+ R 2 n A=(a 1 ,a 2 ,...,a n )andB=(b 1 ,b 2 ,...,b n ) 2 (a 2 b 2 ) 2 (a n b n ) 2 ML2(3-Ktechniques):4 Ri is the range of the ith component The (k-)nearest neighbour method is simple, easy to use, and can achieve good results in many cases What problem can you find? - Does this method explicitly learn a classifier? If yes, what is it? - Efficient?

COMP307 7 ML2(3-K Technique): K-fold Cross Validation Idea: chop the available data into K equal chunks For each chunk in turn: - Treat it as the test data set - Treat the rest K 1 chunks as the training data set - The classifier trained/learned from the training set is applied to the test set The process is then repeated K times (the folds), with each of the K chunks used exactly once as the test data set. The K results from the folds can be then averaged (or otherwise combined) to produce a single estimation. Can be used for comparing two algorithms, or measure the performance of a particular algorithm when the data set is small.

COMP307 8 ML2(3-K Technique): 10-fold Cross Validation Example: 150 instances for 10-fold cross validation - Splitting into 10 chunks 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 training set test set

COMP307 9 ML2(3-K Technique): Leave-one-out Cross Validation It is very similar to the K-fold cross-validation method Every time, it only uses one instance as the test data set The process needs to be repeated m times, where m is the total number of examples/instances in the entire data set K-fold cross validation (including leave-one-out cross validation) is NOT a machine learning or classification method or technique It is an experimental design method for setting up experiments for supervised learning tasks such as classification and regression

COMP307 10 ML2(3-K Technique): Validation Set vs Cross Validation Validation set (vs training set vs test set) - The validation set is a data set - Validation set is a separate data set from the training set and the test set. - Validation set is used for monitoring the training process but is not directly used for learning the classifier - the validation set is used for avoid overfitting or overtraining - Assume: 100 examples 50 vs 50 40 vs 30 vs 30 Cross Validation - Is a experimental design method, NOT a data set - In this method, there are only training sets and test sets - No validation set exists Data Subset vs How to Use Data

COMP307 11 ML2(3-K Technique): K-Means Clustering Unlabelled data Expect to obtain good partitions for the instances using learning techniques. Need clustering techniques, unsupervised learning K-means clustering is a method of cluster analysis which aims to partition m instances into k clusters in which each instance belongs to the cluster with the nearest mean. Need some kind of distance measure such as Euclidean distance Need to assume the number of clusters

COMP307 12 ML2(3-K Technique): K-Means Clustering: Algorithm 1. Set k initial means (in this case k=3) randomly from the data set (shown in color). 2. Create k clusters by associating every instance with the nearest mean based on a distance measure. 3. Replace the old means with the centroid of each of the k clusters (as the new means). 4. Repeat the above two steps until convergence (no change in each cluster center).

COMP307 13 ML2(3-K Technique): K-Means Clustering: An Example COMP307 K-MeansClustering: An Example (1) (3) ML2(3-K techniques): 11 (4) (2) K-means Algorithm Demo: https://www.youtube.com/watch?v=zHbxbb2ye3E

COMP307 14 ML2(3-K Technique): Summary Nearest neighbour method for classification - K-Nearest neighbour method classification method - Measures of comparing two feature vectors K-fold cross validation - experimental design method, NOT a learning method - validation set is a data set, NOT a method K-means method - clustering method, NOT for classification Next Lecture: Decision tree learning for classification Suggested reading: Section 18.3 (both 2nd and 3rd editions) and online materials

Machine Learning Techniques: K-Nearest Neighbour, K-fold Cross Validation, and K-Means Clustering

Download Presentation

Presentation Transcript

Related

More Related Content