K-Nearest neighbor
Introduction
Since KNN is a simple algorithm, I will just use this small project as a quick and dirty way to demonstrate the implementation of KNN.
Get the data
I will use the popular IRIS data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.
I will load the library to get the data set and check the head of iris data fram
library(ISLR)
head(iris)
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Standardize Data
In this case, the iris data set has all its features in the same order of magnitude, but its good practice (especially with KNN) to standardize features in the data. Lets go ahead and do this even though its not necessary for this data!
I will use scale() to standardize the feature columns of the iris dataset. Set this standardized version of the data as a new variable.
stand.features <- scale(iris[1:4])
Check that the scaling worked by checking the variance of one of the new columns.
var(stand.features[,1])
1
Now, let’s join the standardized data with the response/target/label column
final.data <- cbind(stand.features,iris[5])
head(final.data)
Train and Test Splits
Now, let’s split Training and test datas using the caTools library to split standardized data into train and test sets. I will use a 70/30 split.
library(caTools)
sample <- sample.split(final.data$Species, SplitRatio = .70)
train <- subset(final.data, sample == TRUE)
test <- subset(final.data, sample == FALSE)
Build KNN model
library(class)
Use the knn function to predict Species of the test set. Use k=1
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted.species
setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica
What was your misclassification rate?
mean(test$Species != predicted.species)
0.0444444444444444
Choosing a K Value
Let’s create a plot of the error (misclassification) rate for k values ranging from 1 to 10.
predicted.species <- NULL
error.rate <- NULL
for(i in 1:10){
set.seed(101)
predicted.species <- knn(train[1:4],test[1:4],train$Species,k=i)
error.rate[i] <- mean(test$Species != predicted.species)
}
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
pl <- ggplot(error.df,aes(x=k.values,y=error.rate)) + geom_point()
pl + geom_line(lty="dotted",color='red')
We notice that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10, it begins to approach setting k=10% of the data, which is quite large.