K-Nearest neighbor


Kaggle project}

Posted by Rishabh Pande on 02 Feb 2016

K-Nearest neighbor

Introduction

Since KNN is a simple algorithm, I will just use this small project as a quick and dirty way to demonstrate the implementation of KNN.

Get the data

I will use the popular IRIS data set for this project. It’s a small data set with flower features that can be used to attempt to predict the species of an iris flower.

I will load the library to get the data set and check the head of iris data fram

library(ISLR)

head(iris)

str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Standardize Data

In this case, the iris data set has all its features in the same order of magnitude, but its good practice (especially with KNN) to standardize features in the data. Lets go ahead and do this even though its not necessary for this data!

I will use scale() to standardize the feature columns of the iris dataset. Set this standardized version of the data as a new variable.

stand.features <- scale(iris[1:4])

Check that the scaling worked by checking the variance of one of the new columns.

var(stand.features[,1])
1

Now, let’s join the standardized data with the response/target/label column

final.data <- cbind(stand.features,iris[5])
head(final.data)

Train and Test Splits

Now, let’s split Training and test datas using the caTools library to split standardized data into train and test sets. I will use a 70/30 split.

library(caTools)

sample <- sample.split(final.data$Species, SplitRatio = .70)
train <- subset(final.data, sample == TRUE)
test <- subset(final.data, sample == FALSE)

Build KNN model

library(class)

Use the knn function to predict Species of the test set. Use k=1

predicted.species <- knn(train[1:4],test[1:4],train$Species,k=1)
predicted.species

setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica

What was your misclassification rate?

mean(test$Species != predicted.species)

0.0444444444444444

Choosing a K Value

Let’s create a plot of the error (misclassification) rate for k values ranging from 1 to 10.

predicted.species <- NULL
error.rate <- NULL

for(i in 1:10){
    set.seed(101)
    predicted.species <- knn(train[1:4],test[1:4],train$Species,k=i)
    error.rate[i] <- mean(test$Species != predicted.species)
}
library(ggplot2)
k.values <- 1:10
error.df <- data.frame(error.rate,k.values)
pl <- ggplot(error.df,aes(x=k.values,y=error.rate)) + geom_point()
pl + geom_line(lty="dotted",color='red')

We notice that the error drops to its lowest for k values between 2-6. Then it begins to jump back up again, this is due to how small the data set it. At k=10, it begins to approach setting k=10% of the data, which is quite large.