k-Nearest Neighbors is a supervised machine learning algorithm for object classification that is widely used in data science and business analytics.

In this post, I will show how to use R’s *knn()* function which implements the k-Nearest Neighbors (kNN) algorithm in a simple scenario which you can extend to cover your more complex and practical scenarios. R is free and kNN has not been patented by some evil patent trolls (“patent assertion entities”), so there is no legal or other restrictions for us to go ahead with the demonstration.

kNN has many useful characteristics, one of which being its insensitivity to outliers that makes it resilient to any errors in the classification data (the supervised learning phase). As a downside, the algorithm is noted for its CPU and memory greediness. The R interpreter may also fail to take advantage of the multi-core CPU capabilities of modern computers. If and when such constrains hamper your data analytics work, the kNN algorithm can be implemented in a language that supports the creation of multi-threaded applications (e.g. Java or C#) that would evenly load all the CPU cores of your computer and improve the computational throughput of the application.

To simplify the demonstration of kNN and make it easy to follow, we will have only two classes used in object classification, which we label A and B.

Objects in classes A and B have two numeric attributes/properties that we map to X and Y Cartesian coordinates so that we can plot class instances (cases) as points on a 2-D chart. In other words, our cases are represented as points with X and Y coordinates (p(X,Y)).

Our simple classes A and B will have 3 object instances (cases) each.

Class A will include points with coordinates (0,0), (1,1), and (2,2).

Class B will include points with coordinates (6,6), (5.5, 7), and (6.5, 5).

In R, we can write down the above arrangement as follows:

# Class A training object instances (cases)

A1=c(0,0)

A2=c(1,1)

A3=c(2,2)

# Class B training objects instances (cases)

B1=c(6,6)

B2=c(5.5,7)

B3=c(6.5,5)

Here is how the classification training objects for class A and class B are arranged on the chart.

The *knn()* function is housed in the *class* package and is invoked as follows:

knn(train, test, cl, k)

where

**is a matrix or a data frame of training (classification) cases**

*train***is a matrix or a data frame of test case(s) (one or more rows)**

*test***is a vector of classification labels (with the number of elements matching the number of classes in the training data set)**

*cl***is an integer value of closest cases (the**

*k***k**in the

**k**-Nearest Neighbor Algorithm); normally, it is a small odd integer number

For full description of the *knn() *function, read its help page that you can see by typing `?knn `

in the R console.

The points (cases) from both classification classes A and B must be packed in the same matrix used in the classification exercise.

In R you do this with a one-liner:

train=rbind(A1,A2,A3, B1,B2,B3)

This command will build the following matrix:

A1: 0.0 0

A2: 1.0 1

A3: 2.0 2

B1: 6.0 6

B2: 5.5 7

B3: 6.5 5

Now, when we try out classification of a test object (with properties expressed as X and Y coordinates), the kNN algorithm will use the Euclidean distance metric calculated for every row (case) in the training matrix to find the closest one for k=1 and the majority of closest ones for k > 1 (where k should be an odd number).

We also need to construct the *cl* parameter (the vector of classification labels). We can do this with this command:

cl=factor(c(rep("A",3),rep("B",3)))

This command uses the *factor()* function to create a vector of factors (discrete, enumerated values) that are used as class literals. In our setup, we have two factors (a.k.a. levels): A and B.

The *rep()* function replicates the first parameter (a printable symbol) the number of times passed on to it as the second parameter.

The resulting factor vector has the following content:` A A A B B B `

(3 As followed by 3 Bs to match the layout of our train cases – we have 3 cases of A followed by 3 cases of B).

To run the *knn()* function, we need to supply the test case(s). In our runs, we will start with a single test object that we create as follows:

test=c(4,4)

which corresponds to a point sitting approximately in the middle of the distance between A and B.

At this point, we have everything in place to run *knn()*. Let’s do it for k = 1 (classification by its proximity to a single neighbor).

For more informative reports of test object classification results, we are going to use the *summary()* function that can polymorphically act on its input parameter depending on its type. In our case, the input parameter to the *summary()* function is the output of the *knn()* function.

That’s how the nested call looks like:

summary(knn(train, test, cl, k = 1))

Here is the complete code listing that you can copy and paste into the R console.

# Class A cases

A1=c(0,0)

A2=c(1,1)

A3=c(2,2)

# Class B cases

B1=c(6,6)

B2=c(5.5,7)

B3=c(6.5,5)

# Build the classification matrix

train=rbind(A1,A2,A3, B1,B2,B3)

# Class labels vector (attached to each class instance)

cl=factor(c(rep("A",3),rep("B",3)))

# The object to be classified

test=c(4, 4)

# Load the class package that holds the knn() function

library(class)

# call knn() and get its summary

summary(knn(train, test, cl, k = 1))

# End of listing

After pasting the above code in the R console, press **ENTER** to submit it to the R interpreter for execution.

You should see the following output in your console:

A B

0 1

This result indicates that the test case has been classified as belonging to class B.

Type in the following command in console:

test=c(3.5, 3.5)

Visually, this test case point looks to be closer to the cluster of the A class cases (points). Let’s verify our assumption.

Repeat the same knn summary command as we did a moment ago:

summary(knn(train, test, cl, k = 1))

The result comes back as we expected:

A B

1 0

The point has been classified as belonging to class A.

Let’s increase the number of closest neighbors that are involved in voting during the classification step.

Type in the following command:

summary(knn(train, test, cl, k = 3))

Now, the positions of class B points make them closer as a whole (according to the Euclidean distance metric) to the test point, so the (3.5, 3.5) case is classified as belonging to class B this time.

A B

0 1

If you wish, you can further experiment with the *k* number (this is one of the exercises data scientists perform trying out better fitting classification models).

Now, let’s build a matrix of test cases containing four points:

(4,4) - should be classified as B

(3,3) - should be classified as A

(5,6) - should be classified as B

(7,7) - should be classified as B

As a result, we should be expecting our test batch to have 3 cases (points) classified as belonging to the B class and one A case.

Type in the following code:

test = matrix (c(4,4,3,3,5,6,7,7), ncol=2, byrow=TRUE)

This command will help us map our points into a two-column matrix containing the X and Y coordinates of our test points.

Now run again the previous knn summary command:

summary(knn(train, test, cl, k = 3))

R should print the following summary of its classification job:

A B

1 3

Which supports our assumptions.

I hope, now you are well equipped to start applying R’s *knn()* function in your problem domain.

Hi,

This is really helpful.

But I would like to know how to export the final output (if test data is of bigger size) in csv/txt/excel file

Use write.csv(“filename.csv”); to write that data

Thanks. really helpful.

Thanks, that was helpful!

thanks so much… I needed some simple example…

Hi

I have a little problem, i have a 7 dimension column which the label of rows attached to it. How should I define the factor now?

I need help with information Retrieval question

Using the K-Nearest-Neighbor approach for document categorization with K = 3 (see notes determine how Doc8 given below will be classified. Show your steps. Note: Use non-normalized tf*idf weights and cosine similarity when computing similarities.

Doc8 T1 T2 T3 T4 T5 T6 T7 T8

3 1 0 4 1 0 2 1

what about using mahalanobis instead of euclidean? I havent found any good R solutions for using mahalanobis instead. knnMCN supposedly

Hey, Thansk a ton, really nice documentation.

Is there a way to get the list of (say top 3) closest neighbors using knn.

very helpful! most other examples just use a few lines with iris and do not go into great detail

Thanks, very helpful.

Thanks! Very clear and useful for work.

Very helpful. How do we go around dealing with categorical predictors where getting the euclidean distance is not possible.?

I have just seen the knncat package. I haope it helps

this is great!! do you have any suggestions on how you would use this in a production environment where you had 300 to 400 cases to test each day?