In this example, I will show you how to use a KNN model to classify text in R. We use machine learning meaning that we train the model with “known” data (i.e. data that is already categorized) and test it on “unknown” data (i.e. data where the model should tell us the category).

I created a dataset with text snippets on Ford and Mazda cars (from carbuyer.co.uk). The text snippets are stored in the column Text and the car brand in Category. To download the data, click here.

You can use the model for any kind of category prediction by putting your own choice of text into the text and category columns. Basically, by categorizing a few observations of text, you can run the model on large datasets and quickly classify or categorize huge amounts of free-form text.

I owe a big thanks to Tim D’Auria at Boston Decision. His video tutorial is the inspiration for this tool.

First, I will show you the full code. Then I will explain the code bit by bit.

Full code

Here comes the explanation of the code.

The first step, is to load the necessary packages. We will be using tm for text mining, class for the KNN model and SnowballC to stem words (I’ll explain word stemming later).

# Packages
library(tm) # Text mining: Corpus and Document Term Matrix
library(class) # KNN model
library(SnowballC) # Stemming words

Since, I like to use this model for all kinds of text classification, I have created a template csv file containing two columns: Text and Category. If you use the code for your own model, you can put any kind of text in the Text and any kind of category in Category.

So, below we read the csv file knn.csv into dataframe df specifying that the separator is semicolon and that the file contains a header (text and category)

# Read csv with two columns: text and category
df <- read.csv("knn.csv", sep =";", header = TRUE)

Now, we load the text data in to a Corpus. Think of a corpus as a collection of documents or text snippets. In our example, each line in the csv file represents a document that we have categorized.

So, we create a Corpus called docs consisting of the Text column in our dataframe df. We specify that the source is a vector (i.e. list of elements).

# Create corpus
docs <- Corpus(VectorSource(df$Text))

The next step is to clean our corpus. Basically, we strip out all the noise from the text and leave only the important parts. Most of the below code is pretty self explanatory. First, we convert all text to lower case. Then we remove all numbers, punctuation and extra white space.
Then, we remove stop words. These are common words like “then”, “I”, “it”, “there” etc. And finally, we stem words. This means removing common endings from words. E.g. instead of having the two words “bicycle” and “bicycles”, we end up with one word “bicycl” because we only kept the stem of the word.

# Clean corpus
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stemDocument, language = "english")

Now, we create the document term matrix. This is a matrix with each document on the x axis and each term on the y axis. Then, we count how many times each term is present in each document. So, basically a numeric representation of the frequency of each word in each document. We will use this matrix to perform our model on.

# Create dtm
dtm <- DocumentTermMatrix(docs)

After we have created the DTM, we convert it to a data frame mat.df.

# Transform dtm to matrix to data frame - df is easier to work with
mat.df <- as.data.frame(data.matrix(dtm), stringsAsfactors = FALSE)

Next step is to create a new column with the known category of each text. We use cbind to bind a new column to mat.df. Afterwards, we change the name of the last column of mat.df to “category”

# Column bind category (known classification)
mat.df <- cbind(mat.df, df$Category)
# Change name of new column to "category"
colnames(mat.df)[ncol(mat.df)] <- "category"

Remember that the KNN model takes three sets of data: Train,test and classifier. All three sets must have the same number of rows.

Let’s create those three sets of data.

We take a random sample with a size of 50% of the full data and call this train.
Test will hold all the rest of the data.

Train and test contain just the row numbers, so we can use the row numbers for indexing our data, when we create the KNN model.

The last step is to create the classifier. We isolate all the known categories and put them into cl.

# Split data by rownumber into two equal portions
train <- sample(nrow(mat.df), ceiling(nrow(mat.df) * .50))
test <- (1:nrow(mat.df))[- train]
# Isolate classifier
cl <- mat.df[, "category"]

In order to use our data in the KNN model, we need it without the categories. The categories are what we are trying to predict.
We create a data frame modeldata with all columns from mat.df except the category.

# Create model data and remove "category"
modeldata <- mat.df[,!colnames(mat.df) %in% "category"]

Now we are finally ready to create the model! So, from our modeldata, we take the rows that we decided to use for training and test. And, we also feed the known categories of the training data into the model. We call the model knn.pred.

# Create model: training set, test set, training set classifier
knn.pred <- knn(modeldata[train, ], modeldata[test, ], cl[train])

And now comes the cool part: The confusion matrix.
This is a matrix that tells what documents the model predicted correctly, and what documents it did not predict correctly.

# Confusion matrix
conf.mat <- table("Predictions" = knn.pred, Actual = cl[test])

And here is how to calculate the accuracy of the model by using the confusion matrix.

# Accuracy
(accuracy <- sum(diag(conf.mat))/length(test) * 100)

This last step is optional, but it can be useful to export a data frame with your test data and the predictions from the model.

# Create data frame with test data and predicted category
df.pred <- cbind(knn.pred, modeldata[test, ])
write.table(df.pred, file="output.csv", sep=";")

That’s it! :-) Feel free to ask any questions in the comments below.

Full code

Here comes the explanation of the code.

Latest Images

Trending Articles

Latest Images