This article goes into detail about the implementation of cross validation for k-NN classifiers, classificaiton ties, and touches on confusion matrices.
What is Cross-Validation
Cross-validation is a technique used to assess the performance and generalization ability of machine learning models. It involves splitting the data into multiple subsets to train and evaluate the model multiple times. Some benefits to cross-validation is that it helps estimate how well a model will perform on unseen data and can be used to compare different models. It can also be used to evaluate the model’s performance and determine the optimal value of
When thinking about a
Dealing With Ties
Fig 1. Example of a tie (image by author)
For a
This raises a problem since our classifier selects the class label with the minimum distance from our new point, yet in this case, the distances are equal.
Functions such as sklearn.metrics.confusion_matrix()
break these ties randomly. This is largely to prevent bias in the models. Additionally, this is why counts of truth values may be inconsistent. Obviously, the example above is a very over-simplified one, but breaking ties with limited bias is important when making predictions. Model bias is one area that is constantly under scrutiny, especially with text-based models such as GPT and BERT, and also with classification models.
Implementing Cross-Validation
For this example, we will be using the Howells dataset, available here. It is dataset that contains craniometric measurements taken from over 2,500 human crania from 28 populations between 1965 and 1980 by Dr. William W. Howells. This example will attempt to classify either BERG males/females or NORSE males/females based on several features of the dataset.
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
# read in the data
url = "https://www.statsmachine.net/databases/STAT_139/Howells.csv"
Howells = pd.read_csv(url)
Now that we have the dataset in memory, we can begin the filtering process.
, we will filter the data to contain only people from NORSE
or BERG
populations. These are the locations, and the PopSex
column splits each into male or female, hence the F
and M
after the population name.
The 'GOL', 'NOL', 'BNL', 'BBH', 'XCB'
features measure different dimensions of the human cranium. The description for specific features can be found here and more here. These columns, excluding PopSex
, will become our training data. The crosstab()
function is for visualization purposes, and the same result can be obtained alternatively by using the value_counts()
function.
# filter the data
HBNMF = Howells.query("Pop == 'NORSE' or Pop == 'BERG'").sort_values(['Pop', 'Sex'])[['HID', 'Sex', 'Pop', 'PopSex', 'GOL', 'NOL', 'BNL', 'BBH', 'XCB']]
# choose which measurements to use in classification
train = HBNMF[['GOL', 'NOL', 'BNL', 'BBH', 'XCB']].values
# choose which group labels to use in classification
trl = HBNMF['PopSex'].values
# table showing the counts of each unique value in the desired column
ct = pd.crosstab(index=HBNMF['PopSex'], columns='count').T
ct
PopSex BERGF BERGM NORSEF NORSEM
col_0
count 53 56 55 55
Now we can begin with the
A similar result can be obtained by drawing a circle with a given radius,
# 1 nearest neighbor
knn = KNeighborsClassifier(n_neighbors=1)
# perform cross-validation and predict labels
kcres1 = cross_val_predict(knn, train, trl, cv=5, method='predict_proba')
# convert probabilities to predicted class labels
kcres1 = np.argmax(kcres1, axis=1)
# print confusion matrix
print("Confusion Matrix (1 neighbor):\n", pd.crosstab(kcres1, HBNMF['PopSex']), sep="")
Confusion Matrix (1 neighbor):
PopSex BERGF BERGM NORSEF NORSEM
row_0
0 30 10 7 1
1 11 33 8 7
2 10 5 32 12
3 2 8 8 35
Because we are cross-validating, let’s pick another value for
# 3 nearest neighbors
knn = KNeighborsClassifier(n_neighbors=3)
# perform cross-validation and predict labels
kcres3 = cross_val_predict(knn, train, trl, cv=3, method='predict_proba')
# convert probabilities to predicted class labels
kcres3 = np.argmax(kcres3, axis=1)
# print confusion matrix
print("Confusion Matrix (3 neighbors):\n", pd.crosstab(kcres3, HBNMF['PopSex']), sep="")
Confusion Matrix (3 neighbors):
PopSex BERGF BERGM NORSEF NORSEM
row_0
0 30 9 12 1
1 9 34 2 5
2 14 5 33 9
3 0 8 8 40
We’ll do this two more times, when
Confusion Matrix (7 neighbors):
PopSex BERGF BERGM NORSEF NORSEM
row_0
0 39 12 7 0
1 5 32 1 5
2 9 4 40 10
3 0 8 7 40
Confusion Matrix (9 neighbors):
PopSex BERGF BERGM NORSEF NORSEM
row_0
0 39 9 9 0
1 4 33 2 6
2 10 6 38 11
3 0 8 6 38
Doing some simple calculations, it seems that the optimal value of
The Resulting Matrix
The outputs above are called confusion matrices. They represent the truth values for each value of row_0
column. Each row in the table corresponds to a specific true class label, and the counts in each cell indicate the number of data points that were predicted to and/or actually belong to that class label.
These truth values give us a means to evaluate our values for