News

New paper! in the American Naturalist

Wednesday, October 7, 2020

K-Nearest Neighbor ((un)supervised, classification)



Supervised classification algorithm.
This consideres the (given) labels of k-nearest neighboring points of a new datapoint to estimate the label of this datapoint.
"""
Codes from "COGNITIVE CLASS.ai - K-Nearest Neighbors", author: Saeed Aghabozorgi
K-Nearest Neighbors (KNN) is an algorithm for supervised learning.
Once a point is to be predicted, it takes into account the 'K' nearest points to it 
to determine it's classification.

The value K is supposed to be specified by the user.
To choose right value for K, the general solution is to reserve a part of your data for testing the accuracy of the model. By increasing k and conduct accuracy evaluation for each k, see which k is the best for your model.
"""

import itertools
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import pandas as pd
import matplotlib.ticker as ticker
from sklearn import preprocessing

##### Step 1: Data #####
df = pd.read_csv("data.csv")
X = df[['X1','X2',...,'Xn']].values # features
y = df['Y'].values                  # target

##### Step 2: Normalize Data #####
# Data standardization give data zero mean and unit variance.
# This is a good praactice to do, especially for algorithms such as KNN,
# which is based on distance of cases
X = preprocessing.StandardScaler().fit(X).transform(X.astype(float))

##### Step 3: Train Test Split #####
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_tes = train_test_split(X, y, test_size = 0.2, random_state=4)
# random_state: pass an int for reproducible output affected by random shuffling

##### Step 4: Classification by KNN #####
from sklearn.neighbors import KNeighborsClassifier
# Training
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)

# Predicting
yhat = neigh.predict(X_test)

##### Step 5: Accuracy Evaluation #####
# accuracy_score function is equal to the jaccard_similarity_socre function - it calculates how closely the actual labels and predicted labels are matched in the test set.
from sklearn import metrics
print("Train set Accuracy:", metrics.accuracy_score(y_train, neigh.predict(X_train)))
print("Test set Accuracyt:", metrics.accuracy_score(y_test, yhat))

##### Step 6: Choose best k #####
Ks = 10
mean_acc = np.zeros(Ks-1)
std_acc = np.zeros((Ks-1))
ConfusionMx = [];
for n in range(1,Ks):
	# Train Model and Predict
	neigh = KNeighborsClassifier(n_neibors = n).fit(X_train,y_train)
	yhat = neigh.predict(X_test)
	mean_acc[n-1] = metrics.accuracy_score(y_test,yhat)
	std_acc[n-1] = np.std(yhat==y_test)/np.sqrt(yhat.shape[0])

# Plot model accuracy for different number of neighbors
plt.plot(range(1,Ks),mean_acc, 'g')
plt.fill_between(range(1,Ks),mean_acc - 1 * std_acc,mean_acc + 1 * std_acc, alpha=0.1)
plt.legend(('Accuracy ', +/0 3x std'' ))
plt.ylabel('Accuracy')
plt.xlabel('Number of Neighbors (K)')
print("The best accuracy was with", mean_acc.max(), "with k=", mean_acc.argmax()+1)
plt.show()