News

New paper! in the American Naturalist

Thursday, October 8, 2020

Logistic Regression (classification, supervised)



Supervised classification algorithm.
This consideres the (given) labels of k-nearest neighboring points of a new datapoint to estimate the label of this datapoint.
"""
Codes from "COGNITIVE CLASS.ai - Logistic Regression", author: Saeed Aghabozorgi
While Linear Regression is suited for estimating continuous values, it is not the best tool for predicting the class of an observed data point.
Logistic Regression uses logistic/sigmoid model and is suited for dataset when the observed dependent variable is categorical. 
It produces a formula that predicts the *probability of the class label* as a function of the independent variables.
Logistic Regression passes the input through the logistic/sigmoid but then treats the result as a probability that the datapoint belongs to one of binary categories.
"""

import pandas as pd
import pylab as pl
import numpy as np
import scipy.optimize as opt
from sklearn import preprocessing
import matplotlib.pyplot as plt

##### Step 1: Data & Preprocessing #####
df = pd.read_csv("data.csv")
X = np.asarray(df[['X1','X2',...,'Xn']])
y = np.asarray(df['target'])

# Preprocessing
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(X).transform(X)

# Train/Test dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)

##### Step 2: Modeling #####
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrixi
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
# different solvers, such as 'newton-cg', 'liblinear', 'lbfgs', 'sag', 'saga', can be used
# C: inverse of regularization strength which must be a positive float. Regularization is a technique used to solve the overfitting problem in machine learning models. Smaller values specify stronger regularization.
yhat = LR.predict(X_test)

# predict_proba - first column: probabilities of class 1, P(Y=1:X), second column: P(Y=0|X)
yhat_prob = LR.predict_proba(X_test)

##### Step 3: Evaluation #####
# Jaccard index
from sklearn.metrics import jaccard_similarity_score
jaccard_similarity_score(y_test, yhat)

# confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
			  normalize=False,
			  title='Confusion matrix',
			  cmap=plt.cm.Blues):
	"""
	This function prints and plots the confusion matrix.
	Normalization can be applied by setting `normalize=True`.
	"""
	if normalize:
		cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
		print("Normalized confusion matrix")
	else:
		print("Confusion matrix, without normalization")
	print(cm)

	plt.imshow(cm, interpolation='nearest',cmap=cmap)
	plt.title(title)
	plt.colorbar()
	tick_marks = np.arange(len(classes))
	plt.xticks(tick_marks, classes, rotation=45)
	plt.yticks(tick_marks, classes)

	fmt = '.2f' if normalize else 'd'
	thresh = cm.max() / 2.
	for i,j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
		plt.text(j, i, format(cm[i,j], fmt),
			forizontalalignment="center",
			color="white" if cm[i,j] > thresh else "black")

	plt.ylabel("True label")
	plt.xlabel("Predicted label")

print(confusion_matrix(y_test, yhat, labels=[1,0]))

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, yhat, labels=[1,0])
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['category=1', 'category=0'], normalize=False, title='Confusion matrix')

# classification_report gives precition, recall, f1-score, support
# precision = TP / (TP + FP)
# recal = TP / (TP + FN) ;true positive rate
# F1 score: the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
print(classification_report(y_test_yhat))

# log loss: the performance of a classifier where the predicted output is a probability value between 0 and 1.
from sklearn.metrics iport log_loss
log_loss(y_test, yhat_prob)