News

New paper! in the American Naturalist

Sunday, October 4, 2020

Simple Regression





Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the metrics to understand since it’s just average error.
Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s more popular than Mean absolute error because the focus is geared more towards large errors. This is due to the squared term exponentially increasing larger errors in comparison to smaller ones.
Root Mean Squared Error (RMSE): This is the square root of the Mean Square Error. R-squared is not error, but is a popular metric for accuracy of your model. It represents how close the data are to the fitted regression line. The higher the R-squared, the better the model fits your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse).
#!/usr/bin/env python3

"""
Codes from "COGNITIVE CLASS.ai - Simple Linear Regression", author: Saeed Aghabozorgi
Regression:  to predict a continuous dependent variable from a number of independent variables
"""

import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np

##### Data #####
df = pd.read_csv("data.csv")

##### Step 1: Understanding the Data #####
# Use plots, histgrams, etc to understand the data

X = ['column1','column2','column3'] # select features to explore
cdf = df[X]


##### Step 2: Create train and test dataset #####
# Training and testing sets should be mutually exclusive.

msk = np.random.rand(len(df)) < 0.8 # 80% of the entire data for training, 20% for testing
train = cdf[msk]
test = cdf[~msk]


##### Step 3: Simple Regression Model #####
# Linear Regression fits a linear model with coefficients theta = (th1, th2, ..., thN) to minimize the 'residual xum of squares' between the independent x in the dataset, and the dependent y by the linear approximation.

# Modeling
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = np.asanyarray(train[[X]]) # a feature variable (X:string) in the training data
train_y = np.asanyarray(train[[Y]]) # target variable (Y:string) in the training data
regr.fit(train_x, train_y)	     # fitting
print('Coefficients:', regr.coef_)
print('Intercept:', regr.intercept_)

# Plot outputs
plt.scatter(train.X, train.Y, color='blue')
plt.plot(train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
plt.xlabel(X)
plt.ylabel(Y)

##### Step 4: Evalation #####
# There are different model evaluation metrics. Let's use MSE here to calculate the accuracy of the model based on the test set
from sklearn.metrics import r2_score

test_x = np.asanyarray(test[[X]]) # a feature variable (X) in test data
test_y = np.asanyarray(test[[Y]]) # target variable (Y) in test data
test_y_hat = regr.predict(test_x) # predict target variable using the same model (regr) and test data (test_x)

print('Mean absolute error: %.2f' % np.mean(np.mean(np.absolute(test_y_hat - test_y))))
print('Mean squared error (MSE): %.2f' % np.mean((test_y_hat - test_y) ** 2))
print('R2-score: %.2f' % r2_score(test_y_hat, test_y)) # best possible score is 1.0 and it can be negative