[wikipedia]
Null hypothesis: samples in all groups are drawn from populations with the same mean values.
Assumptions that need to be met in ANOVA are:
Response variable residuals are normally distributed (or approximately normally distributed).
Variances of populations are equal.
Responses for a given group are independent and identically distributed normal random variables (not a simple random sample (SRS)).
import pandas as pd
data = pd.DataFrame #
X = ['feature1', 'feature2', ...]
GM = data.mean()[X] # overall mean
# total variation
SST = 0
for i in data[X].values:
SST += (i - GM)**2
print('total variation\n',SST)
# between-group variation
SSB = 0
for group in set(data.GroupingColumnName):
d = data.loc[data.GroupingColumnName == group]
n = len(d)
SSB += n * (d.mean()[X] - GM)**2
print('between-group variation\n',SSB)
# within-group variation
SSW = SST - SSB
print('within-group variation\n',SSW)
N = len(data)
k = len(set(data.GroupingColumnName))
MSB = SSB / (k-1)
MSW = SSW / (N-k)
print('F-value\n',MSB/MSW)
# degrees of freedom to calculate p-value in this ANOVA: numerator (k-1), denominator (N-k).
# GroupingColumnName: the name of column in the dataframe that categorize data into groups of interest.