News

New paper! in the American Naturalist

Monday, September 6, 2021

Linear regression (OLS) and p-value in Python (and F-test in multiple regression)

-----------------------------------------
import numpy as np
import pandas as pd
import statsmodels.api as sm

X = data['x1'] or X = np.column_stack((data['x1'], data['x2'], data['x3'])) for a multivariate case
Y = data['y'] 

X2 = xm.add_constant(X)
ext = sm.OLS(y, X2)
est2 = est.fit()
print(ext2.summary())
-----------------------------------------

which gives something like this:

T-test: for each coefficient

F-statistic (F-test of the overall significance)
"In multiple regression, since we are fitting many predictors, we need to consider a case where there are a lot of features. With a very large amount of predictors, there will always be about 5% of them that will have, by chance, a very small p-value even though they are not statistically significant. Therefore, we use the F-statistic to avoid considering unimportant predictors as significant predictors. " link

What does this F-test do? link
"The F-test of the overall significance is a specific form of the F-test. It compares a model with no predictors (intercept-only model) to the model that you specify." 
That is,
"Null hypothesis: The fit of the intercept-only model and your model are equal.
Alternative hypothesis: The fit of the intercept-only model is significantly reduced compared to your model."