News

New paper!

Thursday, August 6, 2020

Statistics: F test and feature selection

F-test: "any statistical test in which the test statistic has an F-distribution under the null hypothesis" (Wikipedia)

Given U1 and U2 that follow chi square distribution with degree of freedom d1 and d2, respectively, the following F follows F-distribution if U1 and U2 are statistically independent:
        F = (U1/d1)/(U2/d2)

In an F-test the test statistic is the ratio of two scaled sums of squares reflecting different sources of variability* (U1/d1 and U2/d2 above). 
These sums of squares are constructed so that the statistic tends to be greater when the null hypothesis is not true.

*For ANOVA, 
        F = (explained variance) / (unexplained variance), or
        F = (between-group variability) / (within-group variability) .
In both cases, F takes a form of the sum of the square of differences between a set of values of interest and overall mean (which is conceptually related to variance or variability).

*For regression problems,
One has two models 1 and 2. Model 1 is nested within model 2; model 1 is the restricted version of the model 2. 
The model with more parameters (model 2) will always be able to fit the data at least as well as the model with fewer parameters (model 1). The question is whether model 2 gives a significantly better fit to the data than model 1 does. In this context, if there are n data to estimate parameters of both models from,
        F = [(RSS1 - RSS2)/ (p2 - p1)] / [RSS2 / (n - p2)]
p1 and p2: the number of parameters in model 1 and 2 (p1 < p2)
RSSi: the residual sum of squares of model i
n: the number of data points to estimate parameters of both models from

*F test in Feature Selection
"Modern day datasets are very rich in information .... This makes the data high dimensional ... Feature Selection is a very critical component in a Data Scientist's workflow. ... Feature Selection methods helps ... by reducing the dimensions without much loss of the total information. It also helps to make sense of the features and its importance."
In python with sklearn.feature_selection.f_regression to operate univariate linear regression tests:
        X: {array-like, sparse matrix} shape = (n_samples, n_features)
        y: array of shape (n_samples)
        F, p_val = f_regression(X, y)
This function first computes the correlation between each regressor and the target, i.e. [{X[:,i] - mean(X[:,i])} * {y - mean_y}] / [std(X[:,i]) * std(y)].
Then, this is converted to an F values then to a p-value.

Drawbacks of using F-Test to select features is that it relies on correlation. When the use of correlation doesn't make sense, F-test isn't a good indicator. It captures only linear relationships between features and labels, by giving a higher score to a highly correlated feature.
Using mutual information can resolve this problem. It does well even when the features and target is non-linearly related.