News

New paper! in the American Naturalist

Tuesday, August 11, 2020

Statistics: Mutual information and feature selection in regression

The description here is based on the following paper

Frenay B., Doquire G., and Verleysen M. (2013)  Is mutual information adequate for feature selection in regression? Neural networks letter paper

see also here for feature selection with linear regression model (with F-test).

What is feature selection?

The objective is to select a small subset of features (variables) that are together minimize error between the model (prediction) and observed values.

The error functions frequently used to characterized the error are such as the mean squared error (MSE) and the mean absolute error (MAE).

Mutual information in regression problems

The mutual information generally describes the reduction of uncertainty (measured by the entropy, H) on the values of the target value (Y) when you have observed a variable (X):

I(X; Y) = H(Y) - H(Y|X) .

For a given regression problem with multiple input variables (X) and target Y, H(Y) is fixed and does not depend on the choice of features out of the available variables X. According to the above equation, in the feature selection in regression problems, we look for a variable minimize H(Y|X) which at the same time maximizes I(X;Y) .

The issue addressed in this paper

Although maximizing mutual information seems conceptually reasonable in regression because a high mutual information with the target variable is associated with reducing uncertainty on the target, there has been no connection in machine learning between MSE or MAE and mutual information.

Conclusion

"when the conditional distribution of the estimation error is uniform, Laplacian or Gaussian, choosing the feature subset which minimises the conditional target entropy H(Y|X) is equivalent to minimizing either the MSE or the MAE criterion."