Types of regressions and when/why we may want to use GAM
https://www.analyticsvidhya.com/blog/2018/03/introduction-regression-splines-python-codes/
- Linear regression assumes linear relationship
- Polynomial regression allows non-linear relationship
- However, polynomial regression tends to overfit.
- Also, it is quite sensitive to local value; “changing the value of Y at one point in the training set can affect the fit of the polynomial for data points that are very far away” because it uses the existing features X to generate new features, X2, X3, … .
> To overcome this, we can divide the data into multiple bins to fit linear or low degree polynomial functions separately, which is called piecewise polynomials.
Piecewise polynomials looks like this:
yi= beta01+ beta11xi+ beta21xi2+ beta31xi3+ epsi(if xi< c);
yi= beta02+ beta12xi+ beta22xi2+ beta32xi3+ epsi(if xi>= c).
Each of these polynomial functions can be fit using the least squares error metric.
In this particular case, this family of polynomial functions has 8 degrees of freedom in total. Stepwise functions are piecewise polynomials of degree 0, and linear functions are of degree 1.
Constraints and Splines (knot: the points where the division occurs)
- “the polynomials on either side of a knot should be continuous at the knot.” Otherwise the fitted model won’t generate on a unique output for every input.
- “the first derivative of both the cubic polynomials must be same.” This is a constraint for the model to be a smooth function.
- “the double derivative of both the cubic polynomials at a knot must be same.”
*Note that each constraint imposed on the piecewise polynomials reduces one degree of freedom.
Particularly, such a piecewise polynomial of degree m (m=3 for cubic) with m-1 continuous derivatives (the first and second derivatives) is called a Spline.
In general, a cubic spline with K knots uses cubic spline with a total of 4 + K degrees of freedom. There is seldom any good reason to go beyond cubic-splines.
It is known that the behavior of polynomials that polynomial models near the boundaries are erratic. This problem happens in regression splines too. Thus, “too smooth the polynomial beyond the boundary knots, we will use a special type of spline known as Natural Spline”, which impose this constraint of linearity beyond the boundary knots. This reduces the 2 more degrees of freedom at each of the two ends of the curve, reducing K+4 to K.
Choosing the number of locations of the knots
- In practice, it is common to place knots in a uniform fashion, whereas placing knots the area of high variability should work well.
- A more objective approach is to repeat cross-validation with different numbers of K knots:
1. Remove a portion of the data
2. fit a spline with a certain number of knots to the remaining data, and then
3. Use the spline to make predictions for the held-out portion.
Q. While dealing with non-linear relationships, how do we know the functions that describe the nonlinear relationships anyways!?!? Here comes GAM!
https://www.hds.utc.fr/~tdenoeux/dokuwiki/_media/en/splines.pdf
“The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Trevor Hstie, Robert Tibshirani, Jerome Friedman.”
Generalized additive model
may be used to identify and characterize nonlinear regression effects. In the regression setting, a GAM has the form
E (Y| X1, X2, …, Xp)] = alpha +f1(X1) + f2(X2) + … + fp(Xp)
where fj are unspecified smooth (nonparametric) functions. The interest of GAM focuses on inference about these smooth functions.
In general, the conditional mean mu(X) of a response Y is related to an additive function of the predictors via a link function g:
g[mu(X)] = alpha + f1(X1) + .. + fp(Xp)
- g(mu) = mu is the identity link, used for linear and additive models for Gaussian response data.
- g(mu) = logit(mu) or g(mu) = probit(mu) for classification. the probit link function for modeling binomial probabilities.
- g(mu) = log(mu) for log-linear or log-additive models for Poisson count data.
Mixing linear and nonlinear effects, interactions
need to consider when there are qualitative feature involved
Fitting GAMs
- If we model each function fj as a natural spline, then we can use simple least square (regression) or likelihood maximization algorithm (classification) to fit the resulting model.
Evaluation of fitted GAMs
- AIC will fail with regression spline. Permutation feature importance is probably better way to evaluate the model relative importance.
- GCV for finding a good K value (the number of bins)
lambda: smoothing parameter, which is also needs to be estimated, which are both taken care by gam.gridsearch(X,y). objective=’GCV’
Partial Dependency Plot (PDP) Lucas 2020 Ecological Monographs
PDP is to analyze how a fitted model behave relative to changes in the covariate of interest. "This plot if created by evaluating the model n x m times, with all but one covariates taking their values from the n data points and the covariate of interest taking m equally spaced values. The mean response for each value of the covariate of interest is then plotted. "
*"While PDPs are computed as the mean of the response over the data set, the variable importance measures calculated above are evaluated over all training data. There can therefore be a mismatch where a PDP looks flat wile the variable importance is high. Relatedly, the PDP gives no information on interactions because only one curve is plotted. Once we have identified covariates with important interactions we can use individual conditional expectation plots."