Linear Methods for Regression

A linear regression model assumes that the regression function E[Y|X] is linear in the inputs X1,....,Xp.
在具体指定模型之前，使用E[Y|X]泛指判别模型，生成模型表示为E[Y,X].

生成模型有种创造万物的上帝的感觉，但许多任务，找到合适的generative process非常困难，而去“识别判断"它很容易，这就像是一个做一个好作家很难，但是能识别好的作品容易得多.

Linear Regression假设E[Y|X]是线性的，或者至少是近似线性的,具体形式化为:

f(X)=β0+∑pj=1Xjβj

损失函数定义为:

RSS(β)=∑Ni=1(yi−f(xi))2=∑Ni=1(yi−β0−∑pj=1(xijβj))2

Alt text

Linear Regression的损失函数是在特征空间中找到最小化训练集误差的超平面:f(X;β).

Alt text

假设XTX是满秩矩阵，那么对RSS(β)求导并使其为0求最小值，可得到:XT(y−Xβ)=0,β^=(XTX)−1XTy. 从XT(y−Xβ)=0又可以得到XT⊥(y−Xβ).

Alt text

Assume the deviations of Y around its expectation are additive and Gaussian, then:

Y=E[Y|X1,...,Xp]+ε⟹β0+∑pj=1Xjβj+ε

其中,ε∼N(0,σ2)
那么:

E(β^)=E[(XTX)−1XTy]=(XTX)−1XTE[y]
⟹E[β^]=(XTX)−1XTXβ=β
β^−E[β^]=(XTX)−1XTy−(XTX)−1XTXβ
⟹((XTX)−1XT)(y−Xβ)=(XTX)−1XTε
Var(β^)=E[(β^−E[β^])(β^−E[β^])T]
⟹E[(XTX)−1XT(εε)TX(XTX)−1]
⟹(XTX)−1XTE[(εε)T]X(XTX)−1
⟹(XTX)−1XTVar(ε)X(XTX)−1
⟹σ2(XTX)−1XTX(XTX)−1
⟹σ2(XTX)−1

所以 β^∼N(β,(XTX)−1σ2)
通常，σ2由无偏估计(E[σ2^]=σ2)得到:

σ2^=1N−p−1∑Ni=1(yi−yi^)2

稍做变换:(N−p−1)σ2^=∑Ni=1(yi−yi^)2=σ2χ2N−p−1

Alt text

Under the null hypothesis that βj=0,zj is distributed as tN−p−1 and hence a large (absolute) value of zj will lead to rejectoin of this null hypothesis.

Page 47:
The variance-covariance matrix of the least squares parameter estimates is easily derived( from(3.6) and is given by

Var(β^)=(XTX)−1σ2.

Typically one estimates the variance σ2 by

σ2^=1N−p−1∑Ni=1(yi−y^i)2

Alt text

拓展到多类问题时，Linear model形式化为:

Yk=β0k+∑pj=1Xjβjk+εk=fk(X)+εk

进一步，用矩阵表示为:

Y=XB+E

那么多类问题的损失函数为:

RSS(B)=∑Kk=1∑Ni=1(yik−fk(xi))2=tr[(Y−XB)T(Y−XB)].

估计的参数结果为:

B^=(XTX)−1XTY

Least squares estimates of the parameter β have the smallest variance among all linear unbiased estimates.

Alt text

Best-subset，暴力方法，局限问题是计算复杂度，subset size怎么选择的问题，最终目的是最小化期望误差，但是实际中使用交叉验证或者AIC. higher variance.
Forward stepwise，贪心策略，计算复杂度较低，lower variance，使用情况多.
Forward stagewise

以上特征子集选择方法是一种对特征采用0/1编码的选择方法，通常具有比较高的variance, 而Shrinkages更加平滑，相对的variance较低.

Alt text

Ridge regression does a proportional shrinkage. Lasso translates each coefficient by a constant factor λ, truncating at zero. This is called "soft thresholding".
Best-subset selection drops all variables with coefficients smaller than the Mth largest; this is a form of "hard-thresholding."

Alt text

references:
Chapter 3: Linear Methods for Regression
Regression (statistics): What is Least Angle Regression and when should it be used?
More Notes for Linear Regression
统计学习那些事
Bias of an estimator
Random Vectors and the Variance-Covariance Matrix
LaTeX:Symbols
Mean Vector and Covariance Matrix
(3.11) chi-squared distribution:
包含定义和一个简单示例,不包括chi-squared distribution的性质.
Definition
Applications

(3.12) Z-score:
介绍Z-score的定义和用法.
Standard Score
Hypothesis Testing

(3.12)Z-score:
统计建模与R软件-多重共线性