Linear Methods for Regression

Linear Methods for Regression

A linear regression model assumes that the regression function E[Y|X] is linear in the inputs X1,....,Xp.
在具体指定模型之前,使用E[Y|X]泛指判别模型,生成模型表示为E[Y,X].

生成模型有种创造万物的上帝的感觉,但许多任务,找到合适的generative process非常困难,而去“识别判断"它很容易,这就像是一个做一个好作家很难,但是能识别好的作品容易得多.

Linear Regression假设E[Y|X]是线性的,或者至少是近似线性的,具体形式化为:

f(X)=β0+pj=1Xjβj

损失函数定义为:

RSS(β)=Ni=1(yif(xi))2=Ni=1(yiβ0pj=1(xijβj))2

Alt text

Linear Regression的损失函数是在特征空间中找到最小化训练集误差的超平面:f(X;β).

Alt text

假设XTX是满秩矩阵,那么对RSS(β)求导并使其为0求最小值,可得到:XT(yXβ)=0,β^=(XTX)1XTy. 从XT(yXβ)=0又可以得到XT(yXβ).

Alt text

Assume the deviations of Y around its expectation are additive and Gaussian, then:

Y=E[Y|X1,...,Xp]+εβ0+pj=1Xjβj+ε

其中,εN(0,σ2)
那么:

  1. E(β^)=E[(XTX)1XTy]=(XTX)1XTE[y]
    E[β^]=(XTX)1XTXβ=β
  2. β^E[β^]=(XTX)1XTy(XTX)1XTXβ
    ((XTX)1XT)(yXβ)=(XTX)1XTε
  3. Var(β^)=E[(β^E[β^])(β^E[β^])T]
    E[(XTX)1XT(εε)TX(XTX)1]
    (XTX)1XTE[(εε)T]X(XTX)1
    (XTX)1XTVar(ε)X(XTX)1
    σ2(XTX)1XTX(XTX)1
    σ2(XTX)1

所以 β^N(β,(XTX)1σ2)
通常,σ2由无偏估计(E[σ2^]=σ2)得到:

σ2^=1Np1Ni=1(yiyi^)2

稍做变换:(Np1)σ2^=Ni=1(yiyi^)2=σ2χ2Np1

Alt text

Alt text

Alt text

Under the null hypothesis that βj=0,zj is distributed as tNp1 and hence a large (absolute) value of zj will lead to rejectoin of this null hypothesis.

Page 47:
The variance-covariance matrix of the least squares parameter estimates is easily derived( from(3.6) and is given by

Var(β^)=(XTX)1σ2.

Typically one estimates the variance σ2 by

σ2^=1Np1Ni=1(yiy^i)2

Alt text

Alt text

拓展到多类问题时,Linear model形式化为:

Yk=β0k+pj=1Xjβjk+εk=fk(X)+εk

进一步,用矩阵表示为:

Y=XB+E

那么多类问题的损失函数为:

RSS(B)=Kk=1Ni=1(yikfk(xi))2=tr[(YXB)T(YXB)].

估计的参数结果为:

B^=(XTX)1XTY

Least squares estimates of the parameter β have the smallest variance among all linear unbiased estimates.

Alt text

Alt text

  • Best-subset,暴力方法,局限问题是计算复杂度,subset size怎么选择的问题,最终目的是最小化期望误差,但是实际中使用交叉验证或者AIC. higher variance.
  • Forward stepwise, 贪心策略,计算复杂度较低,lower variance,使用情况多.
  • Forward stagewise

以上特征子集选择方法是一种对特征采用0/1编码的选择方法,通常具有比较高的variance, 而Shrinkages更加平滑,相对的variance较低.

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Alt text

Ridge regression does a proportional shrinkage. Lasso translates each coefficient by a constant factor λ, truncating at zero. This is called "soft thresholding".
Best-subset selection drops all variables with coefficients smaller than the Mth largest; this is a form of "hard-thresholding."

Alt text

Alt text

references:
Chapter 3: Linear Methods for Regression
Regression (statistics): What is Least Angle Regression and when should it be used?
More Notes for Linear Regression
统计学习那些事
Bias of an estimator
Random Vectors and the Variance-Covariance Matrix
LaTeX:Symbols
Mean Vector and Covariance Matrix
(3.11) chi-squared distribution:
包含定义和一个简单示例,不包括chi-squared distribution的性质.
Definition
Applications

(3.12) Z-score:
介绍Z-score的定义和用法.
Standard Score
Hypothesis Testing

(3.12)Z-score:
统计建模与R软件-多重共线性

Linear%20Methods%20for%20Regression%20%20%0A%3D%3D%3D%3D%0A@%28ir%29%5Bpublished%7Cmachine%20learning%5D%20%20%20%0AA%20linear%20regression%20model%20assumes%20that%20the%20regression%20function%20%24E%5BY%7CX%5D%24%20is%20linear%20in%20the%20inputs%20%24X_1%2C%20....%2C%20X_p%24.%20%20%20%0A%u5728%u5177%u4F53%u6307%u5B9A%u6A21%u578B%u4E4B%u524D%uFF0C%u4F7F%u7528%24E%5BY%7CX%5D%24%u6CDB%u6307%u5224%u522B%u6A21%u578B%uFF0C%u751F%u6210%u6A21%u578B%u8868%u793A%u4E3A%24E%5BY%2CX%5D%24.%20%20%20%0A%0A%3E%u751F%u6210%u6A21%u578B%u6709%u79CD%u521B%u9020%u4E07%u7269%u7684%u4E0A%u5E1D%u7684%u611F%u89C9%uFF0C%u4F46%u8BB8%u591A%u4EFB%u52A1%uFF0C%u627E%u5230%u5408%u9002%u7684generative%20process%u975E%u5E38%u56F0%u96BE%uFF0C%u800C%u53BB%u201C%u8BC6%u522B%u5224%u65AD%u201D%u5B83%u5F88%u5BB9%u6613%uFF0C%u8FD9%u5C31%u50CF%u662F%u4E00%u4E2A%u505A%u4E00%u4E2A%u597D%u4F5C%u5BB6%u5F88%u96BE%uFF0C%u4F46%u662F%u80FD%u8BC6%u522B%u597D%u7684%u4F5C%u54C1%u5BB9%u6613%u5F97%u591A.%20%20%20%20%0A%0ALinear%20Regression%u5047%u8BBE%24E%5BY%7CX%5D%24%u662F%u7EBF%u6027%u7684%uFF0C%u6216%u8005%u81F3%u5C11%u662F%u8FD1%u4F3C%u7EBF%u6027%u7684%2C%u5177%u4F53%u5F62%u5F0F%u5316%u4E3A%3A%20%20%20%0A%0A%24%24%0A%20%20%20%20f%28X%29%20%3D%20%5Cbeta_0%20+%20%5Csum_%7Bj%3D1%7D%5EpX_j%5Cbeta_j%0A%24%24%20%20%20%0A%0A%u635F%u5931%u51FD%u6570%u5B9A%u4E49%u4E3A%3A%20%20%0A%24%24%0A%20%20%20%20RSS%28%5Cbeta%29%20%3D%20%5Csum_%7Bi%3D1%7D%5EN%28y_i%20-%20f%28x_i%29%29%5E2%20%5C%5C%0A%20%20%20%20%3D%20%5Csum_%7Bi%3D1%7D%5EN%28y_i%20-%20%5Cbeta_0%20-%20%5Csum_%7Bj%3D1%7D%5Ep%28x_%7Bij%7D%5Cbeta_j%29%29%5E2%0A%24%24%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1406990133282.png%29%20%20%20%0A%0ALinear%20Regression%u7684%u635F%u5931%u51FD%u6570%u662F%u5728%u7279%u5F81%u7A7A%u95F4%u4E2D%u627E%u5230%u6700%u5C0F%u5316%u8BAD%u7EC3%u96C6%u8BEF%u5DEE%u7684%u8D85%u5E73%u9762%3A%24f%28X%3B%5Cbeta%29%24.%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1406990467985.png%29%0A%0A%u5047%u8BBE%24X%5ETX%24%u662F%u6EE1%u79E9%u77E9%u9635%uFF0C%u90A3%u4E48%u5BF9%24RSS%28%5Cbeta%29%24%u6C42%u5BFC%u5E76%u4F7F%u5176%u4E3A0%u6C42%u6700%u5C0F%u503C%uFF0C%u53EF%u5F97%u5230%3A%24X%5ET%28y-X%5Cbeta%29%3D0%24%2C%24%5Chat%7B%5Cbeta%7D%3D%28X%5ETX%29%5E%7B-1%7DX%5ETy%24.%20%u4ECE%24X%5ET%28y-X%5Cbeta%29%20%3D%200%24%u53C8%u53EF%u4EE5%u5F97%u5230%24X%5ET%5Cperp%28y-X%5Cbeta%29%24.%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1406991131515.png%29%20%20%20%0A%0AAssume%20the%20deviations%20of%20Y%20around%20its%20expectation%20are%20additive%20and%20Gaussian%2C%20then%3A%20%20%0A%24%24%0A%20%20%20%20Y%20%3D%20E%5BY%7CX_1%2C%20...%2C%20X_p%5D%20+%20%5Cvarepsilon%20%5C%5C%0A%20%20%20%20%5CLongrightarrow%20%5Cbeta_0%20+%20%5Csum_%7Bj%3D1%7D%5EpX_j%5Cbeta_j%20+%20%5Cvarepsilon%0A%24%24%20%20%0A%u5176%u4E2D%2C%24%5Cvarepsilon%20%5Csim%20N%280%2C%20%5Csigma%5E2%29%24%20%20%0A%u90A3%u4E48%3A%0A%0A1.%20%24E%28%5Chat%7B%5Cbeta%7D%29%20%3D%20E%5B%28X%5ETX%29%5E%7B-1%7DX%5ETy%5D%3D%28X%5ETX%29%5E%7B-1%7DX%5ETE%5By%5D%24%20%20%20%0A%24%5CLongrightarrow%20E%5B%5Chat%7B%5Cbeta%7D%5D%3D%28X%5ETX%29%5E%7B-1%7DX%5ETX%5Cbeta%3D%5Cbeta%24%20%20%0A2.%20%24%5Chat%7B%5Cbeta%7D-E%5B%5Chat%7B%5Cbeta%7D%5D%3D%28X%5ETX%29%5E%7B-1%7DX%5ETy%20-%20%28X%5ETX%29%5E%7B-1%7DX%5ETX%5Cbeta%24%20%20%20%0A%20%20%20%20%24%5CLongrightarrow%20%28%28X%5ETX%29%5E%7B-1%7DX%5ET%29%28y-X%5Cbeta%29%20%3D%20%28X%5ETX%29%5E%7B-1%7DX%5ET%5Cvarepsilon%24%20%20%20%0A3.%20%24Var%28%5Chat%7B%5Cbeta%7D%29%20%3D%20E%5B%28%5Chat%7B%5Cbeta%7D-E%5B%5Chat%7B%5Cbeta%7D%5D%29%28%5Chat%7B%5Cbeta%7D-E%5B%5Chat%7B%5Cbeta%7D%5D%29%5ET%5D%24%20%20%20%0A%24%5CLongrightarrow%20E%5B%28X%5ETX%29%5E%7B-1%7DX%5ET%28%5Cvarepsilon%5Cvarepsilon%29%5ETX%28X%5ETX%29%5E%7B-1%7D%5D%24%20%20%20%0A%24%5CLongrightarrow%20%28X%5ETX%29%5E%7B-1%7DX%5ETE%5B%28%5Cvarepsilon%5Cvarepsilon%29%5ET%5DX%28X%5ETX%29%5E%7B-1%7D%24%0A%24%5CLongrightarrow%20%28X%5ETX%29%5E%7B-1%7DX%5ETVar%28%5Cvarepsilon%29X%28X%5ETX%29%5E%7B-1%7D%24%20%20%0A%24%5CLongrightarrow%20%5Csigma%5E2%28X%5ETX%29%5E%7B-1%7DX%5ETX%28X%5ETX%29%5E%7B-1%7D%24%0A%24%5CLongrightarrow%20%5Csigma%5E2%28X%5ETX%29%5E%7B-1%7D%24%20%20%0A%0A%u6240%u4EE5%20%24%5Chat%7B%5Cbeta%7D%20%5Csim%20N%28%5Cbeta%2C%20%28X%5ETX%29%5E%7B-1%7D%5Csigma%5E2%29%24%20%20%0A%u901A%u5E38%uFF0C%24%5Csigma%5E2%24%u7531%u65E0%u504F%u4F30%u8BA1%28%24E%5B%5Chat%7B%5Csigma%5E2%7D%5D%20%3D%20%5Csigma%5E2%24%29%u5F97%u5230%3A%20%20%20%20%0A%24%24%0A%20%20%20%20%5Chat%7B%5Csigma%5E2%7D%20%3D%20%5Cfrac%7B1%7D%7BN-p-1%7D%5Csum_%7Bi%3D1%7D%5EN%28y_i%20-%20%5Chat%7By_i%7D%29%5E2%0A%24%24%20%20%20%0A%u7A0D%u505A%u53D8%u6362%3A%24%28N-p-1%29%5Chat%7B%5Csigma%5E2%7D%3D%5Csum_%7Bi%3D1%7D%5EN%28y_i%20-%20%5Chat%7By_i%7D%29%5E2%3D%5Csigma%5E2%5Cchi%5E2_%7BN-p-1%7D%24%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1406994784781.png%29%0A%0A%21%5BAlt%20text%5D%28./1406995123646.png%29%0A%0A%21%5BAlt%20text%5D%28./1406995376050.png%29%0A%0AUnder%20the%20null%20hypothesis%20that%20%24%5Cbeta_j%3D0%2Cz_j%24%20is%20distributed%20as%20%24t_%7BN-p-1%7D%24%20and%20hence%20a%20large%20%28absolute%29%20value%20of%20%24z_j%24%20will%20lead%20to%20rejectoin%20of%20this%20null%20hypothesis.%20%20%20%0A%0A%0A%0A%0APage%2047%3A%20%20%0AThe%20variance-covariance%20matrix%20of%20the%20least%20squares%20parameter%20estimates%20is%20easily%20derived%28%20from%283.6%29%20and%20is%20given%20by%20%20%20%0A%24%24%0A%20%20%20%20Var%28%5Chat%7B%5Cbeta%7D%29%20%3D%20%28X%5ETX%29%5E%7B-1%7D%5Csigma2.%0A%24%24%20%20%20%0ATypically%20one%20estimates%20the%20variance%20%24%5Csigma%5E2%24%20by%20%20%20%0A%24%24%0A%20%20%20%20%5Chat%7B%5Csigma%5E2%7D%20%3D%20%5Cfrac%7B1%7D%7BN-p-1%7D%5Csum_%7Bi%3D1%7D%5EN%28y_i-%5Chat%7By%7D_i%29%5E2%0A%24%24%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1407033900149.png%29%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1407034087256.png%29%0A%0A%u62D3%u5C55%u5230%u591A%u7C7B%u95EE%u9898%u65F6%uFF0CLinear%20model%u5F62%u5F0F%u5316%u4E3A%3A%20%20%20%0A%24%24%0A%20%20%20%20Y_k%20%3D%20%5Cbeta_%7B0k%7D%20+%20%5Csum_%7Bj%3D1%7D%5EpX_j%5Cbeta_%7Bjk%7D%20+%20%5Cvarepsilon_k%0A%20%20%20%20%3D%20f_k%28X%29%20+%20%5Cvarepsilon_k%0A%24%24%20%20%0A%u8FDB%u4E00%u6B65%uFF0C%u7528%u77E9%u9635%u8868%u793A%u4E3A%3A%20%20%20%0A%24%24%0A%20%20%20%20Y%20%3D%20XB%20+%20E%0A%24%24%20%20%0A%u90A3%u4E48%u591A%u7C7B%u95EE%u9898%u7684%u635F%u5931%u51FD%u6570%u4E3A%3A%20%20%0A%24%24%0A%20%20%20%20%5Ctext%7BRSS%28B%29%7D%20%3D%20%5Csum_%7Bk%3D1%7D%5EK%5Csum_%7Bi%3D1%7D%5EN%28y_%7Bik%7D-f_k%28x_i%29%29%5E2%20%3D%20%5Ctext%7Btr%7D%5B%28Y-XB%29%5ET%28Y-XB%29%5D.%0A%24%24%20%20%0A%u4F30%u8BA1%u7684%u53C2%u6570%u7ED3%u679C%u4E3A%3A%20%20%0A%24%24%0A%5Chat%7BB%7D%20%3D%20%28X%5ETX%29%5E%7B-1%7DX%5ETY%0A%24%24%0A%0ALeast%20squares%20estimates%20of%20the%20parameter%20%24%5Cbeta%24%20have%20the%20smallest%20variance%20among%20all%20linear%20unbiased%20estimates.%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1407050285616.png%29%20%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1407051313668.png%29%0A%0A-%20Best-subset%uFF0C%u66B4%u529B%u65B9%u6CD5%uFF0C%u5C40%u9650%u95EE%u9898%u662F%u8BA1%u7B97%u590D%u6742%u5EA6%uFF0Csubset%20size%u600E%u4E48%u9009%u62E9%u7684%u95EE%u9898%uFF0C%u6700%u7EC8%u76EE%u7684%u662F%u6700%u5C0F%u5316%u671F%u671B%u8BEF%u5DEE%uFF0C%u4F46%u662F%u5B9E%u9645%u4E2D%u4F7F%u7528%u4EA4%u53C9%u9A8C%u8BC1%u6216%u8005AIC.%20higher%20variance.%20%20%20%0A-%20Forward%20stepwise%uFF0C%20%u8D2A%u5FC3%u7B56%u7565%uFF0C%u8BA1%u7B97%u590D%u6742%u5EA6%u8F83%u4F4E%uFF0Clower%20variance%uFF0C%u4F7F%u7528%u60C5%u51B5%u591A.%20%20%0A-%20Forward%20stagewise%20%20%0A%0A%u4EE5%u4E0A%u7279%u5F81%u5B50%u96C6%u9009%u62E9%u65B9%u6CD5%u662F%u4E00%u79CD%u5BF9%u7279%u5F81%u91C7%u75280/1%u7F16%u7801%u7684%u9009%u62E9%u65B9%u6CD5%uFF0C%u901A%u5E38%u5177%u6709%u6BD4%u8F83%u9AD8%u7684variance%2C%20%u800CShrinkages%u66F4%u52A0%u5E73%u6ED1%uFF0C%u76F8%u5BF9%u7684variance%u8F83%u4F4E.%20%0A%0A%21%5BAlt%20text%5D%28./1407054537624.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054605821.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054637898.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054664504.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054860419.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054880393.png%29%0A%0A%21%5BAlt%20text%5D%28./1407054902709.png%29%20%20%20%20%20%20%0A%0A%21%5BAlt%20text%5D%28./1407055112447.png%29%0A%0A%21%5BAlt%20text%5D%28./1407055129605.png%29%20%20%0A%0A%21%5BAlt%20text%5D%28./1407057369441.png%29%0A%0A**Ridge%20regression%20does%20a%20proportional%20shrinkage.%20Lasso%20translates%20each%20coefficient%20by%20a%20constant%20factor%20%24%5Clambda%24%2C%20truncating%20at%20zero.%20This%20is%20called%20%22soft%20thresholding%22.%20%20%0ABest-subset%20selection%20drops%20all%20variables%20with%20coefficients%20smaller%20than%20the%20%24M%24th%20largest%3B%20this%20is%20a%20form%20of%20%22hard-thresholding.%22**%0A%0A%21%5BAlt%20text%5D%28./1407055206352.png%29%0A%0A%21%5BAlt%20text%5D%28./1407055230459.png%29%0A%0A%0Areferences%3A%20%20%0A%5BChapter%203%3A%20Linear%20Methods%20for%20Regression%5D%28http%3A//www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/Lecture2.pdf%29%20%20%20%0A%5BRegression%20%28statistics%29%3A%20What%20is%20Least%20Angle%20Regression%20and%20when%20should%20it%20be%20used%3F%5D%28http%3A//www.quora.com/Regression-statistics/What-is-Least-Angle-Regression-and-when-should-it-be-used%29%0A%5BMore%20Notes%20for%20Linear%20Regression%5D%28http%3A//pan.baidu.com/s/1pJFgy9P%29%20%20%20%0A%5B%u7EDF%u8BA1%u5B66%u4E60%u90A3%u4E9B%u4E8B%5D%28https%3A//cloud.github.com/downloads/cosname/editor/stories-about-statistical-learning1.pdf%29%0A*%5BBias%20of%20an%20estimator%5D%28http%3A//en.wikipedia.org/wiki/Bias_of_an_estimator%29*%20%20%20%20%20%0A%5BRandom%20Vectors%20and%20the%20Variance-Covariance%20Matrix%5D%28%29%0A%5BLaTeX%3ASymbols%5D%28http%3A//www.artofproblemsolving.com/Wiki/index.php/LaTeX%3ASymbols%29%20%20%20%20%0A%5BMean%20Vector%20and%20Covariance%20Matrix%5D%28http%3A//www.itl.nist.gov/div898/handbook/pmc/section5/pmc541.htm%29%20%20%20%0A%283.11%29%20chi-squared%20distribution%3A%0A%u5305%u542B%u5B9A%u4E49%u548C%u4E00%u4E2A%u7B80%u5355%u793A%u4F8B%2C%u4E0D%u5305%u62ECchi-squared%20distribution%u7684%u6027%u8D28.%0A%5BDefinition%5D%28http%3A//en.wikipedia.org/wiki/Chi-squared_distribution%23Definition%29%20%20%20%0A%5BApplications%5D%28http%3A//en.wikipedia.org/wiki/Chi-squared_distribution%23Applications%29%20%20%0A%0A%283.12%29%20Z-score%3A%20%20%0A%u4ECB%u7ECDZ-score%u7684%u5B9A%u4E49%u548C%u7528%u6CD5.%20%20%20%0A%5BStandard%20Score%5D%28https%3A//statistics.laerd.com/statistical-guides/standard-score.php%29%20%20%20%0A%5BHypothesis%20Testing%5D%28https%3A//statistics.laerd.com/statistical-guides/hypothesis-testing.php%29%20%20%20%0A%0A%283.12%29Z-score%3A%20%20%20%0A%5B%u7EDF%u8BA1%u5EFA%u6A21%u4E0ER%u8F6F%u4EF6-%u591A%u91CD%u5171%u7EBF%u6027%5D%28http%3A//book.douban.com/subject/2120492/%29%0A%0A


comments powered by Disqus