11.2 다중 선형 회귀 (Multiple Linear Regression)

만약 다음과 같이 x 변수가 하나 더 추가되어 다중 회귀가 되어도 위의 식들은 그대로 성립한다. 이렇듯 행렬은 \(\Sigma\) (summation)와 index (i, j, k 등)를 이용한 표기법에 비하여 식 표현을 매우 간결하고 일관되게 해서 좋다.

y x1 x2
60 1 10
74 3 14
73 5 12
95 7 16
y = c(60, 74, 73, 95)
x1 = c(1, 3, 5, 7)
x2 = c(10, 14, 12, 16)
d2 = data.frame(y, x1, x2)
r2 = lm(y ~ x1 + x2, d2)
summary(r2)

Call:
lm(formula = y ~ x1 + x2, data = d2)

Residuals:
 1  2  3  4 
 2 -2 -2  2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   23.667     15.173   1.560    0.363
x1             2.667      1.491   1.789    0.325
x2             3.167      1.491   2.124    0.280

Residual standard error: 4 on 1 degrees of freedom
Multiple R-squared:  0.9746,    Adjusted R-squared:  0.9237 
F-statistic: 19.16 on 2 and 1 DF,  p-value: 0.1595
predict(r2, d2, interval="confidence")
  fit      lwr      upr
1  58 13.98442 102.0156
2  76 31.98442 120.0156
3  75 30.98442 119.0156
4  93 48.98442 137.0156
predict(r2, d2, interval="prediction")
  fit       lwr      upr
1  58 -9.234916 125.2349
2  76  8.765084 143.2349
3  75  7.765084 142.2349
4  93 25.765084 160.2349
intercept = TRUE
X = cbind(intercept, x1, x2) ; X
     intercept x1 x2
[1,]         1  1 10
[2,]         1  3 14
[3,]         1  5 12
[4,]         1  7 16
XpX = t(X) %*% X                       # t(X) %*% X = crossprod(X)
iXpX = solve(XpX)                      # inverse of XpX
b = iXpX %*% t(X) %*% y                # beta = solve(XpX, crossprod(X, y))
# b = solve(crossprod(X), crossprod(X, y))  # fastest way
t(b)                                   # Betas
     intercept       x1       x2
[1,]  23.66667 2.666667 3.166667
yhat = X %*% b ; t(yhat)               # fitted y value
     [,1] [,2] [,3] [,4]
[1,]   58   76   75   93
e = y - yhat ; t(e)                    # residual
     [,1] [,2] [,3] [,4]
[1,]    2   -2   -2    2
SSE = sum(e^2)                         # Sum of Square Error
DFr = length(y) - qr(XpX)$rank ; DFr   # Degree of Freedom for residual
[1] 1
MSE = as.numeric(SSE/DFr) ; MSE        # estimate of error variance
[1] 16
Vcov = iXpX * MSE ; Vcov               # variance-covariance matrix of beta hat (b)
          intercept        x1         x2
intercept 230.22222 14.222222 -21.777778
x1         14.22222  2.222222  -1.777778
x2        -21.77778 -1.777778   2.222222
SE = sqrt(diag(Vcov)) ; SE             # SEs of beta hats
intercept        x1        x2 
15.173076  1.490712  1.490712 
t.val = b/SE ; t(t.val)                # t values of beta hats
     intercept       x1       x2
[1,]   1.55978 1.788854 2.124265
p.val = 2*(1 - pt(abs(t.val), DFr)); t(p.val) # p values of beta hats
     intercept        x1        x2
[1,] 0.3629397 0.3245104 0.2800974
Vyhat = X %*% Vcov %*% t(X)
alpha = 0.05
B = qt(1 - alpha/2, DFr)*sqrt(diag(Vyhat))        # margin of error for mean response
data.frame(yhat, lwr=yhat - B, upr=yhat + B)      # confidence interval
  yhat      lwr      upr
1   58 13.98442 102.0156
2   76 31.98442 120.0156
3   75 30.98442 119.0156
4   93 48.98442 137.0156
B2 = qt(1 - alpha/2, DFr)*sqrt(MSE + diag(Vyhat)) # error margin for individual prediction
data.frame(yhat, lwr=yhat - B2, upr=yhat + B2)    # prediction interval
  yhat       lwr      upr
1   58 -9.234916 125.2349
2   76  8.765084 143.2349
3   75  7.765084 142.2349
4   93 25.765084 160.2349

[예제] 혈중 빌리루빈의 농도는 나이에 따라 정상치가 달라져야 할 것으로 보인다. 이를 통계적으로 입증하도록 노력해 보시오.