11.2 다중 선형 회귀 (Multiple Linear Regression)
만약 다음과 같이 x 변수가 하나 더 추가되어 다중 회귀가 되어도 위의 식들은 그대로 성립한다. 이렇듯 행렬은 \(\Sigma\) (summation)와 index (i, j, k 등)를 이용한 표기법에 비하여 식 표현을 매우 간결하고 일관되게 해서 좋다.
y | x1 | x2 |
---|---|---|
60 | 1 | 10 |
74 | 3 | 14 |
73 | 5 | 12 |
95 | 7 | 16 |
= c(60, 74, 73, 95)
y = c(1, 3, 5, 7)
x1 = c(10, 14, 12, 16)
x2 = data.frame(y, x1, x2)
d2 = lm(y ~ x1 + x2, d2)
r2 summary(r2)
Call:
lm(formula = y ~ x1 + x2, data = d2)
Residuals:
1 2 3 4
2 -2 -2 2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.667 15.173 1.560 0.363
x1 2.667 1.491 1.789 0.325
x2 3.167 1.491 2.124 0.280
Residual standard error: 4 on 1 degrees of freedom
Multiple R-squared: 0.9746, Adjusted R-squared: 0.9237
F-statistic: 19.16 on 2 and 1 DF, p-value: 0.1595
predict(r2, d2, interval="confidence")
fit lwr upr
1 58 13.98442 102.0156
2 76 31.98442 120.0156
3 75 30.98442 119.0156
4 93 48.98442 137.0156
predict(r2, d2, interval="prediction")
fit lwr upr
1 58 -9.234916 125.2349
2 76 8.765084 143.2349
3 75 7.765084 142.2349
4 93 25.765084 160.2349
= TRUE
intercept = cbind(intercept, x1, x2) ; X X
intercept x1 x2
[1,] 1 1 10
[2,] 1 3 14
[3,] 1 5 12
[4,] 1 7 16
= t(X) %*% X # t(X) %*% X = crossprod(X)
XpX = solve(XpX) # inverse of XpX
iXpX = iXpX %*% t(X) %*% y # beta = solve(XpX, crossprod(X, y))
b # b = solve(crossprod(X), crossprod(X, y)) # fastest way
t(b) # Betas
intercept x1 x2
[1,] 23.66667 2.666667 3.166667
= X %*% b ; t(yhat) # fitted y value yhat
[,1] [,2] [,3] [,4]
[1,] 58 76 75 93
= y - yhat ; t(e) # residual e
[,1] [,2] [,3] [,4]
[1,] 2 -2 -2 2
= sum(e^2) # Sum of Square Error
SSE = length(y) - qr(XpX)$rank ; DFr # Degree of Freedom for residual DFr
[1] 1
= as.numeric(SSE/DFr) ; MSE # estimate of error variance MSE
[1] 16
= iXpX * MSE ; Vcov # variance-covariance matrix of beta hat (b) Vcov
intercept x1 x2
intercept 230.22222 14.222222 -21.777778
x1 14.22222 2.222222 -1.777778
x2 -21.77778 -1.777778 2.222222
= sqrt(diag(Vcov)) ; SE # SEs of beta hats SE
intercept x1 x2
15.173076 1.490712 1.490712
= b/SE ; t(t.val) # t values of beta hats t.val
intercept x1 x2
[1,] 1.55978 1.788854 2.124265
= 2*(1 - pt(abs(t.val), DFr)); t(p.val) # p values of beta hats p.val
intercept x1 x2
[1,] 0.3629397 0.3245104 0.2800974
= X %*% Vcov %*% t(X)
Vyhat = 0.05
alpha = qt(1 - alpha/2, DFr)*sqrt(diag(Vyhat)) # margin of error for mean response
B data.frame(yhat, lwr=yhat - B, upr=yhat + B) # confidence interval
yhat lwr upr
1 58 13.98442 102.0156
2 76 31.98442 120.0156
3 75 30.98442 119.0156
4 93 48.98442 137.0156
= qt(1 - alpha/2, DFr)*sqrt(MSE + diag(Vyhat)) # error margin for individual prediction
B2 data.frame(yhat, lwr=yhat - B2, upr=yhat + B2) # prediction interval
yhat lwr upr
1 58 -9.234916 125.2349
2 76 8.765084 143.2349
3 75 7.765084 142.2349
4 93 25.765084 160.2349
[예제] 혈중 빌리루빈의 농도는 나이에 따라 정상치가 달라져야 할 것으로 보인다. 이를 통계적으로 입증하도록 노력해 보시오.