9.8 상관관계 (Correlation)
Correlation은 기본적으로 두 변수 모두 확률변수로 간주하는 것이다. 여기서는 두 변수가 연속형 변수인 경우이다.
\[\begin{align*} Y_1 & = \alpha_0 + \alpha_1 x_1 + \epsilon_1 \\ Y_2 & = \beta_0 + \beta_1 x_2 + \epsilon_2 \\ E(Y_1)E(Y_2) & = (\alpha_0 + \alpha_1 x_1)(\beta_0 + \beta_1 x_2) \\ & = \alpha_0 \beta_0 + \alpha_0 \beta_1 x_2 + \alpha_1 \beta_0 x_1 + \alpha_1 \beta_1 x_1 x_2 \\ Y_1 Y_2 & = \alpha_0 \beta_0 + \alpha_0 \beta_1 x_2 + \alpha_0 \epsilon_2 \\ & \quad + \alpha_1 \beta_0 x_1 + \alpha_1 \beta_1 x_1 x_2 + \alpha_1 x_1 \epsilon_2 \\ & \quad + \beta_0 \epsilon_1 + \beta_1 x_2 \epsilon_1 + \epsilon_1 \epsilon_2 \\ E(Y_1 Y_2) & = \alpha_0 \beta_0 + \alpha_0 \beta_1 x_2 + \alpha_1 \beta_0 x_1 + \alpha_1 \beta_1 x_1 x_2 + E(\epsilon_1 \epsilon_2) \\ COV(Y_1 , Y_2) & = E(Y_1 Y_2) - E(Y_1) E(Y_2) \\ & = E(\epsilon_1 \epsilon_2) \\ COV(\epsilon_1 , \epsilon_2) & = E(\epsilon_1 \epsilon_2) - E(\epsilon_1)E(\epsilon_2) \\ & = E(\epsilon_1 \epsilon_2) \\ \therefore COV(Y_1 , Y_2) & = COV(\epsilon_1 , \epsilon_2) \\ SD(Y_1) & = SD(\epsilon_1) \\ SD(Y_2) & = SD(\epsilon_2) \\ \therefore COR(Y_1 , Y_2) & = COR(\epsilon_1 , \epsilon_2) \\ \end{align*}\]
위 식을 보면 아래 기본식이 이해된다.
\[\begin{align*} COV(aX + b, cY + d) & = acCOV(X, Y) \\ COR(aX + b, cY + d) & = sign(ac)COR(X, Y) \end{align*}\]
다음은 correlation과 관련한 simulation이다.
= function(a0, a1, b0, b1, x1min, x1max, x2min, x2max, nSim, sig1, sig2, rho)
Sim
{= seq(x1min, x1max, length.out=nSim)
x1 = seq(x2min, x2max, length.out=nSim)
x2 = cbind(x1, x2)
xs
= sig1^2
ve1 = sig2^2
ve2 = rho*sig1*sig2
coves = matrix(c(ve1, coves, coves, ve2), ncol=2)
mCov
require(MASS)
= mvrnorm(n=nSim, mu=c(0, 0), Sigma=mCov)
es = cbind(a0 + a1*x1 + es[,1], b0 + b1*x2 + es[,2])
ys
= (x1max - x1min)^2/12
vx1 = (x2max - x2min)^2/12
vx2 = sqrt(vx1)*sqrt(vx2)
covxs
= a1^2*vx1 + ve1
vy1 = b1^2*vx2 + ve2
vy2 = a1*b1*covxs + coves
covys = covys/sqrt(vy1)/sqrt(vy2)
corys
list(covE = var(es), covEt = matrix(c(ve1, coves, coves, ve2), ncol=2),
corE = cor(es), corEt = matrix(c(1, rho, rho, 1), ncol=2),
covX = var(xs), covXt = matrix(c(vx1, covxs, covxs, vx2), ncol=2),
corX = cor(xs), corXt = matrix(c(1, 1, 1, 1), ncol=2),
covY = var(ys), covYt = matrix(c(vy1, covys, covys, vy2), ncol=2),
corY = cor(ys), corYt = matrix(c(1, corys, corys, 1), ncol=2))
}
위에서 만든 함수로 simulation한 결과이다.
= Sim(0, 1, 2, -0.5, 1, 10, 100, 200, 2000, 1, 2, 0.7)
r1 print(cbind(r1$corE, NA, r1$corEt), na.print="")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.7140977 1.0 0.7
[2,] 0.7140977 1.0000000 0.7 1.0
print(cbind(r1$covX, NA, r1$covXt), na.print="")
x1 x2
x1 6.760133 75.11259 6.75 75.0000
x2 75.112594 834.58438 75.00 833.3333
print(cbind(r1$corX, NA, r1$corXt), na.print="")
x1 x2
x1 1 1 1 1
x2 1 1 1 1
print(cbind(r1$covY, NA, r1$covYt), na.print="")
[,1] [,2] [,3] [,4] [,5]
[1,] 7.709293 -35.94947 7.75 -36.1000
[2,] -35.949467 212.15903 -36.10 212.3333
print(cbind(r1$corY, NA, r1$corYt), na.print="")
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 -0.8889031 1.0000000 -0.8899124
[2,] -0.8889031 1.0000000 -0.8899124 1.0000000
앞의 결과가 어떻게 나왔는지 식을 유도해 보면 다음과 같다. 이제는 x가 변하는 값이어서, x를 잠정적으로 확률변수(X)로 취급하여야 한다.
\[\begin{align*} Y_1 & = \alpha_0 + \alpha_1 X_1 + \epsilon_1 \\ Y_2 & = \beta_0 + \beta_1 X_2 + \epsilon_2 \\ COV(Y_1, Y_2) & = COV(\alpha_1 X_1 + \epsilon_1, \beta_1 X_2 + \epsilon_2) \\ COR(Y_1, Y_2) & = COR(\alpha_1 X_1 + \epsilon_1, \beta_1 X_2 + \epsilon_2) \\ V(\alpha_1 X_1 + \epsilon_1) & = \alpha_1^2V(X_1) + V(\epsilon_1) + 2 \alpha_1 COV(X_1 , \epsilon_1) \\ & = \alpha_1^2V(X_1) + V(\epsilon_1) \\ V(\beta_1 X_2 + \epsilon_2) & = \beta_1^2V(X_2) + V(\epsilon_2) + 2 \beta_1 COV(X_2 , \epsilon_2) \\ & = \beta_1^2V(X_2) + V(\epsilon_2) \\ E(\alpha_1 X_1 + \epsilon_1) & = \alpha_1 E(X_1) \approx \alpha_1 \overline{x_1} \\ E(\beta_1 X_2 + \epsilon_2) & = \beta_1 E(X_2) \approx \beta_1 \overline{x_2} \\ E[ (\alpha_1 X_1 + \epsilon_1)(\beta_1 X_2 + \epsilon_2) ] & = E[ \alpha_1 \beta_1 X_1 X_2 + \alpha_1 X_1 \epsilon_2 + \beta_1 X_2 \epsilon_1 + \epsilon_1 \epsilon_2 ] \\ & = E(\alpha_1 \beta_1 X_1 X_2) + 0 + 0 + E(\epsilon_1 \epsilon_2) \\ & = COV(\alpha_1 X_1, \beta_1 X_2) + \alpha_1 E(X_1) \beta_1 E(X_2) + E(\epsilon_1 \epsilon_2) \\ COV(Y_1, Y_2) & = E(Y_1 Y_2) - E(Y_1)E(Y_2) \\ & = COV(\alpha_1 X_1, \beta_1 X_2) + E(\epsilon_1 \epsilon_2) \\ & = \alpha_1 \beta_1 COV(X_1, X_2) + COV(\epsilon_1, \epsilon_2) \\ COR(Y_1, Y_2) & = \frac{\alpha_1 \beta_1 COV(X_1, X_2) + COV(\epsilon_1, \epsilon_2)}{\sqrt{\alpha_1^2 V(X_1) + V(\epsilon_1)} \sqrt{\beta_1^2 V(X_2) + V(\epsilon_2)}} \end{align*}\]
만약 \[\alpha_1 \beta_1 COV(X_1, X_2) \gg COV(\epsilon_1, \epsilon_2)\] AND \[\alpha_1^2 V(X_1) \gg V(\epsilon_1)\] AND \[\beta_1^2 V(X_2) \gg V(\epsilon_2)\] 이라면, \[COR(Y_1,Y_2) \approx \frac{COV(X_1,X_2)}{\sqrt{V(X_1)}\sqrt{V(X_2)}} = COR(X_1,X_2)\]
만약 \[\alpha_1^2 V(X_1) \gg V(\epsilon_1)\] AND \[\beta_1^2 V(X_2) \gg V(\epsilon_2)\] 이라면, \[COR(Y_1,Y_2) \approx COR(X_1, X_2) + \frac{\rho_{\epsilon_1, \epsilon_2}}{\alpha_1 \sigma_{X_1} / \sigma_{\epsilon_1} \cdot \beta_1 \sigma_{X_2} / \sigma_{\epsilon_2}}\]
단순 선형 관계 simulation에서는
\[\begin{align*} X_1 & \sim Uniform(x_{1,min}, x_{1,max}) \\ X_2 & \sim Uniform(x_{2,min}, x_{2,max}) \\ X_2 & = \frac{x_{2,max} - x_{2,min}}{x_{1,max} - x_{1,min}} (X_1 - x_{1,min}) + x_{2,min} \\ E(X_1) & = (x_{1,max} + x_{1,min}) / 2 \\ E(X_2) & = (x_{2,max} + x_{2,min}) / 2 \\ V(X_1) & = (x_{1,max} - x_{1,min})^2 / 12 \\ V(X_2) & = (x_{2,max} - x_{2,min})^2 / 12 \\ COV(X_1, X_2) & = \sqrt{V(X_1)} \sqrt{V(X_2)} = \sigma(X_1) \sigma(X_2) \qquad \because COR(X_1, X_2) = 1\\ COV(\epsilon_1, \epsilon_2) & = \rho \sigma(\epsilon_1) \sigma(\epsilon_2) \\ \end{align*}\]
일반적인 선형 회귀 모형과의 관계는 다음과 같다.
\[\begin{align*} Y & = \alpha_0 + \alpha_1 X + \epsilon \\ V(Y) & = \alpha_1^2 V(X) + V(\epsilon) + 2 \alpha_1 COV(X, \epsilon) \\ & = \alpha_1^2 V(X) + V(\epsilon) \\ COV(X, Y) & = COV(X, \alpha_0 + \alpha_1 X + \epsilon) \\ & = E(\alpha_0 X + \alpha_1 X^2 + X \epsilon) - E(X) E(\alpha_0 + \alpha_1 X + \epsilon) \\ & = \alpha_0 E(X) + \alpha_1 E(X^2) + E(X \epsilon) - (\alpha_0 E(X) + \alpha_1 E(X)^2 + E(X)E(\epsilon)) \\ & = \alpha_1 (V(X) + E(X)^2) + COV(X, \epsilon) + E(X)E(\epsilon) - \alpha_1 E(X)^2 - E(X)E(\epsilon) \\ & = \alpha_1 V(X) \\ COR(X, Y) & = \frac{COV(X, Y)}{\sqrt{V(X)}\sqrt{V(Y)}} = \frac{\alpha_1 V(X)}{\sqrt{V(X)} \sqrt{V(Y)}} = \alpha_1 \frac{\sqrt{V(X)}}{\sqrt{V(Y)}}\\ & = \frac{\alpha_1 \sqrt{V(X)}}{\sqrt{\alpha_1^2 V(X) + V(\epsilon)}} \\ COV(Y, \epsilon) & = E(Y \epsilon) - E(Y)E(\epsilon) \\ & = E(\alpha_0 \epsilon + \alpha_1 X \epsilon + \epsilon^2) \\ & = \alpha_0 E(\epsilon) + \alpha_1 E(X)E( \epsilon) + E(\epsilon^2) \\ & = V(\epsilon) \\ COR(Y, \epsilon) & = \frac{V(\epsilon)}{\sqrt{\alpha_1^2 V(X) + V(\epsilon)} \sqrt{V(\epsilon)}} \\ & = \frac{\sqrt{V(\epsilon)}}{\sqrt{\alpha_1^2 V(X) + V(\epsilon)}} \\ & = \frac{1}{\sqrt{\alpha_1^2 V(X) / V(\epsilon) + 1}} \end{align*}\]
다음은 앞의 simulation이다.
= function(a0, a1, x1min, x1max, nSim, sig1)
Sim2
{= seq(x1min, x1max, length.out=nSim)
x1 = rnorm(n=nSim, mean=0, sd=sig1)
e1 = a0 + a1*x1 + e1
y1
= sig1^2
ve1 = (x1max - x1min)^2/12
vx1 = a1^2*vx1 + ve1
vy1 = summary(lm(y1 ~ x1))
r1 = cbind(e1, x1, y1)
d1 = 0
covex = ve1
covey = a1*vx1
covxy = matrix(c(ve1, covex, covey, covex, vx1, covxy, covey, covxy, vy1), ncol=3)
m1 list(coveXY = var(d1), coveXYt = m1,
coreXY = cor(d1), coreXYt = cov2cor(m1),
rx = r1 , r = sqrt(r1$r.squared))
}
= Sim2(a0 = 1, a1 = 2, x1min = 0, x1max = 1, nSim = 20000, sig1 = 0.5)
r2 print(cbind(r2$coveXY, NA, r2$coveXYt), na.print="")
e1 x1 y1
e1 0.2515003583 -0.0003798788 0.2507406 0.25 0.00000000 0.2500000
x1 -0.0003798788 0.0833458344 0.1663118 0.00 0.08333333 0.1666667
y1 0.2507406006 0.1663117899 0.5833642 0.25 0.16666667 0.5833333
print(cbind(r2$coreXY, NA, r2$coreXYt), na.print="")
e1 x1 y1
e1 1.000000000 -0.002623819 0.6546143 1.0000000 0.0000000 0.6546537
x1 -0.002623819 1.000000000 0.7542429 0.0000000 1.0000000 0.7559289
y1 0.654614285 0.754242862 1.0000000 0.6546537 0.7559289 1.0000000
$rx r2
Call:
lm(formula = y1 ~ x1)
Residuals:
Min 1Q Median 3Q Max
-2.04656 -0.34449 -0.00258 0.34225 1.78274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.004498 0.007092 141.6 <2e-16 ***
x1 1.995442 0.012284 162.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5015 on 19998 degrees of freedom
Multiple R-squared: 0.5689, Adjusted R-squared: 0.5689
F-statistic: 2.639e+04 on 1 and 19998 DF, p-value: < 2.2e-16
$r r2
[1] 0.7542429
마지막으로 다음의 경우를 생각해보자.
\[\begin{align*} Y & = \beta_0 + \beta_1 (X + \epsilon_1) + \epsilon_2 \\ E(Y) & = \beta_0 + \beta_1 E(X) + \beta_1 E(\epsilon_1) + E(\epsilon_2) \\ & = \beta_0 + \beta_1 E(X) \\ V(Y) & = \beta_1^2 V(X) + \beta_1^2 V(\epsilon_1) + V(\epsilon_2) + 2\beta_1^2 COV(X, \epsilon_1) \\ & \quad + 2\beta_1 COV(X, \epsilon_2) + 2\beta_1 COV(\epsilon_1, \epsilon_2) \\ & = \beta_1^2 V(X) + \beta_1^2 V(\epsilon_1) + V(\epsilon_2) \\ COV(X + \epsilon_1, Y) & = COV(X + \epsilon_1, \beta_0 + \beta_1 X + \beta_1 \epsilon_1 + \epsilon_2) \\ & = E[(X + \epsilon_1)Y] - E(X + \epsilon_1)E(Y) \\ & = \beta_0 E(X) + \beta_1 E(X^2) + \beta_1 E(X\epsilon_1) + E(X\epsilon_2) \\ & \quad + \beta_0 E(\epsilon_1) + \beta_1 E(X\epsilon_1) + \beta_1 E(\epsilon_1^2) + E(\epsilon_1 \epsilon_2) \\ & \quad - \beta_0 E(X) - \beta_1 E(X)^2 \\ & = \beta_1 \{ V(X) + E(X)^2 \} + \beta_1 V(\epsilon_1) - \beta_1 E(X)^2 \\ & = \beta_1 V(X) + \beta_1 V(\epsilon_1) \\ COV(Y, \epsilon_1) & = E(Y \epsilon_1) - E(Y)E(\epsilon_1) \\ & = E(\beta_0 \epsilon_1 + \beta_1 X \epsilon_1 + \beta_1 \epsilon_1^2 + \epsilon_1 \epsilon_2) \\ & = \beta_0 E(\epsilon_1) + \beta_1 E(X \epsilon_1) + \beta_1 V(\epsilon_1) + E(\epsilon_1 \epsilon_2) \\ & = \beta_1 V(\epsilon_1) \\ COV(Y, \epsilon_2) & = E(Y \epsilon_2) - E(Y)E(\epsilon_2) \\ & = E(\beta_0 \epsilon_2 + \beta_1 X \epsilon_2 + \beta_1 \epsilon_1 \epsilon_2 + \epsilon_2^2) \\ & = \beta_0 E(\epsilon_2) + \beta_1 E(X \epsilon_2) + \beta_1 E(\epsilon_1 \epsilon_2) + E(\epsilon_2^2) \\ & = V(\epsilon_2) \end{align*}\]
Y와 잔차(residual)간에는 correlation을 가지므로, Y가 아닌 \(\hat{Y}\)와 잔차(residual)간의 plot을 해야 한다.
앞 모형의 simulation은 다음과 같다.
= function(b0, b1, x1min, x1max, nSim, sig1, sig2)
Sim3
{= seq(x1min, x1max, length.out=nSim)
x1 = rnorm(n=nSim, mean=0, sd=sig1)
e1 = x1 + e1
x2 = rnorm(n=nSim, mean=0, sd=sig2)
e2 = b0 + b1*(x1 + e1) + e2
y1
= sig1^2
ve1 = sig2^2
ve2 = (x1max - x1min)^2/12
vx1 = vx1 + ve1
vx2 = b1^2*vx1 + b1^2*ve1 + ve2
vy1
= 0
ce1x1 = 0
ce2x1 = ve1
ce1x2 = 0
ce2x2 = vx1
cx1x2 = b1*vx1
cx1y = b1*vx1 + b1*ve1
cx2y = b1*ve1
ce1y = ve2
ce2y = 0
ce1e2
= summary(lm(y1 ~ x2))
r1 = cbind(y1, x1, x2, e1, e2)
d1
= matrix(c(vy1, cx1y, cx2y, ce1y, ce2y,
m1
cx1y, vx1, cx1x2, ce1x1, ce2x1,
cx2y, cx1x2, vx2, ce1x2, ce2x2,
ce1y, ce1x1, ce1x2, ve1, ce1e2,ncol=5)
ce2y, ce2x1, ce2x2, ce1e2, ve2), list(coveXY = var(d1), coveXYt = m1,
coreXY = cor(d1), coreXYt = cov2cor(m1),
rx = r1 , r = sqrt(r1$r.squared))
}
Sim3(1, 2, 1, 2, 2000, 0.3, 0.5)
$coveXY
y1 x1 x2 e1 e2
y1 0.9806339 0.1715200320 0.362991505 0.191471473 0.2546508899
x1 0.1715200 0.0834584376 0.086046032 0.002587594 -0.0005720318
x2 0.3629915 0.0860460319 0.178574925 0.092528893 0.0058416557
e1 0.1914715 0.0025875943 0.092528893 0.089941298 0.0064136875
e2 0.2546509 -0.0005720318 0.005841656 0.006413688 0.2429675784
$coveXYt
[,1] [,2] [,3] [,4] [,5]
[1,] 0.9433333 0.16666667 0.34666667 0.18 0.25
[2,] 0.1666667 0.08333333 0.08333333 0.00 0.00
[3,] 0.3466667 0.08333333 0.17333333 0.09 0.00
[4,] 0.1800000 0.00000000 0.09000000 0.09 0.00
[5,] 0.2500000 0.00000000 0.00000000 0.00 0.25
$coreXY
y1 x1 x2 e1 e2
y1 1.0000000 0.599551195 0.86742667 0.64471986 0.521696093
x1 0.5995512 1.000000000 0.70483179 0.02986630 -0.004017084
x2 0.8674267 0.704831791 1.00000000 0.73010875 0.028044749
e1 0.6447199 0.029866304 0.73010875 1.00000000 0.043386444
e2 0.5216961 -0.004017084 0.02804475 0.04338644 1.000000000
$coreXYt
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.5944383 0.8573111 0.6177584 0.5147987
[2,] 0.5944383 1.0000000 0.6933752 0.0000000 0.0000000
[3,] 0.8573111 0.6933752 1.0000000 0.7205767 0.0000000
[4,] 0.6177584 0.0000000 0.7205767 1.0000000 0.0000000
[5,] 0.5147987 0.0000000 0.0000000 0.0000000 1.0000000
$rx
Call:
lm(formula = y1 ~ x2)
Residuals:
Min 1Q Median 3Q Max
-1.8509 -0.3302 0.0100 0.3409 1.6557
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.95359 0.04048 23.56 <2e-16 ***
x2 2.03271 0.02609 77.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4928 on 1998 degrees of freedom
Multiple R-squared: 0.7524, Adjusted R-squared: 0.7523
F-statistic: 6072 on 1 and 1998 DF, p-value: < 2.2e-16
$r
[1] 0.8674267