13.3 다항 로지스틱 회귀 (Multinomial Logistic Regression)
어떤 자동차 특성에 대해 성별, 연령별 어느 정도 중요하다고 생각하는 지에 대한 조사 결과이다.
Cars
sex age response frequency
1 women 18-23 no/little 26
2 women 18-23 important 12
3 women 18-23 very important 7
4 women 24-40 no/little 9
5 women 24-40 important 21
6 women 24-40 very important 15
7 women > 40 no/little 5
8 women > 40 important 14
9 women > 40 very important 41
10 men 18-23 no/little 40
11 men 18-23 important 17
12 men 18-23 very important 8
13 men 24-40 no/little 17
14 men 24-40 important 15
15 men 24-40 very important 12
16 men > 40 no/little 8
17 men > 40 important 15
18 men > 40 very important 18
반응이 ‘중요하지 않다’, ‘중요하다’, ’매우 중요하다’와 같이 3가지로 분류되어 있으므로 다항분포를 이용해야 한다.
먼저 자료를 분석하기 좋게 다음과 같이 변형한다.
= reshape(Cars, direction="wide", idvar=c("sex", "age"), timevar="response")
d7 colnames(d7) = c("sex", "age", "y1", "y2", "y3") ; d7
sex age y1 y2 y3
1 women 18-23 26 12 7
4 women 24-40 9 21 15
7 women > 40 5 14 41
10 men 18-23 40 17 8
13 men 24-40 17 15 12
16 men > 40 8 15 18
"sexN"] = as.numeric(d7$sex == "men")
d7[, "ageN"] = rep(0:2, 2)
d7[, d7
sex age y1 y2 y3 sexN ageN
1 women 18-23 26 12 7 0 0
4 women 24-40 9 21 15 0 1
7 women > 40 5 14 41 0 2
10 men 18-23 40 17 8 1 0
13 men 24-40 17 15 12 1 1
16 men > 40 8 15 18 1 2
Fitting (목적함수)에 필요한 response matrix (Y), design matrix (X), Y 수준별 합계 (ni), Y수준 수 (nyLevel)를 준비한다.
= d7[, c("y1", "y2", "y3")] ; Y Y
y1 y2 y3
1 26 12 7
4 9 21 15
7 5 14 41
10 40 17 8
13 17 15 12
16 8 15 18
= model.matrix(~ sexN + as.factor(ageN), d7) ; X X
(Intercept) sexN as.factor(ageN)1 as.factor(ageN)2
1 1 0 1 0
4 1 0 0 1
7 1 0 -1 -1
10 1 1 1 0
13 1 1 0 1
16 1 1 -1 -1
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$`as.factor(ageN)`
[1] "contr.sum"
= rowSums(Y) ; ni ni
1 4 7 10 13 16
45 45 60 65 44 41
= ncol(Y) nyLevel
다항 분포의 경우 목적함수(여기서는 -log likelihood)를 다음과 같이 만들 수 있다.
= function(Beta) # -log likelihood for multinomial distribution
LLm
{= matrix(Beta, ncol=nyLevel - 1)
b = X %*% b
Xb = sum(Y[,-1]*Xb) - sum(ni*log(1 + rowSums(exp(Xb))))
LogLik return(-LogLik)
}
다음은 적합 결과이다.
= nlm(LLm, c(-1, -1, 1, 1, -1, -1, 1, 1), hessian=TRUE)
r7 r7
$minimum
[1] 290
$estimate
[1] 0.3145 -0.3881 -0.9053 0.2229 0.4259 -0.8130 -1.4650 0.0132
$gradient
[1] -2.22e-05 7.56e-06 3.62e-05 -2.74e-05 -1.46e-05 6.89e-05 -7.28e-05 -3.43e-05
$hessian
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 63.337 31.58 0.704 0.841 -31.72 -12.60 12.71 5.86
[2,] 31.580 31.58 3.102 1.697 -12.60 -12.60 5.01 2.31
[3,] 0.704 3.10 41.897 20.596 12.70 5.01 -20.81 -16.76
[4,] 0.841 1.70 20.596 42.034 5.85 2.31 -16.76 -27.67
[5,] -31.720 -12.60 12.705 5.854 55.35 24.28 -11.33 -5.63
[6,] -12.600 -12.60 5.006 2.308 24.28 24.28 -4.28 -2.19
[7,] 12.706 5.01 -20.814 -16.760 -11.33 -4.28 36.88 24.11
[8,] 5.855 2.31 -16.760 -27.665 -5.63 -2.19 24.11 42.58
$code
[1] 1
$iterations
[1] 16
= solve(r7$hessian) ; COV2 COV2
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] 0.05277 -0.049665 -0.007145 -2.97e-03 0.032246 -3.07e-02 -0.009749 -0.00110
[2,] -0.04966 0.090303 -0.002955 -3.70e-04 -0.029786 5.15e-02 0.000522 -0.00105
[3,] -0.00715 -0.002955 0.042159 -1.35e-02 -0.009151 -2.98e-04 0.020028 -0.00358
[4,] -0.00297 -0.000370 -0.013477 4.76e-02 -0.001656 5.52e-05 -0.003637 0.02789
[5,] 0.03225 -0.029786 -0.009151 -1.66e-03 0.054064 -5.11e-02 -0.000546 -0.00267
[6,] -0.03069 0.051513 -0.000298 5.52e-05 -0.051080 1.03e-01 -0.000385 0.00011
[7,] -0.00975 0.000522 0.020028 -3.64e-03 -0.000546 -3.85e-04 0.056254 -0.02511
[8,] -0.00110 -0.001054 -0.003584 2.79e-02 -0.002666 1.10e-04 -0.025105 0.05427
= sqrt(diag(COV2)) ; SE2 SE2
[1] 0.230 0.301 0.205 0.218 0.233 0.321 0.237 0.233
= exp(r7$estimate)
expBeta = exp(r7$estimate - 1.96*SE2)
expLL = exp(r7$estimate + 1.96*SE2)
expUL data.frame(PE=r7$estimate, SE=SE2, expBeta, expLL, expUL)
PE SE expBeta expLL expUL
1 0.3145 0.230 1.370 0.873 2.148
2 -0.3881 0.301 0.678 0.376 1.222
3 -0.9053 0.205 0.404 0.270 0.605
4 0.2229 0.218 1.250 0.815 1.917
5 0.4259 0.233 1.531 0.971 2.415
6 -0.8130 0.321 0.444 0.236 0.832
7 -1.4650 0.237 0.231 0.145 0.368
8 0.0132 0.233 1.013 0.642 1.600
그런데, age를 범주형으로 다루지 않고, 연속형(수량형)으로 다룬다면 다음과 같이 된다.
= model.matrix(~ sexN + ageN, d7) ; X X
(Intercept) sexN ageN
1 1 0 0
4 1 0 1
7 1 0 2
10 1 1 0
13 1 1 1
16 1 1 2
attr(,"assign")
[1] 0 1 2
= nlm(LLm, c(-1, -1, 1, -1, -1, 1), hessian=TRUE)
r8 r8
$minimum
[1] 291.0502
$estimate
[1] -0.5019268 -0.3889759 0.8303839 -1.0923183 -0.8130358 1.5214547
$gradient
[1] 3.507239e-05 -3.217338e-05 5.542233e-05 -5.542179e-05 -3.069545e-06 8.327818e-05
$hessian
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 64.12992 31.98101 63.32066 -32.08023 -12.74038 -45.70887
[2,] 31.98101 31.98101 28.62844 -12.74038 -12.74038 -18.11368
[3,] 63.32066 28.62844 106.22907 -45.71002 -18.11384 -81.08478
[4,] -32.08023 -12.74038 -45.71002 55.50818 24.32150 67.74634
[5,] -12.74038 -12.74038 -18.11384 24.32150 24.32150 28.96011
[6,] -45.70887 -18.11368 -81.08478 67.74634 28.96011 116.31146
$code
[1] 1
$iterations
[1] 22
= solve(r8$hessian) ; COV4 COV4
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.07273270 -5.224193e-02 -0.0289805949 0.03472141 -0.0309597237 -1.227136e-02
[2,] -0.05224193 8.947685e-02 0.0030164757 -0.02974715 0.0514799434 1.565474e-05
[3,] -0.02898059 3.016476e-03 0.0378800475 -0.01315869 0.0002861731 2.308138e-02
[4,] 0.03472141 -2.974715e-02 -0.0131586859 0.09517886 -0.0514140235 -4.279704e-02
[5,] -0.03095972 5.147994e-02 0.0002861731 -0.05141402 0.1031061890 3.241872e-04
[6,] -0.01227136 1.565474e-05 0.0230813829 -0.04279704 0.0003241872 4.471508e-02
= sqrt(diag(COV4)) ; SE4 SE4
[1] 0.2696900 0.2991268 0.1946280 0.3085107 0.3211015 0.2114594
= exp(r8$estimate)
expBeta = exp(r8$estimate - 1.96*SE4)
expLL = exp(r8$estimate + 1.96*SE4)
expUL data.frame(PE=r8$estimate, SE=SE4, expBeta, expLL, expUL)
PE SE expBeta expLL expUL
1 -0.5019268 0.2696900 0.6053631 0.3568216 1.0270243
2 -0.3889759 0.2991268 0.6777506 0.3770926 1.2181249
3 0.8303839 0.1946280 2.2941994 1.5666085 3.3597103
4 -1.0923183 0.3085107 0.3354379 0.1832325 0.6140755
5 -0.8130358 0.3211015 0.4435096 0.2363610 0.8322047
6 1.5214547 0.2114594 4.5788815 3.0252518 6.9303837
모형의 자유도가 2 줄어들면서 (잔차의 자유도는 2 늘어난다.) 설명력은 유사해 보인다. 더 단순한 모형이 선호되므로 후자의 모형을 선택할 수 있다.
다음은 R nnet package를 이용한 풀이이다.
require(nnet)
= reshape(d7, timevar="Y", times=c("0-no", "1-yes", "2-very"), v.names="Count",
d8 varying=c("y1", "y2", "y3"), direction="long")
d8
sex age sexN ageN Y Count id
1.0-no women 18-23 0 0 0-no 26 1
2.0-no women 24-40 0 1 0-no 9 2
3.0-no women > 40 0 2 0-no 5 3
4.0-no men 18-23 1 0 0-no 40 4
5.0-no men 24-40 1 1 0-no 17 5
6.0-no men > 40 1 2 0-no 8 6
1.1-yes women 18-23 0 0 1-yes 12 1
2.1-yes women 24-40 0 1 1-yes 21 2
3.1-yes women > 40 0 2 1-yes 14 3
4.1-yes men 18-23 1 0 1-yes 17 4
5.1-yes men 24-40 1 1 1-yes 15 5
6.1-yes men > 40 1 2 1-yes 15 6
1.2-very women 18-23 0 0 2-very 7 1
2.2-very women 24-40 0 1 2-very 15 2
3.2-very women > 40 0 2 2-very 41 3
4.2-very men 18-23 1 0 2-very 8 4
5.2-very men 24-40 1 1 2-very 12 5
6.2-very men > 40 1 2 2-very 18 6
= multinom(Y ~ sexN + as.factor(ageN), weights=Count, data=d8) r9
# weights: 15 (8 variable)
initial value 329.583687
iter 10 value 290.936860
final value 290.351098
converged
summary(r9)
Call:
multinom(formula = Y ~ sexN + as.factor(ageN), data = d8, weights = Count)
Coefficients:
(Intercept) sexN as.factor(ageN)1 as.factor(ageN)2
1-yes 0.3145071 -0.3881172 -0.9053048 0.22296071
2-very 0.4258621 -0.8129978 -1.4649126 0.01315929
Std. Errors:
(Intercept) sexN as.factor(ageN)1 as.factor(ageN)2
1-yes 0.2297206 0.3005107 0.2053270 0.2181834
2-very 0.2325160 0.3210363 0.2371824 0.2329585
Residual Deviance: 580.7022
AIC: 596.7022
= multinom(Y ~ sexN + ageN, weights=Count, data=d8) r10
# weights: 12 (6 variable)
initial value 329.583687
iter 10 value 291.051804
final value 291.050160
converged
summary(r10)
Call:
multinom(formula = Y ~ sexN + ageN, data = d8, weights = Count)
Coefficients:
(Intercept) sexN ageN
1-yes -0.5019284 -0.3889719 0.8303865
2-very -1.0923016 -0.8130325 1.5214495
Std. Errors:
(Intercept) sexN ageN
1-yes 0.2696958 0.2991332 0.1946359
2-very 0.3085171 0.3211054 0.2114579
Residual Deviance: 582.1003
AIC: 594.1003
anova(r10, r9)
Likelihood ratio tests of Multinomial Models
Response: Y
Model Resid. df Resid. Dev Test Df LR stat. Pr(Chi)
1 sexN + ageN 30 582.1003
2 sexN + as.factor(ageN) 28 580.7022 1 vs 2 2 1.398123 0.4970516
두 모형의 비교 결과 통계적으로 유의한 차이가 없으므로 더 단순한 모형을 선택할 수 있다.