13.3 다항 로지스틱 회귀 (Multinomial Logistic Regression)

어떤 자동차 특성에 대해 성별, 연령별 어느 정도 중요하다고 생각하는 지에 대한 조사 결과이다.

Cars
     sex   age       response frequency
1  women 18-23      no/little        26
2  women 18-23      important        12
3  women 18-23 very important         7
4  women 24-40      no/little         9
5  women 24-40      important        21
6  women 24-40 very important        15
7  women  > 40      no/little         5
8  women  > 40      important        14
9  women  > 40 very important        41
10   men 18-23      no/little        40
11   men 18-23      important        17
12   men 18-23 very important         8
13   men 24-40      no/little        17
14   men 24-40      important        15
15   men 24-40 very important        12
16   men  > 40      no/little         8
17   men  > 40      important        15
18   men  > 40 very important        18

반응이 ‘중요하지 않다’, ‘중요하다’, ’매우 중요하다’와 같이 3가지로 분류되어 있으므로 다항분포를 이용해야 한다.

먼저 자료를 분석하기 좋게 다음과 같이 변형한다.

d7 = reshape(Cars, direction="wide", idvar=c("sex", "age"), timevar="response")
colnames(d7) = c("sex", "age", "y1", "y2", "y3") ; d7
     sex   age y1 y2 y3
1  women 18-23 26 12  7
4  women 24-40  9 21 15
7  women  > 40  5 14 41
10   men 18-23 40 17  8
13   men 24-40 17 15 12
16   men  > 40  8 15 18
d7[, "sexN"] = as.numeric(d7$sex == "men")
d7[, "ageN"] = rep(0:2, 2)
d7
     sex   age y1 y2 y3 sexN ageN
1  women 18-23 26 12  7    0    0
4  women 24-40  9 21 15    0    1
7  women  > 40  5 14 41    0    2
10   men 18-23 40 17  8    1    0
13   men 24-40 17 15 12    1    1
16   men  > 40  8 15 18    1    2

Fitting (목적함수)에 필요한 response matrix (Y), design matrix (X), Y 수준별 합계 (ni), Y수준 수 (nyLevel)를 준비한다.

Y = d7[, c("y1", "y2", "y3")] ; Y
   y1 y2 y3
1  26 12  7
4   9 21 15
7   5 14 41
10 40 17  8
13 17 15 12
16  8 15 18
X = model.matrix(~ sexN + as.factor(ageN), d7) ; X
   (Intercept) sexN as.factor(ageN)1 as.factor(ageN)2
1            1    0                1                0
4            1    0                0                1
7            1    0               -1               -1
10           1    1                1                0
13           1    1                0                1
16           1    1               -1               -1
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$`as.factor(ageN)`
[1] "contr.sum"
ni = rowSums(Y) ; ni
 1  4  7 10 13 16 
45 45 60 65 44 41 
nyLevel = ncol(Y)

다항 분포의 경우 목적함수(여기서는 -log likelihood)를 다음과 같이 만들 수 있다.

LLm = function(Beta) # -log likelihood for multinomial distribution
{
  b = matrix(Beta, ncol=nyLevel - 1)
  Xb = X %*% b
  LogLik = sum(Y[,-1]*Xb) - sum(ni*log(1 + rowSums(exp(Xb))))
  return(-LogLik)
}

다음은 적합 결과이다.

r7 = nlm(LLm, c(-1, -1, 1, 1, -1, -1, 1, 1), hessian=TRUE)
r7
$minimum
[1] 290

$estimate
[1]  0.3145 -0.3881 -0.9053  0.2229  0.4259 -0.8130 -1.4650  0.0132

$gradient
[1] -2.22e-05  7.56e-06  3.62e-05 -2.74e-05 -1.46e-05  6.89e-05 -7.28e-05 -3.43e-05

$hessian
        [,1]   [,2]    [,3]    [,4]   [,5]   [,6]   [,7]   [,8]
[1,]  63.337  31.58   0.704   0.841 -31.72 -12.60  12.71   5.86
[2,]  31.580  31.58   3.102   1.697 -12.60 -12.60   5.01   2.31
[3,]   0.704   3.10  41.897  20.596  12.70   5.01 -20.81 -16.76
[4,]   0.841   1.70  20.596  42.034   5.85   2.31 -16.76 -27.67
[5,] -31.720 -12.60  12.705   5.854  55.35  24.28 -11.33  -5.63
[6,] -12.600 -12.60   5.006   2.308  24.28  24.28  -4.28  -2.19
[7,]  12.706   5.01 -20.814 -16.760 -11.33  -4.28  36.88  24.11
[8,]   5.855   2.31 -16.760 -27.665  -5.63  -2.19  24.11  42.58

$code
[1] 1

$iterations
[1] 16
COV2 = solve(r7$hessian) ; COV2
         [,1]      [,2]      [,3]      [,4]      [,5]      [,6]      [,7]     [,8]
[1,]  0.05277 -0.049665 -0.007145 -2.97e-03  0.032246 -3.07e-02 -0.009749 -0.00110
[2,] -0.04966  0.090303 -0.002955 -3.70e-04 -0.029786  5.15e-02  0.000522 -0.00105
[3,] -0.00715 -0.002955  0.042159 -1.35e-02 -0.009151 -2.98e-04  0.020028 -0.00358
[4,] -0.00297 -0.000370 -0.013477  4.76e-02 -0.001656  5.52e-05 -0.003637  0.02789
[5,]  0.03225 -0.029786 -0.009151 -1.66e-03  0.054064 -5.11e-02 -0.000546 -0.00267
[6,] -0.03069  0.051513 -0.000298  5.52e-05 -0.051080  1.03e-01 -0.000385  0.00011
[7,] -0.00975  0.000522  0.020028 -3.64e-03 -0.000546 -3.85e-04  0.056254 -0.02511
[8,] -0.00110 -0.001054 -0.003584  2.79e-02 -0.002666  1.10e-04 -0.025105  0.05427
SE2 = sqrt(diag(COV2)) ; SE2
[1] 0.230 0.301 0.205 0.218 0.233 0.321 0.237 0.233
expBeta = exp(r7$estimate)
expLL = exp(r7$estimate - 1.96*SE2)
expUL = exp(r7$estimate + 1.96*SE2)
data.frame(PE=r7$estimate, SE=SE2, expBeta, expLL, expUL)
       PE    SE expBeta expLL expUL
1  0.3145 0.230   1.370 0.873 2.148
2 -0.3881 0.301   0.678 0.376 1.222
3 -0.9053 0.205   0.404 0.270 0.605
4  0.2229 0.218   1.250 0.815 1.917
5  0.4259 0.233   1.531 0.971 2.415
6 -0.8130 0.321   0.444 0.236 0.832
7 -1.4650 0.237   0.231 0.145 0.368
8  0.0132 0.233   1.013 0.642 1.600

그런데, age를 범주형으로 다루지 않고, 연속형(수량형)으로 다룬다면 다음과 같이 된다.

X = model.matrix(~ sexN + ageN, d7) ; X
   (Intercept) sexN ageN
1            1    0    0
4            1    0    1
7            1    0    2
10           1    1    0
13           1    1    1
16           1    1    2
attr(,"assign")
[1] 0 1 2
r8 = nlm(LLm, c(-1, -1, 1, -1, -1, 1), hessian=TRUE)
r8
$minimum
[1] 291.0502

$estimate
[1] -0.5019268 -0.3889759  0.8303839 -1.0923183 -0.8130358  1.5214547

$gradient
[1]  3.507239e-05 -3.217338e-05  5.542233e-05 -5.542179e-05 -3.069545e-06  8.327818e-05

$hessian
          [,1]      [,2]      [,3]      [,4]      [,5]      [,6]
[1,]  64.12992  31.98101  63.32066 -32.08023 -12.74038 -45.70887
[2,]  31.98101  31.98101  28.62844 -12.74038 -12.74038 -18.11368
[3,]  63.32066  28.62844 106.22907 -45.71002 -18.11384 -81.08478
[4,] -32.08023 -12.74038 -45.71002  55.50818  24.32150  67.74634
[5,] -12.74038 -12.74038 -18.11384  24.32150  24.32150  28.96011
[6,] -45.70887 -18.11368 -81.08478  67.74634  28.96011 116.31146

$code
[1] 1

$iterations
[1] 22
COV4 = solve(r8$hessian) ; COV4
            [,1]          [,2]          [,3]        [,4]          [,5]          [,6]
[1,]  0.07273270 -5.224193e-02 -0.0289805949  0.03472141 -0.0309597237 -1.227136e-02
[2,] -0.05224193  8.947685e-02  0.0030164757 -0.02974715  0.0514799434  1.565474e-05
[3,] -0.02898059  3.016476e-03  0.0378800475 -0.01315869  0.0002861731  2.308138e-02
[4,]  0.03472141 -2.974715e-02 -0.0131586859  0.09517886 -0.0514140235 -4.279704e-02
[5,] -0.03095972  5.147994e-02  0.0002861731 -0.05141402  0.1031061890  3.241872e-04
[6,] -0.01227136  1.565474e-05  0.0230813829 -0.04279704  0.0003241872  4.471508e-02
SE4 = sqrt(diag(COV4)) ; SE4
[1] 0.2696900 0.2991268 0.1946280 0.3085107 0.3211015 0.2114594
expBeta = exp(r8$estimate)
expLL = exp(r8$estimate - 1.96*SE4)
expUL = exp(r8$estimate + 1.96*SE4)
data.frame(PE=r8$estimate, SE=SE4, expBeta, expLL, expUL)
          PE        SE   expBeta     expLL     expUL
1 -0.5019268 0.2696900 0.6053631 0.3568216 1.0270243
2 -0.3889759 0.2991268 0.6777506 0.3770926 1.2181249
3  0.8303839 0.1946280 2.2941994 1.5666085 3.3597103
4 -1.0923183 0.3085107 0.3354379 0.1832325 0.6140755
5 -0.8130358 0.3211015 0.4435096 0.2363610 0.8322047
6  1.5214547 0.2114594 4.5788815 3.0252518 6.9303837

모형의 자유도가 2 줄어들면서 (잔차의 자유도는 2 늘어난다.) 설명력은 유사해 보인다. 더 단순한 모형이 선호되므로 후자의 모형을 선택할 수 있다.

다음은 R nnet package를 이용한 풀이이다.

require(nnet)
d8 = reshape(d7, timevar="Y", times=c("0-no", "1-yes", "2-very"), v.names="Count",
             varying=c("y1", "y2", "y3"), direction="long")
d8
           sex   age sexN ageN      Y Count id
1.0-no   women 18-23    0    0   0-no    26  1
2.0-no   women 24-40    0    1   0-no     9  2
3.0-no   women  > 40    0    2   0-no     5  3
4.0-no     men 18-23    1    0   0-no    40  4
5.0-no     men 24-40    1    1   0-no    17  5
6.0-no     men  > 40    1    2   0-no     8  6
1.1-yes  women 18-23    0    0  1-yes    12  1
2.1-yes  women 24-40    0    1  1-yes    21  2
3.1-yes  women  > 40    0    2  1-yes    14  3
4.1-yes    men 18-23    1    0  1-yes    17  4
5.1-yes    men 24-40    1    1  1-yes    15  5
6.1-yes    men  > 40    1    2  1-yes    15  6
1.2-very women 18-23    0    0 2-very     7  1
2.2-very women 24-40    0    1 2-very    15  2
3.2-very women  > 40    0    2 2-very    41  3
4.2-very   men 18-23    1    0 2-very     8  4
5.2-very   men 24-40    1    1 2-very    12  5
6.2-very   men  > 40    1    2 2-very    18  6
r9 = multinom(Y ~ sexN + as.factor(ageN), weights=Count, data=d8)
# weights:  15 (8 variable)
initial  value 329.583687 
iter  10 value 290.936860
final  value 290.351098 
converged
summary(r9)
Call:
multinom(formula = Y ~ sexN + as.factor(ageN), data = d8, weights = Count)

Coefficients:
       (Intercept)       sexN as.factor(ageN)1 as.factor(ageN)2
1-yes    0.3145071 -0.3881172       -0.9053048       0.22296071
2-very   0.4258621 -0.8129978       -1.4649126       0.01315929

Std. Errors:
       (Intercept)      sexN as.factor(ageN)1 as.factor(ageN)2
1-yes    0.2297206 0.3005107        0.2053270        0.2181834
2-very   0.2325160 0.3210363        0.2371824        0.2329585

Residual Deviance: 580.7022 
AIC: 596.7022 
r10 = multinom(Y ~ sexN + ageN, weights=Count, data=d8)
# weights:  12 (6 variable)
initial  value 329.583687 
iter  10 value 291.051804
final  value 291.050160 
converged
summary(r10)
Call:
multinom(formula = Y ~ sexN + ageN, data = d8, weights = Count)

Coefficients:
       (Intercept)       sexN      ageN
1-yes   -0.5019284 -0.3889719 0.8303865
2-very  -1.0923016 -0.8130325 1.5214495

Std. Errors:
       (Intercept)      sexN      ageN
1-yes    0.2696958 0.2991332 0.1946359
2-very   0.3085171 0.3211054 0.2114579

Residual Deviance: 582.1003 
AIC: 594.1003 
anova(r10, r9)
Likelihood ratio tests of Multinomial Models

Response: Y
                   Model Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
1            sexN + ageN        30   582.1003                                
2 sexN + as.factor(ageN)        28   580.7022 1 vs 2     2 1.398123 0.4970516

두 모형의 비교 결과 통계적으로 유의한 차이가 없으므로 더 단순한 모형을 선택할 수 있다.