7.1 추정량의 바람직한 성질

A라는 추정량(통계량, 확률변수)으로 \(\alpha\)라는 모수(parameter)를 추정하고자 하는 경우 다음과 같은 measure를 생각할 수 있다.

\(\hat{\mu} = \bar{X} = \frac{\sum X_i}{n}\)

\(\hat{\sigma^2} = S^2 = \frac{\sum (X_i - \bar{X})^2}{n - 1}\)

\(\hat{\pi} = \frac{X}{n}\)

\(\hat{\rho} = \frac{\sum (X_i - \bar{X}) (Y_i - \bar{Y})}{\left[ \sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2 \right]^{1/2}}\)

여기에서 모수는 모평균(\(\mu\)), 모분산(\(\sigma^2\)), 모비율(\(\pi\)), 모상관계수(\(\rho\))나 pdf 내에 있는 상수들이다.

위에서 모수 위에 hat 기호가 붙어 있는데, 이것은 해당 모수에 대한 추정량이라는 상징적 기호(symbol)이다. 실제로 어떤 확률변수를 그 추정량으로 사용할 지에 대해서는 한 가지 방법만 있는 것이 아니다.

7.1.1 불편성(unbiasedness)

\(E(A) = \alpha\)

확률변수 A의 기대값이 \(\alpha\)이다.

[예제 7-1] 앞에서 정의한 \(\bar{X}\)와 \(S^2\)이 불편추정량임을 보이시오. 반면에 \(\hat{\sigma^2}\)를 MLE인 \(\frac{\sum(X_i - \bar{X})^2}{n}\) 를 사용하면 편의 추정량(biased estimator)이 됨을 보이시오.

7.1.2 일치성(consistency)

\(\begin{aligned} &\lim_{n \to \infty} A = \alpha \end{aligned}\)

표본크기(n)이 클수록 통계량 A가 모수\(\alpha\)에 가까이 간다.

더 자세히는 약 대수의 법칙(weak law of large numbers)과 강 대수의 법칙(strong law of large numbers)이 있다.

Weak law of large numbers

\(\bar{X} \stackrel{P}{\to} \mu\) when \(n \to \infty\) (P = probablistically) 또는

\(\begin{aligned} &\lim_{n \to \infty} Pr \left(| \bar{X}_n - \mu | < \epsilon \right) = 1 \end{aligned}\)

Strong law of large numbers

\(\bar{X} \stackrel{a.s.}{\to} \mu\) when \(n \to \infty\) (a.s. = almost surely) 또는

\(\begin{aligned} &Pr(\lim_{n \to \infty} |\bar{X}_n - \mu| < \epsilon) = 1 \end{aligned}\)

7.1.3 최소분산성(minimum variance) 또는 최대 효율성(maximal efficiency)

\(Var(A) \leq Var(\text{any other estimator})\)

통계에서는 추정량의 분산(variance), 즉 표준오차(standard error)가 작을수록 효율적(efficient)이라고 한다.

[예제 7-2] \(Y_1, Y_2, \cdots, Y_n\) 이 1부터 1보다 큰 자연수 \(\theta\) 까지의 일양 분포(discrete uniform distribution)로부터 왔다. \(\theta\) 를 추정하기 위해 다음의 두 추정량을 고안해 보았다. 어느 것이 더 efficient한지 계산해 보시오.

\(\hat{\theta}_1 = 2\bar{Y} - 1\)

\(\hat{\theta}_2 = \frac{n + 1}{n} Y_{(n)}\)

[Hint] Wackerly’s Mathematical Statistics with Applications 7e p446 Example 9.1 (Duxbury 2008)

\(V(\hat{\theta_1}) = \frac{\theta^2 - 1}{3n}\)

\(V(\hat{\theta_2}) = \frac{\theta^2}{n(n + 2)}\)

[예제 7-3] 일군의 사람들이 모여서 오징어게임을 하고 있다. 당신은 경찰로서 몰래 이들을 관찰하고 있는데, 옷에는 일련번호가 붙어 있었다. 20명을 관측한 결과는 다음과 같다.

196, 51, 244, 207, 56, 420, 146, 151, 340, 132, 382, 31, 366, 241, 294, 166, 127, 286, 390, 361

(안 보이는 번호는 사망했을 수도 있다.) 이들이 번호를 1번부터 중복 없이 하나씩 증가시켜가면서 붙였다고 가정하면, 총 몇 명이 게임을 시작했는지 추정하고 여러 가지 방법에 의해 신뢰구간을 구하시오.

[풀이 R script]

d0 = c(196, 51, 244, 207, 56, 420, 146, 151, 340, 132, 382, 31, 366, 241, 294,
       166, 127, 286, 390, 361)
n = length(d0)

th1 = mean(d0)*2 - 1; th1                       # 457.7
sd1 = sqrt((th1^2 - 1)/(3*n)) ; sd1             #  59.08867
ci1 = th1 + c(-1, 1)*qt(0.975, n - 1)*sd1 ; ci1 # 334.026 581.374

Yn = max(d0)
th2 = Yn*(n + 1)/n ; th2                        # 441
sd2 = sqrt(th2^2/(n*(n + 2))) ; sd2             #  21.02385
ci2 = th2 + c(-1, 1)*qt(0.975, n - 1)*sd2 ; ci2 # 396.9966 485.0034
ci3 = c(Yn, ci2[2]) ; ci3                       # 420.0000 485.0034

\(\hat{\theta_1}\)은 bootstrap에 의한 신뢰구간이 적절하지만, \(\hat{\theta_2}\)는 유효하지 않다. 그 이유를 생각해 보시오.

모수에 대한 신뢰구간을 추정할 때, 위의 두 가지 \(\hat{\theta}\)의 \(\pm2SE\)를 이용한 것 외에도 3번째 신뢰구간을 생각할 수 있는데, 이것은 다음과 같다.

논리적으로 생각해보면 추정의 대상이 되는 \(\theta\) (sampling space의 upper bound)가 이미 관찰된 Y(n) (관찰값들의 최대값)보다 작을 수는 없다. 즉, 신뢰구간의 하한이 Y(n)보다 크거나 같아야 한다. 아래 simulation을 보면 이 3번째 신뢰구간이 가장 좁고 coverage는 동일함을 알 수 있다. 다만, 3번째 신뢰구간에서 하한에 등호가 있어야 함에 유의한다.

[Simulation R script]

N = 2000
Nboot = 2000
nMax = 456
nSamp = 20
q0975 = qnorm(0.975)
qt0975 = qt(0.975, nSamp - 1)
Res = as.data.frame(matrix(ncol=10, nrow=N))
colnames(Res) = c("th1", "LL1", "UL1", "in1", "th2", "LL2", "UL2", "in2", "LL3", "in3")
for (i in 1:N) {
  d1 = sample(nMax, nSamp)

  th1 = mean(d1)*2 - 1
  sd1 = sqrt((th1^2 - 1)/(3*nSamp))
  ci1 = th1 + c(-1, 1)*qt0975*sd1
  Res[i, 1] = th1
  Res[i, 2:3] = ci1
  Res[i, 4] = ifelse(ci1[1] < nMax & ci1[2] > nMax, T, F)

  Yn = max(d1)
  th2 = Yn*(nSamp + 1)/nSamp
  sd2 = sqrt(th2^2/(nSamp*(nSamp + 2)))
  ci2 = th2 + c(-1, 1)*qt0975*sd2
  Res[i, 5] = th2
  Res[i, 6:7] = ci2
  Res[i, 8] = ifelse(ci2[1] < nMax & ci2[2] > nMax, T, F)
  Res[i, 9] = Yn
  Res[i, 10] = ifelse(Yn <= nMax & ci2[2] > nMax, T, F)
}
colMeans(Res)
mean(Res[,3] - Res[,2]) # width of ci1, 246.0836  -> Worst
mean(Res[,7] - Res[,6]) # width of ci2,  91.15236
mean(Res[,7] - Res[,9]) # width of ci3,  67.32675 -> Best
dev.new()
hist(Res[,1], breaks=30)
dev.new()
hist(Res[,5], breaks=30)

엄밀하게는 복원추출이어야 하므로 위 R script에서 sample 함수에 replace=T 옵션이 필요하다.

Bootstrap에 의한 \(\hat{\theta_1}\)과 \(\hat{\theta_2}\)의 신뢰구간이 coverage하는 확률을 구하는 R script는 다음과 같다. 두 번째의 coverage가 불충분함을 알 수 있다.

N = 2000
Nboot = 2000
nMax = 456
nSamp = 20
Res2 = matrix(ncol=2, nrow=N)
for (i in 1:N) {
  d1 = sample(nMax, nSamp)
  ResBoot1 = rep(NA, Nboot)
  ResBoot2 = ResBoot1
  for (j in 1:Nboot) {
    b1 = sample(d1, nSamp, replace=T)
    ResBoot1[j] = 2*mean(b1) - 1
    ResBoot2[j] = max(b1)*(nSamp + 1)/nSamp
  }
  ci3 = quantile(ResBoot1, c(0.025, 0.975))
  ci4 = quantile(ResBoot2, c(0.025, 0.975))
  Res2[i, 1] = ifelse(ci3[1] <= nMax & ci3[2] >= nMax, T, F)
  Res2[i, 2] = ifelse(ci4[1] <= nMax & ci4[2] >= nMax, T, F)
}

colMeans(Res2) # coverage

이 외에도 충분성(sufficiency) 등이 있으나, 이론 통계가 아닌 경우에는 크게 관심을 두지 않아도 된다.