Odchylenie standardowe obserwacji podzielonych

Mam zestaw danych z przykładowymi obserwacjami, przechowywanymi jako liczby w przedziałach zakresu. na przykład:

min/max  count
40/44    1
45/49    2
50/54    3
55/59    4
70/74    1

Teraz oszacowanie średniej z tego jest dość proste. Po prostu użyj średniej (lub mediany) każdego przedziału zakresu jako obserwacji i zliczenia jako wagi i znajdź średnią ważoną:

{\bar{x}}^{*} = \frac{1}{\sum_{i = 1}^{N} w_{i}} \sum_{i = 1}^{N} w_{i} x_{i}

$\bar{x}^* = \frac{1}{\sum_{i=1}^N w_i} \sum_{i=1}^N w_ix_i$

For my test case, this gives me 53.82.

My question now is, what's the correct method of finding the standard deviation (or variance)?

Through my searching, I've found several answers, but I'm unsure which, if any, is actually appropriate for my dataset. I was able to find the following formula both on another question here and a random NIST document.

s^{2 *} = \frac{\sum_{i = 1}^{N} w_{i} (x_{i} - {\bar{x}}^{*})^{2}}{\frac{(M - 1)}{M} \sum_{i = 1}^{N} w_{i}}

$s^{2*} = \frac{ \sum_{i=1}^N w_i (x_i - \bar{x}^*)^2 }{ \frac{(M-1)}{M} \sum_{i=1}^N w_i }$

Which gives a standard deviation of 8.35 for my test case. However, the Wikipedia article on weighted means gives both the formula:

s^{2 *} = \frac{\sum_{i = 1}^{N} w_{i}}{(\sum_{i = 1}^{N} w_{i})^{2} - \sum_{i = 1}^{N} w_{i}^{2}} \sum_{i = 1}^{N} w_{i} (x_{i} - {\bar{x}}^{*})^{2}

$s^{2*} = \frac{ \sum_{i=1}^N w_i}{(\sum_{i=1}^N w_i)^2 - \sum_{i=1}^N w_i^2} \sum_{i=1}^N w_i(x_i-\bar{x}^*)^2$

and

s^{2 *} = \frac{1}{(\sum_{i = 1}^{N} w_{i}) - 1} \sum_{i = 1}^{N} w_{i} (x_{i} - {\bar{x}}^{*})^{2}

$s^{2*} = \frac{1}{(\sum_{i=1}^N w_i) - 1} \sum_{i=1}^N w_i(x_i-\bar{x}^*)^2$

Which give standard deviations of 8.66 and 7.83, respectively, for my test case.

Update

Thanks to @whuber who suggested looking into Sheppard's Corrections, and your helpful comments related to them. Unfortunately, I'm having a difficult time understanding the resources I can find about it (and I can't find any good examples). To recap though, I understand that the following is a biased estimate of variance:

s^{2 *} = \frac{1}{\sum_{i = 1}^{N} w_{i}} \sum_{i = 1}^{N} w_{i} (x_{i} - {\bar{x}}^{*})^{2}

$s^{2*} = \frac{1}{\sum_{i=1}^N w_i} \sum_{i=1}^N w_i(x_i-\bar{x}^*)^2$

I also understand that most standard corrections for the bias are for direct random samples of a normal distribution. Therefore, I see two potential issues for me:

Są to losowe próbki skumulowane (co, jestem pewien, że właśnie tam pojawiają się Korekcje Shepparda).
Nie wiadomo, czy dane mają rozkład normalny (dlatego zakładam, że nie, co, jestem całkiem pewien, unieważnia Korekcje Shepparda).

Moje zaktualizowane pytanie brzmi: Jaka jest właściwa metoda radzenia sobie z błędem wynikającym z „prostej” ważonej formuły odchylenia standardowego / wariancji dla rozkładu niestandardowego? W szczególności w odniesieniu do danych binarnych.

Uwaga: używam następujących terminów:

$s^{2*}$ jest ważoną wariancją
$N$ to liczba obserwacji. (tj. liczba pojemników)
$M$ to liczba niezerowych wag. (tj. liczba pojemników z zliczeniami)
$w_i$ są wagami (tj. zliczeniami)
$x_i$ są obserwacjami. (tzn. bin oznacza)
$\bar{x}^*$ is the weighted mean.

variance standard-deviation weighted-sampling

— chezy525
źródło

Google "Sheppard's corrections" for the standard solutions to this problem.

— whuber

@whuber, I'm afraid my google-foo is failing me... I'm not finding much about how to use Sheppard's corrections. As far as I can tell, it's a correction for the binned nature of the data, and in my test case would be used like

s^{2 *} - \frac{c^{2}}{12}

$s^{2*} - \frac{c^2}{12}$ , where

c

$c$ is the size of the bins (in my test case, 4). Is this correct? In any case, what I'm finding still doesn't seem to help me with computing

s^{2 *}

$s^{2*}$ .

— chezy525

Drugie trafienie w mojej wyszukiwarce Google zapewnia wyraźny wzór (równanie 9).

— whuber

@whuber, it's been a couple months, and I've tried reading the document you linked a couple times. I think I'm still missing something, but the best I've come up with is that the final equation I listed is correct as the unbiased estimator. Is this right?

— chezy525

Poprawki Shepparda nie zakładają normalności.

— Glen_b

This reply presents two solutions: Sheppard's corrections and a maximum likelihood estimate. Both closely agree on an estimate of the standard deviation: $7.70$ for the first and $7.69$ for the second (when adjusted to be comparable to the usual "unbiased" estimator).

Sheppard's corrections

"Sheppard's corrections" are formulas that adjust moments computed from binned data (like these) where

the data are assumed to be governed by a distribution supported on a finite interval $[a,b]$
that interval is divided sequentially into equal bins of common width $h$ that is relatively small (no bin contains a large proportion of all the data)
the distribution has a continuous density function.

They are derived from the Euler-Maclaurin sum formula, which approximates integrals in terms of linear combinations of values of the integrand at regularly spaced points, and therefore generally applicable (and not just to Normal distributions).

Although strictly speaking a Normal distribution is not supported on a finite interval, to an extremely close approximation it is. Essentially all its probability is contained within seven standard deviations of the mean. Therefore Sheppard's corrections are applicable to data assumed to come from a Normal distribution.

The first two Sheppard's corrections are

Use the mean of the binned data for the mean of the data (that is, no correction is needed for the mean).
Subtract $h^2/12$ from the variance of the binned data to obtain the (approximate) variance of the data.

Where does $h^2/12$ come from? This equals the variance of a uniform variate distributed over an interval of length $h$ . Intuitively, then, Sheppard's correction for the second moment suggests that binning the data--effectively replacing them by the midpoint of each bin--appears to add an approximately uniformly distributed value ranging between $-h/2$ and $h/2$ , whence it inflates the variance by $h^2/12$ .

Let's do the calculations. I use R to illustrate them, beginning by specifying the counts and the bins:

counts <- c(1,2,3,4,1)
bin.lower <- c(40, 45, 50, 55, 70)
bin.upper <- c(45, 50, 55, 60, 75)

The proper formula to use for the counts comes from replicating the bin widths by the amounts given by the counts; that is, the binned data are equivalent to

42.5, 47.5, 47.5, 52.5, 52.5, 57.5, 57.5, 57.5, 57.5, 72.5

Their number, mean, and variance can be directly computed without having to expand the data in this way, though: when a bin has midpoint $x$ and a count of $k$ , then its contribution to the sum of squares is $kx^2$ . This leads to the second of the Wikipedia formulas cited in the question.

bin.mid <- (bin.upper + bin.lower)/2
n <- sum(counts)
mu <- sum(bin.mid * counts) / n
sigma2 <- (sum(bin.mid^2 * counts) - n * mu^2) / (n-1)

The mean (mu) is $1195/22 \approx 54.32$ (needing no correction) and the variance (sigma2) is $675/11 \approx 61.36$ . (Its square root is $7.83$ as stated in the question.) Because the common bin width is $h=5$ , we subtract $h^2/12 = 25/12 \approx 2.08$ from the variance and take its square root, obtaining $\sqrt{675/11 - 5^2/12} \approx 7.70$ for the standard deviation.

Maximum Likelihood Estimates

An alternative method is to apply a maximum likelihood estimate. When the assumed underlying distribution has a distribution function $F_\theta$ (depending on parameters $\theta$ to be estimated) and the bin $(x_0, x_1]$ contains $k$ values out of a set of independent, identically distributed values from $F_\theta$ , then the (additive) contribution to the log likelihood of this bin is

\log \prod_{i = 1}^{k} (F_{θ} (x_{1}) - F_{θ} (x_{0})) = k \log (F_{θ} (x_{1}) - F_{θ} (x_{0}))

$\log \prod_{i=1}^k \left(F_\theta(x_1) - F_\theta(x_0)\right) = k\log\left(F_\theta(x_1) - F_\theta(x_0)\right)$

(see MLE/Likelihood of lognormally distributed interval).

Summing over all bins gives the log likelihood $\Lambda(\theta)$ for the dataset. As usual, we find an estimate $\hat\theta$ which minimizes $-\Lambda(\theta)$ . This requires numerical optimization and that is expedited by supplying good starting values for $\theta$ . The following R code does the work for a Normal distribution:

sigma <- sqrt(sigma2) # Crude starting estimate for the SD
likelihood.log <- function(theta, counts, bin.lower, bin.upper) {
  mu <- theta[1]; sigma <- theta[2]
  -sum(sapply(1:length(counts), function(i) {
    counts[i] * 
      log(pnorm(bin.upper[i], mu, sigma) - pnorm(bin.lower[i], mu, sigma))
  }))
}
coefficients <- optim(c(mu, sigma), function(theta) 
  likelihood.log(theta, counts, bin.lower, bin.upper))$par

The resulting coefficients are $(\hat\mu, \hat\sigma) = (54.32, 7.33)$ .

Remember, though, that for Normal distributions the maximum likelihood estimate of $\sigma$ (when the data are given exactly and not binned) is the population SD of the data, not the more conventional "bias corrected" estimate in which the variance is multiplied by $n/(n-1)$ . Let us then (for comparison) correct the MLE of $\sigma$ , finding $\sqrt{n/(n-1)} \hat\sigma = \sqrt{11/10}\times 7.33 = 7.69$ . This compares favorably with the result of Sheppard's correction, which was $7.70$ .

Verifying the Assumptions

To visualize these results we can plot the fitted Normal density over a histogram:

hist(unlist(mapply(function(x,y) rep(x,y), bin.mid, counts)),
     breaks = breaks, xlab="Values", main="Data and Normal Fit")
curve(dnorm(x, coefficients[1], coefficients[2]), 
      from=min(bin.lower), to=max(bin.upper), 
      add=TRUE, col="Blue", lwd=2)

To some this might not look like a good fit. However, because the dataset is small (only $11$ values), surprisingly large deviations between the distribution of the observations and the true underlying distribution can occur.

Let's more formally check the assumption (made by the MLE) that the data are governed by a Normal distribution. An approximate goodness of fit test can be obtained from a $\chi^2$ test: the estimated parameters indicate the expected amount of data in each bin; the $\chi^2$ statistic compares the observed counts to the expected counts. Here is a test in R:

breaks <- sort(unique(c(bin.lower, bin.upper)))
fit <- mapply(function(l, u) exp(-likelihood.log(coefficients, 1, l, u)),
              c(-Inf, breaks), c(breaks, Inf))
observed <- sapply(breaks[-length(breaks)], function(x) sum((counts)[bin.lower <= x])) -
  sapply(breaks[-1], function(x) sum((counts)[bin.upper < x]))
chisq.test(c(0, observed, 0), p=fit, simulate.p.value=TRUE)

The output is

Chi-squared test for given probabilities with simulated p-value (based on 2000 replicates)

data:  c(0, observed, 0) 
X-squared = 7.9581, df = NA, p-value = 0.2449

The software has performed a permutation test (which is needed because the test statistic does not follow a chi-squared distribution exactly: see my analysis at How to Understand Degrees of Freedom). Its p-value of $0.245$ , which is not small, shows very little evidence of departure from normality: we have reason to trust the maximum likelihood results.

— whuber
źródło