Czy mogę zrekonstruować rozkład normalny na podstawie wielkości próbki oraz wartości minimalnych i maksymalnych? Mogę użyć punktu środkowego do określenia średniej

Wiem, że to może być trochę ryzykowne statystycznie, ale to mój problem.

Mam wiele danych zakresu, to znaczy minimalną, maksymalną i wielkość próbki zmiennej. Dla niektórych z tych danych mam również średnią, ale nie wiele. Chcę porównać te zakresy ze sobą, aby obliczyć zmienność każdego zakresu, a także porównać średnie. Mam dobry powód przypuszczać, że rozkład jest symetryczny wokół średniej i że dane będą miały rozkład Gaussa. Z tego powodu myślę, że mogę uzasadnić użycie środkowej części rozkładu jako przybliżenia średniej, gdy jest ona nieobecna.

Chcę zrekonstruować rozkład dla każdego zakresu, a następnie użyć go, aby podać odchylenie standardowe lub błąd standardowy dla tego rozkładu. Jedyne informacje, jakie mam, to maksima i min obserwowane z próbki oraz punkt środkowy jako przybliżenie średniej.

W ten sposób mam nadzieję, że będę w stanie obliczyć średnie ważone dla każdej grupy, a także opracować współczynnik zmienności dla każdej grupy, w oparciu o dane zakresu i moje założenia (rozkład symetryczny i normalny).

Planuję użyć do tego R, więc każda pomoc kodu byłaby mile widziana.

— green_thinlake
źródło

Zastanawiałem się, dlaczego twierdzisz, że masz dane dotyczące wartości minimalnych, maksymalnych i maksymalnych; potem masz informacje o oczekiwanym minimum i maksimum. Co to jest - zaobserwowane lub oczekiwane?

— Scortchi - Przywróć Monikę

Przepraszam, to mój błąd. Obserwowane są maksymalne i minimalne dane (mierzone na podstawie rzeczywistych obiektów). Poprawiłem post.

— green_thinlake

Odpowiedzi:

Łączna funkcja skumulowanego rozkładu dla minimum i maksimum dla próbki z rozkładu Gaussa ze średnią i odchyleniem standardowym wynosi $x_{(1)}$ $x_{(n)}$ $n$ $\mu$ $\sigma$

F (x_{(1)}, x_{(n)}; μ, σ) = Pr (X_{(1)} < x_{(1)}, X_{(n)} < x_{(n)}) = Par (X_{(n)} < x_{(n)}) - Par (X_{(1)} > x_{(1)}, X_{(n)} < x_{(n)} = Φ {(\frac{x_{(n)} - μ}{σ})}^{n} - {[Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})]}^{n}

$F(x_{(1)},x_{(n)};\mu,\sigma) = \Pr(X_{(1)}<x_{(1)}, X_{(n)}<x_{(n)})\\ =\Pr( X_{(n)}<x_{(n)}) - \Pr(X_{(1)}>x_{(1)}, X_{(n)}<x_{(n)}\\ =\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right)^n - \left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) -\Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right]^n$

gdzie to standardowy gaussowski CDF. Zróżnicowanie względem i daje funkcję gęstości prawdopodobieństwa połączenia $\Phi(\cdot)$ $x_{(1)}$ $x_{(n)}$

f (x_{(1)}, x_{(n)}; μ, σ) = n (n - 1) {[Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})]}^{n - 2} \cdot ϕ (\frac{x_{(n)} - μ}{σ}) \cdot ϕ (\frac{x_{(1)} - μ}{σ}) \cdot \frac{1}{σ^{2}}

$f(x_{(1)},x_{(n)};\mu,\sigma) =\\ n(n-1)\left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) - \Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right]^{n-2}\cdot\phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right)\cdot\phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\cdot\tfrac{1}{\sigma^2}$

gdzie to standardowy gaussowski plik PDF. Biorąc dziennik i upuszczając warunki, które nie zawierają parametrów, daje funkcję prawdopodobieństwa dziennika $\phi(\cdot)$

ℓ (μ, σ; x_{(1)}, x_{(n)}) = (n - 2) \log [Φ (\frac{x_{(n)} - μ}{σ}) - Φ (\frac{x_{(1)} - μ}{σ})] + \log ϕ (\frac{x_{(n)} - μ}{σ}) + \log ϕ (\frac{x_{(1)} - μ}{σ}) - 2 \log σ

$\ell(\mu,\sigma;x_{(1)},x_{(n)}) =\\ (n-2)\log\left[\Phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) - \Phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right)\right] + \log\phi\left(\tfrac{x_{(n)}-\mu}{\sigma}\right) + \log\phi\left(\tfrac{x_{(1)}-\mu}{\sigma}\right) - 2\log\sigma$

To nie wygląda bardzo łagodny, ale to łatwo zauważyć, że bez względu na to maksymalizować wartość przez ustawienie $\sigma$ , tj. Punkt środkowy - pierwszy termin jest maksymalizowany, gdy argument jednego CDF jest ujemny od argumentu drugiego; drugi i trzeci termin reprezentują wspólne prawdopodobieństwo dwóch niezależnych zmiennych normalnych. $\mu=\hat\mu=\frac{x_{(n)}+x_{(1)}}{2}$

Podstawiając w Log-Likelihood i pisania daje $\hat\mu$ $r=x_{(n)}-x_{(1)}$

ℓ (σ; x_{(1)}, x_{(n)}, \hat{μ}) = (n - 2) \log [1 - 2 Φ (\frac{- r}{2 σ})] - \frac{r^{2}}{4 σ^{2}} - 2 \log σ

$\ell(\sigma;x_{(1)},x_{(n)},\hat\mu)=(n-2)\log\left[1 - 2\Phi\left(\tfrac{-r}{2\sigma}\right)\right] - \frac{r^2}{4\sigma^2} -2\log{\sigma}$

Wyrażenie to ma zostać zmaksymalizowane liczbowo (np optimizez R w statzestawie) w celu znalezienia . (Okazuje się, że $\hat\sigma$ , gdzie jest stałą zależności tylko od -perhaps ktoś bardziej matematycznie zręczny niż mogłem pokazać, dlaczego). $\hat\sigma=k(n)\cdot r$ $k$ $n$

Szacunki nie mają zastosowania bez towarzyszącej mi precyzji. Obserwowane informacje Fishera można ocenić numerycznie (np. Z pakietu hessianR numDeriv) i wykorzystać do obliczenia przybliżonych błędów standardowych:

I (μ) = - {\frac{\partial^{2} ℓ (μ; \hat{σ})}{(\partial μ)^{2}} |}_{μ = \hat{μ}}

$I(\mu)=-\left.\frac{\partial^2{\ell(\mu;\hat\sigma)}}{(\partial\mu)^2}\right|_{\mu=\hat\mu}$

I (σ) = - {\frac{\partial^{2} ℓ (σ; \hat{μ})}{(\partial σ)^{2}} |}_{σ = \hat{σ}}

$I(\sigma)=-\left.\frac{\partial^2{\ell(\sigma;\hat\mu)}}{(\partial\sigma)^2}\right|_{\sigma=\hat\sigma}$

Interesujące byłoby porównanie prawdopodobieństwa i oszacowania metody momentów dla pod względem błędu (czy MLE jest spójny?), Wariancji i błędu średniej kwadratowej. Istnieje również kwestia szacowania dla tych grup, w których średnia próbki jest znana oprócz minimum i maksimum. $\sigma$

— Scortchi - Przywróć Monikę
źródło

2 \log (r)

$2\log(r)$

σ / r

$\sigma/r$

n

$n$

σ / r

$\sigma/r$

n \to k (n)

$n\to k(n)$

\hat{σ} = k (n) r

$\hat\sigma=k(n)r$ studentyzowane zakres .

— whuber

@whuber: Dzięki! Z perspektywy czasu wydaje się to oczywiste. Włączę to do odpowiedzi.

— Scortchi - Przywróć Monikę

You need to relate the range to the standard deviation/variance.Let $\mu$ be the mean, $\sigma$ the standard deviation and $R=x_{(n)} - x_{(1)}$ be the range. Then for the normal distribution we have that $99.7$ % of probability mass lies within 3 standard deviations from the mean. This, as a practical rule means that with very high probability,

μ + 3 σ \approx x_{(n)}

$\mu + 3\sigma \approx x_{(n)}$ and

μ - 3 σ \approx x_{(1)}

$\mu - 3\sigma \approx x_{(1)}$

Subtracting the second from the first we obtain

6 σ \approx x_{(n)} - x_{(1)} = R

$6\sigma \approx x_{(n)} - x_{(1)}= R$ (this, by the way is whence the "six-sigma" quality assurance methodology in industry comes). Then you can obtain an estimate for the standard deviation by

\hat{σ} = \frac{1}{6} ({\bar{x}}_{(n)} - {\bar{x}}_{(1)})

$\hat \sigma = \frac 16 \Big(\bar x_{(n)} - \bar x_{(1)}\Big)$ where the bar denotes averages. This is when you assume that all sub-samples come from the same distribution (you wrote about having expected ranges). If each sample is a different normal, with different mean and variance, then you can use the formula for each sample, but the uncertainty / possible inaccuracy in the estimated value of the standard deviation will be much larger.

Having a value for the mean and for the standard deviation completely characterizes the normal distribution.

— Alecos Papadopoulos
źródło

That's neither a close approximation for small

n

$n$ nor an asymptotic result for large

n

$n$ .

— Scortchi - Reinstate Monica

@Stortchi Well, I didn't say that it is a good estimate -but I believe that it is always good to have easily implemented solutions, even very rough, in order to get a quantitative sense of the issue at hand, alongside the more sophisticated and efficient approaches like for example the one outlined in the other answer to this question.

— Alecos Papadopoulos

I wouldn't carp at "the expectation of the sample range turns out to be about 6 times the standard deviation for values of

n

$n$ from 200 to 1000". But am I missing something subtle in your derivation, or wouldn't it work just as well to justify dividing the range by any number?

— Scortchi - Reinstate Monica

@Scortchi Well, the spirit of the approach is "if we expect almost all realizations to fall within 6 sigmas, then it is reasonable to expect that the extreme realizations will be near the border" -that's all there is to it, really. Perhaps I am too used to operate under extremely incomplete information, and obliged to say something quantitative about it... :)

— Alecos Papadopoulos

I could reply that even more observations would fall within

10 σ

$10 \sigma$ of the mean, giving a better estimate

\hat{σ} = \frac{R}{10}

$\hat\sigma=\frac{R}{10}$ . I shan't because it's nonsense. Any number over

1.13

$1.13$ will be a rough estimate for some value of

n

$n$ .

— Scortchi - Reinstate Monica

It is straightforward to get the distribution function of the maximum of the normal distribution (see "P.max.norm" in code). From it (with some calculus) you can get the quantile function (see "Q.max.norm").

Using "Q.max.norm" and "Q.min.norm" you can get the median of the range that is related with N. Using the idea presented by Alecos Papadopoulos (in previous answer) you can calculate sd.

Try this:

N = 100000    # the size of the sample

# Probability function given q and N
P.max.norm <- function(q, N=1, mean=0, sd=1){
    pnorm(q,mean,sd)^N
} 
# Quantile functions given p and N
Q.max.norm <- function(p, N=1, mean=0, sd=1){
    qnorm(p^(1/N),mean,sd)
} 
Q.min.norm <- function(p, N=1, mean=0, sd=1){
    mean-(Q.max.norm(p, N=N, mean=mean, sd=sd)-mean)
} 

### lets test it (takes some time)
Q.max.norm(0.5, N=N)  # The median on the maximum
Q.min.norm(0.5, N=N)  # The median on the minimum

iter = 100
median(replicate(iter, max(rnorm(N))))
median(replicate(iter, min(rnorm(N))))
# it is quite OK

### Lets try to get estimations
true_mean = -3
true_sd = 2
N = 100000

x = rnorm(N, true_mean, true_sd)  # simulation
x.vec = range(x)                  # observations

# estimation
est_mean = mean(x.vec)
est_sd = diff(x.vec)/(Q.max.norm(0.5, N=N)-Q.min.norm(0.5, N=N))

c(true_mean, true_sd)
c(est_mean, est_sd)

# Quite good, but only for large N
# -3  2
# -3.252606  1.981593

— Vyga
źródło

Continuing this approach,

E (R) = σ \int_{- \infty}^{\infty} 1 - (1 - Φ (x))^{n} - Φ (x)^{n} d x = σ d_{2} (n)

$\operatorname{E} (R) = \sigma \int_{-\infty}^{\infty} 1-(1-\Phi(x))^n -\Phi(x)^n\, \mathrm{d} x = \sigma d_2(n)$ , where

R

$R$ is the range &

Φ (\cdot)

$\Phi(\cdot)$ the standard normal cumulative distribution function. You can find tabulated values of

d_{2}

$d_2$ for small

n

$n$ in the statistical process control literature, numerically evaluate the integral, or simulate for your

n

$n$ .

— Scortchi - Reinstate Monica