Stan

Przeglądałem dokumentację Stana, którą można pobrać stąd . Byłem szczególnie zainteresowany ich wdrożeniem diagnostyki Gelmana-Rubina. Oryginalny artykuł Gelman i Rubin (1992) definiuje potencjalny współczynnik redukcji skali (PSRF) w następujący sposób:

Niech $X_{i,1}, \dots , X_{i,N}$ będą $i$ -tym łańcuchem Markowa, z którego pobrano próbkę, i niech będzie próbka z całych $M$ niezależnych łańcuchów. Niech $\bar{X}_{i\cdot}$ będzie średnią z $i$ tego łańcucha, a $\bar{X}_{\cdot \cdot}$ będzie średnią ogólną. Zdefiniuj,

W = \frac{1}{M} \sum_{m = 1}^{M} s_{m}^{2},

$W = \dfrac{1}{M} \sum_{m=1}^{M} {s^2_m},$ gdzie

I zdefiniuj

s_{m}^{2} = \frac{1}{N - 1} \sum_{t = 1}^{N} ({\bar{X}}_{m t} - {\bar{X}}_{m \cdot})^{2} .

$s^2_m = \dfrac{1}{N-1} \sum_{t=1}^{N} (\bar{X}_{m t} - \bar{X}_{m \cdot})^2\,.$

B

$B$

B = \frac{N}{M - 1} \sum_{m = 1}^{M} ({\bar{X}}_{m \cdot} - {\bar{X}}_{\cdot \cdot})^{2} .

$B = \dfrac{N}{M-1} \sum_{m=1}^{M} (\bar{X}_{m \cdot} - \bar{X}_{\cdot \cdot})^2 \,.$

Określić PSRF szacuje się na

\hat{V} = (\frac{N - 1}{N}) W + (\frac{M + 1}{M N}) B .

$\hat{V} = \left(\dfrac{N-1}{N} \right)W + \left( \dfrac{M+1}{MN} \right)B\,.$

, gdzie

\sqrt{\hat{R}}

$\sqrt{\hat{R}}$

Gdzie

\hat{R} = \frac{\hat{V}}{W} \cdot \frac{d f + 3}{d f + 1},

$\hat{R} = \dfrac{\hat{V}}{W} \cdot \dfrac{df+3}{df+1}\,,$

d f = 2 \hat{V} / V a r (\hat{V})

$df = 2\hat{V}/Var(\hat{V})$

Dokumentacja Stan na stronie 349 ignoruje termin z usuwa, a także mnożnikowy okresie. To jest ich formuła, $df$ $(M+1)/M$

Estymatorem wariancji jest Na koniec, możliwość statystyczne zmniejszenie skali jest określony przez
${\hat{var}}^{+} (θ | y) = \frac{N - 1}{N} W + \frac{1}{N} B .$ $\widehat{\text{var}}^{+}(\theta \, | \, y) = \frac{N-1}{N} W + \frac{1}{N} B\,.$ $\hat{R} = \sqrt{\frac{{\hat{var}}^{+} (θ | y)}{W}} .$ $\hat{R} = \sqrt{\frac{\widehat{\text{var}}^{+}(\theta \, | \, y) }{W}}\,.$

Z tego, co widziałem, nie zawierają one odniesienia do tej zmiany formuły i nie dyskutują o tym. Zwykle nie jest zbyt duże i często może być tak niskie, jak , więc nie należy ignorować, nawet jeśli wartość można aproksymować za pomocą 1. $M$ $2$ $(M+1)/M$ $df$

Skąd więc ta formuła?

EDYCJA: Znalazłem częściową odpowiedź na pytanie „ skąd pochodzi ta formuła? ”, Ponieważ książka Bayesian Data Analysis autorstwa Gelmana, Carlina, Sterna i Rubina (wydanie drugie) ma dokładnie tę samą formułę. Jednak książka nie wyjaśnia, w jaki sposób / dlaczego uzasadnione jest ignorowanie tych terminów?

— Greenparker
źródło

Nie opublikowano jeszcze żadnej pracy, a formuła prawdopodobnie zmieni się w ciągu najbliższych kilku miesięcy.

— Ben Goodrich,

@BenGoodrich Dzięki za komentarz. Czy możesz powiedzieć coś więcej na temat motywacji korzystania z tej formuły? I dlaczego dokładnie zmieni się formuła?

— Greenparker

Obecna podzielona formuła R-kapelusza ma na celu przede wszystkim zastosowanie jej w przypadku, gdy istnieje tylko jeden łańcuch. Nadchodzące zmiany dotyczą głównie faktu, że leżący u podstaw rozkład brzeżny może nie być normalny lub mieć średnią i / lub wariancję.

— Ben Goodrich,

@BenGoodrich Tak, rozumiem, dlaczego STAN dzieli Rhata. Jednak nawet w tym przypadku

, i tak stałej

, który nie jest do pominięcia.

M = 2

$M = 2$

(M + 1) / M = 3 / 2

$(M+1)/M = 3/2$

— Greenparker

\hat{σ} = \frac{n - 1}{n} W + \frac{1}{n} B

$\hat{\sigma} = \frac{n-1}{n}W+ \frac{1}{n}B$

\hat{σ}

$\hat{\sigma}$

{\hat{σ}}_{+}

$\hat{\sigma}_+$

{\hat{v a r}}^{+}

$\widehat{\rm var}^+$

\hat{R} = \frac{m + 1}{m} \frac{{\hat{σ}}_{+}}{W} - \frac{n - 1}{m n},

$\hat{R} = \frac{m+1}{m}\frac{\hat{\sigma}_+}{W} - \frac{n-1}{mn},$ which can be rearranged as

\hat{R} = \frac{{\hat{σ}}_{+}}{W} + \frac{{\hat{σ}}_{+}}{W m} - \frac{n - 1}{m n} .

$\hat{R} = \frac{\hat{\sigma}_+}{W} + \frac{\hat{\sigma}_+}{Wm}- \frac{n-1}{mn}.$ We can see that the effect of second and third term are negligible for decision making when

n

$n$ is large. See also the discussion in the paragraph before Section 3.1 in Brooks & Gelman (1998).

Gelman & Rubin (1992) also had the term with df as df/(df-2). Brooks & Gelman (1998) have a section describing why this df corretion is incorrect and define (df+3)/(df+1). The paragraph before Section 3.1 in Brooks & Gelman (1998) explains why (d+3)/(d+1) can be dropped.

It seems your source for the equations was something post Brooks & Gelman (1998) as you had (d+3)/(d+1) there and Gelman & Rubin (1992) had df/df(-2). Otherwise Gelman & Rubin (1992) and Brooks & Gelman (1998) have equivalent equations (with slightly different notations and some terms are arranged differently). BDA2 (Gelman, et al., 2003) doesn't have anymore terms $\frac{\hat{\sigma}_+}{Wm}- \frac{n-1}{mn}$ . BDA3 (Gelman et al., 2003) and Stan introduced split chains version.

My interpretation of the papers and experiences using different versions of $\hat{R}$ is that the terms which have been eventually dropped can be ignored when $n$ is large, even when $m$ is not. I also vaguely remember discussing this with Andrew Gelman years ago, but if you want to be certain of the history, you should ask him.

Usually M is not too large, and can often be as low so as 2

I really do hope that this is not often the case. In cases where you want to use split- $\hat{R}$ convergence diagnostic, you should use at least 4 chains split and thus have M=8. You may use less chains, if you already know that in your specific cases the convergence and mixing is fast.

Additional reference:

Brooks and Gelman (1998). Journal of Computational and Graphical Statistics, 7(4)434-455.

— Aki Vehtari
źródło

Yes it has the same

{\hat{σ}}^{2}

$\hat{\sigma}^2$ as you mention, but their

\hat{R}

$\hat{R}$ statistic is

({\hat{σ}}^{2} + B / m n) / W * d f_{t e r m}

$(\hat{\sigma}^2 + B/mn)/W * df_{term}$ (look at the equation on top of page 495 in the Stat Science official version), which introduces the

(m + 1) / m

$(m+1)/m$ term I was talking about. In addition, look at the code and description in the R package coda, which has had the GR diagnostic since 1999.

— Greenparker

I'm confused. The article via the link you provided and the article from Stat Science web pages has only pages 457-472.I didn't check now, but years ago and last year when I checked coda, it didn't have the current recommended version.

— Aki Vehtari

Note that I edited my answer. Gelman & Brooks (1998) has that (m+1)/m term more clearly, and it seems you missed the last term which mostly cancels the effect of (m+1)/m term for decision making. See that paragraph before section 3.1.

— Aki Vehtari

Sorry about that, that was a typo. It's page 465, and Gelman and Rubin have the same exact definition as Brooks and Gelman (which you state above). Equation 1.1 in Brooks and Gelman is exactly what I wrote down as well (when you rearrange some terms).

— Greenparker

"We can see that the effect of second and third term are negligible for decision making when n is large", so what you are saying is that the expression in BDA and hence STAN comes from essentially ignoring these terms for large n?

— Greenparker