Interwał przewidywania regresji liniowej

Jeśli najlepszym przybliżeniem liniowym (przy użyciu najmniejszych kwadratów) moich punktów danych jest linia $y=mx+b$ , jak mogę obliczyć błąd przybliżenia? Jeśli obliczę odchylenie standardowe różnic między obserwacjami i prognozami $e_i=real(x_i)-(mx_i+b)$ , czy mogę później powiedzieć, że rzeczywista (ale nie zaobserwowana) wartość $y_r=real(x_0)$ należy do przedziału ( ) z prawdopodobieństwem ~ 68%, przy założeniu rozkładu normalnego? $[y_p-\sigma, y_p+\sigma]$ $y_p=mx_0+b$

Wyjaśnić:

Dokonałem obserwacji dotyczących funkcji , oceniając ją w kilku punktach . Dopasowuję te obserwacje do linii . Dla , którego nie obserwowałem, chciałbym wiedzieć, jak duże może być . Stosując powyższą metodę, czy poprawne jest stwierdzenie, że $f(x)$ $x_i$ $l(x)=mx+b$ $x_0$ $f(x_0)-l(x_0)$ $f(x_0) \in [l(x_0)-\sigma, l(x_0)+\sigma]$ z prob. ~ 68%?

— BMX
źródło

Myślę, że pytasz o przedziały prognozowania. Pamiętaj jednak, że używasz „

” zamiast „

”. Czy to literówka? My nie przewidują

x_{i}

$x_i$

y_{i}

$y_i$

x

$x$

— gung - Przywróć Monikę

@gung: Używam

do oznaczenia na przykład czasu,

wartości jakiejś zmiennej w tym czasie, więc

oznacza, że dokonałem obserwacji

w czasie

. Chcę wiedzieć, jak daleko mogą być przewidywania funkcji dopasowania od rzeczywistych wartości y. Czy to ma sens? Funkcja

zwraca „poprawną” wartość

dla

, a moje punkty danych składają się z

x

$x$

y

$y$

y = f (x)

$y=f(x)$

y

$y$

x

$x$

r e a l (x_{i})

$real(x_i)$

y

$y$

x_{i}

$x_i$

(x_{i}, r e a l (x_{i}))

${(x_i, real(x_i))}$

— bmx

To wydaje się całkowicie rozsądne. Części, na których się koncentruję, to np. „

”, zwykle myślimy o błędach / resztkach w modelu reg jako „

”. SD z reszt ma odgrywać rolę w obliczaniu przedziały predykcji. Chodzi o to, że „

e_{i} = r e a l (x_{i}) - (m x_{i} + b)

$e_i=real(x_i)-(mx_i+b)$

e_{i} = y_{i} - (m x_{i} + b)

$e_i=y_i-(mx_i+b)$

x_{i}

$x_i$ „to dla mnie dziwne; zastanawiam się, czy to literówka, czy pytasz o coś, czego nie rozpoznaję.

— Gung - Przywróć Monikę

Myślę, że widzę; Brakowało mi twojej edycji. Sugeruje to, że system jest w pełni deterministyczny i jeśli miał dostęp do rzeczywistej funkcji podstawowej, zawsze można przewidzieć,

doskonale w / o błędzie. Nie tak zazwyczaj myślimy o modelach reg.

y_{i}

$y_i$

— gung - Przywróć Monikę

bmx, Wydaje mi się, że masz jasne pojęcie o swoim pytaniu i dobrą świadomość niektórych problemów. Być może zainteresuje Cię przegląd trzech ściśle powiązanych wątków. stats.stackexchange.com/questions/17773 opisuje przedziały prognozowania w terminach nietechnicznych; stats.stackexchange.com/questions/26702 daje bardziej matematyczny opis; a na stronie stats.stackexchange.com/questions/9131 Rob Hyndman zapewnia formułę, której szukasz. Jeśli nie odpowiedzą w pełni na twoje pytanie, przynajmniej mogą dać ci standardową notację i słownictwo, aby je wyjaśnić.

— whuber

@whuber wskazał ci trzy dobre odpowiedzi, ale być może nadal mogę napisać coś wartościowego. Wasze wyraźne pytanie, jak rozumiem, brzmi:

Biorąc pod uwagę, my wyposażona $\hat y_i=\hat mx_i + \hat b$ (zauważ, że dodana 'kapeluszy') , i przy założeniu, że reszty moi zazwyczaj rozproszone, , można przewidzieć, że dotychczas zauważony reakcji, , o znanej wartości czynnikiem, , będzie mieścić się w przedziale $\mathcal N(0, \hat\sigma^2_e)$ $y_{new}$ $x_{new}$ , z prawdopodobieństwem 68%? $(\hat y -\sigma_e, \hat y +\sigma_e)$

Intuitively, the answer seems like it should be 'yes', but the true answer is maybe. This will be the case when the parameters (i.e., $m, b,$ & $\sigma$ ) are known and without error. Since you estimated these parameters, we need to take their uncertainty into account.

$t_\text{df error}$ $t$

$\hat y_\text{new}\pm t_{(1-\alpha/2,\ \text{df error})}s$ , instead of $\hat y_\text{new}\pm z_{(1-\alpha/2)}s$ , and go about our merry way? Unfortunately, no. The bigger issue is that there is uncertainty about your estimate of the conditional mean of the response at that location due to the uncertainty in your estimates $\hat m$ & $\hat b$ . Thus, the standard deviation of your predictions needs to incorporate more than just $s_\text{error}$ . Because variances add, the estimated variance of the predictions will be:

s_{predictions(new)}^{2} = s_{error}^{2} + Var (\hat{m} x_{new} + \hat{b})

$s^2_\text{predictions(new)}=s^2_\text{error}+\text{Var}(\hat mx_\text{new}+\hat b)$ Notice that the "

x

$x$ " is subscripted to represent the specific value for the new observation, and that the "

s^{2}

$s^2$ " is correspondingly subscripted. That is, your prediction interval is contingent on the location of the new observation along the

x

$x$ axis. The standard deviation of your predictions can be more conveniently estimated with the following formula:

s_{predictions(new)} = \sqrt{s_{error}^{2} (1 + \frac{1}{N} + \frac{(x_{new} - \bar{x})^{2}}{\sum (x_{i} - \bar{x})^{2}})}

$s_\text{predictions(new)}=\sqrt{s^2_\text{error}\left(1+\frac{1}{N}+\frac{(x_\text{new}-\bar x)^2}{\sum(x_i-\bar x)^2}\right)}$ As an interesting side note, we can infer a few facts about prediction intervals from this equation. First, prediction intervals will be narrower the more data we had when we built the prediction model (this is because there's less uncertainty in

\hat{m}

$\hat m$ &

\hat{b}

$\hat b$ ). Second, predictions will be most precise if they are made at the mean of the

x

$x$ values you used to develop your model, as the numerator for the third term will be

0

$0$ . The reason is that under normal circumstances, there is no uncertainty about the estimated slope at the mean of

x

$x$ , only some uncertainty about the true vertical position of the regression line. Thus, some lessons to be learned for building prediction models are: that more data is helpful, not with finding 'significance', but with improving the precision of future predictions; and that you should center your data collection efforts on the interval where you will need to be making predictions in the future (to minimize that numerator), but spread the observations as widely from that center as you can (to maximize that denominator).

Having calculated the correct value in this manner, we can then use it with the appropriate $t$ distribution as noted above.

— gung - Reinstate Monica
źródło