Dlaczego wstecz propaguje się w czasie w sieci RNN?

W nawracającej sieci neuronowej zwykle propagujesz w przód przez kilka kroków czasowych, „rozwijasz” sieć, a następnie w tył propagujesz w sekwencji danych wejściowych.

Dlaczego po prostu nie aktualizowałbyś wag po każdym indywidualnym kroku w sekwencji? (odpowiednik użycia długości obcięcia 1, więc nie ma nic do rozwinięcia) To całkowicie eliminuje problem znikającego gradientu, znacznie upraszcza algorytm, prawdopodobnie zmniejszy szanse utknięcia w lokalnych minimach i, co najważniejsze, wydaje się działać dobrze . Trenowałem w ten sposób model do generowania tekstu, a wyniki wydawały się porównywalne z wynikami, które widziałem w modelach przeszkolonych przez BPTT. Jestem tylko zdezorientowany, ponieważ każdy samouczek na temat RNN, które widziałem, mówi o używaniu BPTT, prawie tak, jakby był wymagany do prawidłowego uczenia się, co nie jest prawdą.

Aktualizacja: dodałem odpowiedź

— Frobot
źródło

Ciekawym kierunkiem do podjęcia tych badań byłoby porównanie wyników osiągniętych w przypadku problemu z testami porównawczymi opublikowanymi w literaturze na temat standardowych problemów RNN. To byłby naprawdę fajny artykuł.

— Sycorax mówi Przywróć Monikę

Twoja „Aktualizacja: dodałem odpowiedź” zastąpiła poprzednią edycję opisem architektury i ilustracją. Czy to celowo?

— ameba mówi Przywróć Monikę

Tak, wyjąłem go, ponieważ nie wydawało się to tak naprawdę związane z rzeczywistym pytaniem i zajęło dużo miejsca, ale mogę dodać to z powrotem, jeśli to pomoże

— Frobot

Wydaje się, że ludzie mają ogromne problemy ze zrozumieniem twojej architektury, więc wydaje mi się, że wszelkie dodatkowe wyjaśnienia są przydatne. Jeśli chcesz, możesz dodać go do swojej odpowiedzi zamiast do pytania.

— ameba mówi Przywróć Monikę

Odpowiedzi:

Edycja: Popełniłem duży błąd podczas porównywania dwóch metod i musiałem zmienić swoją odpowiedź. Okazuje się, że tak to robiłem, po prostu propagując bieżący krok czasu, faktycznie zaczyna się szybciej uczyć. Szybkie aktualizacje bardzo szybko uczą się najbardziej podstawowych wzorców. Ale w przypadku większego zestawu danych i dłuższego czasu szkolenia BPTT faktycznie wychodzi na prowadzenie. Testowałem małą próbkę przez kilka epok i założyłem, że ktokolwiek zacznie wygrywać wyścig, będzie zwycięzcą. Ale to doprowadziło mnie do interesującego znaleziska. Jeśli zaczniesz trenować z powrotem propagując tylko jeden krok czasowy, następnie zmień na BPTT i powoli zwiększaj, jak daleko się propagujesz, uzyskasz szybszą konwergencję.

— Frobot
źródło

Dziękujemy za aktualizację. W źródle tego ostatniego obrazu mówi to o ustawieniu jeden do jednego : „Waniliowy tryb przetwarzania bez RNN, od wejścia o stałej wielkości do wyjścia o stałej wielkości (np. Klasyfikacja obrazu)”. Tak właśnie mówiliśmy. Jeśli jest tak, jak opisano, nie ma stanu i nie jest to RNN. „propagacja do przodu za pomocą jednego wejścia przed propagacją wsteczną” - nazwałbym to ANN. Ale te nie działałyby tak dobrze z tekstem, więc coś jest nie tak i nie mam pojęcia co, bo nie mam kodu

— ragulpr

Nie przeczytałem tej części i masz rację. Model, którego używam, to tak naprawdę „wiele do wielu” po prawej stronie. W sekcji „jeden do jednego” założyłem, że było ich naprawdę wiele, a rysunek po prostu to pominął. ale w rzeczywistości jest to jedna z opcji po prawej stronie, której nie zauważyłem (dziwne, że mam ją tam na blogu o RNN, więc założyłem, że wszystkie były powtarzające się).

— Zmodyfikuję

Wyobraziłem sobie, że tak jest, dlatego nalegałem na zobaczenie twojej funkcji utraty. Jeśli jest to wiele do wielu utrata jest zbliżona do

i jest identycznie RNN i jesteś rozmnożeniowy / inputing całą sekwencję ale potem po prostu obcinanie BPTT tj you” d obliczyć czerwoną część w moim poście, ale nie powracać dalej.

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$error=\sum_t(y_t-\hat{y}_t)^2$

— ragulpr

Moja funkcja utraty nie sumuje się z czasem. Biorę jeden sygnał wejściowy, otrzymuję jeden wynik, a następnie obliczam stratę i aktualizuję wagi, a następnie przechodzę do t + 1, więc nie ma nic do zsumowania. Dodam funkcję straty dokładnej do oryginalnego postu

— Frobot

Po prostu opublikuj swój kod Nie zgaduję więcej, to głupie.

— ragulpr

RNN to Deep Neural Network (DNN), gdzie każda warstwa może przyjmować nowe dane wejściowe, ale ma te same parametry. BPT jest fantazyjnym słowem dla propagacji wstecznej w takiej sieci, która sama w sobie jest fantazyjnym słowem dla Gradient Descent.

Załóżmy, że w RNN WYJŚCIA w każdym etapie i $\hat{y}_t$

mi r r o r_{t} = (y_{t} - {\hat{y}}_{t})^{2)}

$\begin{equation} error_t=(y_t-\hat{y}_t)^2 \end{equation}$

Aby poznać wagi, potrzebujemy gradientów, aby funkcja mogła odpowiedzieć na pytanie „jak bardzo zmiana parametru wpływa na funkcję straty?” i przesuń parametry w kierunku podanym przez:

\nabla e r r o r_{t} = - 2 (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \nabla error_t=-2(y_t-\hat{y}_t)\nabla \hat{y}_t \end{equation}$

To znaczy, mamy DNN, gdzie otrzymujemy informację zwrotną o tym, jak dobre są prognozy dla każdej warstwy. Ponieważ zmiana parametru zmieni każdą warstwę w DNN (timestep), a każda warstwa przyczynia się do nadchodzących wyników, należy to uwzględnić.

Weź prostą sieć z jedną warstwą jeden neuron, aby zobaczyć to częściowo jawnie:

\begin{aligned} {\hat{y}}_{t + 1} = & f (a + b x_{t} + c {\hat{y}}_{t}) \\ \frac{\partial}{\partial a} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot c \cdot \frac{\partial}{\partial a} {\hat{y}}_{t} \\ \frac{\partial}{\partial b} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot (x_{t} + c \cdot \frac{\partial}{\partial b} {\hat{y}}_{t}) \\ \frac{\partial}{\partial c} {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ({\hat{y}}_{t} + c \cdot \frac{\partial}{\partial c} {\hat{y}}_{t}) \\ ⟺ \\ \nabla {\hat{y}}_{t + 1} = & f^{'} (a + b x_{t} + c {\hat{y}}_{t}) \cdot ([\begin{matrix} 0 \\ x_{t} \\ {\hat{y}}_{t} \end{matrix}] + c \nabla {\hat{y}}_{t}) \end{aligned}

$\begin{align*} \hat{y}_{t+1} =& f(a+bx_t+c\hat{y}_t)\\ \frac{\partial}{\partial a}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot c\cdot \frac{\partial}{\partial a}\hat{y}_{t} \\ \frac{\partial}{\partial b}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (x_t+c\cdot\frac{\partial}{\partial b}\hat{y}_{t})\\ \frac{\partial}{\partial c}\hat{y}_{t+1} = & f'(a+bx_t+c\hat{y}_t)\cdot (\hat{y}_t+c\cdot\frac{\partial}{\partial c}\hat{y}_{t})\\ \iff\\ \nabla \hat{y}_{t+1} =& f'(a+bx_t+c\hat{y}_t)\cdot \left(\begin{bmatrix}0\\x_t\\\hat{y}_t \end{bmatrix} + c \mathbin{\color{red}{\nabla \hat{y}_{t}}} \right) \end{align*}$

With $\delta$ the learning rate one training step is then:

[\begin{matrix} \tilde{a} \\ \tilde{b} \\ \tilde{c} \end{matrix}] \leftarrow [\begin{matrix} a \\ b \\ c \end{matrix}] + δ (y_{t} - {\hat{y}}_{t}) \nabla {\hat{y}}_{t}

$\begin{equation} \begin{bmatrix}\tilde{a}\\\tilde{b}\\\tilde{c}\end{bmatrix} \leftarrow \begin{bmatrix}a\\b\\c\end{bmatrix} + \delta (y_{t}-\hat{y}_{t})\nabla \hat{y}_t \end{equation}$

What we see is that in order to calculate $\nabla \hat{y}_{t+1}$ you need to calculate i.e roll out $\nabla \hat{y}_{t}$ . What you propose is to ~~simply disregard the red part~~ calculate the red part for $t$ but not recurse further. I assume that your loss is something like

e r r o r = \sum_{t} (y_{t} - {\hat{y}}_{t})^{2}

$\begin{equation} error=\sum_t(y_t-\hat{y}_t)^2 \end{equation}$

Maybe each step will then contribute a crude direction which is enough in aggregation? This could explain your results but I'd be really interested in hearing more about your method/loss function! Also would be interested in a comparison with a two timestep windowed ANN.

edit4: After reading comments it seems like your architecture is not an RNN.

RNN: Stateful - carry forward hidden state $h_t$ indefinitely This is your model but the training is different.

~~Your model: Stateless - hidden state rebuilt in each step~~ edit2 : added more refs to DNNs edit3 : fixed gradstep and some notation edit5 : Fixed the interpretation of your model after your answer/clarification.

— ragulpr
źródło

thank you for your answer. I think you may have misunderstood what I am doing though. In the forward propagation I only do one step, so that in the back propagation it is also only one step. I don't forward propagate across multiple inputs in the training sequence. I see what you mean about a crude direction that is enough in aggregation to allow learning, but I have checked my gradients with numerically calculated gradients and they match for 10+ decimal places. The back prop works fine. I am using cross entropy loss.

— Frobot

I am working on taking my same model and retraining it with BPTT as we speak to have a clear comparison. I have also trained a model using this "one step" algorithm to predict whether a stock price will rise or fall for the next day, which is getting decent accuracy, so I will have two different models to compare BPTT vs single step back prop.

— Frobot

If you only forward propagate one step, isn't this a two layered ANN with feature input of last step to the first layer, feature input to the current step at the second layer but has same weights/parameters for both layers? I'd expect similar results or better with an ANN that takes input

{\hat{y}}_{t + 1} = f (x_{t}, x_{t - 1})

$\hat{y}_{t+1}=f(x_t,x_{t-1})$ i.e that uses a fixed time-window of size 2. If it only carries forward one step, can it learn long term dependencies?

— ragulpr

I'm using a sliding window of size 1, but the results are vastly different than making a sliding window of size 2 ANN with inputs (xt,xt−1). I can purposely let it overfit when learning a huge body of text and it can reproduce the entire text with 0 errors, which requires knowing long term dependencies that would be impossible if you only had (xt,xt−1) as input. the only question I have left is if using BPTT would allow the dependencies to become longer, but it honestly doesn't look like it would.

— Frobot

Look at my updated post. Your architecture is not an RNN, it's stateless so long term-dependencies not explicitly baked into the features can't be learned. Previous predictions does not influence future predictions. You can see this as if

\frac{\partial}{\partial {\hat{y}}_{t - 2}} {\hat{y}}_{t} = 0

$\frac{\partial}{\partial \hat{y}_{t-2}}\hat{y}_t =0$ for your architecture. BPTT is in theory identical to BP but performed on an RNN-architecture so you can't but I see what you mean, and the answer is no. Would be really interesting to see experiments on stateful RNN but only onestep BPTT though ^^

— ragulpr

"Unfolding through time" is simply an application of the chain rule,

\frac{d F (g (x), h (x), m (x))}{d x} = \frac{\partial F}{\partial g} \frac{d g}{d x} + \frac{\partial F}{\partial h} \frac{d h}{d x} + \frac{\partial F}{\partial m} \frac{d m}{d x}

$\frac{dF(g(x), h(x), m(x))}{dx} = \frac{\partial F}{\partial g}\frac{dg}{dx} + \frac{\partial F}{\partial h}\frac{dh}{dx} + \frac{\partial F}{\partial m}\frac{dm}{dx}$

The output of an RNN at time step $t$ , $H_t$ is a function of the parameters $\theta$ , the input $x_t$ and the previous state, $H_{t-1}$ (note that instead $H_t$ may be transformed again at time step $t$ to obtain the output, that is not important here). Remember the goal of gradient descent: given some error function $L$ , let's look at our error for the current example (or examples), and then let's adjust $\theta$ in such a way, that given the same example again, our error would be reduced.

How exactly did $\theta$ contribute to our current error? We took a weighted sum with our current input, $x_t$ , so we'll need to backpropagate through the input to find $\nabla_\theta a(x_t, \theta)$ , to work out how to adjust $\theta$ . But our error was also the result of some contribution from $H_{t-1}$ , which was also a function of $\theta$ , right? So we need to find out $\nabla_\theta H_{t-1}$ , which was a function of $x_{t-1}$ , $\theta$ and $H_{t-2}$ . But $H_{t-2}$ was also a function a function of $\theta$ . And so on.

— Matthew Hampsey
źródło

I understand why you back propagate through time in a traditional RNN. I'm trying to find out why a traditional RNN uses multiple inputs at once for training, when using just one at a time is much simpler and also works

— Frobot

The only sense in which you can feed in multiple inputs at once into an RNN is feeding in multiple training examples, as part of a batch. The batch size is arbitrary, and convergence is guaranteed for any size, but higher batch sizes may lead to more accurate gradient estimations and faster convergence.

— Matthew Hampsey

That's not what I meant by "multiple inputs at once". I didn't word it very well. I meant you usually forward propagate through several inputs in the training sequence, then back propagate back through them all, then update the weights. So the question is, why propagate through a whole sequence when doing just one input at a time is much easier and still works

— Frobot

I think some clarification here is required. When you say "inputs", are you referring to multiple training examples, or are you referring to multiple time steps within a single training example?

— Matthew Hampsey

I will post an answer to this question by the end of today. I finished making a BPTT version, just have to train and compare. After that if you still want to see some code let me know what you want to see and I guess I could still post it

— Frobot