Deep Neural Network - Propagacja wsteczna z ReLU

Mam pewne trudności w uzyskaniu wstecznej propagacji za pomocą ReLU i wykonałem trochę pracy, ale nie jestem pewien, czy jestem na dobrej drodze.

Funkcja kosztu: gdzie jest wartością rzeczywistą, a jest wartością przewidywaną. Zakładamy również, że > 0 zawsze. $\frac{1}{2}(y-\hat y)^2$ $y$ $\hat y$ $x$

1 warstwa ReLU, gdzie waga na 1. warstwie wynosi $w_1$

$\frac{dC}{dw_1}=\frac{dC}{dR}\frac{dR}{dw_1}$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

2 warstwy ReLU, gdzie wagi na 1. warstwie to , a 2. warstwa to I chciałem zaktualizować 1. warstwę $w_2$ $w_1$ $w_2$

$\frac{dC}{dw_2}=\frac{dC}{dR}\frac{dR}{dw_2}$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

Ponieważ $ReLU(w_1*ReLU(w_2x))=w_1w_2x$

3 warstwy ReLU, gdzie wagi na 1. warstwie to , 2. warstwa i 3. warstwa $w_3$ $w_2$ $w_1$

$\frac{dC}{dw_3}=\frac{dC}{dR}\frac{dR}{dw_3}$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Ponieważ $ReLU(w_1*ReLU(w_2(*ReLU(w_3))=w_1w_2w_3x$

Ponieważ reguła łańcuchowa trwa tylko z 2 pochodnymi, w porównaniu do sigmoidu, który może być tak długi, jak liczba warstw. $n$

Powiedzmy, że chciałem zaktualizować wszystkie grubości 3 warstw, gdzie to trzecia warstwa, to druga warstwa, to trzecia warstwa $w_1$ $w_2$ $w_1$

$\frac{dC}{w_1}=(y-ReLU(w_1x))(x)$

$\frac{dC}{w_2}=(y-ReLU(w_1*ReLU(w_2x))(w_1x)$

$\frac{dC}{w_3}=(y-ReLU(w_1*ReLU(w_2(*ReLU(w_3)))(w_1w_2x)$

Jeśli to wyprowadzenie jest poprawne, w jaki sposób zapobiega to zniknięciu? W porównaniu z sigmoidem, w którym mamy wiele mnożenia przez 0,25 w równaniu, podczas gdy ReLU nie ma żadnego stałego mnożenia wartości. Jeśli istnieją tysiące warstw, mnożenie byłoby duże ze względu na ciężary, to czy nie spowodowałoby to zanikania lub eksplozji gradientu?

neural-network backpropagation

— użytkownik1157751
źródło

@NeilSlater Dziękujemy za odpowiedź! Czy możesz to rozwinąć, nie jestem pewien, co miałeś na myśli?

— user1157751

Ach, chyba wiem o co ci chodziło. Powodem, dla którego podniosłem to pytanie, jest to, że jestem pewien, że pochodzenie jest poprawne? Rozejrzałem się i nie znalazłem przykładu ReLU pochodzącego całkowicie od podstaw?

— user1157751

Definicje robocze funkcji ReLU i jej pochodnej:

$ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ x, & \text{otherwise}. \end{cases}$

$\frac{d}{dx} ReLU(x) = \begin{cases} 0, & \text{if } x < 0, \\ 1, & \text{otherwise}. \end{cases}$

Pochodna jest funkcją kroku jednostkowego . Ignoruje to problem przy $x=0$ , gdzie gradient nie jest ściśle określony, ale nie jest to praktyczny problem dla sieci neuronowych. Przy powyższym wzorze pochodna przy 0 wynosi 1, ale można ją traktować tak samo jak 0 lub 0,5 bez rzeczywistego wpływu na wydajność sieci neuronowej.

Uproszczona sieć

Dzięki tym definicjom przyjrzyjmy się przykładowym sieciom.

Prowadzisz regresję z funkcją kosztu $C = \frac{1}{2}(y-\hat{y})^2$ $R$ $z$ $r^{(1)}$ $z^{(1)}$ $W^{(0)}$ $x$ $r$ zamiast tego). Dostosowałem także numer indeksu macierzy masy - dlaczego stanie się to wyraźniejsze dla większej sieci. NB Na razie ignoruję posiadanie więcej niż neuronu w każdej warstwie.

Patrząc na twoją prostą 1 warstwę, 1 sieć neuronową, równania sprzężenia zwrotnego są następujące:

$z^{(1)} = W^{(0)}x$

$\hat{y} = r^{(1)} = ReLU(z^{(1)})$

Pochodną funkcji kosztu z przykładowego oszacowania jest:

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}} = \frac{\partial}{\partial r^{(1)}}\frac{1}{2}(y-r^{(1)})^2 = \frac{1}{2}\frac{\partial}{\partial r^{(1)}}(y^2 - 2yr^{(1)} + (r^{(1)})^2) = r^{(1)} - y$

$z$

$\frac{\partial C}{\partial z^{(1)}} = \frac{\partial C}{\partial r^{(1)}} \frac{\partial r^{(1)}}{\partial z^{(1)}} = (r^{(1)} - y)Step(z^{(1)}) = (ReLU(z^{(1)}) - y)Step(z^{(1)})$

$\frac{\partial C}{\partial z^{(1)}}$

$W^{(0)}$

$\frac{\partial C}{\partial W^{(0)}} = \frac{\partial C}{\partial z^{(1)}} \frac{\partial z^{(1)}}{\partial W^{(0)}} = (ReLU(z^{(1)}) - y)Step(z^{(1)})x = (ReLU(W^{(0)}x) - y)Step(W^{(0)}x)x$

$z^{(1)} = W^{(0)}x$ $\frac{\partial z^{(1)}}{\partial W^{(0)}} = x$

To pełne rozwiązanie dla Twojej najprostszej sieci.

Jednak w sieci warstwowej musisz również przenieść tę samą logikę do następnej warstwy. Zazwyczaj masz więcej niż jeden neuron w warstwie.

Bardziej ogólna sieć ReLU

$(k)$ $i$ $(k+1)$ $j$

$z^{(k+1)}_j = \sum_{\forall i} W^{(k)}_{ij}r^{(k)}_i$

$r^{(k+1)}_j = ReLU(z^{(k+1)}_j)$

$r^{output}_j$ $r^{output}_j - y_j$ $\frac{\partial C}{\partial r^{(k+1)}_j}$

Najpierw musimy dostać się do wejścia neuronu przed zastosowaniem ReLU:

$\frac{\partial C}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j} \frac{\partial r^{(k+1)}_j}{\partial z^{(k+1)}_j} = \frac{\partial C}{\partial r^{(k+1)}_j}Step(z^{(k+1)}_j)$

Musimy również propagować gradient do poprzednich warstw, co obejmuje zsumowanie wszystkich połączonych wpływów do każdego neuronu:

$\frac{\partial C}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial r^{(k)}_i} = \sum_{\forall j} \frac{\partial C}{\partial z^{(k+1)}_j} W^{(k)}_{ij}$

Musimy połączyć to z macierzą wag, aby później wprowadzić zmiany:

$\frac{\partial C}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} \frac{\partial z^{(k+1)}_j}{\partial W^{(k)}_{ij}} = \frac{\partial C}{\partial z^{(k+1)}_j} r^{(k)}_{i}$

You can resolve these further (by substituting in previous values), or combine them (often steps 1 and 2 are combined to relate pre-transform gradients layer by layer). However the above is the most general form. You can also substitute the $Step(z^{(k+1)}_j)$ in equation 1 for whatever the derivative function is of your current activation function - this is the only place where it affects the calculations.

Back to your questions:

If this derivation is correct, how does this prevent vanishing?

Your derivation was not correct. However, that does not completely address your concerns.

The difference between using sigmoid versus ReLU is just in the step function compared to e.g. sigmoid's $y(1-y)$ , applied once per layer. As you can see from the generic layer-by-layer equations above, the gradient of the transfer function appears in one place only. The sigmoid's best case derivative adds a factor of 0.25 (when $x = 0, y = 0.5$ ), and it gets worse than that and saturates quickly to near zero derivative away from $x=0$ . The ReLU's gradient is either 0 or 1, and in a healthy network will be 1 often enough to have less gradient loss during backpropagation. This is not guaranteed, but experiments show that ReLU has good performance in deep networks.

If there's thousands of layers, there would be a lot of multiplication due to weights, then wouldn't this cause vanishing or exploding gradient?

Yes this can have an impact too. This can be a problem regardless of transfer function choice. In some combinations, ReLU may help keep exploding gradients under control too, because it does not saturate (so large weight norms will tend to be poor direct solutions and an optimiser is unlikely to move towards them). However, this is not guaranteed.

— Neil Slater
źródło

Was a chain rule performed on

\frac{d C}{d \hat{y}}

$\frac{dC}{d \hat y}$ ?

— user1157751

@user1157751: No,

\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}

$\frac{\partial C}{\partial \hat{y}} = \frac{\partial C}{\partial r^{(1)}}$ because

\hat{y} = r^{(1)}

$\hat{y} = r^{(1)}$ . The cost function C is simple enough that you can take its derivative immediately. The only thing I haven't shown there is the expansion of the square - would you like me to add it?

— Neil Slater

But

C

$C$ is

\frac{1}{2} (y - \hat{y})^{2}

$\frac{1}{2}(y- \hat y)^2$ , don't we need to perform chain rule so that we can perform the derivative on

\hat{y}

$\hat y$ ?

\frac{d C}{d \hat{y}} = \frac{d C}{d U} \frac{d U}{d \hat{y}}

$\frac{dC}{d \hat y}=\frac{dC}{dU}\frac{dU}{d \hat y}$ , where

U = y - \hat{y}

$U = y - \hat y$ . Apologize for asking really simple questions, my maths ability is probably causing trouble for you : (

— user1157751

If you can make things simpler by expanding. Then please do expand the square.

— user1157751

@user1157751: Yes you could use the chain rule in that way, and it would give the same answer as I show. I just expanded the square - I'll show it.

— Neil Slater