Dlaczego stosowanie metody Newtona do optymalizacji regresji logistycznej nazywa się iteracyjną, ponownie ważoną metodą najmniejszych kwadratów?

Nie wydaje mi się to jasne, ponieważ utrata logistyczna i utrata najmniejszych kwadratów to zupełnie inne rzeczy.

— Haitao Du
źródło

Nie sądzę, że są takie same. IRLS to Newton-Raphson z oczekiwanym Hesjanem, a nie obserwowanym Hesjanem.

— Dimitriy V. Masterov

@ DimitriyV.Masterov dzięki, czy mógłbyś mi powiedzieć więcej na temat Oczekiwanego Hesji kontra Obserwowanego? Co sądzisz o tym wyjaśnieniu

— Haitao Du

Zobacz także stats.stackexchange.com/questions/236676/…

— kjetil b halvorsen

Podsumowanie: GLM są odpowiednie dzięki punktacji Fishera, która, jak zauważa Dimitriy V. Masterov, jest Newtonem-Raphsonem z oczekiwanym Hesjanem (tj. Zamiast szacowanej informacji używamy szacunku informacji Fishera). Jeśli używamy kanonicznej funkcji łącza, okazuje się, że obserwowany Hesjan jest równy spodziewanemu Hesji, więc NR i Fisher są w tym przypadku takie same. Tak czy inaczej, zobaczymy, że punktacja Fishera faktycznie pasuje do ważonego modelu liniowego metodą najmniejszych kwadratów, a oszacowania współczynników z tego zbiegają się * na maksymalnym prawdopodobieństwie regresji logistycznej. Oprócz zmniejszenia dopasowania regresji logistycznej do już rozwiązanego problemu, możemy również skorzystać z możliwości zastosowania diagnostyki regresji liniowej przy ostatecznym dopasowaniu WLS, aby dowiedzieć się o naszej regresji logistycznej.

Skupię się na regresji logistycznej, ale dla bardziej ogólnego spojrzenia na maksymalne prawdopodobieństwo w GLM polecam sekcję 15.3 tego rozdziału, która to omawia i wyprowadza IRLS w bardziej ogólnym otoczeniu (myślę, że pochodzi z Applied Johna Foxa Analiza regresji i uogólnione modele liniowe ).

$^*$ patrz komentarze na końcu

Funkcja prawdopodobieństwa i wyniku

Będziemy dopasowywać nasz GLM przez iterację czegoś w postaci

b^{(m + 1)} = b^{(m)} - J_{(m)}^{- 1} \nabla ℓ (b^{(m)})

$b^{(m+1)} = b^{(m)} - J^{-1}_{(m)}\nabla \ell(b^{(m)})$ gdzie

ℓ

$\ell$ jest prawdopodobieństwem logarytmicznym i

J_{m}

$J_{m}$ będzie albo zaobserwowany lub oczekiwany Hesjan o prawdopodobieństwie dziennika.

Nasza funkcja połączenia jest funkcją która odwzorowuje średnią warunkową na nasz predyktor liniowy, więc naszym modelem średniej jest . Niech będzie funkcją odwrotnego łącza odwzorowującą predyktor liniowy na średnią. $g$ $\mu_i = E(y_i | x_i)$ $g(\mu_i) = x_i^T\beta$ $h$

Dla regresji logistycznej istnieje prawdopodobieństwo Bernoulliego z niezależnymi obserwacjami, więc Biorąc pochodne,

ℓ (b; y) = \sum_{i = 1}^{n} y_{i} \log h (x_{i}^{T} b) + (1 - y_{i}) \log (1 - h (x_{i}^{T} b)) .

$\ell(b; y) = \sum_{i=1}^n y_i\log h(x_i^T b) + (1 - y_i) \log(1 - h(x_i^Tb)).$

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i = 1}^{n} \frac{y_{i}}{h (x_{i}^{T} b)} h^{'} (x_{i}^{T} b) x_{i j} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)} h^{'} (x_{i}^{T} b) x_{i j}

$\frac{\partial \ell}{\partial b_j} = \sum_{i=1}^n \frac{y_i}{h(x_i^T b)} h'(x_i^T b) x_{ij} - \frac{1 - y_i}{1 - h(x_i^T b)} h'(x_i^T b) x_{ij}$

= \sum_{i = 1}^{n} x_{i j} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$= \sum_{i=1}^n x_{ij} h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b)) .

$= \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b)).$

Za pomocą łącza kanonicznego

Załóżmy teraz, że używamy kanonicznej funkcji łącza . Następnie $g_c = \text{logit}$ soco oznacza, że upraszcza to $g^{-1}_c(x) := h_c(x) = \frac{1}{1+e^{-x}}$ $h_c' = h_c \cdot (1-h_c)$ , tak Ponadto nadal używa,

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} (y_{i} - h_{c} (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} (y_i - h_c(x_i^T b))$

\nabla ℓ (b; y) = X^{T} (y - \hat{y}) .

$\nabla \ell (b; y) = X^T (y - \hat y).$

h_{c}

$h_c$

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = - \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h_{c} (x_{i}^{T} b) = - \sum_{i} x_{i j} x_{i k} [h_{c} (x_{i}^{T} b) (1 - h_{c} (x_{i}^{T} b))] .

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = - \sum_i x_{ij} \frac{\partial}{\partial b_k} h_c(x_i^T b) = - \sum_i x_{ij}x_{ik} \left[h_c(x_i^T b) (1 - h_c(x_i^T b))\right].$

Niech

W = diag (h_{c} (x_{1}^{T} b) (1 - h_{c} (x_{1}^{T} b)), \dots, h_{c} (x_{n}^{T} b) (1 - h_{c} (x_{n}^{T} b))) = diag ({\hat{y}}_{1} (1 - {\hat{y}}_{1}), \dots, {\hat{y}}_{n} (1 - {\hat{y}}_{n})) .

$W = \text{diag}\left(h_c(x_1^T b)(1 - h_c(x_1^T b)), \dots, h_c(x_n^T b)(1 - h_c(x_n^T b))\right) = \text{diag}\left(\hat y_1(1 - \hat y_1), \dots, \hat y_n (1 - \hat y_n)\right).$ Then we have

H = - X^{T} W X

$H = -X^TWX$ and note how this doesn't have any

y_{i}

$y_i$ in it anymore, so

E (H) = H

$E(H) = H$ (we're viewing this as a function of

b

$b$ so the only random thing is

y

$y$ itself). Thus we've shown that Fisher scoring is equivalent to Newton-Raphson when we use the canonical link in logistic regression. Also by virtue of

{\hat{y}}_{i} \in (0, 1)

$\hat y_i \in (0,1)$

- X^{T} W X

$-X^TWX$

{\hat{y}}_{i}

$\hat y_i$

0

$0$

1

$1$

0

$0$

H

$H$

$z = W^{-1}(y - \hat y)$ and note that

\nabla ℓ = X^{T} (y - \hat{y}) = X^{T} W z .

$\nabla \ell = X^T(y - \hat y) = X^T W z.$

All together this means that we can optimize the log likelihood by iterating

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$ and

(X^{T} W_{(m)} X)^{- 1} X^{T} W_{(m)} z_{(m)}

$(X^T W_{(m)} X)^{-1}X^T W_{(m)} z_{(m)}$ is exactly

\hat{β}

$\hat \beta$ for a weighted least squares regression of

z_{(m)}

$z_{(m)}$ on

X

$X$ .

Checking this in R:

set.seed(123)
p <- 5
n <- 500
x <- matrix(rnorm(n * p), n, p)
betas <- runif(p, -2, 2)
hc <- function(x) 1 /(1 + exp(-x)) # inverse canonical link
p.true <- hc(x %*% betas)
y <- rbinom(n, 1, p.true)

# fitting with our procedure
my_IRLS_canonical <- function(x, y, b.init, hc, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- hc(eta)
    h.prime_eta <- y.hat * (1 - y.hat)
    z <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z ~ x - 1, weights = h.prime_eta)$coef  # WLS regression
    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

my_IRLS_canonical(x, y, rep(1,p), hc)
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

glm(y ~ x - 1, family=binomial())$coef
# x1         x2         x3         x4         x5 
# -1.1149687  2.1897992  1.0271298  0.8702975 -1.2074851

and they agree.

Non-canonical link functions

Now if we're not using the canonical link we don't get the simplification of $\frac{h'}{h(1-h)} = 1$ in $\nabla \ell$ so $H$ becomes much more complicated, and we therefore see a noticeable difference by using $E(H)$ in our Fisher scoring.

Here's how this will go: we already worked out the general $\nabla \ell$ so the Hessian will be the main difficulty. We need

\frac{\partial^{2} ℓ}{\partial b_{k} \partial b_{j}} = \sum_{i} x_{i j} \frac{\partial}{\partial b_{k}} h^{'} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)})

$\frac{\partial^2 \ell}{\partial b_k \partial b_j} = \sum_i x_{ij} \frac{\partial}{\partial b_k}h'(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right)$

= \sum_{i} x_{i j} x_{i k} [h^{″} (x_{i}^{T} b) (\frac{y_{i}}{h (x_{i}^{T} b)} - \frac{1 - y_{i}}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{y_{i}}{h (x_{i}^{T} b)^{2}} + \frac{1 - y_{i}}{(1 - h (x_{i}^{T} b))^{2}})]

$= \sum_i x_{ij}x_{ik} \left[h''(x_i^T b) \left(\frac{y_i}{h(x_i^T b)} - \frac{1 - y_i}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{y_i}{h(x_i^T b)^2} + \frac{1-y_i}{(1-h(x_i^T b))^2} \right)\right]$

Via the linearity of expectation all we need to do to get $E(H)$ is replace each occurrence of $y_i$ with its mean under our model which is $\mu_i=h(x_i^T\beta)$ . Each term in the summand will therefore contain a factor of the form

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} β)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} β)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} β)}{(1 - h (x_{i}^{T} b))^{2}}) .

$h''(x_i^T b) \left(\frac{h(x_i^T \beta)}{h(x_i^T b)} - \frac{1 - h(x_i^T \beta)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T \beta)}{h(x_i^T b)^2} + \frac{1-h(x_i^T \beta)}{(1-h(x_i^T b))^2} \right).$ But to actually do our optimization we'll need to estimate each

β

$\beta$ , and at step

m

$m$

b^{(m)}

$b^{(m)}$ is the best guess we have. This means that this will reduce to

h^{″} (x_{i}^{T} b) (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)} - \frac{1 - h (x_{i}^{T} b)}{1 - h (x_{i}^{T} b)}) - h^{'} (x_{i}^{T} b)^{2} (\frac{h (x_{i}^{T} b)}{h (x_{i}^{T} b)^{2}} + \frac{1 - h (x_{i}^{T} b)}{(1 - h (x_{i}^{T} b))^{2}})

$h''(x_i^T b) \left(\frac{h(x_i^T b)}{h(x_i^T b)} - \frac{1 - h(x_i^T b)}{1 - h(x_i^T b)} \right) - h'(x_i^T b)^2\left(\frac{h(x_i^T b)}{h(x_i^T b)^2} + \frac{1-h(x_i^T b)}{(1-h(x_i^T b))^2} \right)$

= - h^{'} (x_{i}^{T} b)^{2} (\frac{1}{h (x_{i}^{T} b)} + \frac{1}{1 - h (x_{i}^{T} b)})

$= - h'(x_i^T b)^2\left(\frac{1}{h(x_i^T b)} + \frac{1}{1-h(x_i^T b)} \right)$

= - \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$= -\frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$ This means we will use

J

$J$ with

J_{j k} = - \sum_{i} x_{i j} x_{i k} \frac{h^{'} (x_{i}^{T} b)^{2}}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} .

$J_{jk} = -\sum_i x_{ij}x_{ik} \frac{h'(x_i^T b)^2}{h(x_i^T b)(1-h(x_i^T b))}.$

Now let

W^{*} = diag (\frac{h^{'} (x_{1}^{T} b)^{2}}{h (x_{1}^{T} b) (1 - h (x_{1}^{T} b))}, \dots, \frac{h^{'} (x_{n}^{T} b)^{2}}{h (x_{n}^{T} b) (1 - h (x_{n}^{T} b))})

$W^* = \text{diag}\left(\frac{h'(x_1^T b)^2}{h(x_1^T b)(1-h(x_1^T b))} ,\dots, \frac{h'(x_n^T b)^2}{h(x_n^T b)(1-h(x_n^T b))}\right)$ and note how under the canonical link

h_{c}^{'} = h_{c} \cdot (1 - h_{c})

$h_c' = h_c \cdot (1-h_c)$ reduces

W^{*}

$W^*$ to

W

$W$ from the previous section. This lets us write

J = - X^{T} W^{*} X

$J = -X^TW^*X$ except this is now

\hat{E} (H)

$\hat E(H)$ rather than necessarily being

H

$H$ itself, so this can differ from Newton-Raphson. For all

i

$i$

W_{i i}^{*} > 0

$W_{ii}^* > 0$ so aside from numerical issues

J

$J$ will be negative definite.

We have

\frac{\partial ℓ}{\partial b_{j}} = \sum_{i} x_{i j} \frac{h^{'} (x_{i}^{T} b)}{h (x_{i}^{T} b) (1 - h (x_{i}^{T} b))} (y_{i} - h (x_{i}^{T} b))

$\frac{\partial \ell}{\partial b_j} = \sum_i x_{ij} \frac{h'(x_i^T b)}{h(x_i^T b)(1 - h(x_i^T b))}(y_i - h(x_i^T b))$ so letting our new working response be

z^{*} = D^{- 1} (y - \hat{y})

$z^* = D^{-1}(y-\hat y)$ with

D = diag (h^{'} (x_{1}^{T} b), \dots, h^{'} (x_{n}^{T} b))

$D=\text{diag}\left(h'(x_1^T b), \dots, h'(x_n^T b)\right)$ , we have

\nabla ℓ = X^{T} W^{*} z^{*}

$\nabla \ell = X^TW^*z^*$ .

All together we are iterating

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$ so this is still a sequence of WLS regressions except now it's not necessarily Newton-Raphson.

I've written it out this way to emphasize the connection to Newton-Raphson, but frequently people will factor the updates so that each new point $b^{(m+1)}$ is itself the WLS solution, rather than a WLS solution added to the current point $b^{(m)}$ . If we wanted to do this, we can do the following:

b^{(m + 1)} = b^{(m)} + (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} z_{(m)}^{*}

$b^{(m+1)} = b^{(m)} + (X^T W_{(m)}^* X)^{-1}X^T W_{(m)}^* z_{(m)}^*$

= (X^{T} W_{(m)}^{*} X)^{- 1} (X^{T} W_{(m)}^{*} X b^{(m)} + X^{T} W_{(m)}^{*} z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}\left(X^T W_{(m)}^* Xb^{(m)}+ X^TW^*_{(m)}z_{(m)}^* \right)$

= (X^{T} W_{(m)}^{*} X)^{- 1} X^{T} W_{(m)}^{*} (X b^{(m)} + z_{(m)}^{*})

$= (X^T W_{(m)}^* X)^{-1}X^TW_{(m)}^*\left(Xb^{(m)}+ z_{(m)}^* \right)$ so if we're going this way you'll see the working response take the form

η^{(m)} + D_{(m)}^{- 1} (y - {\hat{y}}^{(m)})

$\eta^{(m)} + D^{-1}_{(m)}(y - \hat y^{(m)})$ , but it's the same thing.

Let's confirm that this works by using it to perform a probit regression on the same simulated data as before (and this is not the canonical link, so we need this more general form of IRLS).

my_IRLS_general <- function(x, y, b.init, h, h.prime, tol=1e-8) {
  change <- Inf
  b.old <- b.init
  while(change > tol) {
    eta <- x %*% b.old  # linear predictor
    y.hat <- h(eta)
    h.prime_eta <- h.prime(eta)
    w_star <- h.prime_eta^2 / (y.hat * (1 - y.hat))
    z_star <- (y - y.hat) / h.prime_eta

    b.new <- b.old + lm(z_star ~ x - 1, weights = w_star)$coef  # WLS

    change <- sqrt(sum((b.new - b.old)^2))
    b.old <- b.new
  }
  b.new
}

# probit inverse link and derivative
h_probit <- function(x) pnorm(x, 0, 1)
h.prime_probit <- function(x) dnorm(x, 0, 1)

my_IRLS_general(x, y, rep(0,p), h_probit, h.prime_probit)
# x1         x2         x3         x4         x5 
# -0.6456508  1.2520266  0.5820856  0.4982678 -0.6768585 

glm(y~x-1, family=binomial(link="probit"))$coef
# x1         x2         x3         x4         x5 
# -0.6456490  1.2520241  0.5820835  0.4982663 -0.6768581

and again the two agree.

Comments on convergence

Finally, a few quick comments on convergence (I'll keep this brief as this is getting really long and I'm no expert at optimization). Even though theoretically each $J_{(m)}$ is negative definite, bad initial conditions can still prevent this algorithm from converging. In the probit example above, changing the initial conditions to b.init=rep(1,p) results in this, and that doesn't even look like a suspicious initial condition. If you step through the IRLS procedure with that initialization and these simulated data, by the second time through the loop there are some $\hat y_i$ that round to exactly $1$ and so the weights become undefined. If we're using the canonical link in the algorithm I gave we won't ever be dividing by $\hat y_i (1 - \hat y_i)$ to get undefined weights, but if we've got a situation where some $\hat y_i$ are approaching $0$ or $1$ , such as in the case of perfect separation, then we'll still get non-convergence as the gradient dies without us reaching anything.

— jld
źródło

+1. I love how detailed your answers often are.

— amoeba says Reinstate Monica

You stated "the coefficient estimates from this converge on a maximum of the logistic regression likelihood." Is that necessarily so, from any initial values?

— Mark L. Stone

@MarkL.Stone ah I was being too casual there, didn't mean to offend the optimization people :) I'll add some more details (and would appreciate your thoughts on them when I do)

— jld

any chance you watched the link I posted? Seems that video is talking from machine learning perspective, just optimize logistic loss, without talking about Hessain expectation?

— Haitao Du

@hxd1011 in that pdf i linked to (link again: sagepub.com/sites/default/files/upm-binaries/…) on page 24 of it the author goes into the theory and explains what exactly makes a link function canonical. I found that pdf extremely helpful when I first came across this (although it took me a while to get through).

— jld