Czy MLE wymaga danych ID? Czy tylko niezależne parametry?

16

Oszacowanie parametrów przy użyciu oszacowania maksymalnego prawdopodobieństwa (MLE) obejmuje ocenę funkcji wiarygodności, która odwzorowuje prawdopodobieństwo wystąpienia próbki (X) na wartości (x) w przestrzeni parametrów (θ) dla danej rodziny rozkładów (P (X = x | θ) ) ponad możliwymi wartościami θ (uwaga: czy mam rację?). Wszystkie przykłady, które widziałem, obejmują obliczanie P (X = x | θ) przez obliczenie iloczynu F (X), gdzie F jest rozkładem z wartością lokalną wartość dla θ, a X jest próbką (wektorem).

Skoro tylko mnożymy dane, czy wynika z tego, że dane są niezależne? Np. Czy nie możemy użyć MLE do dopasowania danych szeregów czasowych? A może parametry muszą być po prostu niezależne?

maximum-likelihood

— Felix
źródło

14

Funkcja prawdopodobieństwa jest definiowana jako prawdopodobieństwo zdarzenia $E$ (zbiór danych ${\bf x}$ ) jako funkcja parametrów modelu $\theta$

L (θ; x) \propto P (Event E; θ) = P (observing x; θ) .

${\mathcal L}(\theta;{\bf x})\propto {\mathbb P}(\text{Event }E;\theta)= {\mathbb P}(\text{observing } {\bf x};\theta).$

Dlatego nie ma założenia niezależności obserwacji. W klasycznym podejściu nie ma definicji niezależności parametrów, ponieważ nie są to zmienne losowe; niektórymi powiązanymi pojęciami mogą być identyfikowalność , ortogonalność parametrów i niezależność estymatorów maksymalnego prawdopodobieństwa (które są zmiennymi losowymi).

Kilka przykładów,

(1). Dyskretna obudowa . jest próbką (niezależny) obserwacje w dyskretnych , wtedy ${\bf x}=(x_1,...,x_n)$ ${\mathbb P}(\text{observing } x_j ; \theta)>0$

L (θ; x) \propto \prod_{j = 1}^{n} P (observing x_{j}; θ) .

${\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n{\mathbb P}(\text{observing } x_j ; \theta).$

W szczególności, jeśli , przy znanym , mamy to $x_j\sim \text{Binomial}(N,\theta)$ $N$

L (θ; x) \propto \prod_{j = 1}^{n} θ^{x_{j}} (1 - θ)^{N - x_{j}} .

${\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n \theta^{x_j}(1-\theta)^{N-x_j}.$

(2). Ciągłe zbliżenie . Niech jest próbką z ciągłej zmiennej losowej , o rozkładzie i gęstość z błędu pomiaru , to jest, można zaobserwować zestawy . Następnie ${\bf x}=(x_1,...,x_n)$ $X$ $F$ $f$ $\epsilon$ $(x_j-\epsilon,x_j+\epsilon)$

\begin{array}{rcl} L (θ; x) \propto \prod_{j = 1}^{n} P [observing (x_{j} - ϵ, x_{j} + ϵ); θ] = \prod_{j = 1}^{n} [F (x_{j} + ϵ; θ) - F (x_{j} - ϵ; θ)] \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n {\mathbb P}[\text{observing } (x_j-\epsilon,x_j+\epsilon);\theta] = \prod_{j=1}^n[F(x_j+\epsilon;\theta)-F(x_j-\epsilon;\theta)] \end{eqnarray*}$

Gdy jest małe, można to aproksymować (używając twierdzenia o wartości średniej) o $\epsilon$

\begin{array}{rcl} L (θ; x) \propto \prod_{j = 1}^{n} f (x_{j}; θ) \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto \prod_{j=1}^n f(x_j;\theta) \end{eqnarray*}$

Na przykład w normalnym przypadku, spójrz na to .

(3). Model zależny i Markowa . Załóżmy, że to zestaw może obserwacji i zależnych pozwalają jest wspólną gęstości , a następnie ${\bf x}=(x_1,...,x_n)$ $f$ ${\bf x}$

\begin{array}{rcl} L (θ; x) \propto f (x; θ) . \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto f({\bf x}; \theta). \end{eqnarray*}$

If additionally the Markov property is satisfied, then

\begin{array}{rcl} L (θ; x) \propto f (x; θ) = f (x_{1}; θ) \prod_{j = 1}^{n - 1} f (x_{j + 1} | x_{j}; θ) . \end{array}

$\begin{eqnarray*} {\mathcal L}(\theta;{\bf x})\propto f({\bf x}; \theta) = f(x_1;\theta)\prod_{j=1}^{n-1} f(x_{j+1} \vert x_j ;\theta). \end{eqnarray*}$

Take also a look at this.

— Community
źródło

3

Pisząc funkcję prawdopodobieństwa jako produkt, domyślnie zakładasz strukturę zależności między obserwacjami. Tak więc dla MLE potrzebne są dwa założenia (a) jedno dotyczące rozkładu każdego indywidualnego wyniku i (b) jedno dotyczące zależności między wynikami.

10

(+1) Very good question.

Minor thing, MLE stands for maximum likelihood estimate (not multiple), which means that you just maximize the likelihood. This does not specify that the likelihood has to be produced by IID sampling.

If the dependence of the sampling can be written in the statistical model, you just write the likelihood accordingly and maximize it as usual.

The one case worth mentioning when you do not assume dependence is that of the multivariate Gaussian sampling (in time series analysis for example). The dependence between two Gaussian variables can be modelled by their covariance term, which you incoroporate in the likelihood.

$2$ from correlated Gaussian variables with same mean and variance. You would write the likelihood as

\frac{1}{2 π σ^{2} \sqrt{1 - ρ^{2}}} \exp (- \frac{z}{2 σ^{2} (1 - ρ^{2})}),

$\frac{1}{2\pi\sigma^2\sqrt{1-\rho^2}}\exp\left(-\frac{z}{2\sigma^2(1-\rho^2)}\right),$

where $z$ is

z = (x_{1} - μ)^{2} - 2 ρ (x_{1} - μ) (x_{2} - μ) + (x_{2} - μ)^{2} .

$z = (x_1-\mu)^2-2\rho(x_1-\mu)(x_2-\mu)+(x_2-\mu)^2.$

This is not the product of the individual likelihoods. Still, you would maximize this with parameters $(\mu, \sigma, \rho)$ to get their MLE.

— gui11aume
źródło

2

These are good answers and examples. The only thing I would add to see this in simple terms is that likelihood estimation only requires that a model for the generation of the data be specified in terms of some unknown parameters be described in functional form.

— Michael R. Chernick

(+1) Absolutely true! Do you have an example of model that cannot be specified in those terms?

— gui11aume

@gu11aume I think you are referring to my remark. I would say that I was not giving a direct answer to the question. The answwer to the question is yes because there are examples that can be shown where the likelihood function can be expressed when the data are genersted by dependent random variables.

— Michael R. Chernick

2

Examples where this cannot be done would be where the data are given without any description of the data generating mechanism or the model is not presented in a parametric form such as when you are given two iid data sets and are asked to test whether they come from the same distribution where you only specify that the distributions are absolutely continuous.

— Michael R. Chernick

4

Of course, Gaussian ARMA models possess a likelihood, as their covariance function can be derived explicitly. This is basically an extension of gui11ame's answer to more than 2 observations. Minimal googling produces papers like this one where the likelihood is given in the general form.

Another, to an extent, more intriguing, class of examples is given by multilevel random effect models. If you have data of the form

y_{i j} = x_{i j}^{'} β + u_{i} + ϵ_{i j},

$y_{ij} = x_{ij}'\beta + u_i + \epsilon_{ij},$ where indices

j

$j$ are nested in

i

$i$ (think of students

j

$j$ in classrooms

i

$i$ , say, for a classic application of multilevel models), then, assuming

ϵ_{i j} ⊥ u_{i}

$\epsilon_{ij} \perp u_i$ , the likelihood is

\ln L \sim \sum_{i} \ln \int \prod_{j} f (y_{i j} | β, u_{i}) d F (u_{i})

$\ln L \sim \sum_i \ln \int \prod_j f(y_{ij}|\beta,u_i) {\rm d}F(u_i)$ and is a sum over the likelihood contributions defined at the level of clusters, not individual observations. (Of course, in the Gaussian case, you can push the integrals around to produce an analytic ANOVA-like solution. However, if you have say a logit model for your response

y_{i j}

$y_{ij}$ , then there is no way out of numerical integration.)

— StasK
źródło

2

Stask and @gui11aume, these three answers are nice but I think they miss a point: what about the consistency of the MLE for dependent data ?

— Stéphane Laurent