Jak rygorystycznie zdefiniować prawdopodobieństwo?

Prawdopodobieństwo można określić na kilka sposobów, na przykład:

Funkcja $L$ z $\Theta\times{\cal X}$ , który odwzorowuje $(\theta,x)$ do $L(\theta \mid x)$ to znaczy $L:\Theta\times{\cal X} \rightarrow \mathbb{R}$ .
funkcja losowa $L(\cdot \mid X)$
moglibyśmy również wziąć pod uwagę, że prawdopodobieństwo to tylko „zaobserwowane” prawdopodobieństwo $L(\cdot \mid x^{\text{obs}})$
w praktyce prawdopodobieństwo doprowadza informację o $\theta$ tylko do stałej multiplikatywnej, dlatego możemy uznać prawdopodobieństwo za klasę równoważności funkcji, a nie za funkcję

Kolejne pytanie pojawia się przy rozważaniu zmiany parametryzacji: jeśli $\phi=\theta^2$ jest nową parametryzacją, zwykle oznaczamy przez $L(\phi \mid x)$ prawdopodobieństwo na $\phi$ i nie jest to ocena poprzedniej funkcji $L(\cdot \mid x)$ przy $\theta^2$ ale o $\sqrt{\phi}$ . This is an abusive but useful notation which could cause difficulties to beginners if it is not emphasized.

What is your favorite rigorous definition of the likelihood ?

In addition how do you call $L(\theta \mid x)$ ? I usually say something like "the likelihood on $\theta$ when $x$ is observed".

EDIT: In view of some comments below, I realize I should have precised the context. I consider a statistical model given by a parametric family $\{f(\cdot \mid \theta), \theta \in \Theta\}$ of densities with respect to some dominating measure, with each $f(\cdot \mid \theta)$ defined on the observations space ${\cal X}$ . Hence we define $L(\theta \mid x)=f(x \mid \theta)$ and the question is "what is $L$ ?" (the question is not about a general definition of the likelihood)

— Stéphane Laurent
źródło

(1) Because

\int L (θ | x) d x = 1

$\int L(\theta|x)dx = 1$ for all

θ

$\theta$ , I believe even the constant in

L

$L$ is defined. (2) If you think of parameters like

ϕ

$\phi$ and

θ

$\theta$ as merely being coordinates for a manifold of distributions, then change of parameterization has no intrinsic mathematical meaning; it's merely a change of description. (3) Native English speakers would more naturally say "likelihood of

θ

$\theta$ " rather than "on." (4) The clause "when

x

$x$ is observed" has philosophical difficulties, because most

x

$x$ will never be observed. Why not just say "likelihood of

θ

$\theta$ given

x

$x$ "?

— whuber

@whuber: For (1), I don't think the constant is well-defined. See ET Jaynes's book where he writes: "that a likelihood is not a probability because its normalization is arbitrary."

— Neil G

You appear to be confusing two kinds of normalization, Neil: Jaynes was referring to normalization by integration over

θ

$\theta$ , not

x

$x$ .

— whuber

@whuber: I don't think a scaling factor will matter for the Cramer-Rao bound because changing

k

$k$ adds a constant amount to the log-likelihood, which then disappears when the partial derivative is taken.

— Neil G

I agree with Neil, I do not see any application where the constant plays a role

— Stéphane Laurent

Odpowiedzi:

Your third item is the one I have seen the most often used as rigorous definition.

The others are interesting too (+1). In particular the first is appealing, with the difficulty that the sample size not being (yet) defined, it is harder to define the "from" set.

To me, the fundamental intuition of the likelihood is that it is a function of the model + its parameters, not a function of the random variables (also an important point for teaching purposes). So I would stick to the third definition.

The source of the abuse of notation is that the "from" set of the likelihood is implicit, which is usually not the case for well defined functions. Here, the most rigorous approach is to realize that after the transformation, the likelihood relates to another model. It is equivalent to the first, but still another model. So the likelihood notation should show which model it refers to (by subscript or other). I never do it of course, but for teaching, I might.

Finally, to be consistent with my previous answers, I say the "likelihood of $\theta$ " in your last formula.

— gui11aume
źródło

Thanks. And what is your advice about the equality up to a multiplicative constant ?

— Stéphane Laurent

Personally I prefer to call it up when needed rather than hard code it in the definition. And think that for model selection/comparison this 'up-to-a-multiplicative-constant' equality does not hold.

— gui11aume

Ok. Concerning the name, you could imagine you discuss about the likelihoods

L (θ ∣ x_{1})

$L(\theta\mid x_1)$ and

L (θ ∣ x_{2})

$L(\theta\mid x_2)$ for two possibles observations. In such a case, would you say "the likelihood of

θ

$\theta$ when

x_{1}

$x_1$ observed", or "the likelihood of

θ

$\theta$ for the observation

x_{1}

$x_1$ ", or something else ?

— Stéphane Laurent

If you re-parametrize your model with

ϕ = θ^{2}

$\phi = \theta^2$ you actually compute the likelihood as a composition of functions

L (. | x) \circ g (.)

$L(.|x) \circ g(.)$ where

g (y) = y^{2}

$g(y) = y^2$ . In this case,

g

$g$ goes from

R

$R$ to

R^{+}

$R^+$ so the set of definition (mentioned as "from" set) of the likelihood is no longer the same. You could call the first function

L_{1} (. |)

$L_1(.|)$ and the second

L_{2} (. |)

$L_2(.|)$ because they are not the same functions.

— gui11aume

How is the third definition rigorous? And what is the problem with the sample size not being defined? Since we say

P (x_{1}, x_{2}, \dots, x_{n} ∣ θ)

$P(x_1, x_2, \dotsc, x_n \mid \theta)$ , which naturally brings into existence a corresponding sigma algebra for the sample space

Ω^{n}

$\Omega^n$ , why can't we have the parallel definition for likelihoods?

— Neil G

I think I would call it something different. Likelihood is the probability density for the observed x given the value of the parameter $θ$ expressed as a function of $θ$ for the given $x$ . I don't share the view about the proportionality constant. I think that only comes into play because maximizing any monotonic function of the likelihood gives the same solution for $θ$ . So you can maximize $cL(θ∣x)$ for $c>0$ or other monotonic functions such as $\log(L(θ∣x))$ which is commonly done.

— Michael R. Chernick
źródło

Not only the maximization: the up-to-proportionality also comes into play in the likelihood ratio notion, and in Bayes formula for Bayesian statistics

— Stéphane Laurent

I thought someone might downvote my answer. But I think it is quite reasonable to define likelihood this way as a definitive probability without calling anything proprotional to it a likelihood. @StéphaneLaurent to your comment about priors, if the function is integrable it can be normalized to a density. The posterior is proportional to the likelihood times the prior. Since the posterior must be normalized by dividing by an integral we might as well specify the prior to be the distribution. It is only in an extended sense that this gets applied to improper priors.

— Michael R. Chernick

I'm not quite sure why someone would downvote this answer. It seems you are trying to respond more to the OP's second and questions than the first. Perhaps that was not entirely clear to other readers. Cheers. :)

— cardinal

@Michael I don't see the need to downvote this answer too. Concerning noninformative priors (this is another discussion and) I intend to open a new disucssion about this subject. I will not do it soon, because I am not easy with English, and this is more difficult for me to write "philosophy" than mathematics.

— Stéphane Laurent

@Stephane: If you'd like, please consider posting your other question directly in French. We have several native French speakers on this site that likely would help translate any passages you're unsure about. This includes a moderator and also an editor of one of the very top English-language statistics journals. I look forward to the question.

— cardinal

Here's an attempt at a rigorous mathematical definition:

Let $X: \Omega \to \mathbb R^n$ be a random vector which admits a density $f(x | \theta_0)$ with respect to some measure $\nu$ on $\mathbb R^n$ , where for $\theta \in \Theta$ , $\{f(x|\theta): \theta \in \Theta\}$ is a family of densities on $\mathbb R^n$ with respect to $\nu$ . Then, for any $x \in \mathbb R^n$ we define the likelihood function $L(\theta | x)$ to be $f(x | \theta)$ ; for clarity, for each $x$ we have $L_x : \Theta \to \mathbb R$ . One can think of $x$ to be a particular potential $x_{obs}$ and $\theta_0$ to be the "true" value of $\theta$ .

A couple of observations about this definition:

The definition is robust enough to handle discrete, continuous, and other sorts of families of distributions for $X$ .
We are defining the likelihood at the level of density functions instead of at the level of probability distributions/measures. The reason for this is that densities are not unique, and it turns out that this isn't a situation where one can pass to equivalence classes of densities and still be safe: different choices of densities lead to different MLE's in the continuous case. However, in most cases there is a natural choice of family of densities that are desirable theoretically.
I like this definition because it incorporates the random variables we are working with into it and, by design since we have to assign them a distribution, we have also rigorously built in the notion of the "true but unknown" value of $\theta$ , here denoted $\theta_0$ . For me, as a student, the challenge of being rigorous about likelihood was always how to reconcile the real world concepts of a "true" $\theta$ and "observed" $x_{obs}$ with the mathematics; this was often not helped by instructors claiming that these concepts weren't formal but then turning around and using them formally when proving things! So we deal with them formally in this definition.
EDIT: Of course, we are free to consider the usual random elements $L(\theta | X)$ , $S(\theta | X)$ and $\mathcal I(\theta | X)$ and under this definition with no real problems with rigor as long as you are careful (or even if you aren't if that level of rigor is not important to you).

— guy
źródło

@Xi'an Let

X_{1}, . . ., X_{n}

$X_1, ..., X_n$ be uniform on

(0, θ)

$(0, \theta)$ . Consider two densities

f_{1} (x) = θ^{- 1} I [0 < x < θ]

$f_1 (x) = \theta^{-1} I[0 < x < \theta]$ versus

f_{2} (x) = θ^{- 1} I [0 \leq x \leq θ]

$f_2 (x) = \theta^{-1} I[0 \le x \le \theta]$ . Both

f_{1}

$f_1$ and

f_{2}

$f_2$ are valid densities for

U (0, θ)

$\mathcal U(0, \theta)$ , but under

f_{2}

$f_2$ the MLE exists and is equal to

max X_{i}

$\max X_i$ whereas under

f_{1}

$f_1$ we have

\prod_{j} f_{1} (x_{j} | max x_{i}) = 0

$\prod _j f_1 (x_j| \max x_i) = 0$ so that if you set

\hat{θ} = max X_{i}

$\hat \theta = \max X_i$ you end up with a likelihood of

0

$0$ , and in fact the MLE doesn't exist because

sup_{θ} \prod_{j} f_{1} (x | θ)

$\sup _\theta \prod _j f_1(x | \theta)$ is not attained for any

θ

$\theta$ .

— guy

@guy: thanks, I did not know about this interesting counter-example.

— Xi'an

@guy You said that

sup_{θ} \prod_{j} f_{1} (x_{j} | θ)

$\sup_\theta \prod_j f_1(x_j| \theta)$ is not attained for any

θ

$\theta$ . However, this supremum is attained at some point as I show below:

L_{1} (θ; x) = \prod_{j = 1}^{n} f_{1} (x_{j} | θ) = θ^{- n} \prod_{j = 1}^{n} I (0 < x_{j} < θ) = θ^{- n} I (0 < M < θ),

$L_1(\theta;x) = \prod_{j=1}^n f_1(x_j|\theta) = \theta^{-n} \prod_{j=1}^n I\big(0 < x_j < \theta\big) = \theta^{-n}I\big(0< M < \theta\big),$ where

M = max {x_{1}, \dots, x_{n}}

$M = \max \{x_1, \ldots, x_n\}$ . I am assuming that

x_{j} > 0

$x_j > 0$ for all

j = 1, \dots, n

$j=1,\ldots,n$ . It is simple to see that 1.

L_{1} (θ; x) = 0

$L_1(\theta;x) = 0$ , if

0 < θ \leq M

$0<\theta \leq M$ ; 2.

L_{1} (θ; x) = θ^{- n}

$L_1(\theta;x) = \theta^{-n}$ , if

M < θ < \infty

$M < \theta < \infty$ . Continuing...

— Alexandre Patriota

@guy: continuing... That is,

L_{1} (θ; x) \in [0, M^{- n}),

$L_1(\theta;x) \in \big[0,M^{-n}\big),$ for all

θ \in (0, \infty)

$\theta \in (0,\infty)$ . We do not have a maximum value but the supremum does exist and it is given by

sup_{θ \in (0, \infty)} L_{1} (θ, x) = M^{- n}

$\sup_{\theta \in (0,\infty)} L_1(\theta, x) = M^{-n}$ and the argument is

M = \arg sup_{θ \in (0, \infty)} L_{1} (θ; x) .

$M = \arg\sup_{\theta \in (0,\infty)} L_1(\theta;x).$ Perhaps, the usual asymptotics are not applied here and some other tolls should be employed. But, the supremum of

L_{1} (θ; x)

$L_1(\theta;x)$ does exist or I missed some very basic concepts.

— Alexandre Patriota

@AlexandrePatriota The supremum exists, obviously, but it is not attained by the function. I'm not sure what the notation

\arg sup

$\arg \sup$ is supposed to mean - there is no argument of

L_{1} (θ; x)

$L_1(\theta; x)$ which yields the

sup

$\sup$ because

L_{1} (θ; M) = 0

$L_1(\theta; M) = 0$ . The MLE is defined as any

\hat{θ}

$\hat \theta$ which attains the

sup

$\sup$ (typically) and no

\hat{θ}

$\hat \theta$ attains the

sup

$\sup$ here. Obviously there are ways around it - the asymptotics we appeal to require that there exists a likelihood with such-and-such properties, and there does. It's just

L_{2}

$L_2$ rather than

L_{1}

$L_1$ .

— guy