Basics of Probability Theory
Definitions
\(\sigma\)-algebra and Measurable Space
Let \(\Omega\) be a set. A \(\sigma\)-algebra on \(\Omega\) is any collection of its subsets \(\mathcal{F} \subseteq 2^{\Omega}\) such that:
- The empty set belongs to the \(\sigma\)-algebra, i.e. \(\emptyset \in \mathcal{F}\).
- The \(\sigma\)-algebra is closed under complementation, i.e. if \(A \in \mathcal{F}\), then \(A^c \in \mathcal{F}\).
- The \(\sigma\)-algebra is closed under countable unions, i.e. if \(A_1, A_2, \ldots \in \mathcal{F}\), then \(\cup_{i=1}^{\infty} A_i \in \mathcal{F}\).
Other definitions are possible, but they all imply the same properties. In particular, from the given definition one can prove that: the set itself belongs to the \(\sigma\)-algebra; the \(\sigma\)-algebra is closed under countable intersections; the difference of two sets in the \(\sigma\)-algebra belongs to the \(\sigma\)-algebra.
We say that the couple \((\Omega, \mathcal{F})\) is a measurable space and the elements of \(\mathcal{F}\) are measurable sets.
Probability Measure and Probability Space
Let \((\Omega, \mathcal{F})\) be a measurable space. A probability measure is a function \(\mathbb{P}: \mathcal{F} \rightarrow [0, 1]\) such that
- \[\mathbb{P}(\emptyset) = 0\]
- For any sequence of disjoint sets \(A_1, A_2, \ldots \in \mathcal{F}\), \(\mathbb{P}(\cup_{i=1}^{\infty} A_i) = \sum_{i=1}^{\infty} \mathbb{P}(A_i)\).
- \(\mathbb{P}(\Omega) = 1\).
We say that the triplet \((\Omega, \mathcal{F}, \mathbb{P})\) is a probability space. The set \(\Omega\) is called the sample space, each element \(E \in \mathcal{F}\) is called event, and its measure \(\mathbb{P}(E)\) is called probability of the event \(E\).
Random Variable
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. A random variable is any real function \(X: \Omega \rightarrow \mathbb{R}\), such that the pre-image of any real interval (or more generally, any Borel set) is a measurable set, i.e.
\[X^{-1}(B) \in \mathcal{F} \text{ for any Borel set } B \in \mathcal{B}(\mathbb{R}).\]Note that the random variable takes values in \(\mathbb{R}\), but the sample space \(\Omega\) can be any set.
Let us take a sample of the random variable \(X\). To do that, we take an element \(\omega \in \Omega\) from the sample space, and we get the real value \(X(\omega) \in \mathbb{R}\). Let us fix the ideas with an example. When we toss a coin, the outcome of the process can be described by a random variable \(X:\Omega \rightarrow \mathbb{R}\) which assumes only two values: -1 (tail) and 1 (head). Notice that we did not specify any particular sample space, but only the values of the random variable. Indeed, we may choose any convenient sample space, according to the problem that we are dealing with. Denoting with \(H\) and \(T\) the head and tail outcomes, respectively, the simplest sample space we can choose is \(\Omega = \{H, T\}\). If we were considering two coins, the simplest sample space would be \(\Omega = \{HH, HT, TH, TT\}\). However, we could use the second sample space also for the first coin, by setting \(X(HH) = X(HT) = 1\) and \(X(TH) = X(TT) = -1\).
\(\sigma\)-algebra generated by a random variable
Let \(X\) be a random variable on the measurable space \((\Omega, \mathcal{F})\). One can show that the collections of sets:
\[\sigma(X) = \{X^{-1}(B) \mid B \in \mathcal{B}(\mathbb{R})\}\]is a \(\sigma\)-algebra on \(\Omega\), called the \(\sigma\)-algebra generated by \(X\).
Of course, the \(\sigma\)-algebra generated by a random variable is a subset of the \(\sigma\)-algebra of the measurable space, i.e. \(\sigma(X) \subseteq \mathcal{F}\). One can show that the \(\sigma\)-algebra generated by a random variable is the smallest \(\sigma\)-algebra that makes the random variable measurable. The definition is easily extended to a set of random variables, as the smallest \(\sigma\)-algebra that makes all the random variables measurable.
Distribution Function
Let \(X\) be a random variable on the probability space \((\Omega, \mathcal{F}, \mathbb{P})\). One can show that the function \(\mu_X: \mathbb{R}\rightarrow \mathbb{R}\) defined by
\[\mu_X(A) = \mathbb{P}(X^{-1}(A))\]is a probability measure on \((\mathbb{R}, \mathcal{B}(\mathbb{R}))\). Note that the definition is well-posed, because $X$ is a measurable function, so \(X^{-1}(A) \in \mathcal{F}\) (domain of \(\mathbb{P}\)) for any Borel set \(A\). The function \(\mu_X\) inherits the properties of the probability measure from \(\mathbb{P}\).
It might exist (but it is not guaranted) a function \(f_X: \mathbb{R} \rightarrow [0, +\infty]\) such that
\[\mu_X(A) = \int_{A} f_X(x) dx \quad \text{for any Borel set } A \in \mathcal{B}(\mathbb{R}).\]If it exists, the function \(f_X\) is called the probability density function of \(X\).
Let us consider a real interval \([a, b]\) and a random variable \(X\) on the probability space \(\mathbb{P}\). What is the probability that a sample from the random variable \(X\) falls in \([a, b]\)? First, we need to know which portion of the sample space \(\Omega\) is mapped by $X$ into \([a, b]\), i.e. \(X^{-1}([a, b])\). Since \(X\) is by definition a measurable function, \(X^{-1}([a, b]) \in \mathcal{F}\). Thus, we can compute the probability of that set using the probability measure, i.e.
\[p(X\in [a,b]) = \mathbb{P}(X^{-1}([a, b])) = \mu_X([a,b]) = \int_a^b f_X(x) dx.\]Cumulative Distribution Function
Let \(X\) be a random variable on the probability space \((\Omega, \mathcal{F}, \mathbb{P})\). The cumulative distribution function of \(X\) is the function \(F_X: \mathbb{R} \rightarrow [0, 1]\) defined by
\[F(x) = \mathbb{P}(X \leq x) = \mu_X((-\infty, x]).\]Notice that, if a probability density function \(f_X\) exists, it follows that \(F_X(x) = \int_{-\infty}^x f_X(t) dt.\)
Conditional probability of an event given another event
Let \((\Omega, \mathbb{F}, \mathbb{P})\) be a probability space. Given two events \(A, B \in \mathbb{F}\), we define the conditional probability of \(B\) given \(A\) as:
\[P(B | A) = \frac{P(A \cap B)}{P(A)}\]Conditional Expectation of a random variable with respect to an event
Let \((\Omega, \mathbb{F}, \mathbb{P})\) be a probability space. Given a random variable \(X\) on \(\Omega\) and an event \(A\in \mathbb{F}\), we define the conditional expecation of \(X\) given \(A\) as:
\[E[X | A] = \frac{\int_A X d\mathbb{P}}{\mathbb{P}(A)}\]Notice that this could be considered the expectation value of a new random variable \(X|A\) defined on the probability space \((A, \mathbb{F}_A, \frac{\mathbb{P}}{\mathbb{P}(A)})\), where \(\mathbb{A}\) is an appropriate restriction of \(\mathbb{F}\). The restriction of the sample space to \(A\) and of the \(\sigma\)-algebra to subsets of \(A\) ensures that \(\frac{\mathbb{P}}{\mathbb{P}(A)}\) is a probability measure.
Conditional Expectation of a random variable with respect to a \(\sigma\)-algebra
Let \((\Omega, \mathbb{F}, \mathbb{P})\) be a probability space. Let \(\mathbb{G} \subset \mathbb{F}\) be a sub-\(\sigma\)-algebra of \(\mathbb{F}\). If \(X\) is an \(\mathbb{F}\)-measurable random variable, there exists a unique \(\mathbb{G}\)-measurable random variable \(Y\) such that:
\[E[Y | A] = E[X | A] \quad \forall A \in \mathbb{G}.\]We call \(Y\) the conditional expectation of \(X\) given \(\mathbb{G}\), and we denote it as \(E[X | \mathbb{G}]\). Notice that the conditional expectation of a random variable with respect to a \(\sigma\)-algebra is a random variable itself. We can consider it a “restricted” version of the original random variable, which is adapted to the information contained in the sub-\(\sigma\)-algebra \(\mathbb{G}\). This definiton is the natural extension of the conditional expectation of a random variable with respect to an event (indeed a sub-\(\sigma\)-algebra is a collection of events). First, we defined the conditional expectation of a random variable with respect to an event as a number (the expectation value of the random variable on the event). This number contains all the information about the random variable that is relevant to the event. And this is only a number! Indeed, a specific event is the most restricting piece of information that we can have about a random variable (excluding knowing a specific sample, which I am not sure it makes sense). A progressively less restricting piece of information is a collection of events, i.e. a \(\sigma\)-algebra. In this case, the piece of information does not restrict the random variable to a single number, but to a random variable itself, measurable on the restricted \(\sigma\)-algebra. The restriction is again named “conditional expectation”, and it is realized by applying the expectation value to the random variable on each element of the \(\sigma\)-algebra. As the conditional expectation of random variable with respect to an event is number, the conditional expectation of a random variable with respect to a \(\sigma\)-algebra is a collection of numbers, one for each element in the \(\sigma\)-algebra.
Conditional Expectation of a random variable with respect to a random variable
Let \((\Omega, \mathbb{F}, \mathbb{P})\) be a probability space. Let \(X, Y\) be two \(\mathbb{F}\)-measurable random variables. The conditional expectation of \(X\) given \(Y\), denoted as \(E[X | Y]\), is the conditional expectation of \(X\) with respect to the \(\sigma\)-algebra generated by \(Y\), i.e.:
\[E[X|Y] = E[X | \sigma(Y)].\]Thus, the conditional expectation of a \(X\) with respect to \(Y\) is the a measurable function that to each possible event of \(\sigma(Y)\) assigns the expectation value of \(X\) on that event. Again, we are restricting the random variable \(X\) by restricting the underlying \(\sigma\)-algebra to the one generated by \(Y\), which represents the information contained in \(Y\).
Moment Generating Function
Let \(X\) be a random variable on the probability space \((\Omega, \mathcal{F}, \mathbb{P})\). The moment generating function of \(X\) is:
\[M_X(t) = E[e^{t X}]\](provided that the expectation value exists).
Let us see an example. If \(X\) is a random variable with positive values with probability density function \(f(x)=\lambda e^{-\lambda x}\), its moment generating function is:
\[M_X(t) = \int_0^\infty e^{t x} \lambda e^{-\lambda x} dx = \frac{\lambda}{\lambda - t}\]Another example. If \(X\) is a random variable uniformly distributed in the interval \([a,b]\), its moment generating function is:
\[M_X(t) = \int_a^b \frac{e^{t x}}{b-a} dx = \frac{\left[e^{tx}\right]_a^b}{(b-a)t} = \frac{e^{bt} - e^{at}}{(b-a)t}\]Notice that in $t=0$ it is not defined. I am not sure why, but in this case it is common practice to define \(M_X(0)\equiv 1\). Hopefully, the limit is also correct.
Theorems
Weak Law of Large Numbers
Let \(X_1, X_2, \ldots\) be a sequence of independent and identically distributed random variables with finite mean \(\mu\) and finite variance \(\sigma^2\). Let \(S_n = \frac{1}{n} \sum_{i=1}^n X_i\), then the sequence \(S_n\) converges in probability to \(\mu\), i.e.
\[\lim_{n \rightarrow \infty} \mathbb{P}\left(\left|\frac{S_n}{n} - \mu \right| > \epsilon \right) = 0 \quad \text{for any } \epsilon > 0.\]Probability Density Function of monotonic transformation of a random variable
Let \(X\) be a random variable with a continuous and differentiable probability density function \(f_X\). Let \(g(x)\) be a monotonic function. Then, the probability density function of the random variable \(Y = g(X)\) is given by:
\[f_Y(y) = \frac{f_X(g^{-1}(y))}{\left| g'(g^{-1}(y)) \right|}.\]Let us see an example, using first principles. Let \(T\) be a random variable with values in \([0, \infty]\) and probability density function \(f_T(t) = \lambda e^{-\lambda t}\). Let \(X = e^{- T/\tau}\).
Moment Generating Function of the sum of independent random variables
If \(X_1\) and \(X_2\) are independent random variables with moment generating function \(M_{X_1}(t)\) and \(M_{X_2}(t), then\)Y=X_1 + X_2$$ has the following moment generating function:
\[M_Y(t) = M_{X_1}(t) M_{X_2}(t)\]For example, if \(X_1\) and \(X_2\) are both uniformly distributed in
Discrete Random Variables
Most of the definitions given above can be naturally adapted to the discrete case. In particular, the definition of probability space needs no adaptation. The definition of discrete random variable is achieved by substituting the codomain \(\mathbb{R}\) with a discrete set like \(\mathbb{N}\). The definition of distribution function is achieved by substituting the integration with a summation.
Special Distributions
Poisson Distribution
This is a probability distribution for a discrete random variable \(N\) (Poisson random variable) defined by:
\[P(N=n) = \frac{\lambda^n}{n!} e^{-\lambda}\]where \lambda is called rate parameter.
The random variable \(N\) represents the number of events in fixed interval of time (or space, or anything else), given a mean number of events \(\lambda\).
References
- Wikipedia