Concept#
Multivariate Gaussian#
(Multivariate Gaussian Distribution)
Let \(\boldsymbol{X} = \begin{bmatrix} X_1 & X_2 & \cdots & X_D \end{bmatrix}^{\intercal}_{D \times 1}\) be a \(D\)-dimensional random vector with sample space \(\mathcal{X} = \mathbb{R}^D\).
Then we say this random vector \(\boldsymbol{X}\) is distributed according to the multivariate Gaussian distribution with mean vector
and covariance matrix
if its probability density function (PDF) is given by
where \(\lvert \boldsymbol{\Sigma} \rvert\) is the determinant of \(\boldsymbol{\Sigma}\).
(Expectation and Covariance of Multivariate Gaussian)
By definition, the expectation of a multivariate Gaussian random vector \(\boldsymbol{X}\) is given by
parameterized by the mean vector \(\boldsymbol{\mu}\).
The covariance matrix \(\boldsymbol{\Sigma}\) of a multivariate Gaussian random vector \(\boldsymbol{X}\) is given by
parameterized by the covariance matrix \(\boldsymbol{\Sigma}\).
Therefore, both are contained in the definition.
One can also easily see if \(\boldsymbol{X}\) is a scalar (1-dimensional) random variable \(X\), then \(D = 1\), \(\boldsymbol{\mu} = \mu\), and \(\boldsymbol{\Sigma} = \sigma^2\), and by plugging into (207), the PDF of \(X\) is merely
Independence#
If all the individual random variables \(X_d\) in a multivariate Gaussian random vector \(\boldsymbol{X}\) are independent, then the PDF can be greatly simplified to a product of univariate Gaussian PDFs.
(PDF of Independent Multivariate Gaussian Random Vectors)
Let \(\boldsymbol{X} = \begin{bmatrix} X_1 & X_2 & \cdots & X_D \end{bmatrix}^{\intercal}_{D \times 1}\) be a random vector following a multivariate Gaussian distribution with mean vector \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\).
Suppose that all entries in \(\boldsymbol{X} = \begin{bmatrix} X_1 & X_2 & \cdots & X_D \end{bmatrix}^{\intercal}_{D \times 1}\) are independent random variables (i.e. \(X_i\) and \(X_j\) are independent for all \(i \neq j\)).
Then the PDF of \(\boldsymbol{X}\) is given by
which is indeed a product of univariate Gaussian PDFs.
Proof. Intuitively, if all the individual random variables \(X_d\) in a multivariate Gaussian random vector \(\boldsymbol{X}\) are independent, then when you draw such a random vector \(\boldsymbol{X}\) from the sample space, the probability of drawing a particular value \(\boldsymbol{x} = \begin{bmatrix} x_1 & x_2 & \cdots & x_D \end{bmatrix}^{\intercal}_{D \times 1}\) is akin to asking what is the probablity of drawing \(X_1 = x_1\) and \(X_2 = x_2\) and \(\cdots\) and \(X_D = x_D\) simultaneously. However, since they are all independent, then drawing \(X_i = x_i\) does not affect the probability of drawing \(X_j = x_j\) for \(i \neq j\). Therefore, the probability of drawing \(\boldsymbol{x}\) is simply the product of the probabilities of drawing each individual \(X_d\). This is in line with what we understand of independence from earlier.
Formally, suppose \(X_i\) and \(X_j\) are independent for all \(i \neq j\). Then, Property 23 states that \(\operatorname{Cov}\left(X_i, X_j\right)=\) 0 . Consequently, the covariance matrix \(\boldsymbol{\Sigma}\) is a diagonal matrix:
where \(\sigma_{d}^2=\operatorname{Var}\left[X_d\right]\). When this occurs, the exponential term in the Gaussian PDF is [Chan, 2021]
Moreover, the determinant \(|\boldsymbol{\Sigma}|\) is
Substituting these results into the joint Gaussian PDF, we obtain
which is a product of individual univariate Gaussian PDFs.
(IID Assumption)
Do not confuse this with the \(\iid\) assumption. In most Machine Learning context, when we say \(\iid\), it refers to the samples and not the individual random variables in the random vector of each sample.
In supervised learning, implicitly or explicitly, one always assumes that the training set
is composed of \(N\) input/response tuples
that are independently drawn from the same (identical) joint distribution
with
where \(\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\) is the conditional probability of \(Y\) given \(\mathbf{X}\), the relationship that the learner algorithm/concept \(c\) is trying to capture.
Then in this case, the i.i.d. assumption writes (also defined in Definition 92):
and we sometimes denote
This does not assume any independence within each sample \(\mathbf{X}^{(n)}\).