Gaussian Discriminant Analysis (GDA)

Gaussian Discriminant Analysis (GDA)
- Python and Machine Learning for Integrated Circuits -
- An Online Book -

Python and Machine Learning for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Gaussian Discriminant Analysis (GDA) is a classification algorithm used for supervised learning tasks and it can be used for prediction in the supervised machine learning tasks. It is closely related to Linear Discriminant Analysis (LDA) but assumes that the features follow a Gaussian distribution. In GDA, you model the probability distributions of the features for each class and then use these distributions to make predictions. Specifically, GDA estimates the parameters of the Gaussian distribution (mean and covariance) for each class and then uses these estimates to calculate the likelihood of a new data point belonging to each class. It's a generative model that can be used for both binary and multiclass classification problems.

In GDA, we typically have a dataset with labeled examples. Each example is represented by a feature vector and an associated label ⁽ⁱ⁾. The feature vector ⁽ⁱ⁾ contains the values of different features for the i-th example, and ⁽ⁱ⁾ is the corresponding label or class for that example. The assumption is often made that the features ⁽ⁱ⁾ are generated from a multivariate Gaussian distribution for each class. The parameters of these distributions, such as the mean and covariance matrix, are estimated from the training data.

Here's a high-level overview of the GDA process:

Estimate the mean and covariance matrix for each class based on the training data.
Calculate the prior probability of each class (the proportion of training samples in each class).
Use Bayes' theorem to calculate the posterior probability of a new data point belonging to each class.
Assign the class label with the highest posterior probability to the new data point.

GDA has some assumptions, such as the assumption that the features follow a Gaussian distribution within each class, which may not always hold in practice. When these assumptions are met, GDA can work well. If the assumptions are not met, other classification algorithms like logistic regression or support vector machines may be more appropriate.

Gaussian Discriminant Analysis (GDA) is a probabilistic model. In GDA, the core idea is to model the data distribution using Gaussian distributions, and it uses probabilities to make predictions and classify new data points. Here's why GDA is considered a probabilistic model:

Probability Distributions: GDA explicitly models the probability distribution of the data for each class, assuming that the data in each class follows a Gaussian distribution. It estimates the parameters (mean and covariance) of these Gaussian distributions.
Bayesian Decision Rule: GDA uses Bayes' theorem to calculate the posterior probabilities of a new data point belonging to each class. This involves the use of conditional probabilities, making it a probabilistic classification method.
Posterior Probability: After training, GDA can provide not just a single class label prediction, but also the posterior probability of a data point belonging to each class. This provides a measure of uncertainty or confidence in the prediction.
Probability of Data Generation: GDA's generative aspect allows it to calculate the probability of generating a new data point given the learned distributions. This is a key characteristic of probabilistic models.

In Gaussian Discriminant Analysis (GDA), we can ssume that the feature variables (x) are continuous and belong to the n-dimensional real space, denoted as ℝⁿ. Typically, you include an additional feature with a constant value (1) to account for bias. If they're dropping this convention, then x is considered to be in ℝⁿ rather than ℝⁿ + 1. The key assumption in Gaussian Discriminant Analysis is that the conditional probability distribution of feature variables P(x|y) is Gaussian (normal) for each class y. In other words, the feature variables for each class follow a Gaussian distribution.

The expected value of z can be given by the Gaussian distribution for z,

Gaussian distribution ------------------------------------ [3848a]

where,

μ represents the mean vector ( µ ∈ Rⁿ) of the Gaussian distribution for z.

represents the covariance matrix (Σ ∈ Sⁿ₊₊). For instance for Gaussian distribution , the values on the diagonal (2 and 3) represent the variances of the first and second dimensions of , respectively. The off-diagonal value (1) represents the covariance between the two dimensions. This covariance matrix defines how the two dimensions are related and how spread out or concentrated the data is in those dimensions.

is the random variable for which you are calculating the probability density.

is the dimensionality of the random variable .

Both μ and σ are two-dimensional (2D), matching the dimensionality of z.

E[z] = μ ------------------------------- [3848b]

Covariance of Z, which typically represents how the components of the random variable z are correlated,

Gaussian distribution ------------------------------------ [3848c]

Gaussian distribution ------------------------------------ [3848d]

where,

E[z] = Ez for simplified notation.

A vector-valued random variable X =[X₁, X₂, · · · ,X_n]^T has a probability of density function, which is given by Multivariate Gaussian Distribution [1],

Gaussian distribution ------- [3848e]

The right-hand side of the equation is a mathematical expression for the probability density of the random variable with respect to the parameters , , and . With defined parameters n = 2, ε = [[2, 1], [1, 3]], and μ = [1, 2] (with with components and ), to represent the dimension, covariance matrix, and mean vector of the distribution, the probability of density function is plotted in Figure 3848a. The probability density function is defined over a two-dimensional space, and the X-axis and Y-axis correspond to the values of in those two dimensions. In Figure 3848a (a), the contour lines represent the values of the probability density function at different points in this two-dimensional space.

Note that there are no specific X-axis and Y-axis on the right-hand side of Equation 3848e. Instead, this equation represents the probability density as a function of the random variable , which you can evaluate for different values of to obtain the probability density at those points. The X-axis and Y-axis are only used when visualizing the function in a plot. These parameters , , and define the shape, location, and scale of the PDF, and thus control the mean and the variance of this density. Here, the mean is a vector , which determines the center of the distribution in the multi-dimensional space. The variance related is a covariance matrix, which contains information about the variances and covariances of the different components of the random variable and describes how spread out or concentrated the data is in different dimensions.

Gaussian distribution

(a)

Gaussian distribution

(b)

Gaussian distribution

(c)

Figure 3848a. Probability of density function: (a) 2D (Code), (b) 3D (Code) with mean parameter and and set to 0 (). The X-axis and Y-axis represent the two dimensions of the input variable . The grid of points in the and dimensions is used to evaluate the function

Figure 3848b shows the probability of density function in 3D with = [[1, 0], [0, 1]] and [[1.5, 0], [0, 1.5]], with = [[1, 0], [0, 1]] and [[1, 0.9] and with = [[1, 0], [0, 1]] and [[1, -0.5], [-0.5, 1]]. When the covariance is reduced, then the spread of the Gaussian density is reduced, and the distribution is taller because the area under the curve must be integrated to 1. Furthermore, changing μ can shift the center of the Gaussian density around.

Gaussian distribution

(a)

Gaussian distribution

(b)

Gaussian distribution

(c)

Figure 3848b. Probability of density function in 3D: (a) with = [[1, 0], [0, 1]] and [[1.5, 0], [0, 1.5]] (Code)

Based on Equation 3848e, in GDA model, p(x|y=0) and p(x|y=1) can be given by Gaussian distribution,

Gaussian distribution ------- [3848f]

Gaussian distribution ----- [3848g]

Equations 3848f and 3848g describe conditional probability density functions. These equations are commonly associated with Gaussian (or normal) distributions in classification problem or Bayesian statistics:

p(x|y=0):
- This equation represents the conditional probability density function of the random variable x given that y is equal to 0.
- The expression within the equation involves the probability density of x when y is 0.
- (2π)^-n/2 is a constant that ensures that the integral of the Gaussian distribution over all possible values of x is equal to 1.
- |ε|^-n/2 is the determinant of the covariance matrix ε raised to the power of -1/2. It is a measure of the spread or variability in the data.
- exp(...) is the exponential term, which describes the bell-shaped curve of the Gaussian distribution.
- (x - μ₀) represents the difference between the value of x and the mean (average) value μ₀ when y is 0.
- ε^-1 is the inverse of the covariance matrix ε.
- (x - μ₀)ᵀ represents the transpose of the difference vector.
p(x|y=1):
- This equation represents the conditional probability density function of the random variable x given that y is equal to 1.
- Similar to the previous equation, it involves the probability density of x when y is 1, and it has the same structure with the appropriate parameters.
- In this equation, (x - μ₁) represents the difference between the value of x and the mean value μ₁ when y is 1.

In machine learning, with Gaussian distribution, the prior distribution of θ can be given by,

Gaussian distribution ------------------- [3848gb]

This is a Gaussian (Normal) distribution with mean zero and a covariance matrix of τ²I, where I is the identity matrix. This is a common choice for the prior distribution in Bayesian regularization in machine learning.

In a classification context, these equations are used for modeling the conditional probability of an input x belonging to class 0 (y=0) or class 1 (y=1) when you assume that the data follows a Gaussian distribution with class-specific mean values (μ₀ and μ₁) and a shared covariance matrix (ε).

Based on logistic regression, we have,

Gaussian distribution ----------------------- [3848h]

where,

p(y): This represents the probability of a binary random variable y taking on a specific value, which can be either 0 or 1.

y is the Bernoulli random variable. This is the random variable, which can take on one of two values, either 0 or 1.

Φ = p(y=1), which is a parameter of the Bernoulli distribution, and it represents the probability of success or the probability that y equals 1.

(1-Φ): This is the probability of failure or the probability that y equals 0. Since there are only two possible outcomes (0 and 1), the sum of φ and (1-Φ) must equal 1.

µ₀ ∈ Rⁿ, µ₁ ∈ Rⁿ, ϵ ∈ Rⁿ^xⁿ, and Φ ∈ Rⁿ.

The probability distribution function given in Equation 3848h shows the probability mass function for a Bernoulli distribution. Equation 3848h calculates the probability of y taking on a particular value based on the parameter Φ. When y is 1, the probability is Φ, and when y is 0, the probability is (1-Φ). This is a fundamental model for situations where you have binary outcomes, like heads or tails in a coin toss, success or failure in a trial, etc.

A training set in machine learning can be given by,

---------------------------------- [3848i]

Some other notations can be used in Maximum Likelihood Estimation (MLE) as well. Multiple parameter estimation is given by,

The likelihood function ----------------------------------- [3848j]

The likelihood function ----------------------------------- [3848k]

µ₀ can be given by,

The likelihood function ----------------------------------- [3848l]

where,

μ₀ represents a parameter or variable. In logistic regression, it might be related to the parameters used in modeling.
ℒ(y⁽ⁱ⁾)= 0 is a condition or indicator function. It indicates that the sum is only taken over instances where the target variable, denoted as 'y', is equal to 0. In other words, it's a way to filter the data to only include cases where y⁽ⁱ⁾ is equal to 0.
x⁽ⁱ⁾: This represents the feature or input associated with the i-th data point or example in your dataset.
m is the total number of data points or examples in your dataset.
The denominator represents the sum of instances or data points where the target variable y⁽ⁱ⁾ is equal to 0.

µ₁ can be given by,

The likelihood function ----------------------------------- [3848m]

On the other hand, we have,

The likelihood function ----------------------------------- [3848n]

where,

is the difference between the data point x⁽ⁱ⁾ and its corresponding mean μ_y(i). This is a vector subtraction, resulting in a vector that represents the deviation of 'x^{(i)}' from its mean.

GDA is particularly useful when the data is normally distributed and the class-conditional distributions have different covariances. When GDA is used for prediction in supervised machine learning tasks, the GDA process is below:

Model Training: In the training phase, GDA estimates the parameters (mean and covariance), discussed above, of the Gaussian distribution for each class in the dataset. This means that for each class, you calculate the mean vector and covariance matrix based on the training data.
Predictions: To make predictions on new, unseen data points, GDA follows these steps:

a. For a given input feature vector, calculate the likelihood of the data given each class's Gaussian distribution. This involves computing the probability density function (PDF) of the data for each class.

In the step of making predictions, when we assign the input data point to a class based on the posterior probabilities calculated for each class, we can use the equation below,

The likelihood function ----------------------------------- [3848p]

where,

represents the posterior probability of class given the input data point .

is the likelihood of the data given that it belongs to class .

is the prior probability of class .

is the marginal likelihood of the data, which is the probability of observing the data regardless of the class.

In this step, you are comparing the posterior probabilities for each class, and the class with the highest posterior probability is the one to which you assign the input data point. The equation (Equation 3848p) is part of the decision-making process in this step, helping you determine the class with the maximum posterior probability.

b. Multiply the likelihood by the prior probability of the class (the proportion of the training data that belongs to that class).

c. Normalize the results to obtain posterior probabilities for each class. This can be done using Bayes' theorem.

d. Assign the input data point to the class with the highest posterior probability.

Decision Boundary: GDA provides a decision boundary that separates different classes based on the estimated Gaussian distributions. This decision boundary can be linear or quadratic, depending on whether GDA assumes equal or different covariances for each class.

Note that it's a common simplification in many classification tasks to treat as a constant across all classes when you're only interested in finding the class with the maximum posterior probability, because:

Decision Rule: When you want to classify a new data point into one of several classes, you are often interested in finding which class has the highest posterior probability . To compare the posterior probabilities across classes, you can ignore because it's the same for all classes. In this context, it doesn't affect the comparison.
Proportional Relationships: The class with the maximum posterior probability is the same whether you consider or not, because is a constant factor that doesn't depend on the class. Therefore, if you want to maximize and you have to compare the values across classes, the constant doesn't change the relative ordering of the probabilities.
Mathematical Convenience: Ignoring can simplify the math and reduce the computational cost, as you don't need to compute it explicitly for each class. This can be especially helpful in situations where calculating is difficult or computationally expensive.

Table 3848. Applications of Gaussian Discriminant Analysis.

Applications	Details
Expectation-Maximization (EM) algorithm	page3696

============================================

[1] Chuong B. Do, The Multivariate Gaussian Distribution, 2008.

=================================================================================