=================================================================================
In statistics and probability theory, the likelihood of parameters θ is typically denoted as ℒ(θ). The likelihood function represents the probability of observing the given data under different values of the parameter θ. It's a fundamental concept in statistical inference, particularly in the maximum likelihood estimation (MLE).
The likelihood function is often denoted as:
 [3997a]
 [3997b]
 [3997c]
where,

ℒ(θ) is the likelihood function, which is a function of the parameter θ given the observed data.

represents the probability of observing the data given a specific value of the parameter θ. It quantifies how well the parameter θ explains or fits the observed data.
The crossentropy loss function can be given by,
 [3997cb]
where,:
 y^ typically represents the predicted probability that a given input belongs to the positive class (class 1).
 y represents the actual binary label (0 or 1) of the instance.
The terms "likelihood" and "probability" are related but distinct concepts in statistics, and they are used in different contexts.

Likelihood:
 Likelihood is a function that measures how well a particular set of parameters (θ) in a statistical model explains or fits observed data.
 It is not a probability distribution over data; instead, it's a function of the parameters given the data.
 The likelihood function is used in statistical inference to estimate parameters. In maximum likelihood estimation (MLE), for example, you find the values of θ that maximize the likelihood function given the observed data.
 The likelihood function is not constrained to sum to 1 over all possible values of θ; its values can vary widely depending on the data and the model.
 Probability:
 Probability is a measure of the uncertainty associated with an event or outcome.
 It is typically defined over the possible outcomes of a random process or experiment, and it quantifies how likely each outcome is to occur.
 Probability distributions represent the set of possible outcomes and their associated probabilities, and they must sum to 1 over all possible outcomes.
 Probability is used to describe uncertainty before an event has occurred (prior probability) or to compute the probability of future events (posterior probability) given prior information.
It might seem like these two concepts are similar because both involve expressing the likelihood of something happening. However, they are not the same for the following reasons:

Different Focus:
 Likelihood focuses on how well a set of parameters explains observed data, and it's used in parameter estimation.
 Probability focuses on the likelihood of specific events or outcomes occurring and is used in predicting or describing events in a probabilistic manner.
 Different Objectives:
 Likelihood helps us find the bestfitting parameters for a statistical model, given the observed data.
 Probability is used for modeling the inherent randomness or uncertainty in a system, often in the context of random variables and their distributions.
While likelihood and probability are distinct concepts, they are related in Bayesian statistics through Bayes' theorem. Bayes' theorem relates the likelihood, prior probability, and posterior probability of a parameter, allowing us to update our beliefs about the parameter using observed data. In this context, likelihood plays a crucial role in Bayesian inference. It is common, in statistics, to say "likelihood of the parameter" and "probability of the data", which is used to clarify the distinction between these two related but distinct concepts.
Log of likelihood can be given by,
 [3997d]
 [3997e]
 [3997f]
 [3997g]
Maximum Likelihood (ML) is a statistical method used for estimating the parameters of a probability distribution or a statistical model. It is a common approach in both frequentist and Bayesian statistics and is widely used in various fields, including machine learning, economics, biology, and more. Maximum likelihood estimation (MLE) is to choose θ to maximize the ℒ(θ); however, in practice, it is much easier to maximize the log(ℒ(θ)) in order to maximize the ℒ(θ).
Since t he first team in Equation 3997f is constant, then in order to maximize the ℒ(θ), we need to mimimize the term after the minus sign in Equation 3997g.
The basic idea behind maximum likelihood estimation is to find the values of the model parameters that maximize the likelihood function, which measures how well the model explains the observed data. In other words, ML seeks to find the parameter values that make the observed data most probable under the assumed statistical model.
Here's a simplified stepbystep explanation of how maximum likelihood works:

Define a Probability Model: Start by assuming a probability distribution or a statistical model that describes the data. This model depends on one or more parameters that you want to estimate.

Construct the Likelihood Function: The likelihood function is a measure of how likely the observed data is, given the parameter values. It's essentially the probability of observing the data as it is, given the model and parameter values. This function is often denoted as L(θ  data), where θ represents the parameters of the model.

Maximize the Likelihood: To find the maximum likelihood estimates, you aim to find the values of θ that maximize the likelihood function. This can be done analytically by taking derivatives and solving for the maximum or numerically using optimization algorithms like gradient descent.

Obtain Parameter Estimates: Once you've found the parameter values that maximize the likelihood function, these values are considered the maximum likelihood estimates for the model's parameters.
Maximum Likelihood Estimation is a powerful and widely used technique in statistics because it provides a principled way to estimate the parameters of a model based on observed data. The resulting parameter estimates are often used for making predictions, inference, and statistical hypothesis testing. In many cases, ML estimates are also asymptotically efficient, meaning they achieve the best possible performance as the sample size increases.
Performing maximum likelihood estimation (MLE) on the exponential family of probability distributions is a common statistical technique used to estimate the parameters of these distributions. When you perform MLE on an exponential family distribution, the following typically happens:

Likelihood Function: You start with a probability density function (PDF) or probability mass function (PMF) that belongs to the exponential family. The PDF or PMF depends on one or more parameters, which you want to estimate. The likelihood function is essentially the product (for continuous distributions) or the product of probabilities (for discrete distributions) of the observed data points, given the parameter(s).

LogLikelihood: It's often more convenient to work with the loglikelihood, which is the natural logarithm of the likelihood function. This is because it simplifies the mathematical calculations and avoids potential numerical precision issues when dealing with small probabilities.

Optimization: You then find the parameter values that maximize the loglikelihood function. This is done by taking the derivative of the loglikelihood with respect to the parameters and setting it equal to zero, or by using optimization techniques like gradient descent or the NewtonRaphson method. Solving this equation or using optimization methods yields the MLE estimates of the parameters.

Interpretation: The MLE estimates represent the parameter values that make the observed data most probable under the assumed exponential family distribution. In other words, they are the values that maximize the likelihood of the data given the model.

Statistical Properties: MLE estimators often have desirable statistical properties, such as being asymptotically unbiased and efficient (i.e., they have the smallest possible variance among unbiased estimators). However, the exact properties depend on the specific distribution and sample size.

Confidence Intervals and Hypothesis Testing: Once you have MLE estimates, you can construct confidence intervals to assess the uncertainty associated with the parameter estimates. You can also perform hypothesis tests to determine if the estimated parameters are significantly different from specific values of interest.

Model Fit Assessment: After obtaining MLE estimates, it's essential to assess how well the chosen exponential family distribution fits the data. This can be done through various goodnessoffit tests and graphical methods.
MLE with respect to η (the natural parameter) is concave in the natural parameterization of the exponential family, namely, when the exponential family is parameterized in the natural parameters. This is a fundamental property of MLE for exponential family distributions:

Exponential Family Structure: The exponential family of probability distributions is characterized by a specific mathematical structure. In the natural parameterization, the loglikelihood function for a sample of independent and identically distributed (i.i.d.) random variables has a specific form that depends linearly on the natural parameter η.

LogLikelihood Function: The loglikelihood function, which is the logarithm of the likelihood function, takes the form of a sum over the data points, with each term being a linear combination of the natural parameter and sufficient statistics.

Linearity in Parameters: The key property here is that the loglikelihood is linear in the natural parameter. Specifically, it is an affine function of η. This means that as you change the values of η, the loglikelihood function behaves as a linear function.

Concavity: A function is said to be concave if, roughly speaking, it "curves downward" as you move along the function from left to right. In the context of MLE, when the loglikelihood function is concave with respect to the natural parameter η, it means that the function forms a "bowllike" shape with a single maximum point. In other words, it is a concave function with a unique maximum.

Optimization: The fact that the loglikelihood function is concave with respect to η is crucial for optimization techniques like gradient descent, NewtonRaphson, or other optimization algorithms. A concave function has a single maximum that can be efficiently found through these optimization methods.

Uniqueness of MLE: The concavity of the loglikelihood function ensures that there is a unique solution for the MLE of the natural parameter η. This makes the estimation process welldefined, and the MLE is the value of η that maximizes the likelihood of the observed data.
Figure 3997a shows the concave nature of the loglikelihood function in an exponential family distribution by using the Gaussian distribution as an example, which is a member of the exponential family and is parameterized by the natural parameters. The loglikelihood function is represented by the blue curve in the plot, and it forms a "bowllike" shape, curving downward as you move along the natural parameter (η) axis. This characteristic of the curve indicates that the loglikelihood function is concave with respect to η. Concave functions have a single maximum point, which is often referred to as the MLE of the parameter. The red dashed line marks the location of the MLE of η, which corresponds to the point where the loglikelihood function reaches its maximum value.
Figure 3997a. Concave nature of the loglikelihood function in an exponential family distribution. (Python code) 
Negative log likelihood (NLL) is a commonly used mathematical function in the field of statistics and machine learning, particularly in the probability models and maximum likelihood estimation. It is used to measure the goodness of fit between a probability distribution (usually a model's predicted distribution) and a set of observed data points because of:

Likelihood: In statistics, the likelihood function measures how well a probability distribution or statistical model explains the observed data. Given a set of observed data points (often denoted as x) and a probability distribution or model parameterized by θ, the likelihood L(θ  x) measures the probability of observing the given data under the assumed model.

Log Likelihood: To simplify calculations and avoid numerical underflow/overflow issues, it's common to work with the logarithm of the likelihood function. This is called the log likelihood and is denoted as log L(θ  x).

Negative Log Likelihood (NLL): To turn the measure of fit into a loss function (something to be minimized), the negative log likelihood is often used. It's simply the negative of the log likelihood: log L(θ  x).
The idea behind using the negative log likelihood as a loss function is to find the model parameters (θ) that maximize the likelihood of observing the given data. Maximizing the likelihood is equivalent to minimizing the negative log likelihood.
Figure 3997b shows the negative log likelihood (NLL), where you would need to negate the loglikelihood values. The NLL is often used in optimization problems because minimizing it is equivalent to maximizing the likelihood.
Figure 3997b. Negative log likelihood (NLL). ( Python code) 
Table 3997. Applications of Maximum Likelihood Estimation (MLE).
Applications 
Details 
Single parameter estimation versus multiple parameter estimation 
page3843 
============================================
