Input data (sample and feature) (multiple and single sample/example)

Input data (sample and feature) (multiple and single sample/example)
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Figure 3907a shows how supervised learning works. To provide a precise characterization of the supervised learning problem, the objective is to acquire a function h: X → Y from a given training set. This function, denoted as h(x), should excel at predicting the associated value y. Traditionally, this function h is referred to as a "hypothesis" due to historical conventions.

Workflow of supervised learning

Figure 3907a. Workflow of supervised learning.

In machine learning, a training example is typically represented as a pair of input features () and their corresponding target variable (). This pair, denoted as , is called a training example because it is used during the training phase of a machine learning algorithm to teach the model how to make predictions or learn patterns from data.

Here's why is called a training example:

Input Features (x): The part of the training example represents the input features or attributes of the data. These features are what the machine learning model uses to make predictions or decisions. For example, in a machine learning model for predicting house prices, the input features () might include the number of bedrooms, square footage, neighborhood, etc.
Target Variable (y): The part of the training example represents the target variable or the output that the model is trying to predict. In supervised learning, the goal is to learn a mapping from input features () to the target variable (). Continuing with the house price prediction example, would represent the actual sale price of a house.
Training Phase: During the training phase of a machine learning algorithm, the model is exposed to a dataset containing multiple such training examples ( pairs). The model learns from these examples by adjusting its parameters to minimize the difference between its predictions and the actual target values.
Generalization: The ultimate objective of training a machine learning model is to make accurate predictions on new, unseen data. By learning patterns and relationships from the training examples, the model aims to generalize its knowledge to make predictions for similar examples it has not encountered before.

When you have multiple training samples (also known as a dataset with multiple data points), the equations for the hypothesis and the cost function change to accommodate the entire dataset. This is often referred to as "batch" gradient descent, where you update the model parameters using the average of the gradients computed across all training samples.

multiple training samples

Figure 3907b. Multiple training samples, features, and outputs in csv format.

Table 3907a lists some examples of features corresponding veriables of some regression models.

Table 3907a. Examples of features corresponding veriables of some regression models.

Equation	Relationship between x and features	Regression with features
y = θ₀ + θ₁x	x is represented by x₁	y = θ₀ + θ₁x₁
y = θ₀ + θ₁x + θ₂x²	x is represented by x₁, x² is represented by x₂	y = θ₀ + θ₁x₁ + θ₂x₂
y = θ₀ + θ₁x + θ₂x^1/2	x is represented by x₁, x^1/2is represented by x₂	y = θ₀ + θ₁x₁ + θ₂x₂

In Figure 3907b, pair is the i^th training example.

On the other hand, n is used as a notation of how many features in the training process.

Whether the hypothesis obtained from a machine learning process is a random variable or not depends on various factors, and it's not only determined by whether the input data is a random variable. Table 3907ab list the randomness of hypothesis depending on learning algorithms and input data in the learning process of "data > learning algorithm > hypothesis".

Table 3907ab. Randomness of hypothesis depending on learning algorithms and input data.

Input data	Learning algorithm	Hypothesis
Random variable	Deterministic Function	Random variable
Random variable	Linear Regression	Not random variables
Random variable	Logistic Regression	Not typically random variables
Random variable	Decision Trees	Not random variables
Random variable	Random Forest	Not random variables
Random variable	Naive Bayes	Not random variables, but rather a probabilistic model that describes the relationships between the input random variables and the output classes.

Hypothesis (for multiple training samples):

The hypothesis for linear regression with multiple training samples is represented as a matrix multiplication. Let be the number of training samples, be the number of features, be the feature matrix, and be the target values. The hypothesis can be expressed as:

------------------------------ [3907a]

where,

is an matrix, where each row represents a training sample with features, and the first column is filled with ones (for the bias term).
is a column vector, representing the model parameters, including the bias term.

In machine learning, hypothesis representation is a fundamental concept, especially in supervised learning tasks like regression and classification. The hypothesis is essentially the model's way of making predictions or approximating the relationship between input data and the target variable. It is typically represented as a mathematical function or a set of parameters that map input features to output predictions.

In linear regression, the hypothesis represents a linear relationship between input features and the target variable. The hypothesis function is defined as:

------------------------------ [3907b]

Here,

is the predicted output (hypothesis) for input .
are the input features.
n is the number of features.
are the model parameters (weights) that the algorithm learns during training.

The goal of linear regression is to find the values of that minimize the difference between the predictions and the actual target values in the training data.

For multinomial Naive Bayes algorithm, we have,

multinomial Naive Bayes algorithm --------------------------------------- [3907ba]

where,:

p(x|y) represents the conditional probability of the data point x belonging to class y.
x_j represents the j^th feature or attribute of the data point x. This means that in a dataset, you might have multiple features (e.g., words in a text document, attributes of an object, etc.), and x_j refers to one of these features.
p(x_j|y) represents the conditional probability of the j^th feature x_j given the class y. In other words, it's the probability that a specific feature is observed in data points belonging to class y.

For instance, for house prices depending on the house size, we have the table below,

Table 4026b. House prices depending on the house size.

Size (sqr feet)	100	200	400	800	1200
x_j	1	2	3	4	5

x_j in Table 4026b are considered as features. Features are also known as independent variables or predictors, and they are the input variables used to make predictions or estimate an outcome, which in this case is house prices.

In matrix linear algebra notation for the cases of multiple examples, you can represent the derivative of a scalar function J(θ) with respect to a vector θ as the gradient, which is a vector itself. The gradient is denoted as ∇J(θ) and is defined as a vector of partial derivatives of J(θ) with respect to each element of θ. Let θ be an (n+1)-dimensional column vector:

Workflow of supervised learning ------------------------------ [3907c]

And J(θ) is a scalar function, then the gradient ∇J(θ) with respect to θ would be an (n+1)-dimensional column vector:

Workflow of supervised learning ------------------------------ [3907d]

Each element of ∇J(θ) represents the rate of change of J(θ) with respect to the corresponding element of θ. In this case, there are n+1 terms in the gradient vector because θ is an (n+1)-dimensional vector, including the bias term θ₀.

If n = 2, then, the gradient ∇J(θ) with respect to θ would be a 3-dimensional column vector:

Workflow of supervised learning ------------------------------ [3907e]

These three terms represent the rate of change of J(θ) with respect to each of the three elements of θ: θ₀, θ₁, and θ₂. The specific values of ∂J/∂θ₀, ∂J/∂θ₁, and ∂J/∂θ₂ would depend on the function J(θ) and would need to be computed based on the function's mathematical expression.

If A is 2 x 2 matrix, then A can be given by:

Workflow of supervised learning ------------------------------ [3907f]

In some cases, e.g. used as a regularization term, we have: function:

f(A) = A₁₁ + (A₁₂)² ------------------------ [3907g]

Since:

------------------------ [3907h]

------------------------ [3907i]

------------------------ [3907j]

------------------------ [3907k]

Therefore, the derivative of f(A) with respect to the elements of the matrix A is:

Workflow of supervised learning ------------------------ [3907l]

Here, each element of this matrix represents the rate of change of f(A) with respect to the corresponding element of A.

The derivative (gradient) of the function f(θ) = θ₁₁ + (θ₁₂)² with respect to θ, and the function itself can be plotted by the python script as shown in Figure 3907c.

Derivative of the function f(A) = A₁₁ + (A₁₂)² with respect to A

(a)

Derivative of the function f(A) = A₁₁ + (A₁₂)² with respect to A

(b)

Figure 3907c. (a) the function f(A) = A₁₁ + (A₁₂)², and (b) Derivative of the function f(A) = A₁₁ + (A₁₂)² with respect to A. .

In global optimization, we set:

Workflow of supervised learning ------------------------ [3907m]

In this sense, this is a n easy way to compute the derivative of J(θ) with respect to θ.

The cost function in linear regression is typically represented using the mean squared error (MSE) for multiple training samples. The cost function is defined as:

------------------------------ [3907n]

where,

is the number of training samples (examples).
) is the hypothesis's prediction for the -th training sample.
⁽ⁱ⁾ is the actual target value for the -th training sample.
⁽ⁱ⁾ ∈Rⁿ⁺¹.
⁽ⁱ⁾ ∈Rⁿ.
is the hypothesis.
x₀ = 1
(x⁽ⁱ⁾, y⁽ⁱ⁾) is the ith training example.
Note n is normally used for the nubmer of features

We have the capital X called the design matrix. ,

Workflow of supervised learning ----------------------------- [3907o]

Then,

Workflow of supervised learning ----------------------------- [3907p]

θ is a vector, so that we have,

Workflow of supervised learning ----------------------------- [3907q]

Workflow of supervised learning ------------------------ [3907s]

For y vector, it stacks up into a big column vector,

Workflow of supervised learning ------------------------------------------------------------ [3907t]

Then, J(θ) can be given by,

Workflow of supervised learning ---------------------------------------------- [3907u]

Combining Equations 3907s and 3907t, then we can have,

Workflow of supervised learning ------------------------------------------------------------ [3907v]

We know,

Workflow of supervised learning ------------------------------------------------------------ [3907w]

Therefore,

Workflow of supervised learning ---------------------------------------------------------- [3907x]

Then, we have,

Workflow of supervised learning ----------------------------------------- [3907y]

Workflow of supervised learning ----------------------------------- [3907z]

Workflow of supervised learning ----------------------------------- [3907za]

Then, we have,

Workflow of supervised learning ----------------------------------- [3907zb]

Therefore, we can get "Normal Equation" given by,

Workflow of supervised learning ----------------------------------- [3907zc]

where,

θ: Caled the vector of coefficients (parameters) in math, which we want to estimate, including the intercept (θ0) and the slope(s) for the independent variable(s).
X: The design matrix, which contains the values of the independent variable(s) for each data point. Each row corresponds to a data point, and each column corresponds to a feature. The first column typically contains a constant term of 1 (for the intercept), and subsequent columns contain the values of the independent variable(s).
X^T: The transpose of the design matrix.
y: The vector of observed values for the dependent variable.
(X^T * X)^(-1): The inverse of the product of the transpose of the design matrix and the design matrix itself.

The Normal Equation is a mathematical formula used in linear regression to find the coefficients (parameters) of a linear model that best fits a given set of data points. Linear regression is a statistical method used to model the relationship between a dependent variable (the target or output) and one or more independent variables (predictors or features) by fitting a linear equation to the observed data.

By solving the Normal Equation, we can obtain the values of the coefficients θ that minimize the sum of squared differences between the predicted values of the dependent variable and the actual observed values. These coefficients define the best-fitting linear model for the given data. While the Normal Equation provides a closed-form solution for linear regression, there are also iterative optimization methods like gradient descent that can be used to find the coefficients, especially when dealing with more complex models or large datasets. Nonetheless, the Normal Equation is a valuable tool for understanding the fundamental principles of linear regression and for solving simple linear regression problems analytically.

When you use the Normal Equation to solve for the coefficients (θ) in linear regression, you are essentially finding the values of θ that correspond to the global minimum of the cost function in a single step. In linear regression, the goal is to find the values of θ that minimize a cost function, often represented as J(θ). This cost function measures the error or the difference between the predicted values (obtained using the linear model with θ) and the actual observed values in your dataset.

For SVM, we can transform (θ₀, θ₁, ..., θ_n) into (b, w₁, ..., w_n):

Iterative Update Rule of SGD ------------------- [3907zd]

To find the values of θ that minimize this cost function, you can use the Normal Equation, which provides an analytical solution. When you solve the Normal Equation, you find the exact values of θ that minimize J(θ) by setting the gradient of J(θ) with respect to θ equal to zero.

The key point is that this solution is obtained directly, without the need for iterative optimization algorithms like gradient descent. Gradient descent, for example, iteratively adjusts the parameters θ to minimize the cost function, which may take many steps to converge to the global minimum. In contrast, the Normal Equation provides a closed-form solution that directly computes the optimal θ values in a single step by finding the point where the gradient is zero.

However, note that the Normal Equation has some limitations:

It may not be suitable for very large datasets because of the matrix inversion operation, which can be computationally expensive.
It requires that the design matrix (X^T * X) is invertible. In cases where it's not invertible (e.g., due to multicollinearity), you may need to use regularization techniques.

One example of feature illustration is that, in binary classification (e.g. Table 3907b), you have two classes, typically denoted as "positive" (P) and "negative" (N). Given a set of features (X), you want to determine the probability that an observation belongs to the positive class.

The probability of an observation belonging to the positive class, given the features, is calculated using Bayes' theorem as follows:

Binary Classification ------------------------------------------ [3907ze]

Where:

is the posterior probability that the observation belongs to the positive class.
is the likelihood of observing the features given that the class is positive.
is the prior probability of the positive class.
is the probability of observing the features .

Table 4026c. Binary classification.

Binary Classification

On the other hand, bias-variance trade-off is also a crucial concept in machine learning. If you increase the complexity of a model (e.g., by adding more features or using a more complex algorithm), you reduce bias but increase variance, and vice versa.

============================================

=================================================================================