=================================================================================
The Normal Equation is a mathematical formula used in linear regression to find the coefficients (parameters) of a linear model that best fits a given set of data points. Linear regression is a statistical method used to model the relationship between a dependent variable (the target or output) and one or more independent variables (predictors or features) by fitting a linear equation to the observed data.
By solving the Normal Equation, we can obtain the values of the coefficients θ that minimize the sum of squared differences between the predicted values of the dependent variable and the actual observed values. These coefficients define the bestfitting linear model for the given data. While the Normal Equation provides a closedform solution for linear regression, there are also iterative optimization methods like gradient descent that can be used to find the coefficients, especially when dealing with more complex models or large datasets. Nonetheless, the Normal Equation is a valuable tool for understanding the fundamental principles of linear regression and for solving simple linear regression problems analytically.
When you use the Normal Equation to solve for the coefficients (θ) in linear regression, you are essentially finding the values of θ that correspond to the global minimum of the cost function in a single step. In linear regression, the goal is to find the values of θ that minimize a cost function, often represented as J(θ). This cost function measures the error or the difference between the predicted values (obtained using the linear model with θ) and the actual observed values in your dataset.
To find the values of θ that minimize this cost function, you can use the Normal Equation, which provides an analytical solution. When you solve the Normal Equation, you find the exact values of θ that minimize J(θ) by setting the gradient of J(θ) with respect to θ equal to zero.
The key point is that this solution is obtained directly, without the need for iterative optimization algorithms like gradient descent. Gradient descent, for example, iteratively adjusts the parameters θ to minimize the cost function, which may take many steps to converge to the global minimum. In contrast, the Normal Equation provides a closedform solution that directly computes the optimal θ values in a single step by finding the point where the gradient is zero.
However, note that the Normal Equation has some limitations:
 It may not be suitable for very large datasets because of the matrix inversion operation, which can be computationally expensive.
 It requires that the design matrix (X^{T} * X) is invertible. In cases where it's not invertible (e.g., due to multicollinearity), you may need to use regularization techniques.
Table 3892. Comparison between MLs with large and small datasets.

Small dataset 
Large dataset 
Dataset 
Limited Data: Small datasets contain only a limited number of examples, making it challenging for models to learn and generalize effectively. Each data point is more valuable and requires careful handling. 
Abundance of Data: Large datasets typically contain a vast amount of data, providing ample examples for training and evaluation. This abundance of data can help machine learning models generalize well and learn complex patterns. 
Complexity of models 
Simpler Models: Due to the risk of overfitting, simpler models, such as linear models or shallow neural networks, are often preferred for small datasets. Complex models may perform poorly. 
Complex Models: With a large dataset, you can afford to use more complex and deep models, such as deep neural networks. These models have a greater capacity to capture intricate relationships within the data. 
Overfitting 
Small datasets are highly susceptible to overfitting. Regularization techniques, such as L1 and L2 regularization, are typically employed to combat this problem. 
Reduced Risk of Overfitting: Large datasets reduce the risk of overfitting, as models are less likely to memorize the data and more likely to learn meaningful patterns. This allows for more flexibility in model selection and hyperparameter tuning. 
Computation Intensity 

Computationally Less Demanding: Training models on smaller datasets typically requires less computational power and time compared to large datasets. You may not need specialized hardware or distributed computing frameworks to handle the workload.

Faster Prototyping: Smaller datasets enable faster prototyping and experimentation. You can iterate through different algorithms, hyperparameters, and feature engineering ideas more rapidly, which can be advantageous during the development phase.

CostEfficient: Smaller datasets can be more costefficient as they require fewer computational resources. This is especially important for organizations with budget constraints.

Training models on large datasets can be computationally intensive. You might require highperformance hardware and distributed computing frameworks to handle the scale efficiently. 
Feature Engineering 
Feature engineering and careful feature selection become crucial, as the limited data requires more attention to ensure that relevant information is extracted. 
Large datasets can benefit from automated feature selection and extraction techniques due to the sheer volume of data. Feature engineering can be less critical, as models can learn relevant features from the data. 
Robustness to Noise 

Large datasets tend to be more robust to noise and outliers, as they are less likely to be influenced by a few erroneous data points. 
CrossValidation 
Crossvalidation can be challenging with small datasets, as splitting the data into smaller subsets for validation may lead to less reliable performance estimates.
Small (e.g. 100 samples): KFold CrossValidation
Extremely small (e.g. 20~50 samples): LeaveOneOut CrossValidation (LOOCV) 
Crossvalidation is effective in assessing model performance, and standard techniques like kfold crossvalidation can be used to ensure robust evaluations. 
Code and Algorithm 
Efficient coding and algorithm selection are essential to make the most of the available data. This includes optimizing code for preprocessing, training, and evaluation. 
While large datasets can tolerate more computationally intensive algorithms, the efficiency of your code and algorithms still matters. Efficient code and algorithms can save time and resources during training, which becomes especially important when working with very large datasets. However, the performance of your model may not be as sensitive to minor inefficiencies in the code or algorithm as it might be with a small dataset. 
Model Interpretability 
In certain applications, interpretability is more critical when working with small datasets, as decisionmaking may need to be explained to stakeholders. 
Model interpretability remains important, especially when the stakes are high, such as in medical or legal applications. Even with a large dataset, you might need to explain the model's decisions, ensure fairness and ethics, or comply with regulations. The need for interpretability depends on the specific use case and context. 
Domain Knowledge 
Leveraging domain knowledge and prior expertise is crucial for addressing the limitations of small datasets and making informed modeling decisions. 
Domain knowledge can guide feature engineering, model selection, and the understanding of the problem. With large datasets, you might have more data to help you uncover patterns, but domain knowledge can still enhance your ability to make informed decisions. 
Figure 3892a shows linear regression learning curves, which involve training the model on different subsets of the training data and show the training and test errors for each subset size.
Figure 3892a. Linear regression learning curves. (code)
When the training sample size is small, the training error is also small. For instance, if we only have one example for training, then any algorithm can fit the example perfectly. When the sample size is large, then it is harder to fit the training data perfectly. Howver, when the training sample size is small, the test error still decreases as the increase of sample size, which suggests larger sample set will help. The phenomenon of larger gaps between training and test errors at small sample sizes is often attributed to high model variance. This is a common occurrence when the size of the training set is relatively small, and the model has a higher tendency to overfit the training data:

Overfitting at Small Sample Sizes:
 With a small training set, the model may have a higher chance of fitting the noise or specific patterns present in that limited dataset.
 The model can become overly complex and memorize the training examples, capturing random fluctuations instead of learning the underlying patterns.
 Model Complexity:
 Models with higher complexity, such as those with a large number of parameters, are more prone to overfitting, especially when the amount of training data is limited.
 At small sample sizes, complex models may perform well on the training set but struggle to generalize to unseen data.
 Variance Dominance:
 When the training set is small, the impact of random variability in the data (variance) can dominate the learning process.
 The model may become overly sensitive to the specific examples in the training set, leading to a larger gap between the training and test errors.
 Limited Diversity in Training Data:
 A small training set may not adequately represent the diversity of the underlying data distribution.
 The model may not generalize well to new examples that differ from those in the training set.
To address these issues, practitioners often consider strategies such as:

Increasing the Training Set Size: Collecting more data can help mitigate overfitting by providing a more comprehensive representation of the underlying patterns.

Simplifying the Model: Using simpler models with fewer parameters or incorporating regularization techniques can reduce the risk of overfitting.

CrossValidation: Employing techniques like crossvalidation can provide a more robust estimate of model performance by assessing it on multiple traintest splits.
The general trend of the test error decreases as the square root of the training set size () as shown in Figure 3892b. The idea that the test error decreases as until reaching some irreducible error, often referred to as the "Bayes error" or "irreducible error," is a common observation. However, it's important to note that learning algorithms may not always drive the test error to zero, even with an infinite amount of data. However, it's important to note that learning algorithms may not always drive the test error to zero, even with an infinite amount of data.
Figure 3892b. General trend of the test error decreasing as the square root of the training set size. (Code). 
With a small dataset (e.g. 100 images), in some cases, we still can get some good results. However, with a small dataset, some insightful design of machine learning pipelines is needed. (see page3700)
DL (deep learning) is extremely datahungry [1, 2] so that DL demands an extensively large amount of data to achieve a wellbehaved performance model, i.e. as the data increases, an extra wellbehaved performance model can be achieved as shown in Figure 3892c.
Figure 3892c. performance of DL regarding the amount of data. [3]
============================================
[1] Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv: 1912.11460.
[2] Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385.
[3] Laith Alzubaidi, Jinglan Zhang, Amjad J. Humaidi, Ayad Al‐Dujaili, Ye Duan, Omran Al‐Shamma, J. Santamaría, Mohammed A. Fadhel, Muthana Al‐Amidie and Laith Farhan, Review of deep learning: concepts, CNN architectures, challenges, applications, future directions, Journal of Big Data, 8:53, https://doi.org/10.1186/s40537021004448, (2021).
