Example of ML debugging/diagnostic: Anti-Spam

Example of ML Debugging/Diagnostic: Anti-Spam
- Python Automation and Machine Learning for ICs -
- An Online Book -

Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Debugging problems in ML can be very complicated if you want to have good job on it, and can be a PhD thesis work.

Debugging example A:

An machine learning (ML) example is Anti-spam. [1] In this example, we carefully choose a small set of 100 words as features, instead of using all 50,000 + words in English.

In this ML example, we used logistic regression with regularization (Bayesian Logistic regression), and implemented with gradient descent. We got 20% test error, which is unacceptably high.

Anti-spam ----------------------------------- [3710a]

Then, how can we debug it?

One common approach of debugging the ML algorithm is to improve the algorithm in different ways.

          i) Try to get more training examples. It is almost always true since more data normally helps the training. This fixes a high variance.
          ii) Try to use a smaller set of features. Some of the hundred features probably are not relevant. This fixes a high variance.
          iii) Try to use a larger set of features. The hundred features can be too small. A larger set of features fixes a high bias.
          iv) Try to change or re-design the features. For instance, you can use the email header features instead of just using the email body features, or uee what service the emails take to get to you.
          v) Try to run gradient descent for more iterations. This fixes the problem of optimization algorithm.
          vi) Try to switch to Newton’s method. This fixes the problem of optimization algorithm.
          vii) Try to use different value for λ. This fixes the problem of optimization objective.
          viii) Try to use a totally different algorithm, e.g. an SVM.

A lot of teams pick up the ideas above, which can be kind of random. And then, they will spend a few days or few weeks trying that, which may not be the best thing to do. Each of the first four methods above fixes either a high variance or a high bias problem.

The most common diagnostic is bias-variance diagnostic. Understanding this tradeoff is crucial for diagnosing the performance of a model and making informed decisions about improving its generalization ability. Understanding the contributions of bias and variance to the algorithm's performance helps guide your debugging efforts. Without a thorough analysis, it can be challenging to assess which option above is the best, as we wouldn't have a clear understanding of the impact of each option on the problem at hand. Techniques such as cross-validation, learning curves, and error analysis can be employed to gain insights into bias and variance and make informed adjustments to enhance model generalization.

Note, the learning process on bias-variance diagnostic can take a couple of years of deep practice, but may vary from person to person. In this process, you will practice machine learning algorithms, find the problems and then fix the problems:

Complexity of Machine Learning: Machine learning is a complex field that encompasses a wide range of algorithms, techniques, and domains. Mastering it involves not only understanding theoretical concepts like bias and variance but also gaining practical experience in applying these concepts to real-world problems.
Deep Practice: Deep practice refers to deliberate, focused, and often challenging practice that pushes one's limits. Becoming adept at diagnosing and addressing bias and variance issues involves hands-on experience, experimenting with different models, datasets, and strategies for mitigating these issues.
Iterative Nature: Improving as a machine learning practitioner typically involves an iterative process. You learn from your experiences, identify challenges, and continually refine your approach. This iterative cycle contributes to a deeper understanding of the nuances involved in managing bias and variance.
Domain Knowledge: Deep practice in machine learning also requires domain-specific knowledge. Understanding the characteristics of the data and the context of the problem is crucial for effective model diagnosis and debugging.
Evolution of Techniques: The field of machine learning is dynamic, with new techniques, algorithms, and best practices continually emerging. Staying updated and adapting to these changes is part of the ongoing learning process.

The diagnostic will be:

i) Variance: Training error will be much lower than test error.

ii) Bias: Training error will also be high.

Figure 3710a shows linear regression learning curves, which involve training the model on different subsets of the training data and show the training and test errors for each subset size. When the training sample size is small, the training error is also small. For instance, if we only have one example for training, then any algorithm can fit the example perfectly. When the sample size is large, then it is harder to fit the training data perfectly. Howver, when the training sample size is small, the test error still decreases as the increase of sample size, which suggests larger sample set will help. The phenomenon of larger gaps between training and test errors at small sample sizes is often attributed to high model variance. On the other hand, the large deviation of the training error from the desired performance is a sign of large bias. The other sign of large bias can be given when the gap between the training error and test error is very small, and the training error will never come back down to reach the desired performance line no matter how many samples we have. Note that a common “standard” of desired performance (allowed maximum error) is what humans can achieve. In general, the test error should always be higher than the training error no matter how many samples we have.

Linear Regression Learning Curves

Figure 3710a. Linear regression learning curves. (code)

Debugging example B:

Another debugging example is: [1]

i) Logistic regression gets 2% error on spam, and 2% error on non-spam. (unacceptably high error on non-spam)

ii) SVM using a linear kernel gets 10% error on spam, and 0.01% error on non-spam (outperform). (Acceptable performance)

iii) But you want to use logistic regression because of computational efficiency, etc.

Question: What to do next?

Analysis:

2% error on spam is Ok since we just need to read the small amount of spam emails. But, 2% error on non-spam is not unacceptable because we lose 2% important emails. However, because of logistic regression is more computationally efficient so that it can be easier to update.

Then, we need to ask some common questions:

i) Understand the comparisons between (see page3814).

Is the algorithm (gradient ascent for logistic regression) coverging?

ii) Do we need to extend the training time (see page3707)?

iii) Are we optimizing the right loss function?

For instance, what do we care about in the equation below?

Anti-spam ----------------------------------- [3710b]

The weights w⁽ⁱ⁾ is higher for non-spam than for spam. Then, we need to consider the weighted accuracy criteria.

iv) For logistic regression, do we have corret value for λ in Equation 3710a? We know we are optimizing J of θ. We are probably optimizing a wrong cost function with the λ.

v) How about SVM? Do we have correct value for C?

Minimization of objective function (see page4270):

Anti-spam ----------------------------------- [3710c]

Subject to constraint (see page4270):

Anti-spam ------------------------ [3710d]

Questions to answer: Does the gradient descent converge (J versus θ)? Are you optimizing a wrong function?

In some cases, an SVM outperforms logistic regression, but we really want to deploy logistic regression for our application (page3814). The objective function of a linear Support Vector Machine (SVM) in machine learning can be given by,

objective function -------------------------------------- [3710e]

where,:

represents the weight associated with each training example.
is the hypothesis function, which is the output for input ⁽ⁱ⁾ using parameters .
⁽ⁱ⁾ is the true label of the training example ⁽ⁱ⁾.
is an indicator function that equals 1 if the predicted output () matches the true label (⁽ⁱ⁾), and 0 otherwise.
: This is the objective function that the algorithms aim to maximize. The objective is to find the hyperplane that maximally separates the data points of different classes.
This sum part of the equation is a summation over all training examples ().

The objective is to find the values of the parameters that maximize this sum, which effectively maximizes the margin between different classes in the feature space. One hypothesis of the problem of 10% error on spam in the current example with SVM is the gradient descent is not doing well.

For logistic regression, the typical objective function (cost function) for binary classification is as follows,

objective function - [3710f]

where:

is the cost function.
is the number of training examples.
⁽ⁱ⁾ is the true label of the -th example.
h_θ(x⁽ⁱ⁾) is the sigmoid (logistic) function applied to the linear combination of input features x⁽ⁱ⁾ with parameters .
is the regularization parameter.
is the number of features.

The logistic regression cost function aims to minimize the difference between the predicted probabilities and the true labels. One hypothesis of the problem of 2% error on non-spam in the current example with logistic regression is the J ~ θ function is a wrong function to be optimized. J ~ θ is too different from a ~ θ. We cannot directly maximize a(θ) directly because a(θ) is not differentiable.

Since the parameters learned by an SVM, θ_SVM, outperforms the parameters learned by logistic regression, θ_LR, then we have,

θ_SVM > θ_LR -------------------------------------- [3710g]

The problem is for some reason, the gradient descent in is not converging.

While both logistic regression and linear SVM are methods for binary classification, they have different objective functions and are optimized using different approaches. SVM uses a hinge loss function, as reflected in Equation 3710e. SVM is designed for margin maximization and doesn't directly model probabilities like logistic regression.

A way for diagnostic of the current problem is to understand the question below,

J(θ_SVM) ? J(θ_LR) -------------------------------------- [3710g]

With Formulas in 3710g and 3710g, we then have two cases:

Case A:

θ_SVM > θ_LR

J(θ_SVM) > J(θ_LR)

In this case, LR is trying to maximize J(θ). That is, LR with θ_LR fails to maximize cost function J due to J(θ_SVM) > J(θ_LR), and the problem is with the convergence of the algorithm.

Case B:

θ_SVM > θ_LR

J(θ_SVM) ≤ J(θ_LR)

For this case, LR succeeded at maximizing J(θ). However, SVM, which does worse on J(θ), actually does better on weighted accuracy a(θ). Therefore, J(θ) is a wrong function to be maximizing, if we care about a(θ). J(θ) does not correspond to having the best value for a of θ. This is maximization problem on objective function, which means maximizing J(θ) probably is not a good idea for the current problem. Then, we probably need to find a different function to maximize.

============================================

[1] Andrew NG.

=================================================================================