Example of ML Debugging/Diagnostic: Anti-Spam - Python Automation and Machine Learning for ICs - - An Online Book - |
||||||||
Python Automation and Machine Learning for ICs http://www.globalsino.com/ICs/ | ||||||||
Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix | ||||||||
================================================================================= Debugging problems in ML can be very complicated if you want to have good job on it, and can be a PhD thesis work. Debugging example A: An machine learning (ML) example is Anti-spam. [1] In this example, we carefully choose a small set of 100 words as features, instead of using all 50,000 + words in English. In this ML example, we used logistic regression with regularization (Bayesian Logistic regression), and implemented with gradient descent. We got 20% test error, which is unacceptably high. ----------------------------------- [3710a] Then, how can we debug it? One common approach of debugging the ML algorithm is to improve the algorithm in different ways. i) Try to get more training examples. It is almost always true since more data normally helps the training. This fixes a high variance. A lot of teams pick up the ideas above, which can be kind of random. And then, they will spend a few days or few weeks trying that, which may not be the best thing to do. Each of the first four methods above fixes either a high variance or a high bias problem. The most common diagnostic is bias-variance diagnostic. Understanding this tradeoff is crucial for diagnosing the performance of a model and making informed decisions about improving its generalization ability. Understanding the contributions of bias and variance to the algorithm's performance helps guide your debugging efforts. Without a thorough analysis, it can be challenging to assess which option above is the best, as we wouldn't have a clear understanding of the impact of each option on the problem at hand. Techniques such as cross-validation, learning curves, and error analysis can be employed to gain insights into bias and variance and make informed adjustments to enhance model generalization. Note, the learning process on bias-variance diagnostic can take a couple of years of deep practice, but may vary from person to person. In this process, you will practice machine learning algorithms, find the problems and then fix the problems:
The diagnostic will be: i) Variance: Training error will be much lower than test error. ii) Bias: Training error will also be high. Figure 3710a shows linear regression learning curves, which involve training the model on different subsets of the training data and show the training and test errors for each subset size. When the training sample size is small, the training error is also small. For instance, if we only have one example for training, then any algorithm can fit the example perfectly. When the sample size is large, then it is harder to fit the training data perfectly. Howver, when the training sample size is small, the test error still decreases as the increase of sample size, which suggests larger sample set will help. The phenomenon of larger gaps between training and test errors at small sample sizes is often attributed to high model variance. On the other hand, the large deviation of the training error from the desired performance is a sign of large bias. The other sign of large bias can be given when the gap between the training error and test error is very small, and the training error will never come back down to reach the desired performance line no matter how many samples we have. Note that a common “standard” of desired performance (allowed maximum error) is what humans can achieve. In general, the test error should always be higher than the training error no matter how many samples we have. Figure 3710a. Linear regression learning curves. (code) Debugging example B: Another debugging example is: [1] i) Logistic regression gets 2% error on spam, and 2% error on non-spam. (unacceptably high error on non-spam) ii) SVM using a linear kernel gets 10% error on spam, and 0.01% error on non-spam (outperform). (Acceptable performance) iii) But you want to use logistic regression because of computational efficiency, etc. Question: What to do next? Analysis: 2% error on spam is Ok since we just need to read the small amount of spam emails. But, 2% error on non-spam is not unacceptable because we lose 2% important emails. However, because of logistic regression is more computationally efficient so that it can be easier to update. Then, we need to ask some common questions: i) Understand the comparisons between (see page3814). Is the algorithm (gradient ascent for logistic regression) coverging? ii) Do we need to extend the training time (see page3707)? iii) Are we optimizing the right loss function? For instance, what do we care about in the equation below? ----------------------------------- [3710b] The weights w(i) is higher for non-spam than for spam. Then, we need to consider the weighted accuracy criteria. iv) For logistic regression, do we have corret value for λ in Equation 3710a? We know we are optimizing J of θ. We are probably optimizing a wrong cost function with the λ. v) How about SVM? Do we have correct value for C? Minimization of objective function (see page4270): ----------------------------------- [3710c] Subject to constraint (see page4270): ------------------------ [3710d] Questions to answer: Does the gradient descent converge (J versus θ)? Are you optimizing a wrong function? In some cases, an SVM outperforms logistic regression, but we really want to deploy logistic regression for our application (page3814). The objective function of a linear Support Vector Machine (SVM) in machine learning can be given by, -------------------------------------- [3710e] where,:
The objective is to find the values of the parameters that maximize this sum, which effectively maximizes the margin between different classes in the feature space. One hypothesis of the problem of 10% error on spam in the current example with SVM is the gradient descent is not doing well.For logistic regression, the typical objective function (cost function) for binary classification is as follows, - [3710f] where:
The logistic regression cost function aims to minimize the difference between the predicted probabilities and the true labels. One hypothesis of the problem of 2% error on non-spam in the current example with logistic regression is the J ~ θ function is a wrong function to be optimized. J ~ θ is too different from a ~ θ. We cannot directly maximize a(θ) directly because a(θ) is not differentiable. Since the parameters learned by an SVM, θSVM, outperforms the parameters learned by logistic regression, θLR, then we have, θSVM > θLR -------------------------------------- [3710g] The problem is for some reason, the gradient descent in is not converging. While both logistic regression and linear SVM are methods for binary classification, they have different objective functions and are optimized using different approaches. SVM uses a hinge loss function, as reflected in Equation 3710e. SVM is designed for margin maximization and doesn't directly model probabilities like logistic regression. A way for diagnostic of the current problem is to understand the question below, J(θSVM) ? J(θLR) -------------------------------------- [3710g] With Formulas in 3710g and 3710g, we then have two cases: Case A: θSVM > θLR J(θSVM) > J(θLR) In this case, LR is trying to maximize J(θ). That is, LR with θLR fails to maximize cost function J due to J(θSVM) > J(θLR), and the problem is with the convergence of the algorithm. Case B: θSVM > θLR J(θSVM) ≤ J(θLR) For this case, LR succeeded at maximizing J(θ). However, SVM, which does worse on J(θ), actually does better on weighted accuracy a(θ). Therefore, J(θ) is a wrong function to be maximizing, if we care about a(θ). J(θ) does not correspond to having the best value for a of θ. This is maximization problem on objective function, which means maximizing J(θ) probably is not a good idea for the current problem. Then, we probably need to find a different function to maximize. ============================================
[1] Andrew NG.
|
||||||||
================================================================================= | ||||||||
|
||||||||