Supervised machine learning

Supervised Machine Learning
- Python for Integrated Circuits -
- An Online Book -

Python for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Supervised learning is a machine learning paradigm where an algorithm learns to make predictions or decisions based on labeled training data. In supervised learning, the algorithm is provided with a dataset that consists of input-output pairs, where the input represents the features or attributes of the data, and the output represents the corresponding labels or target values that you want the algorithm to predict.

The main goal of supervised learning is to learn a mapping or a function from the input data to the output labels so that the algorithm can make accurate predictions on new, unseen data. During the training process, the algorithm adjusts its internal parameters through optimization techniques to minimize the difference between its predictions and the true labels in the training data.

Supervised learning is widely used in various fields, including natural language processing, computer vision, healthcare, finance, and many other domains where predictive modeling is essential. It's called "supervised" learning because the algorithm learns from a "teacher" or supervisor who provides the correct answers (labels) during training, and the algorithm's performance is evaluated based on its ability to make accurate predictions on new, unseen data.

Figure 4323a shows the supervised machine learning architecture.

Comparison between machine learning algorithms (supervised Learning, unsupervised Learning and reinforcement Learning). Error = target output - actual output

Figure 4323a. Supervised machine learning architecture. [8]

Comparison between machine learning algorithms (supervised Learning, unsupervised Learning and reinforcement Learning). Error = target output - actual output

Figure 4323b. Comparison between machine learning algorithms (supervised Learning, unsupervised Learning and reinforcement Learning). Error = target output - actual output.

In the keyword analysis with supervised machine learning approach by Kurian et al., [1] a total of 15,000 incidents were manually classified: descriptive labels, actual and potential risk scores, and consequence labels (environment, finance, health/safety, and reputation) were applied to each incident. The incident reports were then divided into training and test data, and the machine learning algorithm used the training data to predict labels for the test data. The result of this research was a machine learning algorithm that could apply labels to incidents with 75–90% accuracy (depending on the label), and the outputs were used to develop risk matrices and to analyze trends in incidents. Such machine learning can be used to remove human bias, and this method allowed for consistent reporting of incidents. However, some incident reports lacked the detail required for classification, therefore it was impossible to completely remove bias as using a supervised learning model implies manual training. Additional keyword analysis can be applied to increase the accuracy of machine learning classification. This ML research provides significant changes to the current system of incident reporting.

Example applications of supervised machine learning approach are [1]:
i) Text and keyword analysis (page4511). In this approach, the supervised machine learning operates by using predictor features to forecast class labels, which aims to categorize data by utilizing prior information, [2] with the steps below:
i.a) Implementing supervised machine learning towards the classification of incident reports is to manually classify incident reports by labelling them with consistent identifiers such as key de-scriptors, immediate and latent causes, contributing factors. The labels, Kurian et al. had [1] (page4511), were communication, health/safety, leak/spill, miscellaneous, operation, uncategorized, and vehicle (See Table 4323a). The label of "uncategorized" was assigned to incident reports that could not be classified.

Table 4323a. labels, Kurian et al. had [1].

labels, Kurian et al. had

i.b) The data in the incident database was prepared for the machine learning classification. The TfidfVectorizer feature from Python’s scikit-learn library was used to transform each incident report into a numerical vector, and thus, the incident database is transformed into a matrix ( incident database, or similar to a dictionary). [3] Note that few other libraries can also be used to convert incident documents into numerical vectors (see page4316). Such manually classified incidents were then separated into training and test data sets, containing 70% and 30% of the data, respectively. [4] The occurrence of each term (word) is counted, and weights are applied by comparing how frequently a term is found in a document (report) versus the entire dictionary. The result is the transformation of text to a numerical vector. Such numerical vectors of the incident reports in the training set were expressed graphically, and a classifier was used to generate decision boundaries used to classify data. For every term found in an incident document, a count is applied to the position of the word in the incident dictionary.
In the analysis above, many classifiers from the scikit-learn library that are compatible with sparse matrices were used to classify the incident reports: The supervised machine learning algorithm attempted to identify features in the incident report that were used to connect it to a given label, and metrics were calculated for different classifiers to identify the most suitable classifier for the data.

The most accurate classifier for categorizing incident reports is Linear Support Vector Classifier (Linear SVC), boasting accuracies close to 90% when predicting labels. [5] In those cases, the metrics used were the confusion matrix, classification report, and accuracy score. [6] The confusion matrix was calculated by counting the number of true positives, true negatives, false positives, and false negatives. The confusion matrix was used to demonstrate how a classifier makes predictions for labels and requires the true and predicted classifications of the model. In a confusion matrix, the true label can be found on the y-axis and the predicted label on the x-axis. The classification report delivers precision, recall, F1-score, and support with inputs of the actual and predicted labels.
                         --------------------------------- [4323a]
                         recall --------------------------------- [4323b]
                         recall --------------------------------------------------- [4323c]
Values for precision, recall, and F1-score will be between 0 and 1, where values closer to 1 represent a more robust model. Support is the count of true occurrences for each label. The accuracy score is the percentage of predicted labels that the model correctly identifies:
                        ------------------------------ [4323d]
            i.c) After the accuracies from the machine learning classification was determined, Natural Language Processing (NLP) was used to analyze keywords. NLP allows computers to interact with humans by processing and analyzing natural language data. [7]

       ii.a) Logistic regression. Within supervised learning, if we are to predict numeric values, then it is called regression. For instance, if the aim is to predict the scores the student is going to have (numeric value), this comes under regression.
          i.a.1) k-nearest neighbors.
       ii.b) Classification. For instance, if the aim is to determine whether a customer will buy a product from an online store, then this is a classification problem. Classification can be further divided as binary and multi-class.
          ii.b.1) Adaboost classifier.
          ii.b.2) Decision tree classifier.
          ii.b.3) Multi-layer perceptron classifier.
          ii.b.4) Multinomial Naïve Bayes classifier.
          ii.b.5) Random forest classifier.
          ii.b.6) Support vector machine classifier (including linear support vector classifier).
      iii) The type of machine learning that would typically be used for predicting the retail value of a new house on the market is regression. Regression is a type of supervised learning algorithm used when the target (or dependent) variable is continuous. In the case of predicting house prices, the target variable is the price of the house, which is a continuous variable. Regression algorithms, such as linear regression, polynomial regression, decision tree regression, random forest regression, or support vector regression, can be applied to this task. These algorithms learn the relationship between the features (independent variables) of the house (such as square footage, number of bedrooms, location, etc.) and the target variable (house price) from historical data. Once the model is trained, it can then be used to predict the price of new houses based on their features.

============================================

[1] Daniel Kurian, Fereshteh Sattari, Lianne Lefsrud, Yongsheng Ma, Using machine learning and keyword analysis to analyze incidents and reduce risk in oil sands operations, Safety Science, 130(2020), 104873.
[2] S. B. Kotsiantis, I. D. Zaharakis and P. E. Pintelas, Artificial Intelligence Review, 26, 159–190 (2006).
[3] Imani, A., Forman, J.E., Amir, W., 2018. A Clustering Analysis of Codes of Conduct and Ethics in the Practice. of Chemistry.
[4] Ng, A. (n.d.). Machine Learning. Retrieved from https://www.coursera.org/learn/ma-chine-learning/.
[5] Kurian, D., Ma, Y., Lefsrud, L., Sattari, F., 2020. Seeing the Forest and the Trees: Using Machine Learning to Categorize and Analyze Incident Reports for Alberta Oil Sands Operators. J. Loss Prev. Process Ind. 64, 104069. https://doi.org/10.1016/j.jlp.2020. 104069.
[6] Garreta, R., Hauck, T., Hackeling, G., 2017. Scikit-learn: machine learning simplified. Packt Publishing, Birmingham, UK.
[7] Srinivasa-Desikan, B., 2018. Natural language processing and computational linguistics a practical guide to text analysis with Python, Gensim, spaCy, and Keras. Birmingham: Packt.
[8] Pramod Singh and Avinash Manure, Learn TensorFlow 2.0: Implement Machine Learning and Deep Learning Models with Python, 2022.

=================================================================================