Text/keyword classification/sort/prediction, train/test e.g. Youtube spam

Text/Keyword Classification/Sort/Prediction, Train/Test e.g. Youtube Spam
- Integrated Circuits -
- An Online Book -

Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

In the keyword analysis with supervised machine learning approach by Kurian et al., [1] a total of 15,000 incidents were manually classified: descriptive labels, actual and potential risk scores, and consequence labels (environment, finance, health/safety, and reputation) were applied to each incident. In their keyword analysis, supervised machine learning was used with the Linear Support Vector Classifier (Linear SVC) to predict labels for incidents since it provides the highest accuracy. The incident reports were then divided into training and test data, and the machine learning algorithm used the training data to predict labels for the test data. The result of this research was a machine learning algorithm that could apply labels to incidents with 75–90% accuracy (depending on the label), and the outputs were used to develop risk matrices and to analyze trends in incidents. Such machine learning can be used to remove human bias, and this method allowed for consistent reporting of incidents. However, some incident reports lacked the detail required for classification, therefore it was impossible to completely remove bias as using a supervised learning model implies manual training. Additional keyword analysis can be applied to increase the accuracy of machine learning classification. This ML research provides significant changes to the current system of incident reporting.

The objectives of the research with keyword analysis with supervised machine learning approach presented by Kurian et al. [1], with the methodology as shown in Figure 4511a, were to:
          i) Strengthen the incident reporting system by creating a customized library using artificial intelligence, machine learning, and statistics.
          ii) Support the design of more sensitive risk prevention and mitigation strategies, as well as leading factors.
          iii) Enhance organizations learning from incidents and create opportunities to reduce losses.

Methodology of the supervised machine learning approach

Figure 4511a. Methodology of the supervised machine learning approach (see Figure 4511b for details). [1] In the step of "Input Data", the data of the past incidents are used. The second step involves designing a customized library for analyzing future reported incidents. The last step provides a detailed analysis and then gives suggestions for preventing incidents from occurring or to minimize the damage caused by such events. When an incident report is inputted, based on statements selected by the user, four outputs are delivered. For instance, the risk matrix is generated by calculating frequency and consequence.

Figure 4511b shows the details of the steps involved in the methodology of the supervised machine learning approach. The reports with 15,000 incidents were used to train a machine learning algorithm to predict class labels, in conjunction with keyword analysis, for new incident reports. In the multi-step process in 2.2, machine learning and keyword analysis were applied to the incidents reports. A supervised machine learning algorithm (page4323) had been used to classify incident reports in this step.

Methodology of the supervised machine learning approach

Figure 4511b. Detailed description of methodology. [1] In this research, several collaborating companies provided access to their incident databases con-taining incident reports from 2013 to 2017, inclusive.

After the accuracies from the machine learning classification was determined, Natural Language Processing (NLP) was used to analyze keywords. (page4323) Keyword analysis can be completed by lemmatizing all the words found in the incident database, [1] e.g. "run" = "running" = "ran" are all reverted to "run", A counter can then used to identify and tally the lemmatized words, and these words were then arranged from most frequent to least frequent. The keywords that could be used to classify incidents were selected to include in the customized library (stop words, punctuation, names of individuals, etc. were re-moved).

A customized library can be created with two variables in the analysis:
i) The identifying labels used to train the machine learning algorithm. The labels and keywords found in the customized library were used
to generate a list of statements, linked to the parameters of the inputted incidents.
ii) The keywords identified using the spaCy library.

The labels and keywords stored in the customized library were then matched to statements that could be used to analyze and evaluate events. The combination of using both machine learning and a "manual" keyword approach is to increase accuracy and ensure that the generated statements could accurately describe any incident. To some extent, the keyword analysis was also used as a buffer to compensate for misclassification by the machine learning algorithm.

============================================

Prediction of Youtube spam: code:
          Prediction of Youtube spam


Output ([1] represents not-spam, and [0] represents spam):

[1] Daniel Kurian, Fereshteh Sattari, Lianne Lefsrud, Yongsheng Ma, Using machine learning and keyword analysis to analyze incidents and reduce risk in oil sands operations, Safety Science, 130(2020), 104873.

=================================================================================