Knuth-Morris-Pratt (KMP) algorithm (a string-searching algorithm)

Knuth-Morris-Pratt (KMP) Algorithm
- Python and Machine Learning for Integrated Circuits -
- An Online Book -

Python and Machine Learning for Integrated Circuits http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

The Knuth-Morris-Pratt (KMP) algorithm is a string-searching algorithm that efficiently finds all occurrences of a pattern (substring) within a text (string). It was developed by Donald Knuth, Vaughan Pratt, and James H. Morris in 1977.

The main advantage of the KMP algorithm is its linear time complexity, making it one of the most efficient algorithms for substring searching. It works by precomputing a partial match table (often called the "failure function" or "LPS array") based on the pattern, which allows it to avoid unnecessary character comparisons during the search phase. Note that the KMP algorithm is not a machine learning algorithm. The KMP algorithm is a classic computer science algorithm designed for efficient string matching and does not involve learning from data or making predictions.

Table 3807. Process of Knuth-Morris-Pratt (KMP) algorithm.

Preprocessing	Given a pattern string P, the KMP algorithm constructs a failure function (LPS array) that records the length of the longest proper prefix (a prefix that is not the whole string) that is also a suffix for each prefix of the pattern. This is done in a way that allows you to efficiently "slide" the pattern over the text without re-checking characters that have already been matched.
Searching	Start at the beginning of the text string T and the beginning of the pattern string P. Compare characters of T and P from left to right. If a mismatch occurs at some character in P, you use the information from the LPS array to determine how far you can shift the pattern to the right without missing a potential match in the text. This is the key to the algorithm's efficiency.
Repeating	Continue this process until you either find a match or reach the end of the text. If you find a match, record the position in the text where the match starts.

Knuth-Morris-Pratt (KMP) algorithm for pattern matching in strings and is known for its efficiency in finding occurrences of a pattern in a text. Code:
          Upload Files to Webpages
       Output:

The key characteristic of the KMP algorithm is the construction of the Longest Prefix Suffix (LPS) array, which is used to avoid unnecessary character comparisons when a mismatch occurs. Here's how the script utilizes the KMP algorithm:

build_lps_array(pattern) function: This function builds the Longest Prefix Suffix (LPS) array for the given pattern. It iteratively processes the pattern and calculates the length of the longest proper prefix of the substring that is also a suffix. This LPS array is a crucial part of the KMP algorithm. The preprocessing step for the KMP algorithm is primarily performed in the build_lps_array(pattern) function. Preprocessing is the step where the Longest Prefix Suffix (LPS) array is constructed. The LPS array is used to determine the next position to start comparing characters in case of a mismatch during the search. The specific lines responsible for preprocessing are within this function.
kmp_search(text, pattern) function: This function performs the KMP search in the text for occurrences of the pattern. It uses the LPS array created by build_lps_array to efficiently handle mismatches. When a mismatch occurs, it doesn't start comparing characters from the beginning of the pattern; instead, it uses the information in the LPS array to determine the next position to start comparing. This is the core principle of the KMP algorithm. Within the kmp_search function, the code uses i and j to keep track of the positions in the text and pattern, respectively. It iterates through the text, comparing characters between the pattern and text. When a mismatch is encountered, it uses the information in the LPS array to determine where to continue the comparison without starting from the beginning of the pattern. If a complete match is found (i.e., j == m), it records the position of the pattern in the text (i - j) and continues the search.

The KMP algorithm can be applied to find occurrences of a pattern within a text so that it can be used to find the list of tables in a text file. For instance, this script can be used to find how many sections (tables) there are in a text file. Each section starts with <h5> and ends with </table>. This script finally stores the contents of each section (table) to a different dataframe set as a dataframe table. The content of the text file is:

<table>

>tbody>

<tr><th>Hellow </th><th>Have a good day! </th><th> The country </th></tr> <tr><td class=“group”> DUX </td><td class=“BBY”> 1234 </td><td class=“fers”> hmDy </td></tr> </tbody> </table>

<table>

<tbody>

<tr><th>DDYIN </th><th>Bob </th><th>UHY</th></tr> <tr><td class=“group”>BBY </td><td class=“BYHZ”>com</td><td class=“fers”> ICs </td></tr>

</tbody>

</table>

Another KMP algorithm which can be used to search all the patterns of the tables (see below) in a webpage file, and then store the tables as dataframes, is the script:

In this case, the output from the script is:

============================================

=================================================================================