Skip to content

Instantly share code, notes, and snippets.

@Nicksil
Created May 24, 2013 19:40
Show Gist options
  • Select an option

  • Save Nicksil/5646003 to your computer and use it in GitHub Desktop.

Select an option

Save Nicksil/5646003 to your computer and use it in GitHub Desktop.

Revisions

  1. Nicksil created this gist May 24, 2013.
    330 changes: 330 additions & 0 deletions gistfile1.py
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,330 @@
    # File: KnuthMorrisPratt.py
    # Author: Keith Schwarz (htiek@cs.stanford.edu)
    #
    # An implementation of the Knuth-Morris-Pratt (KMP) string-matching algorithm.
    # This algorithm takes as input a pattern string P and target string T, then
    # finds the first occurrence of the string T in the pattern P, doing so in time
    # O(|P| + |T|). The logic behind the algorithm is not particularly complex,
    # though getting it to run in linear time requires a few non-obvious tricks.
    #
    # To motivate KMP, consider the naive algorithm for trying to match a pattern
    # string P against a target T. This would work by considering all possible
    # start positions for the pattern P in the target T, then checking whether a
    # match exists at each of those positions. For example, to match the string
    # ABC against the target string ABABABACCABC, we'd get
    #
    # ABABABACCABC
    # ABX (first two characters match, last does not)
    # X (first character doesn't match)
    # ABX (first two characters match, last does not)
    # X (first character doesn't match)
    # ABX (first two characters match, last does not)
    # X (first character doesn't match)
    # AX (first character matches, second doesn't)
    # X (first character doesn't match)
    # X (first character doesn't match)
    # ABC (match found)
    #
    # This algorithm runs in O(mn) in the worst case, where m = |T| and n = |P|,
    # because it has to do O(n) work to check whether the string matches O(m) times
    # for each spot in the string.
    #
    # However, a lot of this is wasted work. For example, in the above example,
    # consider what happens when we know that the string ABC does not match the
    # first part of the string, ABA. At this point, it would be silly to actually
    # try to match the string at the string starting with the B, since there's no
    # possible way that the string could match there. Instead, it would make more
    # sense to instead start over and try matching ABC at the next A. In fact,
    # more generally, if we can use the information we have about what characters
    # we already matched to determine where we should try to resume the search in
    # the string, we can avoid revisiting characters multiple times when there's no
    # hope that they could ever match.
    #
    # The idea we'll use is to look for "borders" of a string, which are substrings
    # that are both a prefix and suffix of the string. For example, the string
    # "aabcaa" has "aa" as a border, while the string "abc" just has the empty
    # string as a border. Borders are useful in KMP because they encode
    # information about where we might need to pick up the search when a particular
    # match attempt fails. For example, suppose that we want to match ABABC
    # against the string ABABABC. If we start off by trying to match the string,
    # we'll find that they overlap like this:
    #
    # ABABABC
    # ABABx
    #
    # That is, the first four characters match, but the fifth does not. At this
    # point, rather than naively restarting the search at the second character (B),
    # or even restarting it at the third position (A), we can instead note that we
    # can treat the last two characters of the string we matched (AB) as the first
    # two characters of the pattern string ABABC if we just treated it instead as
    # though we had
    #
    # ABABABC
    # ABABx
    # ABABC
    #
    # If we can somehow remember the fact that we already matched the AB at the
    # start of this string, we could just confirm that the three characters after
    # it are ABC and be done. There's no need to confirm that the characters at
    # the front match.
    #
    # In order to make this possible, we'll construct a special data structure
    # called the "fail table." This table stores, for each possible prefix of the
    # string to match, the length of the longest border of that prefix. That way,
    # when we find a mismatch, we know where the next possible start location could
    # be found. In particular, once we have a mismatch, if there's any border of
    # the prefix of the pattern that we matched so far, then we can treat the end
    # of that matching prefix as the start of a prefix of the word that occurs
    # later in the target.
    #
    # The basic idea behind KMP is, given this table, to execute the following:
    #
    # - Guess that the string starts at the beginning of the target.
    # - Match as much of the string as possible.
    # - If the whole string matched, we're done.
    # - Otherwise, a mismatch was found. Look up the largest border of the
    # string that was matched so far in the failure table. Suppose it has
    # length k.
    # - Update our guess of the start position to be where that border occurs
    # in the portion matched so far, then repeat this process.
    #
    # Notice that once we've matched a character against part of the pattern (or
    # found that it can't possibly match), we never visit that character again.
    # This is responsible for the fast runtime of the algorithm (though I'll give a
    # more formal description later on).

    # Function: failTable(pattern)
    # Usage: failTable("This is a string!")
    # -----------------------------------------------------------------------------
    # Given a string, constructs the KMP failure table for that string. The values
    # in the table are defined as
    #
    # table[i] = |LongestProperBoundary(pattern[0:i)])|
    #
    # Where the longest proper boundary of a string is the longest proper substring
    # of that string that is both a prefix and a suffix. For example, given the
    # string "abcabc," the longest proper boundary is abc. Similarly, given the
    # string "apple," the longest proper boundary is the empty string.
    #
    # As a sample output of this function, given the string "ababcac", the table
    # would be
    #
    # a b a b c a c
    # * 0 0 1 2 0 1 0
    #
    # This means, for example, that the longest proper boundary of the prefix "aba"
    # has length 1, while the longest proper boundary of the string as a whole is
    # the empty string. Notice that the first entry is *, which we have chosen
    # because there is no mathematically well-defined proper substring of the empty
    # string. We can put anything we want there, and we'll go with None.
    #
    # To compute the values of this table, we use a dynamic programming algorithm
    # to compute a slightly stronger version of the function. We define the
    # function "Extended Longest Proper Boundary" (xLPB) as follows:
    #
    # xLPB(string, n, char) = The longest proper boundary of string[0:n] + char
    #
    # The idea behind this function is that we want to be able to recycle the
    # values of the longest proper boundary function for smaller prefixes of the
    # string in order to compute the longest proper boundary for longer prefixes.
    # To make this easier, the xLPB function allows us to talk about what would
    # happen if we extended the longest proper boundary of some prefix of the
    # string by a single character. Notice that for any nonzero n, we have that
    #
    # LongestProperBoundary(string[0:n]) = xLPB(string, n - 1, string[n])
    #
    # That is, we simply tear off the last character and use it as the final
    # argument to xLPB. Given this xLPB function, we can compute its values
    # recursively using the following logic. As a base case, xLPB(string, 0, char)
    # is the longest proper boundary of string[0:0] + char = char. But this has
    # only one proper boundary, the empty string, and so its value must be zero.
    #
    # Now suppose that for all n' < n we have the value of xLPB(string, n', char)
    # for any character char. Suppose we want to go and compute
    # xLPB(string, n, char). Let's think about what this would mean. Given that
    # n is not zero, we can think of this problem as trying to find the longest
    # proper boundary of this string:
    #
    # +------------+---+------------+------------+---+
    # | LPB | ? | ... | LPB | c |
    # +------------+---+------------+------------+---+
    #
    # ^ ^ ^
    # +----------------------+-------------------+ |
    # | |
    # String of length n New character
    #
    # The idea is that we have the original string of length n, followed by our new
    # character char (which we'll abbreviate c). In this diagram, I've marked the
    # LPB of the string of length n. Notice that right after the LPB at the prefix
    # of the string, we have some character whose value is unknown (since n != 0
    # and the LPB can't be the whole string). If this value is equal to c, then
    # the LPB of the whole string can be formed by simply extending the LPB of the
    # first n characters. There can't be a longer proper boundary, since otherwise
    # we could show that by taking that longer boundary and dropping off the
    # character c, we'd end up with a longer proper boundary for the first n
    # characters of the string, contradicting that we chose the longest proper
    # boundary.
    #
    # By our above argument, remember that the length of the longest proper
    # boundary of the first n characters of the string is given by
    #
    # xLPB(string, n - 1, string[n - 1])
    #
    # Thus we have the first part of our recurrence, which is defined as
    #
    # xLPB(string, n, char) =
    # if n = 0, then 0.
    # let k = xLPB(string, n - 1, string[n - 1])
    # if string[k] == char, return k + 1
    # else, ???
    #
    # Now, suppose that we find that the character after the LPB does not match.
    # If this happens, we can then make the following observation. Below I've
    # reprinted the above diagram:
    #
    # +------------+---+------------+------------+---+
    # | LPB | ? | ... | LPB | c |
    # +------------+---+------------+------------+---+
    #
    # ^ ^ ^
    # +----------------------+-------------------+ |
    # | |
    # String of length n New character
    #
    # Notice that any LPB of this new string must be a prefix of the LPB of the
    # first n characters and a suffix of the LPB followed by the character c.
    # Since by definition the LPB of the first n characters must be a prefix of
    # those n characters, we have the following elegant conclusion to our
    # recurrence:
    #
    # xLPB(string, n, char) =
    # if n = 0, then 0.
    # let k = xLPB(string, n - 1, string[n - 1])
    # if string[k] == char, return k + 1
    # else, xLPB(string, k, char)
    #
    # The reason for this is that xLPB(string, k, char) asks for the longest
    # proper boundary of the LPB of the string formed from the first n characters
    # of the string followed by the character c, which is exactly what we described
    # above.
    #
    # As written, filling in the table of LPB values would take O(n^2) time, where
    # n is the length of the string. However, using dynamic programming and an
    # amortized analysis, we can show that this function can be made to run in
    # O(n) time. In particular, suppose that for all n' < n, we know the value of
    # LPB(string[0:n]). Then in the above formulation of xLPB, the first
    # recursive call is known, and the only recursive call we may actually need to
    # make is the second.
    #
    # However, this doesn't seem to say anything about the runtime of the second
    # recursive call, which seems as though it might cause the evaluation of this
    # function to run in time O(n). This is correct, but in an *amortized* sense
    # the whole table can still be computed in O(n) time overall. To see this,
    # let's define a potential function Phi(k) that associates a potential at each
    # point of the computation of the table. In particular, define Phi(k) as
    #
    # Phi(0) = 0
    # Phi(k + 1) = result[k - 1]
    #
    # Here, result is the resulting table of LPB values. Because of this, we can
    # remark that result[k] < k, since the longest proper border of a string can't
    # be any longer than that string.
    #
    # Let's now show that this potential function gives an amortized O(1) cost for
    # each table entry computation, and thus an O(n) overall runtime for the table-
    # building algorithm. To see this, consider what happens when the logic to
    # compute the next value runs. The runtime for this step is bounded by the
    # number of recursive calls made to a subproblem. However, each subproblem is
    # then of size given by the LPB of a slightly smaller problem. This subproblem
    # must then have size at most the size of that smaller subproblem. In other
    # words, we can say that each recursive call drops the maximum possible value
    # of the LPB for the current prefix by at least one. Consequently, if k
    # recursive calls are made, the LPB of the current prefix is at least k smaller
    # than the LPB of the previous prefix, and so
    #
    # D Phi = -k
    #
    # And so the amortized cost of computing the next term is 1 + k - k = O(1).

    def failTable(pattern):
    # Create the resulting table, which for length zero is None.
    result = [None]

    # Iterate across the rest of the characters, filling in the values for the
    # rest of the table.
    for i in range(0, len(pattern)):
    # Keep track of the size of the subproblem we're dealing with, which
    # starts off using the first i characters of the string.
    j = i

    while True:
    # If j hits zero, the recursion says that the resulting value is
    # zero since we're looking for the LPB of a single-character
    # string.
    if j == 0:
    result.append(0)
    break

    # Otherwise, if the character one step after the LPB matches the
    # next character in the sequence, then we can extend the LPB by one
    # character to get an LPB for the whole sequence.
    if pattern[result[j]] == pattern[i]:
    result.append(result[j] + 1)
    break

    # Finally, if neither of these hold, then we need to reduce the
    # subproblem to the LPB of the LPB.
    j = result[j]

    return result

    # Function: kmpMatch(needle, haystack)
    # Usage: print kmpMatch("0101", "0011001011") # Prints 5
    # -----------------------------------------------------------------------------
    # Uses the KMP algorithm to find an occurrence of the specified needle string
    # in the haystack string. To do this, we compute the failure table, which
    # is done above. Next, we iterate across the string, keeping track of a
    # candidate start point and length matched so far. Whenever a match occurs, we
    # update the length of the match we've made. On a failure, we update these
    # values by trying to preserve the maximum proper border of the string we were
    # able to manage by that point.
    def kmpMatch(needle, haystack):
    # Compute the failure table for the needle we're looking up.
    fail = failTable(needle)

    # Keep track of the start index and next match position, both of which
    # start at zero since our candidate match is at the beginning and is trying
    # to match the first character.
    index = 0
    match = 0

    # Loop until we fall off the string or match.
    while index + match < len(haystack):
    print index, match

    # If the current character matches the expected character, then bump up
    # the match index.
    if haystack[index + match] == needle[match]:
    match = match + 1

    # If we completely matched everything, we're done.
    if match == len(needle):
    return index

    # Otherwise, we need to look at the fail table to determine what to do
    # next.
    else:
    # If we couldn't match the first character, then just advance the
    # start index. We need to try again.
    if match == 0:
    index = index + 1

    # Otherwise, see how much we need to skip forward before we have
    # another feasible match.
    else:
    index = index + match - fail[match]
    match = fail[match]

    # If we made it here, then no match was found.
    return None