Nicksil · May 24, 2013 19:40 · May 24, 2013
diff --git a/gistfile1.py b/gistfile1.py
@@ -0,0 +1,330 @@
+# File: KnuthMorrisPratt.py
+# Author: Keith Schwarz (htiek@cs.stanford.edu)
+#
+# An implementation of the Knuth-Morris-Pratt (KMP) string-matching algorithm.
+# This algorithm takes as input a pattern string P and target string T, then
+# finds the first occurrence of the string T in the pattern P, doing so in time
+# O(|P| + |T|).  The logic behind the algorithm is not particularly complex,
+# though getting it to run in linear time requires a few non-obvious tricks.
+#
+# To motivate KMP, consider the naive algorithm for trying to match a pattern
+# string P against a target T.  This would work by considering all possible
+# start positions for the pattern P in the target T, then checking whether a
+# match exists at each of those positions.  For example, to match the string
+# ABC against the target string ABABABACCABC, we'd get
+#
+#     ABABABACCABC
+#     ABX               (first two characters match, last does not)
+#      X                (first character doesn't match)
+#       ABX             (first two characters match, last does not)
+#        X              (first character doesn't match)
+#         ABX           (first two characters match, last does not)
+#          X            (first character doesn't match)
+#           AX          (first character matches, second doesn't)
+#            X          (first character doesn't match)
+#             X         (first character doesn't match)
+#              ABC      (match found)
+#
+# This algorithm runs in O(mn) in the worst case, where m = |T| and n = |P|,
+# because it has to do O(n) work to check whether the string matches O(m) times
+# for each spot in the string.
+#
+# However, a lot of this is wasted work.  For example, in the above example,
+# consider what happens when we know that the string ABC does not match the
+# first part of the string, ABA.  At this point, it would be silly to actually
+# try to match the string at the string starting with the B, since there's no
+# possible way that the string could match there.  Instead, it would make more
+# sense to instead start over and try matching ABC at the next A.  In fact,
+# more generally, if we can use the information we have about what characters
+# we already matched to determine where we should try to resume the search in
+# the string, we can avoid revisiting characters multiple times when there's no
+# hope that they could ever match.
+#
+# The idea we'll use is to look for "borders" of a string, which are substrings
+# that are both a prefix and suffix of the string.  For example, the string
+# "aabcaa" has "aa" as a border, while the string "abc" just has the empty
+# string as a border.  Borders are useful in KMP because they encode
+# information about where we might need to pick up the search when a particular
+# match attempt fails.  For example, suppose that we want to match ABABC
+# against the string ABABABC.  If we start off by trying to match the string,
+# we'll find that they overlap like this:
+#
+#    ABABABC
+#    ABABx
+#
+# That is, the first four characters match, but the fifth does not.  At this
+# point, rather than naively restarting the search at the second character (B),
+# or even restarting it at the third position (A), we can instead note that we
+# can treat the last two characters of the string we matched (AB) as the first
+# two characters of the pattern string ABABC if we just treated it instead as
+# though we had
+#
+#    ABABABC
+#    ABABx
+#      ABABC
+#
+# If we can somehow remember the fact that we already matched the AB at the
+# start of this string, we could just confirm that the three characters after
+# it are ABC and be done.  There's no need to confirm that the characters at
+# the front match.
+#
+# In order to make this possible, we'll construct a special data structure
+# called the "fail table."  This table stores, for each possible prefix of the
+# string to match, the length of the longest border of that prefix.  That way,
+# when we find a mismatch, we know where the next possible start location could
+# be found.  In particular, once we have a mismatch, if there's any border of
+# the prefix of the pattern that we matched so far, then we can treat the end
+# of that matching prefix as the start of a prefix of the word that occurs
+# later in the target.
+#
+# The basic idea behind KMP is, given this table, to execute the following:
+#
+#  - Guess that the string starts at the beginning of the target.
+#  - Match as much of the string as possible.
+#  - If the whole string matched, we're done.
+#  - Otherwise, a mismatch was found.  Look up the largest border of the
+#    string that was matched so far in the failure table.  Suppose it has
+#    length k.
+#  - Update our guess of the start position to be where that border occurs
+#    in the portion matched so far, then repeat this process.
+#
+# Notice that once we've matched a character against part of the pattern (or
+# found that it can't possibly match), we never visit that character again.
+# This is responsible for the fast runtime of the algorithm (though I'll give a
+# more formal description later on).
+
+# Function: failTable(pattern)
+# Usage: failTable("This is a string!")
+# -----------------------------------------------------------------------------
+# Given a string, constructs the KMP failure table for that string.  The values
+# in the table are defined as
+#
+#    table[i] = |LongestProperBoundary(pattern[0:i)])|
+#
+# Where the longest proper boundary of a string is the longest proper substring
+# of that string that is both a prefix and a suffix.  For example, given the
+# string "abcabc," the longest proper boundary is abc.  Similarly, given the
+# string "apple," the longest proper boundary is the empty string.
+#
+# As a sample output of this function, given the string "ababcac", the table
+# would be
+#
+#     a b a b c a c
+#    * 0 0 1 2 0 1 0
+#
+# This means, for example, that the longest proper boundary of the prefix "aba"
+# has length 1, while the longest proper boundary of the string as a whole is
+# the empty string.  Notice that the first entry is *, which we have chosen
+# because there is no mathematically well-defined proper substring of the empty
+# string.  We can put anything we want there, and we'll go with None.
+#
+# To compute the values of this table, we use a dynamic programming algorithm
+# to compute a slightly stronger version of the function.  We define the
+# function "Extended Longest Proper Boundary" (xLPB) as follows:
+#
+#    xLPB(string, n, char) = The longest proper boundary of string[0:n] + char
+#
+# The idea behind this function is that we want to be able to recycle the
+# values of the longest proper boundary function for smaller prefixes of the
+# string in order to compute the longest proper boundary for longer prefixes.
+# To make this easier, the xLPB function allows us to talk about what would
+# happen if we extended the longest proper boundary of some prefix of the
+# string by a single character.  Notice that for any nonzero n, we have that
+#
+#   LongestProperBoundary(string[0:n]) = xLPB(string, n - 1, string[n])
+#
+# That is, we simply tear off the last character and use it as the final
+# argument to xLPB.  Given this xLPB function, we can compute its values
+# recursively using the following logic.  As a base case, xLPB(string, 0, char)
+# is the longest proper boundary of string[0:0] + char = char.  But this has
+# only one proper boundary, the empty string, and so its value must be zero.
+#
+# Now suppose that for all n' < n we have the value of xLPB(string, n', char)
+# for any character char.  Suppose we want to go and compute
+# xLPB(string, n, char).  Let's think about what this would mean.  Given that
+# n is not zero, we can think of this problem as trying to find the longest
+# proper boundary of this string:
+#
+#     +------------+---+------------+------------+---+
+#     |     LPB    | ? |    ...     |     LPB    | c |
+#     +------------+---+------------+------------+---+
+#
+#     ^                                          ^ ^
+#     +----------------------+-------------------+ |
+#                            |                     |
+#                   String of length n      New character
+#
+# The idea is that we have the original string of length n, followed by our new
+# character char (which we'll abbreviate c).  In this diagram, I've marked the
+# LPB of the string of length n.  Notice that right after the LPB at the prefix
+# of the string, we have some character whose value is unknown (since n != 0
+# and the LPB can't be the whole string).  If this value is equal to c, then
+# the LPB of the whole string can be formed by simply extending the LPB of the
+# first n characters.  There can't be a longer proper boundary, since otherwise
+# we could show that by taking that longer boundary and dropping off the
+# character c, we'd end up with a longer proper boundary for the first n
+# characters of the string, contradicting that we chose the longest proper
+# boundary.
+#
+# By our above argument, remember that the length of the longest proper
+# boundary of the first n characters of the string is given by
+#
+#    xLPB(string, n - 1, string[n - 1])
+#
+# Thus we have the first part of our recurrence, which is defined as
+#
+#    xLPB(string, n, char) =
+#        if n = 0, then 0.
+#        let k = xLPB(string, n - 1, string[n - 1])
+#        if string[k] == char, return k + 1
+#        else, ???
+#
+# Now, suppose that we find that the character after the LPB does not match.
+# If this happens, we can then make the following observation.  Below I've
+# reprinted the above diagram:
+#
+#     +------------+---+------------+------------+---+
+#     |     LPB    | ? |    ...     |     LPB    | c |
+#     +------------+---+------------+------------+---+
+#
+#     ^                                          ^ ^
+#     +----------------------+-------------------+ |
+#                            |                     |
+#                   String of length n      New character
+#
+# Notice that any LPB of this new string must be a prefix of the LPB of the
+# first n characters and a suffix of the LPB followed by the character c.
+# Since by definition the LPB of the first n characters must be a prefix of
+# those n characters, we have the following elegant conclusion to our
+# recurrence:
+#
+#    xLPB(string, n, char) =
+#        if n = 0, then 0.
+#        let k = xLPB(string, n - 1, string[n - 1])
+#        if string[k] == char, return k + 1
+#        else, xLPB(string, k, char)
+#
+# The reason for this is that xLPB(string, k, char) asks for the longest
+# proper boundary of the LPB of the string formed from the first n characters
+# of the string followed by the character c, which is exactly what we described
+# above.
+#
+# As written, filling in the table of LPB values would take O(n^2) time, where
+# n is the length of the string.  However, using dynamic programming and an
+# amortized analysis, we can show that this function can be made to run in
+# O(n) time.  In particular, suppose that for all n' < n, we know the value of
+# LPB(string[0:n]).  Then in the above formulation of xLPB, the first
+# recursive call is known, and the only recursive call we may actually need to
+# make is the second.
+#
+# However, this doesn't seem to say anything about the runtime of the second
+# recursive call, which seems as though it might cause the evaluation of this
+# function to run in time O(n).  This is correct, but in an *amortized* sense
+# the whole table can still be computed in O(n) time overall.  To see this,
+# let's define a potential function Phi(k) that associates a potential at each
+# point of the computation of the table.  In particular, define Phi(k) as
+#
+#   Phi(0)     = 0
+#   Phi(k + 1) = result[k - 1]
+#
+# Here, result is the resulting table of LPB values.  Because of this, we can
+# remark that result[k] < k, since the longest proper border of a string can't
+# be any longer than that string.
+#
+# Let's now show that this potential function gives an amortized O(1) cost for
+# each table entry computation, and thus an O(n) overall runtime for the table-
+# building algorithm.  To see this, consider what happens when the logic to
+# compute the next value runs.  The runtime for this step is bounded by the
+# number of recursive calls made to a subproblem.  However, each subproblem is
+# then of size given by the LPB of a slightly smaller problem.  This subproblem
+# must then have size at most the size of that smaller subproblem.  In other
+# words, we can say that each recursive call drops the maximum possible value
+# of the LPB for the current prefix by at least one.  Consequently, if k
+# recursive calls are made, the LPB of the current prefix is at least k smaller
+# than the LPB of the previous prefix, and so
+#
+#   D Phi = -k
+#
+# And so the amortized cost of computing the next term is 1 + k - k = O(1).
+
+def failTable(pattern):
+    # Create the resulting table, which for length zero is None.
+    result = [None]
+
+    # Iterate across the rest of the characters, filling in the values for the
+    # rest of the table.
+    for i in range(0, len(pattern)):
+        # Keep track of the size of the subproblem we're dealing with, which
+        # starts off using the first i characters of the string.
+        j = i
+
+        while True:
+            # If j hits zero, the recursion says that the resulting value is
+            # zero since we're looking for the LPB of a single-character
+            # string.
+            if j == 0:
+                result.append(0)
+                break
+
+            # Otherwise, if the character one step after the LPB matches the
+            # next character in the sequence, then we can extend the LPB by one
+            # character to get an LPB for the whole sequence.
+            if pattern[result[j]] == pattern[i]:
+                result.append(result[j] + 1)
+                break
+
+            # Finally, if neither of these hold, then we need to reduce the
+            # subproblem to the LPB of the LPB.
+            j = result[j]
+
+    return result
+
+# Function: kmpMatch(needle, haystack)
+# Usage: print kmpMatch("0101", "0011001011") # Prints 5
+# -----------------------------------------------------------------------------
+# Uses the KMP algorithm to find an occurrence of the specified needle string
+# in the haystack string.  To do this, we compute the failure table, which
+# is done above.  Next, we iterate across the string, keeping track of a
+# candidate start point and length matched so far.  Whenever a match occurs, we
+# update the length of the match we've made.  On a failure, we update these
+# values by trying to preserve the maximum proper border of the string we were
+# able to manage by that point.
+def kmpMatch(needle, haystack):
+    # Compute the failure table for the needle we're looking up.
+    fail = failTable(needle)
+
+    # Keep track of the start index and next match position, both of which
+    # start at zero since our candidate match is at the beginning and is trying
+    # to match the first character.
+    index = 0
+    match = 0
+
+    # Loop until we fall off the string or match.
+    while index + match < len(haystack):
+        print index, match
+
+        # If the current character matches the expected character, then bump up
+        # the match index.
+        if haystack[index + match] == needle[match]:
+            match = match + 1
+
+            # If we completely matched everything, we're done.
+            if match == len(needle):
+                return index
+
+        # Otherwise, we need to look at the fail table to determine what to do
+        # next.
+        else:
+            # If we couldn't match the first character, then just advance the
+            # start index.  We need to try again.
+            if match == 0:
+                index = index + 1
+
+            # Otherwise, see how much we need to skip forward before we have
+            # another feasible match.
+            else:
+                index = index + match - fail[match]
+                match = fail[match]
+
+    # If we made it here, then no match was found.
+    return None
No results found