Keywords: Levenshtein, String Matching, Optimization
In many service oriented businesses, the problem comes up to search your customers for names which are published on a government watchlist. The input names on either list may contain typographical errors, so a fault insensitive matching algorithm must be run on each name pair. A model algorithm to consider for this is the Levenshtein string matching algorithm. Levenshtein is a nice algorithm to consider because it is well behaved, easy to understand and work with, and it is unbiased in the sense that it is not parametric and tuned for “English” sounding names only. The Levenshtein algorithm is (somewhat whimsically) explained in another posting here with references.
The time to compare all possible combinations of names varies linearly with the length of each input list, quadratically if they are both growing. Typical string matching algorithms that are insensitive to typographical errors run more slowly than linear time. Levenshtein is itself quadratic in the length of the strings it is comparing, so it makes sense to attempt to find a fast, linear-time pre-scoring algorithm that enables fast rejection of as many comparisons as possible beforehand. Furthermore, and perhaps as importantly, it is desirable to find a pre-scoring strategy that is separable over two strings being matched so that the pre-score can be calculated “offline” and stored with each string for quick comparison later.
There are some interesting caveats here. First, it should be noted that the output of the Levenshtein algorithm is often called the “string distance” or “edit distance”, and is thus suggestive of some physical distance between two strings. This is not the case: the set of all character strings does not easily support a topology or a metric. (That’s college-ese for “I can’t think of one”!) For example, if each character in a string represents a dimension, then you might imagine character strings as defining a vector in some hyperdimensional space, and the distance is some Cartesian metric in that space. But even small typographical changes to the string can cause the vector to change position wildly, so “close in vector space” in this construction doesn’t correspond very neatly at all to “close typographically”. (And this doesn’t even address the problem of padding up the dimension of either of a pair of character strings so that they match in dimensionality!)
Second, one might try to address this problem by using a different measure that maps higher dimensional data into fewer dimensions while preserving locality. A simple example of such a mapping is binary prefix ordering. Imagine a 2-dimensional square and a string of 1’s and 0’s. As you go down the string, each successive character will pick left/right if the character is in an odd position, and it will pick top/bottom if it is in an even position. Specifically, let a 0 pick the left or top region, and let a 1 pick the right/bottom region. This is essentially a code to label every successively smaller region with a prefix code so that strings that share a common prefix also share a common sub-region of the square.
Now you can imagine doing this with a character string in binary representation. In the above figure, each string would map into a region of the square based on its 1’s and 0’s. Two typographically close strings would map to the same region (assuming they were exact up to some prefix) so the typographical metric appears to be preserved. However, it is horribly biased in the sense that many typographical errors far down the length of a string could result in dissimilar words ending up in the same region, while a single typographical error could put similar strings in different quadrants provided the error happened in the first character.
Other more advanced schemes include Z-ordering or Hilbert ordering. These fractal measures are designed for mapping multidimensional data to a single parameter while preserving locality better than in prefix ordering. These are used in modern spatial databases as a means of clustering images. But even for fractal measures, you just can’t map higher dimensional data to fewer dimensions without incurring some “cuts” and discontinuities. (Think of the Mercator projection “Map of the Globe”.) It is also true that if strings don’t match in the first few letters, then they start out rather far apart in either Z-ordering or Hilbert-ordering – and I don’t think OFAC would buy a “Fractals ate my homework!” excuse if you happened to miss matching “Usama” to “Osama” ;-)
In what follows, I will be working with the Levenshtein algorithm because it has nice mathematical properties. Much of what follows can be applied also to Wagner-Fischer with appropriate choices for the cost-weightings. The pre-scoring strategy that I would like to employ involves some basic set theory and second order logic. Imagine the set of all string pairs (s1,s2). The Levenshtein Algorithm defines an integer function L on (s1,s2), namely it is the number of individual, reversible typographical operations that must be done in order to turn one string in the pair (s1) into the other (s2). The operations include character insertion, character deletion, and character transformation. In our particular problem, we will place a bound N on L(s1,s2) such that if L(s1,s2) <= N, we say that s1 matches s2. The bound N generates a set in (s1,s2), which we call LN, which is simply the set of all pairs (s1,s2) such that L(s1,s2) <= N. So the question becomes: is there another set X in (s1,s2) such that LN is a proper subset of X and X is computable in linear time? And as a bonus, can we at least partially compute X ahead of time by looking at s1 and s2 separately?
Well, there is an obvious choice here: string length difference DLEN(s1,s2) = |LEN(s1) – LEN(s2)|. This should be obvious by construction: If we know that L(s1,s2) <= N, then certainly the strings can’t be farther apart than N in length, so L(s1,s2) <= N implies DLEN(s1,s2) <= N. If we follow convention and name the latter set as DLENN, and given the existence of strings that differ in length by N but don’t match at all, then this shows that LN is a proper subset of DLENN.
A more formal proof can be made by induction. Note that the original computation of the Levenshtein result involves the construction of a cost matrix, shown below.
Consider a string pair (s1, s2). L is computed by assigning a cost of 1 to insertions (moving right), deletions (moving down), and exchanges (moving diagonally right and down) if the corresponding characters in s1 and s2 are different. Start in the upper left hand corner with a 0 in the square if the first character of s1 and s2 match and 1 otherwise. In subsequent squares, you compute the cost of moving into that square from adjacent squares to the left and above, and keep only the minimum. The final value L appears in the lower right square.
Now consider that as we compute L, we also keep track of the string length difference. For both insertions (moving right) and deletions (moving down), the partial result for L changes by 1, and the difference in string lengths changes by plus or minus 1. For changes, L may change by 0 or 1 depending on the corresponding characters in each string. But going diagonally means that the string length change is 0. Therefore in each square, the partial DLEN(s1,s2) <= partial L(s1,s2), and so the final DLEN(s1,S2) <= L(s1,s2). But then placing a bound on L(s1,s2) also places the same bound on DLEN(s1,s2), so that L(s1,s2) <= N implies DLEN(s1,s2) <=N.
DLEN furthermore has nice properties in that one can compute the string lengths of all s1 and s2 separately ahead of time, in linear time, and store the results in a database. Later, when one has picked a bound on L(s1,s2) <= N one only needs to search buckets in the database that satisfy DLEN(s1,s2) <= N.
So an interesting question is: are there other functions that may be more efficient at rejecting non-matching string pairs? Another quickly computable set can be generated by counting the maximum number of occurrences of any of the characters in the string. For example, if string s = “BARBER”, MAXC(s) = 2 because there are 2 “B”‘s or because there are two “R”‘s. Or if string s = “BERBERRY”, MAXC(s) = 3 because there are 3 “R”‘s. What about the difference function DMAXC(s1,s2) = |MAXC(s1) -MAXC(s2)| ?
First, DMAXC(s1,s2) satisfies our set criterion above. Let’s consider our computation of L again. As we compute L, let’s also keep track of DMAXC. For both insertions (moving right) and deletions (moving down), the partial result for L changes by 1. However, DMAXC changes by 0 or plus or minus 1 because you could be inserting or removing one of the maximum repeating characters in the string. For character changes, L may change by 0 or 1 depending on whether the corresponding characters in each string are the same or not. Let’s say that the change in L was 0. But then that means the characters in s1 and s2 match in that position, and so obviously the change in DMAXC is 0 also going diagonally. Say that the change in L is 1 when going diagonally. DMAXC again changes by 0 or plus or minus 1 because you could be changing one (but only one) of the maximum repeating characters in the string. Therefore in each square, the partial DMAXC(s1,s2) <= partial L(s1,s2), and so the final DMAXC(s1,S2) <= L(s1,s2). But then placing a bound on L(s1,s2) also places the same bound on DMAXC(s1,s2), so that L(s1,s2) <= N implies DMAXC(s1,s2) <= N.
Second, DMAXC also has nice properties in that one can compute MAXC of all s1 and s2 separately ahead of time, in linear time, and store the results in a database. Later, when one has picked a bound on L(s1,s2) <= N one only needs to search buckets in the database that satisfy DMAXC(s1,s2) <= N.
Is this any better than DLEN? Probably not… because the maximum number of repeating characters in a string is less than or equal to the length of the string and likely much smaller, and therefore the number of bins you can sort strings into based on MAXC is fewer than the number of bins using LEN, and so the number of strings per bin is going to be higher and the rejection of non-matching strings is going to be less efficient.
(But boy that was fun, wasn’t it?)
Certainly MAXC has some discriminating power, and clearly it is different than the power you get from using LEN alone, so is there a way to combine the two? Unfortunately, while you get a lot of bins if you sum the two, the sum no longer obeys the bounds on L. But if you subtract, it does obey the bounds.
So if we define UD(s) = LEN(s) – MAXC(s), and then DUD(s1,s2) = |UD(s1) – UD(s2)|, then we can construct a proof much like those above to show that a bound on L(s1,s2) is also a bound on DUD(s1,s2), and so that LN is a subset of the set of all (s1,s2) such that DUD(s1,s2) <= N, a.k.a. DUDN by our convention. I won’t go through that again. DUD also has nice properties in that one can compute UD of all s1 and s2 separately ahead of time, in linear time, and store the results in a database. Later, when one has picked a bound on L(s1,s2) <= N one only needs to search buckets in the database that satisfy DUD(s1,s2) <= N.
Will DUD(s1,s2) outperform DLEN(s1,s2)? Again probably not, and again because there are fewer bins to sort strings into. I only bring it up because UD(s) has another interesting interpretation: it is the Levenshtein distance between the string s and a string specially constructed by taking any one of the maximum repeating characters in s up to the length of s. In other words, if s = “BERBERRY”, then the specially constructed string s’ is “RRRRRRRR”. Or if s = “CAT”, then the specially constructed string s’ is any of “CCC”, “AAA”, or “TTT”. UD(s) = LD(s,s’) simply because you don’t have to insert/delete any characters to turn s into s’, you only need change the LEN(s) – DMAXC(s) characters in s that don’t have corresponding matches in s’.
Two distinct lists of names were obtained for the test. After filtering out obvious confounding sub-strings like “INC” and “COMPANY”, and after including “Doing Business As” names, there were approximately 475 million name pairs. The following table shows the number (in millions) of remaining pairs on which to compute L(s1,s2) completely in the given sets for various cuts on L(s1,s2) with noise rejection factors (ratio of number before cut to number after cut).
|4||228 (2.1)||473 (1.0)||252 (1.8)|
|3||183 (2.6)||468 (1.0)||203 (2.3)|
|2||134 (3.5)||447 (1.1)||150 (3.2)|
|1||82 (5.9)||366 (1.3)||92 (5.2)|
The results clearly show that DLENN has the best rejection factors. While DMAXCN has the worst, it was a bit surprising to see it perform so badly. However, most words just don’t have more than 8 repeating characters, and so there are really much fewer bins for strings to go into, and most of the cuts in N are essentially “all bins in” for DMAXCN. And instead of enhancing the performance of DLENN, the mixing of DLENN and DMAXCN in DUDN made it slightly worse. Again, this is because DUDN has slightly fewer bins than DLENN.
It is an interesting question: is there a discriminator X(s) that you can define on a single string that (a) sorts strings more evenly across a greater number of bins than LEN(s), and (b) is there a simple way to combine it with another result DX(s,s’) = |X(s) – X(s’)| such that if L(s,s’) is bounded above by N then DX(s,s’) is also bounded above by N? Or is DLEN(s,s’) maximal in the sense that there is no function better at doing this?
(If you liked this post, feel free to find me at Linked In, Gregory Graham, Greater Chicago Area, Banking.)