**Keywords: Levenshtein, Wagner-Fischer, Zaxxon
**

It comes frequently in the service sector to check your customer names against some list provided by regulatory agencies, and the checking has to be tolerant against slightly mistyped names. So for example if “Hugo Chavez” is on the list, and “Huge Shavez” is in your list of customers, you might want to at least flag it as something to investigate further and see if it is a real match. One of the classic methods for evaluating near matches for words and names is the Levenshtein algorithm, or its closely related cousin Wagner-Fischer. But how does it work? And what does it have to do with Zaxxon?

**The Cost of Changing One Word into Another**

How do you change one word into another on a computer? You can do one of three things: insert a letter into the word, remove a letter from the word, or change a letter. Try it out! Open up a new text file in notepad.exe and see what it takes to turn “abra” into “cadabra”. For example you can just add “cad” to the front, or you can change the “abr” into “cad” and then add “bra” to the end. Try it out with different words. There are many things you could do, and the particular things you might do depend on what the words are. In order to figure out if one word matches another even if there was a typo, one way to do it is to figure out what is the *minimum* number of such things you have to do to turn one word into the other. And that is what the Levenshtein algorithm does.

To figure out the minimum cost, imagine a custom built checkerboard that has the first word along the left side and the second word along the top, one letter per column or row, as in the above figure. In the above picture, the incremental cost of each kind of change is illustrated at left with the corresponding movements through the checkerboard. Let’s say you start at the upper left hand corner. When you insert a letter, it’s like moving one square to the right. When you remove a character, it’s like moving one square down. And when you change a character, it’s like moving one square diagonal (right and then down.) When you reach the lower right hand corner, you’re done: the word on the left has been transformed into the word on the right.

Now it always costs something to insert or remove a letter, ie- when moving straight right or down. But let’s say at some point in this checkerboard, a letter in the first word is the same as a letter in the second word: you don’t have to do anything! In other words, the cost of going diagonally is nothing if the letters already match there, but it still costs something if they don’t. So let’s consider finding the minimum cost. Make your checkerboard, and make a 0 in the upper left corner if the first letters of each word match and make a 1 if they don’t. Spread out from the upper left corner by filling in a running total of the minimum cost in the boxes. You can get to each box by inserting (ie- you moved in from the left), by removing (ie- you moved in from the top), or by changing (ie- you moved in from the upper left diagonal.) To get the minimum cost, you just take the minimum of whatever number is in the box to the left plus 1, whatever number is in the box on top plus one, whatever number is in the upper left diagonal plus one *if the letters in your box are the different*, or just whatever number is in the upper left diagonal *if the letters in your box are the same*, because you didn’t have to do anything there. (Note if a box is on an edge, just ignore boxes “above” or “to the left” if they don’t exist!)

If you go through this procedure for “cad” and “cat”, you get the checkerboard above right. Reading off the value in the lower right box, it costs a minimum of 1 “thing to do” to change “cad” into “cat”. Namely, you can start out going diagonal on “c” and “a” at a cost of zero because they already match, and then you have to change the “d” into a “t” so the last box contains a 1. Again there may be other ways to get there, but 1 was the minimum cost of all possible ways, and Levenshtein only keeps track of the minimum.

// assume words are non-empty and in the character arrays a and b

// Never compiled, for demonstration purposes only! Feel free to copy!

// No warrantry expressed or implied.

int[][] costs = new int[a.Length][b.Length];

costs[0][0] = ( a[0] == b[0] ? 0 : 1 );

for (int i = 0; i < a.Length; i++ ) {
for ( int j = 0; j < b.Length; j++ ) {
if ( i == 0 and j == 0 ) continue;
if ( i == 0 )
{
cost[i][j] = cost[i][j-1] + 1;
continue;
}
if ( j == 0 )
{
cost[i][j] = cost [i-1][j] + 1;
continue;
}
cost[i][j] = Min (
cost[i-1][j] + 1,
cost[i][j-1] + 1,
cost[i-1][j-1] + (a[i] == b[j] ? 0 : 1)
);
}
}
// The minimum cost is in cost[a.Length-1][b.Length-1]
[/sourcecode]
(The only difference with Wagner-Fischer is that W-F allows you to set the costs of inserts, removals, and changes differently. So instead of adding "1" to various costs in the above, you'd be adding one number for inserts, a different number for removals, and maybe a third number for changes.)
The Zaxxon Interpretation
OK, so what does all of this have to do with Zaxxon? For younger readers, Zaxxon was an early 1980's video game that featured a space shuttle flying through an enemy base and shooting it up. The shuttle moved forward at a constant rate and never backwards. You could move left and right, and if anything was in your way you had to shoot it.

When you think about it, this is somewhat similar to what you’re doing with Levenshtein. Think of the checkerboard we made above, except that instead of containing running totals of the minimum cost, each square contains only the cost of “changing” a letter at that point in the word. In other words, each square contains a 1 if the row and column letters are different, and it contains a 0 if they are the same. Finding the minimum cost is the same as finding the path through the new checkerboard that goes over the most zeros and avoids the ones as much as possible. Remember, on the checkerboard it’s a given that moving straight right or straight down costs one, so you can think of it as having to shoot your way into squares going straight down or straight right, and shooting to go diagonal if the letters are different but holding your fire if the letters are the same. Then the word matching problem becomes equivalent to finding the minimum number of “shots fired” to get through an enemy base!

So for example to turn “Jaded” into “Jabber”, I can fly over the “Ja”, shoot to add a “b”, shoot again to turn a “d” into a”b”, fly over the “e” and shoot to turn the last “d” into an “r”.

Or to turn “Zaxon” into “Zaxxon” I only need to shoot once to add another “x”. Crazy.

**But seriously folks…**

OK, if you’re not ROFLOL by now, I’ll admit that it’s not exactly like Zaxxon. In the real Zaxxon, you use ammo but you also get more or less points for hitting various targets, you can nose up and avoid all of the targets altogether, and you can destroy the shuttle by crashing into something. But in my suggested “modified” Zaxxon above, if I had a map of an enemy base, and if I had a good algorithm for finding the minimum number of shots to clear the base without nosing up or crashing, then I’d have a good algorithm for string matching. And vice versa.

The field of algorithms is filled with such equivalences. It’s what makes this field so much fun. In fact, professionals often use these kinds of techniques of turning one problem into another in order to answer questions relating to the running time of algorithms. So for example if it is known that if the running time of a particular algorithm goes as the square of the length of the input, and if you can transform the original problem into a new problem (without wasting too much time in the process), then you essentially have an algorithm that solves the new problem with running time that goes as the square of the length of the input. Some interesting questions to ponder:

Does every string matching problem have a Zaxxon equivalent? I think pretty clearly the answer is yes.

Does evey Zaxxon problem have a string matching equivalent? Not so clearly, but I think that answer is no. Zaxxon is more general – in Zaxxon you can place obstacles anywhere, but in string matching your zeros have to be consistent: if I have a zero in square (2,3), a zero in square (2,5), and a zero in square (1,3), then I had better have a zero in square (1,5). However, even though the Zaxxon problem is more general, Levenshtein should still work since it assumes nothing special about the cost matrix.

While the suggested equivalence with Zaxxon above is somewhat whimsical, I’m going to be on the lookout from now on for other optimization problems that I can solve using Levenshtein.

**References**

(If you liked this post, feel free to find me at Linked In, Gregory Graham, Greater Chicago Area, Banking.)

on August 25, 2009 at 7:32 pmPre-scoring Candidates for String Matching « The Solarium[…] names only. The Levenshtein algorithm is (somewhat whimsically) explained in another posting here with […]