Into NLP 2 – Fuzzy String Matching and the Edit Distance
In my last article I started with a dive into the wonderfull world of Regular Expressions. We’ve seen how RegEx are really useful for search tasks. However… They are not perfect and today we will look at one particular problem: Missspelling
Language is messy and humans (including me, who is definitely a human) make mistakes. Unfortunately RegEx doesn’t handle spelling errors very well. To a RegEx something is either a match or not, and often enough if a word is misspelled it wont get found. If I write “doin” instead of “doing” my RegEx matching words ending in “ing” will not find it. Sure one could try to rewrite the expression to allow for spelling errors, but the result probably wont work very well and will look even more messy than Regular Expressions usually do.
RegEx simply aren’t designed for this kind of task. So we have to expand our toolkit and look for some alternatives.
The root of the problem
The main problem we are facing is, that regular expressions are a binary thing and like most of the time the binary distinction is overly simplified and not the whole truth. But there are alternatives. Instead of asking “does Pattern match Text?” we could be asking “How well does Pattern match Text?” The result is then some match score. This concept is called “Fuzzy Matching”. But let’s look at an example. Say it’s late at night and for some reason I have to finish my my epic novel about the magical adventures of a talking lion but in my sleep deprived stupor I managed to type the non-word “Lliom” instead of the intended “Lion”. Usually there is some spell checker that would now try to correct me, but how would they know what I meant? As we have already established, Regular Expressions are not the way to go, so let’s try and build a simple spell checker to help us understand how one could use Fuzzy Matching. Once we’ve done that we can look into different applications like searching.
Correcting the Lliom
So first our spell checker can make the assumption that I used a correct english word (since that is what I do most of the time). Some smarter algorithm might be able to learn my personal vocabulary but let’s stick with that for now. Luckily there are dictionaries for english words. So our simple spell checker could now compare my non-word with a list of actual words, find the one that most resembles it and suggest it (or maybe even the top 5 closest ones). Of course this has only moved the problem. How do we know how close two words are? So let’s talk about edit distance.
What is the edit distance?
The edit distance between two words describes how many edit operations one has to do to one word to get to the other. Commonly these operations are deleting a letter, inserting a letter, and replacing a letter.
This gives us a score of how different two texts are (and by inversion how close they are). So we could say an insertion or deletion has a score of 1 and a replacement has a score of 2. Then we sum up the score for each operation we need to get from one to the other to get the distance. A distance of zero would be a perfect match, the higher the score the further apart the two words are. Do that for all words in the dictionary, find the lowest scores, bada bing bada boom spelling fixed. Well you’d probably have to do some optimization, or use some heuristics since that would be a lot of comparisons but you get the idea: Once we have a measurement of how similar two words are we can run a search for the most similar words. For our example we would compare the “Lliom” to the words in the english dictionary and simply compute the edit distance…
Uhm okay… But how…
So we know what to do, but how do we find out what operations are needed to get from “Lliom” to “Lion”? Thankfully many other people have asked this question before and have come up with solutions, the best known one is the Wagner–Fischer algorithm. It uses some quite nice dynamic programming tricks and reduces the edit distance problem to simply filling out a grid based on some stupidly simple rules:
First we create a grid with one more columns as the word in question has letters (in our case “Lliom” so 5+1 columns) and one more rows as the word we want to compare against (so “Lion” 4+1 rows). To help visualize I’ve added an additional row and column with the actual letters in them. First we fill in a zero in the top left corner of the grid and then count up to the right and the bottom.
This is the starting grid. Now all we have to do is fill in the blanks row by row until we reach the bottom right corner which will contain the final edit distance. We fill in the grid with the following rules:
- Take the entry to the left +1 for a deletion operation
- The the entry to the top + 1 for an insertion operation
- Take the entry to the top left +2 for a substitution operation
- Take the entry to the top left directly if the letters of the row and column are the same
Of these three/four options we pick the lowest number.
This would be the result for the next row. Since in the row/column with the “L” the letters are identical we get a free substitution from that point onwards we just have deletion operations. As you probably have guessed is that this method is pure dynamic programming. We compute the edit distance of the two words by computing the edit distance of partial words. Each cell of the table represents such a partial edit distance. Take for example the cell in bold in the “i” column and the “L” row. It represents the edit distance between the words “Lli” (the first three columns) and “L” (the first row). Fully filling out the table gets us the following result:
So according to the Wagner-Fischer one can derive my “late-night-accident” from the word “Lion” by inserting and an “l” after the first, and substituting the “n” for an “m” all for the low, low cost of 3 (1 insertion + 1 substituion) “distance points”.
Searching with Wagner-Fischer?
Now let’s come back to our original problem that searching with Regular Expressions doesn’t work when the text is misspelled. There is a variant of the Wagner-Fischer algorithm that allows us to use a very similar method to search for a string in a piece of text. We do this by slightly modifying the rules we use to fill in the table: First we set the first column to all zeros instead of counting up. This means that inserting letters at the beginning of our search string is free. Additionally the cost of insertion is zero for the last column, meaning that inserting letters after our search string is free as well. This variant is actually mentioned in the wikipedia article linked above as “Seller’s variant” and I have in fact used a similar method in the past when I had to deal with some corrupted text that I had to search. I knew a given sentence would be on a page but there was a high likelihood that there would be some noise in the text, maybe some words would be missing or be misspelled. But a large grid with the page and my sentence as row and columns and some finetuning made short work of these kinds of issues.
This just works… Kind of…
While this is definitely a workable approach for string searching it has some definite drawbacks. One is the fact that the runtime and memory requirement can get out of hand quickly (usually O(n*m) where n and m are the lengths of the search term and the document). If your documents are relatively short or if they can be chunked into smaller pieces this might not be a problem. Additionally it might be useful to play with the scores for different operations and the threshold for what counts as a match. For example if you were to search the word “unknown” in the text “It is unlikely that he did know this.” depending on your costs Sellers variant could try to match the “un” of unlikely and the “know” with a bunch of insertions in between. One could counteract this by penalizing inserting spaces. While this approach is definitely not perfect it worked well enough for me even in multi-page search jobs and I’m sure that there are ways to optimize. So the next time you have to search for some text in an unreliable document maybe you can give Sellers Variant a try. Next time we will look at a more down to earth tool and continue our dive into NLP tooling with tokenization.