Into NLP 1 – Regular Expressions
Into the Fire – A
no less somewhat less nonsense introduction to NLP
Natural Language Processing? – What is NLP?
Language is messy. In our attempts to convey meaning, and emotions to each other, we have come up with some extraordinarily complex structures that need years of learning to grasp. There are countless rules and even more exceptions to those rules but somehow we manage to communicate with each other. The name, scientists have come up with for mess is natural language.
And then there are computers, machines that require a lot of structure to work. NLP is the attempt to make those two worlds meet, to have computers parse, process, and understand the language we use in our daily (natural) lifes. In the coming articles we will have a look at tools, techniques, and methods that help us deal with the chaotic complexity of natural language. We will see the many ways in which NLP will make dealing with language easier, one method at the time. Today we will start with the first:
Narly Letter Pulp
Probably the easiest tasks one might encounter when dealing with NLP is text search. You might have to search a long E-Mail thread for an address, or a pdf document for a specific text section. The common approach would be to “CMD + F” (Ctrl + F for you windows peeps) and type in the word you are looking for. Depending on the task at hand this might be sufficient, but more often than not this is not good enough. Today we will look at a method that gives us a lot more power when searching though documents: Regular Expressions. Now, if you have encountered Regular Expressions (RegEx for short) before, you might remember it as this utterly unreadable soup of letters and characters. But the basics actually aren’t all that convoluted.
My goal with this article isn’t to give you a full rundown of all of RegEx, since it is quite a large topic. Instead I want to give you a quick overview:
Into the letter soup
Say we have a document and we want to search the word “lion”. A regular text search would bring it up but maybe we want more: Maybe we are searching for instances of “lion” or “wolf”. We could do two searches but say we had a lot more words, and suddenly one is occupied for a day searching individual words. So how can regex help?
The first construct regex gives us is the “or”. It is writing as a vertical bar (|) so if we were to search
lion|wolf it would match every instance of either word. We can chain as mainly words as we want:
lion|wolf|eagle|whale would search for any of the for given words. We can also use this construct to search for upper and lower case variants. Assume, we want to find both upper- and lower case variants of lion, but only lower case variant of wolf. We could write this as
Lion|lion|wolf. There is also a shorter notation:
This might look a bit strange, but let’s dissect it: The parentheses are not part of the match, they help us structure the search, since if we were to write
L|lion|wolf it would read as if we were looking for the letter “L”. So with the parentheses it is clear: We want “L” or “l” followed by “ion” which gives us the upper and lower case variants of the word lion.
Next there is the optional. Say you have an english document of unkown origin. You don’t yet know if it is writing in american or british english and you want to find the world “color” or “colour”. Again you could do two searches, or use the tool from above. But there is again an easy way: The “optional” operator, written as a question mark (?). With that our search can be written as
colou?r what the optional operator does, is it matches if the letter in question is present, or not so in our case, the “u” followed by the questionmark means, this letter “u” is optional. So it would match both spelling variants. Using parentheses we can also make multiple letters optional, for example
do(ing)? would match both “do” and “doing”.
Now let’s imagine we are looking for a number. We aren’t sure anymore if it is ten thousand, maybe ten million? Just a one, followed by a bunch of zeros. There is a construct in RegEx that allows searching for a string of unspecified length… Actually there are two: The Star (*) and the Plus operator (+). In our case we could write something like
10+. They have two slightly different meanings. Let’s focus on the second: `10+` means a “1” followed by a one or more “0”s. This would match “10”, “100”, “1000”, “10000” etc. `10*` means a “1” followed by zero or more “0”s. So unlike the the plus example, it would also match a lone “1” without any zeros.
This doesn’t sound all that useful until you encounter some additional tools. Like the dot (.) The dot matches any character. So have a look at the following example:
.+ing What does this mean? We break it down
. matches any character so
.+ means we match a string of one ore more characters. Additional we have the requirement that we are matching
ing so this means we match an arbitrary string followed by “ing”. This would match “doing”, “making”, “running” etc. As long as it ends in “ing” it gets matched.
There are a lot more things to discover like character groups
[a-z] matches any lower case letter.
[0-9] any single digit. Additionally there are pre-defined groups:
\s matches whitespace,
\w word characters
\d digits and especially useful is
\b which matches the beginning and end of words, so we can write
\b.+ing\b to make sure we only match full words. But as I have said my goal is not go give you a full tutorial, so let’s look at some applications:
RegEx? Huh! What is it good for?
RegEx are everywhere. Basically every programming languge supports them. A variant of RegEx can even be used when adressing a database in SQL.There are extensions for browsers and pdf editors that use them and almost every IDE allows RegEx in their text search. Outside of the area of searching RegEx can be used for simple validation, so to check if the user entered a valid phone number or e-mail adress and some simple parsing operations.
While RegEx is a fun tool, it is easy to overuse it. Once you have the hammer of RegEx, suddenly every problem looks like a text-searching nail. In my experience the usefulness of an expression goes down as a function of length. So if you find yourself having an expression that lasts multiple lines you are probably overdoing it, especially since Regular Expressions are notoriously annoying to read. Before you take your validation expression into code review be prepared that no one (including your future self) wants to read and decode an expression that lasts three lines. Documentation can help, but at this point you might want to look into other validation techniques. Additionally there are some things RegEx simply can not do. Language is messy and sometimes too difficult for the humble Regular Expression. As one user on Stackoverflow noted, trying to parse HTML using Regular Expressions summons trained souls into the realm of the living. That is because Regular Expressions can only parse what are known as “Regular Languages”, and things like HTML are not regular. Basically everything that involves balanced parentheses (including RegEx itself) is not regular. Another quick way of summoning Ba’al the soul-eater is the escaping problem. Since many symbols in RegEx are reserved (e.g. “.”, “?”, “*”, etc.) if you want to match one of them you have to escape, so
\. matches a literal dot,
\? a questionmark and
\\ matches a backslash. The problem comes when using Regex in a language that requires Strings to be escaped as well, you suddenly find yourself using an ever increasing number of backslashes to escape backslashes and before you know it you summoned Ba’al. But other than that you should be now equipped to write search expressions that look like a cat jumped on the keyboard and isn’t this the reason why we are doing all of this?