Into NLP 4 ~ Normal Language Perfection – Text Normalization
In my last article I talked about the benefits of tokenization regarding text processing: Essentially when we want to make processing text less awkward by separating tokens / words from the whitespace. In this article we continue cleaning up after natural languages. This time we tackle the contents of the words themselves.
Many languages have different grammatical forms of the same word. These forms are used to indicate things like the time an event happened, the order in which events happened, who is participating or passively standing by, and many more things.
Take for example the sentence “The lion hunts the gazelle.” and let’s focus on the word “hunt”. The active form “hunts” was already modified from the base form “hunt”.
Similarly one can have:
– “The lion will hunt the gazelle.”
– “The lion was hunting the gazelle”, or
– “The gazelle was hunted by the lion”
These modifications are very useful for communication since it makes a huge difference when the hunting is taking place and who is being hunted (especially if you are a gazelle). But of course for us as people having to deal with these modifications it is super annoying since our search for the word “hunt” suddenly got a lot more complicated. If you have to write a program that finds all lion hunting gazelle instances, good luck with your regex search.
Even more so, since nothing in english is ever regular and we can have irregular verb forms like “be/was/has been” or “run/ran/run”.
And it is not just verbs that get messy, nouns can differ in number (“lion” vs. “lions”), and adjectives can sometimes differ based on gender (“blond” vs “blonde”). And let’s not start with other languages (here is a video why we don’t want to do that)… The question is how do we deal with all of this mess?
Stemming is the first method one can use when the words get too annoying. A stemming algorithm simply tries to identify the endings of a word and simply chop them off. This can be done using some simple rules or a lookup table. In english you can have a list of common endings like “-ed”, “-ing”, “-s”, and so on and then task an algorithm to “search-and-destroy”. A common algorithm for this task is the “Porter Stemmer”.
Rule-based stemming can do the trick in many cases, but you should be aware of its problems and quirks.
The first being irregular forms: Take the word “go”, which for some reason turns into “went” in past forms (rather than the much more sensible “goed”). Since the difference here is more than just the ending, those two will not get mapped to the same stem.
Additionally some stems are not what you would expect take “to argue”. After going through the language-guillotine it will be reduced to “argu” which looks wrong. It is however, also the result you would get with “arguing” and “argued” so just be aware that stems can look… off…
Lemmatization is a bit more involved. Unlike the stemming algorithms, a lemmatizer tries to find the lemma (sometimes also called the “dictionary form”) of the word. This usually involves determining the part-of-speech (i.e. the word-type, so if it is a verb, a noun, etc. how this is done will be the topic of next time) and also has some special cases for irregular forms. So “went” does get mapped to “go”. This results in something that is usually a bit easier to work with since lemmas are actual words (so no “argu”), but you should be aware that lemmatization can sometimes struggle with uncommon words, since they tend to rely much more on lookup tables, and if the word is not in the table, we won’t get a good result.
This is the third and most complex method since it involves actually identifying the form of the word. So a word like “lions” will be parsed to something like “lion <plural-form affix>”. This can be extremely useful if you need more information about the contents and the structure of a sentence. However getting this information is – as you might have guessed – no easy task. In practice it is often easier to just stick to a lemmatizer and use other techniques like dependency parsing (which will be a topic for the article after part-of-speech tagging) for structural information. But maybe you’ll use it one day. It’s not bad to have some options…
Bringing it together
Let’s think back to our lion hunting the gazelle. As we have seen, dealing with all different variants of every word can be really frustrating and prone to error, which will result in unnecessarily complex expressions, bugs and other annoyances. But luckily normalization can help us out. By stemming, lemmatization, or even morphological parsing we can reduce the number of variants for every single word drastically, making our job a lot easier.