Into NLP 3 ~ Numerous Language Pieces – Tokenization

Previously we started our journey into the world of Natural Language Processing by learning the basics of Regular Expressions (RegEx) and fuzzy matching. 

While RegEx is certainly a powerful tool, it has its problems. As I wrote, Regular Expressions have a tendency of quickly devolving into alphabet-soup. Take this example:

\b(T|t)he\b\s+\blion\b\s+\broars\b

The goal of this expression is simple: Match the phrase “the lion roars”. There is a bit of additional flourish to allow capitalization of the “The” and to allow arbitrary whitespace between the words, but we already have something that will make someone stop in their tracks when trying to read it. Half of the expression is just used to check word boundaries. While there are more efficient ways of doing this, in general you can’t get around dealing with words at some level. The word is the base unit for many applications. In professional settings there may even be norms and guidelines of how many words there can be in a sentence or a document.

But language is messy and Regular Expressions on their own are often not enough to cope with that. That’s why today we will get a handle on a language processing tool, that will make dealing with something like the example above a lot easier: The Tokenizer

So, what is a tokenizer?

A tokenizer is an algorithm that turns a piece of text into a list of tokens.  

Okay, this definition isn’t all that useful since we now have to answer the question, what is a token?

Well… Generally a token is a piece of the text, but usually it is a word, or a punctuation mark. Essentially if we have the text “The lion roars.” and pass it through a tokenizer what we will get out is something like  ["The", "lion", "roars", "."] 

The benefit of having the text in such a form is obvious: We no longer have to deal with stuff like word boundaries and spaces when matching. Our alphabet-soup expression from before can simply be broken apart into three easy expressions:

  • (T|t)he
  • lion
  • roars

It doesn’t get much easier than this. So, they are quite handy to have, but we should be aware of some pitfalls:

String.split – “Is this a tokenizer?”

When I showed the example above, you may have excitedly jumped up, pointed at your screen, and yelled, “That is just String.split!”. Well calm down, you are right… Kind of… 

Using a String.split with some special cases for punctuation will do the trick in a pinch. Splitting on \s (“Spaces”) or even \b (“Word Boundaries”) gets you 80-90% on your way. But this may not be enough. Tokenizer can do more than that and if you use them right they might save you a lot of work. But for this we have to get into the details of what constitutes a token and this is a large and not very tasty can of worms: So in order to save this article from the depths of linguistic minutiae, let’s look at some examples:

Contractions

In english (especially colloquial english) words are commonly contracted:   

– “does not” becomes “doesn’t”

– “I will”, becomes “I’ll”

– “that is” becomes “that’s”  

We usually want a tokenizer that splits these back into two tokens, so you’d get results like 

  • ["does", "n't"]
  • ["I", "'ll"]
  • ["that", "'s"]

This can save you a lot of special cases: Imagine you are looking for negations in your text. Normally you have to search for “no”, “not”, and off cause all contractions involving negations, “doesn’t”, “isn’t”, ect. again we will get long match expressions. If we tokenize this way, we suddenly only have to match “no”, “not”, and “n’t” and we are done. This makes our life much easier than using plain Regular Expressions.

Hyphens

Sometimes words can be joined in another way: By adding a hyphen: phrases like “cost-effective” or “non-hyphenated” are examples of this. Again it might save you some trouble to split them into two (or better three if you keep the hyphen) tokens.

Abbreviations

Let’s look at some abbreviations, e.g. i.e., max., etc.  

Depending on your tokenizer these could be interpreted differently. Take “i.e.” there are three common ways of doing it:

  • Four tokens “i”, “.”, “e”, “.”
    The simplest outcome. This is what a simple “split-based” tokenizer might return, it has the least assumptions about the text. It works… But I think there are better options.
  • Two tokens “i.”, “e.”
    Technically “i.e.” is two abbreviations: “Id est” (from latin “that is”). Splitting it this way the tokenizer recognizes the dots as indicators of abbreviations, as opposed to sentence ending dots.
  • One token “i.e.”
    The tokenizer treats “i.e.” as a single token. This is usually my prefered method. I think of “i.e.” as a single unit. Both “i.” and “e.” simply aren’t all that useful on their own, and it makes writing match queries a lot easier.

Other language weirdness

Language is messy and there are many more edge cases that might be useful to know, here are examples of phrases that might want to keep in mind when thinking about tokenization:

  • “New York” might be just one token, since it is a name
  • “$15.00”  could be anything from one to four tokens
  • “123,561” again this could be one or three
  • “I’m-a” or “Imma” this is a contraction for “I am going to” so it should be four tokens?  
  • “¯\_(ツ)_/¯” Have fun getting this or other emoticons through your split method 😀

Okay, now what?

So let’s say we have figured it out. We know what we want to count as a token and dealt with all the edge cases. What now?

Well there is a second property a tokenizer should have: bi-directionality

That is, for every token you want to know the position of it in your original text. This is important if you want to do any kind of visualization or modification of the original text, and yet another thing to keep in mind before just using the good ol’ String.split. But once you have figured all of this out (or more likely once you have found a library that fills your requirements) you are good to go.

Connecting the Dots

To wrap up, we have now seen what a tokenizer can do, and how it can help us. But there is one more use that I would like to end with: Tokenization builds the foundation for any NLP pipeline. No matter what you want to do: POS-Tagging, Dependency Parsing, Sentiment Analysis, even Deep Learning / Deep NLP solutions usually start by running a tokenizer, and then using the tokens as their input. It forms the basis for everything that follows. It is the linchpin of almost all other NLP tools. And in the next few months we will explore many of the tools utilizing the humble tokenizer.

For those of you who still want to find out more about the wonderfull world of “what-counts-as-a-word” edge cases, I can highly recommend this video by Tom Scott.