Natural Language Processing: Timeline Extraction with Regexes and spaCy

New text is generated in a mindblowing speed today. Think about news articles, social media messages, reports, e-mails etc. However, we cannot do much with unstructured text. But as soon as we extract some structured information from it, we can use it to run analyses, to make predictions, to inform decisions etc.

In this blog post, we show how to use natural language processing techniques to retrieve information from unstructured text automatically. In a first attempt we use a simple regular expression to extract events. However, this is cumbersome and rather imprecise. Therefore, in a second attempt, we use spaCy and its Named Entity Recognition and dependency parsing features.

But be warned! This artical is quite technical and dives into details of modern deep learning techniques. If you are just interested in the results, you better jump directly to the TL;DR section. You want to try the code yourself? This article is also available as Google Colab Notebook.

The History of Germany

Often we are interested in finding out if some event happend – and when it happend. For example, from the piece of text On March 1st 2020, Joe Biden wins the South Carolina primary. we want to extract the event [2020/03/01] - [Joe Biden wins the South Carolina primary]

This blog post, we try to generate a timeline that should roughly look like the timeline of german history on wikipedia. Here is an excerpt:

DateEvent
1338The prince-electors of the Holy Roman Empire declared in the Declaration of Rhense that the election of the Holy Roman Emperor was not subject to the approval of the pope.
1356The Imperial Diet issued the Golden Bull of 1356, which fixed the offices of the seven prince-electors and established that the Holy Roman Emperor could be elected by a simple majority vote. 1356
1370The Treaty of Stralsund was signed, ending a war between Denmark and the Hanseatic League.
1392The Victual Brothers were hired by the Duchy of Mecklenburg to assist in its fight against Denmark.
1400The period of Meistersinger lyric poets began.
1400The period of Minnesänger singers ended.

Preparation

Dates and date-ranges come in a lot of different formats. In order to compare dates and to pinpoint them on a timeline, we use dataparser. You can install it with the following command:

pip install daterangeparser
RESULT
Collecting daterangeparser
  Downloading https://files.pythonhosted.org/packages/46/0e/67d6a9912b216417d9a4adf2c716ff628a5fc1797fcafd5a7599928622c8/DateRangeParser-1.3.2-py3-none-any.whl
Requirement already satisfied: pyparsing in /usr/local/lib/python3.6/dist-packages (from daterangeparser) (2.4.6)
Installing collected packages: daterangeparser
Successfully installed daterangeparser-1.3.2

We need a couple of imports

import re

import spacy
import requests
import re
import IPython
from daterangeparser import parse

Obtain Data

This notebook uses the wikipedia article on the history of germany as an example, as it contains a lot of dates and events. In order to avoid complexities such as using the wikipedia API or to remove references, htmls tas etc., we use a version that only contains the main content as text. You can find it here: https://github.com/qualicen/timeline/

response = requests.get('https://raw.githubusercontent.com/qualicen/timeline/master/history_of_germany.txt')
text = response.text
print('Loaded {} lines'.format(text.count('\n')))
RESULT:
Loaded 744 lines

First Attempt: Regular Expressions

Our first attempt will be to use regular expressions in order to find events. The difficulty is to come up with a pattern to capture the many ways a date can be written (e.g. “in april 1975” or “in the spring of 1975” etc.). A second difficulty is to identify numbers or expressions where we can be reasonable sure that the number signifies a date and not something else. Look at the following sentence: Charlemagne ended 200 years of Royal Lombard rule with the Siege of Pavia, and in 774 he installed himself as King of the Lombards. There are two numbers (200 and 774) which could be dates, but only one of them is in fact a date.

In fact, this is a quite difficult task to do with regular expressions. We need to define a lot of patterns to capture the most common situations. As an example, we could define a pattern like this: In [DATE], [EVENT]. In the example above, this would capture the following event: in 774 he installed himself as King of the Lombards.

The pattern from above is exactly what the following function extracts. We will apply the function on each line of the text to extract events.

def extract_events_regex(line):
  matches = []
  # capture thee digit and four digit years (1975) and ranges (1975-1976)
  found = re.findall('In (\d\d\d\d?[/\–]?\d?\d?\d?\d?),? ?([^\\.]*)', line)
  try:
    matches = matches + list(map(lambda f: (f[0] if len(f[0])>3 else "0"+f[0] ,f[0],f[1]),found))
  except:
   return []
  return matches
def extract_all_events(text, extract_function):
  all_events = []
  processed = 0
  # Process the events
  for processed,line in enumerate(text.splitlines()):
    events = extract_function(line)
    all_events = all_events + events
    if processed % 100 == 0:
      print('Processed: {}'.format(processed))

  print("Extracted {} events.".format(len(all_events)))

  # Print out the events
  for event in sorted(all_events, key=lambda e: e[0]):
    print("{} - {}".format(event[1],event[2]))

Let’s look at the results: With the simple patterns we were already able to extract 48 events. While this is not too bad there are certainly many events missing. We would now need to go back to our regexes and refine them. Refine the pattern for single dates (maybe we should use a library for that) but also the patterns for sentences that describe an event. You can imagine that is is rather cumbersome.

extract_all_events(text,extract_events_regex)
RESULT:
Processed: 0
Processed: 100
Processed: 200
Processed: 300
Processed: 400
Processed: 500
Processed: 600
Processed: 700
Extracted 48 events.
718 - Charles Martel waged war against the Saxons in support of the Neustrians
743 - his son Carloman in his role as Mayor of the Palace renewed the war against the Saxons, who had allied with and aided the duke Odilo of Bavaria
751 - Pippin III, Mayor of the Palace under the Merovingian king, himself assumed the title of king and was anointed by the Church
936 - Otto I was crowned German king at Aachen, in 961 King of Italy in Pavia and crowned emperor by Pope John XII in Rome in 962
1122 - a temporary reconciliation was reached between Henry V and the Pope with the Concordat of Worms
1137 - the prince-electors turned back to the Hohenstaufen family for a candidate, Conrad III
1180 - Henry the Lion was outlawed, Saxony was divided, and Bavaria was given to Otto of Wittelsbach, who founded the Wittelsbach dynasty, which was to rule Bavaria until 1918
1230 - the Catholic monastic order of the Teutonic Knights launched the Prussian Crusade 
...

Second Attempt: Use Named Entity Recognition and Dependency Parsing

I hope you are convinced that designing regular expressions to capture a wide range of sentences that signify an event would mean a lot of effort. In order to save us the effort we can use a technique from Natural Language Processing that is called Named Entity Recognition (NER). Named Entity Recognition amounts to annotating phrases in a text that are not “regular” words but instead refer to some specific entity in the real word. For example the following text contains named entities such as :

  • the Western Roman Empire,
  • the 5th century, or
  • the Franks

Note especially the second example (the 5th century), these are the kind of named entities we are interested in: dates.

For this article, we use spaCy, a library for performing different kinds of NLP tasks, among them named entity recognition. Let’s load spaCy and perform NER on the sentence from above. We print out all found named entities together with their label, that is the type of named entity.

nlp = spacy.load("en_core_web_sm")
doc = nlp("After the fall of the Western Roman Empire in the 5th century, the Franks, like other post-Roman Western Europeans, emerged as a tribal confederacy in the Middle Rhine-Weser region, among the territory soon to be called Austrasia (the 'eastern land'), the northeastern portion of the future Kingdom of the Merovingian Franks.")
for ent in doc.ents:
  print("{} -> {}".format(ent.text,ent.label_))
RESULT:
the Western Roman Empire -> LOC
the 5th century -> DATE
Franks -> PERSON
Western Europeans -> NORP
the Middle Rhine-Weser -> LOC
Austrasia -> GPE
Kingdom -> GPE
the Merovingian Franks -> ORG

If we apply NER to our original text we see that there is a whole range of dates in many different formats.

doc = nlp(text)
for ent in filter(lambda e: e.label_=='DATE',doc.ents):
  print(ent.text)
RESULT:
1907
at least 600,000 years
Between 1994 and 1998
1856
around 40,000 years old
1858
1864
1988
3rd century
the 1st century
...

At this point we have to extract the actual events from the named entities relating to dates that we found. We could go the easy route and simply take the whole sentence containting the date. I want to make it a bit more interesting and extract only the core of the sentence that describes the event. For this we use another NLP technique called dependency parsing. This techniques reconstructs the grammatical structure from a text.

Let’s look at an example to see how it works. You see that the words are connected via arcs. Earch arc posseses a label that describes the relation between the words. Unluckily, the labels of the arcs are not shown in the colab version. Together, the arcs form a tree. In the example, the root of the tree is at the word “emerged”.

doc = nlp("After the fall of the Western Roman Empire in the 5th century, the Franks, like other post-Roman Western Europeans, emerged as a tribal confederacy in the Middle Rhine-Weser region, among the territory soon to be called Austrasia (the 'eastern land'), the northeastern portion of the future Kingdom of the Merovingian Franks.")
IPython.display.HTML(spacy.displacy.render(doc,style="dep", page=True, options={"compact":True}))

We will use the dependency structure to extract only a part of the text. The idea is to start from the detected date and walk up the tree until we are at the root (there may be more than one root, if the current line contains more sentences). When we are at the root we select a couple of subtrees and build a new text from that. Often, the root will be a verb. In this case we will e.g. select the subject and the object and use this as a summarization of the sentence. It is a rather simple approach but is good enough to get the idea.

def dep_subtree(token, dep):
  deps =[child.dep_ for child in token.children]
  child=next(filter(lambda c: c.dep_==dep, token.children), None)
  if child != None:
    return " ".join([c.text for c in child.subtree])
  else:
    return ""

# to remove citations, e.g. "[91]" as this makes problems with spaCy
p = re.compile(r'\[\d+\]')

def extract_events_spacy(line):
  line=p.sub('', line)
  events = []
  doc = nlp(line)
  for ent in filter(lambda e: e.label_=='DATE',doc.ents):
    try:
      start,end = parse(ent.text)
    except:
      # could not parse the dates, hence ignore it
      continue
    current = ent.root
    while current.dep_ != "ROOT":
      current = current.head
    desc = " ".join(filter(None,[
                                 dep_subtree(current,"nsubj"),
                                 dep_subtree(current,"nsubjpass"),
                                 dep_subtree(current,"auxpass"),
                                 dep_subtree(current,"amod"),
                                 dep_subtree(current,"det"),
                                 current.text, 
                                 dep_subtree(current,"acl"),
                                 dep_subtree(current,"dobj"),
                                 dep_subtree(current,"attr"),
                                 dep_subtree(current,"advmod")]))
    events = events + [(start,ent.text,desc)]
  return events

Let’s look at an example to see how this works out:

extract_events_spacy("The Protestant Reformation was the first successful challenge to the Catholic Church and began in 1521 as Luther was outlawed at the Diet of Worms after his refusal to repent. ")
RESULT:
[(datetime.datetime(1521, 1, 1, 0, 0),
  '1521',
  'The Protestant Reformation was the first successful challenge to the Catholic Church')]

All we have to do to get our (hopefully) improved timeline is use our new extraction function on the whole text

extract_all_events(text,extract_events_spacy)
RESULT:
Processed: 0
Processed: 100
Processed: 200
Processed: 300
Processed: 400
Processed: 500
Processed: 600
Processed: 700
Extracted 446 events.
1027–1125 - The Salian emperors ( reigned 1027–1125 ) retained the stem duchies
1039 to 1056 - the empire supported the Cluniac reforms of the Church , the Peace of God , prohibition of simony ( the purchase of clerical offices )
1056 - Total population estimates of the German territories range around 5 to 6 million
1077 - The subsequent conflict in which emperor Henry IV was compelled to submit to the Pope at Canossa in 1077 , after having been excommunicated came
1111 - Henry V ( 1086–1125 ) , great - grandson of Conrad II , who had overthrown his father Henry IV became Holy Roman Emperor
1122 - a temporary reconciliation was reached
1135 - Emperor Lothair II re - established sovereignty
1137 - the prince - electors turned back to the Hohenstaufen family
1155 to 1190 - Friedrich Barbarossa was Holy Roman Emperor
...

TL;DR

Extracting events from text, such as news articles, socical media content, reports, etc. is useful to gather data in order to run analyses, make predictions or inform decisions. There are sophisticated techniques around, but you can get some good results with simple techniques, too. This article looked at two methods to extract a timeline from a text. We started with a simple approach based on regular expressions. We saw that fine-tuning the regular expressions is cumbersome and therefore we swithed to an approach using named entity recognition (NER). With NER we can recognize different types of dates out of the box. We then used dependency parsing to build an extract of the event. Here is how the results looks like:

DateEvent
1027–1125The Salian emperors ( reigned 1027–1125 ) retained the stem duchies
1039 to 1056the empire supported the Cluniac reforms of the Church , the Peace of God , prohibition of simony ( the purchase of clerical offices )
1056Total population estimates of the German territories range around 5 to 6 million