Introduction to natural language processing: rule based methods, name entity recognition (NER), and text classification
01.02.2019 - Jay M. Patel - Reading time ~3 Minutes
The ability of computers to understand human languages us referred to as Natural Language Processing (NLP). This is a vast field and frequently practitioners include machine translation and natural language generation (NLG) as part of core NLP. However, in this section we will only look at NLP techniques which aim to extract insights from unstructured text.
Regular expressions (Regex) and rule based methods
Regular expressions (Regex) match patterns with sequences of characters and they are supported in wide variety of programming languages. A common use case for regex is extracting email addresses, phone numbers etc. from a text document. They are also widely used for search and replace in many commonly used programs and text processing software applications. There is no “learning” happening per se, and since we are more focused on using machine learning algorithms for NLP, we shall not extensively talk about regex and other rule based methods.
Using regex makes your code harder to read and debug, and their performance degrades too in case the regex pattern is very complex. I think for these reasons and more, most coders seem to be in agree with Jamie Zawinski’s famous quote “Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems”.
Below is the sample of the regex I used to extract out email addresses from a large corpus of text.
import re
regex = re.compile("([a-z0-9!#$%&'*+\/=?^_`{|.}~-]+@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)", re.IGNORECASE)
email_address_set = {x.strip() for x in regex.findall(sample_text)}
Name entity recognition (NER)
These refer to a group or statistical and machine learning methods which extracts out specific tokens (words or phrases) from unstructured text belonging to specific categories such as names of persons, companies, geographical locations etc. Essentially NER consists of two separate tasks, the first being text segmentation (similar to chunking) where a “name” is extracted, and secondly classifying it within predefined categories.
NER models are trained on labeled text where start and end indexes or offsets of each words are manually labeled into categories you want the NER model to recognize. The categories could be anything as long as the text can be labeled as start and end offsets for training purposes.
Text classification
A classic example of text classification is classifying a given email into spam or not spam based on the unstructured text in the email itself. Other examples of text classification include classifying the topic of the news article into financial, entertainment, sports etc. Document level sentiments classification such as customer reviews on ecommerce sites (Amazon, ebay) or movie reviews on sites such as ImDb is also examples of text classification.
A first step in applying machine learning algorithms for text classification is generating features from unstructured text by converting it into numerical vectors. We can use count vectorization, term frequency-inverse document frequency (tf-idf) or word embeddings for this task.