Natural language processing (NLP): text vectorization and bag of words approach

04.02.2019 - Jay M. Patel - Reading time ~4 Minutes

Let us consider a simple task of classifying a given email into spam or not spam. As you might have already guessed, this is an example of a simple binary classification problem with target or label values being 0 or 1, within the larger field of supervised machine learning (ML).

However, we still need to convert non-numerical text contained in emails into numerical features which can be fed into a ML algorithm of our choice. This process is known as “text vectorization”, and there are many different ways of doing it, from simple approach of count based vectorization, discussed here, to more sophisticated term frequency-inverse document frequency (tf-idf) and word embedding and neural network methods such as Glove, word2vec etc.

Introduction to bag of words based text vectorization

In this model, any text is represented as the bag or a multiset of its words (known as “tokens”) by disregarding grammar, punctuation and word order but keeping multiplicity of individual words/tokens. Splitting individual sentences into it’s constituent words or tokens is referred to as “tokenization”.

As an example, let us consider the sentence below

Sam likes to watch baseball. Christine likes baseball too.

Let us remove punctuation and tokenize it.

"Sam","likes","to","watch","baseball","Christine","likes","baseball","too"

In a bag of words representation, we can convert the list above into a dictionary, with keys being the words, and values being the number of occurrences.

BoW_dict = {"Sam":1,"likes":2,"to":1,"watch":1,"baseball":2,"Christine":1,"too":1}

For our purposes, we are more interested in getting a list of “word frequencies” from the above dictionary which serves as a sparse vector representation of the underlying sentence, with number of dimensions being equal of total number of unique words in the corpus. As you might have already noticed, this term frequency vector does not preserve order of the words in the sentence. However, for many text classification tasks this bag of words model works pretty satisfactorily.

vect = [1, 2, 1, 1, 2, 1, 1]

Count vectorization with scikit-learn.

Let us take a block of text below from a random page on open learn.

text = '''As a psychologist and coach, I have long been interested in the power of stories to convey important lessons. Some of my earliest     recollections were fables, stories involving animals as protagonists. Fables are an oral story-telling tradition typically involving animals, occasionally humans or forces of nature interacting with each other. The best-known fables are attributed to the ancient Greek storyteller Aesop, although there are fables found in almost all human societies and oral traditions making this an almost universal feature of human culture. A more modern example of an allegorical tale involving animals is George Orwell’s Animal Farm , in which animals rebel against their human masters only to recreate the oppressive system humans had subjected them to. Fables work by casting animals as characters with minds that are recognisably human. This superimposition of human minds and animals serves as a medium for transmitting and teaching moral and ethical values, in particular to children. It does this by drawing on the symbolic connection between nature and society and the dilemmas that this connection can be used to represent. Fables have survived into the modern age, although their role seems to have waned under the emergence of other storytelling media. Animals are, thus, still a feature of modern day stories (e.g. David Attenborough’s wildlife programmes, stories about animals in the news, and and CGI-heavy children’s cartoons involving cute animal characters). Our more immediate and visceral connection with nature and with the different type of animal species seems to have been reduced to being observers rather than participants. Wild animals in ancient times were revered as avatars of ancient gods but also viewed with fear. By and large, our modern position as observer in relation to animal stories is that we remotely view animals in peril from climate change on television and social media news coverage, but we may remain disconnected to the fundamental moral lessons their stories could teach us today.'''

Let us remove punctuations from the text using regex function.

import re
def preprocessor(text):

    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

After preprocessing, the text looks like shown below:

Figure 1: Text after applying preprocessing function

Now, this is ready to be fed into scikit learn’s count vectorizer which will convert it into bag of words.

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd

cv = CountVectorizer(analyzer='word')
#cv.get_feature_names()

You can intuitively see the sparse vectors by converting the cv into a pandas dataframe.

dtm = cv.transform([text])
pd.DataFrame(dtm.toarray(), columns=cv.get_feature_names())

Figure 2: pandas dataframe from sparse bag of words vectors

You can directly use the vectors created above for text classification. However, there are some disadvantages with using these count vectors directly and in the next post we will talk about term frequency-inverse document frequency (tf-idf) and discuss more robust approaches and preprocessing techniques which we can use instead.

comments powered by Disqus