Natural language processing (NLP): word embeddings, words2vec, GloVe based text vectorization in python

08.02.2019 - Jay M. Patel - Reading time ~8 Minutes

In the previous posts, we looked at count vectorization and term frequency-inverse document frequency (tf-idf) to convert a given text document into a vectors which can be used as features for text classification task such as classifying emails into spam or not spam.

Major disadvantages of such an approach is that the vectors generated are sparse (mostly made of zeros), and very high-dimensional (same dimensionality as the number of words in the vocabulary). They also miss out on learning the structure of the sentences, since bag of words isn’t order preserving. To alleviate this problem, we often have to resort to ngrams while generating features.

A much better approach is converting words into dense vectors (50-1000 dimensions) and these are learned by training on the large corpus of text itself.

If you have a large training data, than you can simply train the word embeddings on your training data as part of training the classifier, using a deep learning library such as Keras/Tensorflow etc.

However, if you don’t have a large corpus of training data, then you’ll get much higher accuracy and will avoid overfitting if you use precomputed vectors such as GloVe, words2vec etc which already has word vectors calculated for a large number of words. You’ll have to handle cases when the words in your training data isn’t found in the precomputed word vectors by initializing such words to zero.

Let us first explore word vector call GloVe and in the next section write some code to extract GloVe vectors for use as features in text classification using neural networks.

Explore word similarities with pretrained GloVe vectors using Gensim

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = datapath(r'C:\Users\Jay M. Patel\Downloads\glove.6B\glove.6B.300d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")
_ = glove2word2vec(glove_file, tmp_file)

model = KeyedVectors.load_word2vec_format(tmp_file)

# Access vectors for specific words with a keyed lookup:
vector = model['difficult']
vector

Out:
array([-4.0606e-02,  1.4324e-01,  1.9288e-01, -1.0436e-01, -5.6030e-02,
       -3.0935e-01,  1.5655e-01,  2.3725e-01, -5.3723e-02, -2.1004e+00,
        5.9859e-02, -3.3431e-01, -1.3793e-01,  1.0236e-01, -3.0622e-01,
       -7.6803e-01, -3.1541e-01, -3.1179e-02,  1.6535e-01,  1.5336e-02,
        1.8997e-02,  1.0863e-01,  3.8125e-01,  1.4823e-01, -5.4693e-01,
        4.4363e-02,  1.4309e-01, -1.5470e-01, -1.4463e-01,  5.0835e-01,
        1.7206e-01,  9.9587e-02, -4.2409e-01, -2.0810e-01, -1.0157e+00,
        3.4243e-01, -1.9691e-01,  6.5321e-03, -6.8339e-01, -1.6867e-01,
        2.0538e-01, -4.9454e-02,  4.7625e-01, -3.1684e-01,  5.0577e-02,
        2.2495e-01, -2.3178e-01,  4.9180e-02, -2.1754e-01,  8.3796e-02,
        3.6140e-01,  1.9740e-01,  2.2693e-01, -3.1942e-01, -2.9624e-01,
        2.7698e-01, -4.2097e-01, -3.9311e-02,  8.4818e-03,  5.1844e-01,
        1.1367e-01, -3.3604e-01, -1.0600e-01,  2.5175e-01,  3.4666e-02,
       -1.1251e-01,  1.0034e-01,  2.3927e-01,  9.4413e-02,  2.7226e-02,
       -5.8804e-01,  2.3139e-01, -5.2413e-01,  3.2389e-01,  2.1242e-01,
       -6.1664e-01, -1.0549e-01,  1.7353e-01,  2.1278e-01, -2.5790e-01,
       -2.8885e-01, -1.4372e-01,  3.8583e-01, -3.4501e-01, -3.3514e-01,
        3.4023e-01,  2.0493e-01,  3.9131e-01, -3.3880e-01,  6.3694e-01,
       -2.8986e-01,  4.5282e-01, -5.3680e-01, -2.1792e-01,  2.2253e-01,
       -1.0863e-01, -2.4295e-01, -2.3874e-01,  1.4042e-02, -3.2659e-01,
       -2.9056e-01,  2.1324e-01, -1.8131e-01,  1.6505e-02,  3.8621e-02,
        1.0543e-01,  2.2498e-01, -8.8775e-02,  1.6820e-01,  2.1441e-01,
        4.6674e-02,  1.3272e-02, -1.7832e-01, -2.8319e-01,  1.7978e-01,
       -4.9054e-01,  1.8272e-01, -4.4948e-01,  2.5928e-01,  3.8793e-01,
        1.2461e-03, -8.0123e-02,  3.5682e-01,  7.2455e-02,  2.4277e-02,
        2.7593e-01,  7.8571e-02,  1.7030e-01, -7.3869e-02, -4.8485e-01,
        6.6072e-01,  4.6059e-01, -8.2610e-02,  2.4914e-01,  6.7816e-02,
       -1.1211e-01, -3.3612e-01, -1.1851e-01,  4.8579e-02,  2.9305e-02,
       -6.0241e-03,  4.8158e-02, -1.6371e-01,  2.8233e-01, -5.9021e-01,
       -3.3825e-01,  4.8388e-02, -1.3091e-01, -2.1689e-02,  1.7659e-01,
       -5.3690e-01, -4.6085e-01, -8.9463e-02, -3.0241e-01,  2.6071e-01,
       -2.3258e-02, -1.0427e-01, -1.0499e-01, -5.8492e-02,  3.8111e-01,
       -6.4881e-02, -3.5313e-01,  1.9272e-01,  7.1384e-02, -4.2484e-02,
        1.1687e-01, -6.7063e-02, -4.5600e-01, -2.7347e-01,  2.7672e-01,
        3.0203e-01,  2.3025e-01, -4.4759e-01, -1.7419e-01, -1.6792e-01,
       -6.5255e-02, -2.3110e-01,  2.9205e-02, -2.4940e-01,  3.3786e-01,
        3.2858e-01,  2.2003e-01, -3.1040e-01,  1.7974e-01, -1.0873e-01,
        1.4175e-01,  1.8542e-01, -2.2056e-01, -1.8736e-01,  4.1422e-01,
       -2.2709e-01,  1.5384e-01,  7.2828e-01, -1.9905e-01,  4.9462e-02,
       -3.2603e-01, -2.2826e-01,  1.2393e-01,  1.7079e-01, -3.9508e-01,
        6.5985e-01,  2.1349e-01,  4.9656e-03, -1.6015e-01,  1.7824e-01,
        2.3835e-01,  5.4046e-02,  1.7623e-01,  2.5032e-01, -8.9558e-02,
       -5.1746e-02, -5.6780e-02,  4.6783e-01, -4.0191e-01, -1.4609e-01,
       -4.1840e-01,  4.7808e-02,  8.5309e-02,  2.2981e-01, -1.3150e-01,
        3.5223e-01, -2.9983e-02, -1.5872e-01,  4.8105e-01, -2.1813e-01,
        2.6121e-02,  2.9155e-01, -1.6576e-01, -6.4291e-01,  1.1702e-01,
        2.1869e-02,  2.3049e-01, -3.0427e-04, -3.0246e-01, -5.5311e-01,
        2.1771e-01, -2.7757e-01,  3.6217e-01,  3.0001e-01,  1.3424e-01,
       -6.9047e-02,  4.1382e-01,  6.5554e-01, -1.4646e-01, -7.5479e-01,
        1.5657e-01, -3.1010e-02,  2.6448e-01, -1.0306e-01,  1.8929e-01,
       -4.0320e-02, -2.3840e-01, -3.4240e-01, -3.6037e-01,  4.9534e-01,
        2.2218e-01, -1.4530e-01, -3.0531e-02, -1.9072e-02,  4.3804e-01,
       -2.1604e-01, -2.1227e-01, -3.7679e-01, -3.7617e-01,  2.9929e-01,
        4.8557e-01, -4.1022e-01, -3.5577e-01,  1.5126e-01, -7.7749e-02,
        3.0098e-01,  5.3022e-01, -7.3720e-02,  3.1710e-01,  2.7338e-01,
        1.2067e-01, -1.6816e+00,  5.4383e-02,  4.4323e-01, -1.8093e-02,
        1.5088e-01, -1.7838e-02,  5.6679e-01,  2.8874e-01,  2.0058e-01,
       -5.9464e-01, -2.1689e-02,  3.7966e-01, -3.5294e-01, -1.6049e-01,
        5.1990e-02, -2.3183e-01,  4.7882e-01,  1.3886e-01,  4.3026e-01,
        1.6122e-01, -8.2338e-02,  9.2464e-02, -3.7833e-01, -5.2110e-02],
      dtype=float32)

Let us query to see which words have vectors most similar to a given word within GloVe.

model.most_similar("difficult")

Out: 

[('impossible', 0.7520476579666138),
 ('very', 0.6883023977279663),
 ('extremely', 0.681058943271637),
 ('complicated', 0.674138069152832),
 ('quite', 0.6580551862716675),
 ('harder', 0.647814929485321),
 ('easier', 0.6476973295211792),
 ('able', 0.6425892114639282),
 ('tough', 0.6402384042739868),
 ('so', 0.638898491859436)]

Explore word similarities with pretrained word embeddings vectors in SpaCy

Let us take another example; spacy ships with a pretrained model for English and few other languages. It includes a similarity method to compare different words and also includes pretrained word embeddings for the large models. The one shown below is one of their “sm” models which does NOT include pretrained word embeddings and hence it will be less accurate than Gensim/GloVe shown above.

import spacy

nlp = spacy.load('en_core_web_sm')
tokens = nlp(u'difficult impossible')
for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

Out:
difficult difficult 1.0
difficult impossible 0.49030286
impossible difficult 0.49030286
impossible impossible 1.0

Document level vectors for running shallow machine learning algorithms such as Support vector machines (SVMs)

I know from the previous sections, you must be wondering about how do we generate document level vectors from pretrained embeddings such as GloVe so that we can apply shallow machine learning algorithms such as support vector machines, random forest etc., like we would normally do from sparse vectors coming directly from count vectorizer and Tf-idf.

Most common method known in literature (see this paper) and on many award winning Kaggle kernels is simply taking an average of all the word vectors in a particular document to get an overall document level vector. A further refinement in this strategy (Lilleberg & Zhang, 2015) is to multiply this with Tf-idf vectoring scheme so that we can weigh important words in a given corpus higher than other ones.

Let us take a block of text just like in previous two sections, and preprocess and tokenize it and convert it into bytes objects.

text = '''As a psychologist and coach, I have long been interested in the power of stories to convey important lessons. Some of my earliest     recollections were fables, stories involving animals as protagonists. Fables are an oral story-telling tradition typically involving animals, occasionally humans or forces of nature interacting with each other. The best-known fables are attributed to the ancient Greek storyteller Aesop, although there are fables found in almost all human societies and oral traditions making this an almost universal feature of human culture. A more modern example of an allegorical tale involving animals is George Orwell’s Animal Farm , in which animals rebel against their human masters only to recreate the oppressive system humans had subjected them to. Fables work by casting animals as characters with minds that are recognisably human. This superimposition of human minds and animals serves as a medium for transmitting and teaching moral and ethical values, in particular to children. It does this by drawing on the symbolic connection between nature and society and the dilemmas that this connection can be used to represent. Fables have survived into the modern age, although their role seems to have waned under the emergence of other storytelling media. Animals are, thus, still a feature of modern day stories (e.g. David Attenborough’s wildlife programmes, stories about animals in the news, and and CGI-heavy children’s cartoons involving cute animal characters). Our more immediate and visceral connection with nature and with the different type of animal species seems to have been reduced to being observers rather than participants. Wild animals in ancient times were revered as avatars of ancient gods but also viewed with fear. By and large, our modern position as observer in relation to animal stories is that we remotely view animals in peril from climate change on television and social media news coverage, but we may remain disconnected to the fundamental moral lessons their stories could teach us today.'''

import re
def preprocessor(text):

    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text
    
text = preprocessor(text)
text = text.split()
for i in range(len(text)):
    text[i] = text[i].encode()

Let us preprocess the downloaded GloVe vectors and convert it into a dict.

import os
import numpy as np

embeddings_index = {}
with open(r"C:\Users\Jay M. Patel\Downloads\glove.6B\glove.6B.50d.txt", "rb") as lines:
    for line in lines:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

Finally, lets take average of all the word vectors to get a vector for the entire document which we can directly use for text classification

# taking average of vectors for a given document
total_words = len(text)

vector_sum = np.zeros(embeddings_index[text[0]].shape, dtype=float)

for i in range(total_words):
    if embeddings_index.get(text[i]) is not None:
        vector_sum = vector_sum + embeddings_index.get(text[i])

vector_average = vector_sum/total_words

vector_average

Out:
        array([ 4.27346652e-01,  1.57514317e-01, -2.39625478e-01, -9.35632864e-02,
        3.44963874e-01,  3.40743040e-01, -2.63750508e-01, -3.66529997e-01,
       -8.27876303e-02,  3.41370854e-02,  4.36536945e-02,  1.38812064e-01,
       -7.28143505e-02, -8.25220181e-02,  3.11417515e-01,  4.72280507e-02,
        1.04038140e-01,  1.44413003e-02, -3.03484470e-01, -2.69421163e-01,
       -7.93770321e-04,  1.88174378e-01,  2.59898154e-01,  2.18157735e-02,
        2.42991123e-01, -1.24372694e+00, -5.44038484e-01, -1.15947643e-01,
        1.31132527e-01, -1.05934229e-01,  2.97637301e+00,  1.27288132e-02,
       -1.00509887e-01, -5.74834689e-01, -2.73898482e-02,  1.25999838e-01,
       -8.16523744e-02, -1.20244182e-02, -1.90150194e-01, -2.36316453e-02,
       -1.41432983e-01,  2.43932903e-02,  1.20964539e-01,  2.92627208e-01,
        5.57823928e-02,  1.55617574e-01, -1.29049102e-01, -2.56062194e-02,
       -6.94451796e-02, -2.09271689e-01])

You can also generate sentence and document level vectors from sophisticated methods such as doc2vec.

Using word embeddings in neural networks

We will talk about this in a detailed post which will walk you through all the steps of not only using pretrained word vectors but will also tell you about to train it on your own training data.

machine-learning natural-language-processing text-vectorization machine-learning text-mining word-embeddings words2vec glove