How to do full text searching in Python using Whoosh library
02.08.2020 - Jay M. Patel - Reading time ~3 Minutes
Full text searching with fast indexing is essential to go through large quantities of text in Pandas dataframes to power your data science workflows. Traditionally, we can use a full text search engine databases like Elasticsearch, Apache Solr or Amazon cloudsearch for this; however, it’s pretty impractical to do that for one off requirements or when working with only few GBs of data.
In that case, you can simply load your text into Pandas dataframe, and use regex to perform the search as shown.
import pandas as pd
import numpy as np
import re
df = pd.read_csv("your_csv_file_path")
df = df.fillna("")
df = df[df.col_name.str.contains(r'^(?=.*trump)(?=.*biden)', flags=re.IGNORECASE, regex=True)] # https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation
# this is equivalent to:
df = df[(df['col_name'].str.contains('trump'), flags=re.IGNORECASE) & (df['col_name'].str.contains('biden'), flags=re.IGNORECASE)]
You can build the regex expression programmatically as shown:
base = r'^{}'
expr = '(?=.*{})'
words = ['trump', 'biden', 'elections'] # example
search_regex = base.format(''.join(expr.format(w) for w in words))
However, this is still pretty imperfect, for starters, there is no stop word removal happening; and words are not being normalized into their base forms, so that words like rain, raining, etc are considered distinct even though they still belong to same root form and have the same sementic meaning.
There is lot of text preprocessing which is necessary to get all the relevant search results.
There is a full text search library called Lucene in Java which powers Elasticsearch; similarly, Whoosh is a pure Python library which fits the same niche and can be installed simply by pip install Whoosh
. In this post, we will create and populate a Whoosh search index and use it to run some fulltext queries.
Firstly, let us get some data to fill our Whoosh index. One of the best ways to see the power of full text searching is by using it on recent news articles. Let us grab some articles from the Latest News API at Algorithmia. You get about 10,000 free credits when you sign up for free using your email address, and that should be plenty to run hundreds of queries.
import Algorithmia
results_list = []
for i in range(1,3):
print("fetching page ", str(i))
input = {
"domains": "",
"topic": "politics",
"q": "",
"qInTitle": "",
"content": "true",
"author_only": "true",
"page": str(i)
}
client = Algorithmia.client(YOUR_ALGO_KEY)
algo = client.algo('specrom/LatestNewsAPI/0.1.6')
response_dict = algo.pipe(input).result
results_list = results_list + response_dict["Article"]
df = pd.DataFrame(results_list)
df.head()
Let us create a search schema for whoosh. For our purposes, let us only index few fields like title, content, and path which is just the index of the dataframe row; let us also create an empty index directory.
from whoosh.fields import Schema, TEXT, ID
from whoosh import index
import os, os.path
from whoosh import index
from whoosh import qparser
from whoosh.qparser import QueryParser
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
# create empty index directory
if not os.path.exists("index_dir"):
os.mkdir("index_dir")
Now, we will use the schema to initialize a Whoosh index in the above directory.
ix = index.create_in("index_dir", schema)
writer = ix.writer()
Lastly, let us fill this index with the data from the dataframe.
for i in range(df):
writer.add_document(title=str(df.title.iloc[i]), content=str(df2.content.iloc[i]),
path=str(i))
writer.commit()
Now, all we need is to write up a function which can search this index.
# https://stackoverflow.com/questions/19477319/whoosh-accessing-search-page-result-items-throws-readerclosed-exception
# http://annamarbut.blogspot.com/2018/08/whoosh-pandas-and-redshift-implementing.html
# https://ai.intelligentonlinetools.com/ml/search-text-documents-whoosh/
def index_search(dirname, search_fields, search_query):
ix = index.open_dir(dirname)
schema = ix.schema
og = qparser.OrGroup.factory(0.9)
mp = qparser.MultifieldParser(search_fields, schema, group = og)
q = mp.parse(search_query)
with ix.searcher() as s:
results = s.search(q, terms=True, limit = 10)
print("Search Results: ")
print(results[0:10])
results_dict = index_search("index_dir", ['title','content'], u"northern minnesota")
The simplicity of using Whoosh makes it an attractive choice for quick and easy searching before moving onto more heavyweight options.