Introduction to web scraping in python using Beautiful Soup

14.02.2019 - Jay M. Patel

The first step for any web scraping project is getting the webpage you want to parse. There are many python libraries such as urllib, urllib2, urllib3 for requesting pages via HTTP, however, none of them beat the elegance of requests library which we have been using in earlier posts on rest APIs and we will continue to use that here. Before we get into the workings of Beautiful Soup, let us first get a basic understanding of HTML structure, common tags and styling sheets.

Read more… (~5 Min.)

Why is web scraping essential and who uses web scraping?

10.02.2019 - Jay M. Patel

Web scraping, also called web harvesting, web data extraction, or even web data mining is defined as a software program or code designed to automate the downloading and parsing of the data from the web. Nowadays many websites such as Twitter, Facebook etc. provides REST based Application Programming Interface (APIs) to programmatically consume the structured data available on their websites and data obtained that way is usually not only “cleaner” but also easy and hassle-free compared to web scraping.

Read more… (~3 Min.)

Natural language processing (NLP): word embeddings, words2vec, GloVe based text vectorization in python

08.02.2019 - Jay M. Patel

In the previous posts, we looked at count vectorization and term frequency-inverse document frequency (tf-idf) to convert a given text document into a vectors which can be used as features for text classification task such as classifying emails into spam or not spam. Major disadvantages of such an approach is that the vectors generated are sparse (mostly made of zeros), and very high-dimensional (same dimensionality as the number of words in the vocabulary).

Read more… (~8 Min.)

Natural language processing (NLP): term frequency - inverse document frequency (Tf-idf) based vectorization in Python

07.02.2019 - Jay M. Patel

In the previous post, we looked at count vectorization to convert a given document into a sparse vectors which can be used as features for text classification task such as classifying emails into spam or not spam. Count vectorization has one major drawback: it gives equal weightage to all the words (or tokens) present in the corpus, and this makes it a poor representation for the semantic analysis of the sentence.

Read more… (~5 Min.)

Natural language processing (NLP): text vectorization and bag of words approach

04.02.2019 - Jay M. Patel

Let us consider a simple task of classifying a given email into spam or not spam. As you might have already guessed, this is an example of a simple binary classification problem with target or label values being 0 or 1, within the larger field of supervised machine learning (ML). However, we still need to convert non-numerical text contained in emails into numerical features which can be fed into a ML algorithm of our choice.

Read more… (~4 Min.)

Top data science interview questions and answers

02.02.2019 - Jay M. Patel

Machine learning Basics 1. What is machine learning? Machine learning (ML) is a field of computer science that gives computer systems the ability to progressively improve performance on a specific task aka learn with data without being explicitly programmed. Taking a 50,000 ft view, we want to model a given dataset either to make predictions or we want a model to describe a given dataset to gain valuable insights.

Read more… (~7 Min.)

Using Twitter rest APIs in python to search and download tweets in bulk

01.02.2019 - Jay M. Patel

Getting Twitter data Lets use tweepy package in python instead of handling the Twitter API directly. The two things we will do with the package are, authorize ourselves to use the API and then use the cursor to access the twitter search APIs. Let’s go ahead and get our imports loaded. import tweepy import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np sns.

Read more… (~4 Min.)

Introduction to natural language processing: rule based methods, name entity recognition (NER), and text classification

01.02.2019 - Jay M. Patel

The ability of computers to understand human languages us referred to as Natural Language Processing (NLP). This is a vast field and frequently practitioners include machine translation and natural language generation (NLG) as part of core NLP. However, in this section we will only look at NLP techniques which aim to extract insights from unstructured text. Regular expressions (Regex) and rule based methods Regular expressions (Regex) match patterns with sequences of characters and they are supported in wide variety of programming languages.

Read more… (~3 Min.)
Get started with Git and Github in under 10 Minutes

Get started with Git and Github in under 10 Minutes

28.10.2018, last updated 20.02.2019 - Jay M. Patel

Git is probably the most popular version control system being used right now, and if you have any experience with other version control systems like subversion than it’s strongly recommended that you go through this section since lot of concepts from other systems don’t directly translate to Git. Git basics You can create a local Git repository on your computer by either: Clone an existing Git repository from a remote repository on internet (e.

Read more… (~6 Min.)