# Machine learning

## Basics

### 1. What is machine learning?

Machine learning (ML) is a field of computer science that gives computer systems the ability to progressively improve performance on a specific task aka learn with data without being explicitly programmed. Taking a 50,000 ft view, we want to model a given dataset either to make predictions or we want a model to describe a given dataset to gain valuable insights.

### 2. What are features and targets in context of machine learning?

Features are independent variables which are descriptors (also known as predictors) of a given dataset. For example, for a given equation Y = a1x1 + a2x2 + ..+ c; the independent variables x1, x2 etc are called features. If our dataset is a text document, then the individual words (known as “tokens”) are features; similarly, for an image, the pixel densities are features.

Targets or labels are dependent variables which are the predicted variables of a given dataset. For a given equation Y = a1x1 + a2x2 + ..+ c; Y is called target variable. If a dataset is text document, then the topic of the document represent its target variable, similarly, for an image, the object in the image itself such as a cat or a dog is the target variable.

### 3. What are the most important machine learning techniques or algorithms?

Supervised learning: These training algorithm require both features and targets. Classification and regression are both types of supervised learning.

Unsupervised learning: These algorithms draw inferences from datasets containing only features without any targets or labels data. Clustering and density estimation algorithms are types of unsupervised learning algorithms.

### 4. Define classification.

Classification: These are supervised ML algorithms which classify a new observation to a set of sub populations using a training set of data containing observations (or instances) whose category membership is known.

### 5. Define Clustering.

Clustering: These are unsupervised ML algorithms which can group a set of objects in such a way that objects in the same group called a cluster are more similar to each other then to those in other groups/clusters.

### 6. What is underfitting and overfitting?

Underfitting is the error representing missing relations between features and outputs, this is also known as bias.

Overfitting is the error representing sensitiveness to small training data fluctuations, this is also known as variance.

### 7. What is threshold?

Classifiers such as logistic regression returns probability values between 0-1, and while you can use this directly, most applications require you to convert it into labels by specifying a threshold probability value, and if the returned probability values are higher than this threshold than they are assigned a categorical label.

For example, we are trying to classify emails into spam or not spam, and for a set threshold value of 0.9, if a particular email gets a probability of 0.95, than we can label it as spam.

### 8. What is precision, recall and F1 score ?

Precision (P) is defined as the number of true positives (Tp) over the number of true positives plus the number of false positives (Fp).

$$P = \frac{T_p}{T_p+F_p}$$

Recall (R ) is defined as the number of true positives (Tp) over the number of true positives plus the number of false negatives (Fp).

$$R = \frac{T_p}{T_p + F_n}$$

These quantities are also related to the F1 score, which is defined as the harmonic mean of precision and recall.

$$F1 = 2\frac{P \times R}{P+R}$$

Precision and recall are dependent with the set threshold of a given classifier. If the threshold is very low, than there will be higher numbers of false positives and an increase in precision can be obtained by increasing the threshold.

### 9. What are common data transformations ?

Datasets with unequal variances or skewed distributions make it difficult to be directly used as training data for predicitve machine learning models and hence need to apply an appropriate transformation to make it suitable for further analysis.

Log (x) Transformation

Taking a logarithm of a dataset easily fixes a strong positive skew of a given distribution.

Log(x+1) and Log (X+c) Transformation

As we know, Logs can only be applied on positive numbers, and Log(0) is undefined. Hence, if we have zero in a given distribution, then we can simply add 1 to the entire distribution and this will change it minimally since log(1)=0. Similarly, in case you have negative numbers in the distribution, a fixed constant, say c, can be added to the entire set so that all the values become positive.

Square Root Transformation

The result from a square root transformation are very similar to a log transformation since there there is a greater value change from square root of a larger number than a smaller number, and the net effect being that the entire dataset is pulled more tightly towards a center value. Negative numbers don’t have square root so you can only apply this to positive values.

Reverse Score Transformation

If we have a negative skewed transformation then we need to reverse it before applying above transformations. Take the highest value in a given dataset and subtract each score from that value (xhighest-x). This will give you a lowest score of 0; in case you want the lowest score to be 1, then add 1 to all the values after subtraction. Now, each value has reversed, the highest value has become lowest and vice versa.

Reciprocal Transformation

Taking a reciprocal of each value also reduces large values and narrows the distribution, however, this will reverse the scores; small values will become large and vice versa. One way to avoid that is taking a reverse score transformation before doing a reciprocal transformation. This will ensure that the values which were large originally will still be large after taking reciprocals, with the added advantage being that now the distribution will be much tighter and this can be applied to both positive and negative values.

Cube Root Transformations

There are cases when its preferable to retain the negative sign of each value, and in those cases taking a cubic root is a good idea especially if reciprocal transformation hasnt had the desired effect.

### 10. Define hyperparameters.

Hyperparameters is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Learning rate, regularization etc are examples of hyperparameters.

## Feature engineering for text analytics and natural language processing

### What is meant by text tokenization?

Splitting individual sentences into it’s constituent words or tokens is referred to as “tokenization”.

let us consider the sentence below

Sam likes to watch baseball. Christine likes baseball too.


After removing punctuations, we can represent the sentence as a list of individual “tokens” or words.

"Sam","likes","to","watch","baseball","Christine","likes","baseball","too"


### What is bag of words based text vectorization?

text is represented as the bag or a multiset of its words (known as “tokens”) by disregarding grammar, punctuation and word order but keeping multiplicity of individual words/tokens.

In a bag of words representation, we can convert the list above into a dictionary, with keys being the words, and values being the number of occurrences.

BoW_dict = {"Sam":1,"likes":2,"to":1,"watch":1,"baseball":2,"Christine":1,"too":1}



### How to create bag of word vectors in scikit learn?

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd

cv = CountVectorizer(analyzer='word')
dtm = cv.transform([text])

### What is a major drawback of count vectorization compared to other vectorization methods?

It gives equal weightage to all the words (or tokens) present in the corpus, and this makes it a poor representation for the semantic analysis of the sentence. There are certain words above such as “it”, “is”, “that”, “this” etc which don’t contribute much to the meaning of the underlying sentence and are actually quite common across all English documents and these are known as stop words. Ideally, you want to decrease the weightage given to these stop words and give higher weightage to words which imparting semantic meaning of the sentence.

### Define Term Frequency.

Term Frequency (TF) is the ratio of number of times a word appears in a document compared to the total number of words in that document and it’s expressed below. TF will be high for STOP words whereas it will be pretty low for rare words which impart meaning to a sentence.

$$TF = \frac{n_i,_j}{\sum\limits_{k} n_i,_j}$$

### Define Inverse Document Frequency.

Inverse Document Frequency idf(w) of a given word w is defined as a log of total number of documents (N) divided by document frequency dft, which is the the number of documents in the collection containing the word w.

$$idf(w) = \log\frac{N}{df_t}$$

### Define Term Frequency - Inverse Document Frequency (Tf-idf)

Term Frequency - Inverse Document Frequency tfidf(w) is simply a product of term frequency and inverse document frequency.

$$tfidf(w) = {TF}\times{idf(w)}$$