Introduction to machine learning metrics

01.01.2020 - Jay M. Patel - Reading time ~6 Minutes

Machine learning (ML) is a field of computer science that gives computer systems the ability to progressively improve performance on a specific task aka learn with data without being explicitly programmed. Taking a 50,000 ft view, we want to model a given dataset either to make predictions or we want a model to describe a given dataset to gain valuable insights.

There are three general areas of machine learning models

Numerical prediction, also called Regression: These are supervised ML algorithms which use a training set of independent variables to derive relationship to a dependent variable.
Classification: These are supervised ML algorithms which classify a new observation to a set of sub populations using a training set of data containing observations (or instances) whose category membership is known.
Clustering: These are unsupervised ML algorithms which can group a set of objects in such a way that objects in the same group called a cluster are more similar to each other than to those in other groups/clusters.

Basic statistics

Almost everyone knows basic statistics such as mean, standard deviation etc. but let us mention their formulae below just in case.

Mean, X_M can be defined as

$$X_M = \dfrac{\sum_{i=1}^{n} X_i}n$$

Degrees of freedom, Df, with N observations is defined as

$$Df = N-1$$

Standard deviation, s is defined as

$$s = \sqrt{\dfrac{\sum_{i=1}^{n} (X_i-X_M)^2}{Df}} = \sqrt{\dfrac{\sum_{i=1}^{n} (X_i-X_M)^2}{N-1}}$$

Evaluating goodness of fit for regression

If a regression model is perfect than we would see zero deviation between predicted values of Y and actual Y, however, in practice this is seldom the case and the difference between actual and predicted value is known as residuals.

Intuitively, the model with lowest total residuals will perform the best; however, residuals can be positive or negative so adding it up may cancel it out leading to incorrect conclusions.

We could easily avoid this problem if we take squares of each residuals and take a sum of that (sum of squared differences, SS).

Thus we obtain the best model by selecting the one with lowest sum of squared differences; and this is excatly how ordinary least squares method for linear regression works.

Coefficient of determination ( R² )

In order to assess goodness of fit, we need to compare our model something else. In statistics, the most basic model available is mean.

$$S_T = \sum_{i=1}^{n} (Y_{iObs}-mean)^2$$

Where, S_T is total sum of squares (sum of squares between Y and mean)

$$S_R = \sum_{i=1}^{n} (Y_{iObs}-Y_{iM})^2$$

Where S_R are residual sum of squares between Y_iObs and predicted Y, aka Y_iM from a given model. They are also called sum of squared errors of prediction (SSE).

$$S_M=S_T-S_R$$

S_M defined as model sum of squares and represents the improvement made by a given regression model in comparison with just a basic model of just using a mean. This is also called explained sum of squares (ESS) or sum of squares due to regression (SSR).

In simple regression cases, it can be calculated by the above expression, whereas for more complicated cases, we need to add other terms.

If S_M is small, than the model is just a little better than simply using mean as the best guess.

The performance of a model can be represented as a fraction from 0-1 called R² with a value of 1 indicating that a given model can explain all the variance in data whereas 0 means that it doesn’t explain any variance in a given dataset.

$$R^2 = S_M/S_T$$

For simple linear regression, taking square root of this R² will give Pearson’s correlation coefficient.

In general, R² can be represented as:

$$R^2=\frac{\sum_{i=1}^{n} (Y_{iM}-Y_{iObs})^2}{\sum_{i=1}^{n} (Y_{iObs}-mean)^2}$$

Akaike information criterion (AIC) and Bayesian information criterion (BIC)

As you add variables in a multiple linear regression (MLR) model, R² goes up regardless of whether you are improving the fit or no. Due to this, we need other parsimony adjusted measures of fit to figure out which model performs best in fitting a given data among multiple models.

$$AIC=2k-2\ln(L_h)$$ $$BIC=2\ln(n)k-2\ln(L_h)$$

Where, AIC is Akaike information criterion and BIC is Bayesian information criterion respectively.
- k is number of parameters estimated by the model; for the MLR it is number of slope coefficients (q), plus the intercept and error so k = q+2
- n is number of observations.
- L_h is maximum value of the likelihood function for the model (this ofcourse varies for different types of regressions).

Leave-one-out cross-validation is asymptotically equivalent to AIC, for ordinary linear regression models.

AIC introduces a penalty term of 2k for addition of new parameters whereas BIC uses 2ln(n)k as its penalty term, so that as you add more variables and consequently new parameters, the values of AIC and BIC go up, indicating a worse fit.

In general its a good idea to use both for model selection. As you can probably guess, there are no hard guidelines on what is an acceptable AIC value, essentially, values of AIC and BIC are only used to compare it with other models with same outcome variable and in general a low value is preferred over a high one.

p-value and t-test

p-value or probability value or asymptotic significance is the probability of the null hypothesis being true for a given statistical model. A small p-value (typically < 0.05) indicates strong evidence against the null hypothesis whereas a large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.

In case of regression models, we can calculate p value of each coefficients of the model, with null hypothesis being that coefficient being zero. A low p-value indicates the probability of that coefficient being zero is low.

We use a t-test to assess the null hypothesis of a coefficient in a regression equation being zero. If it is significant, than we are more confident in the hypothesis that the coefficient value is different than zero.

The degree of freedom in a regression is N-p-1, where N are number of observations, and p are number of predictors. As explained above, a low p-value from the t-test shows that coefficient is significantly different from 0.

F-ratio

S_M and S_R are dependent on number of observations we have added, so its better if use mean sum of squares (MSS) and residual mean squares (MSR), where we we have divided it by their respective degrees of freedom. F-ratio can then be defined as:

$$ F = \dfrac{MSS}{MSR}$$

A large F-ratio is obtained with a good model (F>1) and exact magnitude should be assessed by checking F values with appropriate degrees of freedom for both numerator and denominator. Df for MSS are number of variables in the model and Df for MSR are number of observations minus number of parameters being estimated.

Evaluating metrics for classification

Classification metrics such as precision, recall, F1 score etc. are all covered in detail in our article about top data science FAQs article, check it out for more information.

machine-learning regression coefficient-of-determination-r2 information-criterion p-value t-test