# Model Dimensionality and Overfitting

## Introduction

Overfitting is one of the most common situations any data scientist will face, still one of the most challenging to understand and master in-depth.

Overfitting is the production of a model that resembles a portion of (noisy) data too closely, thus loosing the ability to capture the real features of the underlying phenomenon.

The immediate result of overfitting is a great fir result in-sample, while the results do not extend out of sample. Which in turn make the model unusable.

There are main reasons that can lead to overfitting, reading literature the most commonly reported reason for overfitting is noisy data. THis is a quite common condition and it can be easily addressd. THe literature is rich of devices for detection of ill-posed problems and regularization. However, when working with real data there are many other reasons that can lead to overfitting. One quite common and much more difficult to point out is model bias. I.e. the poor ability of the model to capture and explain the underlying phenomenon. The effects of model bias are multiple and can span from omitted variables, to partial and unsuitable predictors.

## Goals

In this experiment we are showing how a sufficiently broad set of very poor (random) predictors, can fit a noiseless function to high levels of of accuracy, while not being able to capture the underlying process.

This example is based on linear regression as it is avery simple and intuitive method of forecasting, but the results we achieved here easily extend to mode complex machine learning methods

## Pre-Requisites

This artcle is suitable for mid to senior level statisticians and can be proposed to junior level under due supervision.

We assume the reader is at ease with

- Mathematical notation
- The core concepts of Linear Regression
- The R-Square, what it represents and how is it interpreted

## Method

We generate noiseless observations from a sinusoidal target function.

We generate a set of monotonic predictors sorting a set of random values in range [0, 1)

We run a linear fit of the synthetic random predictors on the noise-free observations and observe the quality of fit.

Where is the matrix formed compounding the random predictors, are the observations, and are the coefficients of the linear model fitted by a linear regression

import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt %matplotlib inline n_predictors = 10 n_samples = 64 range = 2*np.pi # noiseless observations t = np.linspace(0,range,n_samples) b = np.sin(t) # random predictors A = np.sort(np.random.uniform(0.0, 1.0, size = (n_samples,n_predictors)),0) # Linear model model = LinearRegression() reg = model.fit(A, b) b_pred = model.predict(A) print('Intercepts of the model:', model.intercept_) print('coefficients of the model:', model.coef_) r_sq = model.score(A, b) print('coefficient of determination:', r_sq) fig, (ax1,ax2) = plt.subplots(1,2, figsize=(10,4)) # 1 row, 2 columns plt.subplot(1, 2, 1) plt.plot(t,A) plt.title('A set of random predictors') plt.xlabel('t') plt.ylabel('a.u.') plt.subplot(1, 2, 2) plt.plot(t,b) plt.plot(t,b_pred) plt.title('A noiseless sin funcion and fitted response') plt.xlabel('t') plt.ylabel('a.u.') plt.tight_layout() # Optional ... often improves the layout plt.show()

## Results

On the left graph we can observe the predictors, they are somehow noisily increasing monotonic functions, whose actual shape strongly depends on the different realizations of noise. However, nothing hints at the fact they could actually fit a sinus function. On the left hand side we can see the actual noise-free sinus function (blue) and the response predicted by a linear regression (orange). Quite surprisingly, the predicted response fits the curve quite well. In most of the cases the r-square is above 0.90 indicating a very good quality of fit. Also note hte coefficients of hte fit are much higher than one would expect. This situation usually points out some issue with the model, most likely omitted variable bias. I.e. the model in to good at explaining the underlying process and the regression compounds the predictors and uses the differences in the noise to reproduce the missing variable.

## Discussion

Despite the result might look quite unexpected to the novice, it is quite usual for the more experience pratictioner and it is a direct effect of the high dimensionality of the. I.e. under fairly mild conditions, given a sufficient number of degrees of freedom (in the case of regression DoF are simply predictors) any ML algorithm is smart enough to find an apparently good fit to any set of observations, for no reason at all. The only mild condition we need to verify for this example to work is the monotonicity of the predictors. Therefore, the only little trick we use in this example is sorting the random value so that we get monotonic predictors, in our specific case the predictors are also quite correlated, but it is not a necessary condition. To note, most neural networks have monotonic (usually sigmoidal) response function. Therefore, they fully qualify to fit (and overfit) any continuous function with extreme ease. Indeed, the higher the complexity of the model, the higher the risk of seeing unexpected behaviours. Thus, the need to use a meaningful set of predictors, control for collinearity and correlation, regularize the fit or the learning process, and validate the results out of sample.

## References

A. N. Kolmogorov, On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition, Dokl. Akad. Nauk. SSSR,114 (1957), 953β956.

## Are you ready to take smarter decision?

Otherwise you can always drop a comment…