Louis Cialdella lmc2179

Picking the model with the lowest cross validation error is not enough

TL;DR - We often pick the model with the lowest CV error, but this leaves out valuable information. Specifically, it ignores the uncertainty around the estimated out-of-sample error. It's useful to calculate the standard errors of a CV score, and incorporate this uncertainty into our model selection process. Doing so avoids false precision in model selection and allows us to better balance out-of-sample error with other factors like model complexity.

We build trust in our models by demonstrating that they make good predictions on out-of-sample data. This process, called cross validation, is at the heart of most model evaluation procedures. It allows us to easily compare the performance of any black-box models, though they may have differing structures or assumptions.

Most commonly, methods like K-fold cross validation are used to compute point est

	import pandas as pd

	df = pd.DataFrame({'group': [0, 0, 2, 2], 'values': [1,2,3,4]})
	mean_df = df.groupby('group').mean()
	mean_df.columns = ['mean']

	finished_df = df.merge(mean_df, left_on='group', right_index=True)

	import numpy as np
	from matplotlib import pyplot as plt
	import seaborn as sns
	from sklearn.tree import DecisionTreeClassifier
	from sklearn.utils import resample

	x = np.random.normal(0, 1, 5000)
	x_sim = np.random.uniform(np.min(x), np.max(x), 5000)

	X = np.concatenate((x, x_sim)).reshape(-1, 1)

	import numpy as np
	import matplotlib.pyplot as plt
	from scipy import optimize
	from scipy.stats import multivariate_normal

	np.random.seed(0)

	x_data = np.array([0, 1, 2])
	y_data = np.array([0.9, 2.1, 5])
	s = np.array([1, 1, 100])

	from statsmodels.api import formula as smf
	from matplotlib import pyplot as plt
	import seaborn as sns
	import pandas as pd
	import numpy as np

	x = np.linspace(0, 10, 100)
	y = 2*np.sin(x) + x + np.random.normal(0, 1, 100)

	df = pd.DataFrame({'x': x, 'y':y})

	from scipy.stats import pearsonr
	import numpy as np
	from matplotlib import pyplot as plt
	import seaborn as sns
	import pandas as pd
	from sklearn.datasets import load_boston

	def pairwise_correlation_analysis(X, alpha):
	# Pairwise pearson correlation analysis of the columns of X, with insignificant correlations dropped,
	# and a Bonferroni correction applied.

	from statsmodels.regression.linear_model import OLS
	from statsmodels.tools import add_constant
	from statsmodels.stats.multitest import multipletests
	import numpy as np
	from matplotlib import pyplot as plt
	import seaborn as sns
	import pandas as pd
	from sklearn.datasets import load_boston

	# Analysis of internal partial correlation structure of a matrix, with appropriate multiple testing correction

	import numpy as np
	from sklearn.tree import DecisionTreeRegressor
	from sklearn.utils import resample
	from sklearn.model_selection import KFold
	from sklearn.metrics import mean_squared_error
	from sklearn.dummy import DummyRegressor
	from sklearn.linear_model import LinearRegression
	from copy import deepcopy
	from matplotlib import pyplot as plt
	import seaborn as sns

	import numpy as np
	import matplotlib.pyplot as plt
	from scipy.stats import norm
	from scipy.optimize import minimize

	n = 1000
	x = np.linspace(0, 100, n)
	y = 50*np.sqrt(x) + x + np.random.normal(0, 1, n)

	plt.scatter(x, y)