Skip to content

Instantly share code, notes, and snippets.

View lmc2179's full-sized avatar

Louis Cialdella lmc2179

View GitHub Profile
@lmc2179
lmc2179 / window.py
Created September 11, 2020 02:04
SQL-style windows function in pandas
import pandas as pd
df = pd.DataFrame({'group': [0, 0, 2, 2], 'values': [1,2,3,4]})
mean_df = df.groupby('group').mean()
mean_df.columns = ['mean']
finished_df = df.merge(mean_df, left_on='group', right_index=True)
@lmc2179
lmc2179 / post.md
Created July 8, 2020 01:15
kfoldci post temp

Picking the model with the lowest cross validation error is not enough

TL;DR - We often pick the model with the lowest CV error, but this leaves out valuable information. Specifically, it ignores the uncertainty around the estimated out-of-sample error. It's useful to calculate the standard errors of a CV score, and incorporate this uncertainty into our model selection process. Doing so avoids false precision in model selection and allows us to better balance out-of-sample error with other factors like model complexity.

We build trust in our models by demonstrating that they make good predictions on out-of-sample data. This process, called cross validation, is at the heart of most model evaluation procedures. It allows us to easily compare the performance of any black-box models, though they may have differing structures or assumptions.

Most commonly, methods like K-fold cross validation are used to compute point est

@lmc2179
lmc2179 / supervised_unsupervised.py
Created June 29, 2020 02:43
supervised unsupervised learning ESL 14.2.4
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
x = np.random.normal(0, 1, 5000)
x_sim = np.random.uniform(np.min(x), np.max(x), 5000)
X = np.concatenate((x, x_sim)).reshape(-1, 1)
@lmc2179
lmc2179 / curve_fit_from_summaries.py
Created June 21, 2020 21:40
curve fit from summaries - wls-type with known sigma
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from scipy.stats import multivariate_normal
np.random.seed(0)
x_data = np.array([0, 1, 2])
y_data = np.array([0.9, 2.1, 5])
s = np.array([1, 1, 100])
@lmc2179
lmc2179 / patsy_b_spline.py
Created June 17, 2020 15:20
b spline patsy
from statsmodels.api import formula as smf
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
x = np.linspace(0, 10, 100)
y = 2*np.sin(x) + x + np.random.normal(0, 1, 100)
df = pd.DataFrame({'x': x, 'y':y})
@lmc2179
lmc2179 / simple_correlation_matrix_p_value.py
Created June 17, 2020 03:23
simple_correlation_matrix_p_value.py
from scipy.stats import pearsonr
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_boston
def pairwise_correlation_analysis(X, alpha):
# Pairwise pearson correlation analysis of the columns of X, with insignificant correlations dropped,
# and a Bonferroni correction applied.
@lmc2179
lmc2179 / partial correlation analysis
Last active June 15, 2020 21:37
partial correlation analysis
from statsmodels.regression.linear_model import OLS
from statsmodels.tools import add_constant
from statsmodels.stats.multitest import multipletests
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import load_boston
# Analysis of internal partial correlation structure of a matrix, with appropriate multiple testing correction
@lmc2179
lmc2179 / compare_with_control.py
Last active June 9, 2020 04:06
standard errors k fold cross validation
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.utils import resample
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from copy import deepcopy
from matplotlib import pyplot as plt
import seaborn as sns
@lmc2179
lmc2179 / variable_root_regression.py
Created June 3, 2020 00:13
variable_root_regression.py
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.optimize import minimize
n = 1000
x = np.linspace(0, 100, n)
y = 50*np.sqrt(x) + x + np.random.normal(0, 1, n)
plt.scatter(x, y)
@lmc2179
lmc2179 / standard_error_curve_fit.py
Created May 29, 2020 23:56
scipy curve fit standard errors
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from scipy.stats import multivariate_normal
np.random.seed(0)
x_data = np.linspace(-5, 5, num=50)
y_data = 2.9 * np.sin(1.5 * x_data) + np.random.normal(size=50)