import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pymc as pm
import arviz as az
= 42
seed = np.random.default_rng(seed) rng
Insea 2025 Statistiques Bayésiennes
TP7: Bayesian logistic regression for churn prediction
Author: Hicham Janati
Objectives:
- Apply a Bayesian model to perform binary prediction
- Understand how Bayesian modeling helps avoid underfitting and overfitting
- Compare Bayesian and frequentist approaches
1. Problem Statement
In companies that offer services (such as a mobile phone operator), customer retention is a major challenge. The churn rate refers to the percentage of customers who decide to cancel their subscription (e.g., to switch to another provider). If the company can predict which customers are likely to churn, it can take proactive steps — such as offering additional services or special deals — to retain them.
We will work with a real dataset from a telecom operator (filtered and adapted from: Kaggle).
= pd.read_csv("http://hichamjanati.github.io/data/churn.csv", index_col=0)
df df.head()
Dependents | TechSupport | Contract | InternetService | CustomerID_Region | MonthlyCharges | Months | Churn | |
---|---|---|---|---|---|---|---|---|
0 | Yes | No | One year | Fiber optic | MIS-1 | 78.95 | 34.0 | 0 |
1 | Yes | Yes | Two year | DSL | DAL-1 | 85.95 | 70.0 | 0 |
2 | No | Yes | Two year | Fiber optic | SAN-1 | 104.00 | 69.0 | 0 |
3 | No | No internet service | Month-to-month | No | HOU-1 | 20.55 | 5.0 | 0 |
4 | Yes | Yes | Two year | Fiber optic | HOU-1 | 113.10 | 72.0 | 0 |
df.columns
Index(['Dependents', 'TechSupport', 'Contract', 'InternetService',
'CustomerID_Region', 'MonthlyCharges', 'Months', 'Churn'],
dtype='object')
# get the X and y variables
= df.Churn
y = df.drop("Churn", axis=1) X
Check if the number of class instances we have:
y.value_counts()
Churn
0 100
1 100
Name: count, dtype: int64
This binary classification task is balanced: (in practice churn rates are significantly lower, churn prediction is very imbalanced in the real world. I made the problem easier here by resampling from the original data for the sake of simplicity). Let’s check the type of data variables we have:
X.dtypes
Dependents object
TechSupport object
Contract object
InternetService object
CustomerID_Region object
MonthlyCharges float64
Months float64
dtype: object
X.Contract.value_counts()
Contract
Month-to-month 133
One year 37
Two year 30
Name: count, dtype: int64
2. Data preprocessing
2.1 One-hot encoding / dummy variables
We need to pre-process the data: turn the categorical variables to binary dummy variables. This is called one-hot encoding. Here is how it goes: - for a variable like Contract which takes Month-to-Month, One year or Two year (three categories) we can transform it to a 3 binary variables:
… | Contract | … |
---|---|---|
… | Month-to-Month | … |
… | One Year | … |
… | Two Year | … |
… | Month-to-Month | … |
… | Two Year | … |
… | Two Year | … |
↓
… | Month-to-Month | One Year | Two Year | … |
---|---|---|---|---|
… | 1 | 0 | 0 | … |
… | 0 | 1 | 0 | … |
… | 0 | 0 | 1 | … |
… | 1 | 0 | 0 | … |
… | 0 | 0 | 1 | … |
… | 0 | 0 | 1 | … |
These binary variables are called dummy variables or the one-hot encoding of Contract. However you can notice that the sum of these columns will always be 1: the 3 binary variables are linearly dependent which is bad for linear models (particularly if no regularization is used), this is called a dummy variables trap. We should drop one of the columns to avoid it. Thus, a categorical variable with K categories is transformed into K-1 binary variables. We can do this with sklearn transformer object called OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(drop='first')
encoder = encoder.fit_transform(df[["Contract"]])
encoded_data encoded_data
<200x2 sparse matrix of type '<class 'numpy.float64'>'
with 67 stored elements in Compressed Sparse Row format>
The output is a sparse matrix (CSR) which is a compressed form of storing matrices with lots of zeros: instead of storing all their entries, we only store in memory necessary information (for e.g the triplets (i, j, v) such that M[i, j] = v and v is not zero). Here the data is small, no need to used sparse matrices:
= OneHotEncoder(drop='first', sparse_output=False)
encoder = encoder.fit_transform(X[["Contract"]])
encoded_data 10] encoded_data[:
array([[1., 0.],
[0., 1.],
[0., 1.],
[0., 0.],
[0., 1.],
[0., 0.],
[1., 0.],
[1., 0.],
[0., 0.],
[0., 0.]])
We can apply this to all categorical variables:
= OneHotEncoder(drop='first', sparse_output=False)
encoder = ["Dependents", "TechSupport", "Contract", "InternetService", "CustomerID_Region"]
categorical_features = encoder.fit_transform(X[categorical_features])
encoded_data 10] encoded_data[:
array([[1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
0., 0.],
[1., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[0., 0., 1., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
1., 0.],
[0., 1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0.],
[1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 0.],
[0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 0.],
[1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
0., 0.],
[0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0.]])
But later we will have a regression coefficient for each one of these columns, how do we know which one belongs to which ? Well we can get their names from the encoder:
encoder.get_feature_names_out(categorical_features)
array(['Dependents_Yes', 'TechSupport_No internet service',
'TechSupport_Yes', 'Contract_One year', 'Contract_Two year',
'InternetService_Fiber optic', 'InternetService_No',
'CustomerID_Region_CHI-1', 'CustomerID_Region_DAL-1',
'CustomerID_Region_HOU-1', 'CustomerID_Region_LAX-1',
'CustomerID_Region_MIA-1', 'CustomerID_Region_MIS-1',
'CustomerID_Region_NYC-1', 'CustomerID_Region_PHL-1',
'CustomerID_Region_PHX-1', 'CustomerID_Region_SAN-1',
'CustomerID_Region_SEA-1'], dtype=object)
2.2 Scaling numerical features
We also have continuous variables (numerical features): Months and MonthlyCharges. As explained in the last class, it’s important for them to have the same scale (order of magnitude) in a linear model. Scaling a variable means centering it and dividing by its standard deviation:
= X["Months"]
variable = variable - variable.mean()
variable = variable / variable.std()
variable variable.head()
0 0.326288
1 1.850800
2 1.808453
3 -0.901791
4 1.935495
Name: Months, dtype: float64
It is tedious to do this for each variable, the best way to do it is to use a sklearn transformer called StandardScaler:
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler = ["MonthlyCharges", "Months"]
numeric_features = scaler.fit_transform(X[numeric_features])
scaled_data 10] scaled_data[:
array([[ 0.37846895, 0.32710679],
[ 0.63721955, 1.85544479],
[ 1.30442645, 1.81299095],
[-1.78025031, -0.90405438],
[ 1.64080222, 1.94035245],
[ 0.56698725, -1.03141588],
[ 1.37835519, 0.58182979],
[ 1.04937229, 0.66673745],
[ 0.19734354, -0.77669288],
[ 0.79062169, -0.73423905]])
2.3 Merging both in one transformer:
We can handle both categorical and numerical features in one ColumnTransformer object:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
= ["Dependents", "TechSupport", "Contract", "InternetService", "CustomerID_Region"]
categorical_features = ["MonthlyCharges", "Months"]
numeric_features
= OneHotEncoder(drop="first", sparse_output=False)
categorical_transformer = StandardScaler()
numeric_transformer
= ColumnTransformer(
preprocessor =[
transformers'cat', categorical_transformer, categorical_features),
('num', numeric_transformer, numeric_features),
(
],
)
= preprocessor.fit_transform(X)
transformed_data print(f"The shape of the transformed data is {transformed_data.shape}")
# we construct a new pandas with the column names:
= preprocessor.get_feature_names_out()
column_names = pd.DataFrame(transformed_data, columns=column_names)
transformed_df transformed_df.head()
The shape of the transformed data is (200, 20)
cat__Dependents_Yes | cat__TechSupport_No internet service | cat__TechSupport_Yes | cat__Contract_One year | cat__Contract_Two year | cat__InternetService_Fiber optic | cat__InternetService_No | cat__CustomerID_Region_CHI-1 | cat__CustomerID_Region_DAL-1 | cat__CustomerID_Region_HOU-1 | cat__CustomerID_Region_LAX-1 | cat__CustomerID_Region_MIA-1 | cat__CustomerID_Region_MIS-1 | cat__CustomerID_Region_NYC-1 | cat__CustomerID_Region_PHL-1 | cat__CustomerID_Region_PHX-1 | cat__CustomerID_Region_SAN-1 | cat__CustomerID_Region_SEA-1 | num__MonthlyCharges | num__Months | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.378469 | 0.327107 |
1 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.637220 | 1.855445 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.304426 | 1.812991 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.780250 | -0.904054 |
4 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.640802 | 1.940352 |
2.4 Preprocessing and the train-test split
But when developing a machine learing model, we always follow the train-test paradigm where we split train-test data: therefore when transforming / encoding / scaling the variables we should only be using the train data. Otherwise info from the test data will be used by the model: for example, the scaling operation will use test samples to compute the mean and std. Therefore, the column transformer object should be fit on the train only, then we use the transform method on the test data:
from sklearn.model_selection import train_test_split
= ["Dependents", "TechSupport", "Contract", "InternetService", "CustomerID_Region"]
categorical_features = ["MonthlyCharges", "Months"]
numeric_features
= OneHotEncoder(drop="first", sparse_output=False)
categorical_transformer = StandardScaler()
numeric_transformer
= ColumnTransformer(
preprocessor =[
transformers'cat', categorical_transformer, categorical_features),
('num', numeric_transformer, numeric_features),
(
],
)
= train_test_split(df, y, test_size=0.3, random_state=42, stratify=y)
X_train, X_test, y_train, y_test
= preprocessor.fit_transform(X_train)
X_train_processed
= preprocessor.transform(X_test)
X_test_processed
# we get the column names:
= preprocessor.get_feature_names_out() column_names
3. Logistic regression
We can now start doing ML models. First, we use logistic regression without regularization:
3.1 Without regularization
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# we can fit the logistic regression model with no regularization:
= LogisticRegression(penalty=None)
model
model.fit(X_train_processed, y_train.values)
= model.predict(X_train_processed)
y_train_pred = model.predict(X_test_processed)
y_test_pred
# Evaluate the model accuracy
print(f"Training accuracy: {accuracy_score(y_train_pred, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test_pred, y_test):.4f}")
Training accuracy: 0.8357
Test accuracy: 0.6500
We can see that the model isn’t quite good: 1. 83% accuracy on the training data means that the model isnt complex enough to predict labels it has already seen 2. 65% accuracy on the test data suggest a big difference between train and test: the models learns a bit of noise in the training data
We can extract the coefficients and visualize their values:
= pd.DataFrame(model.coef_, columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
We can extract the change in the odd-ratios and visualize their importance in percentages (see TD6):
= (np.exp(model.coef_) - 1) * 100
odds_changes = pd.DataFrame(odds_changes, columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
odds_changes
array([[ 184.33978368, -73.24588216, 10.11408019, -69.01164118,
-42.3604937 , 1867.04450222, -73.24588216, -57.30827598,
-67.89164809, -27.90134379, 87.71595551, -47.99574828,
-49.09364831, -69.35050994, 10.37215112, 190.19346862,
-73.42643779, 875.62186833, -54.79865072, -77.0485994 ]])
"Months"].std(), X["MonthlyCharges"].std() X[
(23.61410818827673, 27.120965039260753)
Some odd percentage changes are off-the-charts ! 1867% increase with Optic Fiber and 875% if the customer is in the region SEA-1 ! Given that the performance of the model is poor here and the coefficient values are very large it is very likely the model has learned noise: regularization is needed.
3.2 With L2 / L1 regularization
We can now try to improve the model by adding regularization. We can use the LogisticRegressionCV
class which will automatically find the best regularization parameter for us:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
# we can fit the logistic regression model with no regularization:
= LogisticRegressionCV(Cs=np.logspace(-3, 3, 100), penalty="l2")
model
model.fit(X_train_processed, y_train.values)
= model.predict(X_train_processed)
y_train_pred = model.predict(X_test_processed)
y_test_pred
# Evaluate the model accuracy
print(f"Training accuracy: {accuracy_score(y_train_pred, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test_pred, y_test):.4f}")
Training accuracy: 0.7857
Test accuracy: 0.7167
The test accuracy is slightly improved. The coefficients look very similar:
= pd.DataFrame(model.coef_, columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
= (np.exp(model.coef_) - 1) * 100
odds_changes = pd.DataFrame(odds_changes, columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
"Months"].std(), X["MonthlyCharges"].std() X[
(23.61410818827673, 27.120965039260753)
These values are more reasonable ! We can see that the most impactful features are the contract type, months, the monthly charges, internetService with Fiber optic and the SEA-1 region: 1. the longer the subscription (contract type and months) the less likely the customer churns. 2. In particular, for each additional “months std” = 23.6 months -> -45% odds. 3. having a Fiber optic internet / high charges makes the customer more likely to churn 4. In particular, 26% higher odds of churning if the customer has fiber: it could perhaps more competitiveness between providers. 5. For each additional 27$ per month, the customer odds of churning increase by 23.65%. 6. Finally, being in SEA-1 increases the odds by 11%.
We can also try L1 regularization for sparse coefficients (feature selection):
= LogisticRegressionCV(Cs=np.logspace(-4, 4, 100), penalty="l1", solver="liblinear")
model
model.fit(X_train_processed, y_train.values)
= model.predict(X_train_processed)
y_train_pred = model.predict(X_test_processed)
y_test_pred
# Evaluate the model accuracy
print(f"Training accuracy: {accuracy_score(y_train_pred, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test_pred, y_test):.4f}")
Training accuracy: 0.8429
Test accuracy: 0.6667
= pd.DataFrame(model.coef_, columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
= pd.DataFrame(100 * (np.exp(model.coef_) - 1), columns=column_names)
coef ="bar", legend=False)
coef.T.plot(kind
plt.grid() plt.show()
The model is similar to the unregularized case, we have 20 dimension with 100 observations: perhaps using feature selection is not a good idea here since the dimension is not that large compared to the number of observations: the Lasso keeps the largest coefficients and reduces the others to near 0.
With all these models, we can obtain the prediction probability (sigmoid) of 10 samples for e.g using:
10]) model.predict_proba(X_test_processed[:
array([[0.22842486, 0.77157514],
[0.29932415, 0.70067585],
[0.22107175, 0.77892825],
[0.45511437, 0.54488563],
[0.28544249, 0.71455751],
[0.85739176, 0.14260824],
[0.67825299, 0.32174701],
[0.91052171, 0.08947829],
[0.78664728, 0.21335272],
[0.50537597, 0.49462403]])
The sigmoid probability of the model corresponds to the second column of the output above.
It outputs a vector of probabilites that sums to 1. We get the prediction class using the argmax or comparing the second column to 0.5:
10]).argmax(axis=1) model.predict_proba(X_test_processed[:
array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
10])[:, 1] > 0.5).astype(int) (model.predict_proba(X_test_processed[:
array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])
4. Bayesian logistic regression
We can now build a bayesian logistic regression with a Gaussian prior using pymc:
4.1 Fitting the model with MCMC
import pymc as pm
import arviz as az
import seaborn as sns
# Build the model
= 1000
n_mcmc_samples = dict(var_names=column_names)
coords with pm.Model(coords=coords) as logistic_model:
# Priors for weights and intercept
= pm.HalfCauchy('sigma', beta=1)
sigma = pm.Normal('intercept', mu=0, sigma=sigma)
intercept = pm.Normal('betas', mu=0, sigma=sigma, shape=X_train_processed.shape[1])
betas
# Linear predictor
= pm.math.dot(X_train_processed, betas) + intercept
mu
# Likelihood (observed outcome)
= pm.math.sigmoid(mu)
theta = pm.Bernoulli('y_obs', p=theta, observed=y_train)
y_obs
# Sample from posterior
= pm.sample(n_mcmc_samples, tune=1000, return_inferencedata=True)
trace az.summary(trace)
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, intercept, betas]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2 seconds.
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
intercept | -0.288 | 0.438 | -1.101 | 0.542 | 0.008 | 0.007 | 2776.0 | 2395.0 | 1.0 |
betas[0] | 0.345 | 0.487 | -0.564 | 1.270 | 0.008 | 0.007 | 4078.0 | 2611.0 | 1.0 |
betas[1] | -0.544 | 0.733 | -1.884 | 0.875 | 0.012 | 0.011 | 3709.0 | 2799.0 | 1.0 |
betas[2] | -0.066 | 0.457 | -0.935 | 0.788 | 0.007 | 0.007 | 4514.0 | 2947.0 | 1.0 |
betas[3] | -0.838 | 0.554 | -1.978 | 0.110 | 0.009 | 0.007 | 3821.0 | 3307.0 | 1.0 |
betas[4] | -0.381 | 0.634 | -1.627 | 0.768 | 0.009 | 0.009 | 4762.0 | 3000.0 | 1.0 |
betas[5] | 1.041 | 0.618 | -0.081 | 2.221 | 0.014 | 0.010 | 2181.0 | 2661.0 | 1.0 |
betas[6] | -0.527 | 0.751 | -1.927 | 0.908 | 0.012 | 0.011 | 3877.0 | 2795.0 | 1.0 |
betas[7] | -0.058 | 0.688 | -1.452 | 1.156 | 0.010 | 0.011 | 4470.0 | 2692.0 | 1.0 |
betas[8] | -0.581 | 0.590 | -1.675 | 0.533 | 0.010 | 0.008 | 3326.0 | 3316.0 | 1.0 |
betas[9] | -0.242 | 0.628 | -1.407 | 0.969 | 0.009 | 0.010 | 5203.0 | 2794.0 | 1.0 |
betas[10] | 0.250 | 0.626 | -0.921 | 1.442 | 0.008 | 0.011 | 6243.0 | 2790.0 | 1.0 |
betas[11] | -0.174 | 0.596 | -1.304 | 0.934 | 0.008 | 0.009 | 5396.0 | 3216.0 | 1.0 |
betas[12] | -0.254 | 0.556 | -1.341 | 0.760 | 0.008 | 0.008 | 4633.0 | 2792.0 | 1.0 |
betas[13] | -0.555 | 0.563 | -1.572 | 0.517 | 0.010 | 0.008 | 3570.0 | 2551.0 | 1.0 |
betas[14] | 0.069 | 0.605 | -1.032 | 1.308 | 0.009 | 0.010 | 4284.0 | 2719.0 | 1.0 |
betas[15] | 0.511 | 0.619 | -0.663 | 1.683 | 0.009 | 0.009 | 4406.0 | 2800.0 | 1.0 |
betas[16] | -0.596 | 0.588 | -1.736 | 0.443 | 0.010 | 0.008 | 3500.0 | 2826.0 | 1.0 |
betas[17] | 0.978 | 0.645 | -0.129 | 2.278 | 0.013 | 0.009 | 2661.0 | 3099.0 | 1.0 |
betas[18] | 0.215 | 0.381 | -0.521 | 0.927 | 0.007 | 0.005 | 3335.0 | 2299.0 | 1.0 |
betas[19] | -1.345 | 0.328 | -1.953 | -0.729 | 0.006 | 0.004 | 2985.0 | 2600.0 | 1.0 |
sigma | 0.828 | 0.234 | 0.439 | 1.262 | 0.007 | 0.005 | 1155.0 | 1620.0 | 1.0 |
No warnings, all rhats are equal to 1, the ESS are all very large, no red flags of divergences. the MCMC chains pass all diagnostics. To intepret the coefficients, we should check the HDI of the coefficients, if they contain 0 it means that 0 is included in the 94% credible interval: therefore the coefficient is not statistically different from 0:
az.plot_forest(trace)
plt.grid()0, 0, ymax=100, color="red")
plt.vlines( plt.show()
None of them are except sigma and beta[19] which corresponds to the last variable “Months”
19] column_names[
'num__Months'
100 * (np.exp(-1.352) - 1)
-74.127770174036
4.2 Making predictions
How do we make predictions with this model ? Well, we can use the MCMC samples (beta) to compute the sigmoid probabilities on the test data:
= trace.posterior["betas"].values.reshape(-1, 20)
beta_samples beta_samples.shape
(4000, 20)
= trace.posterior["intercept"].values.reshape(-1, 1)
intercept_samples intercept_samples.shape
(4000, 1)
from scipy.special import expit as sigmoid
def get_bayes_probas(X, trace):
= trace.posterior["betas"].values.reshape(-1, 20)
beta_samples # vector of size 4000 x 20
= trace.posterior["intercept"].values.reshape(1, -1)
intercept_samples # vector of size 4000 x 1
# X test is of size n_samples x 20 so we transpose beta_samples to have a size 20 x 4000
# then we transpose the output to be 4000 x n_samples compatible with intercept_samples of size 1 x 4000
= X.dot(beta_samples.T) + intercept_samples
logits # we have a vector of size n_samples x 4000
return sigmoid(logits)
= get_bayes_probas(X_train_processed, trace)
probas_bayes_train = get_bayes_probas(X_test_processed, trace)
probas_bayes_test
probas_bayes_train.shape, probas_bayes_test.shape
((140, 4000), (60, 4000))
We have 4000 different predictions for each of the 60 test samples, we can compute the average prediction and the standard deviation to evaluate our uncertainty:
= probas_bayes_train.mean(axis=1)
mean_proba_bayes_train = probas_bayes_train.std(axis=1)
std_proba_bayes_train
= probas_bayes_test.mean(axis=1)
mean_proba_bayes_test = probas_bayes_test.std(axis=1)
std_proba_bayes_test
= (mean_proba_bayes_train > 0.5).astype(int)
bayes_predictions_train = (mean_proba_bayes_test > 0.5).astype(int)
bayes_predictions_test
print(f"Training accuracy: {accuracy_score(y_train, bayes_predictions_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test, bayes_predictions_test):.4f}")
Training accuracy: 0.8071
Test accuracy: 0.6833
It seems like the performance is similar or even a bit worse than the frequentist approach (Ridge). Why go through the trouble of MCMC then ? Well, because we can compute also uncertainties around those mean predictions of the MCMC samples. For example, with the frequentist approach we would get the the probability of churn is 0.8. With the bayesian approach we have 4000 probabilities of churn for each sample, assume their mean is identical: 0.8. With the 4000 MCMC samples we can also compute an HDI of those probabilities. If the HDI is too large say [0.3, 1.] then we cannot say for sure that 0.8 is statistically significant. If however the HDI is [0.7, 0.9] (it is far from 0.5) then we are more confident in our prediction.
In practice, the companies does not want to have many false positives (predict churn for customers who are actually satisfied and won’t leave) because it costs money (ads, promo deals to retain them…). So it might use the bayesian approach to only target the customers with predicted churn and high certainty. For the predicted churns with low certainty it may send them a satisfaction survey to be more certain.
5. How I manipulated the data
To illustrate the regularization here I truncated data to only 200 samples (from 10K samples in the kaggle dataset) and kept only a few variables otherwise MCMC would be too slow. And I also added fake variables: the region variable is purely random, completely unrelated to the churn variable. Yet, the regression (and Lasso) found a large coefficient for one of the regions ! This is to illustrate how models with little data can learn noise and lead to wrong intepretations of the coefficients: L2 regularization (and the bayesian approach however correctly reduced their amplitudes).
Let’s remove the region variable and see what happens. Before, we obtained with the unregularized model using the Regions:
- Training accuracy: 0.8357
- Test accuracy: 0.6500
= ["Dependents", "TechSupport", "Contract", "InternetService"]
categorical_features = ["MonthlyCharges", "Months"]
numeric_features
= OneHotEncoder(drop="first", sparse_output=False)
categorical_transformer = StandardScaler()
numeric_transformer
= ColumnTransformer(
preprocessor =[
transformers'cat', categorical_transformer, categorical_features),
('num', numeric_transformer, numeric_features),
(
],
)
= train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_train, X_test, y_train, y_test
= preprocessor.fit_transform(X_train)
X_train_processed
= preprocessor.transform(X_test)
X_test_processed
# we get the column names:
= preprocessor.get_feature_names_out() column_names
# we can fit the logistic regression model with no regularization:
= LogisticRegression(penalty=None)
model
model.fit(X_train_processed, y_train.values)
= model.predict(X_train_processed)
y_train_pred = model.predict(X_test_processed)
y_test_pred
# Evaluate the model accuracy
print(f"Training accuracy: {accuracy_score(y_train_pred, y_train):.4f}")
print(f"Test accuracy: {accuracy_score(y_test_pred, y_test):.4f}")
Training accuracy: 0.8214
Test accuracy: 0.6833
Slightly less train accuracy, more test accuracy: the model’s overfitting is reduced a little bit.