The data is based on the marketing campaign efforts of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).
Source: [Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from scipy import stats
stats.chisqprob = lambda chisq, df: stats.chi2.sf(chisq, df)
raw_data = pd.read_csv('Bank-data.csv')
raw_data
Note that interest rate indicates the 3-month interest rate between banks and duration indicates the time since the last contact was made with a given consumer. The previous variable shows whether the last marketing campaign was successful with this customer. The march and may are Boolean variables that account for when the call was made to the specific customer and credit shows if the customer has enough credit to avoid defaulting.
We want to know whether the bank marketing strategy was successful, so we need to transform the outcome variable into Boolean values in order to run regressions.
# We make sure to create a copy of the data before we start altering it. Note that we don't change the original data we loaded.
data = raw_data.copy()
# Removes the index column that comes with the data
data = data.drop(['Unnamed: 0'], axis = 1)
# We use the map function to change any 'yes' values to 1 and 'no'values to 0.
data['y'] = data['y'].map({'yes':1, 'no':0})
data
data.describe()
Use 'duration' as the independent variable.
y = data['y']
x1 = data['duration']
Run the regression and graph the scatter plot.
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
# Get the regression summary
results_log.summary()
# Create a scatter plot of x1 (Duration, no constant) and y (Subscribed)
plt.scatter(x1,y,color = 'C0')
# Label our axes
plt.xlabel('Duration', fontsize = 20)
plt.ylabel('Subscription', fontsize = 20)
plt.show()
# the odds of duration are the exponential of the log odds from the summary table
np.exp(0.0051)
We can be omitting many causal factors in our simple logistic model, so we instead switch to a multivariate logistic regression model. Add the ‘interest_rate’, ‘march’, ‘credit’ and ‘previous’ estimators to our model and run the regression again.
# To avoid writing them out every time, we save the names of the estimators of our model in a list.
estimators=['interest_rate','credit','march','previous','duration']
X1_all = data[estimators]
y = data['y']
# Import the scaling module
from sklearn.preprocessing import StandardScaler
# Create a scaler object
scaler = StandardScaler()
# Fit the inputs (calculate the mean and standard deviation feature-wise)
scaler.fit(X1_all)
# Scale the features and store them in a new variable (the actual scaling procedure)
X1_scaled = scaler.transform(X1_all)
# Import the module for the split
from sklearn.model_selection import train_test_split
# Split the variables with an 80-20 split and some random state
# To have the same split each time, use random_state = 365
x1_train, x1_test, y_train, y_test = train_test_split(X1_scaled, y, test_size=0.2, random_state=365)
x_train = sm.add_constant(x1_train)
reg_logit = sm.Logit(y_train,x_train)
results_logit = reg_logit.fit()
results_logit.summary2()
# interest rate, credit, march, previous, duration
print(np.exp(-0.7707), np.exp(2.1763), np.exp(-2.0283), np.exp(1.5255), np.exp(0.0070))
# interest rate, credit, march, previous, duration
print(np.exp(-1.4452), np.exp(0.3986), np.exp(-0.8967), np.exp(0.5087), np.exp(2.4048))
Find the confusion matrix of the model and estimate its accuracy.
def confusion_matrix(data,actual_values,model):
# Confusion matrix
# Parameters
# ----------
# data: data frame or array
# data is a data frame formatted in the same way as your input data (without the actual values)
# e.g. const, var1, var2, etc. Order is very important!
# actual_values: data frame or array
# These are the actual values from the test_data
# In the case of a logistic regression, it should be a single column with 0s and 1s
# model: a LogitResults object
# this is the variable where you have the fitted model
# e.g. results_log in this course
# ----------
#Predict the values using the Logit model
pred_values = model.predict(data)
# Specify the bins
bins=np.array([0,0.5,1])
# Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
# if they are between 0.5 and 1, they will be considered 1
cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
# Calculate the accuracy
accuracy = (cm[0,0]+cm[1,1])/cm.sum()
# Return the confusion matrix and
return cm, accuracy
confusion_matrix(x_train,y_train,results_logit)
x_test = sm.add_constant(x1_test)
Determine the test confusion matrix and the test accuracy and compare them with the train confusion matrix and the train accuracy.
# Determine the Confusion Matrix and the accuracy of the model with the new data. Note that the model itself stays the same (results_logit).
# test accuracy
confusion_matrix(x_test, y_test, results_logit)
# Compare these values to the Confusion Matrix and the accuracy of the model with the old data.
# train accuracy
confusion_matrix(x_train,y_train, results_logit)
Looking at the test accuracy we see a number which is a tiny bit higher: 87.5%, compared to 86.7% for train accuracy.
In general, we expect the test accuracy to be lower than the train one. The test accuracy is higher in this case, but that is just due to luck.