Minimizing Churn Rate Through Analysis of Financial Habits
Background
Subscription products often are the main source of revenue for companies across all industries. These products can come in the form of a “one size fits all” overcompassing subscription, or in multi level memberships. Regardless of how they structure their memberships, or what industry they are in, companies almost always try to minimize customer churn (aka subscription cancellations).
To retain their customers, companies first need to identify behavioral patterns that act as catalyst in disengagement with the product.
-
Market: The target audience is the entirety of a company’s subscription base. They are the ones companies want to keep.
-
Product: The subscription products that customers are already enrolled in can provide value that users may not have imagined, or that they may have forgotten
Objective
The objective of this model is to find out which users are likely to churn, so that the company focus on re-engaging these users with the product. These efforts can be email reminders about the benefit of the product, especially focusing on features that are new or that user has shown to value
In this case study we will be working for a fintech company that provides a subscription product to its users, which allows them to manage their Bank accounts (saving account, credit cards etc.), provides them with personalized coupons, informs them about latest low APR-loans available in the market, and educates them on the best available methods to save money (like free courses on financial health etc.)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
dataset = pd.read_csv('churn_data.csv') # Users who were 60 days enrolled, churn in the next 30
Exploratory data analysis
Now, we will do some Exploratory Data Analysis (EDA) which is an approach for data analysis that employs a variety of techniques to:
- maximize insight into a data set
- uncover underlying structure
- extract important variables
- detect outliers and anomalies
- test underlying assumptions
- develop parsimonious models
- determine optimal factor settings
dataset.columns
Index(['user', 'churn', 'age', 'housing', 'credit_score', 'deposits',
'withdrawal', 'purchases_partners', 'purchases', 'cc_taken',
'cc_recommended', 'cc_disliked', 'cc_liked', 'cc_application_begin',
'app_downloaded', 'web_user', 'app_web_user', 'ios_user',
'android_user', 'registered_phones', 'payment_type', 'waiting_4_loan',
'cancelled_loan', 'received_loan', 'rejected_loan', 'zodiac_sign',
'left_for_two_month_plus', 'left_for_one_month', 'rewards_earned',
'reward_rate', 'is_referred'],
dtype='object')
dataset.head(5)
user | churn | age | housing | credit_score | deposits | withdrawal | purchases_partners | purchases | cc_taken | ... | waiting_4_loan | cancelled_loan | received_loan | rejected_loan | zodiac_sign | left_for_two_month_plus | left_for_one_month | rewards_earned | reward_rate | is_referred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 55409 | 0 | 37.0 | na | NaN | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | Leo | 1 | 0 | NaN | 0.00 | 0 |
1 | 23547 | 0 | 28.0 | R | 486.0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | Leo | 0 | 0 | 44.0 | 1.47 | 1 |
2 | 58313 | 0 | 35.0 | R | 561.0 | 47 | 2 | 86 | 47 | 0 | ... | 0 | 0 | 0 | 0 | Capricorn | 1 | 0 | 65.0 | 2.17 | 0 |
3 | 8095 | 0 | 26.0 | R | 567.0 | 26 | 3 | 38 | 25 | 0 | ... | 0 | 0 | 0 | 0 | Capricorn | 0 | 0 | 33.0 | 1.10 | 1 |
4 | 61353 | 1 | 27.0 | na | NaN | 0 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | Aries | 1 | 0 | 1.0 | 0.03 | 0 |
5 rows × 31 columns
dataset.describe()
user | churn | age | credit_score | deposits | withdrawal | purchases_partners | purchases | cc_taken | cc_recommended | ... | registered_phones | waiting_4_loan | cancelled_loan | received_loan | rejected_loan | left_for_two_month_plus | left_for_one_month | rewards_earned | reward_rate | is_referred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 27000.000000 | 27000.000000 | 26996.000000 | 18969.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | ... | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 27000.000000 | 23773.000000 | 27000.000000 | 27000.000000 |
mean | 35422.702519 | 0.413852 | 32.219921 | 542.944225 | 3.341556 | 0.307000 | 28.062519 | 3.273481 | 0.073778 | 92.625778 | ... | 0.420926 | 0.001296 | 0.018815 | 0.018185 | 0.004889 | 0.173444 | 0.018074 | 29.110125 | 0.907684 | 0.318037 |
std | 20321.006678 | 0.492532 | 9.964838 | 61.059315 | 9.131406 | 1.055416 | 42.219686 | 8.953077 | 0.437299 | 88.869343 | ... | 0.912831 | 0.035981 | 0.135873 | 0.133623 | 0.069751 | 0.378638 | 0.133222 | 21.973478 | 0.752016 | 0.465723 |
min | 1.000000 | 0.000000 | 17.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 17810.500000 | 0.000000 | 25.000000 | 507.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 0.200000 | 0.000000 |
50% | 35749.000000 | 0.000000 | 30.000000 | 542.000000 | 0.000000 | 0.000000 | 9.000000 | 0.000000 | 0.000000 | 65.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 25.000000 | 0.780000 | 0.000000 |
75% | 53244.250000 | 1.000000 | 37.000000 | 578.000000 | 1.000000 | 0.000000 | 43.000000 | 1.000000 | 0.000000 | 164.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 48.000000 | 1.530000 | 1.000000 |
max | 69658.000000 | 1.000000 | 91.000000 | 838.000000 | 65.000000 | 29.000000 | 1067.000000 | 63.000000 | 29.000000 | 522.000000 | ... | 5.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 114.000000 | 4.000000 | 1.000000 |
8 rows × 28 columns
Few quick things to note:
- 41% of users have churned
- 32 is the average age of the users
dataset.shape
(27000, 31)
dataset.dtypes
user int64
churn int64
age float64
housing object
credit_score float64
deposits int64
withdrawal int64
purchases_partners int64
purchases int64
cc_taken int64
cc_recommended int64
cc_disliked int64
cc_liked int64
cc_application_begin int64
app_downloaded int64
web_user int64
app_web_user int64
ios_user int64
android_user int64
registered_phones int64
payment_type object
waiting_4_loan int64
cancelled_loan int64
received_loan int64
rejected_loan int64
zodiac_sign object
left_for_two_month_plus int64
left_for_one_month int64
rewards_earned float64
reward_rate float64
is_referred int64
dtype: object
** Data Cleaning: ** Next step, we will clean the data and than continue with more EDA
dataset[dataset.credit_score < 300]
dataset = dataset[dataset.credit_score >= 300]
#Check null values
dataset.isna().sum()
user 0
churn 0
age 0
housing 0
credit_score 0
deposits 0
withdrawal 0
purchases_partners 0
purchases 0
cc_taken 0
cc_recommended 0
cc_disliked 0
cc_liked 0
cc_application_begin 0
app_downloaded 0
web_user 0
app_web_user 0
ios_user 0
android_user 0
registered_phones 0
payment_type 0
waiting_4_loan 0
cancelled_loan 0
received_loan 0
rejected_loan 0
zodiac_sign 0
left_for_two_month_plus 0
left_for_one_month 0
rewards_earned 1190
reward_rate 0
is_referred 0
dtype: int64
Credit score and rewards earned have significant amount of null values. We will drop them from our model.
dataset = dataset.drop(columns = ['credit_score', 'rewards_earned'])
# Features Histograms
fig, ax = plt.subplots(3,3, figsize=(20, 14))
sns.distplot(dataset.age, bins = 20, ax=ax[0,0])
sns.distplot(dataset.purchases_partners, bins = 20, ax=ax[0,1])
sns.distplot(dataset.app_downloaded, bins = 20, ax=ax[0,2])
sns.distplot(dataset.deposits, bins = 20, ax=ax[1,0])
sns.distplot(dataset.withdrawal, bins = 20, ax=ax[1,1])
sns.distplot(dataset.cc_application_begin, bins = 20, ax=ax[1,2])
sns.distplot(dataset.cc_recommended, bins = 20, ax=ax[2,0])
sns.distplot(dataset.cancelled_loan, bins = 20, ax=ax[2,1])
sns.distplot(dataset.reward_rate, bins = 20, ax=ax[2,2])
plt.show()
Few things to note:
- Age: Distribution is right skewed, intuitively it makes sense as older people don’t use the services
- Deposit/withdrawal: Majority of people have no deposit (as the data we have is for first couple of months, and for this time period, activity could be low)
## Pie Plots
dataset2 = dataset[['housing', 'is_referred', 'app_downloaded',
'web_user', 'app_web_user', 'ios_user',
'android_user', 'registered_phones', 'payment_type',
'waiting_4_loan', 'cancelled_loan',
'received_loan', 'rejected_loan', 'zodiac_sign',
'left_for_two_month_plus', 'left_for_one_month', 'is_referred']]
fig = plt.figure(figsize=(15, 12))
#plt.suptitle('Pie Chart Distributions', fontsize=20)
for i in range(1, dataset2.shape[1] + 1):
plt.subplot(6, 3, i)
f = plt.gca()
f.axes.get_yaxis().set_visible(False)
f.set_title(dataset2.columns.values[i - 1])
values = dataset2.iloc[:, i - 1].value_counts(normalize = True).values
index = dataset2.iloc[:, i - 1].value_counts(normalize = True).index
plt.pie(values, labels = index, autopct='%1.1f%%')
plt.axis('equal')
fig.tight_layout(rect=[0, 0.03, 0.9, 2.1])
plt.show()
Few things we notice:
- Housing: Majority of users are not owners. There’s good amount of renters. Most of them are unclassified
- Payment type: Biweekly is most common
- Zodiac sign: Pretty evenly distributed, except for perhaps Capricorn
Interesting to note is features like : ‘waiting_4_loan’, ‘cancelled_loan’, ‘received_loan’, ‘rejected_loan’ and left_for_one_month’are unevenly distributed. We will try to explore more to make sure these features will be useful to build our models
## Exploring Uneven Features
dataset[dataset2.waiting_4_loan == 1].churn.value_counts()
0 15
1 3
Name: churn, dtype: int64
dataset[dataset2.cancelled_loan == 1].churn.value_counts()
0 194
1 187
Name: churn, dtype: int64
dataset[dataset2.received_loan == 1].churn.value_counts()
1 233
0 162
Name: churn, dtype: int64
dataset[dataset2.rejected_loan == 1].churn.value_counts()
1 64
0 17
Name: churn, dtype: int64
dataset[dataset2.left_for_one_month == 1].churn.value_counts()
1 207
0 184
Name: churn, dtype: int64
These are pretty balanced distribution and we do not see any strong reason that these fields are biased.
Next, we will check the correlation with Response variable
## Correlation with Response Variable
dataset_corr = dataset.drop(columns = ['churn', 'user']) #drop columns
dataset_corr.corrwith(dataset.churn).plot.bar(figsize=(20,10),
title = 'Correlation with Response variable',
fontsize = 15, rot = 45,
grid = True)
plt.show()
Age is negatively correlated to the response variable churn, smaller the age - more likely for it to be 1 (or churn).
Same with deposits and withdrawal. Smaller the deposits or withdrawal - more likely for users to churn. This makes sense, because this means that less activity user has, more likely they will churn.
Interestingly, ‘cc_taken’ is correlated with churn, meaning if user has taken a credit card, they are more likely to churn (aka they are not happy with Credit card). This will be interesting to explore further.
#Correlation matrix
corr=dataset_corr.corr()
sns.set(font_scale=1.3)
plt.figure(figsize=(24, 27))
sns.heatmap(corr, vmax=.8, linewidths=0.01,
square=True,annot=True,cmap='YlGnBu',linecolor="black")
plt.title('Correlation between features', fontsize = '32')
plt.show()
Obviously, best case scenario would be every feature is independent of each other and matrix above is marked around ‘0’ meaning that they are not linearly related.
However, that is not the case here. As we see in the matrix, correlation between ‘android user’ and ‘ios user’ is very strong. This makes sense, as if you are an android user, you are likely not a ios user. Correlation is not exactly 1 as there are probably users who have both. We will drop one of the column as it is not bringing any new information.
Additionaly, ‘app_web_user’ field is 1 when ‘web_user’ and ‘app downloaded’ both are ‘1’ (aka its a function of the two fields) which makes it not an independent variable. As we want independent variables, we will drop this field.
# Removing Correlated Fields
dataset = dataset.drop(columns = ['app_web_user'])
dataset = dataset.drop(columns = ['ios_user'])
## Data Preparation
user_identifier = dataset['user']
dataset = dataset.drop(columns = ['user'])
One hot encoding
We will use One hot encoding which is a simple process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
# One-Hot Encoding
dataset.housing.value_counts()
dataset.groupby('housing')['churn'].nunique().reset_index()
dataset = pd.get_dummies(dataset)
dataset.columns
dataset = dataset.drop(columns = ['housing_na', 'zodiac_sign_na', 'payment_type_na'])
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset.drop(columns = 'churn'), dataset['churn'],
test_size = 0.2,
random_state = 0)
# Balancing the Training Set
y_train.value_counts()
pos_index = y_train[y_train.values == 1].index
neg_index = y_train[y_train.values == 0].index
if len(pos_index) > len(neg_index):
higher = pos_index
lower = neg_index
else:
higher = neg_index
lower = pos_index
random.seed(0)
higher = np.random.choice(higher, size=len(lower))
lower = np.asarray(lower)
new_indexes = np.concatenate((lower, higher))
X_train = X_train.loc[new_indexes,]
y_train = y_train[new_indexes]
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train2 = pd.DataFrame(sc_X.fit_transform(X_train))
X_test2 = pd.DataFrame(sc_X.transform(X_test))
X_train2.columns = X_train.columns.values
X_test2.columns = X_test.columns.values
X_train2.index = X_train.index.values
X_test2.index = X_test.index.values
X_train = X_train2
X_test = X_test2
Model building
# Fitting Model to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# Predicting Test Set
y_pred = classifier.predict(X_test)
# Evaluating Results
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
precision_score(y_test, y_pred) # tp / (tp + fp)
recall_score(y_test, y_pred) # tp / (tp + fn)
f1_score(y_test, y_pred)
df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)
sn.heatmap(df_cm, annot=True, fmt='g')
print("Test Data Accuracy: %0.4f" % accuracy_score(y_test, y_pred))
Test Data Accuracy: 0.6399
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("SVM Accuracy: %0.3f (+/- %0.3f)" % (accuracies.mean(), accuracies.std() * 2))
SVM Accuracy: 0.652 (+/- 0.023)
# Analyzing Coefficients
pd.concat([pd.DataFrame(X_train.columns, columns = ["features"]),
pd.DataFrame(np.transpose(classifier.coef_), columns = ["coef"])
],axis = 1)
features | coef | |
---|---|---|
0 | age | -0.164204 |
1 | deposits | 0.034083 |
2 | withdrawal | 0.031360 |
3 | purchases_partners | -0.761356 |
4 | purchases | -0.188282 |
5 | cc_taken | 0.043743 |
6 | cc_recommended | 0.069329 |
7 | cc_disliked | 0.035589 |
8 | cc_liked | -0.003735 |
9 | cc_application_begin | 0.100384 |
10 | app_downloaded | -0.057729 |
11 | web_user | 0.146670 |
12 | android_user | -0.051212 |
13 | registered_phones | 0.098973 |
14 | waiting_4_loan | -0.024642 |
15 | cancelled_loan | 0.098474 |
16 | received_loan | 0.102405 |
17 | rejected_loan | 0.124107 |
18 | left_for_two_month_plus | 0.025957 |
19 | left_for_one_month | 0.056692 |
20 | reward_rate | -0.282190 |
21 | is_referred | 0.041676 |
22 | housing_O | -0.038284 |
23 | housing_R | 0.046987 |
24 | payment_type_Bi-Weekly | -0.064121 |
25 | payment_type_Monthly | -0.054434 |
26 | payment_type_Semi-Monthly | -0.038202 |
27 | payment_type_Weekly | 0.043886 |
28 | zodiac_sign_Aquarius | 0.004534 |
29 | zodiac_sign_Aries | 0.042943 |
30 | zodiac_sign_Cancer | 0.041694 |
31 | zodiac_sign_Capricorn | 0.066223 |
32 | zodiac_sign_Gemini | 0.032779 |
33 | zodiac_sign_Leo | 0.016060 |
34 | zodiac_sign_Libra | 0.005363 |
35 | zodiac_sign_Pisces | 0.056116 |
36 | zodiac_sign_Sagittarius | 0.032343 |
37 | zodiac_sign_Scorpio | 0.002033 |
38 | zodiac_sign_Taurus | 0.013963 |
39 | zodiac_sign_Virgo | 0.041674 |
# Recursive Feature Elimination
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# Model to Test
classifier = LogisticRegression()
# Select Best X Features
rfe = RFE(classifier, 20)
rfe = rfe.fit(X_train, y_train)
# summarize the selection of the attributes
print(rfe.support_)
[ True False False True True True True False False True True True
True True False True True True False True True True True True
False False False True False False False True False False False False
False False False False]
print(rfe.ranking_)
[ 1 9 7 1 1 1 1 5 18 1 1 1 1 1 15 1 1 1 12 1 1 1 1 1
3 2 4 1 20 8 10 1 13 16 19 6 14 21 17 11]
X_train.columns[rfe.support_]
Index(['age', 'purchases_partners', 'purchases', 'cc_taken', 'cc_recommended',
'cc_application_begin', 'app_downloaded', 'web_user', 'android_user',
'registered_phones', 'cancelled_loan', 'received_loan', 'rejected_loan',
'left_for_one_month', 'reward_rate', 'is_referred', 'housing_O',
'housing_R', 'payment_type_Weekly', 'zodiac_sign_Capricorn'],
dtype='object')
# New Correlation Matrix
sn.set(style="white")
# Compute the correlation matrix
corr = X_train[X_train.columns[rfe.support_]].corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(18, 15))
# Generate a custom diverging colormap
cmap = sn.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sn.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
# Fitting Model to the Training Set
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(X_train[X_train.columns[rfe.support_]], y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
# Predicting Test Set
y_pred = classifier.predict(X_test[X_train.columns[rfe.support_]])
# Evaluating Results
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
precision_score(y_test, y_pred) # tp / (tp + fp)
recall_score(y_test, y_pred) # tp / (tp + fn)
f1_score(y_test, y_pred)
0.6318327974276529
df_cm = pd.DataFrame(cm, index = (1, 0), columns = (1, 0))
plt.figure(figsize = (10,7))
sn.set(font_scale=1.4)
sn.heatmap(df_cm, annot=True, fmt='g')
print("Test Data Accuracy: %0.4f" % accuracy_score(y_test, y_pred))
Test Data Accuracy: 0.6378
#Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier,
X = X_train[X_train.columns[rfe.support_]],
y = y_train, cv = 10)
print("SVM Accuracy: %0.3f (+/- %0.3f)" % (accuracies.mean(), accuracies.std() * 2))
SVM Accuracy: 0.651 (+/- 0.027)
# Analyzing Coefficients
pd.concat([pd.DataFrame(X_train[X_train.columns[rfe.support_]].columns, columns = ["features"]),
pd.DataFrame(np.transpose(classifier.coef_), columns = ["coef"])
],axis = 1)
features | coef | |
---|---|---|
0 | age | -0.164930 |
1 | purchases_partners | -0.753390 |
2 | purchases | -0.139449 |
3 | cc_taken | 0.050407 |
4 | cc_recommended | 0.071737 |
5 | cc_application_begin | 0.104003 |
6 | app_downloaded | -0.057866 |
7 | web_user | 0.147560 |
8 | android_user | -0.051496 |
9 | registered_phones | 0.099842 |
10 | cancelled_loan | 0.098161 |
11 | received_loan | 0.101879 |
12 | rejected_loan | 0.122173 |
13 | left_for_one_month | 0.056976 |
14 | reward_rate | -0.287096 |
15 | is_referred | 0.042179 |
16 | housing_O | -0.038413 |
17 | housing_R | 0.048422 |
18 | payment_type_Weekly | 0.089569 |
19 | zodiac_sign_Capricorn | 0.051997 |
# Formatting Final Results
final_results = pd.concat([y_test, user_identifier], axis = 1).dropna()
final_results['predicted_churn'] = y_pred
final_results = final_results[['user', 'churn', 'predicted_churn']].reset_index(drop=True)
final_results
user | churn | predicted_churn | |
---|---|---|---|
0 | 20839 | 0.0 | 1 |
1 | 15359 | 1.0 | 0 |
2 | 34210 | 1.0 | 0 |
3 | 57608 | 1.0 | 1 |
4 | 11790 | 0.0 | 0 |
5 | 1826 | 1.0 | 1 |
6 | 8508 | 0.0 | 1 |
7 | 50946 | 1.0 | 1 |
8 | 50130 | 1.0 | 0 |
9 | 55422 | 0.0 | 0 |
10 | 259 | 1.0 | 1 |
11 | 17451 | 0.0 | 0 |
12 | 41909 | 0.0 | 0 |
13 | 38825 | 0.0 | 1 |
14 | 19314 | 1.0 | 1 |
15 | 26916 | 0.0 | 0 |
16 | 30614 | 0.0 | 1 |
17 | 30329 | 1.0 | 1 |
18 | 38853 | 0.0 | 1 |
19 | 15592 | 1.0 | 1 |
20 | 40888 | 0.0 | 1 |
21 | 17918 | 1.0 | 0 |
22 | 52613 | 0.0 | 0 |
23 | 725 | 0.0 | 1 |
24 | 51797 | 0.0 | 0 |
25 | 2601 | 0.0 | 0 |
26 | 33990 | 0.0 | 1 |
27 | 10006 | 0.0 | 0 |
28 | 19296 | 1.0 | 0 |
29 | 12135 | 1.0 | 1 |
... | ... | ... | ... |
3763 | 64494 | 1.0 | 0 |
3764 | 1185 | 0.0 | 0 |
3765 | 17908 | 0.0 | 1 |
3766 | 52426 | 0.0 | 1 |
3767 | 41552 | 0.0 | 0 |
3768 | 52762 | 1.0 | 0 |
3769 | 35892 | 1.0 | 1 |
3770 | 28025 | 1.0 | 0 |
3771 | 55416 | 0.0 | 0 |
3772 | 14997 | 0.0 | 1 |
3773 | 25667 | 0.0 | 1 |
3774 | 44166 | 0.0 | 1 |
3775 | 50893 | 1.0 | 0 |
3776 | 10975 | 1.0 | 1 |
3777 | 38184 | 0.0 | 0 |
3778 | 31601 | 0.0 | 0 |
3779 | 31167 | 0.0 | 0 |
3780 | 51126 | 0.0 | 1 |
3781 | 58440 | 0.0 | 0 |
3782 | 65088 | 0.0 | 1 |
3783 | 26821 | 0.0 | 0 |
3784 | 25599 | 0.0 | 1 |
3785 | 3369 | 0.0 | 1 |
3786 | 33587 | 1.0 | 0 |
3787 | 22318 | 0.0 | 1 |
3788 | 67681 | 0.0 | 1 |
3789 | 49145 | 1.0 | 0 |
3790 | 47206 | 0.0 | 0 |
3791 | 22377 | 0.0 | 0 |
3792 | 47663 | 1.0 | 0 |
3793 rows × 3 columns
Conclusion
Our model has provided us with an indication of which users are likely to churn. We have purposfully left the date of the expected churn open-ended because we are focussed on only gauging the features that indicate disengagement with the product, and not the exact manner in which users will disengage. We chose this open ended emphasis to get a sense of those who are even just a bit likely to churn because we are not aiming to create new products for people who are going to leave us for sure, but for people who are starting to lose interest in the app.
If after creating new product features , we start seeing our model predict that less of our users are going to churn, then we can assume our customers are feeling more engaged with what we are offering them.
We can move forward with these new efforts by inquiring the opinion of users about the new features (eg. survey). If we want to transition into predicting churn more accurately , in order to put emhpasis directly on those users leaving the product, then we can add a time dimension to churn, which would add more accuracy to our model.