Introduction

A leading retail bank aims to optimize the pricing of its credit portfolio by predicting how sensitive its customers are to changes in interest rates.

This case study provides a step-by-step walkthrough of how to approach a predictive modeling task, specifically, a binary classification problem using a Random Forest algorithm.

By following along, you’ll gain practical insight into each stage of the modeling process and will be able to adapt the code and methodology to your own use case.

I. Basic Data Description

1. Variable Description

We have a dataset of 10,000 customers/contracts with:

  • Demographics (age, income)
  • Credit behavior (loan amount, credit score)
  • Response to rate changes (binary: sensitive = 1 if customer churned/refinanced when rates increased).

First Data Summary

## # A tibble: 6 × 7
##   customer_id   age income credit_score loan_amount current_rate sensitive
##         <dbl> <dbl>  <dbl>        <dbl>       <dbl>        <dbl>     <dbl>
## 1           1  58.7 39401.         670.     151648.       0.0308         0
## 2           2  39.4 20028.         689.      65800.       0.0548         1
## 3           3  48.6 19059.         704.     169695.       0.0774         0
## 4           4  51.3 18044.         727.     183907.       0.0718         0
## 5           5  49.0 31402.         738.     101792.       0.0686         1
## 6           6  43.9  9786.         696.      81714.       0.0664         0

CharacteristicN = 10,0001
age45 (38, 52)
income22,070 (15,717, 30,950)
loan_amount102,841 (54,977, 150,259)
credit_score701 (665, 735)
current_rate0.050 (0.035, 0.065)
1 Median (Q1, Q3)

II. Exploratory Data Analysis

## Dataset Structure:
## tibble [10,000 × 7] (S3: tbl_df/tbl/data.frame)
##  $ customer_id : num [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ age         : num [1:10000] 58.7 39.4 48.6 51.3 49 ...
##  $ income      : num [1:10000] 39401 20028 19059 18044 31402 ...
##  $ credit_score: num [1:10000] 670 689 704 727 738 ...
##  $ loan_amount : num [1:10000] 151648 65800 169695 183907 101792 ...
##  $ current_rate: num [1:10000] 0.0308 0.0548 0.0774 0.0718 0.0686 ...
##  $ sensitive   : num [1:10000] 0 1 0 0 1 0 0 1 0 0 ...
## 
## 
## Table: Descriptive Statistics by Sensitivity Group (Mean ± SD)
## 
## |Group         |    N|Age        |Income            |Loan Amount        |Credit Score |Current Rate  |
## |:-------------|----:|:----------|:-----------------|:------------------|:------------|:-------------|
## |Not Sensitive | 6638|45 (10.1)  |$24,861 ($13,167) |$100,287 ($56,017) |697.1 (48.1) |0.05% (0.02%) |
## |Sensitive     | 3362|44.8 (9.9) |$25,301 ($13,708) |$107,005 ($55,536) |706.9 (54.6) |0.05% (0.02%) |

plot of chunk plot-all-predictorssummaryday2newplot2plot of chunk plot-all-predictorssummaryday2newplot2plot of chunk plot-all-predictorssummaryday2newplot2plot of chunk plot-all-predictorssummaryday2newplot2plot of chunk plot-all-predictorssummaryday2newplot2

## 
## 
## Table: T-test Results: Comparing Means by Sensitivity Group
## 
## |   |Variable     | t-statistic| Not Sensitive| Sensitive|p-value |
## |:--|:------------|-----------:|-------------:|---------:|:-------|
## |t  |age          |       0.782|         44.95|     44.79|0.434   |
## |t1 |income       |      -1.536|      24861.26|  25301.13|0.125   |
## |t2 |loan_amount  |      -5.698|     100287.34| 107005.12|< 0.001 |
## |t3 |credit_score |      -8.855|        697.07|    706.92|< 0.001 |
## |t4 |current_rate |      -0.809|          0.05|      0.05|0.418   |
## 
## === SUMMARY INSIGHTS ===
## 1. Dataset contains 10000 observations
## 2. Sensitivity distribution: 6638 3362
## 3. Variables with significant differences (p < 0.05):
##    - loan_amount, credit_score
## 
## 4. Correlation highlights (|r| > 0.5):
##    - No strong correlations (|r| > 0.5) found
## 
## === END OF DESCRIPTIVE ANALYSIS ===

III. Predictive Modelling

Train/Test Split

The dataset was randomly partitioned, with 80% allocated for training and 20% for testing.

Implement Random Forest with Optimized Hyperparameter Tuning

## #  Random Forest Model Performance Dashboard
## ## Executive Summary
## ## Performance Metrics
Random Forest Model Performance Summary
Performance MetricValuePercentage
AccuracyAccuracy0.679339767.9%
PrecisionPrecision0.693785369.4%
RecallRecall0.925395692.5%
F1F1-Score0.793025579.3%
## 
## 
## ## 🔍 Detailed Analysis

plot of chunk professional-display

## ## 🎉 Key Insights
## *Good Accuracy**: The model achieves 68% accuracy, indicating a fairly strong predictive performance.
## **Fairly good Precision**: 69% precision for sensitive customers reduces false positives.
## **Balanced Performance**: All metrics (Accuracy, Precision, Recall, F1) show consistent strong performance.
## **Optimized Parameters**: Hyperparameter tuning identified optimal settings for enhanced performance.
## **Feature Importance**: The model reveals which variables are most influential in predictions.
## **Top Predictor**: credit_score shows the highest importance in determining customer sensitivity.
## ## Model Configuration
ConfigurationValue
AlgorithmRandom Forest
Number of Trees100
Variables per Split3
Training MethodBootstrap Sampling
Validation5-Fold Cross-Validation

IV. Conclusion

  • Key Drivers: loan_amount and credit_score most predictive.
  • Optimal Strategy: Offer competitive rates to high-credit-score customers.
  • Next Steps: Deploy model via API for real-time pricing decisions.
  • It is advisable to evaluate additional supervised learning models alongside Random Forest for performance benchmarking.
  • It is more realistic to consider customer data for several years, so a panel data, and the predictions/forcasting accordingly.