Introduction

A leading retail bank aims to optimize the pricing of its credit portfolio by predicting how sensitive its customers are to changes in interest rates.

This case study provides a step-by-step walkthrough of how to approach a predictive modeling task, specifically, a binary classification problem using a Random Forest algorithm.

By following along, you’ll gain practical insight into each stage of the modeling process and will be able to adapt the code and methodology to your own use case.

I. Basic Data Description

1. Variable Description

We have a dataset of 10,000 customers/contracts with:

Demographics (age, income)
Credit behavior (loan amount, credit score)
Response to rate changes (binary: sensitive = 1 if customer churned/refinanced when rates increased).

First Data Summary

## # A tibble: 6 × 7
##   customer_id   age income credit_score loan_amount current_rate sensitive
##         <dbl> <dbl>  <dbl>        <dbl>       <dbl>        <dbl>     <dbl>
## 1           1  58.7 39401.         670.     151648.       0.0308         0
## 2           2  39.4 20028.         689.      65800.       0.0548         1
## 3           3  48.6 19059.         704.     169695.       0.0774         0
## 4           4  51.3 18044.         727.     183907.       0.0718         0
## 5           5  49.0 31402.         738.     101792.       0.0686         1
## 6           6  43.9  9786.         696.      81714.       0.0664         0

Characteristic	N = 10,000¹
age	45 (38, 52)
income	22,070 (15,717, 30,950)
loan_amount	102,841 (54,977, 150,259)
credit_score	701 (665, 735)
current_rate	0.050 (0.035, 0.065)
¹ Median (Q1, Q3)

II. Exploratory Data Analysis

## Dataset Structure:

## tibble [10,000 × 7] (S3: tbl_df/tbl/data.frame)
##  $ customer_id : num [1:10000] 1 2 3 4 5 6 7 8 9 10 ...
##  $ age         : num [1:10000] 58.7 39.4 48.6 51.3 49 ...
##  $ income      : num [1:10000] 39401 20028 19059 18044 31402 ...
##  $ credit_score: num [1:10000] 670 689 704 727 738 ...
##  $ loan_amount : num [1:10000] 151648 65800 169695 183907 101792 ...
##  $ current_rate: num [1:10000] 0.0308 0.0548 0.0774 0.0718 0.0686 ...
##  $ sensitive   : num [1:10000] 0 1 0 0 1 0 0 1 0 0 ...

## 
## 
## Table: Descriptive Statistics by Sensitivity Group (Mean ± SD)
## 
## |Group         |    N|Age        |Income            |Loan Amount        |Credit Score |Current Rate  |
## |:-------------|----:|:----------|:-----------------|:------------------|:------------|:-------------|
## |Not Sensitive | 6638|45 (10.1)  |$24,861 ($13,167) |$100,287 ($56,017) |697.1 (48.1) |0.05% (0.02%) |
## |Sensitive     | 3362|44.8 (9.9) |$25,301 ($13,708) |$107,005 ($55,536) |706.9 (54.6) |0.05% (0.02%) |

## 
## 
## Table: T-test Results: Comparing Means by Sensitivity Group
## 
## |   |Variable     | t-statistic| Not Sensitive| Sensitive|p-value |
## |:--|:------------|-----------:|-------------:|---------:|:-------|
## |t  |age          |       0.782|         44.95|     44.79|0.434   |
## |t1 |income       |      -1.536|      24861.26|  25301.13|0.125   |
## |t2 |loan_amount  |      -5.698|     100287.34| 107005.12|< 0.001 |
## |t3 |credit_score |      -8.855|        697.07|    706.92|< 0.001 |
## |t4 |current_rate |      -0.809|          0.05|      0.05|0.418   |

## 
## === SUMMARY INSIGHTS ===

## 1. Dataset contains 10000 observations

## 2. Sensitivity distribution: 6638 3362

## 3. Variables with significant differences (p < 0.05):

##    - loan_amount, credit_score

## 
## 4. Correlation highlights (|r| > 0.5):

##    - No strong correlations (|r| > 0.5) found

## 
## === END OF DESCRIPTIVE ANALYSIS ===

III. Predictive Modelling

Train/Test Split

The dataset was randomly partitioned, with 80% allocated for training and 20% for testing.

Implement Random Forest with Optimized Hyperparameter Tuning

## #  Random Forest Model Performance Dashboard

## ## Executive Summary

## ## Performance Metrics

Random Forest Model Performance Summary
	Performance Metric	Value	Percentage
Accuracy	Accuracy	0.6793397	67.9%
Precision	Precision	0.6937853	69.4%
Recall	Recall	0.9253956	92.5%
F1	F1-Score	0.7930255	79.3%

## 
## 
## ## 🔍 Detailed Analysis

## ## 🎉 Key Insights

## *Good Accuracy**: The model achieves 68% accuracy, indicating a fairly strong predictive performance.

## **Fairly good Precision**: 69% precision for sensitive customers reduces false positives.

## **Balanced Performance**: All metrics (Accuracy, Precision, Recall, F1) show consistent strong performance.

## **Optimized Parameters**: Hyperparameter tuning identified optimal settings for enhanced performance.

## **Feature Importance**: The model reveals which variables are most influential in predictions.

## **Top Predictor**: credit_score shows the highest importance in determining customer sensitivity.

## ## Model Configuration

Configuration	Value
Algorithm	Random Forest
Number of Trees	100
Variables per Split	3
Training Method	Bootstrap Sampling
Validation	5-Fold Cross-Validation

IV. Conclusion

Key Drivers: loan_amount and credit_score most predictive.
Optimal Strategy: Offer competitive rates to high-credit-score customers.
Next Steps: Deploy model via API for real-time pricing decisions.
It is advisable to evaluate additional supervised learning models alongside Random Forest for performance benchmarking.
It is more realistic to consider customer data for several years, so a panel data, and the predictions/forcasting accordingly.

3 D Statistical Learning

We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.

Our core services include:

– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.

– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.

– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.

– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).

– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.

– Scientific Data Analysis:
Advanced analytical support for scientific research projects.