Introduction
A leading retail bank aims to optimize the pricing of its credit portfolio by predicting how sensitive its customers are to changes in interest rates.
This case study provides a step-by-step walkthrough of how to approach a predictive modeling task, specifically, a binary classification problem using a Random Forest algorithm.
By following along, you’ll gain practical insight into each stage of the modeling process and will be able to adapt the code and methodology to your own use case.
I. Basic Data Description
1. Variable Description
We have a dataset of 10,000 customers/contracts with:
- Demographics (age, income)
- Credit behavior (loan amount, credit score)
- Response to rate changes (binary:
sensitive = 1if customer churned/refinanced when rates increased).
First Data Summary
## # A tibble: 6 × 7 ## customer_id age income credit_score loan_amount current_rate sensitive ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 58.7 39401. 670. 151648. 0.0308 0 ## 2 2 39.4 20028. 689. 65800. 0.0548 1 ## 3 3 48.6 19059. 704. 169695. 0.0774 0 ## 4 4 51.3 18044. 727. 183907. 0.0718 0 ## 5 5 49.0 31402. 738. 101792. 0.0686 1 ## 6 6 43.9 9786. 696. 81714. 0.0664 0
| Characteristic | N = 10,0001 |
|---|---|
| age | 45 (38, 52) |
| income | 22,070 (15,717, 30,950) |
| loan_amount | 102,841 (54,977, 150,259) |
| credit_score | 701 (665, 735) |
| current_rate | 0.050 (0.035, 0.065) |
| 1 Median (Q1, Q3) | |
II. Exploratory Data Analysis
## Dataset Structure:
## tibble [10,000 × 7] (S3: tbl_df/tbl/data.frame) ## $ customer_id : num [1:10000] 1 2 3 4 5 6 7 8 9 10 ... ## $ age : num [1:10000] 58.7 39.4 48.6 51.3 49 ... ## $ income : num [1:10000] 39401 20028 19059 18044 31402 ... ## $ credit_score: num [1:10000] 670 689 704 727 738 ... ## $ loan_amount : num [1:10000] 151648 65800 169695 183907 101792 ... ## $ current_rate: num [1:10000] 0.0308 0.0548 0.0774 0.0718 0.0686 ... ## $ sensitive : num [1:10000] 0 1 0 0 1 0 0 1 0 0 ...
## ## ## Table: Descriptive Statistics by Sensitivity Group (Mean ± SD) ## ## |Group | N|Age |Income |Loan Amount |Credit Score |Current Rate | ## |:-------------|----:|:----------|:-----------------|:------------------|:------------|:-------------| ## |Not Sensitive | 6638|45 (10.1) |$24,861 ($13,167) |$100,287 ($56,017) |697.1 (48.1) |0.05% (0.02%) | ## |Sensitive | 3362|44.8 (9.9) |$25,301 ($13,708) |$107,005 ($55,536) |706.9 (54.6) |0.05% (0.02%) |
## ## ## Table: T-test Results: Comparing Means by Sensitivity Group ## ## | |Variable | t-statistic| Not Sensitive| Sensitive|p-value | ## |:--|:------------|-----------:|-------------:|---------:|:-------| ## |t |age | 0.782| 44.95| 44.79|0.434 | ## |t1 |income | -1.536| 24861.26| 25301.13|0.125 | ## |t2 |loan_amount | -5.698| 100287.34| 107005.12|< 0.001 | ## |t3 |credit_score | -8.855| 697.07| 706.92|< 0.001 | ## |t4 |current_rate | -0.809| 0.05| 0.05|0.418 |
## ## === SUMMARY INSIGHTS ===
## 1. Dataset contains 10000 observations
## 2. Sensitivity distribution: 6638 3362
## 3. Variables with significant differences (p < 0.05):
## - loan_amount, credit_score
## ## 4. Correlation highlights (|r| > 0.5):
## - No strong correlations (|r| > 0.5) found
## ## === END OF DESCRIPTIVE ANALYSIS ===
III. Predictive Modelling
Train/Test Split
The dataset was randomly partitioned, with 80% allocated for training and 20% for testing.
Implement Random Forest with Optimized Hyperparameter Tuning
## # Random Forest Model Performance Dashboard
## ## Executive Summary
## ## Performance Metrics
| Performance Metric | Value | Percentage | |
|---|---|---|---|
| Accuracy | Accuracy | 0.6793397 | 67.9% |
| Precision | Precision | 0.6937853 | 69.4% |
| Recall | Recall | 0.9253956 | 92.5% |
| F1 | F1-Score | 0.7930255 | 79.3% |
## ## ## ## 🔍 Detailed Analysis
## ## 🎉 Key Insights
## *Good Accuracy**: The model achieves 68% accuracy, indicating a fairly strong predictive performance.
## **Fairly good Precision**: 69% precision for sensitive customers reduces false positives.
## **Balanced Performance**: All metrics (Accuracy, Precision, Recall, F1) show consistent strong performance.
## **Optimized Parameters**: Hyperparameter tuning identified optimal settings for enhanced performance.
## **Feature Importance**: The model reveals which variables are most influential in predictions.
## **Top Predictor**: credit_score shows the highest importance in determining customer sensitivity.
## ## Model Configuration
| Configuration | Value |
|---|---|
| Algorithm | Random Forest |
| Number of Trees | 100 |
| Variables per Split | 3 |
| Training Method | Bootstrap Sampling |
| Validation | 5-Fold Cross-Validation |
IV. Conclusion
- Key Drivers: loan_amount and credit_score most predictive.
- Optimal Strategy: Offer competitive rates to high-credit-score customers.
- Next Steps: Deploy model via API for real-time pricing decisions.
- It is advisable to evaluate additional supervised learning models alongside Random Forest for performance benchmarking.
- It is more realistic to consider customer data for several years, so a panel data, and the predictions/forcasting accordingly.
We help businesses and researchers solve complex challenges by providing expert guidance in statistics, machine learning, and tailored education.
Our core services include:
– Statistical Consulting:
Comprehensive consulting tailored to your data-driven needs.
– Training and Coaching:
In-depth instruction in statistics, machine learning, and the use of statistical software such as SAS, R, and Python.
– Reproducible Data Analysis Pipelines:
Development of documented, reproducible workflows using SAS macros and customized R and Python code.
– Interactive Data Visualization and Web Applications:
Creation of dynamic visualizations and web apps with R (Shiny, Plotly), Python (Streamlit, Dash by Plotly), and SAS (SAS Viya, SAS Web Report Studio).
– Automated Reporting and Presentation:
Generation of automated reports and presentations using Markdown and Quarto.
– Scientific Data Analysis:
Advanced analytical support for scientific research projects.