Use Case Description: Prediction of Contribution Margin (DB) per Insurance Day (VT)

Objective: The objective of this analysis is to predict the Contribution Margin (DB) per Insurance Day (VT) using machine learning techniques.

Data Description: Two datasets for the years 2019 and 2020 are utilized:

  • Data_DB2019: Contains simulated data on insurance communication with customers, including various attributes such as regional information, number of members, age distribution, and Contribution Margin per Insurance Day.
  • Data_DB2020: Similar to Data_DB2019 but for the year 2020.

Variable Description:

  1. Regionalit?t_WL: Regional categorization of the insurance communication.
  2. Wirtschaftszweig: Economic sector of the insured entity.
  3. Branche: Industry sector of the insured entity.
  4. KREIS: Geographic area of the insured entity.
  5. ORT: Location of the insured entity.
  6. Anzahl_Mitglieder: Number of members associated with the insured entity.
  7. Anzahl_Azubis: Number of apprentices associated with the insured entity.
  8. VersTage_bei_AG: Duration of insurance coverage with the AG.
  9. Durchschnittsalter: Average age of members.
  10. Alter_36_50: Number of members aged between 36 and 50.
  11. Alter_gr??er_50: Number of members older than 50.
  12. TTS1: Type of insurance service 1.
  13. VersTage_bei_TTS1: Duration of insurance coverage with TTS1.
  14. TTS2: Type of insurance service 2.
  15. VersTage_bei_TTS2: Duration of insurance coverage with TTS2.
  16. AU_Tage: Number of days of sick leave.
  17. KG_Tage: Number of days of parental leave.
  18. DB_je_VT: Contribution Margin per Insurance Day (target variable).

Modeling Steps:

  1. Data Preparation:

    • Data from both years are combined and cleaned.
    • Variables are appropriately formatted for analysis.
  2. Model Training:

    • The dataset is split into training and testing sets.
    • The mlr3 package is used for modeling, with the regr.ranger learner selected for regression.
  3. Parameter Tuning:

    • Hyperparameters of the model are optimized using a grid search approach.
    • The optimal configuration is chosen based on mean squared error (MSE) performance.
  4. Feature Selection:

    • Random forest-based feature selection is performed to identify the most important variables.
    • Features are selected based on their contribution to the model’s predictive performance.
  5. Model Evaluation:

    • The performance of the trained models is evaluated using resampling techniques.
    • Models are benchmarked against baseline models and evaluated based on various metrics, including MSE and MAE.

Results and Insights:

  • Without optimization, none of the models outperformed the mean estimator based on MSE.
  • Random forests showed promising results even without optimization, outperforming simple decision trees and mean estimators.
  • Further optimization of random forests led to significant improvements in predictive performance.
  • Extreme values of the Contribution Margin may be excluded for better model performance.

Conclusion: The use of random forest models, along with proper parameter tuning and feature selection, shows promise for accurately predicting the Contribution Margin per Insurance Day. Further optimization and refinement of the models could lead to even better predictive performance.

Note: The provided R code and analysis outline the process and techniques used for Contribution Margin prediction, highlighting the importance of data preprocessing, modeling, and evaluation in achieving accurate predictions.