Statistical Learning Dr. D. Djeudeu

 
Project Description: Predicting Sick Payments for a Health Insurance Company

Problem Statement: A health insurance company in Germany aims to predict sick payments (Krankengeld) for its policyholders based on historical data. The goal is to develop a predictive model that accurately forecasts the likelihood of sick payments, enabling the company to optimize resource allocation and enhance risk management strategies.

Background

If you are sick for a longer period, you will receive sick pay. If you are unable to work, your employer will continue to pay your salary for up to six weeks. This is called wage continuation. If you cannot work for more than six weeks, you can receive sick pay from your health insurance  for  a specific period. This provides you with financial security. Sick pay usually amounts to 70% of the employee’s last gross salary. However, it can vary depending on the health insurance and the individual case.  Before your employer begins to provide you with salary continuation if you are unable to work, you must furnish a Proof of Incapacity for Work. Your physician, upon determining your inability to work, will promptly transmit the relevant data concerning your sick note directly to your health insurance provider. This guarantees that the information regarding your sick leave is readily available to your health insurance company from day one. Please be advised that, in most circumstances, if you are unable to work for a period of up to three days, there is no need to provide proof.

Data analytics can be used to identify and prevent fraudulent claims. “Health insurance providers aim to prevent cases of prolonged work incapacity among insured patients with sick notes. Ideally, patients should not be unable to work for more than six weeks, during which the employer will provide wage continuation. However, if the incapacity lasts longer than six weeks, the health insurance provider will cover 70% of the patient’s wage. It is the goal of the health insurance provider to prevent cases of prolonged work incapacity.

 

Variables Description:

  • S_BWKGPKEY: Unique identifier for each policyholder.
  • RECORDMODE: Mode of record (not described in the provided code).
  • IS_CLAIM: Indicator variable for claim occurrence.
  • S_BWAUINT: Categorical variable indicating AU cases based on AU duration relative to a model threshold.
  • S_BWAUDAUE: Integer representing AU duration.
  • S_BWPARTIT: Participant indicator.
  • S_BWALTER: Age of the policyholder.
  • S_BWANZDIAG: Number of AU diagnoses.
  • S_BWAUFAVJ: Integer indicating the number of AU cases in the previous year.
  • S_BWAUFAVJF: Integer indicating the number of AU cases in the previous year (month).
  • S_BWAUFAVJM: Integer indicating the number of AU cases in the previous year (day).
  • S_BWAUFAVJI: Integer indicating the number of AU cases in the previous year (hour).
  • And other variables representing various attributes of policyholders and their sickness data.

Goal of the Analysis: The primary objective is to predict the probability of sick payments for policyholders based on their demographic and health-related attributes. By accurately forecasting sick payments, the health insurance company can effectively manage its financial reserves and provide better services to its policyholders.

Methodology: The project proceeds in several stages as outlined below:

  1. Introduction:

    • Describes the decision tree methodology for sick payment prediction.
    • Aims to reproduce and optimize the procedure defined in reference [21].
    • Includes descriptive analysis of the data to formulate hypotheses.
  2. Datasets and Variable Descriptions:

    • Describes core and environmental datasets available for modeling.
    • Provides brief descriptions of variables in the datasets.
    • Includes univariate variable descriptions and explanations.
  3. Univariate Variable Descriptions:

    • Describes individual variables in the core dataset.
    • Includes scales, central tendencies, and dispersion characteristics essential for modeling.
  4. Bivariate Analyses of Variables:

    • Examines the relationship between explanatory variables and the target variable.
    • Conducts statistical tests after stratifying variables by the target class.
  5. Outcome Variable Description:

    • The outcome variable, KG-Intervall, represents sick payment intervals categorized into:
      • 1: No sick payment: AU Less than 6 weeks (42 days)
      • 2: Sick payment days <= Sick payment limit (Normal sick payment case)
      • 3: Sick payment days > Sick payment limit (Extended sick payment)
    • The sick payment limit is standardized to 63 days.
  6. Decision Tree and Training Process:

    • Utilizes decision tree-based classification models for prediction.
    • Features:
      • Data Preprocessing: Cleansing and formatting datasets for analysis.
      • Data Joining: Merging datasets containing policyholder and sick payment information.
      • Data Sampling: Random selection of dataset subsets for analysis.
      • Handling Imbalanced Data: Addresses class imbalance through undersampling and oversampling techniques.
      • Feature Engineering: Selects relevant features and encodes categorical variables.
      • Model Training: Utilizes decision tree algorithms to train predictive models.
      • Hyperparameter Tuning: Optimizes model performance via techniques like random search.
      • Model Evaluation: Assesses model accuracy using metrics and confusion matrices.
      • Prediction: Generates sick payment predictions based on the trained model.

Conclusion: Developing an accurate predictive model for sick payments empowers the health insurance company to bolster its risk management capabilities and provide enhanced financial protection to policyholders during periods of illness. The insights derived from this analysis can guide strategic decision-making and improve overall operational efficiency within the organization.