3 D Statistical Learning

Introduction

In the statutory health insurance system in Germany, budgets are agreed upon between statutory health insurance funds and outpatient physicians for a portion of the services provided by the physicians. These budgets are annually updated based on the development of diseases documented by physicians among those insured under statutory health insurance and the services billed by physicians, using a so-called morbidity change rate. Disease documentation follows the International Statistical Classification of Diseases and Related Health Problems, 10th Revision, German Modification (ICD-10-GM) as 0/1 coding. These individual diseases (approximately 15,000 different codes) are classified into about 200 risk categories primarily based on medically justified classification systems. Additional 0/1 age/gender categories (currently 34 categories) are added to these insured-related risk categories. The risk categories and age/gender categories serve as independent variables in a linear regression model predicting the services billed by physicians as the dependent variable. The individual cases include statutory health-insured individuals with annually documented diseases and billed services. Cases are drawn from a birthday sample of all statutory health-insured individuals in Germany, covering 7-8 birth calendar days of three consecutive years. This includes insured individuals who have not documented any diseases or services over the specified period. The strong restriction of the regression model involves predetermined risk categories, and the final model specification results in a surcharge model where some risk categories are combined at most. The resulting regression coefficients are then used to calculate, financially significant for statutory health insurance funds and the medical profession, the change in the aforementioned budget from one year to the next. In this context, the question arises regarding how the regression coefficients change when the prevalence of documented diagnoses (relative frequency) changes, i.e., coding variations occur where physicians document a prevalence that is either too high or too low compared to the actual prevalence. This is reflected in a changed regressor matrix compared to the actual conditions, and this work aims to investigate the question posed above in more detail. In linear regression, changes in the values of the regressor matrix can cause significant or slight changes in the regression coefficients. A compression algorithm is used to calculate the so-called morbidity change rate of the insured, within which several preliminary and a final regression calculation are conducted to determine the so-called cost weights of the insured. Each involves multiple linear regression calculations with dichotomous independent variables without an intercept, where the independent variables reflect the 0/1 age/gender categories and the risk categories (indicating the presence or absence of a specific group of diseases). The dependent variable is a continuous, non-negative measure of the ‘service requirement’ that an insured person incurs due to treatment by outpatient physicians. The aim of this study is to theoretically derive some properties of the behavior of the regression coefficients when the prevalences of individual independent variables or the frequencies of occurrence of individual disease groups change. The problem of prevalence change is motivated and mathematically presented in Chapter 2. Chapter 3 provides a closer explanation of the statistical and mathematical fundamentals important for the further course of the work. Chapter 4 theoretically examines the impact of a change in the regressor matrix on the regression coefficients. Rarely have results been explained regarding the relative error of the regression coefficients considering the component-wise behavior of these coefficients (see [4], [9]). Previous work has imposed some restrictions on the structure of the change. The first part of this chapter examines the impact of a change in the regressor matrix on regression coefficients component-wise without assumptions about the structure of the change. In the second part of this chapter, some restrictions on the change are made for simplification, and in the third part, some theoretical properties of the regression coefficients after the change are derived. Finally, an interpretation of the theoretical results in the situation of the regression model of the insured under statutory health insurance is provided at the end of this chapter. To make the presentation more accessible, some examples are chosen for illustration in each part. Lastly, a brief discussion of the results of the analysis and a summary of the methods presented in the paper follow.

 

Motivation and Problem Statement 

 Statistical methods are widely used in many studies to explain relationships between variables. The variable to be explained is referred to as the target variable, response, or dependent variable. The variables whose influence is being examined are referred to as predictors or regressors. The functional relationship between the target variable and the regressors is usually assumed to be linear. Goals of regression analysis may include demonstrating a known relationship, estimating the parameters of a known functional relationship, identifying a functional relationship, or forecasting future values of the dependent variable given specific values of the predictors. Changes in the predictors lead to different outcomes regarding the aforementioned objectives. In several applications, the predictors change in measurements or over time, and the impact on parameter estimation needs to be investigated. An example is the classification model used to calculate the so-called ‘diagnosis-related’ morbidity change rate of statutory health insurances. 2.2 Problem Statement 2.2.1 The Empirical Problem This subsection primarily introduces the empirical problem of prevalence change. The calculation of the diagnosis-related morbidity change rate of statutory health insurances is carried out through the steps outlined in Figure 2.2.1. The focus of this figure is on determining the so-called cost weights per risk class through several preliminary and a final regression calculation. As mentioned in the introduction, each involves multiple linear regressions with dichotomous independent variables without an intercept, where the independent variables reflect the 0/1 age/gender categories and the risk categories. The problem of prevalence changes arises from physicians documenting diseases among the insured individuals, where there are different developments from year to year or even in different regions (federal states). Through the documentation of diseases, the columns of the regressor matrix X change from year to year as the prevalences of the considered diagnoses change.