3 D Statistical Learning

Overview

The  project  describes an empirical investigation of the temporal course of coding certain diagnoses by ambulatory physicians in connection with the determination of changes in the morbidity structure of insured persons in statutory health insurance in Germany. The report aims to answer the question of whether selected coded ICD diagnoses (three-digit, four-digit, or five-digit) are persistent or not. The coding of typically acute illnesses is often characterized by a short-term course. An increased and consecutive occurrence of persistent acute diagnoses can indicate improper adoption/coding of diagnoses into the subsequent quarter. The persistent and thus falsely coded non-persistent diagnosis can strongly influence the morbidity structure of insured persons in statutory health insurance.

Data Basis

The data body and data model for the subsequent investigations cannot be fully presented here due to data privacy policies. In this context, the term data model refers to a model of the data of insured individuals and their relationships with each other. The term data body refers to the dataset that comprises this model, which includes several files with variables typical of health insurance data. It is important to note that, for each insured person, for each selected diagnosis, and for a given quarter, we assign a value of 1 if the person has a positive diagnosis and 0 otherwise.

Analysis 

 The operationalization and measurement of the persistence of diagnoses are described here. In the first part, this operationalization and measurement of persistence are considered in terms of a prevalence course. Prevalence is an epidemiological parameter for disease frequency. In the second part of the chapter, a measure of persistence is defined that is based on the idea of the mean estimator and takes into account that a persistent diagnosis appears again and again over time. In the third part of the chapter, this operationalization and measurement of persistence is formulated as an unsupervised learning classification problem. The temporal course of diagnoses is to be analyzed automatically, i.e., a complete analysis is to be performed for any selected diagnosis. This operationalization and measurement of persistence should also be independent of the data body and data model, i.e., the same operationalization and measurement of persistence should be performed for any data body and data model with the same structure. The  General Presentation of the SAS Syntax serves to briefly present the macro programs used in the aforementioned sense, with which, for various data bodies and data models given, any ICD diagnosis, the previous operationalizations and measurements of persistence can be created. Finally, the results are summarized. 

Results

The SAS macro has been made available and deployed, and the measurement of persistence for each selected ICD diagnosis was successfully completed.