MEASURING CLINICAL HEALTH DATABASE SIMILARITY USING CLUSTERING AND CLASSIFICATION
Abstract
Clustering data derived from Electronic Health Record (EHR) systems is important to discover
relationships between the clinical profiles of patients and as a preprocessing step for analysis
tasks, such as classification. However, the heterogeneity of these data makes the application of
existing clustering methods difficult and calls for new clustering approaches. In this paper, we
propose the first approach for clustering a dataset in which each record contains a patient‟s
values in demographic attributes and their set of diagnosis codes. Our approach represents the
dataset in a binary form in which the features are selected demographic values, as well as
combinations (patterns) of frequent and correlated diagnosis codes. This representation enables
measuring si- milarity between records using cosine similarity, an effective measure for binaryrepresented
data, and finding compact, well-separated clusters through hierarchical clustering.
Our experiments using two publicly available EHR datasets, comprised of over 26,000 and
52,000 records, demonstrate that our approach is able to construct clusters with correlated
demographics and diagnosis codes, and that it is efficient and scalable.