In this article, Antonio Javier Sutil Jiménez discusses the study “Prediction of the Incidence of Alzheimer’s Disease Using Machine Learning with Large-Scale Administrative Health Data”.
Why is the Study of Alzheimer’s Prediction Using Machine Learning Important?
The advancement of technology can sometimes provide unexpected solutions to medical problems. One example of this is the use of administrative health data to create predictive risk models for Alzheimer’s disease.
The great novelty of the work by Park and colleagues was the utilization of this massive amount of data, which, as the researchers describe, is still largely unexplored in many cases. Therefore, the digitization of medical records has become a valuable resource for reducing the efforts and costs of data collection.
Despite this, its application to diseases like Alzheimer’s has been limited. This has been partly addressed thanks to the increase in computing power, which allows for the application of machine learning techniques to data analysis and the creation of predictive models that can be representative of the population due to having sufficiently large samples.
Premise of the Study
For the study, it is assumed that using data from individuals at risk of developing Alzheimer’s disease will allow for a better early detection of preclinical stage cases and, therefore, improve therapeutic strategies.
To achieve this objective, the research team had access to the database of the national health system of Korea, which contained over 40,000 health records of individuals over 65 years old, with a wealth of information including personal history, family history, sociodemographic data, diagnoses, medications, etc.
What Has Been Done?
Dataset
To carry out the study, a cohort from the NHIS-NSC (The National Health Insurance Service–National Sample Cohort) of South Korea was used, which included over one million participants, followed for eleven years (2002 to 2013).
The database contained information on health services, diagnoses, and prescriptions for each individual, as well as clinical characteristics, including demographic data, income levels based on monthly salary, disease codes and medications, laboratory values, health profiles, and personal and family disease history. From this sample, 40,736 adults over 40 years old were selected for this study.
Operational Definition of Alzheimer’s Disease
Next, an operational definition of Alzheimer’s disease was created based on the algorithm from a previous Canadian study.
This algorithm achieved a sensitivity of 79% and a specificity of 99%, including hospitalization codes, medical claims, and specific prescriptions for Alzheimer’s.
To improve the accuracy of disease detection, “definite AD” labels were used for cases with a high degree of certainty, and “probable AD” for cases confirmed solely through ICD-10 codes (the International Classification of Diseases), in order to minimize false negatives. With these labels, a prevalence of Alzheimer’s disease was found to be 1.5% for “definite AD” and 4.9% for “probable AD.”
Analysis
For the analysis and processing of the data, characteristics such as age and sex were used, in addition to 21 variables from the NHIS-NSC database, which included health profiles and family disease history, along with more than 6,000 variables derived from ICD-10 codes and medications.
Once the characteristics were described, these were aligned focusing on the incidence of the diagnosis for each individual, according to ICD-10 codes and medication codes. This allowed for the exclusion of rare diseases and medication codes with low frequency of occurrence. Furthermore, individuals who did not have new health data in the last two years were excluded. The final set of variables used in the models included 4,894 unique characteristics.
To make predictions “n” years out for the group with Alzheimer’s disease, time windows were utilized from 2002 to the year of incidence. In the group without the disease, data from 2002 to 2010-n was taken.
Finally, before implementing the model, training, validation, and test subsets were created using both a balanced and randomly sampled dataset and an unbalanced dataset.
Application of Machine Learning (ML) Techniques
Finally, data analysis was performed using machine learning techniques such as random forest, support vector machine with a linear kernel, and logistic regression.
Training, validation, and testing were conducted using stratified cross-validation with 5 iterations.
Feature selection was performed within the training samples using a variance threshold method, and the generalization of the model’s performance was evaluated on the test samples.
To assess the model’s performance, standard metrics such as the area under the ROC curve, sensitivity, and specificity were used.
For more details on how this study was conducted, it is recommended to refer to the original article.
What Are the Main Conclusions of This Alzheimer’s Prediction Study Using Machine Learning?
The work highlights the potential of data-driven machine learning techniques as a promising tool for predicting the risk of Alzheimer’s-type dementia.
Main Advantage of the Study
This study presents a significant advantage compared to other approaches based on information obtained from neuroimaging tests or neuropsychological assessments, as it was conducted using exclusively administrative data.
While other studies focus on populations that are already in a state of actual clinical risk or have shown enough concern to consult a healthcare professional, this approach leverages the availability of administrative data to identify risks without the need for prior clinical assessments.
Definite AD | Probable AD | Non-AD | |
Number | 614 | 2026 | 38,710 |
Age | 80.7 | 79.2 | 74.5 |
Gender (male, female) | 229, 285 | 733, 1293 | 18,200, 20,510 |
Next, comparative tables between definite AD and non-AD, and probable AD and non-AD for prediction years 0 and 4 with all classifiers used in the study are shown.
Prediction Years | Classifier | Metrics | |||
Accuracy | AUC | Sensitivity | Specificity | ||
0 years | Logistic Regression | 0.76 | 0.794 | 0.726 | 0.793 |
Support Vector Model | 0.763 | 0.817 | 0.795 | 0.811 | |
Random Forest | 0.823 | 0.898 | 0.509 | 0.852 | |
4 years | Logistic Regression | 0.627 | 0.661 | 0.509 | 0.745 |
Support Vector Model | 0.646 | 0.685 | 0.538 | 0.754 | |
Random Forest | 0.663 | 0.725 | 0.621 | 0.705 |
Prediction Years | Classifier | Metrics | |||
Accuracy | AUC | Sensitivity | Specificity | ||
0 years | Logistic Regression | 0.763 | 0.783 | 0.689 | 0.783 |
Support Vector Model | 0.734 | 0.794 | 0.652 | 0.816 | |
Random Forest | 0.788 | 0.850 | 0.723 | 0.853 | |
4 years | Logistic Regression | 0.611 | 0.644 | 0.516 | 0.707 |
Support Vector Model | 0.601 | 0.641 | 0.465 | 0.738 | |
Random Forest | 0.641 | 0.683 | 0.603 | 0.679 |
Both tables presented are simplifications of the tables from the original article. In this case, the number of years was reduced to just two (0 and 4 years) for the prediction years.
Findings for Prediction
Another highlighted point of the article is the important characteristics found for prediction. These are described as positively or negatively related to the incidence of Alzheimer’s disease. Some of the characteristics positively related to the development of the disease include age, the presence of protein in urine, and the prescription of zotepine (an antipsychotic).
In contrast, characteristics that were found to be negatively related to the incidence of the disease include decreased hemoglobin, the prescription of nicametate citrate (a vasodilator), degenerative disorders of the nervous system, and external ear disorders.
Additionally, the predictive model was tested using only the 20 most important characteristics, and it was found that the model had an accuracy for years 0 and 1 very similar to the original.
Is Detection Based on Administrative Health Data Possible?
Therefore, the conclusion of the study is that the detection of individuals at risk of Alzheimer’s based solely on administrative health data is possible. However, the authors leave open the possibility that future studies in different nations and health systems could corroborate these results. Their replication would be a milestone that would allow for earlier and more accurate detection of at-risk individuals.
Where NeuronUP Could Contribute to a Study Like This?
NeuronUP has experience in the scientific field in two main areas:
- Providing support to research groups interested in technology,
- conducting its own work to be published in high-impact scientific journals.
Specifically, for studies with characteristics similar to those reviewed in this article, we believe that, by having access to large datasets as described, NeuronUP has the team and expertise needed to:
- On one hand, implement sophisticated machine learning techniques, such as those mentioned in the article;
- and, on the other hand, in the design of the study. That is, it has a trained team to formulate questions based on existing scientific literature, as well as to conduct “data-driven” studies.
The particularity of data-driven studies is that they are focused on the analysis and interpretation of data. This perspective is based on using large amounts of data to uncover hidden patterns and trends.
The use of new technologies and advanced analytical techniques, necessary to work with these large datasets, was hardly accessible to most researchers until a few years ago. Therefore, this perspective is important and necessary when large volumes of data are available, as they can offer novel conclusions that would not be reached using methods based solely on theory.
Bibliography
- Park, J.H., Cho, H.E., Kim, J.H. et al. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. npj Digit. Med. 3, 46 (2020). https://doi.org/10.1038/s41746-020-0256-0
Leave a Reply