In this article, Antonio Javier Sutil Jiménez talks about the study “Prediction of the incidence of Alzheimer’s disease using machine learning with large-scale administrative health data”.
Why is the study of Alzheimer’s prediction with machine learning important?
Technological advances can sometimes provide unexpected solutions to medical problems. One example of this is the use of administrative health data to create predictive risk models for Alzheimer’s disease.
The major novelty of Park and colleagues’ work was the exploitation of this massive amount of data that, as the researchers describe, in many cases remains to be harnessed. Therefore, the digitization of medical records has become a valuable resource for reducing the effort and cost of data collection.
Despite this, its application to diseases such as Alzheimer’s had been limited. Partly, this has been addressed thanks to the increase in computing power, which allows the application of machine learning techniques to data analysis and the creation of predictive models that can be representative of the population, by having sufficiently large samples.
Premise of the study
For conducting the study, it is based on the premise that using data from individuals at risk of developing Alzheimer’s disease will allow a better early detection of cases in the preclinical stage and, therefore, improve therapeutic strategies.
To achieve this objective, the research group had access to the database of South Korea’s national health system, which contained more than 40,000 health records of people over 65 years old, with a large amount of information such as personal history, family history, sociodemographic data, diagnoses, medications, etc.
What was done?
Dataset
To carry out the study, a cohort from the NHIS-NSC (The National Health Insurance Service–National Sample Cohort) of South Korea was used, which included more than one million participants, and they were followed for eleven years (2002 to 2013).
The database contained information about health services, diagnoses and prescriptions for each individual, as well as clinical characteristics that included demographic data, income levels based on monthly salary, disease and medication codes, laboratory values, health profiles and personal and family disease histories. From this sample, 40,736 adults over 40 years of age were selected for this study.
Operational definition of Alzheimer’s disease
Next, an operational definition of Alzheimer’s disease was created, based on the algorithm from a previous Canadian study.
This algorithm achieved a sensitivity of 79% and a specificity of 99%, including hospitalization codes, medical claims and prescriptions specific to Alzheimer’s.
To improve detection accuracy, the labels “definite AD” were used for cases with a high degree of certainty, and “probable AD” for cases confirmed only by ICD-10 codes (acronym for the International Classification of Diseases), in order to minimize false negatives. With these labels, the prevalence of Alzheimer’s disease was 1.5% for “definite AD” and 4.9% for “probable AD”.
Analysis
For data processing and analysis, features such as age and sex were used, in addition to 21 variables from the NHIS-NSC database, which included health profiles and family disease history, along with more than 6,000 variables derived from ICD-10 codes and medication codes.
Once the features were described, they were aligned focusing on the incidence of diagnosis for each individual, according to ICD-10 codes and medication codes. This allowed the removal of rare diseases and medication codes with a low frequency of occurrence. In addition, individuals who did not have new health data in the last two years were excluded. The final set of variables used in the models included 4,894 unique features.
To make predictions at “n” years in the Alzheimer’s disease group, time windows between 2002 and the year of incidence were used. In the group that did not have the disease, data from 2002 to 2010-n were taken.
Finally, before implementing the model, training, validation and test subsets were created using both a balanced and randomly sampled dataset and an unbalanced dataset.
Application of machine learning techniques (ML)
Finally, the data analysis was carried out implementing machine learning techniques such as random forest, support vector machine with a linear kernel and logistic regression.
The training, validation and testing were performed using stratified cross-validation with 5 iterations.
Feature selection was performed within the training samples using a variance threshold method, and the generalization of the model’s performance was evaluated on the test samples.
To check model performance, common metrics were used, such as the area under the ROC curve, sensitivity and specificity.
For more details on how this study was carried out, it is recommended to consult the original article.
Subscribe
to our
Newsletter
What are the main conclusions of this Alzheimer’s prediction study with machine learning?
The work highlights the potential of data-driven machine learning techniques as a promising tool to predict the risk of Alzheimer’s-type dementia.
Main advantage of the study
This study presents a major advantage compared to other approaches based on information obtained from neuroimaging tests or neuropsychological assessments, since it was carried out using exclusively administrative data.
While other studies focus on populations that are already in a real clinical risk situation or that have shown sufficient concern to consult a health professional, this approach leverages the availability of administrative data to identify risks without the need for prior clinical assessments.
Definite AD | Probable AD | Non-AD | |
Nº | 614 | 2026 | 38.710 |
Edad | 80.7 | 79.2 | 74.5 |
Sexo (hombre, mujer) | 229, 285 | 733, 1293 | 18.200, 20.510 |
Below are the comparative tables between definite AD and non AD, and probable AD and non AD for the prediction years 0 and 4 with all classifiers used in the study.
Prediction years | Classifier | Metrics | |||
Accuracy | AUC | Sensitivity | Specificity | ||
0 years | Logistic regression | 0.76 | 0.794 | 0.726 | 0.793 |
Support Vector Machine | 0.763 | 0.817 | 0.795 | 0.811 | |
Random Forest | 0.823 | 0.898 | 0.509 | 0.852 | |
4 years | Logistic regression | 0.627 | 0.661 | 0.509 | 0.745 |
Support Vector Machine | 0.646 | 0.685 | 0.538 | 0.754 | |
Random Forest | 0.663 | 0.725 | 0.621 | 0.705 |
Prediction years | Classifier | Metrics | |||
Accuracy | AUC | Sensitivity | Specificity | ||
0 years | Logistic regression | 0.763 | 0.783 | 0.689 | 0.783 |
Support Vector Machine | 0.734 | 0.794 | 0.652 | 0.816 | |
Random Forest | 0.788 | 0.850 | 0.723 | 0.853 | |
4 years | Logistic regression | 0.611 | 0.644 | 0.516 | 0.707 |
Support Vector Machine | 0.601 | 0.641 | 0.465 | 0.738 | |
Random Forest | 0.641 | 0.683 | 0.603 | 0.679 |
Both tables presented are simplifications of the tables from the original article. In this case, the number of years was reduced to only two (0 and 4 years) for the prediction years.
Findings for prediction
Another highlighted point of the article is the important features found for prediction. These are described as being positively or negatively associated with the incidence of Alzheimer’s disease. Some of the features positively related to disease development include age, the presence of protein in the urine and the prescription of zotepine (an antipsychotic).
In contrast, features that were negatively associated with disease incidence were also detected, such as decreased hemoglobin, the prescription of nicametate citrate (a vasodilator), degenerative disorders of the nervous system and disorders of the external ear.
Additionally, the predictive model was tested using only the 20 most important features, and it was found that the model had accuracy for years 0 and 1 very similar to the original.
Is detection based on administrative health data possible?
Therefore, the conclusion of the study is that detection of individuals at risk of Alzheimer’s based solely on administrative health data is possible. However, the authors leave open the possibility that future studies in different nations and health systems could corroborate these results. Their replication would be a milestone that would allow earlier and more accurate detection of people at risk.
Where could NeuronUP contribute in a study like this?
NeuronUP has scientific experience in two main areas:
- Providing support to research groups interested in technology,
- carrying out its own work to be published in high-impact scientific journals.
Specifically, for studies with characteristics similar to those reviewed in this article, we believe that, having access to large datasets like those described, NeuronUP has the team and experience necessary to:
- On the one hand, implement sophisticated machine learning techniques, such as those mentioned in the article;
- and, on the other hand, in study design. That is, it has a team capable of formulating questions based on the existing scientific literature, as well as conducting “data-driven” studies.
The particularity of data-driven studies is that they are focused on data analysis and interpretation. This perspective is based on the use of large amounts of data to discover hidden patterns and trends.
The use of new technologies and advanced analysis techniques, necessary to work with these large datasets, was hardly accessible to most researchers until a few years ago. Therefore, this perspective is important and necessary when large volumes of data are available, as they can offer novel conclusions that would not be reached using methods based solely on theory.
Bibliography
- Park, J.H., Cho, H.E., Kim, J.H. et al. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. npj Digit. Med. 3, 46 (2020). https://doi.org/10.1038/s41746-020-0256-0
If you liked this blog post about the prediction of the incidence of Alzheimer’s disease using machine learning with large-scale administrative health data, you will likely be interested in these articles from NeuronUP:
“This article has been translated. Link to the original article in Spanish:”
Predicción de la incidencia de la enfermedad de alzheimer mediante machine learning utilizando datos sanitarios administrativos a gran escala
Leave a Reply