Why study Alzheimer's prediction with machine learning?

Machine learning applied to administrative health data can enable large-scale, cost-effective early risk prediction, improving detection in preclinical stages and informing population-level prevention strategies.

What dataset was used for prediction?

A South Korean NHIS-NSC cohort with over one million participants tracked 2002–2013; 40,736 adults over 40 were selected, including demographics, diagnoses, prescriptions, lab values and family histories.

How was Alzheimer's disease operationally defined?

An algorithm combining hospitalization codes, medical claims and specific prescriptions was used, achieving 79% sensitivity and 99% specificity; labels 'definite AD' and 'probable AD' distinguished certainty levels.

Which machine learning methods were applied?

Random forest, support vector machine with a linear kernel, and logistic regression were trained with stratified five-fold cross-validation, variance-threshold feature selection, and balanced/unbalanced sampling strategies.

What were the main predictive features identified?

Positively associated features included older age, proteinuria and zotepine prescriptions; negatively associated features included lower hemoglobin, nicametate citrate prescriptions and certain nervous system or ear disorder codes.

Is detection based only on administrative data feasible?

Yes; the study demonstrates feasible detection of individuals at risk using administrative data alone, though replication across other nations and health systems is recommended for validation.

Prediction of Alzheimer’s incidence using machine learning

In this article, Antonio Javier Sutil Jiménez talks about the study “Prediction of the incidence of Alzheimer’s disease using machine learning with large-scale administrative health data”.

Why is the study of Alzheimer’s prediction with machine learning important?

Technological advances can sometimes provide unexpected solutions to medical problems. One example of this is the use of administrative health data to create predictive risk models for Alzheimer’s disease.

The major novelty of Park and colleagues’ work was the exploitation of this massive amount of data that, as the researchers describe, in many cases remains to be harnessed. Therefore, the digitization of medical records has become a valuable resource for reducing the effort and cost of data collection.

Despite this, its application to diseases such as Alzheimer’s had been limited. Partly, this has been addressed thanks to the increase in computing power, which allows the application of machine learning techniques to data analysis and the creation of predictive models that can be representative of the population, by having sufficiently large samples.

Premise of the study

For conducting the study, it is based on the premise that using data from individuals at risk of developing Alzheimer’s disease will allow a better early detection of cases in the preclinical stage and, therefore, improve therapeutic strategies.

To achieve this objective, the research group had access to the database of South Korea’s national health system, which contained more than 40,000 health records of people over 65 years old, with a large amount of information such as personal history, family history, sociodemographic data, diagnoses, medications, etc.

What was done?

Dataset

To carry out the study, a cohort from the NHIS-NSC (The National Health Insurance Service–National Sample Cohort) of South Korea was used, which included more than one million participants, and they were followed for eleven years (2002 to 2013).

The database contained information about health services, diagnoses and prescriptions for each individual, as well as clinical characteristics that included demographic data, income levels based on monthly salary, disease and medication codes, laboratory values, health profiles and personal and family disease histories. From this sample, 40,736 adults over 40 years of age were selected for this study.

Operational definition of Alzheimer’s disease

Next, an operational definition of Alzheimer’s disease was created, based on the algorithm from a previous Canadian study.

This algorithm achieved a sensitivity of 79% and a specificity of 99%, including hospitalization codes, medical claims and prescriptions specific to Alzheimer’s.

To improve detection accuracy, the labels “definite AD” were used for cases with a high degree of certainty, and “probable AD” for cases confirmed only by ICD-10 codes (acronym for the International Classification of Diseases), in order to minimize false negatives. With these labels, the prevalence of Alzheimer’s disease was 1.5% for “definite AD” and 4.9% for “probable AD”.

Analysis

For data processing and analysis, features such as age and sex were used, in addition to 21 variables from the NHIS-NSC database, which included health profiles and family disease history, along with more than 6,000 variables derived from ICD-10 codes and medication codes.

Once the features were described, they were aligned focusing on the incidence of diagnosis for each individual, according to ICD-10 codes and medication codes. This allowed the removal of rare diseases and medication codes with a low frequency of occurrence. In addition, individuals who did not have new health data in the last two years were excluded. The final set of variables used in the models included 4,894 unique features.

To make predictions at “n” years in the Alzheimer’s disease group, time windows between 2002 and the year of incidence were used. In the group that did not have the disease, data from 2002 to 2010-n were taken.

Finally, before implementing the model, training, validation and test subsets were created using both a balanced and randomly sampled dataset and an unbalanced dataset.

Application of machine learning techniques (ML)

Finally, the data analysis was carried out implementing machine learning techniques such as random forest, support vector machine with a linear kernel and logistic regression.

The training, validation and testing were performed using stratified cross-validation with 5 iterations.

Feature selection was performed within the training samples using a variance threshold method, and the generalization of the model’s performance was evaluated on the test samples.

To check model performance, common metrics were used, such as the area under the ROC curve, sensitivity and specificity.

For more details on how this study was carried out, it is recommended to consult the original article.

Subscribe
to our
Newsletter

What are the main conclusions of this Alzheimer’s prediction study with machine learning?

The work highlights the potential of data-driven machine learning techniques as a promising tool to predict the risk of Alzheimer’s-type dementia.

Main advantage of the study

This study presents a major advantage compared to other approaches based on information obtained from neuroimaging tests or neuropsychological assessments, since it was carried out using exclusively administrative data.

While other studies focus on populations that are already in a real clinical risk situation or that have shown sufficient concern to consult a health professional, this approach leverages the availability of administrative data to identify risks without the need for prior clinical assessments.

	Definite AD	Probable AD	Non-AD
Nº	614	2026	38.710
Edad	80.7	79.2	74.5
Sexo (hombre, mujer)	229, 285	733, 1293	18.200, 20.510

Table 1. Simplified data of the sample characteristics. For more precise data and a larger number of features consult table 1 of the original paper.

Below are the comparative tables between definite AD and non AD, and probable AD and non AD for the prediction years 0 and 4 with all classifiers used in the study.

Prediction years	Classifier	Metrics
		Accuracy	AUC	Sensitivity	Specificity
0 years	Logistic regression	0.76	0.794	0.726	0.793
	Support Vector Machine	0.763	0.817	0.795	0.811
	Random Forest	0.823	0.898	0.509	0.852
4 years	Logistic regression	0.627	0.661	0.509	0.745
	Support Vector Machine	0.646	0.685	0.538	0.754
	Random Forest	0.663	0.725	0.621	0.705

Definite AD vs Non AD.

Prediction years	Classifier	Metrics
		Accuracy	AUC	Sensitivity	Specificity
0 years	Logistic regression	0.763	0.783	0.689	0.783
	Support Vector Machine	0.734	0.794	0.652	0.816
	Random Forest	0.788	0.850	0.723	0.853
4 years	Logistic regression	0.611	0.644	0.516	0.707
	Support Vector Machine	0.601	0.641	0.465	0.738
	Random Forest	0.641	0.683	0.603	0.679

Probable AD vs non AD.

Both tables presented are simplifications of the tables from the original article. In this case, the number of years was reduced to only two (0 and 4 years) for the prediction years.

Findings for prediction

Another highlighted point of the article is the important features found for prediction. These are described as being positively or negatively associated with the incidence of Alzheimer’s disease. Some of the features positively related to disease development include age, the presence of protein in the urine and the prescription of zotepine (an antipsychotic).

In contrast, features that were negatively associated with disease incidence were also detected, such as decreased hemoglobin, the prescription of nicametate citrate (a vasodilator), degenerative disorders of the nervous system and disorders of the external ear.

Additionally, the predictive model was tested using only the 20 most important features, and it was found that the model had accuracy for years 0 and 1 very similar to the original.

Is detection based on administrative health data possible?

Therefore, the conclusion of the study is that detection of individuals at risk of Alzheimer’s based solely on administrative health data is possible. However, the authors leave open the possibility that future studies in different nations and health systems could corroborate these results. Their replication would be a milestone that would allow earlier and more accurate detection of people at risk.

Where could NeuronUP contribute in a study like this?

NeuronUP has scientific experience in two main areas:

Providing support to research groups interested in technology,
carrying out its own work to be published in high-impact scientific journals.

Specifically, for studies with characteristics similar to those reviewed in this article, we believe that, having access to large datasets like those described, NeuronUP has the team and experience necessary to:

On the one hand, implement sophisticated machine learning techniques, such as those mentioned in the article;
and, on the other hand, in study design. That is, it has a team capable of formulating questions based on the existing scientific literature, as well as conducting “data-driven” studies.

The particularity of data-driven studies is that they are focused on data analysis and interpretation. This perspective is based on the use of large amounts of data to discover hidden patterns and trends.

The use of new technologies and advanced analysis techniques, necessary to work with these large datasets, was hardly accessible to most researchers until a few years ago. Therefore, this perspective is important and necessary when large volumes of data are available, as they can offer novel conclusions that would not be reached using methods based solely on theory.

Bibliography

Park, J.H., Cho, H.E., Kim, J.H. et al. Machine learning prediction of incidence of Alzheimer’s disease using large-scale administrative health data. npj Digit. Med. 3, 46 (2020). https://doi.org/10.1038/s41746-020-0256-0