Accuracy of Machine Learning Algorithms for the Diagnosis of Autism Spectrum Disorder: Systematic Review and Meta-Analysis of Brain Magnetic Resonance Imaging Studies

Background In the recent years, machine learning algorithms have been more widely and increasingly applied in biomedical fields. In particular, their application has been drawing more attention in the field of psychiatry, for instance, as diagnostic tests/tools for autism spectrum disorder (ASD). However, given their complexity and potential clinical implications, there is an ongoing need for further research on their accuracy. Objective This study aimed to perform a systematic review and meta-analysis to summarize the available evidence for the accuracy of machine learning algorithms in diagnosing ASD. Methods The following databases were searched on November 28, 2018: MEDLINE, EMBASE, CINAHL Complete (with Open Dissertations), PsycINFO, and Institute of Electrical and Electronics Engineers Xplore Digital Library. Studies that used a machine learning algorithm partially or fully for distinguishing individuals with ASD from control subjects and provided accuracy measures were included in our analysis. The bivariate random effects model was applied to the pooled data in a meta-analysis. A subgroup analysis was used to investigate and resolve the source of heterogeneity between studies. True-positive, false-positive, false-negative, and true-negative values from individual studies were used to calculate the pooled sensitivity and specificity values, draw Summary Receiver Operating Characteristics curves, and obtain the area under the curve (AUC) and partial AUC (pAUC). Results A total of 43 studies were included for the final analysis, of which a meta-analysis was performed on 40 studies (53 samples with 12,128 participants). A structural magnetic resonance imaging (sMRI) subgroup meta-analysis (12 samples with 1776 participants) showed a sensitivity of 0.83 (95% CI 0.76-0.89), a specificity of 0.84 (95% CI 0.74-0.91), and AUC/pAUC of 0.90/0.83. A functional magnetic resonance imaging/deep neural network subgroup meta-analysis (5 samples with 1345 participants) showed a sensitivity of 0.69 (95% CI 0.62-0.75), specificity of 0.66 (95% CI 0.61-0.70), and AUC/pAUC of 0.71/0.67. Conclusions The accuracy of machine learning algorithms for diagnosis of ASD was considered acceptable by few accuracy measures only in cases of sMRI use; however, given the many limitations indicated in our study, further well-designed studies are warranted to extend the potential use of machine learning algorithms to clinical settings. Trial Registration PROSPERO CRD42018117779; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=117779


Background
Autism spectrum disorder (ASD), behaviorally characterized by a deficit in social communication and rigidity in interest or behavior by both the Diagnostic and Statistical Manual of Mental Disorders-5 (DSM-5) and the International Statistical Classification of Diseases-11 , is believed to be a product of complex interactions between genetic and environmental factors [1][2][3]. The latest prevalence of ASD has been reported to be 1 in 59 children aged 8 years, based on the 2014 Center for Disease Control and Prevention (CDC) surveillance data [4], and 1 in 40 children aged 3-17 years, based on parental reports of the diagnosis in a national survey [5]. Despite the advancement of many biomarkers with potential in prediction or early detection of ASD (eg, structural magnetic resonance imaging [sMRI] or functional magnetic resonance imaging [fMRI]), a diagnosis is not made until the age of 4-5 years, on average [4,6].
Machine learning has been increasingly studied as a novel tool to enhance the accuracy of diagnosis and early detection of ASD [7]. Unlike traditional rule-based algorithms that allowed computers to generate answers with preprogramed rules, machine learning allows building of an algorithm that can learn, predict, and improve with experience, based on big data [3,[8][9][10]. Psychiatric decision making is more sophisticated and difficult to characterize, compared with machine learning, although there are some common elements. Psychiatrists diagnose patients by observing their behaviors and registering all collected and collateral data into their (psychiatrists') cognitive system as sensory input values (eg, voice and vision). Similarly, machine learning requires a series of steps, including preprocessing (eg, noise removal from data before input into an algorithm), segmentation, and feature extraction [7]. In particular, machine learning in the field of ASD diagnostics incorporates big data (eg, neuroimaging), making the input data immense and complex [11]. The application of machine learning algorithms in the field of neuroimaging often requires an extra process, such as feature selection that extracts key features from a complex dataset. In other words, key features are selected before the learning process, which is called feature selection [11].

Objective
Currently, machine learning is widely applied to the field of bioinformatics, including genetics and imaging, and many applications require signal recognition and processing [12]. Machine learning algorithms are currently applied to the field of psychiatry in areas such as genomics, electroencephalogram (EEG), and neuroimaging. However, owing to the complex workflows implicated in machine learning itself, the accuracy of such algorithms is varied [8]. This study aimed to suggest an integrated estimate of the accuracy for use of machine learning algorithms in distinguishing individuals with ASD from control groups through systematic review and meta-analysis of the available studies.

Systematic Review
This systematic review and meta-analysis was conducted based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses for Diagnostic test accuracy [13]. The study protocol was written before initiation of the study and registered in the Prospective Register of Systematic Reviews database (trial registration: CRD42018117779).

Data Sources and Search Strategy
MEDLINE, EMBASE, CINAHL Complete (with Open Dissertations), and PsycINFO were selected as core search databases, and the Institute of Electrical and Electronics Engineers (IEEE) Xplore Digital Library was added to maximize the sensitivity of the search. The IEEE Xplore Digital Library is a database created by the IEEE, the largest of its kind worldwide, and includes more than 1800 peer-reviewed conference proceedings. Default search filters provided by journals were not used. There was no restriction by publication type (eg, conference proceedings) or language. The initial search was conducted on November 28, 2018. The search strategy and query per search database are listed in Multimedia Appendix 1. The primary consideration for study inclusion was if machine learning was partially or fully applied in distinguishing individuals clinically diagnosed with ASD from controls and assess the accuracy of such applications. Multimedia Appendix 2 lists inclusion/exclusion criteria. An author (SM) retrieved the initial search results and removed duplicates by using the command find duplicate via a reference software (Endnote X9, Clarivate Analytics, Philadelphia, Pennsylvania. Subsequently, another author (JK) manually searched for and removed any residual duplicates. Finally, the studies were screened independently by two authors (SM and JK) by title, abstract, and keywords, after which the full texts of the selected studies were screened by two authors (SM and JK) by inclusion/exclusion criteria. If any discrepancy was found in the final selection, the two authors reached a consensus via discussion.

Data Extraction
A data extraction form was created through discussion among the authors before the extraction process to suggest specific subgroups and coding processes (categorizing) for a meta-analysis (Multimedia Appendix 3). The process is provided in detail in Multimedia Appendix 4. General characteristics such as author, publication year, sample size, average age, gender ratio, and data characteristics were extracted from individual studies. Information regarding the reference standard used in individual studies and definitions of positive/negative disease (autism positive/control) and methodologies to distinguish individuals with autism from control group were collected. Specific methodologies used to process and classify data for use in machine learning algorithms were also recorded (Multimedia Appendices 3 and 4). All accuracy values were extracted, and true-positive / true-negative / false-positive / false-negative (TP/TN/FP/FN) values were calculated from individual studies for a meta-analysis. If the TP/TN/FP/FN values could not be calculated from the accuracy values provided in a study, an email was sent to the corresponding author to request raw data. If there was no response within 14 days, the study was not included in the meta-analysis. The extraction was performed independently by two authors (SM and JK). If there was any discrepancy in the extracted data, a consensus was reached by thorough discussion after repeating the same extraction process.

Quality Assessment
Two authors (SM and JK) independently assessed the quality of individual studies based on the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2). QUADAS-2 is a validated tool used to evaluate the quality of diagnostic accuracy studies by patient selection, index test, reference standard, and risk of bias (RoB) for internal validity and external validity for applicability concerns of individual studies [14]. There was no disagreement between authors in the assessment of patient selection and reference standard domain. The index test, also known as the target tool of our investigation in this study, is a machine learning algorithm. The target tool, the machine learning algorithm's accuracy, is reported through a process called validation. However, when a study provided no information about the validation process, low RoB was assumed if independent datasets were used for training, building a model, and validation [15]. Otherwise, the level of RoB was determined by thoroughly reviewing the validation processes.

Evidence Synthesis
In our meta-analysis, a bivariate random effects model was used to consider both within-and between-subject variability and threshold effect [16]. A Summary Receiver Operating Characteristics (SROC) curve was generated based on parameter estimates extracted from the bivariate random effects model [17]. The SROC curve was specified by pooled sensitivity, specificity point, 95% CIs, and prediction region. Area under the curve (AUC) and partial AUC (pAUC) were calculated based on the SROC curve [18]. Studies that were visually deviant from the 95% prediction region on the SROC curve were considered heterogeneous [19]. Attempts were made to resolve the heterogeneity by performing a subgroup analysis-generating individual SROC curves for subgroups (minimum 5 studies) [20]. If most studies were within the 95% prediction region on the SROC curves of the subgroups, the sample was determined to be homogeneous, and integrated sensitivity, specificity, and SROC curve results were provided.
If any of the TP/FP/TN/FN value was 0, 0.5 was added to prevent zero cell count problem [21]. The TP/FP/TN/FN values were extracted or calculated from each independent sample in a study, and if multiple machine learning algorithms were applied to the same sample, an algorithm with the best accuracy (calculated as [TP+TN]/[TP+FP+TN+FN]) was selected for data extraction.
A meta-analysis was conducted via the mada package in R (version 3.4.3, R Core Team, Vienna, Austria), and statistical significance was expressed with 95% CIs. Publication bias was not assessed in our analysis, as there are currently no statistically adequate models in the field of meta-analysis of diagnostic test accuracy [22].

Search, Selection, and General Characteristics
After duplicate removal, of the 280 studies extracted from five databases and one additional database, 43 studies were selected, of which 40 studies were included in the meta-analysis. Figure  1 provides details according to the screening stage.
The publication years ranged from 2007 to 2018 for the final selection of 43 studies, of which 40 were journal articles and 3 were gray literature elements (eg, conference proceedings). A total of 10 studies used a public database that was available on the internet and open to anyone, 18 used a private sector database (eg, clinic and hospital), 3 used both public and private databases, and the remaining 12 used databases from others. Regarding the average age of the sample, 5 studies included adults, 22 studies included school-aged participants, 11 included preschool-aged participants, and the remaining 5 did not provide any information. For the machine learning algorithm, 20 studies used a support vector machine (SVM), 3 used a deep neural network (DNN), 13 used others, and the remaining 10 used and compared multiple algorithms. For prediction, 11 studies used sMRI features, 9 used fMRI features, 9 used behavior traits, 5 used biochemical features, 4 used EEG features, and the remaining 2 used text or voice features. For reference standards, 24 studies used DSM-IV, DSM-IV -Text Revision, or DSM-5; 10 used the Autism Diagnostic Observation Schedule (ADOS) or the Autism Diagnostic Interview (ADI); 2 used ICD; and the remaining 7 did not provide relevant information. For the validation methodology, 37 studies only used internal validation, 2 only used external validation, and 4 used both. The abovementioned information is summarized in Table 1, and the extracted raw data are presented in Multimedia Appendices 5 and 6.

Qualitative Assessment
Of the 43 studies in total, more than half were assessed to have an unclear RoB by patient selection domain (33 studies) and index test domain (29 studies). More than half were considered to have a low RoB by the total reference standard (35 studies) and flow and timing domains (35 studies). For applicability concern, about half (22 studies) were shown to have unclear or high-risk RoB by patient selection domain, whereas most were considered to have a low risk by index test (42 studies) and reference standard domain (36 studies). Qualitative assessment for all the individual studies is summarized in Multimedia Appendix 7, and the distribution is shown in Figure 2.

Quantitative Analysis (Meta-Analysis)
Of the final selection of 43 studies, only 40, from which TP/FP/FN/TN values were extractable, were considered for the meta-analysis. A total of 53 independent samples were extracted from the 40 studies and included in the meta-analysis (Table  1). Of the 53 samples, 12,128 participants were inspected in the meta-analysis, with the total sensitivity and specificity ranging from 0.55 to 1.00 and 0.56 to 0.99, respectively. TP/FP/FN/TN, sensitivity, and specificity values for 53 individual samples are summarized in Multimedia Appendix 8, and visual distribution is provided as SROC in Figure 3. Of the 53 samples, 12 were found outside the 95% predictive region of the SROC curve, and therefore, there was heterogeneity between samples ( Figure  3).
In an attempt to resolve this heterogeneity, a subgroup analysis was conducted with 19 variables that had been predefined and coded. For replicability, a raw data sheet listing the precodified variables is available in Multimedia Appendix 9. As a result, among 19 variables, predictor was the only one by which the heterogeneity could be partially resolved. Of the 53 samples, for the sMRI subgroup that used sMRI as predictors, all the 12 samples were found to be within the predictive region of the SROC curve, thus resolving the heterogeneity (Figure 4).
For the sMRI subgroup, the pooled sensitivity was 0.83 (95% CI 0.76-0.89), specificity was 0.84 (95% CI 0.74-0.91), and AUC/pAUC was 0.90/0.83. Meta-analysis was also attempted for the remaining subgroups, such as fMRI (15 samples), behavior traits (14 samples), and biochemical features (7 samples) subgroups, but the pooled sensitivity and specificity could not be provided owing to a significant degree of heterogeneity between samples: A few samples were shown to be far off the predictive region of the SROC curves (Multimedia Appendices 10-12). However, sub-subgroup meta-analysis using 5 samples that used fMRI as a predictor and DNN as a classifier allowed for the heterogeneity to be resolved and provided the pooled sensitivity of 0.69 (95% CI 0.62-0.75), specificity of 0.66 (95% CI 0.61-0.70), and AUC/pAUC of 0.71/0.67 ( Figure  5).
Similarly, another sub-subgroup meta-analysis of six samples that used sMRI as a predictor and SVM as a classifier resolved the heterogeneity and resulted in a pooled sensitivity of 0.87 (95% CI 0.78-0.93), specificity of 0.87 (95% CI 0.71-0.95), and AUC/pAUC of 0.92/0.88 (Multimedia Appendix 12). Sensitivity and specificity values and types of classifiers used for samples of individual subgroups that used neuroimaging features (sMRI and fMRI subgroups) as predictors are provided in Table 2, and a forest plot is provided in Multimedia Appendix 13.
Summary Receiver Operating Characteristics curve for functional magnetic resonance imaging/deep neural network sub-subgroup (5 samples). Note that confidence region is the 95% confidence region around the summary sensitivity and specificity points, and the prediction region is the 95% prediction of the true sensitivity and specificity interval for future observations. SROC: Summary Receiver Operating Characteristics.
The sensitivity and specificity for the behavior traits (14 samples) subgroup ranged from 0.68 to 1.00 and 0.56 to 0.9, respectively. The sensitivity and specificity for the biochemical features (7 samples) subgroup ranged from 0.77 to 0.94 and 0.72 to 0.93, respectively. The sensitivity and specificity for the EEG subgroup (3 samples) ranged from 0.94 to 0.97 and 0.81 to 0.94, respectively. The results are summarized in Multimedia Appendix 8. Information for other measures not included in the meta-analysis is provided in Multimedia Appendix 14.   (12 samples). Note that the confidence region is the 95% confidence region around the summary sensitivity and specificity points, and the prediction region is the 95% prediction of the true sensitivity and specificity interval for future observations. SROC: Summary Receiver Operating Characteristics.

Principal Findings
On the basis of the meta-analysis in this study, the summary sensitivity and specificity of the accuracy for use of machine learning algorithms in ASD diagnosis are 0.83 (95% CI 0.76-0.89) and 0.84 (0.74-0.91), respectively, whereas the accuracy value based on AUC/pAUC is 0.90/0.83. On the basis of the opinion that the AUC/pAUC value is considered acceptable when above 0.7, both the AUC/pAUC values can be thought to be acceptable for the sMRI subgroup [44]. However, given the wide confidence interval for each summary sensitivity and specificity, the clinical usefulness of those values can be difficult to determine. In addition, precaution is warranted for interpreting the accuracy results, as the 95% predictive region is larger than the 95% CI region on the SROC curve, indicating a high degree of uncertainty for the pooled sensitivity and specificity calculated [19]. In addition, only one sample from the sMRI subgroup utilized an external validation method, where demographic characteristics of the training dataset were independent of those of the validation dataset. In other words, the rest of the samples in the sMRI subgroup built their validation datasets from participants who were similar to or the same as those recruited in the training datasets. Hence, those samples are believed to have high risks of overfitting, compromising the generalizability of machine learning models and overestimating the results of the meta-analysis of the sMRI subgroup [15].
Machine learning algorithms can be divided into supervised, unsupervised, or reinforcement learning by learning pattern [9]. SVM, for which subgroup analysis was performed for sMRI, is the oldest method of supervised learning, whereas DNN, for which subgroup analysis was conducted for fMRI, is the most advanced of the neural network methods (supervised learning), modeled after the mechanism of neurons [9]. On the contrary, the accuracy values for the fMRI subgroup using one of the latest machine learning algorithms, DNN, were found to be lower than those for the sMRI subgroup. This may, in part, be attributable to possible overestimation secondary to the overfitting in the sMRI subgroup. In addition, one of the studies in the fMRI/DNN sub-subgroup composed their dataset by recruiting over 1000 participants from various sites to minimize limitations such as overfitting in their analysis.

Limitations
Our study has several limitations. Of the final selection of 43 studies, 33 did not provide clear information regarding the process of obtaining an original database or a recruiting training/validation dataset from the real clinical world, or raw data such as basic demographic characteristics of the participants before the input process, thus increasing the RoB in the patient selection processes. For example, more than half the finally selected studies did not match the samples for age or gender, and the number of images or signals per participant was not specified in most of the neuroimaging and EEG studies. Subgroups other than the sMRI subgroup included studies that used the same database, thus raising concerns for possible sample overlap, which was challenging to process statistically owing to the lack or absence of information on the patient selection process. If datasets overlapped and lowered the accuracy, the subgroup meta-analysis would have been underestimated and vice versa. In addition, behavior, EEG, and voice/text subgroups did not consist of enough studies to attempt to resolve the heterogeneity and provide pooled accuracy values. Furthermore, owing to the heterogeneity, summary accuracy values could not be obtained for adult (aged over 18 years), school-age (between 6 and 18 years), and preschool-age (less than 6 years) subgroups, thus limiting the ability to draw a conclusion on accuracy by age groups. Corresponding authors for individual studies with small and high TP values (ie, 100% accurate machine learning test) were reached out to, and one responded. Even if more had responded, to our knowledge, there would not have been any way to perform the aggregation.

Comparison With Prior Work
To our knowledge, there is currently no study that has performed a systematic review and/or a meta-analysis on diagnostic test accuracy for the use of machine learning in diagnosing ASD and suggested its pooled estimate accuracies. In this analysis, many individual studies reported small TP and high TP (ie, 100% accurate machine learning test) and caused significant heterogeneity for a meta-analysis (see Figure 3). Authors resolved the heterogeneity by using subgroup analyses. As a result, individual studies with small and high TP values (ie, 100% accurate machine learning test) were barely included in fMRI and sMRI subgroup analyses, thereby resolving the heterogeneity and allowing conduct of the meta-analysis. Nevertheless, recommendations from our results may improve the quality of prospective studies using machine learning algorithms in ASD diagnosis. First, Standards for Reporting of Diagnostic Accuracy Studies (STARD) can guide machine learning diagnostic studies to enhance the reporting of patient selection processes. In addition, there is the comprehensive guideline for algorithm developers in terms of choosing an adequate predictive model for a target sample; setting the parameters, definition, or threshold; and minimizing errors such as overfitting and perfect separation [45]. Use of the STARD and other guidelines [45] would facilitate more transparent and comprehensive work in this space. Although not discussed in the studies included in our analysis, decision or running time for a machine learning algorithm in ASD diagnosis could become an important quality measure in the near future when these algorithms might be employed in a busy daily clinical practice.

Conclusions
The accuracy of diagnosing ASD by machine learning algorithms was found to be acceptable by select accuracy measures only in studies that utilized sMRI. However, because of the high heterogeneity in the analyzed studies, it is impossible