Original Paper
Abstract
Background: The use of artificial intelligence (AI) in the medical domain has attracted considerable research interest. Inference applications in the medical domain require energy-efficient AI models. In contrast to other types of data in visual AI, data from medical laboratories usually comprise features with strong signals. Numerous energy optimization techniques have been developed to relieve the burden on the hardware required to deploy a complex learning model. However, the energy efficiency levels of different AI models used for medical applications have not been studied.
Objective: The aim of this study was to explore and compare the energy efficiency levels of commonly used machine learning algorithms—logistic regression (LR), k-nearest neighbor, support vector machine, random forest (RF), and extreme gradient boosting (XGB) algorithms, as well as four different variants of neural network (NN) algorithms—when applied to clinical laboratory datasets.
Methods: We applied the aforementioned algorithms to two distinct clinical laboratory data sets: a mass spectrometry data set regarding Staphylococcus aureus for predicting methicillin resistance (3338 cases; 268 features) and a urinalysis data set for predicting Trichomonas vaginalis infection (839,164 cases; 9 features). We compared the performance of the nine inference algorithms in terms of accuracy, area under the receiver operating characteristic curve (AUROC), time consumption, and power consumption. The time and power consumption levels were determined using performance counter data from Intel Power Gadget 3.5.
Results: The experimental results indicated that the RF and XGB algorithms achieved the two highest AUROC values for both data sets (84.7% and 83.9%, respectively, for the mass spectrometry data set; 91.1% and 91.4%, respectively, for the urinalysis data set). The XGB and LR algorithms exhibited the shortest inference time for both data sets (0.47 milliseconds for both in the mass spectrometry data set; 0.39 and 0.47 milliseconds, respectively, for the urinalysis data set). Compared with the RF algorithm, the XGB and LR algorithms exhibited a 45% and 53%-60% reduction in inference time for the mass spectrometry and urinalysis data sets, respectively. In terms of energy efficiency, the XGB algorithm exhibited the lowest power consumption for the mass spectrometry data set (9.42 Watts) and the LR algorithm exhibited the lowest power consumption for the urinalysis data set (9.98 Watts). Compared with a five-hidden-layer NN, the XGB and LR algorithms achieved 16%-24% and 9%-13% lower power consumption levels for the mass spectrometry and urinalysis data sets, respectively. In all experiments, the XGB algorithm exhibited the best performance in terms of accuracy, run time, and energy efficiency.
Conclusions: The XGB algorithm achieved balanced performance levels in terms of AUROC, run time, and energy efficiency for the two clinical laboratory data sets. Considering the energy constraints in real-world scenarios, the XGB algorithm is ideal for medical AI applications.
doi:10.2196/28036
Keywords
Introduction
Machine learning (ML) methods have been successfully employed in various medical fields [
- ], and energy consumption during ML inference has been attracting increasing attention [ - ]. The increasing focus on inference energy can primarily be attributed to two reasons. First, energy constraints constitute a major issue when ML is deployed into battery-powered medical devices [ - ]. Second, to achieve high predictive performance, the computation and memory requirements of ML models have increased. The growth of model size has been well reflected in neural networks (NNs) over the last decade, which are considered as the main ML algorithms implemented during this period.An optimal ML model should achieve balanced predictive performance and energy efficiency. However, most relevant studies have only focused on comparing the predictive performance of different ML algorithms [
- ] and have not thoroughly explored the energy efficiency of different ML algorithms in the medical domain. Data formats in the medical field are diverse, and clinical laboratory data are a common type of medical data. In real-world settings, single laboratory tests must be subjected to strict validation procedures before their clinical use. Thus, the data obtained from such tests usually comprise features that are highly associated with the prediction targets. The characteristics of clinical laboratory data sets are unique, and the energy efficiency of different ML algorithms for processing clinical laboratory data sets warrants investigation.A partial explanation for the poor understanding of energy efficiency is that estimating energy consumption is more difficult than estimating other metrics (eg, accuracy) [
]. Several methods exist for evaluating the energy consumption of ML models. Computational complexity can be used for theoretically approximating the number of operations; thus, it can be used to estimate energy consumption ( ) [ - ]. Studies have established formulas for estimating energy consumption; these formulas sum the energy consumption levels of different elementary operations on the basis of complexity theory and benchmark results [ , ]. However, these formulas are available for only specific ML models and cannot be expanded to all algorithms. In addition to the aforementioned estimation formulas, experimental approaches can be used for estimating energy consumption. Currently, simulation and performance counters are the two main approaches for experimentally estimating energy consumption [ ]. Although simulations enable fine-grained energy estimation at the architecture and instruction levels, the use of simulations for large-scale ML tasks is not feasible due to the considerable overhead involved [ ]. By contrast, performance counters, which are a set of registers in processors that log specific hardware-related events, do not generate any overhead; therefore, these counters are suitable for use in different ML applications.In this study, we estimated the power consumption of nine algorithms during ML inference: logistic regression (LR), k-nearest neighbor (kNN), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and four NN-based algorithms. These algorithms were used to classify two clinical laboratory data sets: a large binary feature set and a small integer feature set. The following performance measures were recorded: accuracy, area under the receiver operating characteristic curve (AUROC), time consumption, and power consumption. The time and power consumption were determined using performance counter data from Intel Power Gadget 3.5. Finally, we performed statistical tests to validate our results. The results indicated the energy efficiency of each investigated ML algorithm in medical applications.
Methods
Study Design and Environmental Settings
illustrates the process flowchart of this study. We used two preprocessed data sets to train ML models with the nine considered algorithms: a mass spectrometry data set, based on matrix-assisted laser desorption/ionization time-of-flight mass spectrometry data of Staphylococcus aureus for predicting methicillin resistance, and a urinalysis data set, based on urinalysis data for predicting Trichomonas vaginalis infection. Subsequently, we comprehensively evaluated the trained models in terms of predictive performance, time consumption, and power consumption using independent testing data.
All experiments were run on a Windows 10 personal computer with 4 GB RAM and a 2.3 GHz Intel Core i5-8300H central processing unit (CPU). All ML models were implemented using Python 3.7.1 with the following Python libraries: scikit-learn 0.23.1 [
], xgboost 0.90 [ ], and pytorch 1.8.1 [ ]. Intel Power Gadget 3.5 [ ] was used to acquire time and power measurements. Additional details regarding the power measurement and Intel Power Gadget 3.5 are provided in the “Time Consumption and Power Consumption” section below. All statistical analyses were performed using the “rstatix” package of R software (version 4.0.2).Data Source
Data Set Characteristics
The mass spectrometry and urinalysis data sets adopted in this study represent distinct feature patterns in ML;
presents their characteristics. The mass spectrometry data set has a relatively large feature set comprising 268 binary features. By contrast, the urinalysis data set has a small feature set comprising only nine features, almost all of which are integer features in a larger range. These data sets have been applied and validated in previous studies [ - ].Data set | Cases, n | Features, n | Binary features, n | Integer features, n | Majority class | Percentage of majority class (Nmax/N) | Gini Impurity |
Mass spectrometry | 3338 | 268 | 268 | 0 | Methicillin-resistant Staphylococcus aureus | 53.0% | 0.50 |
Urinalysis | 2898 | 9 | 2 | 7 | Trichomonas vaginalis-negative | 57.1% | 0.49 |
Mass Spectrometry Data Set
The mass spectrometry data set comprises mass spectral information on methicillin-resistant and methicillin-sensitive S. aureus isolates. We collected routine mass spectrometry data about S. aureus samples consecutively from Chang Gung Memorial Hospital in 2016, and we identified the methicillin resistance of every S. aureus isolate by employing the paper disk method with cefoxitin.
In the original mass spectral data, intensity values are a function of the mass-to-charge ratio. We preprocessed these data using a validated binning method [
]. After data preprocessing, every feature in the mass spectrometry data set was determined to correspond to a 10-Da interval of mass-to-charge ratio. All the features are binary features, and the values 1 and 0 represent the presence and absence of a peak, respectively (ie, the presence and absence of a sufficient intensity, respectively), in a specific interval. The final size of the mass spectrometry data set was 3338×268 entries, with no missing values.Urinalysis Data Set
The urinalysis data set comprises routine urinalysis data (including leukocyte esterase, nitrite, protein, occult blood, red blood cell count, white blood cell count, and epithelial cell count) and demographic data (age and sex) of patients with and without T. vaginalis infections. The diagnosis of T. vaginalis infection was made according to microscopic tests. The urinalysis data set comprises the data of all patients who received at least one urinalysis test at Chang Gung Memorial Hospital between January 2009 and December 2013. The original data set consists of 839,164 cases; because the outcome distribution is imbalanced in the original data [
], we applied random undersampling and the synthetic minority oversampling technique [ ]. The final size of the urinalysis data set was 2898×9 entries and the percentage of the majority class was 57.1%.In the urinalysis data set, “sex” and “nitrite” are binary features, which are represented by 1 and 0. “Leukocyte esterase,” “protein,” and “occult blood” data are semiquantitative features, which are represented on scales ranging from 0 to 4 (“negative,” “trace,” 1+, 2+, and 3+), 0 to 5 (“negative,” “trace,” 1+, 2+, 3+, and 4+), and 0 to 5 (“negative,” “trace,” 1+, 2+, 3+, and 4+), respectively. “Age,” “red blood cell count,” “white blood cell count,” and “epithelial cell count” are nonnegative integer features with maximum values of 103, 501, 501, and 101, respectively. The urinalysis data set does not contain missing values, and we did not perform further feature selection for this data set.
Model Training and Validation
Algorithms
LR Algorithm
LR is one of the simplest binary classification algorithms. In the LR algorithm, the predictive outcome ŷ(x) of given data x is defined as follows:
ŷ(x)=[1+exp(w0+wTx)]–1
where w and w0 represent the weight vector and bias of the LR model, respectively. LR is an example of a generalized linear model, and the output of the LR model represents the estimated probability of a certain class [
].kNN Algorithm
The kNN algorithm is a memory-based algorithm. Accordingly, predictions of the kNN algorithm are directly based on the training data set and no additional training is required [
]. Predictions of the kNN algorithm, which are denoted by ŷ(x), are based on the voting results of the k most similar instances in the training dataset [ ]. The parameter ŷ(x) is defined as follows:where K denotes the set of k instances in the training data set that are most similar to the given data x and w(xi) denotes the weighted value of the corresponding x′ value. In the kNN algorithm, the distance between two data points x and x′ is typically defined as follows:
where q is a given positive number. The Manhattan distance (for q=1) and the Euclidean distance (for q=2) are the two most common distance metrics. In this study, the numbers of nearest neighbors (k) were 27 and 7 in the final models of the mass spectrometry and urinalysis data sets, respectively.
SVM Algorithm
The SVM algorithm is a commonly used binary classification method. The purpose of the SVM algorithm is to find a hyperplane that separates two classes of data with the maximum margin in the feature space [
]. In the original linear SVM, the output binary features are labeled as +1 and −1, and the predictive outcome ŷ(x) of given data x is defined as follows:ŷ(x)=sign(w0+wTx)
where w and w0 represent the weight vector and bias of the SVM model, respectively.
The SVM algorithm is frequently applied with kernel transformations. Kernel functions represent the similarity between two data points. The radial basis function is one of the most commonly used kernel functions and is defined as follows:
κ(x,x′)=exp(–||x–x′||2/2σ2)
where σ is the bandwidth. Through kernel transformations, the original feature space can be mapped into a higher dimension, which may improve the predictive performance of the SVM algorithm. For a kernelized SVM algorithm, the following equation is obtained:
where α is a sparse vector. For all nonzero αi values, corresponding xi terms represent support vectors. Accordingly, the final prediction ŷ(x) depends on only the support vectors and is independent of the remaining training data. In this study, we selected the kernelized SVM algorithm with the radial basis function kernel as our final SVM model according to its validation performance.
RF Algorithm
RF is an ensemble decision tree classifier. The prediction of the RF algorithm, namely ŷ(x), depends on the voting results of numerous decision trees [
]. The parameter ŷ(x) is defined as follows:where T represents the set of decision trees in the RF and Ti(x) represents the predictive outcome of a given decision tree.
RF training involves the “bagging” (ie, bootstrap aggregating) technique [
]. Accordingly, each decision tree in the RF considers only a subset of training cases to improve model generalizability. In addition to the bagging technique, each split of the decision trees only considers a subset of the input features during training to prevent the growth of highly correlated trees [ ]. In this study, the number of trees was set to 1000 and 1500 in the final RF models for the mass spectrometry and urinalysis data sets, respectively.XGB Algorithm
The XGB algorithm is a type of ensemble algorithm, which uses the “boosting” technique to reduce the overall bias by sequentially combining weak classifiers into a model [
]. In practice, shallow decision trees are typical weak classifiers. The outcome ŷ(x) of XGB models represents the log odds ratio of a certain class in binary classification tasks. The parameter ŷ(x) is defined as follows:where Ti(x) denotes the M decision tree regressors in the model.
In each iteration, the training of decision trees in XGB is equivalent to a process of minimizing a certain objective function. Because XGB is a regularized algorithm [
], its objective function is different from those of the original gradient boosting algorithms. The objective function of a given decision tree in the XGB algorithm can be expressed as follows:where N denotes the number of leaf nodes in a decision tree and γ and λ denote given positive numbers. The parameters Gj and Hj are defined as follows:
where ŷi represents the predictive outcome of training data xi after a certain number of iterations, l(yi,ŷi) represents a certain loss function between the predictive and actual outcomes, and Lj represents the set of data points xi belonging to the jth leaf node.
In this study, we implemented the XGB model by using the “xgboost 0.90” Python library.
NN Algorithms
NN is a type of ML model that is inspired by the human nervous system. An NN consists of multiple layers of nodes (or neurons). The layer that receives the initial data is the input layer and the layer that exports the predictive results is the output layer. Numerous hidden layers exist between the input and output layers. The outputs of each node in an NN are obtained according to the outputs of the nodes in the previous layer [
].Several types of connection patterns are possible between two adjacent layers. For example, in a classic fully connected layer, a series of weighted sums of the inputs is first calculated according to the given model parameters. These weighted sums are subjected to nonlinear transformation to obtain the output of the aforementioned layer. In practice, these steps are implemented using vectorized expressions, and the output vector of the nth hidden layer, namely an+1, is expressed as follows:
an+1=g(Θn·an+bn)
where an, Θn, and bn represent the input vector, weight matrix, and bias vector of the hidden layer, respectively, and g represents a nonlinear activation function (eg, the sigmoid function or rectified linear unit activation function).
NNs are ML models that are flexible in terms of the numbers of hidden layers and nodes in each layer. According to previous studies, one hidden layer is sufficient for approximating most continuous functions [
, ]. By contrast, NNs with more than one hidden layer are called deep NNs, and they have superior generalization ability to one-hidden-layer NNs [ , ]. With improvement of the hardware, deep learning models have become increasingly popular over the past few years. In this study, we constructed two types of NNs, namely a one-hidden-layer NN (NN1) and a five-hidden-layer NN (NN5), as our underlying architectures. NN1 represents the simplest form of NNs, whereas NN5 represents a deep learning model.To determine the appropriate architecture of an NN model, some previous studies offered theoretical heuristics regarding the number of hidden units in an NN layer. However, the results ranged widely according to different studies regarding the optimal number of nodes in a hidden layer [
- ]. Accordingly, in this study, we selected the final number of hidden units according to the cross-validation results. After hyperparameter tuning, we determined that the final sizes of the NN1 and NN5 architectures were 268×2048×1 and 268×1024×1024×1024×1024×1024×1, respectively, for the mass spectrometry data set and 9×128×1 and 9×512×512×512×512×1, respectively, for the urinalysis data set.Pruned NNs
Pruning is a method for eliminating redundant connections in NNs [
]. In this method, an NN is converted into a sparse model to reduce its size. Pruning methods can be unstructured or structured [ ]. Unstructured pruning eliminates the individual parameters in an NN, whereas structured pruning eliminates the connections in large units such as hidden units in a fully connected layer or channels in a convolutional layer.In this study, we applied global unstructured pruning to eliminate connections from the entire NN. The pruned NNs displayed in the figures are NN5s with a sparsity of 50%. However, we implemented pruning with sparsity values of 25%, 50%, and 75% for the NN1 and NN5 models. The detailed results regarding other pruned NNs are provided in
- .Quantized NNs
Quantization is a common method for model compression. In this method, the model size is reduced by computing and storing parameters with low bit widths [
]. Two main quantization methods exist in the Pytorch framework: dynamic and static quantization [ ]. Dynamic quantization is the simplest quantization method. In dynamic quantization, the weights of the quantized layers in an NN are replaced with low-precision data, and the activations are quantized just before entering each quantized layer during inference. By contrast, in static quantization, the parameters for activation quantization are determined before the inference phase. Therefore, static quantization requires an additional calibration with a data set before inference.In this study, the parameters of the original NN1 and NN5 models were tensors in the single-precision floating-point format; the quantized models had a quantized 8-bit signed integer data format. The quantized NNs displayed in the figures are NN5s. However, we implemented dynamic quantization for both the NN1 and NN5 models, and the detailed results regarding quantized NNs are provided in
- .Model Construction
In this study, we selected the aforementioned supervised ML algorithms according to their maturity and popularity. For every ML model, we tuned the hyperparameters in each algorithm through 5-fold cross-validation. The cutoff with the highest Youden index was selected as the final cutoff in each model [
].Model Comparison on Deployment
Predictive Performance
We evaluated the predictive performance of all final models using independent testing data sets. We selected accuracy and AUROC as the predictive performance metrics. The 95% CIs of both accuracy and AUROC were calculated.
Time and Power Consumption
We derived the inference time and power data from Intel Power Gadget 3.5 [
]. This commercial product provides power data on the basis of Intel Running Average Power Limit (RAPL) interface estimation. RAPL is a driver that provides a set of performance counter data on time, power, and energy [ , ].We implemented the ML models using command lines and logged the time and power data using PowerLog3.0.exe, a command line version of Intel Power Gadget that allows users to log the time and power data of a specific command line. In addition, because Intel Power Gadget only provides the energy data of the entire processor, all testing procedures were performed without background programs. The measurement for each algorithm was repeated 100 times.
Statistical Analysis
We initially employed the Shapiro-Wilk test to check for normality. If the assumption of normality did not hold, we subsequently adopted the Friedman test to compare the means of different groups. The pairwise Wilcoxon signed-rank test was used to identify which groups were different. P values were adjusted using the Bonferroni multiple testing correction method. All statistical tests were two-sided with an α error level of .05.
Results
Predictive Performance of ML Algorithms
and display the classification accuracy rates and AUROC values for the various ML models, respectively. Almost all models had high accuracy rates. All algorithms, except for the kNN algorithm, achieved an accuracy rate of at least 70% for the mass spectrometry data set. Moreover, all algorithms, except for the SVM algorithm, achieved an accuracy rate of at least 70% for the urinalysis data set. As displayed in , the two tree-based methods, namely the RF and XGB algorithms, achieved the two highest AUROC values for both datasets (84.7% and 83.9% for the mass spectrometry data set, respectively; 91.1% and 91.4% for the urinalysis data set, respectively). In particular, the RF and XGB algorithms exhibited significantly higher AUROC values than those of most of the other algorithms (eg, kNN, SVM, pruned five-hidden layer NN [PNN], and NN5) for the urinalysis data set. The results regarding the algorithms’ predictive performance are detailed in and .
Inference Times of ML Algorithms
presents a comparison of the inference times of the various ML algorithms. All algorithms completed the inference process within 1 millisecond. The XGB and LR algorithms had the shortest runtimes (0.47 milliseconds for both in the mass spectrometry data set; 0.39 and 0.47 milliseconds, respectively, for the urinalysis data set). The Wilcoxon signed-rank test results revealed that the run times of these two algorithms differed significantly (P<.001) from those of the other algorithms, except for NN1. The SVM and RF algorithms exhibited the highest time consumption for the mass spectrometry and urinalysis data sets, respectively. In particular, the RF algorithm exhibited a higher run time compared with that of all other algorithms, except for the SVM algorithm, for both data sets (P<.001). The results regarding the time consumption of the algorithms are detailed in - , and the corresponding P values derived from the Wilcoxon signed-rank test are presented in - .
Power Consumption of ML Algorithms
presents a comparison of the power consumption levels of the ML algorithms. Algorithms of the same type consumed similar amounts of power. For example, both tree-based algorithms (RF and XGB) consumed limited power, whereas all NN-based models (NN1, quantized five-layer hidden NN [QNN], PNN, and NN5) consumed considerable power. The XGB algorithm exhibited the lowest power consumption for the mass spectrometry data set (9.42 Watts) and the LR algorithm exhibited the lowest power consumption for the urinalysis data set (9.98 Watts). According to the results of the Wilcoxon signed-rank tests ( - ), the LR and XGB algorithms exhibited lower power consumption levels than did the kNN algorithm and all NN-based algorithms for both datasets (P≤.001). The NN5, kNN, PNN, QNN, and NN1 algorithms exhibited higher power consumption levels compared with those of the other algorithms. Although pruning and quantization reduced the power consumption levels of the NN algorithms, the energy efficiency levels of the PNN and QNN algorithms did not surpass those of all the non-NN–based algorithms, except for the kNN algorithm. The results regarding power consumption are detailed in - , and the corresponding P values derived from the Wilcoxon signed-rank test are presented in - .
Overall Comparison
and display scatter plots of the performance of the various algorithms in predicting S. aureus methicillin resistance and T. vaginalis infection. The horizontal and vertical axes in these figures represent the AUROC and average power consumption, respectively. The two dashed lines in the figures represent the average AUROC values and mean power consumption levels for the nine algorithms. Only the XGB and RF algorithms had higher than average AUROC and power consumption results for both data sets. and also illustrate the difference between the NN-based and non-NN–based algorithms. All NN-based algorithms are located in the lower half-plane in these figures and all the other algorithms, except for the kNN algorithm, are located in the upper half-plane in the figures. These results indicate that the NN-based algorithms had higher power consumption levels than those of the non-NN–based algorithms, even when model compression was executed through methods such as pruning or quantization.
Discussion
Principal Findings and Related Works
In this study, we compared the predictive performance, time consumption, and power consumption of nine algorithms using two clinical laboratory data sets. The XGB algorithm achieved a balanced performance with respect to the aforementioned metrics, indicating that the XGB algorithm is ideal for medical artificial intelligence applications with energy constraints.
In addition to this study, previous studies have performed comparative analyses of various ML algorithms in the medical domain [
, , ]. However, only few studies have considered the inference efficiency in addition to the predictive performance. Zhang et al [ ] compared the simplicity of seven algorithms by assessing their memory usage and training time for 12 public biomedical data sets. In another study, Deng et al [ ] assessed the inference time of decision tree, SVM, RF, and NN algorithms. In this study, we executed our efficiency evaluation by directly exploring and comparing the power consumption levels of ML algorithms. Furthermore, all power consumption data were obtained according to real-time experimental results from performance counters.Predictive Performance of ML Algorithms
The RF and XGB algorithms exhibited higher AUROC values than did the other algorithms for both data sets. This finding is similar to those of previous studies. In a study that considered 11 performance metrics, the RF algorithm and probability-calibrated boosted trees exhibited the best performance among 10 algorithms [
]. Other previous analyses also indicated that the RF and XGB algorithms consistently exhibit good performance for most biomedical data sets [ , ]. These algorithms have certain advantages; for example, they exhibit adequate scalability to large data sets and are more robust than other types of algorithms [ ]. Medical data sets usually comprise features with strong signals; this is because only well-validated markers are routinely tested in clinical scenarios. Under this condition, tree-based methods would not be inferior to relatively complex models such as NN-based models. However, one should remember the “no free lunch theorem” [ ], which suggests that no model exhibits superior performance universally. This statement is true because every algorithm is proposed on the basis of different underlying assumptions, which may fit only specific types of data. Therefore, different algorithms should be investigated when the predictive performance of a certain model does not match the expectation.Inference Time of ML Algorithms
In this study, the XGB and LR algorithms exhibited the shortest run times (both 0.47 milliseconds for the mass spectrometry data set; 0.39 and 0.47 milliseconds, respectively, for the urinalysis data set). The SVM and RF algorithms exhibited the highest time consumption levels for the mass spectrometry and urinalysis data sets, respectively. Notably, although the XGB and RF algorithms are ensemble algorithms based on decision trees, the XGB algorithm consumed less time than the RF algorithm. This finding is possibly due to differences in the depth and number of trees between these algorithms. For both data sets, the XGB model had shallower trees than did the RF model (for the mass spectrometry and urinalysis data sets, the maximum depths of the XGB decision trees were 6 and 10, respectively, and the average depths of the RF trees were 29 and 21, respectively). An explanation for this finding is that boosting reduces the bias of weak classifiers [
, ] and that bagging reduces the variance of complex classifiers [ ]. Thus, the XGB algorithm may have shallower decision trees compared with those of the RF algorithm for the same prediction task. In addition to the depth difference, the number of trees may be another cause of the run time difference between the two tree-based algorithms (for the mass spectrometry and urinalysis data sets, the XGB algorithm contained 120 and 32 decision trees, respectively, and the RF algorithm contained more than 1000 decision trees). In an RF model, increasing the number of decision trees does not engender overfitting [ , ]. However, this characteristic may result in a final model with excessive decision trees after conventional grid-search cross-validation. By contrast, because an excessive number of decision trees results in overfitting in an XGB model, an XGB model with optimal predictive performance would have an appropriate number of trees. Furthermore, to identify the suitable tree numbers, the early stopping technique is frequently used during training of XGB models in practice [ ]. In conclusion, the shorter run time of the XGB algorithm compared with the RF algorithm is possibly due to the different characteristics of these algorithms.Power Consumption of ML Algorithms
The NN algorithms (NN1, QNN, PNN, and NN5) and the kNN algorithm exhibited the highest power consumption levels in this study, and the two tree-based algorithms (ie, RF and XGB) exhibited the lowest power consumption levels. Tree-based algorithms use the data structure of search trees for making inferences. The inference process mainly involves comparison operations at tree nodes and irregular memory access operations for subtree retrievals. In contrast to several other ML algorithms, tree-based algorithms typically do not use multiplication operations. The comparison and memory access operations in tree-based algorithms consume less energy than do multiplication operations [
, ]. The experiments in this study were run on a general-purpose CPU. Therefore, if necessary, the energy efficiency of tree-based algorithms can be increased using specialized hardware accelerations [ - ].NNs have been regarded as the main tools for implementing ML in the last few years. The development of different NN architectures (eg, convolutional NNs and recurrent NNs) has contributed to considerable improvements in unstructured data analyses [
, ]. However, NNs have high power consumption. Thus, NNs should not always be considered as the preferred algorithm for implementing ML, unless they exhibit superior predictive performance compared with other algorithms. In this study, the adopted NNs consumed considerable power because of their high computational and communication demands. The computational demand of an NN refers to the large number of multiply-add operations in the forward propagation process, and the communicational demand of an NN refers to the energy cost of moving large quantities of data frequently between the processor and memory [ , ].Several methods are available for reducing the power consumption of NNs. NNs have diverse architectures, and constructing an NN with a small architecture is an effective method for improving energy efficiency, as reflected by the difference in power consumption between the NN1 and NN5 models in this study (see
and ). In addition to constructing a small model, a given NN model can be compressed to reduce power consumption. In this study, we implemented and evaluated two common methods for NN compression, namely pruning [ , ] and quantization [ ]. According to the obtained results, these model compression methods reduced the power consumption levels of the NNs. However, the NN-based algorithms did not exhibit higher energy efficiency levels compared with those of the non-NN–based algorithms, even after model compression. Furthermore, although energy optimization methods such as quantization are frequently used for NNs, these methods are not specific to NNs [ , ]. Thus, quantization can be feasibly applied to other ML algorithms if their power consumption must be decreased.Overall Comparison
In summary, the XGB algorithm achieved balanced predictive performance and energy efficiency levels.
and display the predictive performance–power consumption plots of the nine algorithms for the mass spectrometry and urinalysis data sets, respectively. In these figures, the two tree-based algorithms, namely the XGB and RF algorithms, are located in the right-upper quadrant, which indicates that they had higher than average predictive performance and lower than average power consumption. However, the XGB algorithm consumed less time than the RF algorithm (P<.001, according to the Wilcoxon signed-rank test; and - ). Thus, the XGB algorithm achieved a higher energy efficiency level than the RF algorithm because the overall energy consumption for ML inference depends on not only power consumption but also on inference time.Deep learning models are the main ML algorithms applied currently. These algorithms achieve state-of-the-art predictive performance for unstructured data sets (eg, data sets for computer vision and natural language processing) [
]. However, deep learning algorithms may be unnecessary for making predictions based on clinical laboratory data sets. In and , all of the NN-based algorithms are located in the lower half-plane, signifying that the NN-based algorithms consumed more power than did most of the other algorithms. Pruning and quantization increased the efficiency levels of the NN-based algorithms; however, the increase was limited, and the energy efficiency levels of these algorithms did not surpass that of the XGB algorithm. Moreover, the NN-based algorithms did not exhibit higher AUROC values compared with those of the simple tree-based algorithms. The experimental results indicate that for data analysis in the clinical laboratory domain, simpler models such as the XGB model may be sufficient to achieve state-of-the-art predictive performance. Deep NNs are unsuitable for such data sets due to the high power consumption of these networks.Limitations
This study has some limitations. First, because Intel Power Gadget 3.5 only provides the energy consumption of the entire processor [
], one should focus on the comparison of the investigated ML algorithms and not on the absolute power consumption obtained. Second, this study considered only two clinical laboratory data sets. Because energy consumption varies between data sets, a large-scale study based on a variety of medical data sets is essential for confirming the results of this study. Finally, the results were obtained using a general-purpose CPU; however, energy consumption may vary across different processors. Currently, ML is frequently implemented using hardware acceleration techniques. Although hardware devices such as discrete graph processing units or tensor processing units are not ubiquitous equipment in clinical settings, their energy efficiency levels are worth investigation. Energy efficiency is a major issue in embedded systems, and studies have been performed on the energy optimization of different algorithms [ , , ]. Executing a fair comparison of energy efficiency under different hardware implementations is difficult. Hence, a well-designed comparative analysis of energy efficiency across different optimized methods is essential for obtaining general conclusions.Conclusions
This study comprehensively compared various ML algorithms in terms of their predictive performance, time consumption, and power consumption when implemented on two clinical laboratory data sets. According to the results, the XGB algorithm attained balanced performance levels in terms of the aforementioned parameters for the two data sets. Thus, the XGB algorithm is ideal for application in real-world clinical settings.
Acknowledgments
This work was supported by Chang Gung Memorial Hospital (Linkou) (CMRPG3J1791, CMRPG3L0401, CMRPG3L0431, and CMRPG3L1011) and Ministry of Science and Technology, Taiwan (MOST 110-2636-E-008-008). This manuscript was edited by Wallace Academic Editing.
Authors' Contributions
HW conceptualized the study. JY, CHC, and TH wrote the manuscript, analyzed the data, plotted the figures, and created the tables. JY performed the experiments. TH, JL, CRC, TL, MW, YT, and HW reviewed and edited the manuscript for important intellectual content. YT and HW obtained funding and supervised the study. All authors discussed the results and revised the manuscript.
Conflicts of Interest
None declared.
Multimedia Appendix 2
Classification performance of nonneural network–based machine learning algorithms implemented on the mass spectrometry and urinalysis data sets.
DOCX File , 18 KB
Multimedia Appendix 3
Classification performance of different neural networks (NNs) implemented on the mass spectrometry and urinalysis data sets.
DOCX File , 16 KB
Multimedia Appendix 4
Inferencing time and average power consumption levels of nonneural network–based algorithms implemented on the mass spectrometry and urinalysis data sets.
DOCX File , 15 KB
Multimedia Appendix 5
Inferencing time and average power consumption levels of different neural networks (NNs) implemented on the mass spectrometry and urinalysis data sets.
DOCX File , 15 KB
Multimedia Appendix 6
P values were derived from the pairwise Wilcoxon signed-rank test to identify which time and power consumption of any two algorithms were different on the mass spectrometry data set. The P values were adjusted by the Bonferroni multiple testing correction method. LR, logistic regression; kNN, k-nearest neighbors; SVM, support vector machine; RF, random forest; XGB, extreme gradient boosting; NN1, one-hidden-layer neural network; QNN, quantized five-hidden-layer neural network; PNN, pruned five-hidden-layer neural network; NN5, five-hidden-layer neural network.
DOCX File , 15 KB
Multimedia Appendix 7
P values were derived from the pairwise Wilcoxon signed-rank test to identify which time and power consumption of any two algorithms were different on the Urinalysis dataset. The adjusted P values was adjusted by the Bonferroni multiple testing correction method. LR, logistic regression; kNN, k-nearest neighbors; SVM, support vector machine; RF, random forest; XGB, extreme gradient boosting; NN1, one-hidden-layer neural network; QNN, quantized five-hidden-layer neural network; PNN, pruned five-hidden-layer neural network; NN5, five-hidden-layer neural network.
DOCX File , 15 KBReferences
- Wang H, Chen C, Shi S, Chung C, Wen Y, Wu M, et al. Improving multi-tumor biomarker health check-up tests with machine learning algorithms. Cancers (Basel) 2020 Jun 01;12(6):1442 [FREE Full text] [CrossRef] [Medline]
- Wang H, Hsieh C, Wen C, Wen Y, Chen C, Lu J. Cancers screening in an asymptomatic population by using multiple tumour markers. PLoS One 2016;11(6):e0158285 [FREE Full text] [CrossRef] [Medline]
- Wang H, Chen C, Lee T, Horng J, Liu T, Tseng Y, et al. Rapid detection of heterogeneous vancomycin-intermediate based on matrix-assisted laser desorption ionization time-of-flight: using a machine learning approach and unbiased validation. Front Microbiol 2018;9:2393. [CrossRef] [Medline]
- Chatterjee A, Gerdes MW, Martinez SG. Identification of risk factors associated with obesity and overweight-a machine learning overview. Sensors (Basel) 2020 May 11;20(9):2734 [FREE Full text] [CrossRef] [Medline]
- Kohli M, Prevedello LM, Filice RW, Geis JR. Implementing machine learning in radiology practice and research. AJR Am J Roentgenol 2017 Apr;208(4):754-760. [CrossRef] [Medline]
- Kang M, Gonugondla S, Shanbhag N. A 19.4 nJ/decision 364K decisions/s in-memory random forest classifier in 6T SRAM array. 2017 Presented at: ESSCIRC 2017-43rd IEEE European Solid State Circuits Conference; September 11-14, 2017; Leuven, Belgium. [CrossRef]
- Rouhani B, Mirhoseini A, Koushanfar F. DeLight: adding energy dimension to deep neural networks. 2016 Presented at: 2016 International Symposium on Low Power Electronics and Design; August 8-10, 2016; San Francisco, CA. [CrossRef]
- Shoaib M, Jha N, Verma N. A low-energy computation platform for data-driven biomedical monitoring algorithms. 2011 Presented at: The 48th Annual Design Automation Conference 2011; June 5-10, 2011; San Diego, CA. [CrossRef]
- Gauen K, Rangan R, Mohan A, Lu Y, Liu W, Berg A. Low-power image recognition challenge. 2017 Presented at: 2017 22nd Asia and South Pacific Design Automation Conference; January 16-19, 2017; Chiba, Japan. [CrossRef]
- Yang T, Chen Y, Sze V. Designing energy-efficient convolutional neural networks using energy-aware pruning. 2017 Presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition; July 21-26, 2017; Honolulu, HI. [CrossRef]
- Ayinala M, Parhi K. Low-energy architectures for support vector machine computation. 2013 Presented at: 2013 Asilomar Conference on Signals, Systems and Computers; November 3-6, 2013; Pacific Grove, CA. [CrossRef]
- Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak 2019 Dec 21;19(1):281 [FREE Full text] [CrossRef] [Medline]
- Zhang Y, Xin Y, Li Q, Ma J, Li S, Lv X, et al. Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. Biomed Eng Online 2017 Nov 02;16(1):125 [FREE Full text] [CrossRef] [Medline]
- Harper PR. A review and comparison of classification algorithms for medical decision making. Health Policy 2005 Mar;71(3):315-331. [CrossRef] [Medline]
- García-Martín E, Rodrigues CF, Riley G, Grahn H. Estimation of energy consumption in machine learning. J Parallel Distrib Comput 2019 Dec;134:75-88. [CrossRef]
- Kibriya A, Frank E. An empirical comparison of exact nearest neighbour algorithms. 2007 Presented at: European Conference on Principles of Data Mining and Knowledge Discovery; September 17-21, 2007; Warsaw, Poland. [CrossRef]
- Murphy K. Machine learning: a probabilistic perspective. Cambridge, MA: MIT Press; 2012.
- Louppe G. Understanding random forests: From theory to practice. arXiv. 2014. URL: https://arxiv.org/abs/1407.7502 [accessed 2021-03-19]
- Zhang B, Davoodi A, Hu YH. Exploring energy and accuracy tradeoff in structure simplification of trained deep neural networks. IEEE J Emerg Sel Topics Circuits Syst 2018 Dec;8(4):836-848. [CrossRef]
- Abraham A, Pedregosa F, Eickenberg M, Gervais P, Mueller A, Kossaifi J, et al. Machine learning for neuroimaging with scikit-learn. Front Neuroinform 2014;8:14. [CrossRef] [Medline]
- Chen T, Guestrin C. Xgboost: A scalable tree boosting system. 2016 Presented at: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; August 13-17, 2016; San Francisco, CA. [CrossRef]
- Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G. Pytorch: An imperative style, high-performance deep learning library. arXiv. 2019. URL: https://arxiv.org/abs/1912.01703 [accessed 2021-07-31]
- Intel® 64 and ia-32 architectures software developer's manual, Volume 3B: System Programming Guide, Part 2. Intel Corporation. 2011. URL: https://www.intel.com.tw/ [accessed 2021-03-19]
- Wang H, Chung C, Wang Z, Li S, Chu B, Horng J, et al. A large-scale investigation and identification of methicillin-resistant Staphylococcus aureus based on peaks binning of matrix-assisted laser desorption ionization-time of flight MS spectra. Brief Bioinform 2021 May 20;22(3):bbaa138 [FREE Full text] [CrossRef] [Medline]
- Wang Z, Wang H, Chung C, Horng J, Lu J, Lee T. Large-scale mass spectrometry data combined with demographics analysis rapidly predicts methicillin resistance in Staphylococcus aureus. Brief Bioinform 2021 Jul 20;22(4):bbaa293. [CrossRef] [Medline]
- Wang H, Hung C, Chen C, Lee T, Huang K, Ning H, et al. Increase Trichomonas vaginalis detection based on urine routine analysis through a machine learning approach. Sci Rep 2019 Aug 19;9(1):11074. [CrossRef] [Medline]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. J Artif Intell Res 2002 Jun 01;16:321-357 [FREE Full text] [CrossRef]
- Manning C, Schütze H, Raghavan P. Introduction to information retrieval. Cambridge, UK: Cambridge University Press; 2008.
- Rashidi HH, Sen S, Palmieri TL, Blackmon T, Wajda J, Tran NK. Early recognition of burn- and trauma-related acute kidney injury: a pilot comparison of machine learning techniques. Sci Rep 2020 Jan 14;10(1):205. [CrossRef] [Medline]
- Flaxman AD, Vahdatpour A, Green S, James SL, Murray CJ, Population Health Metrics Research Consortium (PHMRC). Random forests for verbal autopsy analysis: multisite validation study using clinical diagnostic gold standards. Popul Health Metr 2011 Aug 04;9:29 [FREE Full text] [CrossRef] [Medline]
- Couronné R, Probst P, Boulesteix A. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinformatics 2018 Jul 17;19(1):270 [FREE Full text] [CrossRef] [Medline]
- Schapire RE, Freund Y. Boosting: Foundations and Algorithms. Cambridge, MA: MIT Press; Jan 04, 2013:164-166.
- Hecht-Nielsen R. On the algebraic structure of feedforward network weight spaces. In: Eckmiller R, editor. Advanced Neural Computers. North Holland: Elsevier; 1990:129-135.
- Basheer IA, Hajmeer M. Artificial neural networks: fundamentals, computing, design, and application. J Microbiol Methods 2000 Dec 01;43(1):3-31. [CrossRef] [Medline]
- Kriegeskorte N, Golan T. Neural network models and deep learning. Curr Biol 2019 Apr 01;29(7):R231-R236 [FREE Full text] [CrossRef] [Medline]
- Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw 2015 Jan;61:85-117. [CrossRef] [Medline]
- Widrow B, Lehr M. 30 years of adaptive neural networks: perceptron, Madaline, and backpropagation. Proc IEEE 1990 Sep;78(9):1415-1442. [CrossRef]
- Masters T. Signal And Image Processing With Neural Networks: A C++ Sourcebook. Hoboken, NJ: John Wiley & Sons, Inc; 1994.
- Lachtermacher G, Fuller JD. Back propagation in time-series forecasting. J Forecast 1995 Jul;14(4):381-393. [CrossRef]
- Jadid MN, Fairbairn DR. Neural-network applications in predicting moment-curvature parameters from experimental data. Eng Appl Artif Intell 1996 Jun;9(3):309-319. [CrossRef]
- Han S, Pool J, Tran J, Dally WJ. Learning both weights and connections for efficient neural networks. arXiv. 2015 Oct 30. URL: https://arxiv.org/abs/1506.02626 [accessed 2021-11-08]
- Liu Z, Sun M, Zhou T, Huang G, Darrell T. Rethinking the value of network pruning. arXiv. 2018. URL: https://arxiv.org/abs/1810.05270 [accessed 2021-07-31]
- Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Quantized neural networks: Training neural networks with low precision weights and activations. J Machine Learn Res 2017;18(1):6869-6898.
- Krzanowski W, Hand D. ROC curves for continuous data. Boca Raton, FL: CRC Press; 2009.
- David H, Gorbatov E, Hanebutte U, Khanna R, Le C. RAPL: Memory power estimation and capping. 2010 Presented at: 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design; August 18-20, 2010; Austin, TX.
- Czarnul P, Proficz J, Krzywaniak A. Energy-aware high-performance computing: survey of state-of-the-art tools, techniques, and environments. Sci Program 2019 Apr 24;2019:8348791 [FREE Full text] [CrossRef]
- Deng F, Huang J, Yuan X, Cheng C, Zhang L. Performance and efficiency of machine learning algorithms for analyzing rectangular biomedical data. Lab Invest 2021 Apr 11;101(4):430-441. [CrossRef] [Medline]
- Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. 2006 Presented at: 23rd International Conference on Machine Learning; June 25-29, 2006; Pittsburgh, PA. [CrossRef]
- Wolpert DH. The lack of a priori distinctions between learning algorithms. Neural Comput 1996 Oct;8(7):1341-1390. [CrossRef]
- Lee H, Yoon H, Nam K, Cho YJ, Kim TK, Kim WH, et al. Derivation and validation of machine learning approaches to predict acute kidney injury after cardiac surgery. J Clin Med 2018 Oct 03;7(10):322 [FREE Full text] [CrossRef] [Medline]
- Breiman L. Bagging predictors. Mach Learn 1996 Aug;24(2):123-140. [CrossRef]
- Zhang T, Yu B. Boosting with early stopping: convergence and consistency. Ann Statist 2005 Aug 1;33(4):1538-1579. [CrossRef]
- McKeown M, Lavrov A, Shahrad M, Jackson P, Fu Y, Balkind J. Power and energy characterization of an open source 25-Core Manycore processor. 2018 Presented at: 2018 IEEE International Symposium on High Performance Computer Architecture; February 24-28, 2018; Vienna, Austria. [CrossRef]
- Vasilakis E. An instruction level energy characterization of ARM processors. Technical Report FORTH-ICS/TR-450, March 2015. GreenVM. 2015 Mar. URL: https://projects.ics.forth.gr/carv/greenvm/files/tr450.pdf [accessed 2021-07-31]
- Shoaran M, Haghi BA, Taghavi M, Farivar M, Emami-Neyestanak A. Energy-efficient classification for resource-constrained biomedical applications. IEEE J Emerg Sel Topics Circuits Syst 2018 Dec;8(4):693-707. [CrossRef]
- Takhirov Z, Wang J, Louis M, Saligrama V, Joshi A. Field of groves: an energy-efficient random forest. arXiv. 2017. URL: https://arxiv.org/abs/1704.02978 [accessed 2021-07-31]
- Van Essen B, Macaraeg C, Gokhale M, Prenger R. Accelerating a random forest classifier: Multi-Core, GP-GPU, or FPGA? 2012 Presented at: 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines; April 29-May 1, 2012; Toronto, ON. [CrossRef]
- Chen Y, Emer J, Sze V. Eyeriss. SIGARCH Comput Archit News 2016 Oct 12;44(3):367-379. [CrossRef]
- Gong Y, Liu L, Yang M, Bourdev L. Compressing deep convolutional networks using vector quantization. arXiv. 2014. URL: https://arxiv.org/abs/1412.6115 [accessed 2021-07-31]
- Zhu B, Shoaran M. Hardware-efficient seizure detection. 2019 Presented at: 53rd Asilomar Conference on Signals, Systems, and Computers; November 3-6, 2019; Pacific Grove, CA. [CrossRef]
- Jégou H, Douze M, Schmid C. Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 2011 Jan;33(1):117-128. [CrossRef]
Abbreviations
AUROC: area under the receiver operating characteristic curve |
CPU: central processing unit |
kNN: k-nearest neighbor |
LR: logistic regression |
ML: machine learning |
NN: neural network |
NN1: one-hidden-layer neural network |
NN5: five-hidden-layer neural network |
PNN: pruned five-hidden-layer neural network |
QNN: quantized five-hidden-layer neural network |
RAPL: Running Average Power Limit |
RF: random forest |
SVM: support vector machine |
XGB: extreme gradient boosting |
Edited by R Kukafka; submitted 19.03.21; peer-reviewed by A Chatterjee, J Yang; comments to author 29.04.21; revised version received 31.07.21; accepted 04.10.21; published 25.01.22
Copyright©Jia-Ruei Yu, Chun-Hsien Chen, Tsung-Wei Huang, Jang-Jih Lu, Chia-Ru Chung, Ting-Wei Lin, Min-Hsien Wu, Yi-Ju Tseng, Hsin-Yao Wang. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 25.01.2022.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.