Machine Learning–Based Text Analysis to Predict Severely Injured Patients in Emergency Medical Dispatch: Model Development and Validation

doi:10.2196/30210

Original Paper

¹Department of Emergency Medicine, Taipei Hospital, Ministry of Health and Welfare, New Taipei City, Taiwan

²Department of Civil Engineering, National Taiwan University, Taipei City, Taiwan

³Department of Emergency Medicine, Far Eastern Memorial Hospital, New Taipei City, Taiwan

⁴Emergency Medical Service Division, Taipei City Fire Department, Taipei City, Taiwan

⁵Department of Emergency Medicine, National Taiwan University Hospital, Taipei City, Taiwan

⁶Department of Emergency Medicine, National Taiwan University Hospital, Yun-Lin Branch, Yunlin County, Taiwan

*these authors contributed equally

Corresponding Author:

Albert Y Chen, PhD

Department of Civil Engineering

National Taiwan University

No 1, Section 4, Roosevelt Rd

Taipei City, 106

Taiwan

Phone: 886 2 3366 4255

Email: AlbertChen@ntu.edu.tw

Background: Early recognition of severely injured patients in prehospital settings is of paramount importance for timely treatment and transportation of patients to further treatment facilities. The dispatching accuracy has seldom been addressed in previous studies.

Objective: In this study, we aimed to build a machine learning–based model through text mining of emergency calls for the automated identification of severely injured patients after a road accident.

Methods: Audio recordings of road accidents in Taipei City, Taiwan, in 2018 were obtained and randomly sampled. Data on call transfers or non-Mandarin speeches were excluded. To predict cases of severe trauma identified on-site by emergency medical technicians, all included cases were evaluated by both humans (6 dispatchers) and a machine learning model, that is, a prehospital-activated major trauma (PAMT) model. The PAMT model was developed using term frequency–inverse document frequency, rule-based classification, and a Bernoulli naïve Bayes classifier. Repeated random subsampling cross-validation was applied to evaluate the robustness of the model. The prediction performance of dispatchers and the PAMT model, in severe cases, was compared. Performance was indicated by sensitivity, specificity, positive predictive value, negative predictive value, and accuracy.

Results: Although the mean sensitivity and negative predictive value obtained by the PAMT model were higher than those of dispatchers, they obtained higher mean specificity, positive predictive value, and accuracy. The mean accuracy of the PAMT model, from certainty level 0 (lowest certainty) to level 6 (highest certainty), was higher except for levels 5 and 6. The overall performances of the dispatchers and the PAMT model were similar; however, the PAMT model had higher accuracy in cases where the dispatchers were less certain of their judgments.

Conclusions: A machine learning–based model, called the PAMT model, was developed to predict severe road accident trauma. The results of our study suggest that the accuracy of the PAMT model is not superior to that of the participating dispatchers; however, it may assist dispatchers when they lack confidence while making a judgment.

J Med Internet Res 2022;24(6):e30210

doi:10.2196/30210

Keywords

emergency medical service (9); emergency medical dispatch (5); dispatcher (2); trauma (89); machine learning (1706); frequency–inverse document frequency; Bernoulli naïve Bayes

Background

Trauma is a leading cause of accidental death globally. According to the World Health Organization, injuries contribute to >5 million deaths each year. Road traffic accidents accounted for most injuries and were the ninth leading cause of death in 2012 [Injuries and violence: the facts 2014. World Health Organization. 2014. URL: https://apps.who.int/iris/bitstream/handle/10665/149798/9789241508018_eng.pdf [accessed 2022-05-19] 1]. Severe trauma is a time-sensitive emergency condition. Prompt transport is beneficial for patients with neurotrauma and penetrating injuries with unstable hemodynamic features [Harmsen AM, Giannakopoulos GF, Moerbeek PR, Jansma EP, Bonjer HJ, Bloemers FW. The influence of prehospital time on trauma patients outcome: a systematic review. Injury 2015 Apr;46(4):602-609. [CrossRef] [Medline]2]. Delays in transportation are associated with poor functional outcome [Chen CH, Shin SD, Sun JT, Jamaluddin SF, Tanaka H, Song KJ, et al. Association between prehospital time and outcome of trauma patients in 4 Asian countries: a cross-national, multicenter cohort study. PLoS Med 2020 Oct 6;17(10):e1003360 [FREE Full text] [CrossRef] [Medline]3].

Prehospital triage allows severely ill patients to receive appropriate time-sensitive management. For cardiac arrest and stroke victims, dispatchers can obtain critical information on the phone, such as the patient’s level of consciousness, breath patterns, or prehospital stroke scales [Drennan IR, Geri G, Brooks S, Couper K, Hatanaka T, Kudenchuk P, Basic Life Support (BLS), Pediatric Life Support (PLS) and Education, Implementation and Teams (EIT) Taskforces of the International Liaison Committee on Resuscitation (ILCOR), BLS Task Force, Pediatric Task Force, EIT Task Force. Diagnosis of out-of-hospital cardiac arrest by emergency medical dispatch: a diagnostic systematic review. Resuscitation 2021 Feb;159:85-96. [CrossRef] [Medline]4,Zhelev Z, Walker G, Henschke N, Fridhandler J, Yip S. Prehospital stroke scales as screening tools for early identification of stroke and transient ischemic attack. Cochrane Database Syst Rev 2019 Apr 09;4(4):CD011427 [FREE Full text] [CrossRef] [Medline]5]. However, no standardized questions have been designed for dispatchers when they encounter severe trauma. Only a few studies on helicopter emergency medical services have addressed the accuracy of dispatch for trauma victims [Bohm K, Kurland L. The accuracy of medical dispatch - a systematic review. Scand J Trauma Resusc Emerg Med 2018 Nov 09;26(1):94 [FREE Full text] [CrossRef] [Medline]6]. Current trauma scales for predicting severity require either physiological or anatomical assessments [Gianola S, Castellini G, Biffi A, Porcu G, Fabbri A, Ruggieri MP, Italian National Institute of Health guideline working group. Accuracy of pre-hospital triage tools for major trauma: a systematic review with meta-analysis and net clinical benefit. World J Emerg Surg 2021 Jun 10;16(1):31 [FREE Full text] [CrossRef] [Medline]7]. Therefore, a victim’s condition cannot be identified or evaluated until the first batch of emergency medical technicians (EMTs) arrives at the scene.

Motivation

Content analysis has been conducted on emergency calls to discover the factors that affect dispatch and have the potential to assist prehospital triage [Richards CT, Wang B, Markul E, Albarran F, Rottman D, Aggarwal NT, et al. Identifying key words in 9-1-1 calls for stroke: a mixed methods approach. Prehosp Emerg Care 2017;21(6):761-766 [FREE Full text] [CrossRef] [Medline]8,Riou M, Ball S, Williams TA, Whiteside A, O'Halloran KL, Bray J, et al. The linguistic and interactional factors impacting recognition and dispatch in emergency calls for out-of-hospital cardiac arrest: a mixed-method linguistic analysis study protocol. BMJ Open 2017 Jul 09;7(7):e016510 [FREE Full text] [CrossRef] [Medline]9]. Specifically, text classification has demonstrated the effectiveness of classifying events recorded during phone calls [Trujillo A, Orellana M, Acosta MI. Design of emergency call record support system applying natural language processing techniques. In: Proceedings of the 6th Conference on Information and Communication Technologies of Ecuador. 2019 Presented at: TIC.EC '19; November 27-29, 2019; Cuenca City, Ecuador p. 53-65 URL: https://doi.org/10.1007/978-3-030-35740-5_4 [CrossRef]10]. In addition, natural language processing has been used in emergency medicine. Text mining techniques have been used to predict the triage level, length of stay, disposition, and mortality in emergency department patients [Choi SW, Ko T, Hong KJ, Kim KH. Machine learning-based prediction of Korean triage and acuity scale level in emergency department patients. Healthc Inform Res 2019 Oct;25(4):305-312 [FREE Full text] [CrossRef] [Medline]11-Fernandes M, Mendes R, Vieira SM, Leite F, Palos C, Johnson A, et al. Risk of mortality and cardiopulmonary arrest in critical patients presenting to the emergency department using machine learning and natural language processing. PLoS One 2020 Apr 2;15(4):e0230876 [FREE Full text] [CrossRef] [Medline]16]. A textual analysis–based machine learning framework was developed to assist dispatchers during the prehospital phase in out-of-hospital cardiac arrest (OHCA) recognition; this framework has been commercialized [AI for Patient Consultations. Corti. URL: https://www.corti.ai/ [accessed 2021-09-12] 17-Blomberg SN, Christensen HC, Lippert F, Ersbøll AK, Torp-Petersen C, Sayre MR, et al. Effect of machine learning on dispatcher recognition of out-of-hospital cardiac arrest during calls to emergency medical services: a randomized clinical trial. JAMA Netw Open 2021 Jan 04;4(1):e2032320 [FREE Full text] [CrossRef] [Medline]20]. These techniques make it possible to stratify the risk to patients when structured questions are unavailable, similar to the assessment of trauma patients over the phone.

The classic process of text classification includes text preprocessing, feature extraction, and classifier construction. Text preprocessing aims to remove noise and effectively retrieve information through text cleaning and organization [Uysal AK, Gunal S. The impact of preprocessing on text classification. Inf Process Manag 2014 Jan;50(1):104-112 [FREE Full text] [CrossRef]21]. Common feature extraction approaches can be loosely divided into two domains: word frequency and semantics [Sparck Jones K. A statistical interpretation of term specificity and its application in retrieval. J Doc 1972 Jan;28(1):11-21 [FREE Full text] [CrossRef]22,Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. In: Proceedings of the 2013 International Conference on Learning Representations. 2013 Presented at: ICLR '13; May 2-4, 2013; Scottsdale, AZ, USA.23]. Machine and deep learning models, such as k-nearest neighbors, decision trees, support vector machines, multilayer perceptron classifiers, and naïve Bayes, are widely used as classifiers [Kamath CN, Bukhari SS, Dengel A. Comparative study between traditional machine learning and deep learning approaches for text classification. In: Proceedings of the 2018 ACM Symposium on Document Engineering. 2018 Presented at: DocEng '18; August 28-31, 2018; Halifax, Canada p. 1-11 URL: https://doi.org/10.1145/3209280.3209526 [CrossRef]24-Li Y, Wang X, Xu P. Chinese text classification model based on deep learning. Future Internet 2018 Nov 20;10(11):113 [FREE Full text] [CrossRef]28].

Aim

We hypothesized that severe trauma cases could be recognized based on the content of communication between callers and call takers during emergency calls. The main research question and objective of this study was to develop a machine learning–based model through text mining of emergency calls to automatically identify severely injured patients in road accidents. We focused on road accidents instead of all trauma cases because they are the major cause of trauma, and compared with other types of injuries, the content of emergency calls for road accidents is homogeneous. As there are no suitable previous studies for comparison, our second objective was to compare the results of the model with 6 participating dispatchers’ judgment.

Study Design and Setting

This paper describes a cross-sectional study on identifying severely injured patients in road accidents by analyzing Mandarin text of emergency calls using machine learning. The results were compared with those of human judgment. We defined severely injured patients as those who fit the major trauma criteria of the EMT trauma triage protocol, that is, prehospital-activated major trauma (PAMT).

Data Acquisition

Data were obtained from the Taipei Trauma Registry, which is a database of trauma accident information from 8 out of 18 hospitals with first aid capabilities. A random sample of one-fourth of the total cases considered as PAMT in 2018 was retrieved. After excluding cases without complete information, 92 PAMT patients (92 of 377 registered cases) were enrolled. As control cases, 3 consecutive non-PAMT road accident calls were matched with each PAMT on the same day from the dispatch system. If the number of non-PAMT cases to be matched on a given day was insufficient, only 1 or 2 calls were included. A total of 92 PAMT calls and 255 non-PAMT calls were considered in this study. The exclusion criteria were as follows: the caller was not by the side of the victim, the caller did not speak Mandarin, the accident was not vehicle-related, and the calls did not provide sufficient information. The final data for analysis included 114 cases in total, which comprised 42 PAMT and 72 non-PAMT cases (Figure 1).

Ethics Approval

This study was approved by the institutional review board of the National Taiwan University Hospital (case number 201902043RINB).

Model Development

As shown in Figure 2, formal model development comprises four steps: (1) text preprocessing, (2) feature engineering, (3) model classification, and (4) model enhancement, which was conducted to improve model performance.

Text Preprocessing (Step 1)

The purpose of text preprocessing is to organize the data such that useful information can be retrieved. This process includes word segmentation, stop word removal, and synonym grouping (Figure 2). First, each emergency call was manually converted into a text form. The continuous text string was then segmented into words, which were the shortest units of meaning, consisting of at least one character. Segmentation was performed using the Chinese word segmentation system developed by the Institute of Information Science and the Institute of Linguistics of Academia Sinica [Ma WY, Chen KJ. Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing. 2003 Presented at: SIGHAN '03; July 11-12, 2003; Sapporo, Japan p. 168-171 URL: https://doi.org/10.3115/1119250.1119276 [CrossRef]29,Li PH, Fu TJ, Ma WY. Why Attention? Analyze BiLSTM deficiency and its remedies in the case of NER. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020 Presented at: AAAI '20; February 7-12, 2020; New York, NY, USA p. 8236-8244. [CrossRef]30]. To eliminate segmentation errors caused by ambiguous Chinese compound words, a dictionary of special terms with specific weights was manually constructed based on experience and trial and error. The segmentation system refers to the weight required to force certain words to merge or separate. Subsequently, stop words were removed to remove insignificant words, such as conjunctions, pronouns, and articles. Then, synonyms were grouped and regarded as the same word, potentially reducing the model overfitting to specific words, thus providing a means for bias-variance control. From >27,000 characters in the original 114 texts, approximately 7000 different word meanings were identified.

Feature Engineering (Step 2)

In feature engineering, the segmented words were transformed into a machine-readable format by feature extraction (step 2.1, Figure 2). As emergency calls are often short, and conversations are urgent, important words are frequently mentioned (

Multimedia Appendix 1

Example texts.

DOCX File , 18 KB Multimedia Appendix 1). Thus, we used term frequency–inverse document frequency (TF-IDF) to weigh each word. The TF-IDF calculation consists of two sections: TF and IDF (). TF illustrates the word frequency, whereas IDF explains the rarity of words appearing in the entire document. A higher frequency of occurrence of a word in one specific text indicates its importance. In contrast, a higher frequency of occurrence of the word in the entire body of texts lowers its importance. By considering these 2 frequencies simultaneously, we ranked all words by importance to conduct feature selection (step 2.2, ). The most important 160 out of 7000 words were chosen based on the experiments. The selected features were placed in a feature space to reduce the number of dimensions and to make the results more explanatory. The feature space included the selected features used to develop the model.

Model Selection (Step 3)

For model selection, we evaluated several commonly used machine learning models for text classification, including k-nearest neighbors, decision tree, support vector machine, multilayer perceptron, multinomial naïve Bayes, and Bernoulli naïve Bayes (BNB). Repeated random subsampling cross-validation (RRS-CV) was conducted 100 times to avoid overfitting and to obtain more stable and reliable classification results. RRS-CV splits samples in a randomized and repeated manner without replacement. The performance of the different models used for comparison was the average of 100 RRS-CV scores. According to Table 1, among these, the BNB-based model achieved the best results. The BNB classifier, which is a supervised learning model, is based on Bayes’ theorem. It assumes that each input variable is independent of the other variables. According to the BNB equation in

Multimedia Appendix 2

Equations and Python script used in the model.

DOCX File , 29 KB Multimedia Appendix 2, the calculation concentrates on binary information of whether the word appears in a document. The Boolean expression of the selected features forms the feature vector for each document. The category estimation of a document depends on the maximum a posteriori of each class k, which consists of the likelihood of the document being given by class k and its prior probability. The category with the highest maximum a posteriori labeled the classified documents. To avoid a zero-probability situation, Laplace smoothing was used to set the additive smoothing parameter to one. Consequently, no hyperparameter tuning was required for BNB. Compared with other text classification models, the BNB model has the advantages of simplicity, efficient computational speed, and ability to achieve a high level of accuracy without hyperparameter tuning. Furthermore, this model is suitable for processing small-scale data and short texts [,]. The results and hyperparameter tuning of other models are presented in .

Table 1. Comparison of machine learning models.

Model	SENS^a (%)	SPEC^b (%)	PPV^c (%)	NPV^d (%)	ACC^e (%)	Youden index
KNN^f	18.7	89.0	32.6	72.1	67.9	0.077
Decision tree	32.7	76.0	35.9	72.9	63.0	0.087
SVM^g	55.7	74.0	49.3	80.3	68.5	0.297
MNB^h	19.0	96.1	42.2	73.8	73.0	0.151
BNBⁱ	53.0^j	86.7^j	67.0^j	81.6^j	76.6^j	0.397^j
MLP^k	53.7	79.0	55.6	80.6	71.4	0.327

^aSENS: sensitivity.

^bSPEC: specificity.

^cPPV: positive predictive value.

^dNPV: negative predictive value.

^eACC: accuracy.

^fKNN: k-nearest neighbors.

^gSVM: support vector machine.

^hMNB: multinomial naïve Bayes.

ⁱBNB: Bernoulli naïve Bayes.

^jBNB-based model achieved the best ACC and Youden index.

^kMLP: multilayer perceptron.

For the split of training and validation data, we set a fixed ratio of PAMT to non-PAMT cases in the validation data. As shown in Figure 3, when the amount of training data becomes larger than that of the validation data, the training score gradually decreases and the validation score increases. The 2 lines were closest when the training and validation data sizes were 104 and 10, respectively. The convergence illustrates that, at this number of training samples, adding more training data does not significantly improve the classification performance. Therefore, for all text classification models, 104 texts were randomly selected as training data and the remaining 10 texts were used as validation data (Figure 2). The training data included 39 PAMT and 65 non-PAMT cases, and the validation data included 3 PAMT and 7 non-PAMT cases. The ground truth of model classification is the on-scene judgment of the EMT, which is presented in the form of binary labels. Figure 4 shows the scalability of the BNB-based model. As the training data increased, the model-fitting time fluctuated moderately around 0.002 seconds but significantly increased when the training data size was >104. In addition, 104 training data points with 10 validation data points had the highest validation score and the third shortest model-fitting time (Figure 5).

Figure 3. Learning curve of the Bernoulli naïve Bayes.

Figure 4. Scalability of the Bernoulli naïve Bayes.

Figure 5. Performance of the Bernoulli naïve Bayes.

Model Enhancement (Step 4)

To optimize the final performance of our model, we enhanced the BNB-based models using feature addition (step 4.1) and rule-based judgment (step 4.2). In feature addition, we gathered 37 keywords provided by the experts and combined them with the 160 words chosen in step 2.2 to form a new feature space (Figure 2). The experts included 6 participating dispatchers and 2 emergency physicians. After they had listened to the 114 audio recordings, they were asked, “Which keyword in an emergency call indicates whether a patient is a PAMT or non-PAMT patient?” They then provided keywords based on their personal experience. The 37 keywords were expected to expand the important feature set, which may be limited by the small amount of data. The feature space created by the union of 160 and 37 words was used to develop enhanced models. Although important features must be included, their contribution to the classification may be small if their frequencies are not significant. Therefore, a rule-based judgment (step 4.2) was designed to highlight the importance of the 37 suggested keywords. Specifically, any text used in the validation that contained at least 2 of the 37 words provided by the experts was classified as PAMT. Texts that did not fit this rule were further examined by a BNB classifier (Figure 2).

The enhanced BNB-based model was compared with various derivative models based on combinations of different steps. The 4 derivative versions of the BNB-based model are presented in Table 2. Model A comprised manually selected features and rule-based judgment. Model B was a classical text classification model that included TF-IDF feature extraction and selection with BNB classification. Model C comprised feature engineering steps and manual feature addition with BNB classification. Finally, we named the best version as the PAMT model. It comprises steps 1 to 4.2, including text preprocessing, feature engineering, model classification, and both model enhancement approaches.

Table 2. BNB-based models of different combinations of steps.

Model	Performance							Steps included^a					BNB^b classification
	SENS^c (%)	SPEC^d (%)	PPV^e (%)	NPV^f (%)	ACC^g (%)	Youden index	1		2	4.1	4.2
Model A	54.7	82.1	56.8	80.9	73.9	0.368	✓			✓	✓
Model B	53.0	86.7	67.0	81.6	76.6	0.397	✓		✓			✓
Model C	54.0	87.3	67.8	82.1	77.3	0.413	✓		✓	✓		✓
PAMT^h model	68.0	78.0	60.6	85.8	75.0	0.460	✓		✓	✓	✓	✓

^aStep 1, text preprocessing; step 2, term frequency–inverse document frequency feature extraction and selection; step 4.1, manual feature addition; step 4.2, rule-based judgment.

^bBNB: Bernoulli naïve Bayes.

^cSENS: sensitivity.

^dSPEC: specificity.

^ePPV: positive predictive value.

^fNPV: negative predictive value.

^gACC: accuracy.

^hPAMT: prehospital-activated major trauma.

Human Participants

For a reference comparison with the PAMT model, we conducted a survey to collect severe trauma judgments from 6 volunteer dispatchers. They were from the fire departments of Taipei City and New Taipei City (Table 3). The participants were asked to listen to 114 road accident audio clips. As we focused on text analysis, the participants were not allowed to receive any information other than the text. Therefore, the audio clips were transcribed into a computer-synthesized voice using a text-to-speech tool. The audio clips were played randomly in both female and male voices. In this way, the tone, speed, and emotions of the speech were neutralized. While listening to the clips, each participant classified the cases as PAMT or non-PAMT depending on their personal experience and intuition. They also shared information regarding their certainty (certain or uncertain) in each case.

Table 3. Profiles of the participating dispatchers.

Participant	Sex	Age (years), range	Service city	EMT^a experience (year)	Dispatch experience (year)
A	Male	30-39	New Taipei City	13	6
B	Female	40-49	New Taipei City	10	2
C	Male	30-39	New Taipei City	14	1
D	Male	30-39	New Taipei City	10	1
E	Male	30-39	Taipei City	10	4
F	Male	30-39	Taipei City	9	4

^aEMT: emergency medicine technician.

Data Analysis

The analysis determined the accuracy, positive predictive value, negative predictive value, sensitivity, and specificity of the PAMT model prediction and average judgments of the participants [Chatterjee A, Gerdes MW, Prinz A, Martinez S. A comparative study to analyze the performance of advanced pattern recognition algorithms for multi-class classification. In: Proceedings of the 2020 Conference on Emerging Technologies in Data Mining and Information Security. 2020 Presented at: IEMIS '20; July 2-4, 2020; Kolkata, India p. 111-124 URL: https://doi.org/10.1007/978-981-15-9774-9_11 [CrossRef]33,Chatterjee A, Gerdes MW, Martinez SG. Identification of risk factors associated with obesity and overweight-a machine learning overview. Sensors (Basel) 2020 May 11;20(9):2734 [FREE Full text] [CrossRef] [Medline]34].

Accuracy refers to the proportion of correctly predicted PAMT and non-PAMT cases. The proportion of cases with true-predicted PAMT and non-PAMT results can be presented as positive predictive value and negative predictive value, respectively. sensitivity and specificity represent the ability of a classification system to correctly identify PAMT and non-PAMT cases, respectively. The Youden index was calculated using different models and can be expressed as the sum of sensitivity and specificity minus 1.

All 114 cases were categorized into certainty levels from 0 to 6, depending on how many participants regarded a case as certain. For example, a case with certainty level 4 indicated that 4 participants were certain of their judgment, whereas the other two were not. The accuracy was also calculated for different certainty levels.

Data management and statistical analyses were performed using Python (Python Software Foundation) and Excel (Microsoft Corporation).

Sample

In total, 114 patients were included in the final analysis. The transcribed texts ranged from 84 to 652 characters, with a mean of 241.4 (SD 106.7) characters; the mean character count of PAMT cases was greater than that of non-PAMT cases (266, SD 102 vs 227, SD 107). The transcribed computer-synthesized audio ranged from 24 to 145 seconds in length, with a mean of 58.9 (SD 24.5) seconds, and the mean call length of PAMT cases was longer than that of non-PAMT (64, SD 24 vs 54, SD 24 seconds) cases (

Multimedia Appendix 4

Descriptive statistics for audio and text files.

DOCX File , 18 KB Multimedia Appendix 4).

Outcome Data

In this study, the machine learning model was trained on a random sample of 104 cases and validated on the remaining 10 cases. RRS-CV was conducted 100 times to obtain greater unbiased validation results; moreover, no external data were used to test the performance of the trained models. According to Table 1, BNB outperformed the other models because it had the highest overall metrics: accuracy (76.6%) and Youden index (0.397). The mean sensitivity, specificity, positive predictive value, and negative predictive value for BNB were 53.0%, 86.7%, 67.0%, and 81.6%, respectively. As there was still room for improvement, model enhancement was performed based on BNB to increase the performance. The enhanced BNB-based model, known as the PAMT model, exhibited the best performance. Its Youden index was 0.460, and it achieved a mean sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of 68.0%, 78.0%, 60.6%, 85.8%, and 75.0%, respectively (Table 2). The performance of model C, which was only enhanced by adding the features provided by the 6 volunteer dispatchers, was ranked after the PAMT model. The mean sensitivity, specificity, positive predictive value, negative predictive value, accuracy, and Youden index of model C were 54.0%, 87.3%, 67.8%, 82.1%, 77.3%, and 0.413, respectively. Model A contained only the features provided by the experts and was classified based on rule-based judgment. It achieved the worst results (sensitivity 54.7%; specificity 82.1%; positive predictive value 56.8%; negative predictive value 80.9%; accuracy 73.9%; Youden index 0.368).

In contrast, the mean sensitivity, specificity, positive predictive value, negative predictive value, and accuracy of the 6 participants were 63.1%, 85.0%, 71.7%, 80.3%, and 76.8%, respectively (

Multimedia Appendix 5

Profiles and predictive performances of the participating dispatchers.

DOCX File , 15 KB Multimedia Appendix 5). The PAMT model with the best performance had a higher sensitivity and negative predictive value but a lower specificity, positive predictive value, and accuracy than the participants. Overall, the PAMT model did not surpass the performance of the participating dispatchers ().

Figure 6. Overall performance of participating dispatchers versus prehospital-activated major trauma (PAMT) model. ACC: accuracy; NPV: negative predictive value; PAMT: prehospital activated major trauma; PPV: positive predictive value; SENS: sensitivity; SPEC: specificity.

In the subgroup analysis, as shown in Figure 7, the mean accuracy of the participants at certainty levels from 0 to 6 was 66.7%, 64.3%, 68.2%, 76.4%, 56.9%, 79.8%, and 87.1%. The mean accuracy of the PAMT model at certainty levels from 0 to 6 was 83.3%, 70.4%, 72.7%, 91.7%, 58.3%, 64.3%, and 81.3%. After all cases were categorized based on different certainty levels, the accuracy of the participants for levels 0 to 6 generally increased, except for level 4, whereas the accuracy of the PAMT model did not show such a linear pattern. The results of the PAMT model did not display a clear trend; that is, they were affected by the certainty level because the BNB model classified cases according to the feature distribution. If we define levels 5 and 6 as certain cases and levels 0 to 4 as uncertain cases, we can observe that, although the accuracy of the PAMT model was lower than that of the participants in certain cases (77.52% vs 85.48%), it was greater than the accuracy of the participants in uncertain cases (73.57% vs 66.34%; Figure 7).

Figure 7. Accuracy of predicting prehospital-activated major trauma (PAMT) by participating dispatchers and PAMT model over different certainty levels.

Principal Findings

Our study makes 3 major contributions to the field. First, this is the primary study to use a machine learning–based model to identify severely injured patients during the dispatch phase. Second, the overall performance of the model was similar to that of human dispatchers (Figure 6). Third, the model produced favorable results for cases in which dispatchers were uncertain (Figure 7).

With no suitable previous studies as a reference, we enrolled 6 volunteer dispatchers in our study. Their judgment was regarded as a reference for comparison with the models. Although such a small sample size cannot represent all dispatchers, we were still able to observe heterogeneity in human performance. As shown in

Multimedia Appendix 5

Profiles and predictive performances of the participating dispatchers.

DOCX File , 15 KB Multimedia Appendix 5, three participants (A, B, and E) had a high specificity and low sensitivity, whereas the other three (C, D, and F) had more balanced figures between specificity and sensitivity. We can speculate that different experiences may affect judgment, and that the policy each participant chose, either aggressive or conservative, also made a difference. With the assistance of the proposed model, which is more stable and adjustable, it is possible to narrow the range of human discrepancies and decrease the uncertainty.

The proposed machine learning models are text classification models. As important words were repeatedly mentioned in often short and intermittent emergency calls (