A 3D and Explainable Artificial Intelligence Model for Evaluation of Chronic Otitis Media Based on Temporal Bone Computed Tomography: Model Development, Validation, and Clinical Application

doi:10.2196/51706

Original Paper

¹ENT Institute and Department of Otorhinolaryngology, Eye & ENT Hospital, Fudan University, Shanghai, China

²NHC Key Laboratory of Hearing Medicine Research, Eye & ENT Hospital, Fudan University, Shanghai, China

³Department of Otolaryngology—Head and Neck Surgery, Vanderbilt University Medical Center, Nashville, TN, United States

⁴Department of Otorhinolargnology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China

⁵State Key Laboratory of Digital Manufacturing Equipment and Technology, School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan, China

⁶Department of Electrical and Computer Engineering, Vanderbilt University, Nashville, TN, United States

⁷Department of Radiology, Eye & ENT Hospital, Fudan University, Shanghai, China

⁸Department of Radiology, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China

*these authors contributed equally

Corresponding Author:

Yike Li, MD, PhD

Department of Otolaryngology—Head and Neck Surgery

Vanderbilt University Medical Center

1215 Medical Center Drive

Nashville, TN, 37232

United States

Phone: 1 6153438146

Email: yike.li.1@vumc.org

Background: Temporal bone computed tomography (CT) helps diagnose chronic otitis media (COM). However, its interpretation requires training and expertise. Artificial intelligence (AI) can help clinicians evaluate COM through CT scans, but existing models lack transparency and may not fully leverage multidimensional diagnostic information.

Objective: We aimed to develop an explainable AI system based on 3D convolutional neural networks (CNNs) for automatic CT-based evaluation of COM.

Methods: Temporal bone CT scans were retrospectively obtained from patients operated for COM between December 2015 and July 2021 at 2 independent institutes. A region of interest encompassing the middle ear was automatically segmented, and 3D CNNs were subsequently trained to identify pathological ears and cholesteatoma. An ablation study was performed to refine model architecture. Benchmark tests were conducted against a baseline 2D model and 7 clinical experts. Model performance was measured through cross-validation and external validation. Heat maps, generated using Gradient-Weighted Class Activation Mapping, were used to highlight critical decision-making regions. Finally, the AI system was assessed with a prospective cohort to aid clinicians in preoperative COM assessment.

Results: Internal and external data sets contained 1661 and 108 patients (3153 and 211 eligible ears), respectively. The 3D model exhibited decent performance with mean areas under the receiver operating characteristic curves of 0.96 (SD 0.01) and 0.93 (SD 0.01), and mean accuracies of 0.878 (SD 0.017) and 0.843 (SD 0.015), respectively, for detecting pathological ears on the 2 data sets. Similar outcomes were observed for cholesteatoma identification (mean area under the receiver operating characteristic curve 0.85, SD 0.03 and 0.83, SD 0.05; mean accuracies 0.783, SD 0.04 and 0.813, SD 0.033, respectively). The proposed 3D model achieved a commendable balance between performance and network size relative to alternative models. It significantly outperformed the 2D approach in detecting COM (P≤.05) and exhibited a substantial gain in identifying cholesteatoma (P<.001). The model also demonstrated superior diagnostic capabilities over resident fellows and the attending otologist (P<.05), rivaling all senior clinicians in both tasks. The generated heat maps properly highlighted the middle ear and mastoid regions, aligning with human knowledge in interpreting temporal bone CT. The resulting AI system achieved an accuracy of 81.8% in generating preoperative diagnoses for 121 patients and contributed to clinical decision-making in 90.1% cases.

Conclusions: We present a 3D CNN model trained to detect pathological changes and identify cholesteatoma via temporal bone CT scans. In both tasks, this model significantly outperforms the baseline 2D approach, achieving levels comparable with or surpassing those of human experts. The model also exhibits decent generalizability and enhanced comprehensibility. This AI system facilitates automatic COM assessment and shows promising viability in real-world clinical settings. These findings underscore AI’s potential as a valuable aid for clinicians in COM evaluation.

Trial Registration: Chinese Clinical Trial Registry ChiCTR2000036300; https://www.chictr.org.cn/showprojEN.html?proj=58685

J Med Internet Res 2024;26:e51706

doi:10.2196/51706

Keywords

artificial intelligence (1624); cholesteatoma (2); deep learning (417); otitis media (3); tomography, x-ray computed (1); machine learning (1673); mastoidectomy (1); convolutional neural networks (23); temporal bone (1)

Chronic otitis media (COM) represents a recurrent inflammatory condition inside the tympanic cavity [Schilder AGM, Chonmaitree T, Cripps AW, Rosenfeld RM, Casselbrant ML, Haggard MP, et al. Otitis media. Nat Rev Dis Primers. 2016;2(1):16063. [CrossRef] [Medline]1]. COM encompasses various forms, including chronic suppurative otitis media (CSOM) and cholesteatoma, each with unique histological characteristics. CSOM involves the accumulation and discharge of purulent fluid, affecting an estimated 330 million people worldwide, with approximately half experiencing hearing loss [World Health Organization. Chronic Suppurative Otitis Media: Burden of Illness and Management Options. Geneva, Switzerland. World Health Organization; 2004. 2]. Cholesteatoma is characterized by the buildup of keratinized squamous epithelium, which has the potential to erode auditory structures and exhibits a notable tendency for relapse. Accurate identification and differentiation of COM types are crucial for effective disease management and surgical planning [Lustig L, Limb C, Baden R, LaSalvia M. Chronic Otitis Media, Cholesteatoma, and Mastoiditis in Adults. Waltham, MA (citirano 145 2019). UpToDate; 2018. 3]. Mastoidectomy, which involves the removal of part of the temporal bone, is the conventional surgical approach for COM. However, less invasive techniques such as endoscopic tympanoplasty are gaining favor for treating CSOM and other noncholesteatoma conditions due to their potential for reduced structural damage and faster recovery [Takahashi M, Motegi M, Yamamoto K, Yamamoto Y, Kojima H. Endoscopic tympanoplasty type I using interlay technique. J Otolaryngol Head Neck Surg. 2022;51(1):45. [CrossRef] [Medline]4-Tarabichi M, Ayache S, Nogueira JF, Al Qahtani M, Pothier DD. Endoscopic management of chronic otitis media and tympanoplasty. Otolaryngol Clin North Am. 2013;46(2):155-163. [CrossRef]9].

Temporal bone computed tomography (CT) is vital for assessing COM and aiding in surgical planning, especially when initial otoscopic examinations have restricted views and yield inconclusive findings [Watts S, Flood LM, Clifford K. A systematic approach to interpretation of computed tomography scans prior to surgery of middle ear cholesteatoma. J Laryngol Otol. 2000;114(4):248-253. [CrossRef]10]. Offering a cost-effective alternative to magnetic resonance imaging (MRI), CT is instrumental in distinguishing cholesteatoma from CSOM by detecting osseous erosion in the tympanum. Although studies have shown that clinicians are capable of diagnosing COM based on CT alone [Selwyn D, Howard J, Cuddihy P. Pre-operative prediction of cholesteatomas from radiology: retrospective cohort study of 106 cases. J Laryngol Otol. 2019;133(06):477-481. [CrossRef]11-Gaurano JL, Joharjy IA. Middle ear cholesteatoma: characteristic CT findings in 64 patients. Ann Saudi Med. 2004;24(6):442-447. [CrossRef] [Medline]17], distinguishing between COM subtypes poses greater challenges to the human eye. Moreover, interpreting temporal bone CT scans requires specialized training and experience, which may not be universally available across otolaryngologists.

Artificial intelligence (AI) is making remarkable advancements in health care. Deep learning (DL) models, particularly convolutional neural networks (CNNs), have demonstrated enhanced efficiency and reduced errors in disease diagnoses and prediction of clinical outcomes [Ardila D, Kiraly AP, Bharadwaj S, Choi B, Reicher JJ, Peng L, et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med. 2019;25(6):954-961. [CrossRef] [Medline]18-Li Y, Guo J, Yang P. Developing an image-based deep learning framework for automatic scoring of the pentagon drawing test. J Alzheimers Dis. 2022;85(1):129-139. [CrossRef]21]. While a few recent papers have reported CNN models in evaluating COM with accuracy scores ranging from 0.77 to 0.85, these studies primarily relied on otoscopic or single-layer CT scans [Wang Y, Li Y, Cheng Y, He Z, Yang J, Xu J, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 2020;41(3):669-677. [CrossRef]22,Sundgaard JV, Harte J, Bray P, Laugesen S, Kamide Y, Tanaka C, et al. Deep metric learning for otitis media classification. Med Image Anal. 2021;71:102034. [CrossRef] [Medline]23]. These 2D representations may not be optimal for revealing pathological changes in concealed or peripheral anatomical structures, such as the attic space and the mastoid air cells. In addition, the inherent “black box” nature of DL models, where decision-making strategies are challenging to understand, has been a common criticism [Watson DS, Krutzinna J, Bruce IN, Griffiths CE, McInnes IB, Barnes MR, et al. Clinical applications of machine learning algorithms: beyond the black box. BMJ. 2019;364:l886. [CrossRef]24,Castelvecchi D. Can we open the black box of AI? Nature. 2016;538(7623):20-23. [CrossRef] [Medline]25]. This lack of comprehensibility hinders the widespread adoption of AI models in clinical practice.

In light of these challenges, this study aimed to create an explainable, 3D CNN model for the automatic interpretation of temporal bone CT scans. The model was designed to pinpoint the region of interest (ROI) and identify pathological and cholesteatomatous conditions in a 3D fashion. Comprehensive benchmarks against baseline methods and human experts on distinct data sets were conducted to demonstrate the robustness and generalizability of this model. In addition, heat map generation was used to highlight potential pathological changes in CT scans and elucidate the model’s rationale for making predictions. These features were integrated into an AI system for the automatic, end-to-end evaluation of COM, which was subsequently assessed in clinical settings. The overarching goal of this system is to support clinicians in making informed decisions for common otologic conditions, thereby enhancing efficiency, reliability, and transparency.

Ethical Considerations

This study was conducted in accordance with the principles of the Declaration of Helsinki. Ethical approval was granted by the institutional review boards at Vanderbilt University Medical Center (191804) and the Eye, Ear, Nose and Throat (EENT) Hospital of Fudan University (2019076). Informed consent was waived as all data were de-identified. The observational study, which aimed to assess the model’s viability in aiding preoperative assessment, was registered with the Chinese Clinical Trial Register (ChiCTR: 2000036300). No compensation was provided to any study participants.

Participants

Data were retrospectively obtained from patients admitted for middle ear surgeries from December 2015 to July 2021 at EENT Hospital. Patients diagnosed with acute otitis media, any inner or external ear diseases, or those with missing temporal bone CT scan were excluded, resulting in 1661 patients eligible for model development. An extra data set containing 108 patients with COM was collected from Wuhan Union (WU) Hospital for external validation (Figure 1).

**Figure 1.** Flowchart of data retrieval. CT: computed tomography; EAC: external auditory canal; EENT: Eye, Ear, Nose, and Throat Hospital of Fudan University; TM: tympanic membrane; WU: Wuhan Union Hospital.

Temporal Bone CT Scans

As part of the routine preoperative assessment, each patient underwent at least 1 temporal bone CT, conducted from the lower margin of the external auditory meatus to the top margin of the petrous bone using a SOMATOM Sensation 10 CT scanner (Siemens Inc) at the EENT Hospital. The scanning parameters were as follows: matrix (512 × 512), field of view (220 mm × 220 mm), tube voltage (140 kV), tube current (100 mAs), section thickness (0.6-0.75 mm), window width (4000 HU), and window level (700 HU). CT scans from the WU Hospital were obtained using a SOMATOM Plus 4 model (Siemens Inc) with different settings for field of view (100 mm), voltage (120 kV), and thickness (0.75 mm). All images were saved in the DICOM format.

Label Assignment

All eligible ears were treated as independent cases and assigned ground truth labels based on their diagnoses (Table 1). Each label was verified according to intraoperative findings and pathology reports for operated ears and using a combination of history, ear examination, audiogram results, and imaging findings for unoperated ears. In cases of unoperated ears, a “normal” label was assigned when there was an absence of ear symptoms, hearing loss, or signs of inflammation. A diagnosis of CSOM was assigned when chronic purulent discharge, conductive hearing loss, and the presence of a perforated tympanic membrane or soft tissue shadow in the tympanic cavity were observed. Cholesteatoma was considered if keratin debris was identified, or if there were signs of osseous damage along with retraction or perforation of the pars flaccida [Wang Y, Li Y, Cheng Y, He Z, Yang J, Xu J, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 2020;41(3):669-677. [CrossRef]22]. Two otolaryngology residents with full access to patients’ medical records independently reviewed these labels as unblinded annotators. Any discrepancies were addressed with senior specialists until a consensus was reached. All data were deidentified and stored on password-protected computers.

Table 1. Summary of patient characteristics and label assignment.

Characteristics				EENT^a data set (N=1661; number of ears=3153)		WU^b data set (N=108; N=211)
Patient age (years), mean (SD)				41.1 (16.6)		39.8 (14.0)
Patient sex,n (%)
	Male			832 (50.1)		49 (45.4)
	Female			829 (49.9)		59 (54.6)
Diagnosis per ear, n (%)
		Normal	1130 (35.8)		101 (47.9)
		Cholesteatoma	728 (23.1)		30 (14.2)
		CSOM^c	1011 (32.1)		69 (32.7)
		Tympanosclerosis	142 (4.5)		2 (0.1)
		Cholesterol granuloma	72 (2.3)		1 (0.05)
		OME^d	41 (1.3)		7 (3.3)
		Adhesive otitis media	29 (0.1)		1 (0.05)
Task 1 labels, n (%)
		Normal	1130 (35.8)		101 (47.9)
		Pathological	2023 (64.2)		110 (52.1)
Task 2 labels, n (%)
		Cholesteatoma	728 (36.7)		28 (26.4)
		Noncholesteatoma	1258 (63.3)		78 (73.6)

^aEENT: Eye, Ear, Nose, and Throat Hospital of Fudan University.

^bWU: Wuhan Union Hospital.

^cCSOM: chronic suppurative otitis media.

^dOME: otitis media with effusion.

Model Architecture

The framework consists of 2 functionally distinct units: a region proposal network for 3D segmentation of ROI, and a classification network for generating predictions. Both networks are established based on CNN models.

Region Proposal Network

This network is designed to extract the middle ear on each side from a full set of temporal bone CT scan (Figure 2A). It contains a YOLO (You Only Look Once; v5) model that is trained to detect and locate 2 auditory structures, including the internal auditory canal and the horizontal semicircular canal, in a series of 2D axial CT scans [Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: unified, real-time object detection. IEEE; 2016. Presented at: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016:779-788; Las Vegas, NV, USA. [CrossRef]26]. These landmarks, positioned at or around the central level of the middle ear, possess unique graphical appearances recognizable by the object detection model. In our recent study, this model demonstrated a 100% success rate in identifying the middle ear region from temporal bone CT scans [Wang Y, Li Y, Cheng Y, He Z, Yang J, Xu J, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 2020;41(3):669-677. [CrossRef]22]. Subsequently, a 3D data matrix (150 × 150 × 32) of the ROI is extracted based on the center coordinates of these 2 structures on each side.

Classification Network

A 3D CNN model is built to interpret the extracted ROI and classify different types of conditions (Figure 2B). This model features 4 convolution blocks and 2 dense blocks (Table 2). Each convolution block consists of a 3D convolutional layer to summarize graphical features along all axes of the input image, followed by a max-pooling layer for downsampling these features and another layer for batch normalization. These high-level features are then pooled and passed to the fully connected layers of the dense blocks, where the diagnosis is predicted based on the calculated probability of each class by a softmax function. A dropout layer is applied to prevent overfitting [Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J machine Learn Res. 2014;15(1):1929-1958.27].

Table 2. Architecture of the 3D convolutional neural network model.

Block and kernel inputs		Settings
Convolution 1
	Conv3D^a	(3,3,3,64)
	MaxPooling3D^b	(2,2,2)
	BatchNormalization^c
Convolution 2
	Conv3D	(3,3,3,64)
	MaxPooling3D	(2,2,2)
	BatchNormalization
Convolution 3
	Conv3D	(3,3,3,128)
	MaxPooling3D	(2,2,2)
	BatchNormalization
Convolution 4
	Conv3D	(3,3,3,256)
	MaxPooling3D	(2,2,2)
	BatchNormalization
	GlobalAveragePooling3D^d
Dense 1
	Fully connected	64
	Dropout	0.3
Output
	Fully connected	2

^aConv3D: 3D convolutional layer.

^bMaxPooling3D: 3D max pooling layer.

^cBatchNormalization: batch normalization layer.

^dGlobalAveragePooling3D: layer performing global average pooling for 3D data.

Model Training and Testing

Task 1—Detection of COM

The first classification model was trained in a binary task distinguishing between normal and pathological ears in all cases (n=3153). The training and testing procedures involved 5-fold cross-validation on the internal (EENT) data set. Specifically, the data set was evenly partitioned into 5 nonoverlapping subsets in a random, stratified fashion. In each iteration, 1 subset was reserved for testing (n=631), while the remaining 4 were used for training (n=2522). Model performance metrics were averaged over 5 iterations of this process. During each training session, a random 20% of training images (n=504) were allocated for validation. Training was set for 1000 epochs with an initial learning rate of 0.0001, and the Adam optimizer was used to dynamically adjust the algorithm’s learning capability and minimize errors [Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv. 13:54-29. Preprint published online Dec 2014. [FREE Full text]28]. Early termination was implemented if no further decrease in validation loss was observed for a consecutive 10 epochs. These hyperparameters were determined based on the resultant model performance and training efficiency shown in a preliminary study. The trained model was also evaluated on the external data set (n=211) in each round.

Task 2—Identification of Cholesteatoma

The second classification model was trained to specifically identify cholesteatoma on selected CT scans that displayed signs of inflammation in the middle ears. This task was designed to simulate a common clinical scenario where clinicians need to differentiate cholesteatoma from other types of COM in patients with positive imaging findings. The aim was to provide a preoperative assessment of the risk of cholesteatoma, assisting clinicians in surgical planning [Lustig L, Limb C, Baden R, LaSalvia M. Chronic Otitis Media, Cholesteatoma, and Mastoiditis in Adults. Waltham, MA (citirano 145 2019). UpToDate; 2018. 3,Tseng C, Lai M, Wu C, Yuan S, Ding Y. Comparison of the efficacy of endoscopic tympanoplasty and microscopic tympanoplasty: a systematic review and meta‐analysis. Laryngoscope. 2016;127(8):1890-1896. [CrossRef] [Medline]29]. For this task, a subset of CT scans with visible soft tissue density or increased opacification in the middle ear or mastoid was selected from both the internal (n=1986) and external sets (n=106). The remaining methods, including extraction of ROI, network architecture, and the training and testing procedures, were consistent with those used in the first task.

Ablation Study

To refine model selection and gain a better understanding of the network’s behavior, an ablation study was performed to compare the proposed classification network with 3 alternative models, each incorporating modifications to certain features. Specifically, the number of convolutional blocks was decreased and increased by 1 in alternative model 1 and model 2, respectively, and a different size of filter was applied in model 3 (Tables S1-S3 in

Multimedia Appendix 1

Additional tables.

DOCX File , 24 KB Multimedia Appendix 1). To ensure adequate statistical power for detecting differences across models, experiments were conducted on the main data set using the same methodology as outlined in the preceding sections.

Benchmarking Against the 2D Approach

To investigate whether the use of 3D CT scans may enhance diagnostic performance, a benchmark study was designed to compare the proposed system with a baseline model using 2D images. This baseline model, previously established by our team, uses transfer learning on a pretrained Inception-V3 (Google LLC) model [Wang Y, Li Y, Cheng Y, He Z, Yang J, Xu J, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 2020;41(3):669-677. [CrossRef]22]. In this study, the base model of Inception-V3 was retained, and the final classification layer was customized with a binary output. Training and validation were conducted in the same manner as the 3D model, except that only a single CT scan at the central layer of the ROI was used as the input for the 2D model. All image-preprocessing techniques and hyperparameter settings remained consistent with those outlined in the previous study [Wang Y, Li Y, Cheng Y, He Z, Yang J, Xu J, et al. Deep learning in automated region proposal and diagnosis of chronic otitis media based on computed tomography. Ear Hear. 2020;41(3):669-677. [CrossRef]22].

Benchmarking Against Human Experts

Another benchmark test was performed against human experts to provide an additional unbiased evaluation of the proposed system. Seven human specialists with a broad range of qualifications were recruited to perform both tasks based on the same image data. The participants included 2 senior otologists, each with 12 years of clinical experience, 1 senior head and neck radiologist with 21 years of experience, 1 attending otologist with 7 years of experience, and 3 otolaryngology residents with 3, 3, and 2 years of experience, respectively. Each expert was provided only with the CT scans and instructed to make a task-specific diagnosis to each ear (task 1: normal or pathological; task 2: cholesteatoma or noncholesteatoma). The test data for clinicians comprised a random selection of 244 ears from the EENT set and all eligible ears from the WU set. To assess intrarater reliability, a random replication of 10% of test cases (n=48) was mixed with these data. All test cases (N=502) had not been previously seen by any experts. They were anonymized, shuffled, and stored on a password-protected computer along with spreadsheets to record each expert’s diagnoses for these cases.

Generation of Heat Maps

Gradient-Weighted Class Activation Mapping was used to visualize model’s rationale for decision-making (Figure 2C). In essence, this approach leverages the gradients of the target class flowing into the final convolutional layer to produce a coarse localization heat map, highlighting the critical regions in the image [Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: visual explanations from deep networks via gradient-based localization. 2017. Presented at: Proceedings of the IEEE International Conference on Computer Vision; 2017:618-626; Venice, Italy.30]. In this study, heat maps were generated in a 3D fashion and rescaled to match the original images using TensorFlow 2.11 in Python 3.91 (Python Core Team) [Python Core Team. Python: a dynamic, open source programming language. Python Software Foundation. URL: https://www.python.org/ [accessed 2024-06-01] 31].

Clinical Applications

The validated model was integrated into a Python program, enabling the automated assessment of COM from raw CT inputs to the generation of explainable diagnoses in an end-to-end fashion (see the section “Data Availability Statements” and

Multimedia Appendix 2

Source code for model development and validation.

ZIP File (Zip Archive), 69 KB Multimedia Appendix 2). To evaluate its viability in assisting otologists in clinical settings, this system was used with a prospective cohort of patients undergoing middle ear surgeries at EENT hospital from November 2023 to January 2024 in a single-arm observational study. Preoperative model predictions, along with routine assessments, were provided to 2 senior otologists, who were given autonomy to determine surgical strategies based on their discretion. Surgeons were surveyed regarding the use of model-generated information in their decision-making processes for these cases. Model predictions were used to analyze the selection of surgical approaches and to measure model performance against pathological findings. Hearing gain was assessed by comparing the air conduction threshold at 2 weeks postoperatively with the baseline.

Statistical Analysis

Descriptive statistics were applied as appropriate. The overall predictability of a model was evaluated by the area under the receiver operating characteristic (AUROC) curve. The optimal cutoff threshold on the curve was determined at the point with minimal distance to the upper left corner on the validation set and subsequently applied to the test set. The numbers of correctly and incorrectly classified cases were displayed in a confusion matrix, and these were used to calculate the performance metrics, including accuracy, recall, specificity, precision, and F₁-score. These metrics offer comprehensive insights into the model’s performance, covering overall correctness in identifying both positives and negatives (accuracy), sensitivity in detecting positive cases (recall), capability in ruling in patients (specificity), propensity for preventing false alarms (precision), and effectiveness in identifying positive cases while minimizing false positives and false negatives (F₁-score). They were derived as shown in Textbox 1. Results are averaged over 5 iterations of cross-validation or external validation and presented as mean (SD). Intrarater consistency was evaluated using Cohen kappa. Significance was determined through pairwise 2-tailed t test for difference in performance between models and via 1-way analysis of variance between the proposed model and human experts. The alpha level was set at .05. Statistical analyses were conducted using Python 3.91 [Python Core Team. Python: a dynamic, open source programming language. Python Software Foundation. URL: https://www.python.org/ [accessed 2024-06-01] 31].

Textbox 1. The calculation of performance metrics.

Accuracy = (True positive + True negative)/Total sample size

Recall = True positive/(True positive + False negative)

Specificity = True negative/(True negative + False positive)

Precision = True negative/(True negative + False negative)

F₁-score=2 × True positive/(2 × True positive + False positive + False negative)

ROI Extraction

The region proposal network successfully extracted the 3D ROI containing the critical anatomies on each side, including the tympanic cavity and sinus tympani (Figure 3). This has been confirmed by manual inspection of the generated images in all cases from both data sets.

**Figure 3.** Generation of the 3D ROI. The region proposal network identifies landmark structures in each of the full-sized sequential CT slices and determines the center of the middle ear on each side. A 3D image comprising 32 stacks of axial slices in 150 × 150 pixels is subsequently segmented. This ROI encompasses an extensive range of critical anatomies within the temporal bone for the evaluation of COM. CT: computed tomographic; HSC: horizontal semicircular canal; IAC: internal auditory canal; ROI: region of interest.

Task 1

Our model exhibited decent performance in identifying pathological changes in the middle ear, achieving a mean accuracy of 87.8%, recall of 85.3%, specificity of 91.3%, and precision of 93.3% on the internal data set (Table 3). It also demonstrated a near-perfect AUROC score of 0.96. These performance metrics remained generally consistent on the external data set, with a comparable AUROC score of 0.93, indicating reasonable generalizability (Figure 4).

Table 3. Performance of the baseline 2D and the proposed 3D models.

Task and model			Size (MB)		Data set		Accuracy, mean (SD)		Recall, mean (SD)		Specificity, mean (SD)		Precision, mean (SD)		F₁-score, mean (SD)		AUROC^a, mean (SD)		P value
1
	3D	14.2		EENT^b		0.878 (0.017)		0.853 (0.032)		0.913 (0.067)		0.933 (0.045)		0.89 (0.012)		0.00959 (0.00011)		.003
	2D	274		EENT		0.861 (0.019)		0.845 (0.028)		0.883 (0.052)		0.909 (0.036)		0.875 (0.016)		0.00939 (0.00013)		N/A^c
	3D	14.2		WU^d		0.843 (0.015)		0.756 (0.047)		0.934 (0.021)		0.924 (0.018)		0.83 (0.022)		0.00933 (0.0001)		.05
	2D	274		WU		0.821 (0.023)		0.744 (0.078)		0.901 (0.046)		0.891 (0.039)		0.808 (0.036)		0.00918 (0.00012)		N/A
2
	3D	14.2		EENT		0.783 (0.04)		0.808 (0.025)		0.77 (0.054)		0.652 (0.06)		0.721 (0.042)		0.00853 (0.0003)		<.001
	2D	274		EENT		0.67 (0.037)		0.716 (0.144)		0.646 (0.119)		0.523 (0.044)		0.596 (0.036)		0.00744 (0.00025)		N/A
	3D	14.2		WU		0.812 (0.033)		0.614 (0.085)		0.878 (0.031)		0.626 (0.078)		0.618 (0.069)		0.00826 (0.00055)		<.001
	2D	274		WU		0.676 (0.103)		0.479 (0.224)		0.741 (0.185)		0.41 (0.086)		0.411 (0.096)		0.00714 (0.00049)		N/A

^aAUROC: area under the receiver operating characteristic curve.

^bEENT: Eye, Ear, Nose, and Throat Hospital of Fudan University.

^cN/A: not applicable.

^dWU: Wuhan Union Hospital.

**Figure 4.** Receiver operating characteristic plots for the benchmark tests. The curve and the shaded area indicate the mean (1 SD) of a model, respectively. Clinical experts are marked by colored asterisks for individual performance and by an open circle for averaged performance. The dotted diagonal line represents a random classifier. AUC: area under the curve; EENT: Eye, Ear, Nose, and Throat Hospital of Fudan University; WU: Wuhan Union Hospital.

Task 2

This model also demonstrated satisfactory predictive capabilities in differentiating between cholesteatoma and noncholesteatomatous cases. On both data sets, the model managed to correctly identify whether a case involved cholesteatoma in approximately 4 out of 5 instances (with accuracies of 78.3% and 81.3%). Generalizability was further supported by the comparable AUROC scores of 0.85 and 0.83 on the internal and the external data set, respectively (Table 3).