e-Learning for Instruction and to Improve Reproducibility of Scoring Tumor-Stroma Ratio in Colon Carcinoma: Performance and Reproducibility Assessment in the UNITED Study

Background The amount of stroma in the primary tumor is an important prognostic parameter. The tumor-stroma ratio (TSR) was previously validated by international research groups as a robust parameter with good interobserver agreement. Objective The Uniform Noting for International Application of the Tumor-Stroma Ratio as an Easy Diagnostic Tool (UNITED) study was developed to bring the TSR to clinical implementation. As part of the study, an e-Learning module was constructed to confirm the reproducibility of scoring the TSR after proper instruction. Methods The e-Learning module consists of an autoinstruction for TSR determination (instruction video or written protocol) and three sets of 40 cases (training, test, and repetition sets). Scoring the TSR is performed on hematoxylin and eosin–stained sections and takes only 1-2 minutes. Cases are considered stroma-low if the amount of stroma is ≤50%, whereas a stroma-high case is defined as >50% stroma. Inter- and intraobserver agreements were determined based on the Cohen κ score after each set to evaluate the reproducibility. Results Pathologists and pathology residents (N=63) with special interest in colorectal cancer participated in the e-Learning. Forty-nine participants started the e-Learning and 31 (63%) finished the whole cycle (3 sets). A significant improvement was observed from the training set to the test set; the median κ score improved from 0.72 to 0.77 (P=.002). Conclusions e-Learning is an effective method to instruct pathologists and pathology residents for scoring the TSR. The reliability of scoring improved from the training to the test set and did not fall back with the repetition set, confirming the reproducibility of the TSR scoring method. Trial Registration The Netherlands Trial Registry NTR7270; https://www.trialregister.nl/trial/7072 International Registered Report Identifier (IRRID) RR2-10.2196/13464


Introduction
The prediction of disease outcome for an individual patient as part of personalized medicine is becoming routine practice in the management of cancer patient treatment. Staging of colon cancer by pathologists is based on hematoxylin and eosin (H&E)-stained sections of the primary tumor. The tumor-node-metastasis (TNM) classification is used as the main selection criterion for additional treatment, along with noting of characteristics such as depth of invasion and differentiation grade [1], according to the American Joint Committee staging algorithm. However, conventional H&E sections provide more information than previously recognized.
In the last decade, research has not only focused on the tumor and its characteristics but increasingly also on the tumor microenvironment. The tumor microenvironment consists of the stromal background with a variety of cells such as fibroblasts, endothelial cells, and lymphocytes. The tumor-stroma ratio (TSR) is the amount of tumor relative to the amount of stroma in the primary tumor [2][3][4]. Patients with a high amount of stroma (stroma-high) have a worse prognosis compared to those harboring tumors with a low amount of stroma (stroma-low) in multiple types of cancer [5][6][7][8][9][10][11][12][13].
Scoring the TSR is performed on H&E-stained sections in only 1-2 minutes, with good to excellent interobserver agreement [3]. This implies that TSR scoring is easy to learn and useful in daily practice.
The Uniform Noting for International Application of the Tumor-Stroma Ratio as an Easy Diagnostic Tool (UNITED) study was developed to prepare for implementation of the TSR as an additional high-risk indicator along with traditional TNM classification. As part of the UNITED study, an instruction protocol and reproducibility study were initiated using an e-Learning module as described in the published research protocol [14].
Digital pathology is increasingly being implemented in daily diagnostic practice as well as for teaching. Digital pathology for instruction, most commonly used for instruction of students, has multiple advantages: more students can be reached because it is web-based; all students look at exactly the same case; annotations can be shared with the teacher, resulting in direct feedback; and students can complete the course when and where it suits them [15][16][17][18].
The use of e-Learning for education has been adopted in different medical specialties worldwide. An example of e-Learning used in pathology is a module developed for Dutch pathologists [19,20]. The module focuses on decreasing the variation in grading dysplasia in adenomas and increasing the consistency of scoring serrated lesions. Two separate studies have shown that e-Learning is a good method to improve performance [19,20]. e-Learning to instruct professionals has also been confirmed in specialties other than pathology [21,22]. For instance, based on a systematic review for surgical training (students, residents, and surgeons), Maertens et al [23] concluded that e-Learning is as effective as other methods for training.
The aim of this study was to confirm the high reproducibility of scoring the TSR using an e-Learning module to train a variety of pathologists.

Case Selection
The e-Learning module was based on H&E-stained sections of stage II and III colon cancer resection specimens. The cases were randomly selected from the archives of the Pathology Department of Leiden University Medical Centre (LUMC). The number of cases with very low stroma (ie, 10% or 20% stroma area) were limited from the analysis to increase the number of cases that are generally more difficult to score for pathologists and are therefore more suitable for training purposes. In the e-Learning, 55% of the cases were stroma-low (≤50% stroma) and 45% were stroma-high (>50% stroma). None of the patients had received neoadjuvant treatment at the time of sample collection. The sample size was based on a workable amount of cases to maintain quality without the case load being too high.
Slides were scanned using a 20× objective with the Panoramic 250 scanner (3D Histech) or with the IntelliSite Digital pathology slide scanner (Philips).

Participants
Pathologists and pathology residents from all over the world could participate in the UNITED study and in the e-Learning. The UNITED study started in 2018, and pathologists were invited to start (and complete) the e-Learning in the period of December 2017 to April 2019. Data collection ended in April 2019. Multimedia Appendix 1 provides an overview of participating countries, and the numbers of participating pathologists and residents.

TSR Scoring Method
The previously published protocol for scoring the TSR was used in this study [3,4]. In brief, the section of the deepest part of the tumor, usually the section used for determining the T-stage, was chosen. The area with the highest amount of stroma was selected and scored at 100× magnification in increments of 10%. A field should contain tumor cells on four opposite edges of the field of evaluation.

e-Learning
The e-Learning module was developed in PathXL Tutor version 6.1.1.1. (Philips, Belfast, UK). This software uses digital images, and was developed to easily share and teach a network of pathology students, or in this case pathologists. The PathXL software allows the pathologist to analyze the slide in a manner comparable to using a microscope. The e-Learning was prepared to resemble real-life microscopy as far as possible by using round annotations with a fixed size of 3.4 mm 2 . This represents the size of the field of vision of microscopes, even from different brands, when using 100× magnification.
Participants were blinded to clinical data of the sections and were only informed that the patients did not receive neoadjuvant treatment.
Before starting the first set of the e-Learning module, participants were asked to watch the instruction video on the study website [24]. The training set consisted of 40 cases. This set started with 5 multiple choice questions where annotations were placed upfront in different areas of the section (Multimedia Appendix 2). Participants were asked to select the correctly placed annotation and determine the percentage of stroma. In the other 35 cases (and in the other sets), the participant was asked to place the annotations themselves at the most optimal position and to determine the stroma percentage.
The test set also consisted of 40 cases, including 37 (93%) new cases. To determine the intraobserver agreement of scoring the TSR, the test set was repeated (repetition set) after a 2-month washout period with the same cases placed in a different order.
The answer model used for evaluating the results was established by two experienced observers (GP, MS) together with a pathologist (VS) of the LUMC. The coordinators of the UNITED study checked all finished e-Learning sets for stroma percentage and for correct placement of the annotation. The answers of the participants were compared with the answer model. Continuing with the second set was allowed when an interobserver agreement (κ) of ≥0.7 with predefined scores was reached. In the case in which a participant did not pass a set due to a κ score below 0.7, the same set had to be rescored (see Multimedia Appendix 3 for the flowchart of the e-Learning module [14]) after feedback was given.

Statistical Analysis
Data collected in this study comprised: (1) whether the participant is a pathologist or resident, (2) participating country, (3) the stroma percentage of the different questions, and (4) whether or not the participant considered a question to be difficult. In this study, a possible bias could be that participating pathologists/residents are generally more motivated for participation.
Stroma percentages were classified as stroma-low (≤50% stroma) or as stroma-high (>50% stroma) [3,4,25]. This dichotomous output and the outcome of whether or not the annotation was placed correctly were used for measuring observer agreement. Cohen κ coefficient was used to measure inter-and intraobserver agreement. This score is quantitative and was used as a noncontinuous variable. Histograms were used to visualize the distribution of the κ scores for each set.
Nonnormally distributed continuous variables are described with the median and range (minimum and maximum values). For the median κ scores of a set, the first κ score of a participant of each set was used. The Wilcoxon signed-rank test was used to compare paired nonnormally distributed continuous variables (eg, measuring the progress between different e-Learning sets by participants).

Ethical Considerations
The UNITED study protocol has been approved by the Medical Research Ethics Committee of the LUMC (study number p17.302). All samples were handled in accordance with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Participants
In total, 63 participants (49 pathologists and 14 residents) were registered for e-Learning. However, 14 participants were nonresponsive after registration, and thus 36 pathologists and 13 residents from 19 countries started the e-Learning; these 49 participants were used for analysis. All participating pathologists had gastrointestinal pathology as a subspecialty; however, most of them had more than one subspecialty. The residents had not yet chosen a pathology subspecialty. Thirty-six (73%) participants (28 pathologists and 8 residents) finished the training set and continued with the test set. In total, 31 (63%) participants finished the whole cycle (3 sets) of the e-Learning. A complete overview and reasons for participants to drop out are shown in Figure 1. The participants who quit the study were left out of the analysis.

Observer Agreement
After finishing the training set of the e-Learning, the observer agreement was determined. Twenty-four (67%) participants (19 pathologists and 5 residents) passed the training set at the first attempt (ie, κ≥0.7). Two participants (both residents) needed a third chance to pass the training set. The median κ score for the training set was 0.72. The test set was passed the first time by 31 (91%) participants (23 pathologists and 8 residents) and 3 (9%) pathologists had to repeat the test set. The median κ score for the test set was 0.77. After a 2-month washout period, 28 (90%) participants (21 pathologists and 7 residents) directly passed the repetition set, with a median κ score of 0.76 (Table   1, Figure 2). A significant improvement was observed from the training set to the test set (P=.002). No significant changes of the κ scores were observed between the test set and the repetition set (P=.30, Figure 3). Intraobserver agreement was measured for the 31 participants who finished the repetition set. The median κ score of the intraobserver agreement was 0.77 (Table  1). Scoring results from pathologists showed significant improvement from the training set to the test set (P=.006) and no fall back (P=.74) after a washout period of 2 months. The scoring results of the residents showed no significant changes between sets (P=. 26 and P=.13). Details are shown in Multimedia Appendix 4.

Difficulty of the Questions
For each case, participants were asked whether scoring the TSR of the section was easy, normal, or difficult. If answered as difficult, a reason could be given. A case was classified as difficult when at least 40% of the participants agreed with this assessment. Eleven cases (9 different cases) were classified as difficult (4 in the training set, 2 in the test set, and 5 in the repetition set). The cases classified as difficult in the test set remained difficult in the repetition set. The three other cases of the repetition set were more difficult than average in the test set. As expected, most of the difficult cases were those close to the cut-off value of 50%, mucinous tumors, tumors with a lot of necrosis, and tumors in which the distinction between the stroma and the smooth muscle was difficult (see Multimedia Appendix 5 for examples of difficult cases). Overall, the 11 cases were more often answered wrong (29% of the answers) by participants who classified (one of) these cases as difficult compared to 19% of wrong answers for cases that were assessed as not difficult (see Multimedia Appendix 6 for the subdivision per case).

Drawing Annotations
A few cases were used in both the training and test sets. Analysis of these cases showed progress of scoring at the hotspot, as the annotations were more centered (Figure 4). Furthermore, in stroma-low cases, annotations were more widely spread over the entire tumor area, whereas one or two hotspots were more often identified in stroma-high cases ( Figure 5).

Discussion
This study shows that e-Learning is an accurate method to instruct pathologists and pathology residents for scoring the TSR. A κ score >0.7 was used to define the reproducibility of the method. The median κ score improved and the minimum value increased after completing the consecutive sets for scoring the TSR. Significant progress (P=.002) was observed from the training to the test set. If a set was not passed the first time, feedback was given to the participant before they repeated the same set; however, participants did not get insight into their precise mistakes. The feedback was personalized, but not case-specific.
A decrease was observed between the number of created accounts, the number of participants who started the e-Learning, and the number of participants who completed the whole module. There are multiple reasons for this decrease. Most of the participants were registered by the principal investigator of an institute; however, not everyone responded to their registration. Another reason for dropout was withdrawal of the center from the UNITED study.
In daily diagnostics, H&E-stained sections are used for determining cancer stage. To ensure the high quality of the sections used for e-Learning, the original H&E-stained slides of the cases used for diagnostic purposes were scanned. Furthermore, all slides were scanned at the same magnification to avoid differences in the quality of the digital images.
When reviewing the e-Learning sets, in some cases it seemed as if the participants had scored the tumor percentage instead of the stroma percentage. This might be explained by the fact that pathologists are accustomed to scoring the neoplastic cell percentage for molecular analysis. Another explanation might be more related to semantics. The amount of tumor is in the numerator, which might be a source of confusion. In these doubtful cases, the participant was asked to reevaluate the case as well as some others. Thus, scoring the percentage of stroma remains a point of attention.
Overall, participants were well able to choose the right area for scoring the stroma percentage and to estimate whether a section was stroma-high or stroma-low. A common misinterpretation was scoring at the invasive front instead of looking for an area with as much stroma as possible within the section. This might be explained by the fact that pathologists are accustomed to scoring the tumor budding at the invasive front [26]. Furthermore, as the scoring protocol describes the use of the section from the deepest part of the tumor (usually the section used for determining the T-status) [3], the distinction between scoring the TSR in the whole tumor area and not necessarily at the invasive front might not have been made clear enough in the instructions. When comparing the results of pathologists and residents, pathologists showed significant improvement from the training to the test set, whereas residents did not. A possible explanation is the small group of residents (n=8) or the fact that pathologists are more experienced than residents.
Research performed on other pathology biomarkers used in daily practice such as lymphovascular invasion [27], tumor grading [28,29], classification and grading of colorectal polyps or adenomas [19,20,[30][31][32][33], and the estimation of tumor cell percentage [34] has shown weak to moderate interobserver agreement. With three median κ scores above 0.7, the interobserver agreement for scoring the TSR in this study was found to be good. Although no comparison arm was included in this study, the median κ values obtained in this study are lower than those reported previously. This can be explained by the fact that the scores were low in the training set, which improved in the test set.
Digital pathology is increasingly entering pathology practice, although most pathology departments are not yet (fully) digitalized. In the future, a digital image analysis program will be useful for more accurate scoring and even better reproducibility. Digital pathology for teaching goals has some advantages and disadvantages. The advantages include easy distribution of samples and being able to score a set at the same time while reaching a worldwide group of pathologists, and disadvantages include possible software flaws or bugs when using a digitized workflow. In this study, the placed annotations were not always saved correctly, which sometimes made it difficult to analyze the results for a participant. In these particular cases, the participant was asked to reevaluate the case.
In conclusion, this study showed that e-Learning is a good and effective method to instruct pathologists and residents in scoring