Correction: Predicting Health Material Accessibility: Development of Machine Learning Algorithms

[This corrects the article DOI: 10.2196/29175.].


Introduction
• Under "Material Collection and Classification," the last sentence "The final classification contained two sets of texts: easy (n=499) versus difficult (n=501;..." has been replaced by "The final classification contained two sets of texts: easy (n=495) versus difficult (n=505;...." • Under "Material Annotation and Semantic Feature Extraction," the sentence "With USAS, we collected 108 semantic features" has been replaced by "With USAS, we collected 113 semantic features." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the first paragraph, the sentence "A total of 29 of the 113 semantic features were identified as statistically significant…" has been replaced by "A total of 26 of the 113 semantic features were identified as statistically significant...." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the first paragraph, the sentence "The mean score of Z8 in health texts of higher understandability was 52.91, this dropped to 20.15 ..." has been replaced by "The mean score of Z8 in health texts of higher understandability was 52.84, this dropped to 20.48...." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the first paragraph, the sentence "…was 0.929 ...easy reading was 0.929..." has been replaced by "...was 0.928 (95% CI 0.905-0.951)… easy reading was 0.928...." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the first paragraph, the sentence "…were identified as statistically significant (P=.005)" has been replaced by "...were identified as statistically significant (P=.01

Under "Statistical Analysis of Multidimensional Semantic
Features in English Educational Health Texts," in the second paragraph, the sentence "To a lesser extent, the odds ratio of 1.082 of L2 (living creatures including microorganisms) indicates that with the increase of one word in this class, the perceived difficulty level (hard-to-understand class) of the health text increased by a mean 8.2% (95% CI 0.3%-16.7%) depending on the vocabulary range of English health terms of the readers" has been replaced by "To a lesser extent, the odds ratio of 1.080 of L2 (living creatures including microorganisms) indicates that with the increase of one word in this class, the perceived difficulty level (hard-to-understand class) of the health text increased by a mean 8.0% (95% CI 0.5%-16.2%) depending on the vocabulary range of English health terms of the readers." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the second paragraph, the sentence "These include A2 (general or abstract terms denoting the propensity for changes, such as adapt, adjust for, conversion, and alter; odds ratio 1.057, 95% CI 1.005-1.111; P=.03), A7 (abstract terms of modality, such as possibility, necessity, and certainty; odds ratio 1.099, 95% CI 1.006-1.2; P=.04), A11 (abstract terms denoting importance, significance, noticeability, or markedness; odds ratio 1.164, 95% CI 1.003-1.351; P=.045)" has been replaced by "These include A11 (abstract terms denoting importance, significance, noticeability, or markedness; odds ratio 1.219, 95% CI 1.070-1.388; P=.003)." • Under "Statistical Analysis of Multidimensional Semantic Features in English Educational Health Texts," in the second paragraph, the sentence "This means that with the increase of one word in the A11 class, the odds of the health text being seen as a hard-to-understand text over the text being seen as an easy text was 1.164, or an increase of 16.4%" has been replaced by "This means that with the increase of one unit in the A11 class, the odds of the health text being seen as a hard-to-understand text over the text being seen as an easy text was 1.219, or an increase of 21.9%."

•
Under "Methods," in the second paragraph, the sentences " For a decision tree classifier, the best-point hyperparameters (Figure 1) were the maximum number of tree splits (n=22)

•
Under "Results," in the first paragraph, the sentences "The mean scores and SDs of the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy were obtained through 10-fold cross-validation. The cross-validation divided the entire data set into 10 folds of equal size. In each iteration, 9 folds were used for the training data, and the remaining fold was used as the testing data. As a result, on completion of the 10-fold cross-validation, each fold was used as the testing data exactly once. We used pairwise corrected resampled t test to counteract the issue of multiple comparisons. As the result, the significance level was adjusted to .008 (n=6; α=.05) using Bonferroni correction" have been replaced by "The mean scores and standard deviations of the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy were obtained through 5-fold cross-validation. The cross-validation divided the entire data set into 5 folds of equal size. In each iteration, 4 folds were used for the training data, and the remaining fold was used as the testing data. As a result, on completion of the 5-fold cross-validation, each fold was used as the testing data exactly once. We used paired-sample comparisons to investigate the area under the operating characteristic curve (AUC), sensitivity, specificity, and accuracy differences of four machine learning algorithms (n=6; α=.05)." • Under "Results," the second paragraph " Table 2  . These results suggest that, when using semantic features as predictor variables, the most stable and highest-performing algorithm is ensemble classifier (LogitBoost), followed by optimized decision tree. LogitBoost, decision tree, and SVM all achieved statistically significant improvement over logistic regression in AUC, specificity, and accuracy. Decision tree and SVM did not improve over logistic regression in terms of sensitivity, but LogitBoost did. Overall, the best AUC, sensitivity, specificity, and accuracy were achieved by LogitBoost as an ensemble classifier (Figure 4)" has been replaced by "  (P=.001 and P<.001, respectively), and accuracy (P=.022 and P=.001, respectively). Only ensemble classifier outperformed decision tree significantly in terms of model sensitivity (P=.024), and specificity (P=.010), using the paired-sample comparisons (n=6; α=.05). These results suggest that, when using semantic features as predictor variables, the most stable and highest-performing algorithm is ensemble classifier (LogitBoost), followed by SVM. Ensemble classifier, decision tree, and SVM all achieved statistically significant improvement over logistic regression in AUC, specificity, sensitivity, and accuracy. SVM did not improve significantly over decision tree in terms of sensitivity and specificity, but ensemble classifier did. Overall, the best AUC, sensitivity, specificity, and accuracy were achieved by LogitBoost as an ensemble classifier (Figure 4)."

Discussion
• Under "Principal Findings," in the second paragraph, the sentence "…(measured in pairwise resampled t tests, with P value adjusted to .008 using Bonferroni correction)" has been replaced by "…(measured in pairwise resampled t tests)." • Under "Principal Findings," in the last paragraph, the sentence "…or those requiring higher cognitive abilities, such as assessing the propensity for changes and expressions of modality describing possibility, necessity, and certainty of health events and situations" has been replaced by "…or those requiring higher cognitive abilities, such as abstract terms denoting importance, significance, noticeability or markedness of health events and situations."

Authors' Contributions
In the originally published paper, the following "Authors' Contributions" section was not included.
MJ and TH were responsible for overall research design; MJ was responsible for paper writing and revision, and YL was responsible for formal analysis and data curation.

Multimedia Appendices
The information presented in the Multimedia Appendix 1 entitled "Variables in the logistic regression of health text understandability membership" has been updated. The originally published Multimedia Appendix 1 is in Multimedia Appendix 2.         The authors confirm that the results and conclusions of the corrected data are consistent with those in the originally published version.

Figures and Tables
These corrections will appear in the online version of the paper on the JMIR website on September 21, 2021, together with the publication of this correction notice. Because this was made after submission to PubMed, PubMed Central, and other full-text repositories, the corrected article has also been resubmitted to those repositories.