@Article{info:doi/10.2196/74345, author="Borycki, Elizabeth", title="2024: A Year of Nursing Informatics Research in Review", journal="JMIR Nursing", year="2025", month="May", day="7", volume="8", pages="e74345", keywords="nursing informatics", keywords="health informatics", keywords="research", keywords="practice", keywords="education", keywords="trends", keywords="artificial intelligence", keywords="data science", doi="10.2196/74345", url="https://nursing.jmir.org/2025/1/e74345" } @Article{info:doi/10.2196/65961, author="Klopfenstein, In{\`e}s Sophie Anne and Flint, Rike Anne and Heeren, Patrick and Prendke, Mona and Chaoui, Amin and Ocker, Thomas and Chromik, Jonas and Arnrich, Bert and Balzer, Felix and Poncette, Akira-Sebastian", title="Developing a Scalable Annotation Method for Large Datasets That Enhances Alarms With Actionability Data to Increase Informativeness: Mixed Methods Approach", journal="J Med Internet Res", year="2025", month="May", day="5", volume="27", pages="e65961", keywords="alarm management", keywords="alarm fatigue", keywords="alarm informativeness", keywords="patient monitoring", keywords="dataset annotation", keywords="intensive care unit", keywords="transdisciplinary research", keywords="machine learning", keywords="technological innovation", keywords="patient-centered care", keywords="digital health", abstract="Background: Alarm fatigue, a multifactorial desensitization of staff to alarms, can harm both patients and health care staff in intensive care units (ICUs), especially due to false and nonactionable alarms. Increasing amounts of routinely collected alarm and ICU patient data are paving the way for training machine learning (ML) models that may help reduce the number of nonactionable alarms, potentially increasing alarm informativeness and reducing alarm fatigue. At present, however, there is no publicly available dataset or process that routinely collects information on alarm actionability (ie, whether an alarm triggers a medical intervention or not), which is a key feature for developing meaningful ML models for alarm management. Furthermore, case-based manual annotation is too slow and resource intensive for large amounts of data. Objective: We propose a scalable method to annotate patient monitoring alarms associated with patient-related variables regarding their actionability. While the method is aimed to be used primarily in our institution, other clinicians, scientists, and industry stakeholders could reuse it to build their own datasets. Methods: The interdisciplinary research team followed a mixed methods approach to develop the annotation method, using data-driven, qualitative, and empirical strategies. The iterative process consisted of six steps: (1) defining alarm terms; (2) reaching a consensus on an annotation concept and documentation structure; (3) defining physiological alarm conditions, related medical interventions, and time windows to assess; (4) developing mapping tables; (5) creating the annotation rule set; and (6) evaluating the generated content. All decisions were made based on feasibility criteria, clinical relevance, occurrence frequency, data availability and quantity, structure, and storage mode. The annotation guideline development process was preceded by the analysis of the institution's data and systems, the evaluation of device manuals, and a systematic literature review. Results: In a multidisciplinary consensus-based approach, we defined preprocessing steps and a rule-based annotation method to classify alarms as either actionable or nonactionable based on data from the patient data management system. We have presented our experience in developing the annotation method and provided the generated resources. The method focuses on respiratory and medication management interventions and includes 8 general rules in a tabular format that are accompanied by graphical examples. Mapping tables enable handling unstructured information and are referenced in the annotation rule set. Conclusions: Our annotation method will enable a large number of alarms to be labeled semiautomatically, retrospectively, and quickly, and will provide information on their actionability based on further patient data. This will make it possible to generate annotated datasets for ML models in alarm management and alarm fatigue research. We believe that our annotation method and the resources provided are universal enough and could be used by others to prepare data for future ML projects, even beyond the topic of alarms. ", doi="10.2196/65961", url="https://www.jmir.org/2025/1/e65961" } @Article{info:doi/10.2196/70576, author="Paradise Vit, Abigail and Magid, Avi", title="Exploring Topics, Emotions, and Sentiments in Health Organization Posts and Public Responses on Instagram: Content Analysis", journal="JMIR Infodemiology", year="2025", month="May", day="2", volume="5", pages="e70576", keywords="emotion analysis", keywords="fear", keywords="health communication", keywords="health care", keywords="Instagram", keywords="official health organizations", keywords="sentiment analysis", keywords="social media", keywords="vaccines", abstract="Background: Social media is a vital tool for health organizations, enabling them to share evidence-based information, educate the public, correct misinformation, and support a more informed and healthier society. Objective: This study aimed to categorize health organizations' content on social media into topics; examine public engagement, sentiment, and emotional responses to these topics; and identify gaps in fear between health organizations' messages and the public response. Methods: Real data were collected from the official Instagram accounts of health organizations worldwide. The BERTopic algorithm for topic modeling was used to categorize health organizations' posts into distinct topics. For each identified topic, we analyzed the engagement metrics (number of comments and likes) of posts categorized under the same topic, calculating the average engagement received. We examined the sentiment and emotional content of both posts and responses within the same topic, providing insights into the distributions of sentiment and emotions for each topic. Special attention was given to identifying emotions, such as fear, expressed in the posts and responses. In addition, a linguistic analysis and an analysis of sentiments and emotions over time were conducted. Results: A total of 6082 posts and 82,982 comments were collected from the official Instagram accounts of 8 health organizations. The study revealed that topics related to COVID-19, vaccines, and humanitarian crises (such as the Ukraine conflict and the war in Gaza) generated the highest engagement. Our sentiment analysis of the responses to health organizations' posts showed that topics related to vaccines and monkeypox generated the highest percentage of negative responses. Fear was the dominant emotion expressed in the posts' text, while the public's responses showed more varied emotions, with anger notably high in discussions around vaccines. Gaps were observed between the level of fear conveyed in posts published by health organizations and in the fear conveyed in the public's responses to such posts, especially regarding mask wearing during COVID-19 and the influenza vaccine. Conclusions: This study underscores the importance of transparent communication that considers the emotional and sentiment-driven responses of the public on social media, particularly regarding vaccines. Understanding the psychological and social dynamics associated with public interaction with health information online can help health organizations achieve public health goals, fostering trust, countering misinformation, and promoting informed health behavior. ", doi="10.2196/70576", url="https://infodemiology.jmir.org/2025/1/e70576" } @Article{info:doi/10.2196/63765, author="Acosta-Perez, Fernando and Boutilier, Justin and Zayas-Caban, Gabriel and Adelaine, Sabrina and Liao, Frank and Patterson, Brian", title="Toward Real-Time Discharge Volume Predictions in Multisite Health Care Systems: Longitudinal Observational Study", journal="J Med Internet Res", year="2025", month="Apr", day="30", volume="27", pages="e63765", keywords="discharge", keywords="machine learning", keywords="predict", keywords="capacity management", keywords="discharge predictions", keywords="predictive modeling", keywords="emergency department", keywords="hospital admission", keywords="hospital data", abstract="Background: Emergency department (ED) admissions are one of the most critical decisions made in health care, with 40\% of ED visits resulting in inpatient hospitalization for Medicare patients. A main challenge with the ED admissions process is the inability to move patients from the ED to an inpatient unit quickly. Identifying hospital discharge volume in advance may be valuable in helping hospitals determine capacity management mechanisms to reduce ED boarding, such as transferring low-complexity patients to neighboring hospitals. Although previous research has studied the prediction of discharges in the context of inpatient care, most of the work is on long-term predictions (ie, discharges within the next 24 to 48 hours) in single-site health care systems. In this study, we approach the problem of inpatient discharge prediction from a system-wide lens and evaluate the potential interactions between the two facilities in our partner multisite system to predict short-term discharge volume. Objective: The objective of this paper was to predict discharges from the general care units within a large tertiary teaching hospital network in the Midwest and evaluate the impact of external information from other hospitals on model performance. Methods: We conducted 2 experiments with 174,799 discharge records from 2 hospitals. In Experiment 1, we predicted the number of discharges across 2 time points (within the hour and the next 4 hours) using random forest (RF) and linear regression (LR) models. Models with access to internal hospital data (ie, system-agnostic) were compared with models with access to additional data from the other hospitals in the network (ie, system-aware). In Experiment 2, we evaluated the performance of an RF model to predict afternoon discharges (ie, 12 PM to 4 PM) 1 to 4 hours in advance. Results: In Experiment 1 and Hospital 1, RF and LR models performed equivalently, with R2 scores varying from 0.76 (hourly) to 0.89 (4 hours). In Hospital 2, the RF model performed best, with scores varying from 0.68 (hourly) to 0.84 (4 hours), while scores for LR models ranged from 0.63 to 0.80. There was no significant difference in performance between a system-aware approach and a system-agnostic one. In experiment 2, the mean absolute percentage error increased from 11\% to 16\% when predicting 4 hours in advance relative to zero hours in Hospital 1 and from 24\% to 35\% in Hospital 2. Conclusions: Short-term discharges in multisite hospital systems can be locally predicted with high accuracy, even when predicting hours in advance. ", doi="10.2196/63765", url="https://www.jmir.org/2025/1/e63765" } @Article{info:doi/10.2196/66231, author="Varner, J. Kristin and Keeler Bruce, Lauryn and Soltani, Severine and Hartogensis, Wendy and Dilchert, Stephan and Hecht, M. Frederick and Chowdhary, Anoushka and Pandya, Leena and Dasgupta, Subhasis and Altintas, Ilkay and Gupta, Amarnath and Mason, E. Ashley and Smarr, L. Benjamin", title="Sex Differences in the Variability of Physical Activity Measurements Across Multiple Timescales Recorded by a Wearable Device: Observational Retrospective Cohort Study", journal="J Med Internet Res", year="2025", month="Apr", day="28", volume="27", pages="e66231", keywords="wearables", keywords="activity", keywords="sex as a biological variable", keywords="time series variance", keywords="timescales of change", keywords="metabolic equivalents", keywords="metabolic equivalent of task", keywords="sex differences", abstract="Background: A substantially lower proportion of female individuals participate in sufficient daily activity compared to male individuals despite the known health benefits of exercise. Investment in female sports and exercise medicine research may help close this gap; however, female individuals are underrepresented in this research. Hesitancy to include female participants is partly due to assumptions that biological rhythms driven by menstrual cycles and occurring on the timescale of approximately 28 days increase intraindividual biological variability and weaken statistical power. An analysis of continuous skin temperature data measured using a commercial wearable device found that temperature cycles indicative of menstrual cycles did not substantially increase variability in female individuals' skin temperature. In this study, we explore physical activity (PA) data as a variable more related to behavior, whereas temperature is more reflective of physiological changes. Objective: We aimed to determine whether intraindividual variability of PA is affected by biological sex, and if so, whether having menstrual cycles (as indicated by temperature rhythms) contributes to increased female intraindividual PA variability. We then sought to compare the effect of sex and menstrual cycles on PA variability to the effect of PA rhythms on the timescales of days and weeks and to the effect of nonrhythmic temporal structure in PA on the timescale of decades of life (age). Methods: We used minute-level metabolic equivalent of task data collected using a wearable device across a 206-day study period for each of 596 individuals as an index of PA to assess the magnitudes of variability in PA accounted for by biological sex and temporal structure on different timescales. Intraindividual variability in PA was represented by the consecutive disparity index. Results: Female individuals (regardless of whether they had menstrual cycles) demonstrated lower intraindividual variability in PA than male individuals (Kruskal-Wallis H=29.51; P<.001). Furthermore, individuals with menstrual cycles did not have greater intraindividual variability than those without menstrual cycles (Kruskal-Wallis H=0.54; P=.46). PA rhythms differed at the weekly timescale: individuals with increased or decreased PA on weekends had larger intraindividual variability (Kruskal-Wallis H=10.13; P=.001). In addition, intraindividual variability differed by decade of life, with older age groups tending to have less variability in PA (Kruskal-Wallis H=40.55; P<.001; Bonferroni-corrected significance threshold for 15 comparisons: P=.003). A generalized additive model predicting the consecutive disparity index of 24-hour metabolic equivalent of task sums (intraindividual variability of PA) showed that sex, age, and weekly rhythm accounted for only 11\% of the population variability in intraindividual PA variability. Conclusions: The exclusion of people from PA research based on their biological sex, age, the presence of menstrual cycles, or the presence of weekly rhythms in PA is not supported by our analysis. ", doi="10.2196/66231", url="https://www.jmir.org/2025/1/e66231" } @Article{info:doi/10.2196/57530, author="Dammas, Shaima and Weyde, Tillman and Tapper, Katy and Spanakis, Gerasimos and Roefs, Anne and Pothos, M. Emmanuel", title="Prediction of Snacking Behavior Involving Snacks Having High Levels of Saturated Fats, Salt, or Sugar Using Only Information on Previous Instances of Snacking: Survey- and App-Based Study", journal="JMIR Med Inform", year="2025", month="Apr", day="23", volume="13", pages="e57530", keywords="high fats, salt, or sugar snacks", keywords="machine learning algorithms", keywords="internet data collection", keywords="just-in-time interventions", abstract="Background: Consuming high amounts of foods or beverages with high levels of saturated fats, salt, or sugar (HFSS) can be harmful for health. Many snacks fall into this category (HFSS snacks). However, the palatability of these snacks means that people can sometimes struggle to reduce their intake. Machine learning algorithms could help in predicting the likely occurrence of HFSS snacking so that just-in-time adaptive interventions can be deployed. However, HFSS snacking data have certain characteristics, such as sparseness and incompleteness, which make snacking prediction a challenge for machine learning approaches. Previous attempts have employed several potential predictor variables and have achieved considerable success. Nevertheless, collecting information from several dimensions requires several potentially burdensome user questionnaires, and thus, this approach may be less acceptable for the general public. Objective: Our aim was to consider the capacity of standard (unmodified in any way; to tailor to the specific learning problem) machine learning algorithms to predict HFSS snacking based on the following minimal data that can be collected in a mostly automated way: day of the week, time of the day (divided into time bins), and location (divided into work, home, and other). Methods: A total of 111 participants in the United Kingdom were asked to record HFSS snacking occurrences and the location category over a period of 28 days, and this was considered the UK dataset. Data collection was facilitated by a purpose-specific app (Snack Tracker). Additionally, a similar dataset from the Netherlands was used (Dutch dataset). Both datasets were analyzed using machine learning methods, including random forest regressor, Extreme Gradient Boosting regressor, feed forward neural network, and long short-term memory. We additionally employed 2 baseline statistical models for prediction. In all cases, the prediction problem was the time to the next HFSS snack from the current one, and the evaluation metric was the mean absolute error. Results: The ability of machine learning methods to predict the time of the next HFSS snack was assessed. The quality of the prediction depended on the dataset, temporal resolution, and machine learning algorithm employed. In some cases, predictions were accurate to as low as 17 minutes on average. In general, machine learning methods outperformed the baseline models, but no machine learning method was clearly better than the others. Feed forward neural network showed a very marginal advantage. Conclusions: The prediction of HFSS snacking using sparse data is possible with reasonable accuracy. Our findings offer a foundation for further exploring how machine learning methods can be used in health psychology and provide directions for further research. ", doi="10.2196/57530", url="https://medinform.jmir.org/2025/1/e57530" } @Article{info:doi/10.2196/69813, author="Hussein, Rada and Gyrard, Amelie and Abedian, Somayeh and Gribbon, Philip and Mart{\'i}nez, Alabart Sara", title="Interoperability Framework of the European Health Data Space for the Secondary Use of Data: Interactive European Interoperability Framework--Based Standards Compliance Toolkit for AI-Driven Projects", journal="J Med Internet Res", year="2025", month="Apr", day="23", volume="27", pages="e69813", keywords="artificial intelligence", keywords="European Health Data Space", keywords="European interoperability framework", keywords="healthcare standards interoperability", keywords="secondary use of health data", doi="10.2196/69813", url="https://www.jmir.org/2025/1/e69813", url="http://www.ncbi.nlm.nih.gov/pubmed/40266673" } @Article{info:doi/10.2196/66616, author="Kapitan, Daniel and Heddema, Femke and Dekker, Andr{\'e} and Sieswerda, Melle and Verhoeff, Bart-Jan and Berg, Matt", title="Data Interoperability in Context: The Importance of Open-Source Implementations When Choosing Open Standards", journal="J Med Internet Res", year="2025", month="Apr", day="15", volume="27", pages="e66616", keywords="FHIR", keywords="OMOP", keywords="openEHR", keywords="health care informatics", keywords="information standards", keywords="secondary use", keywords="digital platform", keywords="data sharing", keywords="data interoperability", keywords="open source implementations", keywords="open standards", keywords="Fast Health Interoperability Resources", keywords="Observational Medical Outcomes Partnership", keywords="clinical care", keywords="data exchange", keywords="longitudinal analysis", keywords="low income", keywords="middle-income", keywords="LMIC", keywords="low and middle-income countries", keywords="developing countries", keywords="developing nations", keywords="health information exchange", doi="10.2196/66616", url="https://www.jmir.org/2025/1/e66616", url="http://www.ncbi.nlm.nih.gov/pubmed/40232773" } @Article{info:doi/10.2196/70983, author="Schmit, D. Cason and O'Connell, Curry Meghan and Shewbrooks, Sarah and Abourezk, Charles and Cochlin, J. Fallon and Doerr, Megan and Kum, Hye-Chung", title="Dying in Darkness: Deviations From Data Sharing Ethics in the US Public Health System and the Data Genocide of American Indian and Alaska Native Communities", journal="J Med Internet Res", year="2025", month="Mar", day="26", volume="27", pages="e70983", keywords="ethics", keywords="information dissemination", keywords="indigenous peoples", keywords="public health surveillance", keywords="privacy", keywords="data sharing", keywords="deidentification", keywords="data anonymization", keywords="public health ethics", keywords="data governance", doi="10.2196/70983", url="https://www.jmir.org/2025/1/e70983" } @Article{info:doi/10.2196/65001, author="Huang, Tracy and Ngan, Chun-Kit and Cheung, Ting Yin and Marcotte, Madelyn and Cabrera, Benjamin", title="A Hybrid Deep Learning--Based Feature Selection Approach for Supporting Early Detection of Long-Term Behavioral Outcomes in Survivors of Cancer: Cross-Sectional Study", journal="JMIR Bioinform Biotech", year="2025", month="Mar", day="13", volume="6", pages="e65001", keywords="machine learning", keywords="data driven", keywords="clinical domain--guided framework", keywords="survivors of cancer", keywords="cancer", keywords="oncology", keywords="behavioral outcome predictions", keywords="behavioral study", keywords="behavioral outcomes", keywords="feature selection", keywords="deep learning", keywords="neural network", keywords="hybrid", keywords="prediction", keywords="predictive modeling", keywords="patients with cancer", keywords="deep learning models", keywords="leukemia", keywords="computational study", keywords="computational biology", abstract="Background: The number of survivors of cancer is growing, and they often experience negative long-term behavioral outcomes due to cancer treatments. There is a need for better computational methods to handle and predict these outcomes so that physicians and health care providers can implement preventive treatments. Objective: This study aimed to create a new feature selection algorithm to improve the performance of machine learning classifiers to predict negative long-term behavioral outcomes in survivors of cancer. Methods: We devised a hybrid deep learning--based feature selection approach to support early detection of negative long-term behavioral outcomes in survivors of cancer. Within a data-driven, clinical domain--guided framework to select the best set of features among cancer treatments, chronic health conditions, and socioenvironmental factors, we developed a 2-stage feature selection algorithm, that is, a multimetric, majority-voting filter and a deep dropout neural network, to dynamically and automatically select the best set of features for each behavioral outcome. We also conducted an experimental case study on existing study data with 102 survivors of acute lymphoblastic leukemia (aged 15-39 years at evaluation and >5 years postcancer diagnosis) who were treated in a public hospital in Hong Kong. Finally, we designed and implemented radial charts to illustrate the significance of the selected features on each behavioral outcome to support clinical professionals' future treatment and diagnoses. Results: In this pilot study, we demonstrated that our approach outperforms the traditional statistical and computation methods, including linear and nonlinear feature selectors, for the addressed top-priority behavioral outcomes. Our approach holistically has higher F1, precision, and recall scores compared to existing feature selection methods. The models in this study select several significant clinical and socioenvironmental variables as risk factors associated with the development of behavioral problems in young survivors of acute lymphoblastic leukemia. Conclusions: Our novel feature selection algorithm has the potential to improve machine learning classifiers' capability to predict adverse long-term behavioral outcomes in survivors of cancer. ", doi="10.2196/65001", url="https://bioinform.jmir.org/2025/1/e65001", url="http://www.ncbi.nlm.nih.gov/pubmed/40080820" } @Article{info:doi/10.2196/64354, author="Ehrig, Molly and Bullock, S. Garrett and Leng, Iris Xiaoyan and Pajewski, M. Nicholas and Speiser, Lynn Jaime", title="Imputation and Missing Indicators for Handling Missing Longitudinal Data: Data Simulation Analysis Based on Electronic Health Record Data", journal="JMIR Med Inform", year="2025", month="Mar", day="13", volume="13", pages="e64354", keywords="missing indicator method", keywords="missing data", keywords="imputation", keywords="longitudinal data", keywords="electronic health record data", keywords="electronic health records", keywords="EHR", keywords="simulation study", keywords="clinical prediction model", keywords="prediction model", keywords="older adults", keywords="falls", keywords="logistic regression", keywords="prediction modeling", abstract="Background: Missing data in electronic health records are highly prevalent and result in analytical concerns such as heterogeneous sources of bias and loss of statistical power. One simple analytic method for addressing missing or unknown covariate values is to treat missingness for a particular variable as a category onto itself, which we refer to as the missing indicator method. For cross-sectional analyses, recent work suggested that there was minimal benefit to the missing indicator method; however, it is unclear how this approach performs in the setting of longitudinal data, in which correlation among clustered repeated measures may be leveraged for potentially improved model performance. Objectives: This study aims to conduct a simulation study to evaluate whether the missing indicator method improved model performance and imputation accuracy for longitudinal data mimicking an application of developing a clinical prediction model for falls in older adults based on electronic health record data. Methods: We simulated a longitudinal binary outcome using mixed effects logistic regression that emulated a falls assessment at annual follow-up visits. Using multivariate imputation by chained equations, we simulated time-invariant predictors such as sex and medical history, as well as dynamic predictors such as physical function, BMI, and medication use. We induced missing data in predictors under scenarios that had both random (missing at random) and dependent missingness (missing not at random). We evaluated aggregate performance using the area under the receiver operating characteristic curve (AUROC) for models with and with no missing indicators as predictors, as well as complete case analysis, across simulation replicates. We evaluated imputation quality using normalized root-mean-square error for continuous variables and percent falsely classified for categorical variables. Results: Independent of the mechanism used to simulate missing data (missing at random or missing not at random), overall model performance via AUROC was similar regardless of whether missing indicators were included in the model. The root-mean-square error and percent falsely classified measures were similar for models including missing indicators versus those with no missing indicators. Model performance and imputation quality were similar regardless of whether the outcome was related to missingness. Imputation with or with no missing indicators had similar mean values of AUROC compared with complete case analysis, although complete case analysis had the largest range of values. Conclusions: The results of this study suggest that the inclusion of missing indicators in longitudinal data modeling neither improves nor worsens overall performance or imputation accuracy. Future research is needed to address whether the inclusion of missing indicators is useful in prediction modeling with longitudinal data in different settings, such as high dimensional data analysis. ", doi="10.2196/64354", url="https://medinform.jmir.org/2025/1/e64354" } @Article{info:doi/10.2196/63216, author="Yang, Zhongbao and Xu, Shan-Shan and Liu, Xiaozhu and Xu, Ningyuan and Chen, Yuqing and Wang, Shuya and Miao, Ming-Yue and Hou, Mengxue and Liu, Shuai and Zhou, Yi-Min and Zhou, Jian-Xin and Zhang, Linlin", title="Large Language Model--Based Critical Care Big Data Deployment and Extraction: Descriptive Analysis", journal="JMIR Med Inform", year="2025", month="Mar", day="12", volume="13", pages="e63216", keywords="big data", keywords="critical care--related databases", keywords="database deployment", keywords="large language model", keywords="database extraction", keywords="intensive care unit", keywords="ICU", keywords="GPT", keywords="artificial intelligence", keywords="AI", keywords="LLM", abstract="Background: Publicly accessible critical care--related databases contain enormous clinical data, but their utilization often requires advanced programming skills. The growing complexity of large databases and unstructured data presents challenges for clinicians who need programming or data analysis expertise to utilize these systems directly. Objective: This study aims to simplify critical care--related database deployment and extraction via large language models. Methods: The development of this platform was a 2-step process. First, we enabled automated database deployment using Docker container technology, with incorporated web-based analytics interfaces Metabase and Superset. Second, we developed the intensive care unit--generative pretrained transformer (ICU-GPT), a large language model fine-tuned on intensive care unit (ICU) data that integrated LangChain and Microsoft AutoGen. Results: The automated deployment platform was designed with user-friendliness in mind, enabling clinicians to deploy 1 or multiple databases in local, cloud, or remote environments without the need for manual setup. After successfully overcoming GPT's token limit and supporting multischema data, ICU-GPT could generate Structured Query Language (SQL) queries and extract insights from ICU datasets based on request input. A front-end user interface was developed for clinicians to achieve code-free SQL generation on the web-based client. Conclusions: By harnessing the power of our automated deployment platform and ICU-GPT model, clinicians are empowered to easily visualize, extract, and arrange critical care--related databases more efficiently and flexibly than manual methods. Our research could decrease the time and effort spent on complex bioinformatics methods and advance clinical research. ", doi="10.2196/63216", url="https://medinform.jmir.org/2025/1/e63216" } @Article{info:doi/10.2196/51804, author="Portela, Diana and Freitas, Alberto and Costa, El{\'i}sio and Giovannini, Mattia and Bousquet, Jean and Almeida Fonseca, Jo{\~a}o and Sousa-Pinto, Bernardo", title="Impact of Demographic and Clinical Subgroups in Google Trends Data: Infodemiology Case Study on Asthma Hospitalizations", journal="J Med Internet Res", year="2025", month="Mar", day="10", volume="27", pages="e51804", keywords="infodemiology", keywords="asthma", keywords="administrative databases", keywords="multimorbidity", keywords="co-morbidity", keywords="respiratory", keywords="pulmonary", keywords="Google Trends", keywords="correlation", keywords="hospitalization", keywords="admissions", keywords="autoregressive", keywords="information seeking", keywords="searching", keywords="searches", keywords="forecasting", abstract="Background: Google Trends (GT) data have shown promising results as a complementary tool to classical surveillance approaches. However, GT data are not necessarily provided by a representative sample of patients and may be skewed toward demographic and clinical groups that are more likely to use the internet to search for their health. Objective: In this study, we aimed to assess whether GT-based models perform differently in distinct population subgroups. To assess that, we analyzed a case study on asthma hospitalizations. Methods: We analyzed all hospitalizations with a main diagnosis of asthma occurring in 3 different countries (Portugal, Spain, and Brazil) for a period of approximately 5 years (January 1, 2012-December 17, 2016). Data on web-based searches on common cold for the same countries and time period were retrieved from GT. We estimated the correlation between GT data and the weekly occurrence of asthma hospitalizations (considering separate asthma admissions data according to patients' age, sex, ethnicity, and presence of comorbidities). In addition, we built autoregressive models to forecast the weekly number of asthma hospitalizations (for the different aforementioned subgroups) for a period of 1 year (June 2015-June 2016) based on admissions and GT data from the 3 previous years. Results: Overall, correlation coefficients between GT on the pseudo-influenza syndrome topic and asthma hospitalizations ranged between 0.33 (in Portugal for admissions with at least one Charlson comorbidity group) and 0.86 (for admissions in women and in White people in Brazil). In the 3 assessed countries, forecasted hospitalizations for 2015-2016 correlated more strongly with observed admissions of older versus younger individuals (Portugal: Spearman $\rho$=0.70 vs $\rho$=0.56; Spain: $\rho$=0.88 vs $\rho$=0.76; Brazil: $\rho$=0.83 vs $\rho$=0.82). In Portugal and Spain, forecasted hospitalizations had a stronger correlation with admissions occurring for women than men (Portugal: $\rho$=0.75 vs $\rho$=0.52; Spain: $\rho$=0.83 vs $\rho$=0.51). In Brazil, stronger correlations were observed for admissions of White than of Black or Brown individuals ($\rho$=0.92 vs $\rho$=0.87). In Portugal, stronger correlations were observed for admissions of individuals without any comorbidity compared with admissions of individuals with comorbidities ($\rho$=0.68 vs $\rho$=0.66). Conclusions: We observed that the models based on GT data may perform differently in demographic and clinical subgroups of participants, possibly reflecting differences in the composition of internet users' health-seeking behaviors. ", doi="10.2196/51804", url="https://www.jmir.org/2025/1/e51804" } @Article{info:doi/10.2196/64721, author="Tawfik, Daniel and Rule, Adam and Alexanian, Aram and Cross, Dori and Holmgren, Jay A. and Lou, S. Sunny and McPeek Hinz, Eugenia and Rose, Christian and Viswanadham, N. Ratnalekha V. and Mishuris, G. Rebecca and Rodr{\'i}guez-Fern{\'a}ndez, M. Jorge and Ford, W. Eric and Florig, T. Sarah and Sinsky, A. Christine and Apathy, C. Nate", title="Emerging Domains for Measuring Health Care Delivery With Electronic Health Record Metadata", journal="J Med Internet Res", year="2025", month="Mar", day="6", volume="27", pages="e64721", keywords="metadata", keywords="health services research", keywords="audit logs", keywords="event logs", keywords="electronic health record data", keywords="health care delivery", keywords="patient care", keywords="healthcare teams", keywords="clinician-patient relationship", keywords="cognitive environment", doi="10.2196/64721", url="https://www.jmir.org/2025/1/e64721", url="http://www.ncbi.nlm.nih.gov/pubmed/40053814" } @Article{info:doi/10.2196/54543, author="Zhang, Chunyan and Wang, Ting and Dong, Caixia and Dai, Duwei and Zhou, Linyun and Li, Zongfang and Xu, Songhua", title="Exploring Psychological Trends in Populations With Chronic Obstructive Pulmonary Disease During COVID-19 and Beyond: Large-Scale Longitudinal Twitter Mining Study", journal="J Med Internet Res", year="2025", month="Mar", day="5", volume="27", pages="e54543", keywords="COVID-19", keywords="chronic obstructive pulmonary disease (COPD)", keywords="psychological trends", keywords="Twitter", keywords="data mining", keywords="deep learning", abstract="Background: Chronic obstructive pulmonary disease (COPD) ranks among the leading causes of global mortality, and COVID-19 has intensified its challenges. Beyond the evident physical effects, the long-term psychological effects of COVID-19 are not fully understood. Objective: This study aims to unveil the long-term psychological trends and patterns in populations with COPD throughout the COVID-19 pandemic and beyond via large-scale Twitter mining. Methods: A 2-stage deep learning framework was designed in this study. The first stage involved a data retrieval procedure to identify COPD and non-COPD users and to collect their daily tweets. In the second stage, a data mining procedure leveraged various deep learning algorithms to extract demographic characteristics, hashtags, topics, and sentiments from the collected tweets. Based on these data, multiple analytical methods, namely, odds ratio (OR), difference-in-difference, and emotion pattern methods, were used to examine the psychological effects. Results: A cohort of 15,347 COPD users was identified from the data that we collected in the Twitter database, comprising over 2.5 billion tweets, spanning from January 2020 to June 2023. The attentiveness toward COPD was significantly affected by gender, age, and occupation; it was lower in females (OR 0.91, 95\% CI 0.87-0.94; P<.001) than in males, higher in adults aged 40 years and older (OR 7.23, 95\% CI 6.95-7.52; P<.001) than in those younger than 40 years, and higher in individuals with lower socioeconomic status (OR 1.66, 95\% CI 1.60-1.72; P<.001) than in those with higher socioeconomic status. Across the study duration, COPD users showed decreasing concerns for COVID-19 and increasing health-related concerns. After the middle phase of COVID-19 (July 2021), a distinct decrease in sentiments among COPD users contrasted sharply with the upward trend among non-COPD users. Notably, in the post-COVID era (June 2023), COPD users showed reduced levels of joy and trust and increased levels of fear compared to their levels of joy and trust in the middle phase of COVID-19. Moreover, males, older adults, and individuals with lower socioeconomic status showed heightened fear compared to their counterparts. Conclusions: Our data analysis results suggest that populations with COPD experienced heightened mental stress in the post-COVID era. This underscores the importance of developing tailored interventions and support systems that account for diverse population characteristics. ", doi="10.2196/54543", url="https://www.jmir.org/2025/1/e54543", url="http://www.ncbi.nlm.nih.gov/pubmed/40053739" } @Article{info:doi/10.2196/68083, author="Parciak, Marcel and Pierlet, No{\"e}lla and Peeters, M. Liesbet", title="Empowering Health Care Actors to Contribute to the Implementation of Health Data Integration Platforms: Retrospective of the medEmotion Project", journal="J Med Internet Res", year="2025", month="Mar", day="4", volume="27", pages="e68083", keywords="data science", keywords="health data integration", keywords="health data platform", keywords="real-world evidence", keywords="health care", keywords="health data", keywords="data", keywords="integration platforms", keywords="collaborative", keywords="platform", keywords="Belgium", keywords="Europe", keywords="personas", keywords="communication", keywords="health care providers", keywords="hospital-specific requirements", keywords="digital health", doi="10.2196/68083", url="https://www.jmir.org/2025/1/e68083", url="http://www.ncbi.nlm.nih.gov/pubmed/40053761" } @Article{info:doi/10.2196/64422, author="Park, ChulHyoung and Lee, Hee So and Lee, Yun Da and Choi, Seoyoon and You, Chan Seng and Jeon, Young Ja and Park, Jun Sang and Park, Woong Rae", title="Analysis of Retinal Thickness in Patients With Chronic Diseases Using Standardized Optical Coherence Tomography Data: Database Study Based on the Radiology Common Data Model", journal="JMIR Med Inform", year="2025", month="Feb", day="21", volume="13", pages="e64422", keywords="data standardization", keywords="ophthalmology", keywords="radiology", keywords="optical coherence tomography", keywords="retinal thickness", abstract="Background: The Observational Medical Outcome Partners-Common Data Model (OMOP-CDM) is an international standard for harmonizing electronic medical record (EMR) data. However, since it does not standardize unstructured data, such as medical imaging, using this data in multi-institutional collaborative research becomes challenging. To overcome this limitation, extensions such as the Radiology Common Data Model (R-CDM) have emerged to include and standardize these data types. Objective: This work aims to demonstrate that by standardizing optical coherence tomography (OCT) data into an R-CDM format, multi-institutional collaborative studies analyzing changes in retinal thickness in patients with long-standing chronic diseases can be performed efficiently. Methods: We standardized OCT images collected from two tertiary hospitals for research purposes using the R-CDM. As a proof of concept, we conducted a comparative analysis of retinal thickness between patients who have chronic diseases and those who have not. Patients diagnosed or treated for retinal and choroidal diseases, which could affect retinal thickness, were excluded from the analysis. Using the existing OMOP-CDM at each institution, we extracted cohorts of patients with chronic diseases and control groups, performing large-scale 1:2 propensity score matching (PSM). Subsequently, we linked the OMOP-CDM and R-CDM to extract the OCT image data of these cohorts and analyzed central macular thickness (CMT) and retinal nerve fiber layer (RNFL) thickness using a linear mixed model. Results: OCT data of 261,874 images from Ajou University Medical Center (AUMC) and 475,626 images from Seoul National University Bundang Hospital (SNUBH) were standardized in the R-CDM format. The R-CDM databases established at each institution were linked with the OMOP-CDM database. Following 1:2 PSM, the type 2 diabetes mellitus (T2DM) cohort included 957 patients, and the control cohort had 1603 patients. During the follow-up period, significant reductions in CMT were observed in the T2DM cohorts at AUMC (P=.04) and SNUBH (P=.007), without significant changes in RNFL thickness (AUMC: P=.56; SNUBH: P=.39). Notably, a significant reduction in CMT during the follow-up was observed only at AUMC in the hypertension cohort, compared to the control group (P=.04); no other significant differences in retinal thickness were found in the remaining analyses. Conclusions: The significance of our study lies in demonstrating the efficiency of multi-institutional collaborative research that simultaneously uses clinical data and medical imaging data by leveraging the OMOP-CDM for standardizing EMR data and the R-CDM for standardizing medical imaging data. ", doi="10.2196/64422", url="https://medinform.jmir.org/2025/1/e64422" } @Article{info:doi/10.2196/66910, author="Seinen, M. Tom and Kors, A. Jan and van Mulligen, M. Erik and Rijnbeek, R. Peter", title="Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study", journal="J Med Internet Res", year="2025", month="Feb", day="13", volume="27", pages="e66910", keywords="natural language processing", keywords="named entity recognition", keywords="clinical concept extraction", keywords="machine learning", keywords="electronic health records", keywords="EHR", keywords="word embeddings", keywords="clinical concept similarity", keywords="text mining", keywords="code", keywords="free-text", keywords="information", keywords="electronic record", keywords="data", keywords="patient records", keywords="framework", keywords="structured data", keywords="unstructured data", abstract="Background: Electronic health records (EHRs) consist of both structured data (eg, diagnostic codes) and unstructured data (eg, clinical notes). It is commonly believed that unstructured clinical narratives provide more comprehensive information. However, this assumption lacks large-scale validation and direct validation methods. Objective: This study aims to quantitatively compare the information in structured and unstructured EHR data and directly validate whether unstructured data offers more extensive information across a patient population. Methods: We analyzed both structured and unstructured data from patient records and visits in a large Dutch primary care EHR database between January 2021 and January 2024. Clinical concepts were identified from free-text notes using an extraction framework tailored for Dutch and compared with concepts from structured data. Concept embeddings were generated to measure semantic similarity between structured and extracted concepts through cosine similarity. A similarity threshold was systematically determined via annotated matches and minimized weighted Gini impurity. We then quantified the concept overlap between structured and unstructured data across various concept domains and patient populations. Results: In a population of 1.8 million patients, only 13\% of extracted concepts from patient records and 7\% from individual visits had similar structured counterparts. Conversely, 42\% of structured concepts in records and 25\% in visits had similar matches in unstructured data. Condition concepts had the highest overlap, followed by measurements and drug concepts. Subpopulation visits, such as those with chronic conditions or psychological disorders, showed different proportions of data overlap, indicating varied reliance on structured versus unstructured data across clinical contexts. Conclusions: Our study demonstrates the feasibility of quantifying the information difference between structured and unstructured data, showing that the unstructured data provides important additional information in the studied database and populations. The annotated concept matches are made publicly available for the clinical natural language processing community. Despite some limitations, our proposed methodology proves versatile, and its application can lead to more robust and insightful observational clinical research. ", doi="10.2196/66910", url="https://www.jmir.org/2025/1/e66910" } @Article{info:doi/10.2196/48775, author="Bhavnani, K. Suresh and Zhang, Weibin and Bao, Daniel and Raji, Mukaila and Ajewole, Veronica and Hunter, Rodney and Kuo, Yong-Fang and Schmidt, Susanne and Pappadis, R. Monique and Smith, Elise and Bokov, Alex and Reistetter, Timothy and Visweswaran, Shyam and Downer, Brian", title="Subtyping Social Determinants of Health in the ``All of Us'' Program: Network Analysis and Visualization Study", journal="J Med Internet Res", year="2025", month="Feb", day="11", volume="27", pages="e48775", keywords="social determinants of health", keywords="All of Us", keywords="bipartite networks", keywords="financial resources", keywords="health care", keywords="health outcomes", keywords="precision medicine", keywords="decision support", keywords="health industry", keywords="clinical implications", keywords="machine learning methods", abstract="Background: Social determinants of health (SDoH), such as financial resources and housing stability, account for between 30\% and 55\% of people's health outcomes. While many studies have identified strong associations between specific SDoH and health outcomes, little is known about how SDoH co-occur to form subtypes critical for designing targeted interventions. Such analysis has only now become possible through the All of Us program. Objective: This study aims to analyze the All of Us dataset for addressing two research questions: (1) What are the range of and responses to survey questions related to SDoH? and (2) How do SDoH co-occur to form subtypes, and what are their risks for adverse health outcomes? Methods: For question 1, an expert panel analyzed the range of and responses to SDoH questions across 6 surveys in the full All of Us dataset (N=372,397; version 6). For question 2, due to systematic missingness and uneven granularity of questions across the surveys, we selected all participants with valid and complete SDoH data and used inverse probability weighting to adjust their imbalance in demographics. Next, an expert panel grouped the SDoH questions into SDoH factors to enable more consistent granularity. To identify the subtypes, we used bipartite modularity maximization for identifying SDoH biclusters and measured their significance and replicability. Next, we measured their association with 3 outcomes (depression, delayed medical care, and emergency room visits in the last year). Finally, the expert panel inferred the subtype labels, potential mechanisms, and targeted interventions. Results: The question 1 analysis identified 110 SDoH questions across 4 surveys covering all 5 domains in Healthy People 2030. As the SDoH questions varied in granularity, they were categorized by an expert panel into 18 SDoH factors. The question 2 analysis (n=12,913; d=18) identified 4 biclusters with significant biclusteredness (Q=0.13; random-Q=0.11; z=7.5; P<.001) and significant replication (real Rand index=0.88; random Rand index=0.62; P<.001). Each subtype had significant associations with specific outcomes and had meaningful interpretations and potential targeted interventions. For example, the Socioeconomic barriers subtype included 6 SDoH factors (eg, not employed and food insecurity) and had a significantly higher odds ratio (4.2, 95\% CI 3.5-5.1; P<.001) for depression when compared to other subtypes. The expert panel inferred implications of the results for designing interventions and health care policies based on SDoH subtypes. Conclusions: This study identified SDoH subtypes that had statistically significant biclusteredness and replicability, each of which had significant associations with specific adverse health outcomes and with translational implications for targeted SDoH interventions and health care policies. However, the high degree of systematic missingness requires repeating the analysis as the data become more complete by using our generalizable and scalable machine learning code available on the All of Us workbench. ", doi="10.2196/48775", url="https://www.jmir.org/2025/1/e48775", url="http://www.ncbi.nlm.nih.gov/pubmed/39932771" } @Article{info:doi/10.2196/64445, author="Abdulazeem, Hebatullah and Borges do Nascimento, J{\'u}nior Israel and Weerasekara, Ishanka and Sharifan, Amin and Grandi Bianco, Victor and Cunningham, Ciara and Kularathne, Indunil and Deeken, Genevieve and de Barros, Jerome and Sathian, Brijesh and {\O}stengaard, Lasse and Lamontagne-Godwin, Frederique and van Hoof, Joost and Lazeri, Ledia and Redlich, Cassie and Marston, R. Hannah and Dos Santos, Alistair Ryan and Azzopardi-Muscat, Natasha and Yon, Yongjie and Novillo-Ortiz, David", title="Use of Digital Health Technologies for Dementia Care: Bibliometric Analysis and Report", journal="JMIR Ment Health", year="2025", month="Feb", day="10", volume="12", pages="e64445", keywords="people living with dementia", keywords="digital health technologies", keywords="bibliometric analysis", keywords="evidence-based medicine", abstract="Background: Dementia is a syndrome that compromises neurocognitive functions of the individual and that is affecting 55 million individuals globally, as well as global health care systems, national economic systems, and family members. Objective: This study aimed to determine the status quo of scientific production on use of digital health technologies (DHTs) to support (older) people living with dementia, their families, and care partners. In addition, our study aimed to map the current landscape of global research initiatives on DHTs on the prevention, diagnosis, treatment, and support of people living with dementia and their caregivers. Methods: A bibliometric analysis was performed as part of a systematic review protocol using MEDLINE, Embase, Scopus, Epistemonikos, the Cochrane Database of Systematic Reviews, and Google Scholar for systematic and scoping reviews on DHTs and dementia up to February 21, 2024. Search terms included various forms of dementia and DHTs. Two independent reviewers conducted a 2-stage screening process with disagreements resolved by a third reviewer. Eligible reviews were then subjected to a bibliometric analysis using VOSviewer to evaluate document types, authorship, countries, institutions, journal sources, references, and keywords, creating social network maps to visualize emergent research trends. Results: A total of 704 records met the inclusion criteria for bibliometric analysis. Most reviews were systematic, with a substantial number covering mobile health, telehealth, and computer-based cognitive interventions. Bibliometric analysis revealed that the Journal of Medical Internet Research had the highest number of reviews and citations. Researchers from 66 countries contributed, with the United Kingdom and the United States as the most prolific. Overall, the number of publications covering the intersection of DHTs and dementia has increased steadily over time. However, the diversity of reviews conducted on a single topic has resulted in duplicated scientific efforts. Our assessment of contributions from countries, institutions, and key stakeholders reveals significant trends and knowledge gaps, particularly highlighting the dominance of high-income countries in this research domain. Furthermore, our findings emphasize the critical importance of interdisciplinary, collaborative teams and offer clear directions for future research, especially in underrepresented regions. Conclusions: Our study shows a steady increase in dementia- and DHT-related publications, particularly in areas such as mobile health, virtual reality, artificial intelligence, and sensor-based technologies interventions. This increase underscores the importance of systematic approaches and interdisciplinary collaborations, while identifying knowledge gaps, especially in lower-income regions. It is crucial that researchers worldwide adhere to evidence-based medicine principles to avoid duplication of efforts. This analysis offers a valuable foundation for policy makers and academics, emphasizing the need for an international collaborative task force to address knowledge gaps and advance dementia care globally. Trial Registration: PROSPERO CRD42024511241; https://www.crd.york.ac.uk/prospero/display\_record.php?RecordID=511241 ", doi="10.2196/64445", url="https://mental.jmir.org/2025/1/e64445" } @Article{info:doi/10.2196/53434, author="Alshanik, Farah and Khasawneh, Rawand and Dalky, Alaa and Qawasmeh, Ethar", title="Unveiling Topics and Emotions in Arabic Tweets Surrounding the COVID-19 Pandemic: Topic Modeling and Sentiment Analysis Approach", journal="JMIR Infodemiology", year="2025", month="Feb", day="10", volume="5", pages="e53434", keywords="topic modeling", keywords="sentiment analysis", keywords="COVID-19", keywords="social media", keywords="Twitter", keywords="public discussion", abstract="Background: The worldwide effects of the COVID-19 pandemic have been profound, and the Arab world has not been exempt from its wide-ranging consequences. Within this context, social media platforms such as Twitter have become essential for sharing information and expressing public opinions during this global crisis. Careful investigation of Arabic tweets related to COVID-19 can provide invaluable insights into the common topics and underlying sentiments that shape discussions about the COVID-19 pandemic. Objective: This study aimed to understand the concerns and feelings of Twitter users in Arabic-speaking countries about the COVID-19 pandemic. This was accomplished through analyzing the themes and sentiments that were expressed in Arabic tweets about the COVID-19 pandemic. Methods: In this study, 1 million Arabic tweets about COVID-19 posted between March 1 and March 31, 2020, were analyzed. Machine learning techniques, such as topic modeling and sentiment analysis, were applied to understand the main topics and emotions that were expressed in these tweets. Results: The analysis of Arabic tweets revealed several prominent topics related to COVID-19. The analysis identified and grouped 16 different conversation topics that were organized into eight themes: (1) preventive measures and safety, (2) medical and health care aspects, (3) government and social measures, (4) impact and numbers, (5) vaccine development and research, (6) COVID-19 and religious practices, (7) global impact of COVID-19 on sports and countries, and (8) COVID-19 and national efforts. Across all the topics identified, the prevailing sentiments regarding the spread of COVID-19 were primarily centered around anger, followed by disgust, joy, and anticipation. Notably, when conversations revolved around new COVID-19 cases and fatalities, public tweets revealed a notably heightened sense of anger in comparison to other subjects. Conclusions: The study offers valuable insights into the topics and emotions expressed in Arabic tweets related to COVID-19. It demonstrates the significance of social media platforms, particularly Twitter, in capturing the Arabic-speaking community's concerns and sentiments during the COVID-19 pandemic. The findings contribute to a deeper understanding of the prevailing discourse, enabling stakeholders to tailor effective communication strategies and address specific public concerns. This study underscores the importance of monitoring social media conversations in Arabic to support public health efforts and crisis management during the COVID-19 pandemic. ", doi="10.2196/53434", url="https://infodemiology.jmir.org/2025/1/e53434", url="http://www.ncbi.nlm.nih.gov/pubmed/39928401" } @Article{info:doi/10.2196/63550, author="Ruta, R. Michael and Gaidici, Tony and Irwin, Chase and Lifshitz, Jonathan", title="ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research", journal="J Med Internet Res", year="2025", month="Feb", day="7", volume="27", pages="e63550", keywords="ChatGPT", keywords="data analysis", keywords="statistics", keywords="chatbot", keywords="artificial intelligence", keywords="biomedical research", keywords="programmers", keywords="bioinformatics", keywords="data processing", abstract="Background: ChatGPT, a conversational artificial intelligence developed by OpenAI, has rapidly become an invaluable tool for researchers. With the recent integration of Python code interpretation into the ChatGPT environment, there has been a significant increase in the potential utility of ChatGPT as a research tool, particularly in terms of data analysis applications. Objective: This study aimed to assess ChatGPT as a data analysis tool and provide researchers with a framework for applying ChatGPT to data management tasks, descriptive statistics, and inferential statistics. Methods: A subset of the National Inpatient Sample was extracted. Data analysis trials were divided into data processing, categorization, and tabulation, as well as descriptive and inferential statistics. For data processing, categorization, and tabulation assessments, ChatGPT was prompted to reclassify variables, subset variables, and present data, respectively. Descriptive statistics assessments included mean, SD, median, and IQR calculations. Inferential statistics assessments were conducted at varying levels of prompt specificity (``Basic,'' ``Intermediate,'' and ``Advanced''). Specific tests included chi-square, Pearson correlation, independent 2-sample t test, 1-way ANOVA, Fisher exact, Spearman correlation, Mann-Whitney U test, and Kruskal-Wallis H test. Outcomes from consecutive prompt-based trials were assessed against expected statistical values calculated in Python (Python Software Foundation), SAS (SAS Institute), and RStudio (Posit PBC). Results: ChatGPT accurately performed data processing, categorization, and tabulation across all trials. For descriptive statistics, it provided accurate means, SDs, medians, and IQRs across all trials. Inferential statistics accuracy against expected statistical values varied with prompt specificity: 32.5\% accuracy for ``Basic'' prompts, 81.3\% for ``Intermediate'' prompts, and 92.5\% for ``Advanced'' prompts. Conclusions: ChatGPT shows promise as a tool for exploratory data analysis, particularly for researchers with some statistical knowledge and limited programming expertise. However, its application requires careful prompt construction and human oversight to ensure accuracy. As a supplementary tool, ChatGPT can enhance data analysis efficiency and broaden research accessibility. ", doi="10.2196/63550", url="https://www.jmir.org/2025/1/e63550" } @Article{info:doi/10.2196/64479, author="Beuken, JM Maik and Kleynen, Melanie and Braun, Susy and Van Berkel, Kees and van der Kallen, Carla and Koster, Annemarie and Bosma, Hans and Berendschot, TJM Tos and Houben, JHM Alfons and Dukers-Muijrers, Nicole and van den Bergh, P. Joop and Kroon, A. Abraham and and Kanera, M. Iris", title="Identification of Clusters in a Population With Obesity Using Machine Learning: Secondary Analysis of The Maastricht Study", journal="JMIR Med Inform", year="2025", month="Feb", day="5", volume="13", pages="e64479", keywords="Maastricht Study", keywords="participant clusters", keywords="cluster analysis", keywords="factor probabilistic distance clustering", keywords="FPDC algorithm", keywords="statistically equivalent signature", keywords="SES feature selection", keywords="unsupervised machine learning", keywords="obesity", keywords="hypothesis free", keywords="risk factor", keywords="physical inactivity", keywords="poor nutrition", keywords="physical activity", keywords="chronic disease", keywords="type 2 diabetes", keywords="diabetes", keywords="heart disease", keywords="long-term behavior change", abstract="Background: Modern lifestyle risk factors, like physical inactivity and poor nutrition, contribute to rising rates of obesity and chronic diseases like type 2 diabetes and heart disease. Particularly personalized interventions have been shown to be effective for long-term behavior change. Machine learning can be used to uncover insights without predefined hypotheses, revealing complex relationships and distinct population clusters. New data-driven approaches, such as the factor probabilistic distance clustering algorithm, provide opportunities to identify potentially meaningful clusters within large and complex datasets. Objective: This study aimed to identify potential clusters and relevant variables among individuals with obesity using a data-driven and hypothesis-free machine learning approach. Methods: We used cross-sectional data from individuals with abdominal obesity from The Maastricht Study. Data (2971 variables) included demographics, lifestyle, biomedical aspects, advanced phenotyping, and social factors (cohort 2010). The factor probabilistic distance clustering algorithm was applied in order to detect clusters within this high-dimensional data. To identify a subset of distinct, minimally redundant, predictive variables, we used the statistically equivalent signature algorithm. To describe the clusters, we applied measures of central tendency and variability, and we assessed the distinctiveness of the clusters through the emerged variables using the F test for continuous variables and the chi-square test for categorical variables at a confidence level of $\alpha$=.001 Results: We identified 3 distinct clusters (including 4128/9188, 44.93\% of all data points) among individuals with obesity (n=4128). The most significant continuous variable for distinguishing cluster 1 (n=1458) from clusters 2 and 3 combined (n=2670) was the lower energy intake (mean 1684, SD 393 kcal/day vs mean 2358, SD 635 kcal/day; P<.001). The most significant categorical variable was occupation (P<.001). A significantly higher proportion (1236/1458, 84.77\%) in cluster 1 did not work compared to clusters 2 and 3 combined (1486/2670, 55.66\%; P<.001). For cluster 2 (n=1521), the most significant continuous variable was a higher energy intake (mean 2755, SD 506.2 kcal/day vs mean 1749, SD 375 kcal/day; P<.001). The most significant categorical variable was sex (P<.001). A significantly higher proportion (997/1521, 65.55\%) in cluster 2 were male compared to the other 2 clusters (885/2607, 33.95\%; P<.001). For cluster 3 (n=1149), the most significant continuous variable was overall higher cognitive functioning (mean 0.2349, SD 0.5702 vs mean --0.3088, SD 0.7212; P<.001), and educational level was the most significant categorical variable (P<.001). A significantly higher proportion (475/1149, 41.34\%) in cluster 3 received higher vocational or university education in comparison to clusters 1 and 2 combined (729/2979, 24.47\%; P<.001). Conclusions: This study demonstrates that a hypothesis-free and fully data-driven approach can be used to identify distinguishable participant clusters in large and complex datasets and find relevant variables that differ within populations with obesity. ", doi="10.2196/64479", url="https://medinform.jmir.org/2025/1/e64479" } @Article{info:doi/10.2196/59452, author="Willem, Theresa and Wollek, Alessandro and Cheslerean-Boghiu, Theodor and Kenney, Martha and Buyx, Alena", title="The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets", journal="JMIR Med Inform", year="2025", month="Jan", day="28", volume="13", pages="e59452", keywords="machine learning", keywords="categorical data", keywords="social context dependency", keywords="mixed methods", keywords="dermatology", keywords="dataset analysis", abstract="Background: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population. Objective: This study aimed to explore categorical data's effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets' data categories before using them for machine learning training. Methods: Against the theoretical background of the?social construction of categories, we suggest a mixed methods approach to assess categorical data's utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Esp{\'i}rito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset authors. Results: Our quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data. Conclusions: We caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for which a model is intended. ", doi="10.2196/59452", url="https://medinform.jmir.org/2025/1/e59452" } @Article{info:doi/10.2196/54133, author="Yang, Doris and Zhou, Doudou and Cai, Steven and Gan, Ziming and Pencina, Michael and Avillach, Paul and Cai, Tianxi and Hong, Chuan", title="Robust Automated Harmonization of Heterogeneous Data Through Ensemble Machine Learning: Algorithm Development and Validation Study", journal="JMIR Med Inform", year="2025", month="Jan", day="22", volume="13", pages="e54133", keywords="ensemble learning", keywords="semantic learning", keywords="distribution learning", keywords="variable harmonization", keywords="machine learning", keywords="cardiovascular health study", keywords="intracohort comparison", keywords="intercohort comparison", keywords="gold standard labels", abstract="Background: Cohort studies contain rich clinical data across large and diverse patient populations and are a common source of observational data for clinical research. Because large scale cohort studies are both time and resource intensive, one alternative is to harmonize data from existing cohorts through multicohort studies. However, given differences in variable encoding, accurate variable harmonization is difficult. Objective: We propose SONAR (Semantic and Distribution-Based Harmonization) as a method for harmonizing variables across cohort studies to facilitate multicohort studies. Methods: SONAR used semantic learning from variable descriptions and distribution learning from study participant data. Our method learned an embedding vector for each variable and used pairwise cosine similarity to score the similarity between variables. This approach was built off 3 National Institutes of Health cohorts, including the Cardiovascular Health Study, the Multi-Ethnic Study of Atherosclerosis, and the Women's Health Initiative. We also used gold standard labels to further refine the embeddings in a supervised manner. Results: The method was evaluated using manually curated gold standard labels from the 3 National Institutes of Health cohorts. We evaluated both the intracohort and intercohort variable harmonization performance. The supervised SONAR method outperformed existing benchmark methods for almost all intracohort and intercohort comparisons using area under the curve and top-k accuracy metrics. Notably, SONAR was able to significantly improve harmonization of concepts that were difficult for existing semantic methods to harmonize. Conclusions: SONAR achieves accurate variable harmonization within and between cohort studies by harnessing the complementary strengths of semantic learning and variable distribution learning. ", doi="10.2196/54133", url="https://medinform.jmir.org/2025/1/e54133" } @Article{info:doi/10.2196/69742, author="Bousquet, Cedric and Beltramin, Div{\`a}", title="Advantages and Inconveniences of a Multi-Agent Large Language Model System to Mitigate Cognitive Biases in Diagnostic Challenges", journal="J Med Internet Res", year="2025", month="Jan", day="20", volume="27", pages="e69742", keywords="large language model", keywords="multi-agent system", keywords="diagnostic errors", keywords="cognition", keywords="clinical decision-making", keywords="cognitive bias", keywords="generative artificial intelligence", doi="10.2196/69742", url="https://www.jmir.org/2025/1/e69742" } @Article{info:doi/10.2196/52385, author="Mumtaz, Shahzad and McMinn, Megan and Cole, Christian and Gao, Chuang and Hall, Christopher and Guignard-Duff, Magalie and Huang, Huayi and McAllister, A. David and Morales, R. Daniel and Jefferson, Emily and Guthrie, Bruce", title="A Digital Tool for Clinical Evidence--Driven Guideline Development by Studying Properties of Trial Eligible and Ineligible Populations: Development and Usability Study", journal="J Med Internet Res", year="2025", month="Jan", day="16", volume="27", pages="e52385", keywords="multimorbidity", keywords="clinical practice guideline", keywords="gout", keywords="Trusted Research Environment", keywords="National Institute for Health and Care Excellence", keywords="Scottish Intercollegiate Guidelines Network", keywords="clinical practice", keywords="development", keywords="efficacy", keywords="validity", keywords="epidemiological data", keywords="epidemiology", keywords="epidemiological", keywords="digital tool", keywords="tool", keywords="age", keywords="gender", keywords="ethnicity", keywords="mortality", keywords="feedback", keywords="availability", abstract="Background: Clinical guideline development preferentially relies on evidence from randomized controlled trials (RCTs). RCTs are gold-standard methods to evaluate the efficacy of treatments with the highest internal validity but limited external validity, in the sense that their findings may not always be applicable to or generalizable to clinical populations or population characteristics. The external validity of RCTs for the clinical population is constrained by the lack of tailored epidemiological data analysis designed for this purpose due to data governance, consistency of disease or condition definitions, and reduplicated effort in analysis code. Objective: This study aims to develop a digital tool that characterizes the overall population and differences between clinical trial eligible and ineligible populations from the clinical populations of a disease or condition regarding demography (eg, age, gender, ethnicity), comorbidity, coprescription, hospitalization, and mortality. Currently, the process is complex, onerous, and time-consuming, whereas a real-time tool may be used to rapidly inform a guideline developer's judgment about the applicability of evidence. Methods: The National Institute for Health and Care Excellence---particularly the gout guideline development group---and the Scottish Intercollegiate Guidelines Network guideline developers were consulted to gather their requirements and evidential data needs when developing guidelines. An R Shiny (R Foundation for Statistical Computing) tool was designed and developed using electronic primary health care data linked with hospitalization and mortality data built upon an optimized data architecture. Disclosure control mechanisms were built into the tool to ensure data confidentiality. The tool was deployed within a Trusted Research Environment, allowing only trusted preapproved researchers to conduct analysis. Results: The tool supports 128 chronic health conditions as index conditions and 161 conditions as comorbidities (33 in addition to the 128 index conditions). It enables 2 types of analyses via the graphic interface: overall population and stratified by user-defined eligibility criteria. The analyses produce an overview of statistical tables (eg, age, gender) of the index condition population and, within the overview groupings, produce details on, for example, electronic frailty index, comorbidities, and coprescriptions. The disclosure control mechanism is integral to the tool, limiting tabular counts to meet local governance needs. An exemplary result for gout as an index condition is presented to demonstrate the tool's functionality. Guideline developers from the National Institute for Health and Care Excellence and the Scottish Intercollegiate Guidelines Network provided positive feedback on the tool. Conclusions: The tool is a proof-of-concept, and the user feedback has demonstrated that this is a step toward computer-interpretable guideline development. Using the digital tool can potentially improve evidence-driven guideline development through the availability of real-world data in real time. ", doi="10.2196/52385", url="https://www.jmir.org/2025/1/e52385" } @Article{info:doi/10.2196/59113, author="de Groot, Rowdy and van der Graaff, Frank and van der Doelen, Dani{\"e}l and Luijten, Michiel and De Meyer, Ronald and Alrouh, Hekmat and van Oers, Hedy and Tieskens, Jacintha and Zijlmans, Josjan and Bartels, Meike and Popma, Arne and de Keizer, Nicolette and Cornet, Ronald and Polderman, C. Tinca J.", title="Implementing Findable, Accessible, Interoperable, Reusable (FAIR) Principles in Child and Adolescent Mental Health Research: Mixed Methods Approach", journal="JMIR Ment Health", year="2024", month="Dec", day="19", volume="11", pages="e59113", keywords="FAIR data", keywords="research data management", keywords="data interoperability", keywords="data standardization", keywords="OMOP CDM", keywords="implementation", keywords="health data", keywords="data quality", keywords="FAIR principles", abstract="Background: The FAIR (Findable, Accessible, Interoperable, Reusable) data principles are a guideline to improve the reusability of data. However, properly implementing these principles is challenging due to a wide range of barriers. Objectives: To further the field of FAIR data, this study aimed to systematically identify barriers regarding implementing the FAIR principles in the area of child and adolescent mental health research, define the most challenging barriers, and provide recommendations for these barriers. Methods: Three sources were used as input to identify barriers: (1) evaluation of the implementation process of the Observational Medical Outcomes Partnership Common Data Model by 3 data managers; (2) interviews with experts on mental health research, reusable health data, and data quality; and (3) a rapid literature review. All barriers were categorized according to type as described previously, the affected FAIR principle, a category to add detail about the origin of the barrier, and whether a barrier was mental health specific. The barriers were assessed and ranked on impact with the data managers using the Delphi method. Results: Thirteen barriers were identified by the data managers, 7 were identified by the experts, and 30 barriers were extracted from the literature. This resulted in 45 unique barriers. The characteristics that were most assigned to the barriers were, respectively, external type (n=32/45; eg, organizational policy preventing the use of required software), tooling category (n=19/45; ie, software and databases), all FAIR principles (n=15/45), and not mental health specific (n=43/45). Consensus on ranking the scores of the barriers was reached after 2 rounds of the Delphi method. The most important recommendations to overcome the barriers are adding a FAIR data steward to the research team, accessible step-by-step guides, and ensuring sustainable funding for the implementation and long-term use of FAIR data. Conclusions: By systematically listing these barriers and providing recommendations, we intend to enhance the awareness of researchers and grant providers that making data FAIR demands specific expertise, available tooling, and proper investments. ", doi="10.2196/59113", url="https://mental.jmir.org/2024/1/e59113" } @Article{info:doi/10.2196/60665, author="Cao, Lang and Sun, Jimeng and Cross, Adam", title="An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study", journal="JMIR Med Inform", year="2024", month="Dec", day="18", volume="12", pages="e60665", keywords="rare disease", keywords="clinical informatics", keywords="LLM", keywords="natural language processing", keywords="machine learning", keywords="artificial intelligence", keywords="large language models", keywords="data extraction", keywords="ontologies", keywords="knowledge graphs", keywords="text mining", abstract="Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition (ICD-9) and Tenth Edition (ICD-10), codes and therefore cannot be reliably extracted from granular fields like ``Diagnosis'' and ``Problem List'' entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease--related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD's performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system's performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F1-score of 56.1\% and a relation extraction F1-score of 38.6\%, marking a 14.4\% improvement over the baseline LLM. Notably, the F1-score for rare disease entity extraction reached 83.5\%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts. ", doi="10.2196/60665", url="https://medinform.jmir.org/2024/1/e60665" } @Article{info:doi/10.2196/59844, author="Baer, J. Rebecca and Bandoli, Gretchen and Jelliffe-Pawlowski, Laura and Chambers, D. Christina", title="The University of California Study of Outcomes in Mothers and Infants (a Population-Based Research Resource): Retrospective Cohort Study", journal="JMIR Public Health Surveill", year="2024", month="Dec", day="3", volume="10", pages="e59844", keywords="birth certificate", keywords="vital statistics", keywords="hospital discharge", keywords="administrative data", keywords="linkage", keywords="pregnancy outcome", keywords="birth outcome", keywords="infant outcome", keywords="adverse outcome", keywords="preterm birth", keywords="birth defects", keywords="pregnancy", keywords="prenatal", keywords="California", keywords="policy", keywords="disparities", keywords="children", keywords="data collection", abstract="Background: Population-based databases are valuable for perinatal research. The California Department of Health Care Access and Information (HCAI) created a linked birth file covering the years 1991 through 2012. This file includes birth and fetal death certificate records linked to the hospital discharge records of the birthing person and infant. In 2019, the University of California Study of Outcomes in Mothers and Infants received approval to create similar linked birth files for births from 2011 onward, with 2 years of overlapping birth files to allow for linkage comparison. Objective: This paper aims to describe the University of California Study of Outcomes in Mothers and Infants linkage methodology, examine the linkage quality, and discuss the benefits and limitations of the approach. Methods: Live birth and fetal death certificates were linked to hospital discharge records for California infants between 2005 and 2020. The linkage algorithm includes variables such as birth hospital and date of birth, and linked record selection is made based on a ``link score.'' The complete file includes California Vital Statistics and HCAI hospital discharge records for the birthing person (1 y before delivery and 1 y after delivery) and infant (1 y after delivery). Linkage quality was assessed through a comparison of linked files and California Vital Statistics only. Comparisons were made to previous linked birth files created by the HCAI for 2011 and 2012. Results: Of the 8,040,000 live births, 7,427,738 (92.38\%) California Vital Statistics live birth records were linked to HCAI records for birthing people, 7,680,597 (95.53\%) birth records were linked to HCAI records for the infant, and 7,285,346 (90.61\%) California Vital Statistics birth records were linked to HCAI records for both the birthing person and the infant. The linkage rates were 92.44\% (976,526/1,056,358) for Asian and 86.27\% (28,601/33,151) for Hawaiian or Pacific Islander birthing people. Of the 44,212 fetal deaths, 33,355 (75.44\%) had HCAI records linked to the birthing person. When assessing variables in both California Vital Statistics and hospital records, the percentage was greatest when using both sources: the rates of gestational diabetes were 4.52\% (329,128/7,285,345) in the California Vital Statistics records, 8.2\% (597,534/7,285,345) in the HCAI records, and 9.34\% (680,757/7,285,345) when using both data sources. Conclusions: We demonstrate that the linkage strategy used for this data platform is similar in linkage rate and linkage quality to the previous linked birth files created by the HCAI. The linkage provides higher rates of crucial variables, such as diabetes, compared to birth certificate records alone, although selection bias from the linkage must be considered. This platform has been used independently to examine health outcomes, has been linked to environmental datasets and residential data, and has been used to obtain and examine maternal serum and newborn blood spots. ", doi="10.2196/59844", url="https://publichealth.jmir.org/2024/1/e59844", url="http://www.ncbi.nlm.nih.gov/pubmed/39625748" } @Article{info:doi/10.2196/65784, author="Bialke, Martin and Stahl, Dana and Leddig, Torsten and Hoffmann, Wolfgang", title="The University Medicine Greifswald's Trusted Third Party Dispatcher: State-of-the-Art Perspective Into Comprehensive Architectures and Complex Research Workflows", journal="JMIR Med Inform", year="2024", month="Nov", day="29", volume="12", pages="e65784", keywords="architecture", keywords="scalability", keywords="trusted third party", keywords="application", keywords="security", keywords="consent", keywords="identifying data", keywords="infrastructure", keywords="modular", keywords="software", keywords="implementation", keywords="user interface", keywords="health platform", keywords="data management", keywords="data privacy", keywords="health record", keywords="electronic health record", keywords="EHR", keywords="pseudonymization", doi="10.2196/65784", url="https://medinform.jmir.org/2024/1/e65784" } @Article{info:doi/10.2196/67429, author="W{\"u}ndisch, Eric and Hufnagl, Peter and Brunecker, Peter and Meier zu Ummeln, Sophie and Tr{\"a}ger, Sarah and Prasser, Fabian and Weber, Joachim", title="Authors' Reply: The University Medicine Greifswald's Trusted Third Party Dispatcher: State-of-the-Art Perspective Into Comprehensive Architectures and Complex Research Workflows", journal="JMIR Med Inform", year="2024", month="Nov", day="29", volume="12", pages="e67429", keywords="architecture", keywords="scalability", keywords="trusted third party", keywords="application", keywords="security", keywords="consent", keywords="identifying data", keywords="infrastructure", keywords="modular", keywords="software", keywords="implementation", keywords="user interface", keywords="health platform", keywords="data management", keywords="data privacy", keywords="health record", keywords="electronic health record", keywords="EHR", keywords="pseudonymization", doi="10.2196/67429", url="https://medinform.jmir.org/2024/1/e67429" } @Article{info:doi/10.2196/64726, author="Yang, Rick and Yang, Alina", title="Strengthening the Backbone: Government-Academic Data Collaborations for Crisis Response", journal="JMIR Public Health Surveill", year="2024", month="Nov", day="28", volume="10", pages="e64726", keywords="data infrastructure", keywords="data sharing", keywords="cross-sector collaboration", keywords="government-academic partnerships", keywords="public health", keywords="crisis response", doi="10.2196/64726", url="https://publichealth.jmir.org/2024/1/e64726" } @Article{info:doi/10.2196/66479, author="Lee, Jian-Sin and Tyler, B. Allison R. and Veinot, Christine Tiffany and Yakel, Elizabeth", title="Authors' Reply to: Strengthening the Backbone: Government-Academic Data Collaborations for Crisis Response", journal="JMIR Public Health Surveill", year="2024", month="Nov", day="28", volume="10", pages="e66479", keywords="COVID-19", keywords="crisis response", keywords="cross-sector collaboration", keywords="data infrastructures", keywords="data science", keywords="data sharing", keywords="pandemic", keywords="public health informatics", doi="10.2196/66479", url="https://publichealth.jmir.org/2024/1/e66479" } @Article{info:doi/10.2196/60878, author="Lukmanjaya, Wilson and Butler, Tony and Taflan, Patricia and Simpson, Paul and Ginnivan, Natasha and Buchan, Iain and Nenadic, Goran and Karystianis, George", title="Population Characteristics in Justice Health Research Based on PubMed Abstracts From 1963 to 2023: Text Mining Study", journal="JMIR Form Res", year="2024", month="Nov", day="22", volume="8", pages="e60878", keywords="epidemiology", keywords="PubMed", keywords="criminology", keywords="text mining", keywords="justice health", keywords="offending and incarcerated populations", keywords="population characteristics", keywords="open research", keywords="health research", keywords="text mining study", keywords="epidemiological criminology", keywords="public health", keywords="justice systems", keywords="bias", keywords="population", keywords="men", keywords="women", keywords="prison", keywords="prisoner", keywords="researcher", abstract="Background: The field of epidemiological criminology (or justice health research) has emerged in the past decade, studying the intersection between the public health and justice systems. To ensure research efforts are focused and equitable, it is important to reflect on the outputs in this area and address knowledge gaps. Objective: This study aimed to examine the characteristics of populations researched in a large sample of published outputs and identify research gaps and biases. Methods: A rule-based, text mining method was applied to 34,481 PubMed abstracts published from 1963 to 2023 to identify 4 population characteristics (sex, age, offender type, and nationality). Results: We evaluated our method in a random sample of 100 PubMed abstracts. Microprecision was 94.3\%, with microrecall at 85.9\% and micro--F1-score at 89.9\% across the 4 characteristics. Half (n=17,039, 49.4\%) of the 34,481 abstracts did not have any characteristic mentions and only 1.3\% (n=443) reported sex, age, offender type, and nationality. From the 5170 (14.9\%) abstracts that reported age, 3581 (69.3\%) mentioned young people (younger than 18 years) and 3037 (58.7\%) mentioned adults. Since 1990, studies reporting female-only populations increased, and in 2023, these accounted for almost half (105/216, 48.6\%) of the research outputs, as opposed to 33.3\% (72/216) for male-only populations. Nordic countries (Sweden, Norway, Finland, and Denmark) had the highest number of abstracts proportional to their incarcerated populations. Offenders with mental illness were the most common group of interest (840/4814, 17.4\%), with an increase from 1990 onward. Conclusions: Research reporting on female populations increased, surpassing that involving male individuals, despite female individuals representing 5\% of the incarcerated population; this suggests that male prisoners are underresearched. Although calls have been made for the justice health area to focus more on young people, our results showed that among the abstracts reporting age, most mentioned a population aged <18 years, reflecting a rise of youth involvement in the youth justice system. Those convicted of sex offenses and crimes relating to children were not as researched as the existing literature suggests, with a focus instead on populations with mental illness, whose rates rose steadily in the last 30 years. After adjusting for the size of the incarcerated population, Nordic countries have conducted proportionately the most research. Our findings highlight that despite the presence of several research reporting guidelines, justice health abstracts still do not adequately describe the investigated populations. Our study offers new insights in the field of justice health with implications for promoting diversity in the selection of research participants. ", doi="10.2196/60878", url="https://formative.jmir.org/2024/1/e60878" } @Article{info:doi/10.2196/63031, author="Maa{\ss}, Laura and Badino, Manuel and Iyamu, Ihoghosa and Holl, Felix", title="Assessing the Digital Advancement of Public Health Systems Using Indicators Published in Gray Literature: Narrative Review", journal="JMIR Public Health Surveill", year="2024", month="Nov", day="20", volume="10", pages="e63031", keywords="digital public health", keywords="health system", keywords="indicator", keywords="interdisciplinary", keywords="information and communications technology", keywords="maturity assessment", keywords="readiness assessment", keywords="narrative review", keywords="gray literature", keywords="digital health", keywords="mobile phone", abstract="Background: Revealing the full potential of digital public health (DiPH) systems requires a wide-ranging tool to assess their maturity and readiness for emerging technologies. Although a variety of indices exist to assess digital health systems, questions arise about the inclusion of indicators of information and communications technology maturity and readiness, digital (health) literacy, and interest in DiPH tools by the society and workforce, as well as the maturity of the legal framework and the readiness of digitalized health systems. Existing tools frequently target one of these domains while overlooking the others. In addition, no review has yet holistically investigated the available national DiPH system maturity and readiness indicators using a multidisciplinary lens. Objective: We used a narrative review to map the landscape of DiPH system maturity and readiness indicators published in the gray literature. Methods: As original indicators were not published in scientific databases, we applied predefined search strings to the DuckDuckGo and Google search engines for 11 countries from all continents that had reached level 4 of 5 in the latest Global Digital Health Monitor evaluation. In addition, we searched the literature published by 19 international organizations for maturity and readiness indicators concerning DiPH. Results: Of the 1484 identified references, 137 were included, and they yielded 15,806 indicators. We deemed 286 indicators from 90 references relevant for DiPH system maturity and readiness assessments. The majority of these indicators (133/286, 46.5\%) had legal relevance (targeting big data and artificial intelligence regulation, cybersecurity, national DiPH strategies, or health data governance), and the smallest number of indicators (37/286, 12.9\%) were related to social domains (focusing on internet use and access, digital literacy and digital health literacy, or the use of DiPH tools, smartphones, and computers). Another 14.3\% (41/286) of indicators analyzed the information and communications technology infrastructure (such as workforce, electricity, internet, and smartphone availability or interoperability standards). The remaining 26.2\% (75/286) of indicators described the degree to which DiPH was applied (including health data architecture, storage, and access; the implementation of DiPH interventions; or the existence of interventions promoting health literacy and digital inclusion). Conclusions: Our work is the first to conduct a multidisciplinary analysis of the gray literature on DiPH maturity and readiness assessments. Although new methods for systematically researching gray literature are needed, our study holds the potential to develop more comprehensive tools for DiPH system assessments. We contributed toward a more holistic understanding of DiPH. Further examination is required to analyze the suitability and applicability of all identified indicators in diverse health care settings. By developing a standardized method to assess DiPH system maturity and readiness, we aim to foster informed decision-making among health care planners and practitioners to improve resource distribution and continue to drive innovation in health care delivery. ", doi="10.2196/63031", url="https://publichealth.jmir.org/2024/1/e63031", url="http://www.ncbi.nlm.nih.gov/pubmed/39566910" } @Article{info:doi/10.2196/50235, author="Jefferson, Emily and Milligan, Gordon and Johnston, Jenny and Mumtaz, Shahzad and Cole, Christian and Best, Joseph and Giles, Charles Thomas and Cox, Samuel and Masood, Erum and Horban, Scott and Urwin, Esmond and Beggs, Jillian and Chuter, Antony and Reilly, Gerry and Morris, Andrew and Seymour, David and Hopkins, Susan and Sheikh, Aziz and Quinlan, Philip", title="The Challenges and Lessons Learned Building a New UK Infrastructure for Finding and Accessing Population-Wide COVID-19 Data for Research and Public Health Analysis: The CO-CONNECT Project", journal="J Med Internet Res", year="2024", month="Nov", day="20", volume="26", pages="e50235", keywords="COVID-19", keywords="infrastructure", keywords="trusted research environments", keywords="safe havens", keywords="feasibility analysis", keywords="cohort discovery", keywords="federated analytics", keywords="federated discovery", keywords="lessons learned", keywords="population wide", keywords="data", keywords="public health", keywords="analysis", keywords="CO-CONNECT", keywords="challenges", keywords="data transformation", doi="10.2196/50235", url="https://www.jmir.org/2024/1/e50235" } @Article{info:doi/10.2196/59439, author="Ke, Yuhe and Yang, Rui and Lie, An Sui and Lim, Yi Taylor Xin and Ning, Yilin and Li, Irene and Abdullah, Rizal Hairil and Ting, Wei Daniel Shu and Liu, Nan", title="Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study", journal="J Med Internet Res", year="2024", month="Nov", day="19", volume="26", pages="e59439", keywords="clinical decision-making", keywords="cognitive bias", keywords="generative artificial intelligence", keywords="large language model", keywords="multi-agent", abstract="Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective: This study aimed to explore the role of large language models (LLMs) in mitigating these biases through the use of the multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy compared with humans. Methods: A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 (OpenAI) to facilitate interactions among different simulated agents to replicate clinical team dynamics. Each agent was assigned a distinct role: (1) making the final diagnosis after considering the discussions, (2) acting as a devil's advocate to correct confirmation and anchoring biases, (3) serving as a field expert in the required medical subspecialty, (4) facilitating discussions to mitigate premature closure bias, and (5) recording and summarizing findings. We tested varying combinations of these agents within the framework to determine which configuration yielded the highest rate of correct final diagnoses. Each scenario was repeated 5 times for consistency. The accuracy of the initial diagnoses and the final differential diagnoses were evaluated, and comparisons with human-generated answers were made using the Fisher exact test. Results: A total of 240 responses were evaluated (3 different multi-agent frameworks). The initial diagnosis had an accuracy of 0\% (0/80). However, following multi-agent discussions, the accuracy for the top 2 differential diagnoses increased to 76\% (61/80) for the best-performing multi-agent framework (Framework 4-C). This was significantly higher compared with the accuracy achieved by human evaluators (odds ratio 3.49; P=.002). Conclusions: The multi-agent framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. In addition, the LLM-driven, multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios. ", doi="10.2196/59439", url="https://www.jmir.org/2024/1/e59439" } @Article{info:doi/10.2196/57754, author="Liu, Shuimei and Guo, Raymond L.", title="Data Ownership in the AI-Powered Integrative Health Care Landscape", journal="JMIR Med Inform", year="2024", month="Nov", day="19", volume="12", pages="e57754", keywords="data ownership", keywords="integrative healthcare", keywords="artificial intelligence", keywords="AI", keywords="ownership", keywords="data science", keywords="governance", keywords="consent", keywords="privacy", keywords="security", keywords="access", keywords="model", keywords="framework", keywords="transparency", doi="10.2196/57754", url="https://medinform.jmir.org/2024/1/e57754" } @Article{info:doi/10.2196/53622, author="Camirand Lemyre, F{\'e}lix and L{\'e}vesque, Simon and Domingue, Marie-Pier and Herrmann, Klaus and Ethier, Jean-Fran{\c{c}}ois", title="Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics", journal="JMIR Med Inform", year="2024", month="Nov", day="14", volume="12", pages="e53622", keywords="distributed algorithms", keywords="generalized linear models", keywords="horizontally partitioned data", keywords="GLMs", keywords="learning health systems", keywords="distributed analysis", keywords="federated analysis", keywords="data science", keywords="data custodians", keywords="algorithms", keywords="statistics", keywords="synthesis", keywords="review methods", keywords="searches", keywords="scoping", abstract="Background: Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for health frameworks. Objective: This study aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data, (2) describing the methods applicable to generalized linear models (GLMs) and assessing their underlying distributional assumptions, and (3) adapting existing methods to make them fully usable in health settings. Methods: A scoping review methodology was used for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in health settings. Statistical theory was used to adapt methods and derive the properties of the resulting estimators. Results: From the review, 41 articles were selected and 6 approaches were extracted to conduct standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information sharing requirements and operational complexity. Conclusions: This study contributes to the field of health analytics by providing an overview of the methods that can be used with horizontally partitioned data by adapting these methods to the context of heterogeneous health data and clarifying the workflows and quantities exchanged by the methods discussed. Further analysis of the confidentiality preserved by these methods is needed to fully understand the risk associated with the sharing of summary statistics. ", doi="10.2196/53622", url="https://medinform.jmir.org/2024/1/e53622" } @Article{info:doi/10.2196/59634, author="Parsons, Rex and Blythe, Robin and Cramb, Susanna and Abdel-Hafez, Ahmad and McPhail, Steven", title="An Electronic Medical Record--Based Prognostic Model for Inpatient Falls: Development and Internal-External Cross-Validation", journal="J Med Internet Res", year="2024", month="Nov", day="13", volume="26", pages="e59634", keywords="clinical prediction model", keywords="falls", keywords="patient safety", keywords="prognostic", keywords="electronic medical record", keywords="EMR", keywords="intervention", keywords="hospital", keywords="risk assessment", keywords="clinical decision", keywords="support system", keywords="in-hospital fall", keywords="survival model", keywords="inpatient falls", abstract="Background: Effective fall prevention interventions in hospitals require appropriate allocation of resources early in admission. To address this, fall risk prediction tools and models have been developed with the aim to provide fall prevention strategies to patients at high risk. However, fall risk assessment tools have typically been inaccurate for prediction, ineffective in prevention, and time-consuming to complete. Accurate, dynamic, individualized estimates of fall risk for admitted patients using routinely recorded data may assist in prioritizing fall prevention efforts. Objective: The objective of this study was to develop and validate an accurate and dynamic prognostic model for inpatient falls among a cohort of patients using routinely recorded electronic medical record data. Methods: We used routinely recorded data from 5 Australian hospitals to develop and internally-externally validate a prediction model for inpatient falls using a Cox proportional hazards model with time-varying covariates. The study cohort included patients admitted during 2018-2021 to any ward, with no age restriction. Predictors used in the model included admission-related administrative data, length of stay, and number of previous falls during the admission (updated every 12 hours up to 14 days after admission). Model calibration was assessed using Poisson regression and discrimination using the area under the time-dependent receiver operating characteristic curve. Results: There were 1,107,556 inpatient admissions, 6004 falls, and 5341 unique fallers. The area under the time-dependent receiver operating characteristic curve was 0.899 (95\% CI 0.88-0.91) at 24 hours after admission and declined throughout admission (eg, 0.765, 95\% CI 0.75-0.78 on the seventh day after admission). Site-dependent overestimation and underestimation of risk was observed on the calibration plots. Conclusions: Using a large dataset from multiple hospitals and robust methods to model development and validation, we developed a prognostic model for inpatient falls. It had high discrimination, suggesting the model has the potential for operationalization in clinical decision support for prioritizing inpatients for fall prevention. Performance was site dependent, and model recalibration may lead to improved performance. ", doi="10.2196/59634", url="https://www.jmir.org/2024/1/e59634" } @Article{info:doi/10.2196/54335, author="Gao, Hongxin and Schneider, Stefan and Hernandez, Raymond and Harris, Jenny and Maupin, Danny and Junghaenel, U. Doerte and Kapteyn, Arie and Stone, Arthur and Zelinski, Elizabeth and Meijer, Erik and Lee, Pey-Jiuan and Orriens, Bart and Jin, Haomiao", title="Early Identification of Cognitive Impairment in Community Environments Through Modeling Subtle Inconsistencies in Questionnaire Responses: Machine Learning Model Development and Validation", journal="JMIR Form Res", year="2024", month="Nov", day="13", volume="8", pages="e54335", keywords="machine learning", keywords="artificial intelligence", keywords="cognitive impairments", keywords="surveys and questionnaires", keywords="community health services", keywords="public health", keywords="early identification", keywords="elder care", keywords="dementia", abstract="Background: The underdiagnosis of cognitive impairment hinders timely intervention of dementia. Health professionals working in the community play a critical role in the early detection of cognitive impairment, yet still face several challenges such as a lack of suitable tools, necessary training, and potential stigmatization. Objective: This study explored a novel application integrating psychometric methods with data science techniques to model subtle inconsistencies in questionnaire response data for early identification of cognitive impairment in community environments. Methods: This study analyzed questionnaire response data from participants aged 50 years and older in the Health and Retirement Study (waves 8-9, n=12,942). Predictors included low-quality response indices generated using the graded response model from four brief questionnaires (optimism, hopelessness, purpose in life, and life satisfaction) assessing aspects of overall well-being, a focus of health professionals in communities. The primary and supplemental predicted outcomes were current cognitive impairment derived from a validated criterion and dementia or mortality in the next ten years. Seven predictive models were trained, and the performance of these models was evaluated and compared. Results: The multilayer perceptron exhibited the best performance in predicting current cognitive impairment. In the selected four questionnaires, the area under curve values for identifying current cognitive impairment ranged from 0.63 to 0.66 and was improved to 0.71 to 0.74 when combining the low-quality response indices with age and gender for prediction. We set the threshold for assessing cognitive impairment risk in the tool based on the ratio of underdiagnosis costs to overdiagnosis costs, and a ratio of 4 was used as the default choice. Furthermore, the tool outperformed the efficiency of age or health-based screening strategies for identifying individuals at high risk for cognitive impairment, particularly in the 50- to 59-year and 60- to 69-year age groups. The tool is available on a portal website for the public to access freely. Conclusions: We developed a novel prediction tool that integrates psychometric methods with data science to facilitate ``passive or backend'' cognitive impairment assessments in community settings, aiming to promote early cognitive impairment detection. This tool simplifies the cognitive impairment assessment process, making it more adaptable and reducing burdens. Our approach also presents a new perspective for using questionnaire data: leveraging, rather than dismissing, low-quality data. ", doi="10.2196/54335", url="https://formative.jmir.org/2024/1/e54335" } @Article{info:doi/10.2196/58116, author="Mayito, Jonathan and Tumwine, Conrad and Galiwango, Ronald and Nuwamanya, Elly and Nakasendwa, Suzan and Hope, Mackline and Kiggundu, Reuben and Byonanebye, M. Dathan and Dhikusooka, Flavia and Twemanye, Vivian and Kambugu, Andrew and Kakooza, Francis", title="Combating Antimicrobial Resistance Through a Data-Driven Approach to Optimize Antibiotic Use and Improve Patient Outcomes: Protocol for a Mixed Methods Study", journal="JMIR Res Protoc", year="2024", month="Nov", day="8", volume="13", pages="e58116", keywords="antimicrobial resistance", keywords="AMR database", keywords="AMR", keywords="machine learning", keywords="antimicrobial use", keywords="artificial intelligence", keywords="antimicrobial", keywords="data-driven", keywords="mixed-method", keywords="patient outcome", keywords="drug-resistant infections", keywords="drug resistant", keywords="surveillance data", keywords="economic", keywords="antibiotic", abstract="Background: It is projected that drug-resistant infections will lead to 10 million deaths annually by 2050 if left unabated. Despite this threat, surveillance data from resource-limited settings are scarce and often lack antimicrobial resistance (AMR)--related clinical outcomes and economic burden. We aim to build an AMR and antimicrobial use (AMU) data warehouse, describe the trends of resistance and antibiotic use, determine the economic burden of AMR in Uganda, and develop a machine learning algorithm to predict AMR-related clinical outcomes. Objective: The overall objective of the study is to use data-driven approaches to optimize antibiotic use and combat antimicrobial-resistant infections in Uganda. We aim to (1) build a dynamic AMR and antimicrobial use and consumption (AMUC) data warehouse to support research in AMR and AMUC to inform AMR-related interventions and public health policy, (2) evaluate the trends in AMR and antibiotic use based on annual antibiotic and point prevalence survey data collected at 9 regional referral hospitals over a 5-year period, (3) develop a machine learning model to predict the clinical outcomes of patients with bacterial infectious syndromes due to drug-resistant pathogens, and (4) estimate the annual economic burden of AMR in Uganda using the cost-of-illness approach. Methods: We will conduct a study involving data curation, machine learning--based modeling, and cost-of-illness analysis using AMR and AMU data abstracted from procurement, human resources, and clinical records of patients with bacterial infectious syndromes at 9 regional referral hospitals in Uganda collected between 2018 and 2026. We will use data curation procedures, FLAIR (Findable, Linkable, Accessible, Interactable and Repeatable) principles, and role-based access control to build a robust and dynamic AMR and AMU data warehouse. We will also apply machine learning algorithms to model AMR-related clinical outcomes, advanced statistical analysis to study AMR and AMU trends, and cost-of-illness analysis to determine the AMR-related economic burden. Results: The study received funding from the Wellcome Trust through the Centers for Antimicrobial Optimisation Network (CAMO-Net) in April 2023. As of October 28, 2024, we completed data warehouse development, which is now under testing; completed data curation of the historical Fleming Fund surveillance data (2020-2023); and collected retrospective AMR records for 599 patients that contained clinical outcomes and cost-of-illness economic burden data across 9 surveillance sites for objectives 3 and 4, respectively. Conclusions: The data warehouse will promote access to rich and interlinked AMR and AMU data sets to answer AMR program and research questions using a wide evidence base. The AMR-related clinical outcomes model and cost data will facilitate improvement in the clinical management of AMR patients and guide resource allocation to support AMR surveillance and interventions. International Registered Report Identifier (IRRID): PRR1-10.2196/58116 ", doi="10.2196/58116", url="https://www.researchprotocols.org/2024/1/e58116" } @Article{info:doi/10.2196/53337, author="Bhavaraju, L. Vasudha and Panchanathan, Sarada and Willis, C. Brigham and Garcia-Filion, Pamela", title="Leveraging the Electronic Health Record to Measure Resident Clinical Experiences and Identify Training Gaps: Development and Usability Study", journal="JMIR Med Educ", year="2024", month="Nov", day="6", volume="10", pages="e53337", keywords="clinical informatics", keywords="electronic health record", keywords="pediatric resident", keywords="COVID-19", keywords="competence-based medical education", keywords="pediatric", keywords="children", keywords="SARS-CoV-2", keywords="clinic", keywords="urban", keywords="diagnosis", keywords="health informatics", keywords="EHR", keywords="individualized learning plan", abstract="Background: Competence-based medical education requires robust data to link competence with clinical experiences. The SARS-CoV-2 (COVID-19) pandemic abruptly altered the standard trajectory of clinical exposure in medical training programs. Residency program directors were tasked with identifying and addressing the resultant gaps in each trainee's experiences using existing tools. Objective: This study aims to demonstrate a feasible and efficient method to capture electronic health record (EHR) data that measure the volume and variety of pediatric resident clinical experiences from a continuity clinic; generate individual-, class-, and graduate-level benchmark data; and create a visualization for learners to quickly identify gaps in clinical experiences. Methods: This pilot was conducted in a large, urban pediatric residency program from 2016 to 2022. Through consensus, 5 pediatric faculty identified diagnostic groups that pediatric residents should see to be competent in outpatient pediatrics. Information technology consultants used International Classification of Diseases, Tenth Revision (ICD-10) codes corresponding with each diagnostic group to extract EHR patient encounter data as an indicator of exposure to the specific diagnosis. The frequency (volume) and diagnosis types (variety) seen by active residents (classes of 2020?2022) were compared with class and graduated resident (classes of 2016?2019) averages. These data were converted to percentages and translated to a radar chart visualization for residents to quickly compare their current clinical experiences with peers and graduates. Residents were surveyed on the use of these data and the visualization to identify training gaps. Results: Patient encounter data about clinical experiences for 102 residents (N=52 graduates) were extracted. Active residents (n=50) received data reports with radar graphs biannually: 3 for the classes of 2020 and 2021 and 2 for the class of 2022. Radar charts distinctly demonstrated gaps in diagnoses exposure compared with classmates and graduates. Residents found the visualization useful in setting clinical and learning goals. Conclusions: This pilot describes an innovative method of capturing and presenting data about resident clinical experiences, compared with peer and graduate benchmarks, to identify learning gaps that may result from disruptions or modifications in medical training. This methodology can be aggregated across specialties and institutions and potentially inform competence-based medical education. ", doi="10.2196/53337", url="https://mededu.jmir.org/2024/1/e53337" } @Article{info:doi/10.2196/58130, author="Penev, P. Yordan and Buchanan, R. Timothy and Ruppert, M. Matthew and Liu, Michelle and Shekouhi, Ramin and Guan, Ziyuan and Balch, Jeremy and Ozrazgat-Baslanti, Tezcan and Shickel, Benjamin and Loftus, J. Tyler and Bihorac, Azra", title="Electronic Health Record Data Quality and Performance Assessments: Scoping Review", journal="JMIR Med Inform", year="2024", month="Nov", day="6", volume="12", pages="e58130", keywords="electronic health record", keywords="EHR", keywords="record", keywords="data quality", keywords="data performance", keywords="clinical informatics", keywords="performance", keywords="data science", keywords="synthesis", keywords="review methods", keywords="review methodology", keywords="search", keywords="scoping", abstract="Background: Electronic health records (EHRs) have an enormous potential to advance medical research and practice through easily accessible and interpretable EHR-derived databases. Attainability of this potential is limited by issues with data quality (DQ) and performance assessment. Objective: This review aims to streamline the current best practices on EHR DQ and performance assessments as a replicable standard for researchers in the field. Methods: PubMed was systematically searched for original research articles assessing EHR DQ and performance from inception until May 7, 2023. Results: Our search yielded 26 original research articles. Most articles had 1 or more significant limitations, including incomplete or inconsistent reporting (n=6, 30\%), poor replicability (n=5, 25\%), and limited generalizability of results (n=5, 25\%). Completeness (n=21, 81\%), conformance (n=18, 69\%), and plausibility (n=16, 62\%) were the most cited indicators of DQ, while correctness or accuracy (n=14, 54\%) was most cited for data performance, with context-specific supplementation by recency (n=7, 27\%), fairness (n=6, 23\%), stability (n=4, 15\%), and shareability (n=2, 8\%) assessments. Artificial intelligence--based techniques, including natural language data extraction, data imputation, and fairness algorithms, were demonstrated to play a rising role in improving both dataset quality and performance. Conclusions: This review highlights the need for incentivizing DQ and performance assessments and their standardization. The results suggest the usefulness of artificial intelligence--based techniques for enhancing DQ and performance to unlock the full potential of EHRs to improve medical research and practice. ", doi="10.2196/58130", url="https://medinform.jmir.org/2024/1/e58130" } @Article{info:doi/10.2196/59674, author="Subramanian, Hemang and Sengupta, Arijit and Xu, Yilin", title="Patient Health Record Protection Beyond the Health Insurance Portability and Accountability Act: Mixed Methods Study", journal="J Med Internet Res", year="2024", month="Nov", day="6", volume="26", pages="e59674", keywords="security", keywords="privacy", keywords="security breach", keywords="breach report", keywords="health care", keywords="health care infrastructure", keywords="regulatory", keywords="law enforcement", keywords="Omnibus Rule", keywords="qualitative analysis", keywords="AI-generated data", keywords="artificial intelligence", keywords="difference-in-differences", keywords="best practice", keywords="data privacy", keywords="safe practice", abstract="Background: The security and privacy of health care information are crucial for maintaining the societal value of health care as a public good. However, governance over electronic health care data has proven inefficient, despite robust enforcement efforts. Both federal (HIPAA [Health Insurance Portability and Accountability Act]) and state regulations, along with the ombudsman rule, have not effectively reduced the frequency or impact of data breaches in the US health care system. While legal frameworks have bolstered data security, recent years have seen a concerning increase in breach incidents. This paper investigates common breach types and proposes best practices derived from the data as potential solutions. Objective: The primary aim of this study is to analyze health care and hospital breach data, comparing it against HIPAA compliance levels across states (spatial analysis) and the impact of the Omnibus Rule over time (temporal analysis). The goal is to establish guidelines for best practices in handling sensitive information within hospitals and clinical environments. Methods: The study used data from the Department of Health and Human Services on reported breaches, assessing the severity and impact of each breach type. We then analyzed secondary data to examine whether HIPAA's storage and retention rule amendments have influenced security and privacy incidents across all 50 states. Finally, we conducted a qualitative analysis of textual data from vulnerability and breach reports to identify actionable best practices for health care settings. Results: Our findings indicate that hacking or IT incidents have the most significant impact on the number of individuals affected, highlighting this as a primary breach category. The overall difference-in-differences trend reveals no significant reduction in breach rates (P=.50), despite state-level regulations exceeding HIPAA requirements and the introduction of the ombudsman rule. This persistence in breach trends implies that even strengthened protections and additional guidelines have not effectively curbed the rising number of affected individuals. Through qualitative analysis, we identified 15 unique values and associated best practices from industry standards. Conclusions: Combining quantitative and qualitative insights, we propose the ``SecureSphere framework'' to enhance data security in health care institutions. This framework presents key security values structured in concentric circles: core values at the center and peripheral values around them. The core values include employee management, policy, procedures, and IT management. Peripheral values encompass the remaining security attributes that support these core elements. This structured approach provides a comprehensive security strategy for protecting patient health information and is designed to help health care organizations develop sustainable practices for data security. ", doi="10.2196/59674", url="https://www.jmir.org/2024/1/e59674" } @Article{info:doi/10.2196/54246, author="Paiva, Bruno and Gon{\c{c}}alves, Andr{\'e} Marcos and da Rocha, Dutra Leonardo Chaves and Marcolino, Soriano Milena and Lana, Barbosa Fernanda Cristina and Souza-Silva, Rego Maira Viana and Almeida, M. Jussara and Pereira, Delfino Polianna and de Andrade, Valiense Claudio Mois{\'e}s and Gomes, Reis Ang{\'e}lica Gomides dos and Ferreira, Pires Maria Ang{\'e}lica and Bartolazzi, Frederico and Sacioto, Furtado Manuela and Boscato, Paula Ana and Guimar{\~a}es-J{\'u}nior, Henriques Milton and dos Reis, Pereira Priscilla and Costa, Roberto Fel{\'i}cio and Jorge, Oliveira Alzira de and Coelho, Reis Laryssa and Carneiro, Marcelo and Sales, Souza Tha{\'i}s Lorenna and Ara{\'u}jo, Ferreira Silvia and Silveira, Vit{\'o}rio Daniel and Ruschel, Brasil Karen and Santos, Veloso Fernanda Caldeira and Cenci, Almeida Evelin Paola de and Menezes, Monteiro Luanna Silva and Anschau, Fernando and Bicalho, Camargos Maria Aparecida and Manenti, Fernandes Euler Roberto and Finger, Goulart Renan and Ponce, Daniela and de Aguiar, Carrilho Filipe and Marques, Margoto Luiza and de Castro, C{\'e}sar Lu{\'i}s and Vietta, Gr{\"u}newald Giovanna and Godoy, de Mariana Frizzo and Vila{\c{c}}a, Nascimento Mariana do and Morais, Costa Vivian", title="A New Natural Language Processing--Inspired Methodology (Detection, Initial Characterization, and Semantic Characterization) to Investigate Temporal Shifts (Drifts) in Health Care Data: Quantitative Study", journal="JMIR Med Inform", year="2024", month="Oct", day="28", volume="12", pages="e54246", keywords="health care", keywords="machine learning", keywords="data drifts", keywords="temporal drifts", abstract="Background: Proper analysis and interpretation of health care data can significantly improve patient outcomes by enhancing services and revealing the impacts of new technologies and treatments. Understanding the substantial impact of temporal shifts in these data is crucial. For example, COVID-19 vaccination initially lowered the mean age of at-risk patients and later changed the characteristics of those who died. This highlights the importance of understanding these shifts for assessing factors that affect patient outcomes. Objective: This study aims to propose detection, initial characterization, and semantic characterization (DIS), a new methodology for analyzing?changes in health outcomes and variables over time while discovering contextual changes for outcomes in large volumes of data. Methods: The DIS methodology involves 3 steps: detection, initial characterization, and semantic characterization. Detection uses metrics such as Jensen-Shannon divergence to identify significant data drifts. Initial characterization offers a global analysis of changes in data distribution and predictive feature significance over time. Semantic characterization uses natural language processing--inspired techniques to understand the local context of these changes, helping identify factors driving changes in patient outcomes. By integrating the outcomes from these 3 steps, our results can identify specific factors (eg, interventions and modifications in health care practices) that drive changes in patient outcomes. DIS was applied to the Brazilian COVID-19 Registry and the Medical Information Mart for Intensive Care, version IV (MIMIC-IV) data sets. Results: Our approach allowed us to (1) identify drifts effectively, especially using metrics such as the Jensen-Shannon divergence, and (2) uncover reasons for the decline in overall mortality in both the COVID-19 and MIMIC-IV data sets, as well as changes in the cooccurrence between different diseases and this particular outcome. Factors such as vaccination during the COVID-19 pandemic and reduced iatrogenic events and cancer-related deaths in MIMIC-IV were highlighted. The methodology also pinpointed shifts in patient demographics and disease patterns, providing insights into the evolving health care landscape during the study period. Conclusions: We developed a novel methodology combining machine learning?and natural language processing techniques to detect, characterize, and understand temporal shifts in health care data. This understanding can enhance predictive algorithms, improve patient outcomes, and optimize health care resource allocation, ultimately?improving the effectiveness of machine learning predictive algorithms applied to health care data. Our methodology can be applied to a variety of scenarios beyond those discussed in this paper. ", doi="10.2196/54246", url="https://medinform.jmir.org/2024/1/e54246" } @Article{info:doi/10.2196/55531, author="Lee, Eun and Kim, Heejun and Esener, Yildiz and McCall, Terika", title="Internet-Based Social Connections of Black American College Students in Pre--COVID-19 and Peri--COVID-19 Pandemic Periods: Network Analysis", journal="J Med Internet Res", year="2024", month="Oct", day="28", volume="26", pages="e55531", keywords="COVID-19 pandemic", keywords="college students", keywords="Black American", keywords="African American", keywords="social network analysis", keywords="social media", keywords="mental health", keywords="depression", abstract="Background: A global-scale pandemic, such as the COVID-19 pandemic, greatly impacted communities of color. Moreover, physical distancing recommendations during the height of the COVID-19 pandemic negatively affected people's sense of social connection, especially among young individuals. More research is needed on the use of social media and communication about depression, with a specific focus on young Black Americans. Objective: This paper aims to examine whether there are any differences in social-networking characteristics before and during the pandemic periods (ie, pre--COVID-19 pandemic vs peri--COVID-19 pandemic) among the students of historically black colleges and universities (HBCUs). For the study, the researchers focus on the students who have posted a depression-related tweet or have retweeted such posts on their timeline and also those who have not made such tweets. This is done to understand the collective patterns of both groups. Methods: This paper analyzed the social networks on Twitter (currently known as X; X Corp) of HBCU students through comparing pre--COVID-19 and peri--COVID-19 pandemic data. The researchers quantified the structural properties, such as reciprocity, homophily, and communities, to test the differences in internet-based socializing patterns between the depression-related and non--depression related groups for the 2 periods. Results: During the COVID-19 pandemic period, the group with depression-related tweets saw an increase in internet-based friendships, with the average number of friends rising from 1194 (SD 528.14) to 1371 (SD 824.61; P<.001). Their mutual relationships strengthened (reciprocity: 0.78-0.8; P=.01), and they showed higher assortativity with other depression-related group members (0.6-0.7; P<.001). In a network with only HBCU students, internet-based and physical affiliation memberships aligned closely during the peri--COVID-19 pandemic period, with membership entropy decreasing from 1.0 to 0.5. While users without depression-related tweets engaged more on the internet with other users who shared physical affiliations, those who posted depression-related tweets maintained consistent entropy levels (modularity: 0.75-0.76). Compared with randomized networks before and during the COVID-19 pandemic (P<.001), the users also exhibited high homophily with other members who posted depression-related tweets. Conclusions: The findings of this study provided insight into the social media activities of HBCU students' social networks and communication about depression on social media. Future social media interventions focused on the mental health of Black college students may focus on providing resources to students who communicate about depression. Efforts aimed at providing relevant resources and information to internet-based communities that share institutional affiliation may enhance access to social support, particularly for those who may not proactively seek assistance. This approach may contribute to increased social support for individuals within these communities, especially those with a limited social capacity. ", doi="10.2196/55531", url="https://www.jmir.org/2024/1/e55531" } @Article{info:doi/10.2196/62678, author="Ball Dunlap, A. Patricia and Michalowski, Martin", title="Advancing AI Data Ethics in Nursing: Future Directions for Nursing Practice, Research, and Education", journal="JMIR Nursing", year="2024", month="Oct", day="25", volume="7", pages="e62678", keywords="artificial intelligence", keywords="AI data ethics", keywords="data-centric AI", keywords="nurses", keywords="nursing informatics", keywords="machine learning", keywords="data literacy", keywords="health care AI", keywords="responsible AI", doi="10.2196/62678", url="https://nursing.jmir.org/2024/1/e62678" } @Article{info:doi/10.2196/54653, author="Manion, J. Frank and Du, Jingcheng and Wang, Dong and He, Long and Lin, Bin and Wang, Jingqi and Wang, Siwei and Eckels, David and Cervenka, Jan and Fiduccia, C. Peter and Cossrow, Nicole and Yao, Lixia", title="Accelerating Evidence Synthesis in Observational Studies: Development of a Living Natural Language Processing--Assisted Intelligent Systematic Literature Review System", journal="JMIR Med Inform", year="2024", month="Oct", day="23", volume="12", pages="e54653", keywords="machine learning", keywords="deep learning", keywords="natural language processing", keywords="systematic literature review", keywords="artificial intelligence", keywords="software development", keywords="data extraction", keywords="epidemiology", abstract="Background: Systematic literature review (SLR), a robust method to identify and summarize evidence from published sources, is considered to be a complex, time-consuming, labor-intensive, and expensive task. Objective: This study aimed to present a solution based on natural language processing (NLP) that accelerates and streamlines the SLR process for observational studies using real-world data. Methods: We followed an agile software development and iterative software engineering methodology to build a customized intelligent end-to-end living NLP-assisted solution for observational SLR tasks. Multiple machine learning--based NLP algorithms were adopted to automate article screening and data element extraction processes. The NLP prediction results can be further reviewed and verified by domain experts, following the human-in-the-loop design. The system integrates explainable articificial intelligence to provide evidence for NLP algorithms and add transparency to extracted literature data elements. The system was developed based on 3 existing SLR projects of observational studies, including the epidemiology studies of human papillomavirus--associated diseases, the disease burden of pneumococcal diseases, and cost-effectiveness studies on pneumococcal vaccines. Results: Our Intelligent SLR Platform covers major SLR steps, including study protocol setting, literature retrieval, abstract screening, full-text screening, data element extraction from full-text articles, results summary, and data visualization. The NLP algorithms achieved accuracy scores of 0.86-0.90 on article screening tasks (framed as text classification tasks) and macroaverage F1 scores of 0.57-0.89 on data element extraction tasks (framed as named entity recognition tasks). Conclusions: Cutting-edge NLP algorithms expedite SLR for observational studies, thus allowing scientists to have more time to focus on the quality of data and the synthesis of evidence in observational studies. Aligning the living SLR concept, the system has the potential to update literature data and enable scientists to easily stay current with the literature related to observational studies prospectively and continuously. ", doi="10.2196/54653", url="https://medinform.jmir.org/2024/1/e54653" } @Article{info:doi/10.2196/57569, author="Kruse, Jesse and Wiedekopf, Joshua and Kock-Schoppenhauer, Ann-Kristin and Essenwanger, Andrea and Ingenerf, Josef and Ulrich, Hannes", title="A Generic Transformation Approach for Complex Laboratory Data Using the Fast Healthcare Interoperability Resources Mapping Language: Method Development and Implementation", journal="JMIR Med Inform", year="2024", month="Oct", day="18", volume="12", pages="e57569", keywords="FHIR", keywords="StructureMaps", keywords="FHIR mapping language", keywords="laboratory data", keywords="mapping", keywords="standardization", keywords="data science", keywords="healthcare system", keywords="HIS", keywords="information system", keywords="electronic healthcare record", keywords="health care system", keywords="electronic health record", keywords="health information system", abstract="Background: Reaching meaningful interoperability between proprietary health care systems is a ubiquitous task in medical informatics, where communication servers are traditionally used for referring and transforming data from the source to target systems. The Mirth Connect Server, an open-source communication server, offers, in addition to the exchange functionality, functions for simultaneous manipulation of data. The standard Fast Healthcare Interoperability Resources (FHIR) has recently become increasingly prevalent in national health care systems. FHIR specifies its own standardized mechanisms for transforming data structures using StructureMaps and the FHIR mapping language (FML). Objective: In this study, a generic approach is developed, which allows for the application of declarative mapping rules defined using FML in an exchangeable manner. A transformation engine is required to execute the mapping rules. Methods: FHIR natively defines resources to support the conversion of instance data, such as an FHIR StructureMap. This resource encodes all information required to transform data from a source system to a target system. In our approach, this information is defined in an implementation-independent manner using FML. Once the mapping has been defined, executable Mirth channels are automatically generated from the resources containing the mapping in JavaScript format. These channels can then be deployed to the Mirth Connect Server. Results: The resulting tool is called FML2Mirth, a Java-based transformer that derives Mirth channels from detailed declarative mapping rules based on the underlying StructureMaps. Implementation of the translate functionality is provided by the integration of a terminology server, and to achieve conformity with existing profiles, validation via the FHIR validator is built in. The system was evaluated for its practical use by transforming Labordatentr{\"a}ger version 2 (LDTv.2) laboratory results into Medical Information Object (Medizinisches Informationsobjekt) laboratory reports in accordance with the National Association of Statutory Health Insurance Physicians' specifications and into the HL7 (Health Level Seven) Europe Laboratory Report. The system could generate complex structures, but LDTv.2 lacks some information to fully comply with the specification. Conclusions: The tool for the auto-generation of Mirth channels was successfully presented. Our tests reveal the feasibility of using the complex structures of the mapping language in combination with a terminology server to transform instance data. Although the Mirth Server and the FHIR are well established in medical informatics, the combination offers space for more research, especially with regard to FML. Simultaneously, it can be stated that the mapping language still has implementation-related shortcomings that can be compensated by Mirth Connect as a base technology. ", doi="10.2196/57569", url="https://medinform.jmir.org/2024/1/e57569" } @Article{info:doi/10.2196/43954, author="Simblett, Sara and Dawe-Lane, Erin and Gilpin, Gina and Morris, Daniel and White, Katie and Erturk, Sinan and Devonshire, Julie and Lees, Simon and Zormpas, Spyridon and Polhemus, Ashley and Temesi, Gergely and Cummins, Nicholas and Hotopf, Matthew and Wykes, Til and ", title="Data Visualization Preferences in Remote Measurement Technology for Individuals Living With Depression, Epilepsy, and Multiple Sclerosis: Qualitative Study", journal="J Med Internet Res", year="2024", month="Oct", day="18", volume="26", pages="e43954", keywords="mHealth", keywords="qualitative", keywords="technology", keywords="depression", keywords="epilepsy", keywords="multiple sclerosis", keywords="wearables", keywords="devices", keywords="smartphone apps", keywords="application", keywords="feedback", keywords="users", keywords="data", keywords="data visualization", keywords="mobile phone", abstract="Background: Remote measurement technology (RMT) involves the use of wearable devices and smartphone apps to measure health outcomes in everyday life. RMT with feedback in the form of data visual representations can facilitate self-management of chronic health conditions, promote health care engagement, and present opportunities for intervention. Studies to date focus broadly on multiple dimensions of service users' design preferences and RMT user experiences (eg, health variables of perceived importance and perceived quality of medical advice provided) as opposed to data visualization preferences. Objective: This study aims to explore data visualization preferences and priorities in RMT, with individuals living with depression, those with epilepsy, and those with multiple sclerosis (MS). Methods: A triangulated qualitative study comparing and thematically synthesizing focus group discussions with user reviews of existing self-management apps and a systematic review of RMT data visualization preferences. A total of 45 people participated in 6 focus groups across the 3 health conditions (depression, n=17; epilepsy, n=11; and MS, n=17). Results: Thematic analysis validated a major theme around design preferences and recommendations and identified a further four minor themes: (1) data reporting, (2) impact of visualization, (3) moderators of visualization preferences, and (4) system-related factors and features. Conclusions: When used effectively, data visualizations are valuable, engaging components of RMT. Easy to use and intuitive data visualization design was lauded by individuals with neurological and psychiatric conditions. Apps design needs to consider the unique requirements of service users. Overall, this study offers RMT developers a comprehensive outline of the data visualization preferences of individuals living with depression, epilepsy, and MS. ", doi="10.2196/43954", url="https://www.jmir.org/2024/1/e43954" } @Article{info:doi/10.2196/50730, author="Kaafarani, Rima and Ismail, Leila and Zahwe, Oussama", title="Automatic Recommender System of Development Platforms for Smart Contract--Based Health Care Insurance Fraud Detection Solutions: Taxonomy and Performance Evaluation", journal="J Med Internet Res", year="2024", month="Oct", day="18", volume="26", pages="e50730", keywords="blockchain", keywords="blockchain development platform", keywords="eHealth", keywords="fraud detection", keywords="fraud scenarios", keywords="health care", keywords="health care insurance", keywords="health insurance", keywords="machine learning", keywords="medical informatics", keywords="recommender system", keywords="smart contract", keywords="taxonomy", abstract="Background: Health care insurance fraud is on the rise in many ways, such as falsifying information and hiding third-party liability. This can result in significant losses for the medical health insurance industry. Consequently, fraud detection is crucial. Currently, companies employ auditors who manually evaluate records and pinpoint fraud. However, an automated and effective method is needed to detect fraud with the continually increasing number of patients seeking health insurance. Blockchain is an emerging technology and is constantly evolving to meet business needs. With its characteristics of immutability, transparency, traceability, and smart contracts, it demonstrates its potential in the health care domain. In particular, self-executable smart contracts are essential to reduce the costs associated with traditional paradigms, which are mostly manual, while preserving privacy and building trust among health care stakeholders, including the patient and the health insurance networks. However, with the proliferation of blockchain development platform options, selecting the right one for health care insurance can be difficult. This study addressed this void and developed an automated decision map recommender system to select the most effective blockchain platform for insurance fraud detection. Objective: This study aims to develop smart contracts for detecting health care insurance fraud efficiently. Therefore, we provided a taxonomy of fraud scenarios and implemented their detection using a blockchain platform that was suitable for health care insurance fraud detection. To automatically and efficiently select the best platform, we proposed and implemented a decision map--based recommender system. For developing the decision-map, we proposed a taxonomy of 102 blockchain platforms. Methods: We developed smart contracts for 12 fraud scenarios that we identified in the literature. We used the top 2 blockchain platforms selected by our proposed decision-making map--based recommender system, which is tailored for health care insurance fraud. The map used our taxonomy of 102 blockchain platforms classified according to their application domains. Results: The recommender system demonstrated that Hyperledger Fabric was the best blockchain platform for identifying health care insurance fraud. We validated our recommender system by comparing the performance of the top 2 platforms selected by our system. The blockchain platform taxonomy that we created revealed that 59 blockchain platforms are suitable for all application domains, 25 are suitable for financial services, and 18 are suitable for various application domains. We implemented fraud detection based on smart contracts. Conclusions: Our decision map recommender system, which was based on our proposed taxonomy of 102 platforms, automatically selected the top 2 platforms, which were Hyperledger Fabric and Neo, for the implementation of health care insurance fraud detection. Our performance evaluation of the 2 platforms indicated that Fabric surpassed Neo in all performance metrics, as depicted by our recommender system. We provided an implementation of fraud detection based on smart contracts. ", doi="10.2196/50730", url="https://www.jmir.org/2024/1/e50730", url="http://www.ncbi.nlm.nih.gov/pubmed/39423005" } @Article{info:doi/10.2196/53024, author="Yusuf, K. Zainab and Dixon, G. William and Sharp, Charlotte and Cook, Louise and Holm, S{\o}ren and Sanders, Caroline", title="Building and Sustaining Public Trust in Health Data Sharing for Musculoskeletal Research: Semistructured Interview and Focus Group Study", journal="J Med Internet Res", year="2024", month="Oct", day="15", volume="26", pages="e53024", keywords="data sharing", keywords="public trust", keywords="musculoskeletal", keywords="marginalized communities", keywords="underserved communities", abstract="Background: Although many people are supportive of their deidentified health care data being used for research, concerns about privacy, safety, and security of health care data remain. There is low awareness about how data are used for research and related governance. Transparency about how health data are used for research is crucial for building public trust. One proposed solution is to ensure that affected communities are notified, particularly marginalized communities where there has previously been a lack of engagement and mistrust. Objective: This study aims to explore patient and public perspectives on the use of deidentified data from electronic health records for musculoskeletal research and to explore ways to build and sustain public trust in health data sharing for a research program (known as ``the Data Jigsaw'') piloting new ways of using and analyzing electronic health data. Views and perspectives about how best to engage with local communities informed the development of a public notification campaign about the research. Methods: Qualitative methods data were generated from 20 semistructured interviews and 8 focus groups, comprising 48 participants in total with musculoskeletal conditions or symptoms, including 3 carers. A presentation about the use of health data for research and examples from the specific research projects within the program were used to trigger discussion. We worked in partnership with a patient and public involvement group throughout the research and cofacilitated wider community engagement. Results: Respondents were supportive of their health care data being shared for research purposes, but there was low awareness about how electronic health records are used for research. Security and governance concerns about data sharing were noted, including collaborations with external companies and accessing social care records. Project examples from the Data Jigsaw program were viewed positively after respondents knew more about how their data were being used to improve patient care. A range of different methods to build and sustain trust were deemed necessary by participants. Information was requested about: data management; individuals with access to the data (including any collaboration with external companies); the National Health Service's national data opt-out; and research outcomes. It was considered important to enable in-person dialogue with affected communities in addition to other forms of information. Conclusions: The findings have emphasized the need for transparency and awareness about health data sharing for research, and the value of tailoring this to reflect current and local research where residents might feel more invested in the focus of research and the use of local records. Thus, the provision for targeted information within affected communities with accessible messages and community-based dialogue could help to build and sustain public trust. These findings can also be extrapolated to other conditions beyond musculoskeletal conditions, making the findings relevant to a much wider community. ", doi="10.2196/53024", url="https://www.jmir.org/2024/1/e53024" } @Article{info:doi/10.2196/49781, author="Grothman, Allison and Ma, J. William and Tickner, G. Kendra and Martin, A. Elliot and Southern, A. Danielle and Quan, Hude", title="Case Identification of Depression in Inpatient Electronic Medical Records: Scoping Review", journal="JMIR Med Inform", year="2024", month="Oct", day="14", volume="12", pages="e49781", keywords="electronic medical records", keywords="EMR phenotyping", keywords="depression", keywords="algorithms", keywords="health services research", keywords="precision public health", keywords="inpatient", keywords="clinical information", keywords="phenotyping", keywords="data accessibility", keywords="scoping review", keywords="disparity", keywords="development", keywords="phenotype", keywords="PRISMA-ScR", keywords="Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews", abstract="Background: Electronic medical records (EMRs) contain large amounts of detailed clinical information. Using medical record review to identify conditions within large quantities of EMRs can be time-consuming and inefficient. EMR-based phenotyping using machine learning and natural language processing algorithms is a continually developing area of study that holds potential for numerous mental health disorders. Objective: This review evaluates the current state of EMR-based case identification for depression and provides guidance on using current algorithms and constructing new ones. Methods: A scoping review of EMR-based algorithms for phenotyping depression was completed. This research encompassed studies published from January 2000 to May 2023. The search involved 3 databases: Embase, MEDLINE, and APA PsycInfo. This was carried out using selected keywords that fell into 3 categories: terms connected with EMRs, terms connected to case identification, and terms pertaining to depression. This study adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Results: A total of 20 papers were assessed and summarized in the review. Most of these studies were undertaken in the United States, accounting for 75\% (15/20). The United Kingdom and Spain followed this, accounting for 15\% (3/20) and 10\% (2/20) of the studies, respectively. Both data-driven and clinical rule-based methodologies were identified. The development of EMR-based phenotypes and algorithms indicates the data accessibility permitted by each health system, which led to varying performance levels among different algorithms. Conclusions: Better use of structured and unstructured EMR components through techniques such as machine learning and natural language processing has the potential to improve depression phenotyping. However, more validation must be carried out to have confidence in depression case identification algorithms in general. ", doi="10.2196/49781", url="https://medinform.jmir.org/2024/1/e49781" } @Article{info:doi/10.2196/62924, author="Chang, Eunsuk and Sung, Sumi", title="Use of SNOMED CT in Large Language Models: Scoping Review", journal="JMIR Med Inform", year="2024", month="Oct", day="7", volume="12", pages="e62924", keywords="SNOMED CT", keywords="ontology", keywords="knowledge graph", keywords="large language models", keywords="natural language processing", keywords="language models", abstract="Background: Large language models (LLMs) have substantially advanced natural language processing (NLP) capabilities but often struggle with knowledge-driven tasks in specialized domains such as biomedicine. Integrating biomedical knowledge sources such as SNOMED CT into LLMs may enhance their performance on biomedical tasks. However, the methodologies and effectiveness of incorporating SNOMED CT into LLMs have not been systematically reviewed. Objective: This scoping review aims to examine how SNOMED CT is integrated into LLMs, focusing on (1) the types and components of LLMs being integrated with SNOMED CT, (2) which contents of SNOMED CT are being integrated, and (3) whether this integration improves LLM performance on NLP tasks. Methods: Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, we searched ACM Digital Library, ACL Anthology, IEEE Xplore, PubMed, and Embase for relevant studies published from 2018 to 2023. Studies were included if they incorporated SNOMED CT into LLM pipelines for natural language understanding or generation tasks. Data on LLM types, SNOMED CT integration methods, end tasks, and performance metrics were extracted and synthesized. Results: The review included 37 studies. Bidirectional Encoder Representations from Transformers and its biomedical variants were the most commonly used LLMs. Three main approaches for integrating SNOMED CT were identified: (1) incorporating SNOMED CT into LLM inputs (28/37, 76\%), primarily using concept descriptions to expand training corpora; (2) integrating SNOMED CT into additional fusion modules (5/37, 14\%); and (3) using SNOMED CT as an external knowledge retriever during inference (5/37, 14\%). The most frequent end task was medical concept normalization (15/37, 41\%), followed by entity extraction or typing and classification. While most studies (17/19, 89\%) reported performance improvements after SNOMED CT integration, only a small fraction (19/37, 51\%) provided direct comparisons. The reported gains varied widely across different metrics and tasks, ranging from 0.87\% to 131.66\%. However, some studies showed either no improvement or a decline in certain performance metrics. Conclusions: This review demonstrates diverse approaches for integrating SNOMED CT into LLMs, with a focus on using concept descriptions to enhance biomedical language understanding and generation. While the results suggest potential benefits of SNOMED CT integration, the lack of standardized evaluation methods and comprehensive performance reporting hinders definitive conclusions about its effectiveness. Future research should prioritize consistent reporting of performance comparisons and explore more sophisticated methods for incorporating SNOMED CT's relational structure into LLMs. In addition, the biomedical NLP community should develop standardized evaluation frameworks to better assess the impact of ontology integration on LLM performance. ", doi="10.2196/62924", url="https://medinform.jmir.org/2024/1/e62924", url="http://www.ncbi.nlm.nih.gov/pubmed/39374057" } @Article{info:doi/10.2196/58085, author="Conderino, Sarah and Anthopolos, Rebecca and Albrecht, S. Sandra and Farley, M. Shannon and Divers, Jasmin and Titus, R. Andrea and Thorpe, E. Lorna", title="Addressing Information Biases Within Electronic Health Record Data to Improve the Examination of Epidemiologic Associations With Diabetes Prevalence Among Young Adults: Cross-Sectional Study", journal="JMIR Med Inform", year="2024", month="Oct", day="1", volume="12", pages="e58085", keywords="information bias", keywords="electronic health record", keywords="EHR", keywords="epidemiologic method", keywords="confounding factor", keywords="diabetes", keywords="epidemiology", keywords="young adult", keywords="cross-sectional study", keywords="risk factor", keywords="asthma", keywords="race", keywords="ethnicity", keywords="diabetic", keywords="diabetic adult", abstract="Background: Electronic health records (EHRs) are increasingly used for epidemiologic research to advance public health practice. However, key variables are susceptible to missing data or misclassification within EHRs, including demographic information or disease status, which could affect the estimation of disease prevalence or risk factor associations. Objective: In this paper, we applied methods from the literature on missing data and causal inference to assess whether we could mitigate information biases when estimating measures of association between potential risk factors and diabetes among a patient population of New York City young adults. Methods: We estimated the odds ratio (OR) for diabetes by race or ethnicity and asthma status using EHR data from NYU Langone Health. Methods from the missing data and causal inference literature were then applied to assess the ability to control for misclassification of health outcomes in the EHR data. We compared EHR-based associations with associations observed from 2 national health surveys, the Behavioral Risk Factor Surveillance System (BRFSS) and the National Health and Nutrition Examination Survey, representing traditional public health surveillance systems. Results: Observed EHR-based associations between race or ethnicity and diabetes were comparable to health survey-based estimates, but the association between asthma and diabetes was significantly overestimated (OREHR 3.01, 95\% CI 2.86-3.18 vs ORBRFSS 1.23, 95\% CI 1.09-1.40). Missing data and causal inference methods reduced information biases in these estimates, yielding relative differences from traditional estimates below 50\% (ORMissingData 1.79, 95\% CI 1.67-1.92 and ORCausal 1.42, 95\% CI 1.34-1.51). Conclusions: Findings suggest that without bias adjustment, EHR analyses may yield biased measures of association, driven in part by subgroup differences in health care use. However, applying missing data or causal inference frameworks can help control for and, importantly, characterize residual information biases in these estimates. ", doi="10.2196/58085", url="https://medinform.jmir.org/2024/1/e58085" } @Article{info:doi/10.2196/53711, author="Lim, Sachiko and Johannesson, Paul", title="An Ontology to Bridge the Clinical Management of Patients and Public Health Responses for Strengthening Infectious Disease Surveillance: Design Science Study", journal="JMIR Form Res", year="2024", month="Sep", day="26", volume="8", pages="e53711", keywords="infectious disease", keywords="ontology", keywords="IoT", keywords="infectious disease surveillance", keywords="patient monitoring", keywords="infectious disease management", keywords="risk analysis", keywords="early warning", keywords="data integration", keywords="semantic interoperability", keywords="public health", abstract="Background: Novel surveillance approaches using digital technologies, including the Internet of Things (IoT), have evolved, enhancing traditional infectious disease surveillance systems by enabling real-time detection of outbreaks and reaching a wider population. However, disparate, heterogenous infectious disease surveillance systems often operate in silos due to a lack of interoperability. As a life-changing clinical use case, the COVID-19 pandemic has manifested that a lack of interoperability can severely inhibit public health responses to emerging infectious diseases. Interoperability is thus critical for building a robust ecosystem of infectious disease surveillance and enhancing preparedness for future outbreaks. The primary enabler for semantic interoperability is ontology. Objective: This study aims to design the IoT-based management of infectious disease ontology (IoT-MIDO) to enhance data sharing and integration of data collected from IoT-driven patient health monitoring, clinical management of individual patients, and disparate heterogeneous infectious disease surveillance. Methods: The ontology modeling approach was chosen for its semantic richness in knowledge representation, flexibility, ease of extensibility, and capability for knowledge inference and reasoning. The IoT-MIDO was developed using the basic formal ontology (BFO) as the top-level ontology. We reused the classes from existing BFO-based ontologies as much as possible to maximize the interoperability with other BFO-based ontologies and databases that rely on them. We formulated the competency questions as requirements for the ontology to achieve the intended goals. Results: We designed an ontology to integrate data from heterogeneous sources, including IoT-driven patient monitoring, clinical management of individual patients, and infectious disease surveillance systems. This integration aims to facilitate the collaboration between clinical care and public health domains. We also demonstrate five use cases using the simplified ontological models to show the potential applications of IoT-MIDO: (1) IoT-driven patient monitoring, risk assessment, early warning, and risk management; (2) clinical management of patients with infectious diseases; (3) epidemic risk analysis for timely response at the public health level; (4) infectious disease surveillance; and (5) transforming patient information into surveillance information. Conclusions: The development of the IoT-MIDO was driven by competency questions. Being able to answer all the formulated competency questions, we successfully demonstrated that our ontology has the potential to facilitate data sharing and integration for orchestrating IoT-driven patient health monitoring in the context of an infectious disease epidemic, clinical patient management, infectious disease surveillance, and epidemic risk analysis. The novelty and uniqueness of the ontology lie in building a bridge to link IoT-based individual patient monitoring and early warning based on patient risk assessment to infectious disease epidemic surveillance at the public health level. The ontology can also serve as a starting point to enable potential decision support systems, providing actionable insights to support public health organizations and practitioners in making informed decisions in a timely manner. ", doi="10.2196/53711", url="https://formative.jmir.org/2024/1/e53711", url="http://www.ncbi.nlm.nih.gov/pubmed/39325530" } @Article{info:doi/10.2196/59505, author="AlSaad, Rawan and Abd-alrazaq, Alaa and Boughorbel, Sabri and Ahmed, Arfan and Renault, Max-Antoine and Damseh, Rafat and Sheikh, Javaid", title="Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook", journal="J Med Internet Res", year="2024", month="Sep", day="25", volume="26", pages="e59505", keywords="artificial intelligence", keywords="large language models", keywords="multimodal large language models", keywords="multimodality", keywords="multimodal generative artificial intelligence", keywords="multimodal generative AI", keywords="generative artificial intelligence", keywords="generative AI", keywords="health care", doi="10.2196/59505", url="https://www.jmir.org/2024/1/e59505" } @Article{info:doi/10.2196/56804, author="Callaghan-Koru, A. Jennifer and Newman Chargois, Paige and Tiwari, Tanvangi and Brown, C. Clare and Greenfield, William and Koru, G{\"u}ne?", title="Public Maternal Health Dashboards in the United States: Descriptive Assessment", journal="J Med Internet Res", year="2024", month="Sep", day="17", volume="26", pages="e56804", keywords="dashboard", keywords="maternal health", keywords="data visualization", keywords="data communication", keywords="perinatal health", abstract="Background: Data dashboards have become more widely used for the public communication of health-related data, including in maternal health. Objective: We aimed to evaluate the content and features of existing publicly available maternal health dashboards in the United States. Methods: Through systematic searches, we identified 80 publicly available, interactive dashboards presenting US maternal health data. We abstracted and descriptively analyzed the technical features and content of identified dashboards across four areas: (1) scope and origins, (2) technical capabilities, (3) data sources and indicators, and (4) disaggregation capabilities. Where present, we abstracted and qualitatively analyzed dashboard text describing the purpose and intended audience. Results: Most reviewed dashboards reported state-level data (58/80, 72\%) and were hosted on a state health department website (48/80, 60\%). Most dashboards reported data from only 1 (33/80, 41\%) or 2 (23/80, 29\%) data sources. Key indicators, such as the maternal mortality rate (10/80, 12\%) and severe maternal morbidity rate (12/80, 15\%), were absent from most dashboards. Included dashboards used a range of data visualizations, and most allowed some disaggregation by time (65/80, 81\%), geography (65/80, 81\%), and race or ethnicity (55/80, 69\%). Among dashboards that identified their audience (30/80, 38\%), legislators or policy makers and public health agencies or organizations were the most common audiences. Conclusions: While maternal health dashboards have proliferated, their designs and features are not standard. This assessment of maternal health dashboards in the United States found substantial variation among dashboards, including inconsistent data sources, health indicators, and disaggregation capabilities. Opportunities to strengthen dashboards include integrating a greater number of data sources, increasing disaggregation capabilities, and considering end-user needs in dashboard design. ", doi="10.2196/56804", url="https://www.jmir.org/2024/1/e56804" } @Article{info:doi/10.2196/57853, author="Ohlsen, Tessa and Ingenerf, Josef and Essenwanger, Andrea and Drenkhahn, Cora", title="PCEtoFHIR: Decomposition of Postcoordinated SNOMED CT Expressions for Storage as HL7 FHIR Resources", journal="JMIR Med Inform", year="2024", month="Sep", day="17", volume="12", pages="e57853", keywords="SNOMED CT", keywords="HL7 FHIR", keywords="TermInfo", keywords="postcoordination", keywords="semantic interoperability", keywords="terminology", keywords="OWL", keywords="semantic similarity", abstract="Background: To ensure interoperability, both structural and semantic standards must be followed. For exchanging medical data between information systems, the structural standard FHIR (Fast Healthcare Interoperability Resources) has recently gained popularity. Regarding semantic interoperability, the reference terminology SNOMED Clinical Terms (SNOMED CT), as a semantic standard, allows for postcoordination, offering advantages over many other vocabularies. These postcoordinated expressions (PCEs) make SNOMED CT an expressive and flexible interlingua, allowing for precise coding of medical facts. However, this comes at the cost of increased complexity, as well as challenges in storage and processing. Additionally, the boundary between semantic (terminology) and structural (information model) standards becomes blurred, leading to what is known as the TermInfo problem. Although often viewed critically, the TermInfo overlap can also be explored for its potential benefits, such as enabling flexible transformation of parts of PCEs. Objective: In this paper, an alternative solution for storing PCEs is presented, which involves combining them with the FHIR data model. Ultimately, all components of a PCE should be expressible solely through precoordinated concepts that are linked to the appropriate elements of the information model. Methods: The approach involves storing PCEs decomposed into their components in alignment with FHIR resources. By utilizing the Web Ontology Language (OWL) to generate an OWL ClassExpression, and combining it with an external reasoner and semantic similarity measures, a precoordinated SNOMED CT concept that most accurately describes the PCE is identified as a Superconcept. In addition, the nonmatching attribute relationships between the Superconcept and the PCE are identified as the ``Delta.'' Once SNOMED CT attributes are manually mapped to FHIR elements, FHIRPath expressions can be defined for both the Superconcept and the Delta, allowing the identified precoordinated codes to be stored within FHIR resources. Results: A web application called PCEtoFHIR was developed to implement this approach. In a validation process with 600 randomly selected precoordinated concepts, the formal correctness of the generated OWL ClassExpressions was verified. Additionally, 33 PCEs were used for two separate validation tests. Based on these validations, it was demonstrated that a previously proposed semantic similarity calculation is suitable for determining the Superconcept. Additionally, the 33 PCEs were used to confirm the correct functioning of the entire approach. Furthermore, the FHIR StructureMaps were reviewed and deemed meaningful by FHIR experts. Conclusions: PCEtoFHIR offers services to decompose PCEs for storage within FHIR resources. When creating structure mappings for specific subdomains of SNOMED CT concepts (eg, allergies) to desired FHIR profiles, the use of SNOMED CT Expression Templates has proven highly effective. Domain experts can create templates with appropriate mappings, which can then be easily reused in a constrained manner by end users. ", doi="10.2196/57853", url="https://medinform.jmir.org/2024/1/e57853" } @Article{info:doi/10.2196/55182, author="Mess, Veronica Elisabeth and Kramer, Frank and Krumme, Julia and Kanelakis, Nico and Teynor, Alexandra", title="Use of Creative Frameworks in Health Care to Solve Data and Information Problems: Scoping Review", journal="JMIR Hum Factors", year="2024", month="Sep", day="13", volume="11", pages="e55182", keywords="creative frameworks", keywords="data and information problems", keywords="data collection", keywords="data processing", keywords="data provision", keywords="health care", keywords="information visualization", keywords="interdisciplinary teams", keywords="user-centered design", keywords="user-centered data design", keywords="user-centric development", abstract="Background: Digitization is vital for data management, especially in health care. However, problems still hinder health care stakeholders in their daily work while collecting, processing, and providing health data or information. Data are missing, incorrect, cannot be collected, or information is inadequately presented. These problems can be seen as data or information problems. A proven way to elicit requirements for (software) systems is by using creative frameworks (eg, user-centered design, design thinking, lean UX [user experience], or service design) or creative methods (eg, mind mapping, storyboarding, 6 thinking hats, or interaction room). However, to what extent they are used to solve data or information-related problems in health care is unclear. Objective: The primary objective of this scoping review is to investigate the use of creative frameworks in addressing data and information problems in health care. Methods: Following JBI guidelines and the PRISMA-ScR framework, this paper analyzes selected papers, answering whether creative frameworks addressed health care data or information problems. Focusing on data problems (elicitation or collection, processing) and information problems (provision or visualization), the review examined German and English papers published between 2018 and 2022 using keywords related to ``data,'' ``design,'' and ``user-centered.'' The database SCOPUS was used. Results: Of the 898 query results, only 23 papers described a data or information problem and a creative method to solve it. These were included in the follow-up analysis and divided into different problem categories: data collection (n=7), data processing (n=1), information visualization (n=11), and mixed problems meaning data and information problem present (n=4). The analysis showed that most identified problems fall into the information visualization category. This could indicate that creative frameworks are particularly suitable for solving information or visualization problems and less for other, more abstract areas such as data problems. The results also showed that most researchers applied a creative framework after they knew what specific (data or information) problem they had (n=21). Only a minority chose a creative framework to identify a problem and realize it was a data or information problem (n=2). In response to these findings, the paper discusses the need for a new approach that addresses health care data and information challenges by promoting collaboration, iterative feedback, and user-centered development. Conclusions: Although the potential of creative frameworks is undisputed, applying these in solving data and information problems is a minority. To harness this potential, a suitable method needs to be developed to support health care system stakeholders. This method could be the User-Centered Data Approach. ", doi="10.2196/55182", url="https://humanfactors.jmir.org/2024/1/e55182" } @Article{info:doi/10.2196/58705, author="Manuilova, Iryna and Bossenz, Jan and Weise, Bianka Annemarie and Boehm, Dominik and Strantz, Cosima and Unberath, Philipp and Reimer, Niklas and Metzger, Patrick and Pauli, Thomas and Werle, D. Silke and Schulze, Susann and Hiemer, Sonja and Ustjanzew, Arsenij and Kestler, A. Hans and Busch, Hauke and Brors, Benedikt and Christoph, Jan", title="Identifications of Similarity Metrics for Patients With Cancer: Protocol for a Scoping Review", journal="JMIR Res Protoc", year="2024", month="Sep", day="4", volume="13", pages="e58705", keywords="patient similarity", keywords="cancer research", keywords="patient similarity applications", keywords="precision medicine", keywords="cancer similarity metrics", keywords="scoping review protocol", abstract="Background: Understanding the similarities of patients with cancer is essential to advancing personalized medicine, improving patient outcomes, and developing more effective and individualized treatments. It enables researchers to discover important patterns, biomarkers, and treatment strategies that can have a significant impact on cancer research and oncology. In addition, the identification of previously successfully treated patients supports oncologists in making treatment decisions for a new patient who is clinically or molecularly similar to the previous patient. Objective: The planned review aims to systematically summarize, map, and describe existing evidence to understand how patient similarity is defined and used in cancer research and clinical care. Methods: To systematically identify relevant studies and to ensure reproducibility and transparency of the review process, a comprehensive literature search will be conducted in several bibliographic databases, including Web of Science, PubMed, LIVIVIVO, and MEDLINE, covering the period from 1998 to February 2024. After the initial duplicate deletion phase, a study selection phase will be applied using Rayyan, which consists of 3 distinct steps: title and abstract screening, disagreement resolution, and full-text screening. To ensure the integrity and quality of the selection process, each of these steps is preceded by a pilot testing phase. This methodological process will culminate in the presentation of the final research results in a structured form according to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) flowchart. The protocol has been registered in the Journal of Medical Internet Research. Results: This protocol outlines the methodologies used in conducting the scoping review. A search of the specified electronic databases and after removing duplicates resulted in 1183 unique records. As of March 2024, the review process has moved to the full-text evaluation phase. At this stage, data extraction will be conducted using a pretested chart template. Conclusions: The scoping review protocol, centered on these main concepts, aims to systematically map the available evidence on patient similarity among patients with cancer. By defining the types of data sources, approaches, and methods used in the field, and aligning these with the research questions, the review will provide a foundation for future research and clinical application in personalized cancer care. This protocol will guide the literature search, data extraction, and synthesis of findings to achieve the review's objectives. International Registered Report Identifier (IRRID): DERR1-10.2196/58705 ", doi="10.2196/58705", url="https://www.researchprotocols.org/2024/1/e58705", url="http://www.ncbi.nlm.nih.gov/pubmed/39230952" } @Article{info:doi/10.2196/51297, author="Gierend, Kerstin and Kr{\"u}ger, Frank and Genehr, Sascha and Hartmann, Francisca and Siegel, Fabian and Waltemath, Dagmar and Ganslandt, Thomas and Zeleke, Alamirrew Atinkut", title="Provenance Information for Biomedical Data and Workflows: Scoping Review", journal="J Med Internet Res", year="2024", month="Aug", day="23", volume="26", pages="e51297", keywords="provenance", keywords="biomedical research", keywords="data management", keywords="scoping review", keywords="health care data", keywords="software life cycle", abstract="Background: The record of the origin and the history of data, known as provenance, holds importance. Provenance information leads to higher interpretability of scientific results and enables reliable collaboration and data sharing. However, the lack of comprehensive evidence on provenance approaches hinders the uptake of good scientific practice in clinical research. Objective: This scoping review aims to identify approaches and criteria for provenance tracking in the biomedical domain. We reviewed the state-of-the-art frameworks, associated artifacts, and methodologies for provenance tracking. Methods: This scoping review followed the methodological framework developed by Arksey and O'Malley. We searched the PubMed and Web of Science databases for English-language articles published from 2006 to 2022. Title and abstract screening were carried out by 4 independent reviewers using the Rayyan screening tool. A majority vote was required for consent on the eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading and screening were performed independently by 2 reviewers, and information was extracted into a pretested template for the 5 research questions. Disagreements were resolved by a domain expert. The study protocol has previously been published. Results: The search resulted in a total of 764 papers. Of 624 identified, deduplicated papers, 66 (10.6\%) studies fulfilled the inclusion criteria. We identified diverse provenance-tracking approaches ranging from practical provenance processing and managing to theoretical frameworks distinguishing diverse concepts and details of data and metadata models, provenance components, and notations. A substantial majority investigated underlying requirements to varying extents and validation intensities but lacked completeness in provenance coverage. Mostly, cited requirements concerned the knowledge about data integrity and reproducibility. Moreover, these revolved around robust data quality assessments, consistent policies for sensitive data protection, improved user interfaces, and automated ontology development. We found that different stakeholder groups benefit from the availability of provenance information. Thereby, we recognized that the term provenance is subjected to an evolutionary and technical process with multifaceted meanings and roles. Challenges included organizational and technical issues linked to data annotation, provenance modeling, and performance, amplified by subsequent matters such as enhanced provenance information and quality principles. Conclusions: As data volumes grow and computing power increases, the challenge of scaling provenance systems to handle data efficiently and assist complex queries intensifies, necessitating automated and scalable solutions. With rising legal and scientific demands, there is an urgent need for greater transparency in implementing provenance systems in research projects, despite the challenges of unresolved granularity and knowledge bottlenecks. We believe that our recommendations enable quality and guide the implementation of auditable and measurable provenance approaches as well as solutions in the daily tasks of biomedical scientists. International Registered Report Identifier (IRRID): RR2-10.2196/31750 ", doi="10.2196/51297", url="https://www.jmir.org/2024/1/e51297" } @Article{info:doi/10.2196/58502, author="Burns, James and Chen, Kelly and Flathers, Matthew and Currey, Danielle and Macrynikola, Natalia and Vaidyam, Aditya and Langholm, Carsten and Barnett, Ian and Byun, Soo) Andrew (Jin and Lane, Erlend and Torous, John", title="Transforming Digital Phenotyping Raw Data Into Actionable Biomarkers, Quality Metrics, and Data Visualizations Using Cortex Software Package: Tutorial", journal="J Med Internet Res", year="2024", month="Aug", day="23", volume="26", pages="e58502", keywords="digital phenotyping", keywords="mental health", keywords="data visualization", keywords="data analysis", keywords="smartphones", keywords="smartphone", keywords="Cortex", keywords="open-source", keywords="data processing", keywords="mindLAMP", keywords="app", keywords="apps", keywords="data set", keywords="clinical", keywords="real world", keywords="methodology", keywords="mobile phone", doi="10.2196/58502", url="https://www.jmir.org/2024/1/e58502", url="http://www.ncbi.nlm.nih.gov/pubmed/39178032" } @Article{info:doi/10.2196/53821, author="Tanaka, L. Hideaki and Rees, R. Judy and Zhang, Ziyin and Ptak, A. Judy and Hannigan, M. Pamela and Silverman, M. Elaine and Peacock, L. Janet and Buckey, C. Jay and ", title="Emerging Indications for Hyperbaric Oxygen Treatment: Registry Cohort Study", journal="Interact J Med Res", year="2024", month="Aug", day="20", volume="13", pages="e53821", keywords="hyperbaric oxygen", keywords="inflammatory bowel disease", keywords="calciphylaxis", keywords="post--COVID-19 condition", keywords="PCC", keywords="postacute sequelae of COVID-19", keywords="PASC", keywords="infected implanted hardware", keywords="hypospadias", keywords="frostbite", keywords="facial filler", keywords="pyoderma gangrenosum", abstract="Background: Hyperbaric oxygen (HBO2) treatment is used across a range of medical specialties for a variety of applications, particularly where hypoxia and inflammation are important contributors. Because of its hypoxia-relieving and anti-inflammatory effects HBO2 may be useful for new indications not currently approved by the Undersea and Hyperbaric Medical Society. Identifying these new applications for HBO2 is difficult because individual centers may only treat a few cases and not track the outcomes consistently. The web-based International Multicenter Registry for Hyperbaric Oxygen Therapy captures prospective outcome data for patients treated with HBO2 therapy. These data can then be used to identify new potential applications for HBO2, which has relevance for a range of medical specialties. Objective: Although hyperbaric medicine has established indications, new ones continue to emerge. One objective of this registry study was to identify cases where HBO2 has been used for conditions falling outside of current Undersea and Hyperbaric Medical Society--approved indications and present outcome data for them. Methods: This descriptive study used data from a web-based, multicenter, international registry of patients treated with HBO2. Participating centers agree to collect data on all patients treated using standard outcome measures, and individual centers send deidentified data to the central registry. HBO2 treatment programs in the United States, the United Kingdom, and Australia participate. Demographic, outcome, complication, and treatment data, including pre- and posttreatment quality of life questionnaires (EQ-5D-5L) were collected for individuals referred for HBO2 treatment. Results: Out of 9726 patient entries, 378 (3.89\%) individuals were treated for 45 emerging indications. Post--COVID-19 condition (PCC; also known as postacute sequelae of COVID-19; 149/378, 39.4\%), ulcerative colitis (47/378, 12.4\%), and Crohn disease (40/378, 10.6\%) accounted for 62.4\% (n=236) of the total cases. Calciphylaxis (20/378, 5.3\%), frostbite (18/378, 4.8\%), and peripheral vascular disease--related wounds (12/378, 3.2\%) accounted for a further 13.2\% (n=50). Patients with PCC reported significant improvement on the Neurobehavioral Symptom Inventory (NSI score: pretreatment=30.6; posttreatment=14.4; P<.001). Patients with Crohn disease reported significantly improved quality of life (EQ-5D score: pretreatment=53.8; posttreatment=68.8), and 5 (13\%) reported closing a fistula. Patients with ulcerative colitis and complete pre- and post-HBO2 data reported improved quality of life and lower scores on a bowel questionnaire examining frequency, blood, pain, and urgency. A subset of patients with calciphylaxis and arterial ulcers also reported improvement. Conclusions: HBO2 is being used for a wide range of possible applications across various medical specialties for its hypoxia-relieving and anti-inflammatory effects. Results show statistically significant improvements in patient-reported outcomes for inflammatory bowel disease and PCC. HBO2 is also being used for frostbite, pyoderma gangrenosum, pterygium, hypospadias repair, and facial filler procedures. Other indications show evidence for improvement, and the case series for all indications is growing in the registry. International Registered Report Identifier (IRRID): RR2-10.2196/18857 ", doi="10.2196/53821", url="https://www.i-jmr.org/2024/1/e53821", url="http://www.ncbi.nlm.nih.gov/pubmed/39078624" } @Article{info:doi/10.2196/58548, author="Julian, Silva Guilherme and Shau, Wen-Yi and Chou, Hsu-Wen and Setia, Sajita", title="Bridging Real-World Data Gaps: Connecting Dots Across 10 Asian Countries", journal="JMIR Med Inform", year="2024", month="Aug", day="15", volume="12", pages="e58548", keywords="Asia", keywords="electronic medical records", keywords="EMR", keywords="health care databases", keywords="health technology assessment", keywords="HTA", keywords="real-world data", keywords="real-world evidence", doi="10.2196/58548", url="https://medinform.jmir.org/2024/1/e58548", url="http://www.ncbi.nlm.nih.gov/pubmed/39026427" } @Article{info:doi/10.2196/49542, author="Fruchart, Mathilde and Quindroit, Paul and Jacquemont, Chlo{\'e} and Beuscart, Jean-Baptiste and Calafiore, Matthieu and Lamer, Antoine", title="Transforming Primary Care Data Into the Observational Medical Outcomes Partnership Common Data Model: Development and Usability Study", journal="JMIR Med Inform", year="2024", month="Aug", day="13", volume="12", pages="e49542", keywords="data reuse", keywords="Observational Medical Outcomes Partnership", keywords="common data model", keywords="data warehouse", keywords="reproducible research", keywords="primary care", keywords="dashboard", keywords="electronic health record", keywords="patient tracking system", keywords="patient monitoring", keywords="EHR", keywords="primary care data", abstract="Background: Patient-monitoring software generates a large amount of data that can be reused for clinical audits and scientific research. The Observational Health Data Sciences and Informatics (OHDSI) consortium developed the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) to standardize electronic health record data and promote large-scale observational and longitudinal research. Objective: This study aimed to transform primary care data into the OMOP CDM format. Methods: We extracted primary care data from electronic health records at a multidisciplinary health center in Wattrelos, France. We performed structural mapping between the design of our local primary care database and the OMOP CDM tables and fields. Local French vocabularies concepts were mapped to OHDSI standard vocabularies. To validate the implementation of primary care data into the OMOP CDM format, we applied a set of queries. A practical application was achieved through the development of a dashboard. Results: Data from 18,395 patients were implemented into the OMOP CDM, corresponding to 592,226 consultations over a period of 20 years. A total of 18 OMOP CDM tables were implemented. A total of 17 local vocabularies were identified as being related to primary care and corresponded to patient characteristics (sex, location, year of birth, and race), units of measurement, biometric measures, laboratory test results, medical histories, and drug prescriptions. During semantic mapping, 10,221 primary care concepts were mapped to standard OHDSI concepts. Five queries were used to validate the OMOP CDM by comparing the results obtained after the completion of the transformations with the results obtained in the source software. Lastly, a prototype dashboard was developed to visualize the activity of the health center, the laboratory test results, and the drug prescription data. Conclusions: Primary care data from a French health care facility have been implemented into the OMOP CDM format. Data concerning demographics, units, measurements, and primary care consultation steps were already available in OHDSI vocabularies. Laboratory test results and drug prescription data were mapped to available vocabularies and structured in the final model. A dashboard application provided health care professionals with feedback on their practice. ", doi="10.2196/49542", url="https://medinform.jmir.org/2024/1/e49542" } @Article{info:doi/10.2196/59924, author="Jia, Si Si and Luo, Xinwei and Gibson, Anne Alice and Partridge, Ruth Stephanie", title="Developing the DIGIFOOD Dashboard to Monitor the Digitalization of Local Food Environments: Interdisciplinary Approach", journal="JMIR Public Health Surveill", year="2024", month="Aug", day="13", volume="10", pages="e59924", keywords="online food delivery", keywords="food environment", keywords="dashboard", keywords="web scraping", keywords="big data", keywords="surveillance", keywords="monitoring", keywords="prevention", keywords="food", keywords="food delivery", keywords="development study", keywords="development", keywords="accessibility", keywords="Australia", keywords="monitoring tool", keywords="tool", keywords="tools", abstract="Background: Online food delivery services (OFDS) enable individuals to conveniently access foods from any deliverable location. The increased accessibility to foods may have implications on the consumption of healthful or unhealthful foods. Concerningly, previous research suggests that OFDS offer an abundance of energy-dense and nutrient-poor foods, which are heavily promoted through deals or discounts. Objective: In this paper, we describe the development of the DIGIFOOD dashboard to monitor the digitalization of local food environments in New South Wales, Australia, resulting from the proliferation of OFDS. Methods: Together with a team of data scientists, we designed a purpose-built dashboard using Microsoft Power BI. The development process involved three main stages: (1) data acquisition of food outlets via web scraping, (2) data cleaning and processing, and (3) visualization of food outlets on the dashboard. We also describe the categorization process of food outlets to characterize the healthfulness of local, online, and hybrid food environments. These categories included takeaway franchises, independent takeaways, independent restaurants and cafes, supermarkets or groceries, bakeries, alcohol retailers, convenience stores, and sandwich or salad shops. Results: To date, the DIGIFOOD dashboard has mapped 36,967 unique local food outlets (locally accessible and scraped from Google Maps) and 16,158 unique online food outlets (accessible online and scraped from Uber Eats) across New South Wales, Australia. In 2023, the market-leading OFDS operated in 1061 unique suburbs or localities in New South Wales. The Sydney-Parramatta region, a major urban area in New South Wales accounting for 28 postcodes, recorded the highest number of online food outlets (n=4221). In contrast, the Far West and Orana region, a rural area in New South Wales with only 2 postcodes, recorded the lowest number of food outlets accessible online (n=7). Urban areas appeared to have the greatest increase in total food outlets accessible via online food delivery. In both local and online food environments, it was evident that independent restaurants and cafes comprised the largest proportion of food outlets at 47.2\% (17,437/36,967) and 51.8\% (8369/16,158), respectively. However, compared to local food environments, the online food environment has relatively more takeaway franchises (2734/16,158, 16.9\% compared to 3273/36,967, 8.9\%) and independent takeaway outlets (2416/16,158, 14.9\% compared to 4026/36,967, 10.9\%). Conclusions: The DIGIFOOD dashboard leverages the current rich data landscape to display and contrast the availability and healthfulness of food outlets that are locally accessible versus accessible online. The DIGIFOOD dashboard can be a useful monitoring tool for the evolving digital food environment at a regional scale and has the potential to be scaled up at a national level. Future iterations of the dashboard, including data from additional prominent OFDS, can be used by policy makers to identify high-priority areas with limited access to healthful foods both online and locally. ", doi="10.2196/59924", url="https://publichealth.jmir.org/2024/1/e59924" } @Article{info:doi/10.2196/50667, author="Rohani, Narjes and Sowa, Stephen and Manataki, Areti", title="Identifying Learning Preferences and Strategies in Health Data Science Courses: Systematic Review", journal="JMIR Med Educ", year="2024", month="Aug", day="12", volume="10", pages="e50667", keywords="health data science", keywords="bioinformatics", keywords="learning approach", keywords="learning preference", keywords="learning tactic", keywords="learning strategy", keywords="interdisciplinary", keywords="systematic review", keywords="medical education", abstract="Background: Learning and teaching interdisciplinary health data science (HDS) is highly challenging, and despite the growing interest in HDS education, little is known about the learning experiences and preferences of HDS students. Objective: We conducted a systematic review to identify learning preferences and strategies in the HDS discipline. Methods: We searched 10 bibliographic databases (PubMed, ACM Digital Library, Web of Science, Cochrane Library, Wiley Online Library, ScienceDirect, SpringerLink, EBSCOhost, ERIC, and IEEE Xplore) from the date of inception until June 2023. We followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines and included primary studies written in English that investigated the learning preferences or strategies of students in HDS-related disciplines, such as bioinformatics, at any academic level. Risk of bias was independently assessed by 2 screeners using the Mixed Methods Appraisal Tool, and we used narrative data synthesis to present the study results. Results: After abstract screening and full-text reviewing of the 849 papers retrieved from the databases, 8 (0.9\%) studies, published between 2009 and 2021, were selected for narrative synthesis. The majority of these papers (7/8, 88\%) investigated learning preferences, while only 1 (12\%) paper studied learning strategies in HDS courses. The systematic review revealed that most HDS learners prefer visual presentations as their primary learning input. In terms of learning process and organization, they mostly tend to follow logical, linear, and sequential steps. Moreover, they focus more on abstract information, rather than detailed and concrete information. Regarding collaboration, HDS students sometimes prefer teamwork, and sometimes they prefer to work alone. Conclusions: The studies' quality, assessed using the Mixed Methods Appraisal Tool, ranged between 73\% and 100\%, indicating excellent quality overall. However, the number of studies in this area is small, and the results of all studies are based on self-reported data. Therefore, more research needs to be conducted to provide insight into HDS education. We provide some suggestions, such as using learning analytics and educational data mining methods, for conducting future research to address gaps in the literature. We also discuss implications for HDS educators, and we make recommendations for HDS course design; for example, we recommend including visual materials, such as diagrams and videos, and offering step-by-step instructions for students. ", doi="10.2196/50667", url="https://mededu.jmir.org/2024/1/e50667" } @Article{info:doi/10.2196/47100, author="O'Shea, MJ Amy and Mulligan, Kailey and Carter, D. Knute and Haraldsson, Bjarni and Wray, M. Charlie and Shahnazi, Ariana and Kaboli, J. Peter", title="Comparing Federal Communications Commission and Microsoft Estimates of Broadband Access for Mental Health Video Telemedicine Among Veterans: Retrospective Cohort Study", journal="J Med Internet Res", year="2024", month="Aug", day="8", volume="26", pages="e47100", keywords="broadband", keywords="telemedicine", keywords="Federal Communications Commission", keywords="veterans", keywords="United States Department of Veterans Affairs", keywords="internet", keywords="mental health care", keywords="veteran health", keywords="broadband access", keywords="web-based", keywords="digital", abstract="Background: The COVID-19 pandemic highlighted the importance of telemedicine in health care. However, video telemedicine requires adequate broadband internet speeds. As video-based telemedicine grows, variations in broadband access must be accurately measured and characterized. Objective: This study aims to compare the Federal Communications Commission (FCC) and Microsoft US broadband use data sources to measure county-level broadband access among veterans receiving mental health care from the Veterans Health Administration (VHA). Methods: Retrospective observational cohort study using administrative data to identify mental health visits from January 1, 2019, to December 31, 2020, among 1161 VHA mental health clinics. The exposure is county-level broadband percentages calculated as the percentage of the county population with access to adequate broadband speeds (ie, download >25 megabits per second) as measured by the FCC and Microsoft. All veterans receiving VHA mental health services during the study period were included and categorized based on their use of video mental health visits. Broadband access was compared between and within data sources, stratified by video versus no video telemedicine use. Results: Over the 2-year study period, 1,474,024 veterans with VHA mental health visits were identified. Average broadband percentages varied by source (FCC mean 91.3\%, SD 12.5\% vs Microsoft mean 48.2\%, SD 18.1\%; P<.001). Within each data source, broadband percentages generally increased from 2019 to 2020. Adjusted regression analyses estimated the change after pandemic onset versus before the pandemic in quarterly county-based mental health visit counts at prespecified broadband percentages. Using FCC model estimates, given all other covariates are constant and assuming an FCC percentage set at 70\%, the incidence rate ratio (IRR) of county-level quarterly mental video visits during the COVID-19 pandemic was 6.81 times (95\% CI 6.49-7.13) the rate before the pandemic. In comparison, the model using Microsoft data exhibited a stronger association (IRR 7.28; 95\% CI 6.78-7.81). This relationship held across all broadband access levels assessed. Conclusions: This study found FCC broadband data estimated higher and less variable county-level broadband percentages compared to those estimated using Microsoft data. Regardless of the data source, veterans without mental health video visits lived in counties with lower broadband access, highlighting the need for accurate broadband speeds to prioritize infrastructure and intervention development based on the greatest community-level impacts. Future work should link broadband access to differences in clinical outcomes. ", doi="10.2196/47100", url="https://www.jmir.org/2024/1/e47100" } @Article{info:doi/10.2196/53369, author="Metsallik, Janek and Draheim, Dirk and Sabic, Zlatan and Novak, Thomas and Ross, Peeter", title="Assessing Opportunities and Barriers to Improving the Secondary Use of Health Care Data at the National Level: Multicase Study in the Kingdom of Saudi Arabia and Estonia", journal="J Med Internet Res", year="2024", month="Aug", day="8", volume="26", pages="e53369", keywords="health data governance", keywords="secondary use", keywords="health information sharing maturity", keywords="large-scale interoperability", keywords="health data stewardship", keywords="health data custodianship", keywords="health information purpose", keywords="health data policy", abstract="Background: Digitization shall improve the secondary use of health care data. The Government of the Kingdom of Saudi Arabia ordered a project to compile the National Master Plan for Health Data Analytics, while the Government of Estonia ordered a project to compile the Person-Centered Integrated Hospital Master Plan. Objective: This study aims to map these 2 distinct projects' problems, approaches, and outcomes to find the matching elements for reuse in similar cases. Methods: We assessed both health care systems' abilities for secondary use of health data by exploratory case studies with purposive sampling and data collection via semistructured interviews and documentation review. The collected content was analyzed qualitatively and coded according to a predefined framework. The analytical framework consisted of data purpose, flow, and sharing. The Estonian project used the Health Information Sharing Maturity Model from the Mitre Corporation as an additional analytical framework. The data collection and analysis in the Kingdom of Saudi Arabia took place in 2019 and covered health care facilities, public health institutions, and health care policy. The project in Estonia collected its inputs in 2020 and covered health care facilities, patient engagement, public health institutions, health care financing, health care policy, and health technology innovations. Results: In both cases, the assessments resulted in a set of recommendations focusing on the governance of health care data. In the Kingdom of Saudi Arabia, the health care system consists of multiple isolated sectors, and there is a need for an overarching body coordinating data sets, indicators, and reports at the national level. The National Master Plan of Health Data Analytics proposed a set of organizational agreements for proper stewardship. Despite Estonia's national Digital Health Platform, the requirements remain uncoordinated between various data consumers. We recommended reconfiguring the stewardship of the national health data to include multipurpose data use into the scope of interoperability standardization. Conclusions: Proper data governance is the key to improving the secondary use of health data at the national level. The data flows from data providers to data consumers shall be coordinated by overarching stewardship structures and supported by interoperable data custodians. ", doi="10.2196/53369", url="https://www.jmir.org/2024/1/e53369" } @Article{info:doi/10.2196/52180, author="Wiertz, Svenja and Boldt, Joachim", title="Ethical, Legal, and Practical Concerns Surrounding the Implemention of New Forms of Consent for Health Data Research: Qualitative Interview Study", journal="J Med Internet Res", year="2024", month="Aug", day="7", volume="26", pages="e52180", keywords="health data", keywords="health research", keywords="informed consent", keywords="broad consent", keywords="tiered consent", keywords="consent management", keywords="digital infrastructure", keywords="data safety", keywords="GDPR", abstract="Background: In Europe, within the scope of the General Data Protection Regulation, more and more digital infrastructures are created to allow for large-scale access to patients' health data and their use for research. When the research is performed on the basis of patient consent, traditional study-specific consent appears too cumbersome for many researchers. Alternative models of consent are currently being discussed and introduced in different contexts. Objective: This study explores stakeholder perspectives on ethical, legal, and practical concerns regarding models of consent for health data research at German university medical centers. Methods: Semistructured focus group interviews were conducted with medical researchers at German university medical centers, health IT specialists, data protection officers, and patient representatives. The interviews were analyzed using a software-supported structuring qualitative content analysis. Results: Stakeholders regarded broad consent to be only marginally less laborious to implement and manage than tiered consent. Patient representatives favored specific consent, with tiered consent as a possible alternative. All stakeholders lamented that information material was difficult to understand. Oral information and videos were mentioned as a means of improvement. Patient representatives doubted that researchers had a sufficient degree of data security expertise to act as sole information providers. They were afraid of undue pressure if obtaining health data research consent were part of medical appointments. IT specialists and other stakeholders regarded the withdrawal of consent to be a major challenge and called for digital consent management solutions. On the one hand, the transfer of health data to non-European countries and for-profit organizations is seen as a necessity for research. On the other hand, there are data security concerns with regard to these actors. Research without consent is legally possible under certain conditions but deemed problematic by all stakeholder groups, albeit for differing reasons and to different degrees. Conclusions: More efforts should be made to determine which options of choice should be included in health data research consent. Digital tools could improve patient information and facilitate consent management. A unified and strict regulation for research without consent is required at the national and European Union level. Obtaining consent for health data research should be independent of medical appointments, and additional personnel should be trained in data security to provide information on health data research. ", doi="10.2196/52180", url="https://www.jmir.org/2024/1/e52180" } @Article{info:doi/10.2196/56627, author="Naseem, Usman and Thapa, Surendrabikram and Masood, Anum", title="Advancing Accuracy in Multimodal Medical Tasks Through Bootstrapped Language-Image Pretraining (BioMedBLIP): Performance Evaluation Study", journal="JMIR Med Inform", year="2024", month="Aug", day="5", volume="12", pages="e56627", keywords="biomedical text mining", keywords="BioNLP", keywords="vision-language pretraining", keywords="multimodal models", keywords="medical image analysis", abstract="Background: Medical image analysis, particularly in the context of visual question answering (VQA) and image captioning, is crucial for accurate diagnosis and educational purposes. Objective: Our study aims to introduce BioMedBLIP models, fine-tuned for VQA tasks using specialized medical data sets such as Radiology Objects in Context and Medical Information Mart for Intensive Care-Chest X-ray, and evaluate their performance in comparison to the state of the art (SOTA) original Bootstrapping Language-Image Pretraining (BLIP) model. Methods: We present 9 versions of BioMedBLIP across 3 downstream tasks in various data sets. The models are trained on a varying number of epochs. The findings indicate the strong overall performance of our models. We proposed BioMedBLIP for the VQA generation model, VQA classification model, and BioMedBLIP image caption model. We conducted pretraining in BLIP using medical data sets, producing an adapted BLIP model tailored for medical applications. Results: In VQA generation tasks, BioMedBLIP models outperformed the SOTA on the Semantically-Labeled Knowledge-Enhanced (SLAKE) data set, VQA in Radiology (VQA-RAD), and Image Cross-Language Evaluation Forum data sets. In VQA classification, our models consistently surpassed the SOTA on the SLAKE data set. Our models also showed competitive performance on the VQA-RAD and PathVQA data sets. Similarly, in image captioning tasks, our model beat the SOTA, suggesting the importance of pretraining with medical data sets. Overall, in 20 different data sets and task combinations, our BioMedBLIP excelled in 15 (75\%) out of 20 tasks. BioMedBLIP represents a new SOTA in 15 (75\%) out of 20 tasks, and our responses were rated higher in all 20 tasks (P<.005) in comparison to SOTA models. Conclusions: Our BioMedBLIP models show promising performance and suggest that incorporating medical knowledge through pretraining with domain-specific medical data sets helps models achieve higher performance. Our models thus demonstrate their potential to advance medical image analysis, impacting diagnosis, medical education, and research. However, data quality, task-specific variability, computational resources, and ethical considerations should be carefully addressed. In conclusion, our models represent a contribution toward the synergy of artificial intelligence and medicine. We have made BioMedBLIP freely available, which will help in further advancing research in multimodal medical tasks. ", doi="10.2196/56627", url="https://medinform.jmir.org/2024/1/e56627", url="http://www.ncbi.nlm.nih.gov/pubmed/39102281" } @Article{info:doi/10.2196/52257, author="Chan, Lisa Jennifer and Tsay, Sarah and Sambara, Sraavya and Welch, B. Sarah", title="Understanding the Use of Mobility Data in Disasters: Exploratory Qualitative Study of COVID-19 User Feedback", journal="JMIR Hum Factors", year="2024", month="Aug", day="1", volume="11", pages="e52257", keywords="mobility data", keywords="disasters", keywords="surveillance", keywords="COVID-19", keywords="qualitative", keywords="user feedback", keywords="policy making", keywords="emergency", keywords="pandemic", keywords="disaster response", keywords="data usage", keywords="situational awareness", keywords="data translation", keywords="big data", abstract="Background: Human mobility data have been used as a potential novel data source to guide policies and response planning during the COVID-19 global pandemic. The COVID-19 Mobility Data Network (CMDN) facilitated the use of human mobility data around the world. Both researchers and policy makers assumed that mobility data would provide insights to help policy makers and response planners. However, evidence that human mobility data were operationally useful and provided added value for public health response planners remains largely unknown. Objective: This exploratory study focuses on advancing the understanding of the use of human mobility data during the early phase of the COVID-19 pandemic. The study explored how researchers and practitioners around the world used these data in response planning and policy making, focusing on processing data and human factors enabling or hindering use of the data. Methods: Our project was based on phenomenology and used an inductive approach to thematic analysis. Transcripts were open-coded to create the codebook that was then applied by 2 team members who blind-coded all transcripts. Consensus coding was used for coding discrepancies. Results: Interviews were conducted with 45 individuals during the early period of the COVID-19 pandemic. Although some teams used mobility data for response planning, few were able to describe their uses in policy making, and there were no standardized ways that teams used mobility data. Mobility data played a larger role in providing situational awareness for government partners, helping to understand where people were moving in relation to the spread of COVID-19 variants and reactions to stay-at-home orders. Interviewees who felt they were more successful using mobility data often cited an individual who was able to answer general questions about mobility data; provide interactive feedback on results; and enable a 2-way communication exchange about data, meaning, value, and potential use. Conclusions: Human mobility data were used as a novel data source in the COVID-19 pandemic by a network of academic researchers and practitioners using privacy-preserving and anonymized mobility data. This study reflects the processes in analyzing and communicating human mobility data, as well as how these data were used in response planning and how the data were intended for use in policy making. The study reveals several valuable use cases. Ultimately, the role of a data translator was crucial in understanding the complexities of this novel data source. With this role, teams were able to adapt workflows, visualizations, and reports to align with end users and decision makers while communicating this information meaningfully to address the goals of responders and policy makers. ", doi="10.2196/52257", url="https://humanfactors.jmir.org/2024/1/e52257", url="http://www.ncbi.nlm.nih.gov/pubmed/39088256" } @Article{info:doi/10.2196/56237, author="Amadi, David and Kiwuwa-Muyingo, Sylvia and Bhattacharjee, Tathagata and Taylor, Amelia and Kiragga, Agnes and Ochola, Michael and Kanjala, Chifundo and Gregory, Arofan and Tomlin, Keith and Todd, Jim and Greenfield, Jay", title="Making Metadata Machine-Readable as the First Step to Providing Findable, Accessible, Interoperable, and Reusable Population Health Data: Framework Development and Implementation Study", journal="Online J Public Health Inform", year="2024", month="Aug", day="1", volume="16", pages="e56237", keywords="FAIR data principles", keywords="metadata", keywords="machine-readable metadata", keywords="DDI", keywords="Data Documentation Initiative", keywords="standardization", keywords="JSON-LD", keywords="JavaScript Object Notation for Linked Data", keywords="OMOP CDM", keywords="Observational Medical Outcomes Partnership Common Data Model", keywords="data science", keywords="data models", abstract="Background: Metadata describe and provide context for other data, playing a pivotal role in enabling findability, accessibility, interoperability, and reusability (FAIR) data principles. By providing comprehensive and machine-readable descriptions of digital resources, metadata empower both machines and human users to seamlessly discover, access, integrate, and reuse data or content across diverse platforms and applications. However, the limited accessibility and machine-interpretability of existing metadata for population health data hinder effective data discovery and reuse. Objective: To address these challenges, we propose a comprehensive framework using standardized formats, vocabularies, and protocols to render population health data machine-readable, significantly enhancing their FAIRness and enabling seamless discovery, access, and integration across diverse platforms and research applications. Methods: The framework implements a 3-stage approach. The first stage is Data Documentation Initiative (DDI) integration, which involves leveraging the DDI Codebook metadata and documentation of detailed information for data and associated assets, while ensuring transparency and comprehensiveness. The second stage is Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) standardization. In this stage, the data are harmonized and standardized into the OMOP CDM, facilitating unified analysis across heterogeneous data sets. The third stage involves the integration of Schema.org and JavaScript Object Notation for Linked Data (JSON-LD), in which machine-readable metadata are generated using Schema.org entities and embedded within the data using JSON-LD, boosting discoverability and comprehension for both machines and human users. We demonstrated the implementation of these 3 stages using the Integrated Disease Surveillance and Response (IDSR) data from Malawi and Kenya. Results: The implementation of our framework significantly enhanced the FAIRness of population health data, resulting in improved discoverability through seamless integration with platforms such as Google Dataset Search. The adoption of standardized formats and protocols streamlined data accessibility and integration across various research environments, fostering collaboration and knowledge sharing. Additionally, the use of machine-interpretable metadata empowered researchers to efficiently reuse data for targeted analyses and insights, thereby maximizing the overall value of population health resources. The JSON-LD codes are accessible via a GitHub repository and the HTML code integrated with JSON-LD is available on the Implementation Network for Sharing Population Information from Research Entities website. Conclusions: The adoption of machine-readable metadata standards is essential for ensuring the FAIRness of population health data. By embracing these standards, organizations can enhance diverse resource visibility, accessibility, and utility, leading to a broader impact, particularly in low- and middle-income countries. Machine-readable metadata can accelerate research, improve health care decision-making, and ultimately promote better health outcomes for populations worldwide. ", doi="10.2196/56237", url="https://ojphi.jmir.org/2024/1/e56237", url="http://www.ncbi.nlm.nih.gov/pubmed/39088253" } @Article{info:doi/10.2196/48595, author="Ben Yehuda, Ori and Itelman, Edward and Vaisman, Adva and Segal, Gad and Lerner, Boaz", title="Early Detection of Pulmonary Embolism in a General Patient Population Immediately Upon Hospital Admission Using Machine Learning to Identify New, Unidentified Risk Factors: Model Development Study", journal="J Med Internet Res", year="2024", month="Jul", day="30", volume="26", pages="e48595", keywords="pulmonary embolism", keywords="deep vein thrombosis", keywords="venous thromboembolism", keywords="imbalanced data", keywords="clustering", keywords="risk factors", keywords="Wells score", keywords="revised Genova score", keywords="hospital admission", keywords="machine learning", abstract="Background: Under- or late identification of pulmonary embolism (PE)---a thrombosis of 1 or more pulmonary arteries that seriously threatens patients' lives---is a major challenge confronting modern medicine. Objective: We aimed to establish accurate and informative machine learning (ML) models to identify patients at high risk for PE as they are admitted to the hospital, before their initial clinical checkup, by using only the information in their medical records. Methods: We collected demographics, comorbidities, and medications data for 2568 patients with PE and 52,598 control patients. We focused on data available prior to emergency department admission, as these are the most universally accessible data. We trained an ML random forest algorithm to detect PE at the earliest possible time during a patient's hospitalization---at the time of his or her admission. We developed and applied 2 ML-based methods specifically to address the data imbalance between PE and non-PE patients, which causes misdiagnosis of PE. Results: The resulting models predicted PE based on age, sex, BMI, past clinical PE events, chronic lung disease, past thrombotic events, and usage of anticoagulants, obtaining an 80\% geometric mean value for the PE and non-PE classification accuracies. Although on hospital admission only 4\% (1942/46,639) of the patients had a diagnosis of PE, we identified 2 clustering schemes comprising subgroups with more than 61\% (705/1120 in clustering scheme 1; 427/701 and 340/549 in clustering scheme 2) positive patients for PE. One subgroup in the first clustering scheme included 36\% (705/1942) of all patients with PE who were characterized by a definite past PE diagnosis, a 6-fold higher prevalence of deep vein thrombosis, and a 3-fold higher prevalence of pneumonia, compared with patients of the other subgroups in this scheme. In the second clustering scheme, 2 subgroups (1 of only men and 1 of only women) included patients who all had a past PE diagnosis and a relatively high prevalence of pneumonia, and a third subgroup included only those patients with a past diagnosis of pneumonia. Conclusions: This study established an ML tool for early diagnosis of PE almost immediately upon hospital admission. Despite the highly imbalanced scenario undermining accurate PE prediction and using information available only from the patient's medical history, our models were both accurate and informative, enabling the identification of patients already at high risk for PE upon hospital admission, even before the initial clinical checkup was performed. The fact that we did not restrict our patients to those at high risk for PE according to previously published scales (eg, Wells or revised Genova scores) enabled us to accurately assess the application of ML on raw medical data and identify new, previously unidentified risk factors for PE, such as previous pulmonary disease, in general populations. ", doi="10.2196/48595", url="https://www.jmir.org/2024/1/e48595" } @Article{info:doi/10.2196/52896, author="Ghasemi, Peyman and Lee, Joon", title="Unsupervised Feature Selection to Identify Important ICD-10 and ATC Codes for Machine Learning on a Cohort of Patients With Coronary Heart Disease: Retrospective Study", journal="JMIR Med Inform", year="2024", month="Jul", day="26", volume="12", pages="e52896", keywords="unsupervised feature selection", keywords="ICD-10", keywords="International Classification of Diseases", keywords="ATC", keywords="Anatomical Therapeutic Chemical", keywords="concrete autoencoder", keywords="Laplacian score", keywords="unsupervised feature selection for multicluster data", keywords="autoencoder-inspired unsupervised feature selection", keywords="principal feature analysis", keywords="machine learning", keywords="artificial intelligence", keywords="case study", keywords="coronary artery disease", keywords="artery disease", keywords="patient cohort", keywords="artery", keywords="mortality prediction", keywords="mortality", keywords="data set", keywords="interpretability", keywords="International Classification of Diseases, Tenth Revision", abstract="Background: The application of machine learning in health care often necessitates the use of hierarchical codes such as the International Classification of Diseases (ICD) and Anatomical Therapeutic Chemical (ATC) systems. These codes classify diseases and medications, respectively, thereby forming extensive data dimensions. Unsupervised feature selection tackles the ``curse of dimensionality'' and helps to improve the accuracy and performance of supervised learning models by reducing the number of irrelevant or redundant features and avoiding overfitting. Techniques for unsupervised feature selection, such as filter, wrapper, and embedded methods, are implemented to select the most important features with the most intrinsic information. However, they face challenges due to the sheer volume of ICD and ATC codes and the hierarchical structures of these systems. Objective: The objective of this study was to compare several unsupervised feature selection methods for ICD and ATC code databases of patients with coronary artery disease in different aspects of performance and complexity and select the best set of features representing these patients. Methods: We compared several unsupervised feature selection methods for 2 ICD and 1 ATC code databases of 51,506 patients with coronary artery disease in Alberta, Canada. Specifically, we used the Laplacian score, unsupervised feature selection for multicluster data, autoencoder-inspired unsupervised feature selection, principal feature analysis, and concrete autoencoders with and without ICD or ATC tree weight adjustment to select the 100 best features from over 9000 ICD and 2000 ATC codes. We assessed the selected features based on their ability to reconstruct the initial feature space and predict 90-day mortality following discharge. We also compared the complexity of the selected features by mean code level in the ICD or ATC tree and the interpretability of the features in the mortality prediction task using Shapley analysis. Results: In feature space reconstruction and mortality prediction, the concrete autoencoder--based methods outperformed other techniques. Particularly, a weight-adjusted concrete autoencoder variant demonstrated improved reconstruction accuracy and significant predictive performance enhancement, confirmed by DeLong and McNemar tests (P<.05). Concrete autoencoders preferred more general codes, and they consistently reconstructed all features accurately. Additionally, features selected by weight-adjusted concrete autoencoders yielded higher Shapley values in mortality prediction than most alternatives. Conclusions: This study scrutinized 5 feature selection methods in ICD and ATC code data sets in an unsupervised context. Our findings underscore the superiority of the concrete autoencoder method in selecting salient features that represent the entire data set, offering a potential asset for subsequent machine learning research. We also present a novel weight adjustment approach for the concrete autoencoders specifically tailored for ICD and ATC code data sets to enhance the generalizability and interpretability of the selected features. ", doi="10.2196/52896", url="https://medinform.jmir.org/2024/1/e52896" } @Article{info:doi/10.2196/54281, author="G{\'o}mez, Gustavo and Hufstedler, Heather and Montenegro Morales, Carlos and Roell, Yannik and Lozano-Parra, Anyela and Tami, Adriana and Magalhaes, Tereza and Marques, A. Ernesto T. and Balmaseda, Angel and Calvet, Guilherme and Harris, Eva and Brasil, Patricia and Herrera, Victor and Villar, Luis and Maxwell, Lauren and Jaenisch, Thomas and ", title="Pooled Cohort Profile: ReCoDID Consortium's Harmonized Acute Febrile Illness Arbovirus Meta-Cohort", journal="JMIR Public Health Surveill", year="2024", month="Jul", day="23", volume="10", pages="e54281", keywords="infectious disease", keywords="harmonized meta-cohort", keywords="IPD-MA", keywords="arbovirus", keywords="dengue", keywords="zika", keywords="chikungunya", keywords="surveillance", keywords="public health", keywords="open access data", keywords="FAIR principles", keywords="febrile illness", keywords="clinical-epidemiological data", keywords="cross-disease interaction", keywords="epidemiology", keywords="consortium", keywords="innovation", keywords="statistical tool", keywords="Latin America", keywords="Maelstrom's", keywords="methodology", keywords="CDISC", keywords="immunological interaction", keywords="flavivirus", keywords="infection", keywords="arboviral disease", doi="10.2196/54281", url="https://publichealth.jmir.org/2024/1/e54281", url="http://www.ncbi.nlm.nih.gov/pubmed/39042429" } @Article{info:doi/10.2196/57005, author="Rosenau, Lorenz and Behrend, Paul and Wiedekopf, Joshua and Gruendner, Julian and Ingenerf, Josef", title="Uncovering Harmonization Potential in Health Care Data Through Iterative Refinement of Fast Healthcare Interoperability Resources Profiles Based on Retrospective Discrepancy Analysis: Case Study", journal="JMIR Med Inform", year="2024", month="Jul", day="23", volume="12", pages="e57005", keywords="Health Level 7 Fast Healthcare Interoperability Resources", keywords="HL7 FHIR", keywords="FHIR profiles", keywords="interoperability", keywords="data harmonization", keywords="discrepancy analysis", keywords="data quality", keywords="cross-institutional data exchange", keywords="Medical Informatics Initiative", keywords="federated data access challenges", abstract="Background: Cross-institutional interoperability between health care providers remains a recurring challenge worldwide. The German Medical Informatics Initiative, a collaboration of 37 university hospitals in Germany, aims to enable interoperability between partner sites by defining Fast Healthcare Interoperability Resources (FHIR) profiles for the cross-institutional exchange of health care data, the Core Data Set (CDS). The current CDS and its extension modules define elements representing patients' health care records. All university hospitals in Germany have made significant progress in providing routine data in a standardized format based on the CDS. In addition, the central research platform for health, the German Portal for Medical Research Data feasibility tool, allows medical researchers to query the available CDS data items across many participating hospitals. Objective: In this study, we aimed to evaluate a novel approach of combining the current top-down generated FHIR profiles with the bottom-up generated knowledge gained by the analysis of respective instance data. This allowed us to derive options for iteratively refining FHIR profiles using the information obtained from a discrepancy analysis. Methods: We developed an FHIR validation pipeline and opted to derive more restrictive profiles from the original CDS profiles. This decision was driven by the need to align more closely with the specific assumptions and requirements of the central feasibility platform's search ontology. While the original CDS profiles offer a generic framework adaptable for a broad spectrum of medical informatics use cases, they lack the specificity to model the nuanced criteria essential for medical researchers. A key example of this is the necessity to represent specific laboratory codings and values interdependencies accurately. The validation results allow us to identify discrepancies between the instance data at the clinical sites and the profiles specified by the feasibility platform and addressed in the future. Results: A total of 20 university hospitals participated in this study. Historical factors, lack of harmonization, a wide range of source systems, and case sensitivity of coding are some of the causes for the discrepancies identified. While in our case study, Conditions, Procedures, and Medications have a high degree of uniformity in the coding of instance data due to legislative requirements for billing in Germany, we found that laboratory values pose a significant data harmonization challenge due to their interdependency between coding and value. Conclusions: While the CDS achieves interoperability, different challenges for federated data access arise, requiring more specificity in the profiles to make assumptions on the instance data. We further argue that further harmonization of the instance data can significantly lower required retrospective harmonization efforts. We recognize that discrepancies cannot be resolved solely at the clinical site; therefore, our findings have a wide range of implications and will require action on multiple levels and by various stakeholders. ", doi="10.2196/57005", url="https://medinform.jmir.org/2024/1/e57005" } @Article{info:doi/10.2196/54994, author="Levinson, T. Rebecca and Paul, Cinara and Meid, D. Andreas and Schultz, Jobst-Hendrik and Wild, Beate", title="Identifying Predictors of Heart Failure Readmission in Patients From a Statutory Health Insurance Database: Retrospective Machine Learning Study", journal="JMIR Cardio", year="2024", month="Jul", day="23", volume="8", pages="e54994", keywords="statutory health insurance", keywords="readmission", keywords="machine learning", keywords="heart failure", keywords="heart", keywords="cardiology", keywords="cardiac", keywords="hospitalization", keywords="insurance", keywords="predict", keywords="predictive", keywords="prediction", keywords="predictions", keywords="predictor", keywords="predictors", keywords="all cause", abstract="Background: Patients with heart failure (HF) are the most commonly readmitted group of adult patients in Germany. Most patients with HF are readmitted for noncardiovascular reasons. Understanding the relevance of HF management outside the hospital setting is critical to understanding HF and factors that lead to readmission. Application of machine learning (ML) on data from statutory health insurance (SHI) allows the evaluation of large longitudinal data sets representative of the general population to support clinical decision-making. Objective: This study aims to evaluate the ability of ML methods to predict 1-year all-cause and HF-specific readmission after initial HF-related admission of patients with HF in outpatient SHI data and identify important predictors. Methods: We identified individuals with HF using outpatient data from 2012 to 2018 from the AOK Baden-W{\"u}rttemberg SHI in Germany. We then trained and applied regression and ML algorithms to predict the first all-cause and HF-specific readmission in the year after the first admission for HF. We fitted a random forest, an elastic net, a stepwise regression, and a logistic regression to predict readmission by using diagnosis codes, drug exposures, demographics (age, sex, nationality, and type of coverage within SHI), degree of rurality for residence, and participation in disease management programs for common chronic conditions (diabetes mellitus type 1 and 2, breast cancer, chronic obstructive pulmonary disease, and coronary heart disease). We then evaluated the predictors of HF readmission according to their importance and direction to predict readmission. Results: Our final data set consisted of 97,529 individuals with HF, and 78,044 (80\%) were readmitted within the observation period. Of the tested modeling approaches, the random forest approach best predicted 1-year all-cause and HF-specific readmission with a C-statistic of 0.68 and 0.69, respectively. Important predictors for 1-year all-cause readmission included prescription of pantoprazole, chronic obstructive pulmonary disease, atherosclerosis, sex, rurality, and participation in disease management programs for type 2 diabetes mellitus and coronary heart disease. Relevant features for HF-specific readmission included a large number of canonical HF comorbidities. Conclusions: While many of the predictors we identified were known to be relevant comorbidities for HF, we also uncovered several novel associations. Disease management programs have widely been shown to be effective at managing chronic disease; however, our results indicate that in the short term they may be useful for targeting patients with HF with comorbidity at increased risk of readmission. Our results also show that living in a more rural location increases the risk of readmission. Overall, factors beyond comorbid disease were relevant for risk of HF readmission. This finding may impact how outpatient physicians identify and monitor patients at risk of HF readmission. ", doi="10.2196/54994", url="https://cardio.jmir.org/2024/1/e54994" } @Article{info:doi/10.2196/53624, author="Jalali, Alireza and Nyman, Jacline and Loeffelholz, Ouida and Courtney, Chantelle", title="Data-Driven Fundraising: Strategic Plan for Medical Education", journal="JMIR Med Educ", year="2024", month="Jul", day="22", volume="10", pages="e53624", keywords="fundraising", keywords="philanthropy", keywords="crowdfunding", keywords="funding", keywords="charity", keywords="higher education", keywords="university", keywords="medical education", keywords="educators", keywords="advancement", keywords="data analytics", keywords="ethics", keywords="ethical", keywords="education", keywords="medical school", keywords="school", keywords="support", keywords="financial", keywords="community", doi="10.2196/53624", url="https://mededu.jmir.org/2024/1/e53624" } @Article{info:doi/10.2196/45030, author="Congy, Juliette and Rahib, Delphine and Leroy, C{\'e}line and Bouyer, Jean and de La Rochebrochard, Elise", title="Contraceptive Use Measured in a National Population--Based Approach: Cross-Sectional Study of Administrative Versus Survey Data", journal="JMIR Public Health Surveill", year="2024", month="Jul", day="22", volume="10", pages="e45030", keywords="contraception", keywords="administrative data", keywords="health data", keywords="implant", keywords="oral contraceptives", keywords="intrauterine device", keywords="IUD", keywords="contraceptive prevalence", keywords="contraceptive", keywords="birth control", keywords="monitoring", keywords="public health issue", keywords="population-based survey", keywords="prevalence", abstract="Background: Prescribed contraception is used worldwide by over 400 million women of reproductive age. Monitoring contraceptive use is a major public health issue that usually relies on population-based surveys. However, these surveys are conducted on average every 6 years and do not allow close follow-up of contraceptive use. Moreover, their sample size is often too limited for the study of specific population subgroups such as people with low income. Health administrative data could be an innovative and less costly source to study contraceptive use. Objective: We aimed to explore the potential of health administrative data to study prescribed contraceptive use and compare these data with observations based on survey data. Methods: We selected all women aged 15-49 years, covered by French health insurance and living in France, in the health administrative database, which covers 98\% of the resident population (n=14,788,124), and in the last French population--based representative survey, the Health Barometer Survey, conducted in 2016 (n=4285). In health administrative data, contraceptive use was recorded with detailed information on the product delivered, whereas in the survey, it was self-declared by the women. In both sources, the prevalence of contraceptive use was estimated globally for all prescribed contraceptives and by type of contraceptive: oral contraceptives, intrauterine devices (IUDs), and implants. Prevalences were analyzed by age. Results: There were more low-income women in health administrative data than in the population-based survey (1,576,066/14,770,256, 11\% vs 188/4285, 7\%, respectively; P<.001). In health administrative data, 47.6\% (7034,710/14,770,256; 95\% CI 47.6\%-47.7\%) of women aged 15-49 years used a prescribed contraceptive versus 50.5\% (2297/4285; 95\% CI 49.1\%-52.0\%) in the population-based survey. Considering prevalences by the type of contraceptive in health administrative data versus survey data, they were 26.9\% (95\% CI 26.9\%-26.9\%) versus 27.7\% (95\% CI 26.4\%-29.0\%) for oral contraceptives, 17.7\% (95\% CI 17.7\%-17.8\%) versus 19.6\% (95\% CI 18.5\%-20.8\%) for IUDs, and 3\% (95\% CI 3.0\%-3.0\%) versus 3.2\% (95\% CI 2.7\%-3.7\%) for implants. In both sources, the same overall tendency in prevalence was observed for these 3 contraceptives. Implants remained little used at all ages, oral contraceptives were highly used among young women, whereas IUD use was low among young women. Conclusions: Compared with survey data, health administrative data exhibited the same overall tendencies for oral contraceptives, IUDs, and implants. One of the main strengths of health administrative data is the high quality of information on contraceptive use and the large number of observations, allowing studies of subgroups of population. Health administrative data therefore appear as a promising new source to monitor contraception in a population-based approach. They could open new perspectives for research and be a valuable new asset to guide public policies on reproductive and sexual health. ", doi="10.2196/45030", url="https://publichealth.jmir.org/2024/1/e45030", url="http://www.ncbi.nlm.nih.gov/pubmed/39037774" } @Article{info:doi/10.2196/54590, author="Lamer, Antoine and Saint-Dizier, Chlo{\'e} and Paris, Nicolas and Chazard, Emmanuel", title="Data Lake, Data Warehouse, Datamart, and Feature Store: Their Contributions to the Complete Data Reuse Pipeline", journal="JMIR Med Inform", year="2024", month="Jul", day="17", volume="12", pages="e54590", keywords="data reuse", keywords="data lake", keywords="data warehouse", keywords="feature extraction", keywords="datamart", keywords="feature store", doi="10.2196/54590", url="https://medinform.jmir.org/2024/1/e54590" } @Article{info:doi/10.2196/54044, author="Ito, Genta and Yada, Shuntaro and Wakamiya, Shoko and Aramaki, Eiji", title="Predictive Model for Extended-Spectrum $\beta$-Lactamase--Producing Bacterial Infections Using Natural Language Processing Technique and Open Data in Intensive Care Unit Environment: Retrospective Observational Study", journal="JMIR Form Res", year="2024", month="Jul", day="10", volume="8", pages="e54044", keywords="predictive modeling", keywords="MIMIC-3 dataset", keywords="natural language processing", keywords="NLP", keywords="QuickUMLS", keywords="named entity recognition", keywords="ESBL-producing bacterial infections", abstract="Background: Machine learning has advanced medical event prediction, mostly using private data. The public MIMIC-3 (Medical Information Mart for Intensive Care III) data set, which contains detailed data on over 40,000 intensive care unit patients, stands out as it can help develop better models including structured and textual data. Objective: This study aimed to build and test a machine learning model using the MIMIC-3 data set to determine the effectiveness of information extracted from electronic medical record text using a named entity recognition, specifically QuickUMLS, for predicting important medical events. Using the prediction of extended-spectrum $\beta$-lactamase (ESBL)--producing bacterial infections as an example, this study shows how open data sources and simple technology can be useful for making clinically meaningful predictions. Methods: The MIMIC-3 data set, including demographics, vital signs, laboratory results, and textual data, such as discharge summaries, was used. This study specifically targeted patients diagnosed with Klebsiella pneumoniae or Escherichia coli infection. Predictions were based on ESBL-producing bacterial standards and the minimum inhibitory concentration criteria. Both the structured data and extracted patient histories were used as predictors. In total, 2 models, an L1-regularized logistic regression model and a LightGBM model, were evaluated using the receiver operating characteristic area under the curve (ROC-AUC) and the precision-recall curve area under the curve (PR-AUC). Results: Of 46,520 MIMIC-3 patients, 4046 were identified with bacterial cultures, indicating the presence of K pneumoniae or E coli. After excluding patients who lacked discharge summary text, 3614 patients remained. The L1-penalized model, with variables from only the structured data, displayed a ROC-AUC of 0.646 and a PR-AUC of 0.307. The LightGBM model, combining structured and textual data, achieved a ROC-AUC of 0.707 and a PR-AUC of 0.369. Key contributors to the LightGBM model included patient age, duration since hospital admission, and specific medical history such as diabetes. The structured data-based model showed improved performance compared to the reference models. Performance was further improved when textual medical history was included. Compared to other models predicting drug-resistant bacteria, the results of this study ranked in the middle. Some misidentifications, potentially due to the limitations of QuickUMLS, may have affected the accuracy of the model. Conclusions: This study successfully developed a predictive model for ESBL-producing bacterial infections using the MIMIC-3 data set, yielding results consistent with existing literature. This model stands out for its transparency and reliance on open data and open-named entity recognition technology. The performance of the model was enhanced using textual information. With advancements in natural language processing tools such as BERT and GPT, the extraction of medical data from text holds substantial potential for future model optimization. ", doi="10.2196/54044", url="https://formative.jmir.org/2024/1/e54044" } @Article{info:doi/10.2196/51347, author="Wang, Meng and Peng, Yun and Wang, Ya and Luo, Dehong", title="Research Trends and Evolution in Radiogenomics (2005-2023): Bibliometric Analysis", journal="Interact J Med Res", year="2024", month="Jul", day="9", volume="13", pages="e51347", keywords="bibliometric", keywords="radiogenomics", keywords="multiomics", keywords="genomics", keywords="radiomics", abstract="Background: Radiogenomics is an emerging technology that integrates genomics and medical image--based radiomics, which is considered a promising approach toward achieving precision medicine. Objective: The aim of this study was to quantitatively analyze the research status, dynamic trends, and evolutionary trajectory in the radiogenomics field using bibliometric methods. Methods: The relevant literature published up to 2023 was retrieved from the Web of Science Core Collection. Excel was used to analyze the annual publication trend. VOSviewer was used for constructing the keywords co-occurrence network and the collaboration networks among countries and institutions. CiteSpace was used for citation keywords burst analysis and visualizing the references timeline. Results: A total of 3237 papers were included and exported in plain-text format. The annual number of publications showed an increasing annual trend. China and the United States have published the most papers in this field, with the highest number of citations in the United States and the highest average number per item in the Netherlands. Keywords burst analysis revealed that several keywords, including ``big data,'' ``magnetic resonance spectroscopy,'' ``renal cell carcinoma,'' ``stage,'' and ``temozolomide,'' experienced a citation burst in recent years. The timeline views demonstrated that the references can be categorized into 8 clusters: lower-grade glioma, lung cancer histology, lung adenocarcinoma, breast cancer, radiation-induced lung injury, epidermal growth factor receptor mutation, late radiotherapy toxicity, and artificial intelligence. Conclusions: The field of radiogenomics is attracting increasing attention from researchers worldwide, with the United States and the Netherlands being the most influential countries. Exploration of artificial intelligence methods based on big data to predict the response of tumors to various treatment methods represents a hot spot research topic in this field at present. ", doi="10.2196/51347", url="https://www.i-jmr.org/2024/1/e51347", url="http://www.ncbi.nlm.nih.gov/pubmed/38980713" } @Article{info:doi/10.2196/55013, author="Liu, Chuchu and Holme, Petter and Lehmann, Sune and Yang, Wenchuan and Lu, Xin", title="Nonrepresentativeness of Human Mobility Data and its Impact on Modeling Dynamics of the COVID-19 Pandemic: Systematic Evaluation", journal="JMIR Form Res", year="2024", month="Jun", day="28", volume="8", pages="e55013", keywords="human mobility", keywords="data representativeness", keywords="population composition", keywords="COVID-19", keywords="epidemiological modeling", abstract="Background: In recent years, a range of novel smartphone-derived data streams about human mobility have become available on a near--real-time basis. These data have been used, for example, to perform traffic forecasting and epidemic modeling. During the COVID-19 pandemic in particular, human travel behavior has been considered a key component of epidemiological modeling to provide more reliable estimates about the volumes of the pandemic's importation and transmission routes, or to identify hot spots. However, nearly universally in the literature, the representativeness of these data, how they relate to the underlying real-world human mobility, has been overlooked. This disconnect between data and reality is especially relevant in the case of socially disadvantaged minorities. Objective: The objective of this study is to illustrate the nonrepresentativeness of data on human mobility and the impact of this nonrepresentativeness on modeling dynamics of the epidemic. This study systematically evaluates how real-world travel flows differ from census-based estimations, especially in the case of socially disadvantaged minorities, such as older adults and women, and further measures biases introduced by this difference in epidemiological studies. Methods: To understand the demographic composition of population movements, a nationwide mobility data set from 318 million mobile phone users in China from January 1 to February 29, 2020, was curated. Specifically, we quantified the disparity in the population composition between actual migrations and resident composition according to census data, and shows how this nonrepresentativeness impacts epidemiological modeling by constructing an age-structured SEIR (Susceptible-Exposed-Infected- Recovered) model of COVID-19 transmission. Results: We found a significant difference in the demographic composition between those who travel and the overall population. In the population flows, 59\% (n=20,067,526) of travelers are young and 36\% (n=12,210,565) of them are middle-aged (P<.001), which is completely different from the overall adult population composition of China (where 36\% of individuals are young and 40\% of them are middle-aged). This difference would introduce a striking bias in epidemiological studies: the estimation of maximum daily infections differs nearly 3 times, and the peak time has a large gap of 46 days. Conclusions: The difference between actual migrations and resident composition strongly impacts outcomes of epidemiological forecasts, which typically assume that flows represent underlying demographics. Our findings imply that it is necessary to measure and quantify the inherent biases related to nonrepresentativeness for accurate epidemiological surveillance and forecasting. ", doi="10.2196/55013", url="https://formative.jmir.org/2024/1/e55013" } @Article{info:doi/10.2196/50437, author="Faust, Louis and Wilson, Patrick and Asai, Shusaku and Fu, Sunyang and Liu, Hongfang and Ruan, Xiaoyang and Storlie, Curt", title="Considerations for Quality Control Monitoring of Machine Learning Models in Clinical Practice", journal="JMIR Med Inform", year="2024", month="Jun", day="28", volume="12", pages="e50437", keywords="artificial intelligence", keywords="machine learning", keywords="implementation science", keywords="quality control", keywords="monitoring", keywords="patient safety", doi="10.2196/50437", url="https://medinform.jmir.org/2024/1/e50437", url="http://www.ncbi.nlm.nih.gov/pubmed/38941140" } @Article{info:doi/10.2196/56759, author="Reshetnikov, Aleksey and Shaikhattarova, Natalia and Mazurok, Margarita and Kasatkina, Nadezhda", title="Dental Tissue Density in Healthy Children Based on Radiological Data: Retrospective Analysis", journal="JMIRx Med", year="2024", month="Jun", day="20", volume="5", pages="e56759", keywords="density", keywords="teeth", keywords="tooth", keywords="dental", keywords="dentist", keywords="dentists", keywords="dentistry", keywords="oral", keywords="tissue", keywords="enamel", keywords="dentin", keywords="Hounsfield", keywords="pathology", keywords="pathological", keywords="radiology", keywords="radiological", keywords="image", keywords="images", keywords="imaging", keywords="teeth density", keywords="Hounsfield unit", keywords="diagnostic imaging", abstract="Background: Information about the range of Hounsfield values for healthy teeth tissues could become an additional tool in assessing dental health and could be used, among other data, for subsequent machine learning. Objective: The purpose of our study was to determine dental tissue densities in Hounsfield units (HU). Methods: The total sample included 36 healthy children (n=21, 58\% girls and n=15, 42\% boys) aged 10-11 years at the time of the study. The densities of 320 teeth tissues were analyzed. Data were expressed as means and SDs. The significance was determined using the Student (1-tailed) t test. The statistical significance was set at P<.05. Results: The densities of 320 teeth tissues were analyzed: 72 (22.5\%) first permanent molars, 72 (22.5\%) permanent central incisors, 27 (8.4\%) second primary molars, 40 (12.5\%) tooth germs of second premolars, 37 (11.6\%) second premolars, 9 (2.8\%) second permanent molars, and 63 (19.7\%) tooth germs of second permanent molars. The analysis of the data showed that tissues of healthy teeth in children have different density ranges: enamel, from mean 2954.69 (SD 223.77) HU to mean 2071.00 (SD 222.86) HU; dentin, from mean 1899.23 (SD 145.94) HU to mean 1323.10 (SD 201.67) HU; and pulp, from mean 420.29 (SD 196.47) HU to mean 183.63 (SD 97.59) HU. The tissues (enamel and dentin) of permanent central incisors in the mandible and maxilla had the highest mean densities. No gender differences concerning the density of dental tissues were reliably identified. Conclusions: The evaluation of Hounsfield values for dental tissues can be used as an objective method for assessing their densities. If the determined densities of the enamel, dentin, and pulp of the tooth do not correspond to the range of values for healthy tooth tissues, then it may indicate a pathology. ", doi="10.2196/56759", url="https://xmed.jmir.org/2024/1/e56759" } @Article{info:doi/10.2196/50209, author="Abdullahi, Tassallah and Mercurio, Laura and Singh, Ritambhara and Eickhoff, Carsten", title="Retrieval-Based Diagnostic Decision Support: Mixed Methods Study", journal="JMIR Med Inform", year="2024", month="Jun", day="19", volume="12", pages="e50209", keywords="clinical decision support", keywords="rare diseases", keywords="ensemble learning", keywords="retrieval-augmented learning", keywords="machine learning", keywords="electronic health records", keywords="natural language processing", keywords="retrieval augmented generation", keywords="RAG", keywords="electronic health record", keywords="EHR", keywords="data sparsity", keywords="information retrieval", abstract="Background: Diagnostic errors pose significant health risks and contribute to patient mortality. With the growing accessibility of electronic health records, machine learning models offer a promising avenue for enhancing diagnosis quality. Current research has primarily focused on a limited set of diseases with ample training data, neglecting diagnostic scenarios with limited data availability. Objective: This study aims to develop an information retrieval (IR)--based framework that accommodates data sparsity to facilitate broader diagnostic decision support. Methods: We introduced an IR-based diagnostic decision support framework called CliniqIR. It uses clinical text records, the Unified Medical Language System Metathesaurus, and 33 million PubMed abstracts to classify a broad spectrum of diagnoses independent of training data availability. CliniqIR is designed to be compatible with any IR framework. Therefore, we implemented it using both dense and sparse retrieval approaches. We compared CliniqIR's performance to that of pretrained clinical transformer models such as Clinical Bidirectional Encoder Representations from Transformers (ClinicalBERT) in supervised and zero-shot settings. Subsequently, we combined the strength of supervised fine-tuned ClinicalBERT and CliniqIR to build an ensemble framework that delivers state-of-the-art diagnostic predictions. Results: On a complex diagnosis data set (DC3) without any training data, CliniqIR models returned the correct diagnosis within their top 3 predictions. On the Medical Information Mart for Intensive Care III data set, CliniqIR models surpassed ClinicalBERT in predicting diagnoses with <5 training samples by an average difference in mean reciprocal rank of 0.10. In a zero-shot setting where models received no disease-specific training, CliniqIR still outperformed the pretrained transformer models with a greater mean reciprocal rank of at least 0.10. Furthermore, in most conditions, our ensemble framework surpassed the performance of its individual components, demonstrating its enhanced ability to make precise diagnostic predictions. Conclusions: Our experiments highlight the importance of IR in leveraging unstructured knowledge resources to identify infrequently encountered diagnoses. In addition, our ensemble framework benefits from combining the complementary strengths of the supervised and retrieval-based models to diagnose a broad spectrum of diseases. ", doi="10.2196/50209", url="https://medinform.jmir.org/2024/1/e50209", url="http://www.ncbi.nlm.nih.gov/pubmed/38896468" } @Article{info:doi/10.2196/55118, author="Akiya, Ippei and Ishihara, Takuma and Yamamoto, Keiichi", title="Comparison of Synthetic Data Generation Techniques for Control Group Survival Data in Oncology Clinical Trials: Simulation Study", journal="JMIR Med Inform", year="2024", month="Jun", day="18", volume="12", pages="e55118", keywords="oncology clinical trial", keywords="survival analysis", keywords="synthetic patient data", keywords="machine learning", keywords="SPD", keywords="simulation", abstract="Background: Synthetic patient data (SPD) generation for survival analysis in oncology trials holds significant potential for accelerating clinical development. Various machine learning methods, including classification and regression trees (CART), random forest (RF), Bayesian network (BN), and conditional tabular generative adversarial network (CTGAN), have been used for this purpose, but their performance in reflecting actual patient survival data remains under investigation. Objective: The aim of this study was to determine the most suitable SPD generation method for oncology trials, specifically focusing on both progression-free survival (PFS) and overall survival (OS), which are the primary evaluation end points in oncology trials. To achieve this goal, we conducted a comparative simulation of 4 generation methods, including CART, RF, BN, and the CTGAN, and the performance of each method was evaluated. Methods: Using multiple clinical trial data sets, 1000 data sets were generated by using each method for each clinical trial data set and evaluated as follows: (1) median survival time (MST) of PFS and OS; (2) hazard ratio distance (HRD), which indicates the similarity between the actual survival function and a synthetic survival function; and (3) visual analysis of Kaplan-Meier (KM) plots. Each method's ability to mimic the statistical properties of real patient data was evaluated from these multiple angles. Results: In most simulation cases, CART demonstrated the high percentages of MSTs for synthetic data falling within the 95\% CI range of the MST of the actual data. These percentages ranged from 88.8\% to 98.0\% for PFS and from 60.8\% to 96.1\% for OS. In the evaluation of HRD, CART revealed that HRD values were concentrated at approximately 0.9. Conversely, for the other methods, no consistent trend was observed for either PFS or OS. CART demonstrated better similarity than RF, in that CART caused overfitting and RF (a kind of ensemble learning approach) prevented it. In SPD generation, the statistical properties close to the actual data should be the focus, not a well-generalized prediction model. Both the BN and CTGAN methods cannot accurately reflect the statistical properties of the actual data because small data sets are not suitable. Conclusions: As a method for generating SPD for survival data from small data sets, such as clinical trial data, CART demonstrated to be the most effective method compared to RF, BN, and CTGAN. Additionally, it is possible to improve CART-based generation methods by incorporating feature engineering and other methods in future work. ", doi="10.2196/55118", url="https://medinform.jmir.org/2024/1/e55118" } @Article{info:doi/10.2196/50182, author="Singla, Ashwani and Khanna, Ritvik and Kaur, Manpreet and Kelm, Karen and Zaiane, Osmar and Rosenfelt, Scott Cory and Bui, An Truong and Rezaei, Navid and Nicholas, David and Reformat, Z. Marek and Majnemer, Annette and Ogourtsova, Tatiana and Bolduc, Francois", title="Developing a Chatbot to Support Individuals With Neurodevelopmental Disorders: Tutorial", journal="J Med Internet Res", year="2024", month="Jun", day="18", volume="26", pages="e50182", keywords="chatbot", keywords="user interface", keywords="knowledge graph", keywords="neurodevelopmental disability", keywords="autism", keywords="intellectual disability", keywords="attention-deficit/hyperactivity disorder", doi="10.2196/50182", url="https://www.jmir.org/2024/1/e50182", url="http://www.ncbi.nlm.nih.gov/pubmed/38888947" } @Article{info:doi/10.2196/47560, author="Syed, Ahmed Toufeeq and Thompson, L. Erika and Latif, Zainab and Johnson, Jay and Javier, Damaris and Stinson, Katie and Saleh, Gabrielle and Vishwanatha, K. Jamboor", title="Diverse Mentoring Connections Across Institutional Boundaries in the Biomedical Sciences: Innovative Graph Database Analysis", journal="J Med Internet Res", year="2024", month="Jun", day="17", volume="26", pages="e47560", keywords="online platform", keywords="mentorship", keywords="diversity", keywords="network analysis", keywords="graph database", keywords="online communities", abstract="Background: With an overarching goal of increasing diversity and inclusion in biomedical sciences, the National Research Mentoring Network (NRMN) developed a web-based national mentoring platform (MyNRMN) that seeks to connect mentors and mentees to support the persistence of underrepresented minorities in the biomedical sciences. As of May 15, 2024, the MyNRMN platform, which provides mentoring, networking, and professional development tools, has facilitated more than 12,100 unique mentoring connections between faculty, students, and researchers in the biomedical domain. Objective: This study aimed to examine the large-scale mentoring connections facilitated by our web-based platform between students (mentees) and faculty (mentors) across institutional and geographic boundaries. Using an innovative graph database, we analyzed diverse mentoring connections between mentors and mentees across demographic characteristics in the biomedical sciences. Methods: Through the MyNRMN platform, we observed profile data and analyzed mentoring connections made between students and faculty across institutional boundaries by race, ethnicity, gender, institution type, and educational attainment between July 1, 2016, and May 31, 2021. Results: In total, there were 15,024 connections with 2222 mentees and 1652 mentors across 1625 institutions contributing data. Female mentees participated in the highest number of connections (3996/6108, 65\%), whereas female mentors participated in 58\% (5206/8916) of the connections. Black mentees made up 38\% (2297/6108) of the connections, whereas White mentors participated in 56\% (5036/8916) of the connections. Mentees were predominately from institutions classified as Research 1 (R1; doctoral universities---very high research activity) and historically Black colleges and universities (556/2222, 25\% and 307/2222, 14\%, respectively), whereas 31\% (504/1652) of mentors were from R1 institutions. Conclusions: To date, the utility of mentoring connections across institutions throughout the United States and how mentors and mentees are connected is unknown. This study examined these connections and the diversity of these connections using an extensive web-based mentoring network. ", doi="10.2196/47560", url="https://www.jmir.org/2024/1/e47560", url="http://www.ncbi.nlm.nih.gov/pubmed/38885013" } @Article{info:doi/10.2196/57209, author="Lai, Peixuan and Cai, Weicong and Qu, Lin and Hong, Chuangyue and Lin, Kaihao and Tan, Weiguo and Zhao, Zhiguang", title="Pulmonary Tuberculosis Notification Rate Within Shenzhen, China, 2010-2019: Spatial-Temporal Analysis", journal="JMIR Public Health Surveill", year="2024", month="Jun", day="14", volume="10", pages="e57209", keywords="tuberculosis", keywords="spatial analysis", keywords="spatial-temporal cluster", keywords="Shenzhen", keywords="China", abstract="Background: Pulmonary tuberculosis (PTB) is a chronic communicable disease of major public health and social concern. Although spatial-temporal analysis has been widely used to describe distribution characteristics and transmission patterns, few studies have revealed the changes in the small-scale clustering of PTB at the street level. Objective: The aim of this study was to analyze the temporal and spatial distribution characteristics and clusters of PTB at the street level in the Shenzhen municipality of China to provide a reference for PTB prevention and control. Methods: Data of reported PTB cases in Shenzhen from January 2010 to December 2019 were extracted from the China Information System for Disease Control and Prevention to describe the epidemiological characteristics. Time-series, spatial-autocorrelation, and spatial-temporal scanning analyses were performed to identify the spatial and temporal patterns and high-risk areas at the street level. Results: A total of 58,122 PTB cases from 2010 to 2019 were notified in Shenzhen. The annual notification rate of PTB decreased significantly from 64.97 per 100,000 population in 2010 to 43.43 per 100,000 population in 2019. PTB cases exhibited seasonal variations with peaks in late spring and summer each year. The PTB notification rate was nonrandomly distributed and spatially clustered with a Moran I value of 0.134 (P=.02). One most-likely cluster and 10 secondary clusters were detected, and the most-likely clustering area was centered at Nanshan Street of Nanshan District covering 6 streets, with the clustering time spanning from January 2010 to November 2012. Conclusions: This study identified seasonal patterns and spatial-temporal clusters of PTB cases at the street level in the Shenzhen municipality of China. Resources should be prioritized to the identified high-risk areas for PTB prevention and control. ", doi="10.2196/57209", url="https://publichealth.jmir.org/2024/1/e57209", url="http://www.ncbi.nlm.nih.gov/pubmed/38875687" } @Article{info:doi/10.2196/55632, author="Robertson, J. Alan and Mallett, J. Andrew and Stark, Zornitza and Sullivan, Clair", title="It Is in Our DNA: Bringing Electronic Health Records and Genomic Data Together for Precision Medicine", journal="JMIR Bioinform Biotech", year="2024", month="Jun", day="13", volume="5", pages="e55632", keywords="genomics", keywords="digital health", keywords="genetics", keywords="precision medicine", keywords="genomic", keywords="genomic data", keywords="electronic health records", keywords="DNA", keywords="supports", keywords="decision-making", keywords="timeliness", keywords="diagnosis", keywords="risk reduction", keywords="electronic medical records", doi="10.2196/55632", url="https://bioinform.jmir.org/2024/1/e55632", url="http://www.ncbi.nlm.nih.gov/pubmed/38935958" } @Article{info:doi/10.2196/56686, author="Shau, Wen-Yi and Santoso, Handoko and Jip, Vincent and Setia, Sajita", title="Integrated Real-World Data Warehouses Across 7 Evolving Asian Health Care Systems: Scoping Review", journal="J Med Internet Res", year="2024", month="Jun", day="11", volume="26", pages="e56686", keywords="Asia", keywords="health care databases", keywords="cross-country comparison", keywords="electronic health records", keywords="electronic medical records", keywords="data warehousing", keywords="information storage and retrieval", keywords="real-world data", keywords="real-world evidence", keywords="registries", keywords="scoping review", abstract="Background: Asia consists of diverse nations with extremely variable health care systems. Integrated real-world data (RWD) research warehouses provide vast interconnected data sets that uphold statistical rigor. Yet, their intricate details remain underexplored, restricting their broader applications. Objective: Building on our previous research that analyzed integrated RWD warehouses in India, Thailand, and Taiwan, this study extends the research to 7 distinct health care systems: Hong Kong, Indonesia, Malaysia, Pakistan, the Philippines, Singapore, and Vietnam. We aimed to map the evolving landscape of RWD, preferences for methodologies, and database use and archetype the health systems based on existing intrinsic capability for RWD generation. Methods: A systematic scoping review methodology was used, centering on contemporary English literature on PubMed (search date: May 9, 2023). Rigorous screening as defined by eligibility criteria identified RWD studies from multiple health care facilities in at least 1 of the 7 target Asian nations. Point estimates and their associated errors were determined for the data collected from eligible studies. Results: Of the 1483 real-world evidence citations identified on May 9, 2023, a total of 369 (24.9\%) fulfilled the requirements for data extraction and subsequent analysis. Singapore, Hong Kong, and Malaysia contributed to ?100 publications, with each country marked by a higher proportion of single-country studies at 51\% (80/157), 66.2\% (86/130), and 50\% (50/100), respectively, and were classified as solo scholars. Indonesia, Pakistan, Vietnam, and the Philippines had fewer publications and a higher proportion of cross-country collaboration studies (CCCSs) at 79\% (26/33), 58\% (18/31), 74\% (20/27), and 86\% (19/22), respectively, and were classified as global collaborators. Collaboration with countries outside the 7 target nations appeared in 84.2\% to 97.7\% of the CCCSs of each nation. Among target nations, Singapore and Malaysia emerged as preferred research partners for other nations. From 2018 to 2023, most nations showed an increasing trend in study numbers, with Vietnam (24.5\%) and Pakistan (21.2\%) leading the growth; the only exception was the Philippines, which declined by --14.5\%. Clinical registry databases were predominant across all CCCSs from every target nation. For single-country studies, Indonesia, Malaysia, and the Philippines favored clinical registries; Singapore had a balanced use of clinical registries and electronic medical or health records, whereas Hong Kong, Pakistan, and Vietnam leaned toward electronic medical or health records. Overall, 89.9\% (310/345) of the studies took >2 years from completion to publication. Conclusions: The observed variations in contemporary RWD publications across the 7 nations in Asia exemplify distinct research landscapes across nations that are partially explained by their diverse economic, clinical, and research settings. Nevertheless, recognizing these variations is pivotal for fostering tailored, synergistic strategies that amplify RWD's potential in guiding future health care research and policy decisions. International Registered Report Identifier (IRRID): RR2-10.2196/43741 ", doi="10.2196/56686", url="https://www.jmir.org/2024/1/e56686", url="http://www.ncbi.nlm.nih.gov/pubmed/38749399" } @Article{info:doi/10.2196/50049, author="Stellmach, Caroline and Hopff, Marie Sina and Jaenisch, Thomas and Nunes de Miranda, Marina Susana and Rinaldi, Eugenia and ", title="Creation of Standardized Common Data Elements for Diagnostic Tests in Infectious Disease Studies: Semantic and Syntactic Mapping", journal="J Med Internet Res", year="2024", month="Jun", day="10", volume="26", pages="e50049", keywords="core data element", keywords="CDE", keywords="case report form", keywords="CRF", keywords="interoperability", keywords="semantic standards", keywords="infectious disease", keywords="diagnostic test", keywords="covid19", keywords="COVID-19", keywords="mpox", keywords="ZIKV", keywords="patient data", keywords="data model", keywords="syntactic interoperability", keywords="clinical data", keywords="FHIR", keywords="SNOMED CT", keywords="LOINC", keywords="virus infection", keywords="common element", abstract="Background: It is necessary to harmonize and standardize data variables used in case report forms (CRFs) of clinical studies to facilitate the merging and sharing of the collected patient data across several clinical studies. This is particularly true for clinical studies that focus on infectious diseases. Public health may be highly dependent on the findings of such studies. Hence, there is an elevated urgency to generate meaningful, reliable insights, ideally based on a high sample number and quality data. The implementation of core data elements and the incorporation of interoperability standards can facilitate the creation of harmonized clinical data sets. Objective: This study's objective was to compare, harmonize, and standardize variables focused on diagnostic tests used as part of CRFs in 6 international clinical studies of infectious diseases in order to, ultimately, then make available the panstudy common data elements (CDEs) for ongoing and future studies to foster interoperability and comparability of collected data across trials. Methods: We reviewed and compared the metadata that comprised the CRFs used for data collection in and across all 6 infectious disease studies under consideration in order to identify CDEs. We examined the availability of international semantic standard codes within the Systemized Nomenclature of Medicine - Clinical Terms, the National Cancer Institute Thesaurus, and the Logical Observation Identifiers Names and Codes system for the unambiguous representation of diagnostic testing information that makes up the CDEs. We then proposed 2 data models that incorporate semantic and syntactic standards for the identified CDEs. Results: Of 216 variables that were considered in the scope of the analysis, we identified 11 CDEs to describe diagnostic tests (in particular, serology and sequencing) for infectious diseases: viral lineage/clade; test date, type, performer, and manufacturer; target gene; quantitative and qualitative results; and specimen identifier, type, and collection date. Conclusions: The identification of CDEs for infectious diseases is the first step in facilitating the exchange and possible merging of a subset of data across clinical studies (and with that, large research projects) for possible shared analysis to increase the power of findings. The path to harmonization and standardization of clinical study data in the interest of interoperability can be paved in 2 ways. First, a map to standard terminologies ensures that each data element's (variable's) definition is unambiguous and that it has a single, unique interpretation across studies. Second, the exchange of these data is assisted by ``wrapping'' them in a standard exchange format, such as Fast Health care Interoperability Resources or the Clinical Data Interchange Standards Consortium's Clinical Data Acquisition Standards Harmonization Model. ", doi="10.2196/50049", url="https://www.jmir.org/2024/1/e50049", url="http://www.ncbi.nlm.nih.gov/pubmed/38857066" } @Article{info:doi/10.2196/51323, author="Hopcroft, EM Lisa and Curtis, J. Helen and Croker, Richard and Pretis, Felix and Inglesby, Peter and Evans, David and Bacon, Sebastian and Goldacre, Ben and Walker, J. Alex and MacKenna, Brian", title="Data-Driven Identification of Potentially Successful Intervention Implementations Using 5 Years of Opioid Prescribing Data: Retrospective Database Study", journal="JMIR Public Health Surveill", year="2024", month="Jun", day="5", volume="10", pages="e51323", keywords="electronic health records", keywords="primary care", keywords="general practice", keywords="opioid analgesics", keywords="data science", keywords="implementation science", keywords="data-driven", keywords="identification", keywords="intervention", keywords="implementations", keywords="proof of concept", keywords="opioid", keywords="unbiased", keywords="prescribing data", keywords="analysis tool", abstract="Background: We have previously demonstrated that opioid prescribing increased by 127\% between 1998 and 2016. New policies aimed at tackling this increasing trend have been recommended by public health bodies, and there is some evidence that progress is being made. Objective: We sought to extend our previous work and develop a data-driven approach to identify general practices and clinical commissioning groups (CCGs) whose prescribing data suggest that interventions to reduce the prescribing of opioids may have been successfully implemented. Methods: We analyzed 5 years of prescribing data (December 2014 to November 2019) for 3 opioid prescribing measures---total opioid prescribing as oral morphine equivalent per 1000 registered population, the number of high-dose opioids prescribed per 1000 registered population, and the number of high-dose opioids as a percentage of total opioids prescribed. Using a data-driven approach, we applied a modified version of our change detection Python library to identify reductions in these measures over time, which may be consistent with the successful implementation of an intervention to reduce opioid prescribing. This analysis was carried out for general practices and CCGs, and organizations were ranked according to the change in prescribing rate. Results: We identified a reduction in total opioid prescribing in 94 (49.2\%) out of 191 CCGs, with a median reduction of 15.1 (IQR 11.8-18.7; range 9.0-32.8) in total oral morphine equivalence per 1000 patients. We present data for the 3 CCGs and practices demonstrating the biggest reduction in opioid prescribing for each of the 3 opioid prescribing measures. We observed a 40\% proportional drop (8.9\% absolute reduction) in the regular prescribing of high-dose opioids (measured as a percentage of regular opioids) in the highest-ranked CCG (North Tyneside); a 99\% drop in this same measure was found in several practices (44\%-95\% absolute reduction). Decile plots demonstrate that CCGs exhibiting large reductions in opioid prescribing do so via slow and gradual reductions over a long period of time (typically over a period of 2 years); in contrast, practices exhibiting large reductions do so rapidly over a much shorter period of time. Conclusions: By applying 1 of our existing analysis tools to a national data set, we were able to identify rapid and maintained changes in opioid prescribing within practices and CCGs and rank organizations by the magnitude of reduction. Highly ranked organizations are candidates for further qualitative research into intervention design and implementation. ", doi="10.2196/51323", url="https://publichealth.jmir.org/2024/1/e51323", url="http://www.ncbi.nlm.nih.gov/pubmed/38838327" } @Article{info:doi/10.2196/50976, author="Xu, Yucan and Chan, Shaunlyn Christian and Chan, Evangeline and Chen, Junyou and Cheung, Florence and Xu, Zhongzhi and Liu, Joyce and Yip, Fai Paul Siu", title="Tracking and Profiling Repeated Users Over Time in Text-Based Counseling: Longitudinal Observational Study With Hierarchical Clustering", journal="J Med Internet Res", year="2024", month="May", day="30", volume="26", pages="e50976", keywords="web-based counseling", keywords="text-based counseling", keywords="repeated users", keywords="frequent users", keywords="hierarchical clustering", keywords="service effectiveness", keywords="risk profiling", keywords="psychological profiles", keywords="psycholinguistic analysis", abstract="Background: Due to their accessibility and anonymity, web-based counseling services are expanding at an unprecedented rate. One of the most prominent challenges such services face is repeated users, who represent a small fraction of total users but consume significant resources by continually returning to the system and reiterating the same narrative and issues. A deeper understanding of repeated users and tailoring interventions may help improve service efficiency and effectiveness. Previous studies on repeated users were mainly on telephone counseling, and the classification of repeated users tended to be arbitrary and failed to capture the heterogeneity in this group of users. Objective: In this study, we aimed to develop a systematic method to profile repeated users and to understand what drives their use of the service. By doing so, we aimed to provide insight and practical implications that can inform the provision of service catering to different types of users and improve service effectiveness. Methods: We extracted session data from 29,400 users from a free 24/7 web-based counseling service from 2018 to 2021. To systematically investigate the heterogeneity of repeated users, hierarchical clustering was used to classify the users based on 3 indicators of service use behaviors, including the duration of their user journey, use frequency, and intensity. We then compared the psychological profile of the identified subgroups including their suicide risks and primary concerns to gain insights into the factors driving their patterns of service use. Results: Three clusters of repeated users with clear psychological profiles were detected: episodic, intermittent, and persistent-intensive users. Generally, compared with one-time users, repeated users showed higher suicide risks and more complicated backgrounds, including more severe presenting issues such as suicide or self-harm, bullying, and addictive behaviors. Higher frequency and intensity of service use were also associated with elevated suicide risk levels and a higher proportion of users citing mental disorders as their primary concerns. Conclusions: This study presents a systematic method of identifying and classifying repeated users in web-based counseling services. The proposed bottom-up clustering method identified 3 subgroups of repeated users with distinct service behaviors and psychological profiles. The findings can facilitate frontline personnel in delivering more efficient interventions and the proposed method can also be meaningful to a wider range of services in improving service provision, resource allocation, and service effectiveness. ", doi="10.2196/50976", url="https://www.jmir.org/2024/1/e50976", url="http://www.ncbi.nlm.nih.gov/pubmed/38815258" } @Article{info:doi/10.2196/52655, author="Invernici, Francesco and Bernasconi, Anna and Ceri, Stefano", title="Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation", journal="J Med Internet Res", year="2024", month="May", day="30", volume="26", pages="e52655", keywords="big data corpus", keywords="clinical research", keywords="co-occurrence network", keywords="COVID-19 Open Research Dataset", keywords="CORD-19", keywords="graph search", keywords="Named Entity Recognition", keywords="Neo4j", keywords="text mining", abstract="Background: Since the beginning of the COVID-19 pandemic, >1 million studies have been collected within the COVID-19 Open Research Dataset, a corpus of manuscripts created to accelerate research against the disease. Their related abstracts hold a wealth of information that remains largely unexplored and difficult to search due to its unstructured nature. Keyword-based search is the standard approach, which allows users to retrieve the documents of a corpus that contain (all or some of) the words in a target list. This type of search, however, does not provide visual support to the task and is not suited to expressing complex queries or compensating for missing specifications. Objective: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19--related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. Methods: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the Unified Medical Language System and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. Results: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. Conclusions: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available. ", doi="10.2196/52655", url="https://www.jmir.org/2024/1/e52655", url="http://www.ncbi.nlm.nih.gov/pubmed/38814687" } @Article{info:doi/10.2196/46160, author="Wang, Guanyi and Chen, Chen and Jiang, Ziyu and Li, Gang and Wu, Can and Li, Sheng", title="Efficient Use of Biological Data in the Web 3.0 Era by Applying Nonfungible Token Technology", journal="J Med Internet Res", year="2024", month="May", day="28", volume="26", pages="e46160", keywords="NFTs", keywords="biobanks", keywords="blockchains", keywords="health care", keywords="medical big data", keywords="sustainability", keywords="blockchain platform", keywords="platform", keywords="tracing", keywords="virtual", keywords="biomedical data", keywords="transformation", keywords="development", keywords="promoted", doi="10.2196/46160", url="https://www.jmir.org/2024/1/e46160", url="http://www.ncbi.nlm.nih.gov/pubmed/38805706" } @Article{info:doi/10.2196/54332, author="Thomas, Mara and Mackes, Nuria and Preuss-Dodhy, Asad and Wieland, Thomas and Bundschus, Markus", title="Assessing Privacy Vulnerabilities in Genetic Data Sets: Scoping Review", journal="JMIR Bioinform Biotech", year="2024", month="May", day="27", volume="5", pages="e54332", keywords="genetic privacy", keywords="privacy", keywords="data anonymization", keywords="reidentification", abstract="Background: Genetic data are widely considered inherently identifiable. However, genetic data sets come in many shapes and sizes, and the feasibility of privacy attacks depends on their specific content. Assessing the reidentification risk of genetic data is complex, yet there is a lack of guidelines or recommendations that support data processors in performing such an evaluation. Objective: This study aims to gain a comprehensive understanding of the privacy vulnerabilities of genetic data and create a summary that can guide data processors in assessing the privacy risk of genetic data sets. Methods: We conducted a 2-step search, in which we first identified 21 reviews published between 2017 and 2023 on the topic of genomic privacy and then analyzed all references cited in the reviews (n=1645) to identify 42 unique original research studies that demonstrate a privacy attack on genetic data. We then evaluated the type and components of genetic data exploited for these attacks as well as the effort and resources needed for their implementation and their probability of success. Results: From our literature review, we derived 9 nonmutually exclusive features of genetic data that are both inherent to any genetic data set and informative about privacy risk: biological modality, experimental assay, data format or level of processing, germline versus somatic variation content, content of single nucleotide polymorphisms, short tandem repeats, aggregated sample measures, structural variants, and rare single nucleotide variants. Conclusions: On the basis of our literature review, the evaluation of these 9 features covers the great majority of privacy-critical aspects of genetic data and thus provides a foundation and guidance for assessing genetic data risk. ", doi="10.2196/54332", url="https://bioinform.jmir.org/2024/1/e54332", url="http://www.ncbi.nlm.nih.gov/pubmed/38935957" } @Article{info:doi/10.2196/53437, author="Cummins, R. Mollie and Shishupal, Sukrut and Wong, Bob and Wan, Neng and Han, Jiuying and Johnny, D. Jace and Mhatre-Owens, Amy and Gouripeddi, Ramkiran and Ivanova, Julia and Ong, Triton and Soni, Hiral and Barrera, Janelle and Wilczewski, Hattie and Welch, M. Brandon and Bunnell, E. Brian", title="Travel Distance Between Participants in US Telemedicine Sessions With Estimates of Emissions Savings: Observational Study", journal="J Med Internet Res", year="2024", month="May", day="15", volume="26", pages="e53437", keywords="air pollution", keywords="environmental health", keywords="telemedicine", keywords="greenhouse gases", keywords="clinical research informatics", keywords="informatics", keywords="data science", keywords="telehealth", keywords="eHealth", keywords="travel", keywords="air quality", keywords="pollutant", keywords="pollution", keywords="polluted", keywords="environment", keywords="environmental", keywords="greenhouse gas", keywords="emissions", keywords="retrospective", keywords="observational", keywords="United States", keywords="USA", keywords="North America", keywords="North American", keywords="cost", keywords="costs", keywords="economic", keywords="economics", keywords="saving", keywords="savings", keywords="finance", keywords="financial", keywords="finances", keywords="CO2", keywords="carbon dioxide", keywords="carbon footprint", abstract="Background: Digital health and telemedicine are potentially important strategies to decrease health care's environmental impact and contribution to climate change by reducing transportation-related air pollution and greenhouse gas emissions. However, we currently lack robust national estimates of emissions savings attributable to telemedicine. Objective: This study aimed to (1) determine the travel distance between participants in US telemedicine sessions and (2) estimate the net reduction in carbon dioxide (CO2) emissions attributable to telemedicine in the United States, based on national observational data describing the geographical characteristics of telemedicine session participants. Methods: We conducted a retrospective observational study of telemedicine sessions in the United States between January 1, 2022, and February 21, 2023, on the doxy.me platform. Using Google Distance Matrix, we determined the median travel distance between participating providers and patients for a proportional sample of sessions. Further, based on the best available public data, we estimated the total annual emissions costs and savings attributable to telemedicine in the United States. Results: The median round trip travel distance between patients and providers was 49 (IQR 21-145) miles. The median CO2 emissions savings per telemedicine session was 20 (IQR 8-59) kg CO2). Accounting for the energy costs of telemedicine and US transportation patterns, among other factors, we estimate that the use of telemedicine in the United States during the years 2021-2022 resulted in approximate annual CO2 emissions savings of 1,443,800 metric tons. Conclusions: These estimates of travel distance and telemedicine-associated CO2 emissions costs and savings, based on national data, indicate that telemedicine may be an important strategy in reducing the health care sector's carbon footprint. ", doi="10.2196/53437", url="https://www.jmir.org/2024/1/e53437", url="http://www.ncbi.nlm.nih.gov/pubmed/38536065" } @Article{info:doi/10.2196/49445, author="Pilgram, Lisa and Meurers, Thierry and Malin, Bradley and Schaeffner, Elke and Eckardt, Kai-Uwe and Prasser, Fabian and ", title="The Costs of Anonymization: Case Study Using Clinical Data", journal="J Med Internet Res", year="2024", month="Apr", day="24", volume="26", pages="e49445", keywords="data sharing", keywords="anonymization", keywords="deidentification", keywords="privacy-utility trade-off", keywords="privacy-enhancing technologies", keywords="medical informatics", keywords="privacy", keywords="anonymized", keywords="security", keywords="identification", keywords="confidentiality", keywords="data science", abstract="Background: Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set's statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice. Objective: The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study. Methods: The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case--specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case--specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95\% CI lengths between anonymized and original results. Results: Reproducibility measured by 95\% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2\% and 87.6\%, and entropy varied between 25.5\% and 46.2\%, whereas the average 95\% CI overlap was above 90\% for all risk thresholds applied. A nonoverlapping 95\% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50\%. The use case--specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy. Conclusions: Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case--specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data. Trial Registration: German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971 International Registered Report Identifier (IRRID): RR2-10.1093/ndt/gfr456 ", doi="10.2196/49445", url="https://www.jmir.org/2024/1/e49445", url="http://www.ncbi.nlm.nih.gov/pubmed/38657232" } @Article{info:doi/10.2196/51880, author="Lee, Jian-Sin and Tyler, B. Allison R. and Veinot, Christine Tiffany and Yakel, Elizabeth", title="Now Is the Time to Strengthen Government-Academic Data Infrastructures to Jump-Start Future Public Health Crisis Response", journal="JMIR Public Health Surveill", year="2024", month="Apr", day="24", volume="10", pages="e51880", keywords="COVID-19", keywords="crisis response", keywords="cross-sector collaboration", keywords="data infrastructures", keywords="data science", keywords="data sharing", keywords="pandemic", keywords="public health", doi="10.2196/51880", url="https://publichealth.jmir.org/2024/1/e51880", url="http://www.ncbi.nlm.nih.gov/pubmed/38656780" } @Article{info:doi/10.2196/50958, author="Liao, Qiuyan and Yuan, Jiehu and Wong, Ling Irene Oi and Ni, Yuxuan Michael and Cowling, John Benjamin and Lam, Tak Wendy Wing", title="Motivators and Demotivators for COVID-19 Vaccination Based on Co-Occurrence Networks of Verbal Reasons for Vaccination Acceptance and Resistance: Repetitive Cross-Sectional Surveys and Network Analysis", journal="JMIR Public Health Surveill", year="2024", month="Apr", day="22", volume="10", pages="e50958", keywords="COVID-19", keywords="vaccination acceptance", keywords="vaccine hesitancy", keywords="motivators", keywords="co-occurrence network analysis", abstract="Background: Vaccine hesitancy is complex and multifaced. People may accept or reject a vaccine due to multiple and interconnected reasons, with some reasons being more salient in influencing vaccine acceptance or resistance and hence the most important intervention targets for addressing vaccine hesitancy. Objective: This study was aimed at assessing the connections and relative importance of motivators and demotivators for COVID-19 vaccination in Hong Kong based on co-occurrence networks of verbal reasons for vaccination acceptance and resistance from repetitive cross-sectional surveys. Methods: We conducted a series of random digit dialing telephone surveys to examine COVID-19 vaccine hesitancy among general Hong Kong adults between March 2021 and July 2022. A total of 5559 and 982 participants provided verbal reasons for accepting and resisting (rejecting or hesitating) a COVID-19 vaccine, respectively. The verbal reasons were initially coded to generate categories of motivators and demotivators for COVID-19 vaccination using a bottom-up approach. Then, all the generated codes were mapped onto the 5C model of vaccine hesitancy. On the basis of the identified reasons, we conducted a co-occurrence network analysis to understand how motivating or demotivating reasons were comentioned to shape people's vaccination decisions. Each reason's eigenvector centrality was calculated to quantify their relative importance in the network. Analyses were also stratified by age group. Results: The co-occurrence network analysis found that the perception of personal risk to the disease (egicentrality=0.80) and the social responsibility to protect others (egicentrality=0.58) were the most important comentioned reasons that motivate COVID-19 vaccination, while lack of vaccine confidence (egicentrality=0.89) and complacency (perceived low disease risk and low importance of vaccination; egicentrality=0.45) were the most important comentioned reasons that demotivate COVID-19 vaccination. For older people aged ?65 years, protecting others was a more important motivator (egicentrality=0.57), while the concern about poor health status was a more important demotivator (egicentrality=0.42); for young people aged 18 to 24 years, recovering life normalcy (egicentrality=0.20) and vaccine mandates (egicentrality=0.26) were the more important motivators, while complacency (egicentrality=0.77) was a more important demotivator for COVID-19 vaccination uptake. Conclusions: When disease risk is perceived to be high, promoting social responsibility to protect others is more important for boosting vaccination acceptance. However, when disease risk is perceived to be low and complacency exists, fostering confidence in vaccines to address vaccine hesitancy becomes more important. Interventions for promoting vaccination acceptance and reducing vaccine hesitancy should be tailored by age. ", doi="10.2196/50958", url="https://publichealth.jmir.org/2024/1/e50958", url="http://www.ncbi.nlm.nih.gov/pubmed/38648099" } @Article{info:doi/10.2196/46777, author="Romano, D. Joseph and Truong, Van and Kumar, Rachit and Venkatesan, Mythreye and Graham, E. Britney and Hao, Yun and Matsumoto, Nick and Li, Xi and Wang, Zhiping and Ritchie, D. Marylyn and Shen, Li and Moore, H. Jason", title="The Alzheimer's Knowledge Base: A Knowledge Graph for Alzheimer Disease Research", journal="J Med Internet Res", year="2024", month="Apr", day="18", volume="26", pages="e46777", keywords="Alzheimer disease", keywords="knowledge graph", keywords="knowledge base", keywords="artificial intelligence", keywords="drug repurposing", keywords="drug discovery", keywords="open source", keywords="Alzheimer", keywords="etiology", keywords="heterogeneous graph", keywords="therapeutic targets", keywords="machine learning", keywords="therapeutic discovery", abstract="Background: As global populations age and become susceptible to neurodegenerative illnesses, new therapies for Alzheimer disease (AD) are urgently needed. Existing data resources for drug discovery and repurposing fail to capture relationships central to the disease's etiology and response to drugs. Objective: We designed the Alzheimer's Knowledge Base (AlzKB) to alleviate this need by providing a comprehensive knowledge representation of AD etiology and candidate therapeutics. Methods: We designed the AlzKB as a large, heterogeneous graph knowledge base assembled using 22 diverse external data sources describing biological and pharmaceutical entities at different levels of organization (eg, chemicals, genes, anatomy, and diseases). AlzKB uses a Web Ontology Language 2 ontology to enforce semantic consistency and allow for ontological inference. We provide a public version of AlzKB and allow users to run and modify local versions of the knowledge base. Results: AlzKB is freely available on the web and currently contains 118,902 entities with 1,309,527 relationships between those entities. To demonstrate its value, we used graph data science and machine learning to (1) propose new therapeutic targets based on similarities of AD to Parkinson disease and (2) repurpose existing drugs that may treat AD. For each use case, AlzKB recovers known therapeutic associations while proposing biologically plausible new ones. Conclusions: AlzKB is a new, publicly available knowledge resource that enables researchers to discover complex translational associations for AD drug discovery. Through 2 use cases, we show that it is a valuable tool for proposing novel therapeutic hypotheses based on public biomedical knowledge. ", doi="10.2196/46777", url="https://www.jmir.org/2024/1/e46777", url="http://www.ncbi.nlm.nih.gov/pubmed/38635981" } @Article{info:doi/10.2196/47125, author="Wang, Echo H. and Weiner, P. Jonathan and Saria, Suchi and Kharrazi, Hadi", title="Evaluating Algorithmic Bias in 30-Day Hospital Readmission Models: Retrospective Analysis", journal="J Med Internet Res", year="2024", month="Apr", day="18", volume="26", pages="e47125", keywords="algorithmic bias", keywords="model bias", keywords="predictive models", keywords="model fairness", keywords="health disparity", keywords="hospital readmission", keywords="retrospective analysis", abstract="Background: The adoption of predictive algorithms in health care comes with the potential for algorithmic bias, which could exacerbate existing disparities. Fairness metrics have been proposed to measure algorithmic bias, but their application to real-world tasks is limited. Objective: This study aims to evaluate the algorithmic bias associated with the application of common 30-day hospital readmission models and assess the usefulness and interpretability of selected fairness metrics. Methods: We used 10.6 million adult inpatient discharges from Maryland and Florida from 2016 to 2019 in this retrospective study. Models predicting 30-day hospital readmissions were evaluated: LACE Index, modified HOSPITAL score, and modified Centers for Medicare \& Medicaid Services (CMS) readmission measure, which were applied as-is (using existing coefficients) and retrained (recalibrated with 50\% of the data). Predictive performances and bias measures were evaluated for all, between Black and White populations, and between low- and other-income groups. Bias measures included the parity of false negative rate (FNR), false positive rate (FPR), 0-1 loss, and generalized entropy index. Racial bias represented by FNR and FPR differences was stratified to explore shifts in algorithmic bias in different populations. Results: The retrained CMS model demonstrated the best predictive performance (area under the curve: 0.74 in Maryland and 0.68-0.70 in Florida), and the modified HOSPITAL score demonstrated the best calibration (Brier score: 0.16-0.19 in Maryland and 0.19-0.21 in Florida). Calibration was better in White (compared to Black) populations and other-income (compared to low-income) groups, and the area under the curve was higher or similar in the Black (compared to White) populations. The retrained CMS and modified HOSPITAL score had the lowest racial and income bias in Maryland. In Florida, both of these models overall had the lowest income bias and the modified HOSPITAL score showed the lowest racial bias. In both states, the White and higher-income populations showed a higher FNR, while the Black and low-income populations resulted in a higher FPR and a higher 0-1 loss. When stratified by hospital and population composition, these models demonstrated heterogeneous algorithmic bias in different contexts and populations. Conclusions: Caution must be taken when interpreting fairness measures' face value. A higher FNR or FPR could potentially reflect missed opportunities or wasted resources, but these measures could also reflect health care use patterns and gaps in care. Simply relying on the statistical notions of bias could obscure or underplay the causes of health disparity. The imperfect health data, analytic frameworks, and the underlying health systems must be carefully considered. Fairness measures can serve as a useful routine assessment to detect disparate model performances but are insufficient to inform mechanisms or policy changes. However, such an assessment is an important first step toward data-driven improvement to address existing health disparities. ", doi="10.2196/47125", url="https://www.jmir.org/2024/1/e47125", url="http://www.ncbi.nlm.nih.gov/pubmed/38422347" } @Article{info:doi/10.2196/53075, author="W{\"u}ndisch, Eric and Hufnagl, Peter and Brunecker, Peter and Meier zu Ummeln, Sophie and Tr{\"a}ger, Sarah and Kopp, Marcus and Prasser, Fabian and Weber, Joachim", title="Development of a Trusted Third Party at a Large University Hospital: Design and Implementation Study", journal="JMIR Med Inform", year="2024", month="Apr", day="17", volume="12", pages="e53075", keywords="pseudonymisation", keywords="architecture", keywords="scalability", keywords="trusted third party", keywords="application", keywords="security", keywords="consent", keywords="identifying data", keywords="infrastructure", keywords="modular", keywords="software", keywords="implementation", keywords="user interface", keywords="health platform", keywords="data management", keywords="data privacy", keywords="health record", keywords="electronic health record", keywords="EHR", keywords="pseudonymization", abstract="Background: Pseudonymization has become a best practice to securely manage the identities of patients and study participants in medical research projects and data sharing initiatives. This method offers the advantage of not requiring the direct identification of data to support various research processes while still allowing for advanced processing activities, such as data linkage. Often, pseudonymization and related functionalities are bundled in specific technical and organization units known as trusted third parties (TTPs). However, pseudonymization can significantly increase the complexity of data management and research workflows, necessitating adequate tool support. Common tasks of TTPs include supporting the secure registration and pseudonymization of patient and sample identities as well as managing consent. Objective: Despite the challenges involved, little has been published about successful architectures and functional tools for implementing TTPs in large university hospitals. The aim of this paper is to fill this research gap by describing the software architecture and tool set developed and deployed as part of a TTP established at Charit{\'e} -- Universit{\"a}tsmedizin Berlin. Methods: The infrastructure for the TTP was designed to provide a modular structure while keeping maintenance requirements low. Basic functionalities were realized with the free MOSAIC tools. However, supporting common study processes requires implementing workflows that span different basic services, such as patient registration, followed by pseudonym generation and concluded by consent collection. To achieve this, an integration layer was developed to provide a unified Representational state transfer (REST) application programming interface (API) as a basis for more complex workflows. Based on this API, a unified graphical user interface was also implemented, providing an integrated view of information objects and workflows supported by the TTP. The API was implemented using Java and Spring Boot, while the graphical user interface was implemented in PHP and Laravel. Both services use a shared Keycloak instance as a unified management system for roles and rights. Results: By the end of 2022, the TTP has already supported more than 10 research projects since its launch in December 2019. Within these projects, more than 3000 identities were stored, more than 30,000 pseudonyms were generated, and more than 1500 consent forms were submitted. In total, more than 150 people regularly work with the software platform. By implementing the integration layer and the unified user interface, together with comprehensive roles and rights management, the effort for operating the TTP could be significantly reduced, as personnel of the supported research projects can use many functionalities independently. Conclusions: With the architecture and components described, we created a user-friendly and compliant environment for supporting research projects. We believe that the insights into the design and implementation of our TTP can help other institutions to efficiently and effectively set up corresponding structures. ", doi="10.2196/53075", url="https://medinform.jmir.org/2024/1/e53075" } @Article{info:doi/10.2196/48330, author="Ke, Yuhe and Yang, Rui and Liu, Nan", title="Comparing Open-Access Database and Traditional Intensive Care Studies Using Machine Learning: Bibliometric Analysis Study", journal="J Med Internet Res", year="2024", month="Apr", day="17", volume="26", pages="e48330", keywords="BERTopic", keywords="critical care", keywords="eICU", keywords="machine learning", keywords="MIMIC", keywords="Medical Information Mart for Intensive Care", keywords="natural language processing", abstract="Background: Intensive care research has predominantly relied on conventional methods like randomized controlled trials. However, the increasing popularity of open-access, free databases in the past decade has opened new avenues for research, offering fresh insights. Leveraging machine learning (ML) techniques enables the analysis of trends in a vast number of studies. Objective: This study aims to conduct a comprehensive bibliometric analysis using ML to compare trends and research topics in traditional intensive care unit (ICU) studies and those done with open-access databases (OADs). Methods: We used ML for the analysis of publications in the Web of Science database in this study. Articles were categorized into ``OAD'' and ``traditional intensive care'' (TIC) studies. OAD studies were included in the Medical Information Mart for Intensive Care (MIMIC), eICU Collaborative Research Database (eICU-CRD), Amsterdam University Medical Centers Database (AmsterdamUMCdb), High Time Resolution ICU Dataset (HiRID), and Pediatric Intensive Care database. TIC studies included all other intensive care studies. Uniform manifold approximation and projection was used to visualize the corpus distribution. The BERTopic technique was used to generate 30 topic-unique identification numbers and to categorize topics into 22 topic families. Results: A total of 227,893 records were extracted. After exclusions, 145,426 articles were identified as TIC and 1301 articles as OAD studies. TIC studies experienced exponential growth over the last 2 decades, culminating in a peak of 16,378 articles in 2021, while OAD studies demonstrated a consistent upsurge since 2018. Sepsis, ventilation-related research, and pediatric intensive care were the most frequently discussed topics. TIC studies exhibited broader coverage than OAD studies, suggesting a more extensive research scope. Conclusions: This study analyzed ICU research, providing valuable insights from a large number of publications. OAD studies complement TIC studies, focusing on predictive modeling, while TIC studies capture essential qualitative information. Integrating both approaches in a complementary manner is the future direction for ICU research. Additionally, natural language processing techniques offer a transformative alternative for literature review and bibliometric analysis. ", doi="10.2196/48330", url="https://www.jmir.org/2024/1/e48330", url="http://www.ncbi.nlm.nih.gov/pubmed/38630522" } @Article{info:doi/10.2196/50897, author="Ndlovu, Kagiso and Mauco, Leonard Kabelo and Makhura, Onalenna and Hu, Robin and Motlogelwa, Peace Nkwebi and Masizana, Audrey and Lo, Emily and Mphoyakgosi, Thongbotho and Moyo, Sikhulile", title="Experiences, Lessons, and Challenges With Adapting REDCap for COVID-19 Laboratory Data Management in a Resource-Limited Country: Descriptive Study", journal="JMIR Form Res", year="2024", month="Apr", day="16", volume="8", pages="e50897", keywords="REDCap", keywords="DHIS2", keywords="COVID-19", keywords="National Health Laboratory", keywords="eHealth", keywords="interoperability", keywords="data management", keywords="Botswana", abstract="Background: The COVID-19 pandemic brought challenges requiring timely health data sharing to inform accurate decision-making at national levels. In Botswana, we adapted and integrated the Research Electronic Data Capture (REDCap) and the District Health Information System version 2 (DHIS2) platforms to support timely collection and reporting of COVID-19 cases. We focused on establishing an effective COVID-19 data flow at the national public health laboratory, being guided by the needs of health care professionals at the National Health Laboratory (NHL). This integration contributed to automated centralized reporting of COVID-19 results at the Ministry of Health (MOH). Objective: This paper reports the experiences, challenges, and lessons learned while designing, adapting, and implementing the REDCap and DHIS2 platforms to support COVID-19 data management at the NHL in Botswana. Methods: A participatory design approach was adopted to guide the design, customization, and implementation of the REDCap platform in support of COVID-19 data management at the NHL. Study participants included 29 NHL and 4 MOH personnel, and the study was conducted from March 2, 2020, to June 30, 2020. Participants' requirements for an ideal COVID-19 data management system were established. NVivo 11 software supported thematic analysis of the challenges and resolutions identified during this study. These were categorized according to the 4 themes of infrastructure, capacity development, platform constraints, and interoperability. Results: Overall, REDCap supported the majority of perceived technical and nontechnical requirements for an ideal COVID-19 data management system at the NHL. Although some implementation challenges were identified, each had mitigation strategies such as procurement of mobile Internet routers, engagement of senior management to resolve conflicting policies, continuous REDCap training, and the development of a third-party web application to enhance REDCap's capabilities. Lessons learned informed next steps and further refinement of the REDCap platform. Conclusions: Implementation of REDCap at the NHL to streamline COVID-19 data collection and integration with the DHIS2 platform was feasible despite the urgency of implementation during the pandemic. By implementing the REDCap platform at the NHL, we demonstrated the possibility of achieving a centralized reporting system of COVID-19 cases, hence enabling timely and informed decision-making at a national level. Challenges faced presented lessons learned to inform sustainable implementation of digital health innovations in Botswana and similar resource-limited countries. ", doi="10.2196/50897", url="https://formative.jmir.org/2024/1/e50897", url="http://www.ncbi.nlm.nih.gov/pubmed/38625736" } @Article{info:doi/10.2196/54490, author="Wu, MeiJung and Islam, Mohaimenul Md and Poly, Nasrin Tahmina and Lin, Ming-Chin", title="Application of AI in Sepsis: Citation Network Analysis and Evidence Synthesis", journal="Interact J Med Res", year="2024", month="Apr", day="15", volume="13", pages="e54490", keywords="AI", keywords="artificial intelligence", keywords="bibliometric analysis", keywords="bibliometric", keywords="citation", keywords="deep learning", keywords="machine learning", keywords="network analysis", keywords="publication", keywords="sepsis", keywords="trend", keywords="visualization", keywords="VOSviewer", keywords="Web of Science", keywords="WoS", abstract="Background: Artificial intelligence (AI) has garnered considerable attention in the context of sepsis research, particularly in personalized diagnosis and treatment. Conducting a bibliometric analysis of existing publications can offer a broad overview of the field and identify current research trends and future research directions. Objective: The objective of this study is to leverage bibliometric data to provide a comprehensive overview of the application of AI in sepsis. Methods: We conducted a search in the Web of Science Core Collection database to identify relevant articles published in English until August 31, 2023. A predefined search strategy was used, evaluating titles, abstracts, and full texts as needed. We used the Bibliometrix and VOSviewer tools to visualize networks showcasing the co-occurrence of authors, research institutions, countries, citations, and keywords. Results: A total of 259 relevant articles published between 2014 and 2023 (until August) were identified. Over the past decade, the annual publication count has consistently risen. Leading journals in this domain include Critical Care Medicine (17/259, 6.6\%), Frontiers in Medicine (17/259, 6.6\%), and Scientific Reports (11/259, 4.2\%). The United States (103/259, 39.8\%), China (83/259, 32\%), United Kingdom (14/259, 5.4\%), and Taiwan (12/259, 4.6\%) emerged as the most prolific countries in terms of publications. Notable institutions in this field include the University of California System, Emory University, and Harvard University. The key researchers working in this area include Ritankar Das, Chris Barton, and Rishikesan Kamaleswaran. Although the initial period witnessed a relatively low number of articles focused on AI applications for sepsis, there has been a significant surge in research within this area in recent years (2014-2023). Conclusions: This comprehensive analysis provides valuable insights into AI-related research conducted in the field of sepsis, aiding health care policy makers and researchers in understanding the potential of AI and formulating effective research plans. Such analysis serves as a valuable resource for determining the advantages, sustainability, scope, and potential impact of AI models in sepsis. ", doi="10.2196/54490", url="https://www.i-jmr.org/2024/1/e54490", url="http://www.ncbi.nlm.nih.gov/pubmed/38621231" } @Article{info:doi/10.2196/55988, author="Hadar-Shoval, Dorit and Asraf, Kfir and Mizrachi, Yonathan and Haber, Yuval and Elyoseph, Zohar", title="Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values", journal="JMIR Ment Health", year="2024", month="Apr", day="9", volume="11", pages="e55988", keywords="large language models", keywords="LLMs", keywords="large language model", keywords="LLM", keywords="machine learning", keywords="ML", keywords="natural language processing", keywords="NLP", keywords="deep learning", keywords="ChatGPT", keywords="Chat-GPT", keywords="chatbot", keywords="chatbots", keywords="chat-bot", keywords="chat-bots", keywords="Claude", keywords="values", keywords="Bard", keywords="artificial intelligence", keywords="AI", keywords="algorithm", keywords="algorithms", keywords="predictive model", keywords="predictive models", keywords="predictive analytics", keywords="predictive system", keywords="practical model", keywords="practical models", keywords="mental health", keywords="mental illness", keywords="mental illnesses", keywords="mental disease", keywords="mental diseases", keywords="mental disorder", keywords="mental disorders", keywords="mobile health", keywords="mHealth", keywords="eHealth", keywords="mood disorder", keywords="mood disorders", abstract="Background: Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz's theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics. Objective: This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other. Methods: In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire---Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests. Results: The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making. Conclusions: This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values. ", doi="10.2196/55988", url="https://mental.jmir.org/2024/1/e55988", url="http://www.ncbi.nlm.nih.gov/pubmed/38593424" } @Article{info:doi/10.2196/55779, author="Tsafnat, Guy and Dunscombe, Rachel and Gabriel, Davera and Grieve, Grahame and Reich, Christian", title="Converge or Collide? Making Sense of a Plethora of Open Data Standards in Health Care", journal="J Med Internet Res", year="2024", month="Apr", day="9", volume="26", pages="e55779", keywords="interoperability", keywords="clinical data", keywords="open data standards", keywords="health care", keywords="digital health", keywords="health care data", doi="10.2196/55779", url="https://www.jmir.org/2024/1/e55779", url="http://www.ncbi.nlm.nih.gov/pubmed/38593431" } @Article{info:doi/10.2196/48963, author="Loeb, Talia and Willis, Kalai and Velishavo, Frans and Lee, Daniel and Rao, Amrita and Baral, Stefan and Rucinski, Katherine", title="Leveraging Routinely Collected Program Data to Inform Extrapolated Size Estimates for Key Populations in Namibia: Small Area Estimation Study", journal="JMIR Public Health Surveill", year="2024", month="Apr", day="4", volume="10", pages="e48963", keywords="female sex workers", keywords="HIV", keywords="key populations", keywords="men who have sex with men", keywords="Namibia", keywords="population size estimation", keywords="small area estimation", abstract="Background: Estimating the size of key populations, including female sex workers (FSW) and men who have sex with men (MSM), can inform planning and resource allocation for HIV programs at local and national levels. In geographic areas where direct population size estimates (PSEs) for key populations have not been collected, small area estimation (SAE) can help fill in gaps using supplemental data sources known as auxiliary data. However, routinely collected program data have not historically been used as auxiliary data to generate subnational estimates for key populations, including in Namibia. Objective: To systematically generate regional size estimates for FSW and MSM in Namibia, we used a consensus-informed estimation approach with local stakeholders that included the integration of routinely collected HIV program data provided by key populations' HIV service providers. Methods: We used quarterly program data reported by key population implementing partners, including counts of the number of individuals accessing HIV services over time, to weight existing PSEs collected through bio-behavioral surveys using a Bayesian triangulation approach. SAEs were generated through simple imputation, stratified imputation, and multivariable Poisson regression models. We selected final estimates using an iterative qualitative ranking process with local key population implementing partners. Results: Extrapolated national estimates for FSW ranged from 4777 to 13,148 across Namibia, comprising 1.5\% to 3.6\% of female individuals aged between 15 and 49 years. For MSM, estimates ranged from 4611 to 10,171, comprising 0.7\% to 1.5\% of male individuals aged between 15 and 49 years. After the inclusion of program data as priors, the estimated proportion of FSW derived from simple imputation increased from 1.9\% to 2.8\%, and the proportion of MSM decreased from 1.5\% to 0.75\%. When stratified imputation was implemented using HIV prevalence to inform strata, the inclusion of program data increased the proportion of FSW from 2.6\% to 4.0\% in regions with high prevalence and decreased the proportion from 1.4\% to 1.2\% in regions with low prevalence. When population density was used to inform strata, the inclusion of program data also increased the proportion of FSW in high-density regions (from 1.1\% to 3.4\%) and decreased the proportion of MSM in all regions. Conclusions: Using SAE approaches, we combined epidemiologic and program data to generate subnational size estimates for key populations in Namibia. Overall, estimates were highly sensitive to the inclusion of program data. Program data represent a supplemental source of information that can be used to align PSEs with real-world HIV programs, particularly in regions where population-based data collection methods are challenging to implement. Future work is needed to determine how best to include and validate program data in target settings and in key population size estimation studies, ultimately bridging research with practice to support a more comprehensive HIV response. ", doi="10.2196/48963", url="https://publichealth.jmir.org/2024/1/e48963", url="http://www.ncbi.nlm.nih.gov/pubmed/38573760" } @Article{info:doi/10.2196/54580, author="Wang, Lei and Ma, Yinyao and Bi, Wenshuai and Lv, Hanlin and Li, Yuxiang", title="An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study", journal="J Med Internet Res", year="2024", month="Mar", day="29", volume="26", pages="e54580", keywords="clinical data extraction", keywords="large language models", keywords="feature hallucination", keywords="modular approach", keywords="unstructured data processing", abstract="Background: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. Objective: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. Methods: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People's Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert's annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. Results: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52\% and 92.93\%, respectively) and Baichuan2-13B-Chat (95.86\% and 90.08\%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28\% accuracy and a 0\% null ratio. Conclusions: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records. ", doi="10.2196/54580", url="https://www.jmir.org/2024/1/e54580", url="http://www.ncbi.nlm.nih.gov/pubmed/38551633" } @Article{info:doi/10.2196/49822, author="Baek, Jinyoung and Lawson, Jonathan and Rahimzadeh, Vasiliki", title="Investigating the Roles and Responsibilities of Institutional Signing Officials After Data Sharing Policy Reform for Federally Funded Research in the United States: National Survey", journal="JMIR Form Res", year="2024", month="Mar", day="20", volume="8", pages="e49822", keywords="biomedical research", keywords="survey", keywords="surveys", keywords="data sharing", keywords="data management", keywords="secondary use", keywords="National Institutes of Health", keywords="signing official", keywords="information sharing", keywords="exchange", keywords="access", keywords="data science", keywords="accessibility", keywords="policy", keywords="policies", abstract="Background: New federal policies along with rapid growth in data generation, storage, and analysis tools are together driving scientific data sharing in the United States. At the same, triangulating human research data from diverse sources can also create situations where data are used for future research in ways that individuals and communities may consider objectionable. Institutional gatekeepers, namely, signing officials (SOs), are therefore at the helm of compliant management and sharing of human data for research. Of those with data governance responsibilities, SOs most often serve as signatories for investigators who deposit, access, and share research data between institutions. Although SOs play important leadership roles in compliant data sharing, we know surprisingly little about their scope of work, roles, and oversight responsibilities. Objective: The purpose of this study was to describe existing institutional policies and practices of US SOs who manage human genomic data access, as well as how these may change in the wake of new Data Management and Sharing requirements for National Institutes of Health--funded research in the United States. Methods: We administered an anonymous survey to institutional SOs recruited from biomedical research institutions across the United States. Survey items probed where data generated from extramurally funded research are deposited, how researchers outside the institution access these data, and what happens to these data after extramural funding ends. Results: In total, 56 institutional SOs participated in the survey. We found that SOs frequently approve duplicate data deposits and impose stricter access controls when data use limitations are unclear or unspecified. In addition, 21\% (n=12) of SOs knew where data from federally funded projects are deposited after project funding sunsets. As a consequence, most investigators deposit their scientific data into ``a National Institutes of Health--funded repository'' to meet the Data Management and Sharing requirements but also within the ``institution's own repository'' or a third-party repository. Conclusions: Our findings inform 5 policy recommendations and best practices for US SOs to improve coordination and develop comprehensive and consistent data governance policies that balance the need for scientific progress with effective human data protections. ", doi="10.2196/49822", url="https://formative.jmir.org/2024/1/e49822", url="http://www.ncbi.nlm.nih.gov/pubmed/38506894" } @Article{info:doi/10.2196/50518, author="Park, Daemin and Kim, Dasom and Park, Ah-hyun", title="Agendas on Nursing in South Korea Media: Natural Language Processing and Network Analysis of News From 2005 to 2022", journal="J Med Internet Res", year="2024", month="Mar", day="19", volume="26", pages="e50518", keywords="nurses", keywords="news", keywords="South Korea", keywords="natural language processing", keywords="NLP", keywords="network analysis", keywords="politicization", abstract="Background: In recent years, Korean society has increasingly recognized the importance of nurses in the context of population aging and infectious disease control. However, nurses still face difficulties with regard to policy activities that are aimed at improving the nursing workforce structure and working environment. Media coverage plays an important role in public awareness of a particular issue and can be an important strategy in policy activities. Objective: This study analyzed data from 18 years of news coverage on nursing-related issues. The focus of this study was to examine the drivers of the social, local, economic, and political agendas that were emphasized in the media by the analysis of main sources and their quotes. This analysis revealed which nursing media agendas were emphasized (eg, social aspects), neglected (eg, policy aspects), and negotiated. Methods: Descriptive analysis, natural language processing, and semantic network analysis were applied to analyze data collected from 2005 to 2022. BigKinds were used for the collection of data, automatic multi-categorization of news, named entity recognition of news sources, and extraction and topic modeling of quotes. The main news sources were identified by conducting a 1-mode network analysis with SNAnalyzer. The main agendas of nursing-related news coverage were examined through the qualitative analysis of major sources' quotes by section. The common and individual interests of the top-ranked sources were analyzed through a 2-mode network analysis using UCINET. Results: In total, 128,339 articles from 54 media outlets on nursing-related issues were analyzed. Descriptive analysis showed that nursing-related news was mainly covered in social (99,868/128,339, 77.82\%) and local (48,056/128,339, 48.56\%) sections, whereas it was rarely covered in economic (9439/128,339, 7.35\%) and political (7301/128,339, 5.69\%) sections. Furthermore, 445 sources that had made the top 20 list at least once by year and section were analyzed. Other than ``nurse,'' the main sources for each section were ``labor union,'' ``local resident,'' ``government,'' and ``Moon Jae-in.'' ``Nursing Bill'' emerged as a common interest among nurses and doctors, although the topic did not garner considerable attention from the Ministry of Health and Welfare. Analyzing quotes showed that nurses were portrayed as heroes, laborers, survivors of abuse, and perpetrators. The economic section focused on employment of youth and women in nursing. In the political section, conflicts between nurses and doctors, which may have caused policy confusion, were highlighted. Policy formulation processes were not adequately reported. Media coverage of the enactment of nursing laws tended to relate to confrontations between political parties. Conclusions: The media plays a crucial role in highlighting various aspects of nursing practice. However, policy formulation processes to solve nursing issues were not adequately reported in South Korea. This study suggests that nurses should secure policy compliance by persuading the public to understand their professional perspectives. ", doi="10.2196/50518", url="https://www.jmir.org/2024/1/e50518", url="http://www.ncbi.nlm.nih.gov/pubmed/38393293" } @Article{info:doi/10.2196/42904, author="Reiter, Vittoria Alisa Maria and Pantel, Tori Jean and Danyel, Magdalena and Horn, Denise and Ott, Claus-Eric and Mensah, Atta Martin", title="Validation of 3 Computer-Aided Facial Phenotyping Tools (DeepGestalt, GestaltMatcher, and D-Score): Comparative Diagnostic Accuracy Study", journal="J Med Internet Res", year="2024", month="Mar", day="13", volume="26", pages="e42904", keywords="facial phenotyping", keywords="DeepGestalt", keywords="facial recognition", keywords="Face2Gene", keywords="medical genetics", keywords="diagnostic accuracy", keywords="genetic syndrome", keywords="machine learning", keywords="GestaltMatcher", keywords="D-Score", keywords="genetics", abstract="Background: While characteristic facial features provide important clues for finding the correct diagnosis in genetic syndromes, valid assessment can be challenging. The next-generation phenotyping algorithm DeepGestalt analyzes patient images and provides syndrome suggestions. GestaltMatcher matches patient images with similar facial features. The new D-Score provides a score for the degree of facial dysmorphism. Objective: We aimed to test state-of-the-art facial phenotyping tools by benchmarking GestaltMatcher and D-Score and comparing them to DeepGestalt. Methods: Using a retrospective sample of 4796 images of patients with 486 different genetic syndromes (London Medical Database, GestaltMatcher Database, and literature images) and 323 inconspicuous control images, we determined the clinical use of D-Score, GestaltMatcher, and DeepGestalt, evaluating sensitivity; specificity; accuracy; the number of supported diagnoses; and potential biases such as age, sex, and ethnicity. Results: DeepGestalt suggested 340 distinct syndromes and GestaltMatcher suggested 1128 syndromes. The top-30 sensitivity was higher for DeepGestalt (88\%, SD 18\%) than for GestaltMatcher (76\%, SD 26\%). DeepGestalt generally assigned lower scores but provided higher scores for patient images than for inconspicuous control images, thus allowing the 2 cohorts to be separated with an area under the receiver operating characteristic curve (AUROC) of 0.73. GestaltMatcher could not separate the 2 classes (AUROC 0.55). Trained for this purpose, D-Score achieved the highest discriminatory power (AUROC 0.86). D-Score's levels increased with the age of the depicted individuals. Male individuals yielded higher D-scores than female individuals. Ethnicity did not appear to influence D-scores. Conclusions: If used with caution, algorithms such as D-score could help clinicians with constrained resources or limited experience in syndromology to decide whether a patient needs further genetic evaluation. Algorithms such as DeepGestalt could support diagnosing rather common genetic syndromes with facial abnormalities, whereas algorithms such as GestaltMatcher could suggest rare diagnoses that are unknown to the clinician in patients with a characteristic, dysmorphic face. ", doi="10.2196/42904", url="https://www.jmir.org/2024/1/e42904", url="http://www.ncbi.nlm.nih.gov/pubmed/38477981" } @Article{info:doi/10.2196/48186, author="Fahimi, Mansour and Hair, C. Elizabeth and Do, K. Elizabeth and Kreslake, M. Jennifer and Yan, Xiaolu and Chan, Elisa and Barlas, M. Frances and Giles, Abigail and Osborn, Larry", title="Improving the Efficiency of Inferences From Hybrid Samples for Effective Health Surveillance Surveys: Comprehensive Review of Quantitative Methods", journal="JMIR Public Health Surveill", year="2024", month="Mar", day="7", volume="10", pages="e48186", keywords="hybrid samples", keywords="composite estimation", keywords="optimal composition factor", keywords="unequal weighting effect", keywords="composite weighting", keywords="weighting", keywords="surveillance", keywords="sample survey", keywords="data collection", keywords="risk factor", abstract="Background: Increasingly, survey researchers rely on hybrid samples to improve coverage and increase the number of respondents by combining independent samples. For instance, it is possible to combine 2 probability samples with one relying on telephone and another on mail. More commonly, however, researchers are now supplementing probability samples with those from online panels that are less costly. Setting aside ad hoc approaches that are void of rigor, traditionally, the method of composite estimation has been used to blend results from different sample surveys. This means individual point estimates from different surveys are pooled together, 1 estimate at a time. Given that for a typical study many estimates must be produced, this piecemeal approach is computationally burdensome and subject to the inferential limitations of the individual surveys that are used in this process. Objective: In this paper, we will provide a comprehensive review of the traditional method of composite estimation. Subsequently, the method of composite weighting is introduced, which is significantly more efficient, both computationally and inferentially when pooling data from multiple surveys. With the growing interest in hybrid sampling alternatives, we hope to offer an accessible methodology for improving the efficiency of inferences from such sample surveys without sacrificing rigor. Methods: Specifically, we will illustrate why the many ad hoc procedures for blending survey data from multiple surveys are void of scientific integrity and subject to misleading inferences. Moreover, we will demonstrate how the traditional approach of composite estimation fails to offer a pragmatic and scalable solution in practice. By relying on theoretical and empirical justifications, in contrast, we will show how our proposed methodology of composite weighting is both scientifically sound and inferentially and computationally superior to the old method of composite estimation. Results: Using data from 3 large surveys that have relied on hybrid samples composed of probability-based and supplemental sample components from online panels, we illustrate that our proposed method of composite weighting is superior to the traditional method of composite estimation in 2 distinct ways. Computationally, it is vastly less demanding and hence more accessible for practitioners. Inferentially, it produces more efficient estimates with higher levels of external validity when pooling data from multiple surveys. Conclusions: The new realities of the digital age have brought about a number of resilient challenges for survey researchers, which in turn have exposed some of the inefficiencies associated with the traditional methods this community has relied upon for decades. The resilience of such challenges suggests that piecemeal approaches that may have limited applicability or restricted accessibility will prove to be inadequate and transient. It is from this perspective that our proposed method of composite weighting has aimed to introduce a durable and accessible solution for hybrid sample surveys. ", doi="10.2196/48186", url="https://publichealth.jmir.org/2024/1/e48186", url="http://www.ncbi.nlm.nih.gov/pubmed/38451620" } @Article{info:doi/10.2196/50421, author="Baines, Rebecca and Stevens, Sebastian and Austin, Daniela and Anil, Krithika and Bradwell, Hannah and Cooper, Leonie and Maramba, Daniel Inocencio and Chatterjee, Arunangsu and Leigh, Simon", title="Patient and Public Willingness to Share Personal Health Data for Third-Party or Secondary Uses: Systematic Review", journal="J Med Internet Res", year="2024", month="Mar", day="5", volume="26", pages="e50421", keywords="data sharing", keywords="personal health data", keywords="patient", keywords="public attitudes", keywords="systematic review", keywords="secondary use", keywords="third party", keywords="willingness to share", keywords="data privacy and security", abstract="Background: International advances in information communication, eHealth, and other digital health technologies have led to significant expansions in the collection and analysis of personal health data. However, following a series of high-profile data sharing scandals and the emergence of COVID-19, critical exploration of public willingness to share personal health data remains limited, particularly for third-party or secondary uses. Objective: This systematic review aims to explore factors that affect public willingness to share personal health data for third-party or secondary uses. Methods: A systematic search of 6 databases (MEDLINE, Embase, PsycINFO, CINAHL, Scopus, and SocINDEX) was conducted with review findings analyzed using inductive-thematic analysis and synthesized using a narrative approach. Results: Of the 13,949 papers identified, 135 were included. Factors most commonly identified as a barrier to data sharing from a public perspective included data privacy, security, and management concerns. Other factors found to influence willingness to share personal health data included the type of data being collected (ie, perceived sensitivity); the type of user requesting their data to be shared, including their perceived motivation, profit prioritization, and ability to directly impact patient care; trust in the data user, as well as in associated processes, often established through individual choice and control over what data are shared with whom, when, and for how long, supported by appropriate models of dynamic consent; the presence of a feedback loop; and clearly articulated benefits or issue relevance including valued incentivization and compensation at both an individual and collective or societal level. Conclusions: There is general, yet conditional public support for sharing personal health data for third-party or secondary use. Clarity, transparency, and individual control over who has access to what data, when, and for how long are widely regarded as essential prerequisites for public data sharing support. Individual levels of control and choice need to operate within the auspices of assured data privacy and security processes, underpinned by dynamic and responsive models of consent that prioritize individual or collective benefits over and above commercial gain. Failure to understand, design, and refine data sharing approaches in response to changeable patient preferences will only jeopardize the tangible benefits of data sharing practices being fully realized. ", doi="10.2196/50421", url="https://www.jmir.org/2024/1/e50421", url="http://www.ncbi.nlm.nih.gov/pubmed/38441944" } @Article{info:doi/10.2196/53627, author="Boehm, Dominik and Strantz, Cosima and Christoph, Jan and Busch, Hauke and Ganslandt, Thomas and Unberath, Philipp", title="Data Visualization Support for Tumor Boards and Clinical Oncology: Protocol for a Scoping Review", journal="JMIR Res Protoc", year="2024", month="Mar", day="5", volume="13", pages="e53627", keywords="clinical oncology", keywords="tumor board", keywords="cancer conference", keywords="multidisciplinary", keywords="visualization", keywords="software", keywords="tool", keywords="scoping review", keywords="tumor", keywords="malignant", keywords="benign", keywords="data sets", keywords="oncology", keywords="interactive visualization", keywords="data", keywords="patient", keywords="patients", keywords="physicians", keywords="medical practitioners", keywords="medical practitioner", keywords="conference", abstract="Background: Complex and expanding data sets in clinical oncology applications require flexible and interactive visualization of patient data to provide the maximum amount of information to physicians and other medical practitioners. Interdisciplinary tumor conferences in particular profit from customized tools to integrate, link, and visualize relevant data from all professions involved. Objective: The scoping review proposed in this protocol aims to identify and present currently available data visualization tools for tumor boards and related areas. The objective of the review will be to provide not only an overview of digital tools currently used in tumor board settings, but also the data included, the respective visualization solutions, and their integration into hospital processes. Methods: The planned scoping review process is based on the Arksey and O'Malley scoping study framework. The following electronic databases will be searched for articles published in English: PubMed, Web of Knowledge, and SCOPUS. Eligible articles will first undergo a deduplication step, followed by the screening of titles and abstracts. Second, a full-text screening will be used to reach the final decision about article selection. At least 2 reviewers will independently screen titles, abstracts, and full-text reports. Conflicting inclusion decisions will be resolved by a third reviewer. The remaining literature will be analyzed using a data extraction template proposed in this protocol. The template includes a variety of meta information as well as specific questions aiming to answer the research question: ``What are the key features of data visualization solutions used in molecular and organ tumor boards, and how are these elements integrated and used within the clinical setting?'' The findings will be compiled, charted, and presented as specified in the scoping study framework. Data for included tools may be supplemented with additional manual literature searches. The entire review process will be documented in alignment with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) flowchart. Results: The results of this scoping review will be reported per the expanded PRISMA-ScR guidelines. A preliminary search using PubMed, Web of Knowledge, and Scopus resulted in 1320 articles after deduplication that will be included in the further review process. We expect the results to be published during the second quarter of 2024. Conclusions: Visualization is a key process in leveraging a data set's potentially available information and enabling its use in an interdisciplinary setting. The scoping review described in this protocol aims to present the status quo of visualization solutions for tumor board and clinical oncology applications and their integration into hospital processes. International Registered Report Identifier (IRRID): DERR1-10.2196/53627 ", doi="10.2196/53627", url="https://www.researchprotocols.org/2024/1/e53627", url="http://www.ncbi.nlm.nih.gov/pubmed/38441925" } @Article{info:doi/10.2196/47846, author="Oehm, Benedict Johannes and Riepenhausen, Luise Sarah and Storck, Michael and Dugas, Martin and Pryss, R{\"u}diger and Varghese, Julian", title="Integration of Patient-Reported Outcome Data Collected Via Web Applications and Mobile Apps Into a Nation-Wide COVID-19 Research Platform Using Fast Healthcare Interoperability Resources: Development Study", journal="J Med Internet Res", year="2024", month="Feb", day="27", volume="26", pages="e47846", keywords="Fast Healthcare Interoperability Resources", keywords="FHIR", keywords="FHIR Questionnaire", keywords="patient-reported outcome", keywords="mobile health", keywords="mHealth", keywords="research compatibility", keywords="interoperability", keywords="Germany", keywords="harmonized data collection", keywords="findable, accessible, interoperable, and reusable", keywords="FAIR data", keywords="mobile phone", abstract="Background: The Network University Medicine projects are an important part of the German COVID-19 research infrastructure. They comprise 2 subprojects: COVID-19 Data Exchange (CODEX) and Coordination on Mobile Pandemic Apps Best Practice and Solution Sharing (COMPASS). CODEX provides a centralized and secure data storage platform for research data, whereas in COMPASS, expert panels were gathered to develop a reference app framework for capturing patient-reported outcomes (PROs) that can be used by any researcher. Objective: Our study aims to integrate the data collected with the COMPASS reference app framework into the central CODEX platform, so that they can be used by secondary researchers. Although both projects used the Fast Healthcare Interoperability Resources (FHIR) standard, it was not used in a way that data could be shared directly. Given the short time frame and the parallel developments within the CODEX platform, a pragmatic and robust solution for an interface component was required. Methods: We have developed a means to facilitate and promote the use of the German Corona Consensus (GECCO) data set, a core data set for COVID-19 research in Germany. In this way, we ensured semantic interoperability for the app-collected PRO data with the COMPASS app. We also developed an interface component to sustain syntactic interoperability. Results: The use of different FHIR types by the COMPASS reference app framework (the general-purpose FHIR Questionnaire) and the CODEX platform (eg, Patient, Condition, and Observation) was found to be the most significant obstacle. Therefore, we developed an interface component that realigns the Questionnaire items with the corresponding items in the GECCO data set and provides the correct resources for the CODEX platform. We extended the existing COMPASS questionnaire editor with an import function for GECCO items, which also tags them for the interface component. This ensures syntactic interoperability and eases the reuse of the GECCO data set for researchers. Conclusions: This paper shows how PRO data, which are collected across various studies conducted by different researchers, can be captured in a research-compatible way. This means that the data can be shared with a central research infrastructure and be reused by other researchers to gain more insights about COVID-19 and its sequelae. ", doi="10.2196/47846", url="https://www.jmir.org/2024/1/e47846", url="http://www.ncbi.nlm.nih.gov/pubmed/38411999" } @Article{info:doi/10.2196/49381, author="Hood, Nicole and Benbow, Nanette and Jaggi, Chandni and Whitby, Shamaya and Sullivan, Sean Patrick and ", title="AIDSVu Cities' Progress Toward HIV Care Continuum Goals: Cross-Sectional Study", journal="JMIR Public Health Surveill", year="2024", month="Feb", day="26", volume="10", pages="e49381", keywords="HIV", keywords="epidemiology", keywords="surveillance", keywords="HIV care continuum", keywords="cities", keywords="HIV public health", keywords="HIV prevention", keywords="diagnosis", keywords="HIV late diagnosis", abstract="Background: Public health surveillance data are critical to understanding the current state of the HIV and AIDS epidemics. Surveillance data provide significant insight into patterns within and progress toward achieving targets for each of the steps in the HIV care continuum. Such targets include those outlined in the National HIV/AIDS Strategy (NHAS) goals. If these data are disseminated, they can be used to prioritize certain steps in the continuum, geographic locations, and groups of people. Objective: We sought to develop and report indicators of progress toward the NHAS goals for US cities and to characterize progress toward those goals with categorical metrics. Methods: Health departments used standardized SAS code to calculate care continuum indicators from their HIV surveillance data to ensure comparability across jurisdictions. We report 2018 descriptive statistics for continuum steps (timely diagnosis, linkage to medical care, receipt of medical care, and HIV viral load suppression) for 36 US cities and their progress toward 2020 NHAS goals as of 2018. Indicators are reported categorically as met or surpassed the goal, within 25\% of attaining the goal, or further than 25\% from achieving the goal. Results: Cities were closest to meeting NHAS goals for timely diagnosis compared to the goals for linkage to care, receipt of care, and viral load suppression, with all cities (n=36, 100\%) within 25\% of meeting the goal for timely diagnosis. Only 8\% (n=3) of cities were >25\% from achieving the goal for receipt of care, but 69\% (n=25) of cities were >25\% from achieving the goal for viral suppression. Conclusions: Display of progress with graphical indicators enables communication of progress to stakeholders. AIDSVu analyses of HIV surveillance data facilitate cities' ability to benchmark their progress against that of other cities with similar characteristics. By identifying peer cities (eg, cities with analogous populations or similar NHAS goal concerns), the public display of indicators can promote dialogue between cities with comparable challenges and opportunities. ", doi="10.2196/49381", url="https://publichealth.jmir.org/2024/1/e49381", url="http://www.ncbi.nlm.nih.gov/pubmed/38407961" } @Article{info:doi/10.2196/54681, author="Castiglione, Angela Sonia and Lavoie-Tremblay, M{\'e}lanie and Kilpatrick, Kelley and Gifford, Wendy and Semenic, Elizabeth Sonia", title="Exploring Shared Implementation Leadership of Point of Care Nursing Leadership Teams on Inpatient Hospital Units: Protocol for a Collective Case Study", journal="JMIR Res Protoc", year="2024", month="Feb", day="19", volume="13", pages="e54681", keywords="case study", keywords="evidence-based practices", keywords="implementation leadership", keywords="inpatient hospital units", keywords="nursing leadership", keywords="point of care", abstract="Background: Nursing leadership teams at the point of care (POC), consisting of both formal and informal leaders, are regularly called upon to support the implementation of evidence-based practices (EBPs) in hospital units. However, current conceptualizations of effective leadership for successful implementation typically focus on the behaviors of individual leaders in managerial roles. Little is known about how multiple nursing leaders in formal and informal roles share implementation leadership (IL), representing an important knowledge gap. Objective: This study aims to explore shared IL among formal and informal nursing leaders in inpatient hospital units. The central research question is as follows: How is IL shared among members of POC nursing leadership teams on inpatient hospital units? The subquestions are as follows: (1) What IL behaviors are enacted and shared by formal and informal leaders? (2) What social processes enable shared IL by formal and informal leaders? and (3) What factors influence shared IL in nursing leadership teams? Methods: We will use a collective case study approach to describe and generate an in-depth understanding of shared IL in nursing. We will select nursing leadership teams on 2 inpatient hospital units that have successfully implemented an EBP as instrumental cases. We will construct data through focus groups and individual interviews with key informants (leaders, unit staff, and senior nurse leaders), review of organizational documents, and researcher-generated field notes. We have developed a conceptual framework of shared IL to guide data analysis, which describes effective IL behaviors, formal and informal nursing leaders' roles at the POC, and social processes generating shared leadership and influencing contextual factors. We will use the Framework Method to systematically generate data matrices from deductive and inductive thematic analysis of each case. We will then generate assertions about shared IL following a cross-case analysis. Results: The study protocol received research ethics approval (2022-8408) on February 24, 2022. Data collection began in June 2022, and we have recruited 2 inpatient hospital units and 25 participants. Data collection was completed in December 2023, and data analysis is ongoing. We anticipate findings to be published in a peer-reviewed journal by late 2024. Conclusions: The anticipated results will shed light on how multiple and diverse members of the POC nursing leadership team enact and share IL. This study addresses calls to advance knowledge in promoting effective implementation of EBPs to ensure high-quality health care delivery by further developing the concept of shared IL in a nursing context. We will identify strategies to strengthen shared IL in nursing leadership teams at the POC, informing future intervention studies. International Registered Report Identifier (IRRID): DERR1-10.2196/54681 ", doi="10.2196/54681", url="https://www.researchprotocols.org/2024/1/e54681", url="http://www.ncbi.nlm.nih.gov/pubmed/38373024" } @Article{info:doi/10.2196/47703, author="Benda, Natalie and Dougherty, Kylie and Gebremariam Gobezayehu, Abebe and Cranmer, N. John and Zawtha, Sakie and Andreadis, Katerina and Biza, Heran and Masterson Creber, Ruth", title="Designing Electronic Data Capture Systems for Sustainability in Low-Resource Settings: Viewpoint With Lessons Learned From Ethiopia and Myanmar", journal="JMIR Public Health Surveill", year="2024", month="Feb", day="12", volume="10", pages="e47703", keywords="low and middle income countries", keywords="LMIC", keywords="electronic data capture", keywords="population health surveillance, sociotechnical system", keywords="data infrastructure", keywords="electronic data system", keywords="health care system", keywords="technology", keywords="information system", keywords="health program development", keywords="intervention", doi="10.2196/47703", url="https://publichealth.jmir.org/2024/1/e47703", url="http://www.ncbi.nlm.nih.gov/pubmed/38345833" } @Article{info:doi/10.2196/47092, author="Hollestelle, J. Marieke and van der Graaf, Rieke and Sturkenboom, M. Miriam C. J. and Cunnington, Marianne and van Delden, M. Johannes J.", title="Building a Sustainable Learning Health Care System for Pregnant and Lactating People: Interview Study Among Data Access Providers", journal="JMIR Pediatr Parent", year="2024", month="Feb", day="8", volume="7", pages="e47092", keywords="ethics", keywords="learning health care systems", keywords="pregnancy", keywords="lactation", keywords="real-world data", keywords="governance", keywords="qualitative research", abstract="Background: In many areas of health care, learning health care systems (LHSs) are seen as promising ways to accelerate research and outcomes for patients by reusing health and research data. For example, considering pregnant and lactating people, for whom there is still a poor evidence base for medication safety and efficacy, an LHS presents an interesting way forward. Combining unique data sources across Europe in an LHS could help clarify how medications affect pregnancy outcomes and lactation exposures. In general, a remaining challenge of data-intensive health research, which is at the core of an LHS, has been obtaining meaningful access to data. These unique data sources, also called data access providers (DAPs), are both public and private organizations and are important stakeholders in the development of a sustainable and ethically responsible LHS. Sustainability is often discussed as a challenge in LHS development. Moreover, DAPs are increasingly expected to move beyond regulatory compliance and are seen as moral agents tasked with upholding ethical principles, such as transparency, trustworthiness, responsibility, and community engagement. Objective: This study aims to explore the views of people working for DAPs who participate in a public-private partnership to build a sustainable and ethically responsible LHS. Methods: Using a qualitative interview design, we interviewed 14 people involved in the Innovative Medicines Initiative (IMI) ConcePTION (Continuum of Evidence from Pregnancy Exposures, Reproductive Toxicology and Breastfeeding to Improve Outcomes Now) project, a public-private collaboration with the goal of building an LHS for pregnant and lactating people. The pseudonymized transcripts were analyzed thematically. Results: A total of 3 themes were identified: opportunities and responsibilities, conditions for participation and commitment, and challenges for a knowledge-generating ecosystem. The respondents generally regarded the collaboration as an opportunity for various reasons beyond the primary goal of generating knowledge about medication safety during pregnancy and lactation. Respondents had different interpretations of responsibility in the context of data-intensive research in a public-private network. Respondents explained that resources (financial and other), scientific output, motivation, agreements collaboration with the pharmaceutical industry, trust, and transparency are important conditions for participating in and committing to the ConcePTION LHS. Respondents also discussed the challenges of an LHS, including the limitations to (real-world) data analyses and governance procedures. Conclusions: Our respondents were motivated by diverse opportunities to contribute to an LHS for pregnant and lactating people, primarily centered on advancing knowledge on medication safety. Although a shared responsibility for enabling real-world data analyses is acknowledged, their focus remains on their work and contribution to the project rather than on safeguarding ethical data handling. The results of our interviews underline the importance of a transparent governance structure, emphasizing the trust between DAPs and the public for the success and sustainability of an LHS. ", doi="10.2196/47092", url="https://pediatrics.jmir.org/2024/1/e47092", url="http://www.ncbi.nlm.nih.gov/pubmed/38329780" } @Article{info:doi/10.2196/51640, author="Kim, Jina and Choi, Sung Yong and Lee, Joo Young and Yeo, Geun Seung and Kim, Won Kyung and Kim, Seo Min and Rahmati, Masoud and Yon, Keon Dong and Lee, Jinseok", title="Limitations of the Cough Sound-Based COVID-19 Diagnosis Artificial Intelligence Model and its Future Direction: Longitudinal Observation Study", journal="J Med Internet Res", year="2024", month="Feb", day="6", volume="26", pages="e51640", keywords="COVID-19 variants", keywords="cough sound", keywords="artificial intelligence", keywords="diagnosis", keywords="human lifestyle", keywords="SARS-CoV-2", keywords="AI model", keywords="cough", keywords="sound-based", keywords="sounds app", keywords="development", keywords="COVID-19", keywords="AI", abstract="Background: The outbreak of SARS-CoV-2 in 2019 has necessitated the rapid and accurate detection of COVID-19 to manage patients effectively and implement public health measures. Artificial intelligence (AI) models analyzing cough sounds have emerged as promising tools for large-scale screening and early identification of potential cases. Objective: This study aimed to investigate the efficacy of using cough sounds as a diagnostic tool for COVID-19, considering the unique acoustic features that differentiate positive and negative cases. We investigated whether an AI model trained on cough sound recordings from specific periods, especially the early stages of the COVID-19 pandemic, were applicable to the ongoing situation with persistent variants. Methods: We used cough sound recordings from 3 data sets (Cambridge, Coswara, and Virufy) representing different stages of the pandemic and variants. Our AI model was trained using the Cambridge data set with subsequent evaluation against all data sets. The performance was analyzed based on the area under the receiver operating curve (AUC) across different data measurement periods and COVID-19 variants. Results: The AI model demonstrated a high AUC when tested with the Cambridge data set, indicative of its initial effectiveness. However, the performance varied significantly with other data sets, particularly in detecting later variants such as Delta and Omicron, with a marked decline in AUC observed for the latter. These results highlight the challenges in maintaining the efficacy of AI models against the backdrop of an evolving virus. Conclusions: While AI models analyzing cough sounds offer a promising noninvasive and rapid screening method for COVID-19, their effectiveness is challenged by the emergence of new virus variants. Ongoing research and adaptations in AI methodologies are crucial to address these limitations. The adaptability of AI models to evolve with the virus underscores their potential as a foundational technology for not only the current pandemic but also future outbreaks, contributing to a more agile and resilient global health infrastructure. ", doi="10.2196/51640", url="https://www.jmir.org/2024/1/e51640", url="http://www.ncbi.nlm.nih.gov/pubmed/38319694" } @Article{info:doi/10.2196/52080, author="Xu, Jian", title="The Current Status and Promotional Strategies for Cloud Migration of Hospital Information Systems in China: Strengths, Weaknesses, Opportunities, and Threats Analysis", journal="JMIR Med Inform", year="2024", month="Feb", day="5", volume="12", pages="e52080", keywords="hospital information system", keywords="HIS", keywords="cloud computing", keywords="cloud migration", keywords="Strengths, Weaknesses, Opportunities, and Threats analysis", abstract="Background: In the 21st century, Chinese hospitals have witnessed innovative medical business models, such as online diagnosis and treatment, cross-regional multidepartment consultation, and real-time sharing of medical test results, that surpass traditional hospital information systems (HISs). The introduction of cloud computing provides an excellent opportunity for hospitals to address these challenges. However, there is currently no comprehensive research assessing the cloud migration of HISs in China. This lack may hinder the widespread adoption and secure implementation of cloud computing in hospitals. Objective: The objective of this study is to comprehensively assess external and internal factors influencing the cloud migration of HISs in China and propose promotional strategies. Methods: Academic articles from January 1, 2007, to February 21, 2023, on the topic were searched in PubMed and HuiyiMd databases, and relevant documents such as national policy documents, white papers, and survey reports were collected from authoritative sources for analysis. A systematic assessment of factors influencing cloud migration of HISs in China was conducted by combining a Strengths, Weaknesses, Opportunities, and Threats (SWOT) analysis and literature review methods. Then, various promotional strategies based on different combinations of external and internal factors were proposed. Results: After conducting a thorough search and review, this study included 94 academic articles and 37 relevant documents. The analysis of these documents reveals the increasing application of and research on cloud computing in Chinese hospitals, and that it has expanded to 22 disciplinary domains. However, more than half (n=49, 52\%) of the documents primarily focused on task-specific cloud-based systems in hospitals, while only 22\% (n=21 articles) discussed integrated cloud platforms shared across the entire hospital, medical alliance, or region. The SWOT analysis showed that cloud computing adoption in Chinese hospitals benefits from policy support, capital investment, and social demand for new technology. However, it also faces threats like loss of digital sovereignty, supplier competition, cyber risks, and insufficient supervision. Factors driving cloud migration for HISs include medical big data analytics and use, interdisciplinary collaboration, health-centered medical service provision, and successful cases. Barriers include system complexity, security threats, lack of strategic planning and resource allocation, relevant personnel shortages, and inadequate investment. This study proposes 4 promotional strategies: encouraging more hospitals to migrate, enhancing hospitals' capabilities for migration, establishing a provincial-level unified medical hybrid multi-cloud platform, strengthening legal frameworks, and providing robust technical support. Conclusions: Cloud computing is an innovative technology that has gained significant attention from both the Chinese government and the global community. In order to effectively support the rapid growth of a novel, health-centered medical industry, it is imperative for Chinese health authorities and hospitals to seize this opportunity by implementing comprehensive strategies aimed at encouraging hospitals to migrate their HISs to the cloud. ", doi="10.2196/52080", url="https://medinform.jmir.org/2024/1/e52080", url="http://www.ncbi.nlm.nih.gov/pubmed/38315519" } @Article{info:doi/10.2196/53302, author="Recsky, Chantelle and Rush, L. Kathy and MacPhee, Maura and Stowe, Megan and Blackburn, Lorraine and Muniak, Allison and Currie, M. Leanne", title="Clinical Informatics Team Members' Perspectives on Health Information Technology Safety After Experiential Learning and Safety Process Development: Qualitative Descriptive Study", journal="JMIR Form Res", year="2024", month="Feb", day="5", volume="8", pages="e53302", keywords="informatics", keywords="community health services", keywords="knowledge translation", keywords="qualitative research", keywords="patient safety", abstract="Background: Although intended to support improvement, the rapid adoption and evolution of technologies in health care can also bring about unintended consequences related to safety. In this project, an embedded researcher with expertise in patient safety and clinical education worked with a clinical informatics team to examine safety and harm related to health information technologies (HITs) in primary and community care settings. The clinical informatics team participated in learning activities around relevant topics (eg, human factors, high reliability organizations, and sociotechnical systems) and cocreated a process to address safety events related to technology (ie, safety huddles and sociotechnical analysis of safety events). Objective: This study aimed to explore clinical informaticians' experiences of incorporating safety practices into their work. Methods: We used a qualitative descriptive design and conducted web-based focus groups with clinical informaticians. Thematic analysis was used to analyze the data. Results: A total of 10 informants participated. Barriers to addressing safety and harm in their context included limited prior knowledge of HIT safety, previous assumptions and perspectives, competing priorities and organizational barriers, difficulty with the reporting system and processes, and a limited number of reports for learning. Enablers to promoting safety and mitigating harm included participating in learning sessions, gaining experience analyzing reported events, participating in safety huddles, and role modeling and leadership from the embedded researcher. Individual outcomes included increased ownership and interest in HIT safety, the development of a sociotechnical systems perspective, thinking differently about safety, and increased consideration for user perspectives. Team outcomes included enhanced communication within the team, using safety events to inform future work and strategic planning, and an overall promotion of a culture of safety. Conclusions: As HITs are integrated into care delivery, it is important for clinical informaticians to recognize the risks related to safety. Experiential learning activities, including reviewing safety event reports and participating in safety huddles, were identified as particularly impactful. An HIT safety learning initiative is a feasible approach for clinical informaticians to become more knowledgeable and engaged in HIT safety issues in their work. ", doi="10.2196/53302", url="https://formative.jmir.org/2024/1/e53302", url="http://www.ncbi.nlm.nih.gov/pubmed/38315544" } @Article{info:doi/10.2196/50339, author="Charles, M. Wendy and van der Waal, B. Mark and Flach, Joost and Bisschop, Arno and van der Waal, X. Raymond and Es-Sbai, Hadil and McLeod, J. Christopher", title="Blockchain-Based Dynamic Consent and its Applications for Patient-Centric Research and Health Information Sharing: Protocol for an Integrative Review", journal="JMIR Res Protoc", year="2024", month="Feb", day="5", volume="13", pages="e50339", keywords="best practices", keywords="blockchain", keywords="clinical trial", keywords="data reuse", keywords="data sharing", keywords="dynamic consent", keywords="health care data", keywords="integrative research review", keywords="scientific rigor", keywords="technology implementation", abstract="Background: Blockchain has been proposed as a critical technology to facilitate more patient-centric research and health information sharing. For instance, it can be applied to coordinate and document dynamic informed consent, a procedure that allows individuals to continuously review and renew their consent to the collection, use, or sharing of their private health information. Such has been suggested to facilitate ethical, compliant longitudinal research, and patient engagement. However, blockchain-based dynamic consent is a relatively new concept, and it is not yet clear how well the suggested implementations will work in practice. Efforts to critically evaluate implementations in health research contexts are limited. Objective: The objective of this protocol is to guide the identification and critical appraisal of implementations of blockchain-based dynamic consent in health research contexts, thereby facilitating the development of best practices for future research, innovation, and implementation. Methods: The protocol describes methods for an integrative review to allow evaluation of a broad range of quantitative and qualitative research designs. The PRISMA-P (Preferred Reporting Items for Systematic Review and Meta-Analysis Protocols) framework guided the review's structure and nature of reporting findings. We developed search strategies and syntax with the help of an academic librarian. Multiple databases were selected to identify pertinent academic literature (CINAHL, Embase, Ovid MEDLINE, PubMed, Scopus, and Web of Science) and gray literature (Electronic Theses Online Service, ProQuest Dissertations and Theses, Open Access Theses and Dissertations, and Google Scholar) for a comprehensive picture of the field's progress. Eligibility criteria were defined based on PROSPERO (International Prospective Register of Systematic Reviews) requirements and a criteria framework for technology readiness. A total of 2 reviewers will independently review and extract data, while a third reviewer will adjudicate discrepancies. Quality appraisal of articles and discussed implementations will proceed based on the validated Mixed Method Appraisal Tool, and themes will be identified through thematic data synthesis. Results: Literature searches were conducted, and after duplicates were removed, 492 articles were eligible for screening. Title and abstract screening allowed the removal of 312 articles, leaving 180 eligible articles for full-text review against inclusion criteria and confirming a sufficient body of literature for project feasibility. Results will synthesize the quality of evidence on blockchain-based dynamic consent for patient-centric research and health information sharing, covering effectiveness, efficiency, satisfaction, regulatory compliance, and methods of managing identity. Conclusions: The review will provide a comprehensive picture of the progress of emerging blockchain-based dynamic consent technologies and the rigor with which implementations are approached. Resulting insights are expected to inform best practices for future research, innovation, and implementation to benefit patient-centric research and health information sharing. Trial Registration: PROSPERO CRD42023396983; http://tinyurl.com/cn8a5x7t International Registered Report Identifier (IRRID): DERR1-10.2196/50339 ", doi="10.2196/50339", url="https://www.researchprotocols.org/2024/1/e50339", url="http://www.ncbi.nlm.nih.gov/pubmed/38315514" } @Article{info:doi/10.2196/49031, author="Ma, Shaoying and Jiang, Shuning and Yang, Olivia and Zhang, Xuanzhi and Fu, Yu and Zhang, Yusen and Kaareen, Aadeeba and Ling, Meng and Chen, Jian and Shang, Ce", title="Use of Machine Learning Tools in Evidence Synthesis of Tobacco Use Among Sexual and Gender Diverse Populations: Algorithm Development and Validation", journal="JMIR Form Res", year="2024", month="Jan", day="24", volume="8", pages="e49031", keywords="machine learning", keywords="natural language processing", keywords="tobacco control", keywords="sexual and gender diverse populations", keywords="lesbian", keywords="gay", keywords="bisexual", keywords="transgender", keywords="queer", keywords="LGBTQ+", keywords="evidence synthesis", abstract="Background: From 2016 to 2021, the volume of peer-reviewed publications related to tobacco has experienced a significant increase. This presents a considerable challenge in efficiently summarizing, synthesizing, and disseminating research findings, especially when it comes to addressing specific target populations, such as the LGBTQ+ (lesbian, gay, bisexual, transgender, queer, intersex, asexual, Two Spirit, and other persons who identify as part of this community) populations. Objective: In order to expedite evidence synthesis and research gap discoveries, this pilot study has the following three aims: (1) to compile a specialized semantic database for tobacco policy research to extract information from journal article abstracts, (2) to develop natural language processing (NLP) algorithms that comprehend the literature on nicotine and tobacco product use among sexual and gender diverse populations, and (3) to compare the discoveries of the NLP algorithms with an ongoing systematic review of tobacco policy research among LGBTQ+ populations. Methods: We built a tobacco research domain--specific semantic database using data from 2993 paper abstracts from 4 leading tobacco-specific journals, with enrichment from other publicly available sources. We then trained an NLP model to extract named entities after learning patterns and relationships between words and their context in text, which further enriched the semantic database. Using this iterative process, we extracted and assessed studies relevant to LGBTQ+ tobacco control issues, further comparing our findings with an ongoing systematic review that also focuses on evidence synthesis for this demographic group. Results: In total, 33 studies were identified as relevant to sexual and gender diverse individuals' nicotine and tobacco product use. Consistent with the ongoing systematic review, the NLP results showed that there is a scarcity of studies assessing policy impact on this demographic using causal inference methods. In addition, the literature is dominated by US data. We found that the product drawing the most attention in the body of existing research is cigarettes or cigarette smoking and that the number of studies of various age groups is almost evenly distributed between youth or young adults and adults, consistent with the research needs identified by the US health agencies. Conclusions: Our pilot study serves as a compelling demonstration of the capabilities of NLP tools in expediting the processes of evidence synthesis and the identification of research gaps. While future research is needed to statistically test the NLP tool's performance, there is potential for NLP tools to fundamentally transform the approach to evidence synthesis. ", doi="10.2196/49031", url="https://formative.jmir.org/2024/1/e49031", url="http://www.ncbi.nlm.nih.gov/pubmed/38265858" } @Article{info:doi/10.2196/47761, author="Wurster, Florian and Beckmann, Marina and Cecon-Stabel, Natalia and Dittmer, Kerstin and Hansen, Jes Till and Jaschke, Julia and K{\"o}berlein-Neu, Juliane and Okumu, Mi-Ran and Rusniok, Carsten and Pfaff, Holger and Karbach, Ute", title="The Implementation of an Electronic Medical Record in a German Hospital and the Change in Completeness of Documentation: Longitudinal Document Analysis", journal="JMIR Med Inform", year="2024", month="Jan", day="19", volume="12", pages="e47761", keywords="clinical documentation", keywords="digital transformation", keywords="document analysis", keywords="electronic medical record", keywords="EMR", keywords="Germany", keywords="health services research", keywords="hospital", keywords="implementation", abstract="Background: Electronic medical records (EMR) are considered a key component of the health care system's digital transformation. The implementation of an EMR promises various improvements, for example, in the availability of information, coordination of care, or patient safety, and is required for big data analytics. To ensure those possibilities, the included documentation must be of high quality. In this matter, the most frequently described dimension of data quality is the completeness of documentation. In this regard, little is known about how and why the completeness of documentation might change after the implementation of an EMR. Objective: This study aims to compare the completeness of documentation in paper-based medical records and EMRs and to discuss the possible impact of an EMR on the completeness of documentation. Methods: A retrospective document analysis was conducted, comparing the completeness of paper-based medical records and EMRs. Data were collected before and after the implementation of an EMR on an orthopaedical ward in a German academic teaching hospital. The anonymized records represent all treated patients for a 3-week period each. Unpaired, 2-tailed t tests, chi-square tests, and relative risks were calculated to analyze and compare the mean completeness of the 2 record types in general and of 10 specific items in detail (blood pressure, body temperature, diagnosis, diet, excretions, height, pain, pulse, reanimation status, and weight). For this purpose, each of the 10 items received a dichotomous score of 1 if it was documented on the first day of patient care on the ward; otherwise, it was scored as 0. Results: The analysis consisted of 180 medical records. The average completeness was 6.25 (SD 2.15) out of 10 in the paper-based medical record, significantly rising to an average of 7.13 (SD 2.01) in the EMR (t178=--2.469; P=.01; d=--0.428). When looking at the significant changes of the 10 items in detail, the documentation of diet (P<.001), height (P<.001), and weight (P<.001) was more complete in the EMR, while the documentation of diagnosis (P<.001), excretions (P=.02), and pain (P=.008) was less complete in the EMR. The completeness remained unchanged for the documentation of pulse (P=.28), blood pressure (P=.47), body temperature (P=.497), and reanimation status (P=.73). Conclusions: Implementing EMRs can influence the completeness of documentation, with a possible change in both increased and decreased completeness. However, the mechanisms that determine those changes are often neglected. There are mechanisms that might facilitate an improved completeness of documentation and could decrease or increase the staff's burden caused by documentation tasks. Research is needed to take advantage of these mechanisms and use them for mutual profit in the interests of all stakeholders. Trial Registration: German Clinical Trials Register DRKS00023343; https://drks.de/search/de/trial/DRKS00023343 ", doi="10.2196/47761", url="https://medinform.jmir.org/2024/1/e47761", url="http://www.ncbi.nlm.nih.gov/pubmed/38241076" } @Article{info:doi/10.2196/49007, author="Mehra, Tarun and Wekhof, Tobias and Keller, Iris Dagmar", title="Additional Value From Free-Text Diagnoses in Electronic Health Records: Hybrid Dictionary and Machine Learning Classification Study", journal="JMIR Med Inform", year="2024", month="Jan", day="17", volume="12", pages="e49007", keywords="electronic health records", keywords="free text", keywords="natural language processing", keywords="NLP", keywords="artificial intelligence", keywords="AI", abstract="Background: Physicians are hesitant to forgo the opportunity of entering unstructured clinical notes for structured data entry in electronic health records. Does free text increase informational value in comparison with structured data? Objective: This study aims to compare information from unstructured text-based chief complaints harvested and processed by a natural language processing (NLP) algorithm with clinician-entered structured diagnoses in terms of their potential utility for automated improvement of patient workflows. Methods: Electronic health records of 293,298 patient visits at the emergency department of a Swiss university hospital from January 2014 to October 2021 were analyzed. Using emergency department overcrowding as a case in point, we compared supervised NLP-based keyword dictionaries of symptom clusters from unstructured clinical notes and clinician-entered chief complaints from a structured drop-down menu with the following 2 outcomes: hospitalization and high Emergency Severity Index (ESI) score. Results: Of 12 symptom clusters, the NLP cluster was substantial in predicting hospitalization in 11 (92\%) clusters; 8 (67\%) clusters remained significant even after controlling for the cluster of clinician-determined chief complaints in the model. All 12 NLP symptom clusters were significant in predicting a low ESI score, of which 9 (75\%) remained significant when controlling for clinician-determined chief complaints. The correlation between NLP clusters and chief complaints was low (r=?0.04 to 0.6), indicating complementarity of information. Conclusions: The NLP-derived features and clinicians' knowledge were complementary in explaining patient outcome heterogeneity. They can provide an efficient approach to patient flow management, for example, in an emergency medicine setting. We further demonstrated the feasibility of creating extensive and precise keyword dictionaries with NLP by medical experts without requiring programming knowledge. Using the dictionary, we could classify short and unstructured clinical texts into diagnostic categories defined by the clinician. ", doi="10.2196/49007", url="https://medinform.jmir.org/2024/1/e49007", url="http://www.ncbi.nlm.nih.gov/pubmed/38231569" } @Article{info:doi/10.2196/47673, author="Ramos, P. Pablo Ivan and Marcilio, Izabel and Bento, I. Ana and Penna, O. Gerson and de Oliveira, F. Juliane and Khouri, Ricardo and Andrade, S. Roberto F. and Carreiro, P. Roberto and Oliveira, A. Vinicius de and Galv{\~a}o, C. Luiz Augusto and Landau, Luiz and Barreto, L. Mauricio and van der Horst, Kay and Barral-Netto, Manoel and ", title="Combining Digital and Molecular Approaches Using Health and Alternate Data Sources in a Next-Generation Surveillance System for Anticipating Outbreaks of Pandemic Potential", journal="JMIR Public Health Surveill", year="2024", month="Jan", day="9", volume="10", pages="e47673", keywords="data integration", keywords="digital public health", keywords="infectious disease surveillance", keywords="pandemic preparedness", keywords="prevention", keywords="response", doi="10.2196/47673", url="https://publichealth.jmir.org/2024/1/e47673", url="http://www.ncbi.nlm.nih.gov/pubmed/38194263" } @Article{info:doi/10.2196/50379, author="Galvez-Hernandez, Pablo and Gonzalez-Viana, Angelina and Gonzalez-de Paz, Luis and Shankardass, Ketan and Muntaner, Carles", title="Generating Contextual Variables From Web-Based Data for Health Research: Tutorial on Web Scraping, Text Mining, and Spatial Overlay Analysis", journal="JMIR Public Health Surveill", year="2024", month="Jan", day="8", volume="10", pages="e50379", keywords="web scraping", keywords="text mining", keywords="spatial overlay analysis", keywords="program evaluation", keywords="social environment", keywords="contextual variables", keywords="health assets", keywords="social connection", keywords="multilevel analysis", keywords="health services research", abstract="Background: Contextual variables that capture the characteristics of delimited geographic or jurisdictional areas are vital for health and social research. However, obtaining data sets with contextual-level data can be challenging in the absence of monitoring systems or public census data. Objective: We describe and implement an 8-step method that combines web scraping, text mining, and spatial overlay analysis (WeTMS) to transform extensive text data from government websites into analyzable data sets containing contextual data for jurisdictional areas. Methods: This tutorial describes the method and provides resources for its application by health and social researchers. We used this method to create data sets of health assets aimed at enhancing older adults' social connections (eg, activities and resources such as walking groups and senior clubs) across the 374 health jurisdictions in Catalonia from 2015 to 2022. These assets are registered on a web-based government platform by local stakeholders from various health and nonhealth organizations as part of a national public health program. Steps 1 to 3 involved defining the variables of interest, identifying data sources, and using Python to extract information from 50,000 websites linked to the platform. Steps 4 to 6 comprised preprocessing the scraped text, defining new variables to classify health assets based on social connection constructs, analyzing word frequencies in titles and descriptions of the assets, creating topic-specific dictionaries, implementing a rule-based classifier in R, and verifying the results. Steps 7 and 8 integrate the spatial overlay analysis to determine the geographic location of each asset. We conducted a descriptive analysis of the data sets to report the characteristics of the assets identified and the patterns of asset registrations across areas. Results: We identified and extracted data from 17,305 websites describing health assets. The titles and descriptions of the activities and resources contained 12,560 and 7301 unique words, respectively. After applying our classifier and spatial analysis algorithm, we generated 2 data sets containing 9546 health assets (5022 activities and 4524 resources) with the potential to enhance social connections among older adults. Stakeholders from 318 health jurisdictions registered identified assets on the platform between July 2015 and December 2022. The agreement rate between the classification algorithm and verified data sets ranged from 62.02\% to 99.47\% across variables. Leisure and skill development activities were the most prevalent (1844/5022, 36.72\%). Leisure and cultural associations, such as social clubs for older adults, were the most common resources (878/4524, 19.41\%). Health asset registration varied across areas, ranging between 0 and 263 activities and 0 and 265 resources. Conclusions: The sequential use of WeTMS offers a robust method for generating data sets containing contextual-level variables from internet text data. This study can guide health and social researchers in efficiently generating ready-to-analyze data sets containing contextual variables. ", doi="10.2196/50379", url="https://publichealth.jmir.org/2024/1/e50379", url="http://www.ncbi.nlm.nih.gov/pubmed/38190245" } @Article{info:doi/10.2196/53365, author="Tumaliuan, Beatriz Faye and Grepo, Lorelie and Jalao, Rex Eugene", title="Development of Depression Data Sets and a Language Model for Depression Detection: Mixed Methods Study", journal="JMIR Data", year="2024", month="Sep", day="4", volume="5", pages="e53365", keywords="depression data set", keywords="depression detection", keywords="social media", keywords="natural language processing", keywords="Filipino", abstract="Background: Depression detection in social media has gained attention in recent years with the help of natural language processing (NLP) techniques. Because of the low-resource standing of Filipino depression data, valid data sets need to be created to aid various machine learning techniques in depression detection classification tasks. Objective: The primary objective is to build a depression corpus of Philippine Twitter users who were clinically diagnosed with depression by mental health professionals and develop from this a corpus of depression symptoms that can later serve as a baseline for predicting depression symptoms in the Filipino and English languages. Methods: The proposed process included the implementation of clinical screening methods with the help of clinical psychologists in the recruitment of study participants who were young adults aged 18 to 30 years. A total of 72 participants were assessed by clinical psychologists and provided their Twitter data: 60 with depression and 12 with no depression. Six participants provided 2 Twitter accounts each, making 78 Twitter accounts. A data set was developed consisting of depression symptom--annotated tweets with 13 depression categories. These were created through manual annotation in a process constructed, guided, and validated by clinical psychologists. Results: Three annotators completed the process for approximately 79,614 tweets, resulting in a substantial interannotator agreement score of 0.735 using Fleiss $\kappa$ and a 95.59\% psychologist validation score. A word2vec language model was developed using Filipino and English data sets to create a 300-feature word embedding that can be used in various machine learning techniques for NLP. Conclusions: This study contributes to depression research by constructing depression data sets from social media to aid NLP in the Philippine setting. These 2 validated data sets can be significant in user detection or tweet-level detection of depression in young adults in further studies. ", doi="10.2196/53365", url="https://data.jmir.org/2024/1/e53365" } @Article{info:doi/10.2196/39336, author="Braun, David and Ingram, Daniel and Ingram, David and Khan, Bilal and Marsh, Jessecae and McAndrew, Thomas", title="Crowdsourced Perceptions of Human Behavior to Improve Computational Forecasts of US National Incident Cases of COVID-19: Survey Study", journal="JMIR Public Health Surveill", year="2022", month="Dec", day="30", volume="8", number="12", pages="e39336", keywords="crowdsourcing", keywords="COVID-19", keywords="forecasting", keywords="human judgment", abstract="Background: Past research has shown that various signals associated with human behavior (eg, social media engagement) can benefit computational forecasts of COVID-19. One behavior that has been shown to reduce the spread of infectious agents is compliance with nonpharmaceutical interventions (NPIs). However, the extent to which the public adheres to NPIs is difficult to measure and consequently difficult to incorporate into computational forecasts of infectious diseases. Soliciting judgments from many individuals (ie, crowdsourcing) can lead to surprisingly accurate estimates of both current and future targets of interest. Therefore, asking a crowd to estimate community-level compliance with NPIs may prove to be an accurate and predictive signal of an infectious disease such as COVID-19. Objective: We aimed to show that crowdsourced perceptions of compliance with NPIs can be a fast and reliable signal that can predict the spread of an infectious agent. We showed this by measuring the correlation between crowdsourced perceptions of NPIs and US incident cases of COVID-19 1-4 weeks ahead, and evaluating whether incorporating crowdsourced perceptions improves the predictive performance of a computational forecast of incident cases. Methods: For 36 weeks from September 2020 to April 2021, we asked 2 crowds 21 questions about their perceptions of community adherence to NPIs and public health guidelines, and collected 10,120 responses. Self-reported state residency was compared to estimates from the US census to determine the representativeness of the crowds. Crowdsourced NPI signals were mapped to 21 mean perceived adherence (MEPA) signals and analyzed descriptively to investigate features, such as how MEPA signals changed over time and whether MEPA time series could be clustered into groups based on response patterns. We investigated whether MEPA signals were associated with incident cases of COVID-19 1-4 weeks ahead by (1) estimating correlations between MEPA and incident cases, and (2) including MEPA into computational forecasts. Results: The crowds were mostly geographically representative of the US population with slight overrepresentation in the Northeast. MEPA signals tended to converge toward moderate levels of compliance throughout the survey period, and an unsupervised analysis revealed signals clustered into 4 groups roughly based on the type of question being asked. Several MEPA signals linearly correlated with incident cases of COVID-19 1-4 weeks ahead at the US national level. Including questions related to social distancing, testing, and limiting large gatherings increased out-of-sample predictive performance for probabilistic forecasts of incident cases of COVID-19 1-3 weeks ahead when compared to a model that was trained on only past incident cases. Conclusions: Crowdsourced perceptions of nonpharmaceutical adherence may be an important signal to improve forecasts of the trajectory of an infectious agent and increase public health situational awareness. ", doi="10.2196/39336", url="https://publichealth.jmir.org/2022/12/e39336", url="http://www.ncbi.nlm.nih.gov/pubmed/36219845" } @Article{info:doi/10.2196/42754, author="Wang, Bin and Lai, Junkai and Jin, Feifei and Liao, Xiwen and Zhu, Huan and Yao, Chen", title="Clinical Source Data Production and Quality Control in Real-world Studies: Proposal for Development of the eSource Record System", journal="JMIR Res Protoc", year="2022", month="Dec", day="23", volume="11", number="12", pages="e42754", keywords="electronic medical record", keywords="electronic health record", keywords="eSource", keywords="real-world data", keywords="eSource record", keywords="clinical research", keywords="data collection", keywords="data transcription", keywords="data quality", keywords="interoperability", abstract="Background: An eSource generally includes the direct capture, collection, and storage of electronic data to simplify clinical research. It can improve data quality and patient safety and reduce clinical trial costs. There has been some eSource-related research progress in relatively large projects. However, most of these studies focused on technical explorations to improve interoperability among systems to reuse retrospective data for research. Few studies have explored source data collection and quality control during prospective data collection from a methodological perspective. Objective: This study aimed to design a clinical source data collection method that is suitable for real-world studies and meets the data quality standards for clinical research and to improve efficiency when writing electronic medical records (EMRs). Methods: On the basis of our group's previous research experience, TransCelerate BioPharm Inc eSource logical architecture, and relevant regulations and guidelines, we designed a source data collection method and invited relevant stakeholders to optimize it. On the basis of this method, we proposed the eSource record (ESR) system as a solution and invited experts with different roles in the contract research organization company to discuss and design a flowchart for data connection between the ESR and electronic data capture (EDC). Results: The ESR method included 5 steps: research project preparation, initial survey collection, in-hospital medical record writing, out-of-hospital follow-up, and electronic case report form (eCRF) traceability. The data connection between the ESR and EDC covered the clinical research process from creating the eCRF to collecting data for the analysis. The intelligent data acquisition function of the ESR will automatically complete the empty eCRF to create an eCRF with values. When the clinical research associate and data manager conduct data verification, they can query the certified copy database through interface traceability and send data queries. The data queries are transmitted to the ESR through the EDC interface. The EDC and EMR systems interoperate through the ESR. The EMR and EDC systems transmit data to the ESR system through the data standards of the Health Level Seven Clinical Document Architecture and the Clinical Data Interchange Standards Consortium operational data model, respectively. When the implemented data standards for a given system are not consistent, the ESR will approach the problem by first automating mappings between standards and then handling extensions or corrections to a given data format through human evaluation. Conclusions: The source data collection method proposed in this study will help to realize eSource's new strategy. The ESR solution is standardized and sustainable. It aims to ensure that research data meet the attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available standards for clinical research data quality and to provide a new model for prospective data collection in real-world studies. ", doi="10.2196/42754", url="https://www.researchprotocols.org/2022/12/e42754", url="http://www.ncbi.nlm.nih.gov/pubmed/36563036" } @Article{info:doi/10.2196/41200, author="MacKenna, Brian and Curtis, J. Helen and Hopcroft, M. Lisa E. and Walker, J. Alex and Croker, Richard and Macdonald, Orla and Evans, W. Stephen J. and Inglesby, Peter and Evans, David and Morley, Jessica and Bacon, J. Sebastian C. and Goldacre, Ben", title="Identifying Patterns of Clinical Interest in Clinicians' Treatment Preferences: Hypothesis-free Data Science Approach to Prioritizing Prescribing Outliers for Clinical Review", journal="JMIR Med Inform", year="2022", month="Dec", day="20", volume="10", number="12", pages="e41200", keywords="prescribing", keywords="NHS England", keywords="antipsychotics", keywords="promazine hydrochloride", keywords="pericyazine", keywords="clinical audit", keywords="data science", abstract="Background: Data analysis is used to identify signals suggestive of variation in treatment choice or clinical outcome. Analyses to date have generally focused on a hypothesis-driven approach. Objective: This study aimed to develop a hypothesis-free approach to identify unusual prescribing behavior in primary care data. We aimed to apply this methodology to a national data set in a cross-sectional study to identify chemicals with significant variation in use across Clinical Commissioning Groups (CCGs) for further clinical review, thereby demonstrating proof of concept for prioritization approaches. Methods: Here we report a new data-driven approach to identify unusual prescribing behaviour in primary care data. This approach first applies a set of filtering steps to identify chemicals with prescribing rate distributions likely to contain outliers, then applies two ranking approaches to identify the most extreme outliers amongst those candidates. This methodology has been applied to three months of national prescribing data (June-August 2017). Results: Our methodology provides rankings for all chemicals by administrative region. We provide illustrative results for 2 antipsychotic drugs of particular clinical interest: promazine hydrochloride and pericyazine, which rank highly by outlier metrics. Specifically, our method identifies that, while promazine hydrochloride and pericyazine are barely used by most clinicians (with national prescribing rates of 11.1 and 6.2 per 1000 antipsychotic prescriptions, respectively), they make up a substantial proportion of antipsychotic prescribing in 2 small geographic regions in England during the study period (with maximum regional prescribing rates of 298.7 and 241.1 per 1000 antipsychotic prescriptions, respectively). Conclusions: Our hypothesis-free approach is able to identify candidates for audit and review in clinical practice. To illustrate this, we provide 2 examples of 2 very unusual antipsychotics used disproportionately in 2 small geographic areas of England. ", doi="10.2196/41200", url="https://medinform.jmir.org/2022/12/e41200", url="http://www.ncbi.nlm.nih.gov/pubmed/36538350" } @Article{info:doi/10.2196/40743, author="Jin, Qiao and Tan, Chuanqi and Chen, Mosha and Yan, Ming and Zhang, Ningyu and Huang, Songfang and Liu, Xiaozhong", title="State-of-the-Art Evidence Retriever for Precision Medicine: Algorithm Development and Validation", journal="JMIR Med Inform", year="2022", month="Dec", day="15", volume="10", number="12", pages="e40743", keywords="precision medicine", keywords="evidence-based medicine", keywords="information retrieval", keywords="active learning", keywords="pretrained language models", keywords="digital health intervention", keywords="data retrieval", keywords="big data", keywords="algorithm development", abstract="Background: Under the paradigm of precision medicine (PM), patients with the same disease can receive different personalized therapies according to their clinical and genetic features. These therapies are determined by the totality of all available clinical evidence, including results from case reports, clinical trials, and systematic reviews. However, it is increasingly difficult for physicians to find such evidence from scientific publications, whose size is growing at an unprecedented pace. Objective: In this work, we propose the PM-Search system to facilitate the retrieval of clinical literature that contains critical evidence for or against giving specific therapies to certain cancer patients. Methods: The PM-Search system combines a baseline retriever that selects document candidates at a large scale and an evidence reranker that finely reorders the candidates based on their evidence quality. The baseline retriever uses query expansion and keyword matching with the ElasticSearch retrieval engine, and the evidence reranker fits pretrained language models to expert annotations that are derived from an active learning strategy. Results: The PM-Search system achieved the best performance in the retrieval of high-quality clinical evidence at the Text Retrieval Conference PM Track 2020, outperforming the second-ranking systems by large margins (0.4780 vs 0.4238 for standard normalized discounted cumulative gain at rank 30 and 0.4519 vs 0.4193 for exponential normalized discounted cumulative gain at rank 30). Conclusions: We present PM-Search, a state-of-the-art search engine to assist the practicing of evidence-based PM. PM-Search uses a novel Bidirectional Encoder Representations from Transformers for Biomedical Text Mining--based active learning strategy that models evidence quality and improves the model performance. Our analyses show that evidence quality is a distinct aspect from general relevance, and specific modeling of evidence quality beyond general relevance is required for a PM search engine. ", doi="10.2196/40743", url="https://medinform.jmir.org/2022/12/e40743", url="http://www.ncbi.nlm.nih.gov/pubmed/36409468" } @Article{info:doi/10.2196/37239, author="Bhavnani, K. Suresh and Zhang, Weibin and Visweswaran, Shyam and Raji, Mukaila and Kuo, Yong-Fang", title="A Framework for Modeling and Interpreting Patient Subgroups Applied to Hospital Readmission: Visual Analytical Approach", journal="JMIR Med Inform", year="2022", month="Dec", day="7", volume="10", number="12", pages="e37239", keywords="visual analytics", keywords="Bipartite Network analysis", keywords="hospital readmission", keywords="precision medicine", keywords="modeling", keywords="Medicare", abstract="Background: A primary goal of precision medicine is to identify patient subgroups and infer their underlying disease processes with the aim of designing targeted interventions. Although several studies have identified patient subgroups, there is a considerable gap between the identification of patient subgroups and their modeling and interpretation for clinical applications. Objective: This study aimed to develop and evaluate a novel analytical framework for modeling and interpreting patient subgroups (MIPS) using a 3-step modeling approach: visual analytical modeling to automatically identify patient subgroups and their co-occurring comorbidities and determine their statistical significance and clinical interpretability; classification modeling to classify patients into subgroups and measure its accuracy; and prediction modeling to predict a patient's risk of an adverse outcome and compare its accuracy with and without patient subgroup information. Methods: The MIPS framework was developed using bipartite networks to identify patient subgroups based on frequently co-occurring high-risk comorbidities, multinomial logistic regression to classify patients into subgroups, and hierarchical logistic regression to predict the risk of an adverse outcome using subgroup membership compared with standard logistic regression without subgroup membership. The MIPS framework was evaluated for 3 hospital readmission conditions: chronic obstructive pulmonary disease (COPD), congestive heart failure (CHF), and total hip arthroplasty/total knee arthroplasty (THA/TKA) (COPD: n=29,016; CHF: n=51,550; THA/TKA: n=16,498). For each condition, we extracted cases defined as patients readmitted within 30 days of hospital discharge. Controls were defined as patients not readmitted within 90 days of discharge, matched by age, sex, race, and Medicaid eligibility. Results: In each condition, the visual analytical model identified patient subgroups that were statistically significant (Q=0.17, 0.17, 0.31; P<.001, <.001, <.05), significantly replicated (Rand Index=0.92, 0.94, 0.89; P<.001, <.001, <.01), and clinically meaningful to clinicians. In each condition, the classification model had high accuracy in classifying patients into subgroups (mean accuracy=99.6\%, 99.34\%, 99.86\%). In 2 conditions (COPD and THA/TKA), the hierarchical prediction model had a small but statistically significant improvement in discriminating between readmitted and not readmitted patients as measured by net reclassification improvement (0.059, 0.11) but not as measured by the C-statistic or integrated discrimination improvement. Conclusions: Although the visual analytical models identified statistically and clinically significant patient subgroups, the results pinpoint the need to analyze subgroups at different levels of granularity for improving the interpretability of intra- and intercluster associations. The high accuracy of the classification models reflects the strong separation of patient subgroups, despite the size and density of the data sets. Finally, the small improvement in predictive accuracy suggests that comorbidities alone were not strong predictors of hospital readmission, and the need for more sophisticated subgroup modeling methods. Such advances could improve the interpretability and predictive accuracy of patient subgroup models for reducing the risk of hospital readmission, and beyond. ", doi="10.2196/37239", url="https://medinform.jmir.org/2022/12/e37239", url="http://www.ncbi.nlm.nih.gov/pubmed/35537203" } @Article{info:doi/10.2196/42185, author="Tang, Ri and Zhang, Shuyi and Ding, Chenling and Zhu, Mingli and Gao, Yuan", title="Artificial Intelligence in Intensive Care Medicine: Bibliometric Analysis", journal="J Med Internet Res", year="2022", month="Nov", day="30", volume="24", number="11", pages="e42185", keywords="intensive care medicine", keywords="artificial intelligence", keywords="bibliometric analysis", keywords="machine learning", keywords="sepsis", abstract="Background: Interest in critical care--related artificial intelligence (AI) research is growing rapidly. However, the literature is still lacking in comprehensive bibliometric studies that measure and analyze scientific publications globally. Objective: The objective of this study was to assess the global research trends in AI in intensive care medicine based on publication outputs, citations, coauthorships between nations, and co-occurrences of author keywords. Methods: A total of 3619 documents published until March 2022 were retrieved from the Scopus database. After selecting the document type as articles, the titles and abstracts were checked for eligibility. In the final bibliometric study using VOSviewer, 1198 papers were included. The growth rate of publications, preferred journals, leading research countries, international collaborations, and top institutions were computed. Results: The number of publications increased steeply between 2018 and 2022, accounting for 72.53\% (869/1198) of all the included papers. The United States and China contributed to approximately 55.17\% (661/1198) of the total publications. Of the 15 most productive institutions, 9 were among the top 100 universities worldwide. Detecting clinical deterioration, monitoring, predicting disease progression, mortality, prognosis, and classifying disease phenotypes or subtypes were some of the research hot spots for AI in patients who are critically ill. Neural networks, decision support systems, machine learning, and deep learning were all commonly used AI technologies. Conclusions: This study highlights popular areas in AI research aimed at improving health care in intensive care units, offers a comprehensive look at the research trend in AI application in the intensive care unit, and provides an insight into potential collaboration and prospects for future research. The 30 articles that received the most citations were listed in detail. For AI-based clinical research to be sufficiently convincing for routine critical care practice, collaborative research efforts are needed to increase the maturity and robustness of AI-driven models. ", doi="10.2196/42185", url="https://www.jmir.org/2022/11/e42185", url="http://www.ncbi.nlm.nih.gov/pubmed/36449345" } @Article{info:doi/10.2196/42261, author="Ljaji{\'c}, Adela and Prodanovi{\'c}, Nikola and Medvecki, Darija and Ba{\vs}aragin, Bojana and Mitrovi{\'c}, Jelena", title="Uncovering the Reasons Behind COVID-19 Vaccine Hesitancy in Serbia: Sentiment-Based Topic Modeling", journal="J Med Internet Res", year="2022", month="Nov", day="17", volume="24", number="11", pages="e42261", keywords="topic modeling", keywords="sentiment analysis", keywords="LDA", keywords="NMF", keywords="BERT", keywords="vaccine hesitancy", keywords="COVID-19", keywords="Twitter", keywords="Serbian language processing", keywords="vaccine", keywords="public health", keywords="NLP", keywords="vaccination", keywords="Serbia", abstract="Background: Since the first COVID-19 vaccine appeared, there has been a growing tendency to automatically determine public attitudes toward it. In particular, it was important to find the reasons for vaccine hesitancy, since it was directly correlated with pandemic protraction. Natural language processing (NLP) and public health researchers have turned to social media (eg, Twitter, Reddit, and Facebook) for user-created content from which they can gauge public opinion on vaccination. To automatically process such content, they use a number of NLP techniques, most notably topic modeling. Topic modeling enables the automatic uncovering and grouping of hidden topics in the text. When applied to content that expresses a negative sentiment toward vaccination, it can give direct insight into the reasons for vaccine hesitancy. Objective: This study applies NLP methods to classify vaccination-related tweets by sentiment polarity and uncover the reasons for vaccine hesitancy among the negative tweets in the Serbian language. Methods: To study the attitudes and beliefs behind vaccine hesitancy, we collected 2 batches of tweets that mention some aspects of COVID-19 vaccination. The first batch of 8817 tweets was manually annotated as either relevant or irrelevant regarding the COVID-19 vaccination sentiment, and then the relevant tweets were annotated as positive, negative, or neutral. We used the annotated tweets to train a sequential bidirectional encoder representations from transformers (BERT)-based classifier for 2 tweet classification tasks to augment this initial data set. The first classifier distinguished between relevant and irrelevant tweets. The second classifier used the relevant tweets and classified them as negative, positive, or neutral. This sequential classifier was used to annotate the second batch of tweets. The combined data sets resulted in 3286 tweets with a negative sentiment: 1770 (53.9\%) from the manually annotated data set and 1516 (46.1\%) as a result of automatic classification. Topic modeling methods (latent Dirichlet allocation [LDA] and nonnegative matrix factorization [NMF]) were applied using the 3286 preprocessed tweets to detect the reasons for vaccine hesitancy. Results: The relevance classifier achieved an F-score of 0.91 and 0.96 for relevant and irrelevant tweets, respectively. The sentiment polarity classifier achieved an F-score of 0.87, 0.85, and 0.85 for negative, neutral, and positive sentiments, respectively. By summarizing the topics obtained in both models, we extracted 5 main groups of reasons for vaccine hesitancy: concern over vaccine side effects, concern over vaccine effectiveness, concern over insufficiently tested vaccines, mistrust of authorities, and conspiracy theories. Conclusions: This paper presents a combination of NLP methods applied to find the reasons for vaccine hesitancy in Serbia. Given these reasons, it is now possible to better understand the concerns of people regarding the vaccination process. ", doi="10.2196/42261", url="https://www.jmir.org/2022/11/e42261", url="http://www.ncbi.nlm.nih.gov/pubmed/36301673" } @Article{info:doi/10.2196/38450, author="van der Ploeg, Tjeerd and Gobbens, J. Robbert J.", title="Prediction of COVID-19 Infections for Municipalities in the Netherlands: Algorithm Development and Interpretation", journal="JMIR Public Health Surveill", year="2022", month="Oct", day="20", volume="8", number="10", pages="e38450", keywords="municipality properties", keywords="data merging", keywords="modeling technique", keywords="variable selection", keywords="prediction model", keywords="public health", keywords="COVID-19", keywords="surveillance", keywords="static data", keywords="Dutch public domain", keywords="pandemic", keywords="Wuhan", keywords="virus", keywords="public", keywords="infections", keywords="fever", keywords="cough", keywords="congestion", keywords="fatigue", keywords="symptoms", keywords="pneumonia", keywords="dyspnea", keywords="death", abstract="Background: COVID-19 was first identified in December 2019 in the city of Wuhan, China. The virus quickly spread and was declared a pandemic on March 11, 2020. After infection, symptoms such as fever, a (dry) cough, nasal congestion, and fatigue can develop. In some cases, the virus causes severe complications such as pneumonia and dyspnea and could result in death. The virus also spread rapidly in the Netherlands, a small and densely populated country with an aging population. Health care in the Netherlands is of a high standard, but there were nevertheless problems with hospital capacity, such as the number of available beds and staff. There were also regions and municipalities that were hit harder than others. In the Netherlands, there are important data sources available for daily COVID-19 numbers and information about municipalities. Objective: We aimed to predict the cumulative number of confirmed COVID-19 infections per 10,000 inhabitants per municipality in the Netherlands, using a data set with the properties of 355 municipalities in the Netherlands and advanced modeling techniques. Methods: We collected relevant static data per municipality from data sources that were available in the Dutch public domain and merged these data with the dynamic daily number of infections from January 1, 2020, to May 9, 2021, resulting in a data set with 355 municipalities in the Netherlands and variables grouped into 20 topics. The modeling techniques random forest and multiple fractional polynomials were used to construct a prediction model for predicting the cumulative number of confirmed COVID-19 infections per 10,000 inhabitants per municipality in the Netherlands. Results: The final prediction model had an R2 of 0.63. Important properties for predicting the cumulative number of confirmed COVID-19 infections per 10,000 inhabitants in a municipality in the Netherlands were exposure to particulate matter with diameters <10 $\mu$m (PM10) in the air, the percentage of Labour party voters, and the number of children in a household. Conclusions: Data about municipality properties in relation to the cumulative number of confirmed infections in a municipality in the Netherlands can give insight into the most important properties of a municipality for predicting the cumulative number of confirmed COVID-19 infections per 10,000 inhabitants in a municipality. This insight can provide policy makers with tools to cope with COVID-19 and may also be of value in the event of a future pandemic, so that municipalities are better prepared. ", doi="10.2196/38450", url="https://publichealth.jmir.org/2022/10/e38450", url="http://www.ncbi.nlm.nih.gov/pubmed/36219835" } @Article{info:doi/10.2196/39373, author="Karystianis, George and Cabral, Carines Rina and Adily, Armita and Lukmanjaya, Wilson and Schofield, Peter and Buchan, Iain and Nenadic, Goran and Butler, Tony", title="Mental Illness Concordance Between Hospital Clinical Records and Mentions in Domestic Violence Police Narratives: Data Linkage Study", journal="JMIR Form Res", year="2022", month="Oct", day="20", volume="6", number="10", pages="e39373", keywords="data linkage", keywords="mental health", keywords="domestic violence", keywords="police records", keywords="hospital records", keywords="text mining", abstract="Background: To better understand domestic violence, data sources from multiple sectors such as police, justice, health, and welfare are needed. Linking police data to data collections from other agencies could provide unique insights and promote an all-of-government response to domestic violence. The New South Wales Police Force attends domestic violence events and records information in the form of both structured data and a free-text narrative, with the latter shown to be a rich source of information on the mental health status of persons of interest (POIs) and victims, abuse types, and sustained injuries. Objective: This study aims to examine the concordance (ie, matching) between mental illness mentions extracted from the police's event narratives and mental health diagnoses from hospital and emergency department records. Methods: We applied a rule-based text mining method on 416,441 domestic violence police event narratives between December 2005 and January 2016 to identify mental illness mentions for POIs and victims. Using different window periods (1, 3, 6, and 12 months) before and after a domestic violence event, we linked the extracted mental illness mentions of victims and POIs to clinical records from the Emergency Department Data Collection and the Admitted Patient Data Collection in New South Wales, Australia using a unique identifier for each individual in the same cohort. Results: Using a 2-year window period (ie, 12 months before and after the domestic violence event), less than 1\% (3020/416,441, 0.73\%) of events had a mental illness mention and also a corresponding hospital record. About 16\% of domestic violence events for both POIs (382/2395, 15.95\%) and victims (101/631, 16.01\%) had an agreement between hospital records and police narrative mentions of mental illness. A total of 51,025/416,441 (12.25\%) events for POIs and 14,802/416,441 (3.55\%) events for victims had mental illness mentions in their narratives but no hospital record. Only 841 events for POIs and 919 events for victims had a documented hospital record within 48 hours of the domestic violence event. Conclusions: Our findings suggest that current surveillance systems used to report on domestic violence may be enhanced by accessing rich information (ie, mental illness) contained in police text narratives, made available for both POIs and victims through the application of text mining. Additional insights can be gained by linkage to other health and welfare data collections. ", doi="10.2196/39373", url="https://formative.jmir.org/2022/10/e39373", url="http://www.ncbi.nlm.nih.gov/pubmed/36264613" } @Article{info:doi/10.2196/33720, author="Kavianpour, Sanaz and Sutherland, James and Mansouri-Benssassi, Esma and Coull, Natalie and Jefferson, Emily", title="Next-Generation Capabilities in Trusted Research Environments: Interview Study", journal="J Med Internet Res", year="2022", month="Sep", day="20", volume="24", number="9", pages="e33720", keywords="data safe haven", keywords="health data analysis", keywords="trusted research environment", keywords="TRE", abstract="Background: A Trusted Research Environment (TRE; also known as a Safe Haven) is an environment supported by trained staff and agreed processes (principles and standards), providing access to data for research while protecting patient confidentiality. Accessing sensitive data without compromising the privacy and security of the data is a complex process. Objective: This paper presents the security measures, administrative procedures, and technical approaches adopted by TREs. Methods: We contacted 73 TRE operators, 22 (30\%) of whom, in the United Kingdom and internationally, agreed to be interviewed remotely under a nondisclosure agreement and to complete a questionnaire about their TRE. Results: We observed many similar processes and standards that TREs follow to adhere to the Seven Safes principles. The security processes and TRE capabilities for supporting observational studies using classical statistical methods were mature, and the requirements were well understood. However, we identified limitations in the security measures and capabilities of TREs to support ``next-generation'' requirements such as wide ranges of data types, ability to develop artificial intelligence algorithms and software within the environment, handling of big data, and timely import and export of data. Conclusions: We found a lack of software or other automation tools to support the community and limited knowledge of how to meet the next-generation requirements from the research community. Disclosure control for exporting artificial intelligence algorithms and software was found to be particularly challenging, and there is a clear need for additional controls to support this capability within TREs. ", doi="10.2196/33720", url="https://www.jmir.org/2022/9/e33720", url="http://www.ncbi.nlm.nih.gov/pubmed/36125859" } @Article{info:doi/10.2196/38319, author="Singh, Lisa and Gresenz, Roan Carole and Wang, Yanchen and Hu, Sonya", title="Assessing Social Media Data as a Resource for Firearm Research: Analysis of Tweets Pertaining to Firearm Deaths", journal="J Med Internet Res", year="2022", month="Aug", day="25", volume="24", number="8", pages="e38319", keywords="firearms", keywords="fatalities", keywords="Twitter", keywords="firearm research", keywords="social media data", abstract="Background: Historic constraints on research dollars and reliable information have limited firearm research. At the same time, interest in the power and potential of social media analytics, particularly in health contexts, has surged. Objective: The aim of this study is to contribute toward the goal of establishing a foundation for how social media data may best be used, alone or in conjunction with other data resources, to improve the information base for firearm research. Methods: We examined the value of social media data for estimating a firearm outcome for which robust benchmark data exist---specifically, firearm mortality, which is captured in the National Vital Statistics System (NVSS). We hand curated tweet data from the Twitter application programming interface spanning January 1, 2017, to December 31, 2018. We developed machine learning classifiers to identify tweets that pertain to firearm deaths and develop estimates of the volume of Twitter firearm discussion by month. We compared within-state variation over time in the volume of tweets pertaining to firearm deaths with within-state trends in NVSS-based estimates of firearm fatalities using Pearson linear correlations. Results: The correlation between the monthly number of firearm fatalities measured by the NVSS and the monthly volume of tweets pertaining to firearm deaths was weak (median 0.081) and highly dispersed across states (range --0.31 to 0.535). The median correlation between month-to-month changes in firearm fatalities in the NVSS and firearm deaths discussed in tweets was moderate (median 0.30) and exhibited less dispersion among states (range --0.06 to 0.69). Conclusions: Our findings suggest that Twitter data may hold value for tracking dynamics in firearm-related outcomes, particularly for relatively populous cities that are identifiable through location mentions in tweet content. The data are likely to be particularly valuable for understanding firearm outcomes not currently measured, not measured well, or not measurable through other available means. This research provides an important building block for future work that continues to develop the usefulness of social media data for firearm research. ", doi="10.2196/38319", url="https://www.jmir.org/2022/8/e38319", url="http://www.ncbi.nlm.nih.gov/pubmed/36006693" } @Article{info:doi/10.2196/41122, author="Kubben, Pieter", title="JMIR Neurotechnology: Connecting Clinical Neuroscience and (Information) Technology", journal="JMIR Neurotech", year="2022", month="Aug", day="11", volume="1", number="1", pages="e41122", keywords="neurotechnology", keywords="neurological disorders", keywords="treatment tools", keywords="chronic neurological disease", keywords="information technology", doi="10.2196/41122", url="https://neuro.jmir.org/2022/1/e41122" } @Article{info:doi/10.2196/39888, author="Kaur, Manpreet and Costello, Jeremy and Willis, Elyse and Kelm, Karen and Reformat, Z. Marek and Bolduc, V. Francois", title="Deciphering the Diversity of Mental Models in Neurodevelopmental Disorders: Knowledge Graph Representation of Public Data Using Natural Language Processing", journal="J Med Internet Res", year="2022", month="Aug", day="5", volume="24", number="8", pages="e39888", keywords="concept map", keywords="neurodevelopmental disorder", keywords="knowledge graph", keywords="text analysis", keywords="semantic relatedness", keywords="PubMed", keywords="forums", keywords="mental model", abstract="Background: Understanding how individuals think about a topic, known as the mental model, can significantly improve communication, especially in the medical domain where emotions and implications are high. Neurodevelopmental disorders (NDDs) represent a group of diagnoses, affecting up to 18\% of the global population, involving differences in the development of cognitive or social functions. In this study, we focus on 2 NDDs, attention deficit hyperactivity disorder (ADHD) and autism spectrum disorder (ASD), which involve multiple symptoms and interventions requiring interactions between 2 important stakeholders: parents and health professionals. There is a gap in our understanding of differences between mental models for each stakeholder, making communication between stakeholders more difficult than it could be. Objective: We aim to build knowledge graphs (KGs) from web-based information relevant to each stakeholder as proxies of mental models. These KGs will accelerate the identification of shared and divergent concerns between stakeholders. The developed KGs can help improve knowledge mobilization, communication, and care for individuals with ADHD and ASD. Methods: We created 2 data sets by collecting the posts from web-based forums and PubMed abstracts related to ADHD and ASD. We utilized the Unified Medical Language System (UMLS) to detect biomedical concepts and applied Positive Pointwise Mutual Information followed by truncated Singular Value Decomposition to obtain corpus-based concept embeddings for each data set. Each data set is represented as a KG using a property graph model. Semantic relatedness between concepts is calculated to rank the relation strength of concepts and stored in the KG as relation weights. UMLS disorder-relevant semantic types are used to provide additional categorical information about each concept's domain. Results: The developed KGs contain concepts from both data sets, with node sizes representing the co-occurrence frequency of concepts and edge sizes representing relevance between concepts. ADHD- and ASD-related concepts from different semantic types shows diverse areas of concerns and complex needs of the conditions. KG identifies converging and diverging concepts between health professionals literature (PubMed) and parental concerns (web-based forums), which may correspond to the differences between mental models for each stakeholder. Conclusions: We show for the first time that generating KGs from web-based data can capture the complex needs of families dealing with ADHD or ASD. Moreover, we showed points of convergence between families and health professionals' KGs. Natural language processing--based KG provides access to a large sample size, which is often a limiting factor for traditional in-person mental model mapping. Our work offers a high throughput access to mental model maps, which could be used for further in-person validation, knowledge mobilization projects, and basis for communication about potential blind spots from stakeholders in interactions about NDDs. Future research will be needed to identify how concepts could interact together differently for each stakeholder. ", doi="10.2196/39888", url="https://www.jmir.org/2022/8/e39888", url="http://www.ncbi.nlm.nih.gov/pubmed/35930346" } @Article{info:doi/10.2196/37486, author="Huang, Yanqun and Zheng, Zhimin and Ma, Moxuan and Xin, Xin and Liu, Honglei and Fei, Xiaolu and Wei, Lan and Chen, Hui", title="Improving the Performance of Outcome Prediction for Inpatients With Acute Myocardial Infarction Based on Embedding Representation Learned From Electronic Medical Records: Development and Validation Study", journal="J Med Internet Res", year="2022", month="Aug", day="3", volume="24", number="8", pages="e37486", keywords="representation learning", keywords="skip-gram", keywords="feature association strengths", keywords="feature importance", keywords="mortality risk prediction", keywords="acute myocardial infarction", abstract="Background: The widespread secondary use of electronic medical records (EMRs) promotes health care quality improvement. Representation learning that can automatically extract hidden information from EMR data has gained increasing attention. Objective: We aimed to propose a patient representation with more feature associations and task-specific feature importance to improve the outcome prediction performance for inpatients with acute myocardial infarction (AMI). Methods: Medical concepts, including patients' age, gender, disease diagnoses, laboratory tests, structured radiological features, procedures, and medications, were first embedded into real-value vectors using the improved skip-gram algorithm, where concepts in the context windows were selected by feature association strengths measured by association rule confidence. Then, each patient was represented as the sum of the feature embeddings weighted by the task-specific feature importance, which was applied to facilitate predictive model prediction from global and local perspectives. We finally applied the proposed patient representation into mortality risk prediction for 3010 and 1671 AMI inpatients from a public data set and a private data set, respectively, and compared it with several reference representation methods in terms of the area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPRC), and F1-score. Results: Compared with the reference methods, the proposed embedding-based representation showed consistently superior predictive performance on the 2 data sets, achieving mean AUROCs of 0.878 and 0.973, AUPRCs of 0.220 and 0.505, and F1-scores of 0.376 and 0.674 for the public and private data sets, respectively, while the greatest AUROCs, AUPRCs, and F1-scores among the reference methods were 0.847 and 0.939, 0.196 and 0.283, and 0.344 and 0.361 for the public and private data sets, respectively. Feature importance integrated in patient representation reflected features that were also critical in prediction tasks and clinical practice. Conclusions: The introduction of feature associations and feature importance facilitated an effective patient representation and contributed to prediction performance improvement and model interpretation. ", doi="10.2196/37486", url="https://www.jmir.org/2022/8/e37486", url="http://www.ncbi.nlm.nih.gov/pubmed/35921141" } @Article{info:doi/10.2196/37817, author="Tang, Wentai and Wang, Jian and Lin, Hongfei and Zhao, Di and Xu, Bo and Zhang, Yijia and Yang, Zhihao", title="A Syntactic Information--Based Classification Model for Medical Literature: Algorithm Development and Validation Study", journal="JMIR Med Inform", year="2022", month="Aug", day="2", volume="10", number="8", pages="e37817", keywords="medical relation extraction", keywords="syntactic features", keywords="pruning method", keywords="neural networks", keywords="medical literature", keywords="medical text", keywords="extraction", keywords="syntactic", keywords="classification", keywords="interaction", keywords="text", keywords="literature", keywords="semantic", abstract="Background: The ever-increasing volume of medical literature necessitates the classification of medical literature. Medical relation extraction is a typical method of classifying a large volume of medical literature. With the development of arithmetic power, medical relation extraction models have evolved from rule-based models to neural network models. The single neural network model discards the shallow syntactic information while discarding the traditional rules. Therefore, we propose a syntactic information--based classification model that complements and equalizes syntactic information to enhance the model. Objective: We aim to complete a syntactic information--based relation extraction model for more efficient medical literature classification. Methods: We devised 2 methods for enhancing syntactic information in the model. First, we introduced shallow syntactic information into the convolutional neural network to enhance nonlocal syntactic interactions. Second, we devise a cross-domain pruning method to equalize local and nonlocal syntactic interactions. Results: We experimented with 3 data sets related to the classification of medical literature. The F1 values were 65.5\% and 91.5\% on the BioCreative ViCPR (CPR) and Phenotype-Gene Relationship data sets, respectively, and the accuracy was 88.7\% on the PubMed data set. Our model outperforms the current state-of-the-art baseline model in the experiments. Conclusions: Our model based on syntactic information effectively enhances medical relation extraction. Furthermore, the results of the experiments show that shallow syntactic information helps obtain nonlocal interaction in sentences and effectively reinforces syntactic features. It also provides new ideas for future research directions. ", doi="10.2196/37817", url="https://medinform.jmir.org/2022/8/e37817", url="http://www.ncbi.nlm.nih.gov/pubmed/35917162" } @Article{info:doi/10.2196/27990, author="Rom{\'a}n-Villar{\'a}n, Esther and Alvarez-Romero, Celia and Mart{\'i}nez-Garc{\'i}a, Alicia and Escobar-Rodr{\'i}guez, Antonio German and Garc{\'i}a-Lozano, Jos{\'e} Mar{\'i}a and Bar{\'o}n-Franco, Bosco and Moreno-Gavi{\~n}o, Lourdes and Moreno-Conde, Jes{\'u}s and Rivas-Gonz{\'a}lez, Antonio Jos{\'e} and Parra-Calder{\'o}n, Luis Carlos", title="A Personalized Ontology-Based Decision Support System for Complex Chronic Patients: Retrospective Observational Study", journal="JMIR Form Res", year="2022", month="Aug", day="2", volume="6", number="8", pages="e27990", keywords="adherence", keywords="ontology", keywords="clinical decision support system", keywords="CDSS", keywords="complex chronic patients", keywords="functional validation", keywords="multimorbidity", keywords="polypharmacy", keywords="atrial fibrillation", keywords="anticoagulants", abstract="Background: Due to an increase in life expectancy, the prevalence of chronic diseases is also on the rise. Clinical practice guidelines (CPGs) provide recommendations for suitable interventions regarding different chronic diseases, but a deficiency in the implementation of these CPGs has been identified. The PITeS-TiiSS (Telemedicine and eHealth Innovation Platform: Information Communications Technology for Research and Information Challenges in Health Services) tool, a personalized ontology-based clinical decision support system (CDSS), aims to reduce variability, prevent errors, and consider interactions between different CPG recommendations, among other benefits. Objective: The aim of this study is to design, develop, and validate an ontology-based CDSS that provides personalized recommendations related to drug prescription. The target population is older adult patients with chronic diseases and polypharmacy, and the goal is to reduce complications related to these types of conditions while offering integrated care. Methods: A study scenario about atrial fibrillation and treatment with anticoagulants was selected to validate the tool. After this, a series of knowledge sources were identified, including CPGs, PROFUND index, LESS/CHRON criteria, and STOPP/START criteria, to extract the information. Modeling was carried out using an ontology, and mapping was done with Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR) and Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT; International Health Terminology Standards Development Organisation). Once the CDSS was developed, validation was carried out by using a retrospective case study. Results: This project was funded in January 2015 and approved by the Virgen del Rocio University Hospital ethics committee on November 24, 2015. Two different tasks were carried out to test the functioning of the tool. First, retrospective data from a real patient who met the inclusion criteria were used. Second, the analysis of an adoption model was performed through the study of the requirements and characteristics that a CDSS must meet in order to be well accepted and used by health professionals. The results are favorable and allow the proposed research to continue to the next phase. Conclusions: An ontology-based CDSS was successfully designed, developed, and validated. However, in future work, validation in a real environment should be performed to ensure the tool is usable and reliable. ", doi="10.2196/27990", url="https://formative.jmir.org/2022/8/e27990", url="http://www.ncbi.nlm.nih.gov/pubmed/35916719" } @Article{info:doi/10.2196/38068, author="Xu, Ran and Divito, Joseph and Bannor, Richard and Schroeder, Matthew and Pagoto, Sherry", title="Predicting Participant Engagement in a Social Media--Delivered Lifestyle Intervention Using Microlevel Conversational Data: Secondary Analysis of Data From a Pilot Randomized Controlled Trial", journal="JMIR Form Res", year="2022", month="Jul", day="28", volume="6", number="7", pages="e38068", keywords="weight loss", keywords="social media intervention", keywords="engagement", keywords="data science", keywords="natural language processing", keywords="NLP", keywords="social media", keywords="lifestyle", keywords="machine learning", keywords="mobile phone", abstract="Background: Social media--delivered lifestyle interventions have shown promising outcomes, often generating modest but significant weight loss. Participant engagement appears to be an important predictor of weight loss outcomes; however, engagement generally declines over time and is highly variable both within and across studies. Research on factors that influence participant engagement remains scant in the context of social media--delivered lifestyle interventions. Objective: This study aimed to identify predictors of participant engagement from the content generated during a social media--delivered lifestyle intervention, including characteristics of the posts, the conversation that followed the post, and participants' previous engagement patterns. Methods: We performed secondary analyses using data from a pilot randomized trial that delivered 2 lifestyle interventions via Facebook. We analyzed 80 participants' engagement data over a 16-week intervention period and linked them to predictors, including characteristics of the posts, conversations that followed the post, and participants' previous engagement, using a mixed-effects model. We also performed machine learning--based classification to confirm the importance of the significant predictors previously identified and explore how well these measures can predict whether participants will engage with a specific post. Results: The probability of participants' engagement with each post decreased by 0.28\% each week (P<.001; 95\% CI 0.16\%-0.4\%). The probability of participants engaging with posts generated by interventionists was 6.3\% (P<.001; 95\% CI 5.1\%-7.5\%) higher than posts generated by other participants. Participants also had a 6.5\% (P<.001; 95\% CI 4.9\%-8.1\%) and 6.1\% (P<.001; 95\% CI 4.1\%-8.1\%) higher probability of engaging with posts that directly mentioned weight and goals, respectively, than other types of posts. Participants were 44.8\% (P<.001; 95\% CI 42.8\%-46.9\%) and 46\% (P<.001; 95\% CI 44.1\%-48.0\%) more likely to engage with a post when they were replied to by other participants and by interventionists, respectively. A 1 SD decrease in the sentiment of the conversation on a specific post was associated with a 5.4\% (P<.001; 95\% CI 4.9\%-5.9\%) increase in the probability of participants' subsequent engagement with the post. Participants' engagement in previous posts was also a predictor of engagement in subsequent posts (P<.001; 95\% CI 0.74\%-0.79\%). Moreover, using a machine learning approach, we confirmed the importance of the predictors previously identified and achieved an accuracy of 90.9\% in terms of predicting participants' engagement using a balanced testing sample with 1600 observations. Conclusions: Findings revealed several predictors of engagement derived from the content generated by interventionists and other participants. Results have implications for increasing engagement in asynchronous, remotely delivered lifestyle interventions, which could improve outcomes. Our results also point to the potential of data science and natural language processing to analyze microlevel conversational data and identify factors influencing participant engagement. Future studies should validate these results in larger trials. Trial Registration: ClinicalTrials.gov NCT02656680; https://clinicaltrials.gov/ct2/show/NCT02656680 ", doi="10.2196/38068", url="https://formative.jmir.org/2022/7/e38068", url="http://www.ncbi.nlm.nih.gov/pubmed/35900824" } @Article{info:doi/10.2196/37201, author="Ahne, Adrian and Khetan, Vivek and Tannier, Xavier and Rizvi, Hassan Md Imbesat and Czernichow, Thomas and Orchard, Francisco and Bour, Charline and Fano, Andrew and Fagherazzi, Guy", title="Extraction of Explicit and Implicit Cause-Effect Relationships in Patient-Reported Diabetes-Related Tweets From 2017 to 2021: Deep Learning Approach", journal="JMIR Med Inform", year="2022", month="Jul", day="19", volume="10", number="7", pages="e37201", keywords="causality", keywords="deep learning", keywords="natural language processing", keywords="diabetes", keywords="social media", keywords="causal relation extraction", keywords="social media data", keywords="machine learning", abstract="Background: Intervening in and preventing diabetes distress requires an understanding of its causes and, in particular, from a patient's perspective. Social media data provide direct access to how patients see and understand their disease and consequently show the causes of diabetes distress. Objective: Leveraging machine learning methods, we aim to extract both explicit and implicit cause-effect relationships in patient-reported diabetes-related tweets and provide a methodology to better understand the opinions, feelings, and observations shared within the diabetes online community from a causality perspective. Methods: More than 30 million diabetes-related tweets in English were collected between April 2017 and January 2021. Deep learning and natural language processing methods were applied to focus on tweets with personal and emotional content. A cause-effect tweet data set was manually labeled and used to train (1) a fine-tuned BERTweet model to detect causal sentences containing a causal relation and (2) a conditional random field model with Bidirectional Encoder Representations from Transformers (BERT)-based features to extract possible cause-effect associations. Causes and effects were clustered in a semisupervised approach and visualized in an interactive cause-effect network. Results: Causal sentences were detected with a recall of 68\% in an imbalanced data set. A conditional random field model with BERT-based features outperformed a fine-tuned BERT model for cause-effect detection with a macro recall of 68\%. This led to 96,676 sentences with cause-effect relationships. ``Diabetes'' was identified as the central cluster followed by ``death'' and ``insulin.'' Insulin pricing--related causes were frequently associated with death. Conclusions: A novel methodology was developed to detect causal sentences and identify both explicit and implicit, single and multiword cause, and the corresponding effect, as expressed in diabetes-related tweets leveraging BERT-based architectures and visualized as cause-effect network. Extracting causal associations in real life, patient-reported outcomes in social media data provide a useful complementary source of information in diabetes research. ", doi="10.2196/37201", url="https://medinform.jmir.org/2022/7/e37201", url="http://www.ncbi.nlm.nih.gov/pubmed/35852829" } @Article{info:doi/10.2196/29056, author="Florensa, D{\'i}dac and Mateo-Forn{\'e}s, Jordi and Solsona, Francesc and Pedrol Aige, Teresa and Mesas Juli{\'o}, Miquel and Pi{\~n}ol, Ramon and Godoy, Pere", title="Use of Multiple Correspondence Analysis and K-means to Explore Associations Between Risk Factors and Likelihood of Colorectal Cancer: Cross-sectional Study", journal="J Med Internet Res", year="2022", month="Jul", day="19", volume="24", number="7", pages="e29056", keywords="colorectal cancer", keywords="cancer registry", keywords="multiple correspondence analysis", keywords="k-means", keywords="risk factors", abstract="Background: Previous works have shown that risk factors are associated with an increased likelihood of colorectal cancer. Objective: The purpose of this study was to detect these associations in the region of Lleida (Catalonia) by using multiple correspondence analysis (MCA) and k-means. Methods: This cross-sectional study was made up of 1083 colorectal cancer episodes between 2012 and 2015, extracted from the population-based cancer registry for the province of Lleida (Spain), the Primary Care Centers database, and the Catalan Health Service Register. The data set included risk factors such as smoking and BMI as well as sociodemographic information and tumor details. The relations between the risk factors and patient characteristics were identified using MCA and k-means. Results: The combination of these techniques helps to detect clusters of patients with similar risk factors. Risk of death is associated with being elderly and obesity or being overweight. Stage III cancer is associated with people aged ?65 years and rural/semiurban populations, while younger people were associated with stage 0. Conclusions: MCA and k-means were significantly useful for detecting associations between risk factors and patient characteristics. These techniques have proven to be effective tools for analyzing the incidence of some factors in colorectal cancer. The outcomes obtained help corroborate suspected trends and stimulate the use of these techniques for finding the association of risk factors with the incidence of other cancers. ", doi="10.2196/29056", url="https://www.jmir.org/2022/7/e29056", url="http://www.ncbi.nlm.nih.gov/pubmed/35852835" } @Article{info:doi/10.2196/38584, author="Jiang, Chao and Ngo, Victoria and Chapman, Richard and Yu, Yue and Liu, Hongfang and Jiang, Guoqian and Zong, Nansu", title="Deep Denoising of Raw Biomedical Knowledge Graph From COVID-19 Literature, LitCovid, and Pubtator: Framework Development and Validation", journal="J Med Internet Res", year="2022", month="Jul", day="6", volume="24", number="7", pages="e38584", keywords="adversarial generative network", keywords="knowledge graph", keywords="deep denoising", keywords="machine learning", keywords="COVID-19", keywords="biomedical", keywords="neural network", keywords="network model", keywords="training data", abstract="Background: Multiple types of biomedical associations of knowledge graphs, including COVID-19--related ones, are constructed based on co-occurring biomedical entities retrieved from recent literature. However, the applications derived from these raw graphs (eg, association predictions among genes, drugs, and diseases) have a high probability of false-positive predictions as co-occurrences in the literature do not always mean there is a true biomedical association between two entities. Objective: Data quality plays an important role in training deep neural network models; however, most of the current work in this area has been focused on improving a model's performance with the assumption that the preprocessed data are clean. Here, we studied how to remove noise from raw knowledge graphs with limited labeled information. Methods: The proposed framework used generative-based deep neural networks to generate a graph that can distinguish the unknown associations in the raw training graph. Two generative adversarial network models, NetGAN and Cross-Entropy Low-rank Logits (CELL), were adopted for the edge classification (ie, link prediction), leveraging unlabeled link information based on a real knowledge graph built from LitCovid and Pubtator. Results: The performance of link prediction, especially in the extreme case of training data versus test data at a ratio of 1:9, demonstrated that the proposed method still achieved favorable results (area under the receiver operating characteristic curve >0.8 for the synthetic data set and 0.7 for the real data set), despite the limited amount of testing data available. Conclusions: Our preliminary findings showed the proposed framework achieved promising results for removing noise during data preprocessing of the biomedical knowledge graph, potentially improving the performance of downstream applications by providing cleaner data. ", doi="10.2196/38584", url="https://www.jmir.org/2022/7/e38584", url="http://www.ncbi.nlm.nih.gov/pubmed/35658098" } @Article{info:doi/10.2196/32728, author="Xu, Yantao and Jiang, Zixi and Kuang, Xinwei and Chen, Xiang and Liu, Hong", title="Research Trends in Immune Checkpoint Blockade for Melanoma: Visualization and Bibliometric Analysis", journal="J Med Internet Res", year="2022", month="Jun", day="27", volume="24", number="6", pages="e32728", keywords="melanoma", keywords="immune checkpoint blockade", keywords="bibliometric", keywords="research trends", keywords="dermatology", keywords="cancer", abstract="Background: Melanoma is one of the most life-threatening skin cancers; immune checkpoint blockade is widely used in the treatment of melanoma because of its remarkable efficacy. Objective: This study aimed to conduct a comprehensive bibliometric analysis of research conducted in recent decades on immune checkpoint blockade for melanoma, while exploring research trends and public interest in this topic. Methods: We summarized the articles in the Web of Science Core Collection on immune checkpoint blockade for melanoma in each year from 1999 to 2020. The R package bibliometrix was used for data extraction and visualization of the distribution of publication year and the top 10 core authors. Keyword citation burst analysis and cocitation networks were calculated with CiteSpace. A Gunn online world map was used to evaluate distribution by country and region. Ranking was performed using the Standard Competition Ranking method. Coauthorship analysis and co-occurrence were analyzed and visualized with VOSviewer. Results: After removing duplicates, a total of 9169 publications were included. The distribution of publications by year showed that the number of publications rose sharply from 2015 onwards and either reached a peak in 2020 or has yet to reach a peak. The geographical distribution indicated that there was a large gap between the number of publications in the United States and other countries. The coauthorship analysis showed that the 149 top institutions were grouped into 8 clusters, each covering approximately a single country, suggesting that international cooperation among institutions should be strengthened. The core author extraction revealed changes in the most prolific authors. The keyword analysis revealed clustering and top citation bursts. The cocitation analysis of references from 2010 to 2020 revealed the number of citations and the centrality of the top articles. Conclusions: This study revealed trends in research and public interest in immune checkpoint blockade for melanoma. Our findings suggest that the field is growing rapidly, has several core authors, and that the United States is taking the lead position. Moreover, cooperation between countries should be strengthened, and future research hot spots might focus on deeper exploration of drug mechanisms, prediction of treatment efficacy, prediction of adverse events, and new modes of administration, such as combination therapy, which may pave the way for further research. ", doi="10.2196/32728", url="https://www.jmir.org/2022/6/e32728", url="http://www.ncbi.nlm.nih.gov/pubmed/35759331" } @Article{info:doi/10.2196/34366, author="Park, Jinkyung and Arunachalam, Ramanathan and Silenzio, Vincent and Singh, K. Vivek", title="Fairness in Mobile Phone--Based Mental Health Assessment Algorithms: Exploratory Study", journal="JMIR Form Res", year="2022", month="Jun", day="14", volume="6", number="6", pages="e34366", keywords="algorithmic bias", keywords="mental health", keywords="health equity", keywords="medical informatics", keywords="health information systems", keywords="gender bias", keywords="mobile phone", abstract="Background: Approximately 1 in 5 American adults experience mental illness every year. Thus, mobile phone--based mental health prediction apps that use phone data and artificial intelligence techniques for mental health assessment have become increasingly important and are being rapidly developed. At the same time, multiple artificial intelligence--related technologies (eg, face recognition and search results) have recently been reported to be biased regarding age, gender, and race. This study moves this discussion to a new domain: phone-based mental health assessment algorithms. It is important to ensure that such algorithms do not contribute to gender disparities through biased predictions across gender groups. Objective: This research aimed to analyze the susceptibility of multiple commonly used machine learning approaches for gender bias in mobile mental health assessment and explore the use of an algorithmic disparate impact remover (DIR) approach to reduce bias levels while maintaining high accuracy. Methods: First, we performed preprocessing and model training using the data set (N=55) obtained from a previous study. Accuracy levels and differences in accuracy across genders were computed using 5 different machine learning models. We selected the random forest model, which yielded the highest accuracy, for a more detailed audit and computed multiple metrics that are commonly used for fairness in the machine learning literature. Finally, we applied the DIR approach to reduce bias in the mental health assessment algorithm. Results: The highest observed accuracy for the mental health assessment was 78.57\%. Although this accuracy level raises optimism, the audit based on gender revealed that the performance of the algorithm was statistically significantly different between the male and female groups (eg, difference in accuracy across genders was 15.85\%; P<.001). Similar trends were obtained for other fairness metrics. This disparity in performance was found to reduce significantly after the application of the DIR approach by adapting the data used for modeling (eg, the difference in accuracy across genders was 1.66\%, and the reduction is statistically significant with P<.001). Conclusions: This study grounds the need for algorithmic auditing in phone-based mental health assessment algorithms and the use of gender as a protected attribute to study fairness in such settings. Such audits and remedial steps are the building blocks for the widespread adoption of fair and accurate mental health assessment algorithms in the future. ", doi="10.2196/34366", url="https://formative.jmir.org/2022/6/e34366", url="http://www.ncbi.nlm.nih.gov/pubmed/35699997" } @Article{info:doi/10.2196/32845, author="Tang, Chunlei and Ma, Jing and Zhou, Li and Plasek, Joseph and He, Yuqing and Xiong, Yun and Zhu, Yangyong and Huang, Yajun and Bates, David", title="Improving Research Patient Data Repositories From a Health Data Industry Viewpoint", journal="J Med Internet Res", year="2022", month="May", day="11", volume="24", number="5", pages="e32845", keywords="data science", keywords="big data", keywords="data mining", keywords="data warehousing", keywords="information storage and retrieval", doi="10.2196/32845", url="https://www.jmir.org/2022/5/e32845", url="http://www.ncbi.nlm.nih.gov/pubmed/35544299" } @Article{info:doi/10.2196/30898, author="Ye, Jiancheng and Wang, Zidan and Hai, Jiarui", title="Social Networking Service, Patient-Generated Health Data, and Population Health Informatics: National Cross-sectional Study of Patterns and Implications of Leveraging Digital Technologies to Support Mental Health and Well-being", journal="J Med Internet Res", year="2022", month="Apr", day="29", volume="24", number="4", pages="e30898", keywords="patient-generated health data", keywords="social network", keywords="population health informatics", keywords="mental health", keywords="social determinants of health", keywords="health data sharing", keywords="technology acceptability", keywords="mobile phone", keywords="mobile health", abstract="Background: The emerging health technologies and digital services provide effective ways of collecting health information and gathering patient-generated health data (PGHD), which provide a more holistic view of a patient's health and quality of life over time, increase visibility into a patient's adherence to a treatment plan or study protocol, and enable timely intervention before a costly care episode. Objective: Through a national cross-sectional survey in the United States, we aimed to describe and compare the characteristics of populations with and without mental health issues (depression or anxiety disorders), including physical health, sleep, and alcohol use. We also examined the patterns of social networking service use, PGHD, and attitudes toward health information sharing and activities among the participants, which provided nationally representative estimates. Methods: We drew data from the 2019 Health Information National Trends Survey of the National Cancer Institute. The participants were divided into 2 groups according to mental health status. Then, we described and compared the characteristics of the social determinants of health, health status, sleeping and drinking behaviors, and patterns of social networking service use and health information data sharing between the 2 groups. Multivariable logistic regression models were applied to assess the predictors of mental health. All the analyses were weighted to provide nationally representative estimates. Results: Participants with mental health issues were significantly more likely to be younger, White, female, and lower-income; have a history of chronic diseases; and be less capable of taking care of their own health. Regarding behavioral health, they slept <6 hours on average, had worse sleep quality, and consumed more alcohol. In addition, they were more likely to visit and share health information on social networking sites, write online diary blogs, participate in online forums or support groups, and watch health-related videos. Conclusions: This study illustrates that individuals with mental health issues have inequitable social determinants of health, poor physical health, and poor behavioral health. However, they are more likely to use social networking platforms and services, share their health information, and actively engage with PGHD. Leveraging these digital technologies and services could be beneficial for developing tailored and effective strategies for self-monitoring and self-management. ", doi="10.2196/30898", url="https://www.jmir.org/2022/4/e30898", url="http://www.ncbi.nlm.nih.gov/pubmed/35486428" } @Article{info:doi/10.2196/35789, author="Rosenau, Lorenz and Majeed, W. Raphael and Ingenerf, Josef and Kiel, Alexander and Kroll, Bj{\"o}rn and K{\"o}hler, Thomas and Prokosch, Hans-Ulrich and Gruendner, Julian", title="Generation of a Fast Healthcare Interoperability Resources (FHIR)-based Ontology for Federated Feasibility Queries in the Context of COVID-19: Feasibility Study", journal="JMIR Med Inform", year="2022", month="Apr", day="27", volume="10", number="4", pages="e35789", keywords="federated queries", keywords="feasibility study", keywords="Fast Healthcare Interoperability Resource", keywords="FHIR Search", keywords="CQL", keywords="ontology", keywords="terminology server", keywords="query", keywords="feasibility", keywords="FHIR", keywords="terminology", keywords="development", keywords="COVID-19", keywords="automation", keywords="user interface", keywords="map", keywords="input", keywords="hospital", keywords="data", keywords="Germany", keywords="accessibility", keywords="harmonized", abstract="Background: The COVID-19 pandemic highlighted the importance of making research data from all German hospitals available to scientists to respond to current and future pandemics promptly. The heterogeneous data originating from proprietary systems at hospitals' sites must be harmonized and accessible. The German Corona Consensus Dataset (GECCO) specifies how data for COVID-19 patients will be standardized in Fast Healthcare Interoperability Resources (FHIR) profiles across German hospitals. However, given the complexity of the FHIR standard, the data harmonization is not sufficient to make the data accessible. A simplified visual representation is needed to reduce the technical burden, while allowing feasibility queries. Objective: This study investigates how a search ontology can be automatically generated using FHIR profiles and a terminology server. Furthermore, it describes how this ontology can be used in a user interface (UI) and how a mapping and a terminology tree created together with the ontology can translate user input into FHIR queries. Methods: We used the FHIR profiles from the GECCO data set combined with a terminology server to generate an ontology and the required mapping files for the translation. We analyzed the profiles and identified search criteria for the visual representation. In this process, we reduced the complex profiles to code value pairs for improved usability. We enriched our ontology with the necessary information to display it in a UI. We also developed an intermediate query language to transform the queries from the UI to federated FHIR requests. Separation of concerns resulted in discrepancies between the criteria used in the intermediate query format and the target query language. Therefore, a mapping was created to reintroduce all information relevant for creating the query in its target language. Further, we generated a tree representation of the ontology hierarchy, which allows resolving child concepts in the process. Results: In the scope of this project, 82 (99\%) of 83 elements defined in the GECCO profile were successfully implemented. We verified our solution based on an independently developed test patient. A discrepancy between the test data and the criteria was found in 6 cases due to different versions used to generate the test data and the UI profiles, the support for specific code systems, and the evaluation of postcoordinated Systematized Nomenclature of Medicine (SNOMED) codes. Our results highlight the need for governance mechanisms for version changes, concept mapping between values from different code systems encoding the same concept, and support for different unit dimensions. Conclusions: We developed an automatic process to generate ontology and mapping files for FHIR-formatted data. Our tests found that this process works for most of our chosen FHIR profile criteria. The process established here works directly with FHIR profiles and a terminology server, making it extendable to other FHIR profiles and demonstrating that automatic ontology generation on FHIR profiles is feasible. ", doi="10.2196/35789", url="https://medinform.jmir.org/2022/4/e35789", url="http://www.ncbi.nlm.nih.gov/pubmed/35380548" } @Article{info:doi/10.2196/32776, author="Yuan, Junyi and Wang, Sufen and Pan, Changqing", title="Mechanism of Impact of Big Data Resources on Medical Collaborative Networks From the Perspective of Transaction Efficiency of Medical Services: Survey Study", journal="J Med Internet Res", year="2022", month="Apr", day="21", volume="24", number="4", pages="e32776", keywords="medical collaborative networks", keywords="big data resources", keywords="transaction efficiency", abstract="Background: The application of big data resources and the development of medical collaborative networks (MCNs) boost each other. However, MCNs are often assumed to be exogenous. How big data resources affect the emergence, development, and evolution of endogenous MCNs has not been well explained. Objective: This study aimed to explore and understand the influence of the mechanism of a wide range of shared and private big data resources on the transaction efficiency of medical services to reveal the impact of big data resources on the emergence and development of endogenous MCNs. Methods: This study was conducted by administering a survey questionnaire to information technology staff and medical staff from 132 medical institutions in China. Data from information technology staff and medical staff were integrated. Structural equation modeling was used to test the direct impact of big data resources on transaction efficiency of medical services. For those big data resources that had no direct impact, we analyzed their indirect impact. Results: Sharing of diagnosis and treatment data ($\beta$=.222; P=.03) and sharing of medical research data ($\beta$=.289; P=.04) at the network level (as big data itself) positively directly affected the transaction efficiency of medical services. Network protection of the external link systems ($\beta$=.271; P=.008) at the level of medical institutions (as big data technology) positively directly affected the transaction efficiency of medical services. Encryption security of web-based data (as big data technology) at the level of medical institutions, medical service capacity available for external use, real-time data of diagnosis and treatment services (as big data itself) at the level of medical institutions, and policies and regulations at the network level indirectly affected the transaction efficiency through network protection of the external link systems at the level of medical institutions. Conclusions: This study found that big data technology, big data itself, and policy at the network and organizational levels interact with, and influence, each other to form the transaction efficiency of medical services. On the basis of the theory of neoclassical economics, the study highlighted the implications of big data resources for the emergence and development of endogenous MCNs. ", doi="10.2196/32776", url="https://www.jmir.org/2022/4/e32776", url="http://www.ncbi.nlm.nih.gov/pubmed/35318187" } @Article{info:doi/10.2196/33213, author="Cooper, Drew and Ubben, Tebbe and Knoll, Christine and Ballhausen, Hanne and O'Donnell, Shane and Braune, Katarina and Lewis, Dana", title="Open-source Web Portal for Managing Self-reported Data and Real-world Data Donation in Diabetes Research: Platform Feasibility Study", journal="JMIR Diabetes", year="2022", month="Mar", day="31", volume="7", number="1", pages="e33213", keywords="diabetes", keywords="type 1 diabetes", keywords="automated insulin delivery", keywords="diabetes technology", keywords="open-source", keywords="patient-reported outcomes", keywords="real-world data", keywords="research methods", keywords="mixed methods", keywords="insulin", keywords="digital health", keywords="web portal", abstract="Background: People with diabetes and their support networks have developed open-source automated insulin delivery systems to help manage their diabetes therapy, as well as to improve their quality of life and glycemic outcomes. Under the hashtag \#WeAreNotWaiting, a wealth of knowledge and real-world data have been generated by users of these systems but have been left largely untapped by research; opportunities for such multimodal studies remain open. Objective: We aimed to evaluate the feasibility of several aspects of open-source automated insulin delivery systems including challenges related to data management and security across multiple disparate web-based platforms and challenges related to implementing follow-up studies. Methods: We developed a mixed methods study to collect questionnaire responses and anonymized diabetes data donated by participants---which included adults and children with diabetes and their partners or caregivers recruited through multiple diabetes online communities. We managed both front-end participant interactions and back-end data management with our web portal (called the Gateway). Participant questionnaire data from electronic data capture (REDCap) and personal device data aggregation (Open Humans) platforms were pseudonymously and securely linked and stored within a custom-built database that used both open-source and commercial software. Participants were later given the option to include their health care providers in the study to validate their questionnaire responses; the database architecture was designed specifically with this kind of extensibility in mind. Results: Of 1052 visitors to the study landing page, 930 participated and completed at least one questionnaire. After the implementation of health care professional validation of self-reported clinical outcomes to the study, an additional 164 individuals visited the landing page, with 142 completing at least one questionnaire. Of the optional study elements, 7 participant--health care professional dyads participated in the survey, and 97 participants who completed the survey donated their anonymized medical device data. Conclusions: The platform was accessible to participants while maintaining compliance with data regulations. The Gateway formalized a system of automated data matching between multiple data sets, which was a major benefit to researchers. Scalability of the platform was demonstrated with the later addition of self-reported data validation. This study demonstrated the feasibility of custom software solutions in addressing complex study designs. The Gateway portal code has been made available open-source and can be leveraged by other research groups. ", doi="10.2196/33213", url="https://diabetes.jmir.org/2022/1/e33213", url="http://www.ncbi.nlm.nih.gov/pubmed/35357312" } @Article{info:doi/10.2196/30258, author="Douze, Laura and Pelayo, Sylvia and Messaadi, Nassir and Grosjean, Julien and Kerdelhu{\'e}, Ga{\'e}tan and Marcilly, Romaric", title="Designing Formulae for Ranking Search Results: Mixed Methods Evaluation Study", journal="JMIR Hum Factors", year="2022", month="Mar", day="25", volume="9", number="1", pages="e30258", keywords="information retrieval", keywords="search engine", keywords="topical relevance", keywords="search result ranking", keywords="user testing", keywords="human factors", abstract="Background: A major factor in the success of any search engine is the relevance of the search results; a tool should sort the search results to present the most relevant documents first. Assessing the performance of the ranking formula is an important part of search engine evaluation. However, the methods currently used to evaluate ranking formulae mainly collect quantitative data and do not gather qualitative data, which help to understand what needs to be improved to tailor the formulae to their end users. Objective: This study aims to evaluate 2 different parameter settings of the ranking formula of LiSSa (the French acronym for scientific literature in health care; Department of Medical Informatics and Information), a tool that provides access to health scientific literature in French, to adapt the formula to the needs of the end users. Methods: To collect quantitative and qualitative data, user tests were carried out with representative end users of LiSSa: 10 general practitioners and 10 registrars. Participants first assessed the relevance of the search results and then rated the ranking criteria used in the 2 formulae. Verbalizations were analyzed to characterize each criterion. Results: A formula that prioritized articles representing a consensus in the field was preferred. When users assess an article's relevance, they judge its topic, methods, and value in clinical practice. Conclusions: Following the evaluation, several improvements were implemented to give more weight to articles that match the search topic and to downgrade articles that have less informative or scientific value for the reader. Applying a qualitative methodology generates valuable user inputs to improve the ranking formula and move toward a highly usable search engine. ", doi="10.2196/30258", url="https://humanfactors.jmir.org/2022/1/e30258", url="http://www.ncbi.nlm.nih.gov/pubmed/35333180" } @Article{info:doi/10.2196/31021, author="Almowil, Zahra and Zhou, Shang-Ming and Brophy, Sinead and Croxall, Jodie", title="Concept Libraries for Repeatable and Reusable Research: Qualitative Study Exploring the Needs of Users", journal="JMIR Hum Factors", year="2022", month="Mar", day="15", volume="9", number="1", pages="e31021", keywords="electronic health records", keywords="record linkage", keywords="reproducible research", keywords="clinical codes", keywords="concept libraries", abstract="Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it difficult to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This study aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, in the development of a data portal for phenotypes (a concept library). Methods: This was a qualitative study using interviews and focus group discussion. One-to-one interviews were conducted with researchers, clinicians, machine learning experts, and senior research managers in health data science (N=6) to explore their specific needs in the development of a concept library. In addition, a focus group discussion with researchers (N=14) working with the Secured Anonymized Information Linkage databank, a national eHealth data linkage infrastructure, was held to perform a SWOT (strengths, weaknesses, opportunities, and threats) analysis for the phenotyping system and the proposed concept library. The interviews and focus group discussion were transcribed verbatim, and 2 thematic analyses were performed. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would stimulate them to share their work and reuse the work of others, and they pointed out several barriers that could inhibit them from sharing their work and reusing the work of others. The participants suggested some developments that they would like to see to improve reproducible research output using routine data. Conclusions: The study indicated that most interviewees valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform. Analysis of interviews and the focus group discussion revealed that different stakeholders have different requirements, facilitators, barriers, and concerns about a prototype concept library. ", doi="10.2196/31021", url="https://humanfactors.jmir.org/2022/1/e31021", url="http://www.ncbi.nlm.nih.gov/pubmed/35289755" } @Article{info:doi/10.2196/35104, author="Jung, Hyesil and Yoo, Sooyoung and Kim, Seok and Heo, Eunjeong and Kim, Borham and Lee, Ho-Young and Hwang, Hee", title="Patient-Level Fall Risk Prediction Using the Observational Medical Outcomes Partnership's Common Data Model: Pilot Feasibility Study", journal="JMIR Med Inform", year="2022", month="Mar", day="11", volume="10", number="3", pages="e35104", keywords="common data model", keywords="accidental falls", keywords="Observational Medical Outcomes Partnership", keywords="nursing records", keywords="medical informatics", keywords="health data", keywords="electronic health record", keywords="data model", keywords="prediction model", keywords="risk prediction", keywords="fall risk", abstract="Background: Falls in acute care settings threaten patients' safety. Researchers have been developing fall risk prediction models and exploring risk factors to provide evidence-based fall prevention practices; however, such efforts are hindered by insufficient samples, limited covariates, and a lack of standardized methodologies that aid study replication. Objective: The objectives of this study were to (1) convert fall-related electronic health record data into the standardized Observational Medical Outcome Partnership's (OMOP) common data model format and (2) develop models that predict fall risk during 2 time periods. Methods: As a pilot feasibility test, we converted fall-related electronic health record data (nursing notes, fall risk assessment sheet, patient acuity assessment sheet, and clinical observation sheet) into standardized OMOP common data model format using an extraction, transformation, and load process. We developed fall risk prediction models for 2 time periods (within 7 days of admission and during the entire hospital stay) using 2 algorithms (least absolute shrinkage and selection operator logistic regression and random forest). Results: In total, 6277 nursing statements, 747,049,486 clinical observation sheet records, 1,554,775 fall risk scores, and 5,685,011 patient acuity scores were converted into OMOP common data model format. All our models (area under the receiver operating characteristic curve 0.692-0.726) performed better than the Hendrich II Fall Risk Model. Patient acuity score, fall history, age ?60 years, movement disorder, and central nervous system agents were the most important predictors in the logistic regression models. Conclusions: To enhance model performance further, we are currently converting all nursing records into the OMOP common data model data format, which will then be included in the models. Thus, in the near future, the performance of fall risk prediction models could be improved through the application of abundant nursing records and external validation. ", doi="10.2196/35104", url="https://medinform.jmir.org/2022/3/e35104", url="http://www.ncbi.nlm.nih.gov/pubmed/35275076" } @Article{info:doi/10.2196/31684, author="Gao, Chuang and McGilchrist, Mark and Mumtaz, Shahzad and Hall, Christopher and Anderson, Ann Lesley and Zurowski, John and Gordon, Sharon and Lumsden, Joanne and Munro, Vicky and Wozniak, Artur and Sibley, Michael and Banks, Christopher and Duncan, Chris and Linksted, Pamela and Hume, Alastair and Stables, L. Catherine and Mayor, Charlie and Caldwell, Jacqueline and Wilde, Katie and Cole, Christian and Jefferson, Emily", title="A National Network of Safe Havens: Scottish Perspective", journal="J Med Internet Res", year="2022", month="Mar", day="9", volume="24", number="3", pages="e31684", keywords="electronic health records", keywords="Safe Haven", keywords="data governance", doi="10.2196/31684", url="https://www.jmir.org/2022/3/e31684", url="http://www.ncbi.nlm.nih.gov/pubmed/35262495" } @Article{info:doi/10.2196/27146, author="Wang, Liya and Qiu, Hang and Luo, Li and Zhou, Li", title="Age- and Sex-Specific Differences in Multimorbidity Patterns and Temporal Trends on Assessing Hospital Discharge Records in Southwest China: Network-Based Study", journal="J Med Internet Res", year="2022", month="Feb", day="25", volume="24", number="2", pages="e27146", keywords="multimorbidity pattern", keywords="temporal trend", keywords="network analysis", keywords="multimorbidity prevalence", keywords="administrative data", keywords="longitudinal study", keywords="regional research", abstract="Background: Multimorbidity represents a global health challenge, which requires a more global understanding of multimorbidity patterns and trends. However, the majority of studies completed to date have often relied on self-reported conditions, and a simultaneous assessment of the entire spectrum of chronic disease co-occurrence, especially in developing regions, has not yet been performed. Objective: We attempted to provide a multidimensional approach to understand the full spectrum of chronic disease co-occurrence among general inpatients in southwest China, in order to investigate multimorbidity patterns and temporal trends, and assess their age and sex differences. Methods: We conducted a retrospective cohort analysis based on 8.8 million hospital discharge records of about 5.0 million individuals of all ages from 2015 to 2019 in a megacity in southwest China. We examined all chronic diagnoses using the ICD-10 (International Classification of Diseases, 10th revision) codes at 3 digits and focused on chronic diseases with ?1\% prevalence for each of the age and sex strata, which resulted in a total of 149 and 145 chronic diseases in males and females, respectively. We constructed multimorbidity networks in the general population based on sex and age, and used the cosine index to measure the co-occurrence of chronic diseases. Then, we divided the networks into communities and assessed their temporal trends. Results: The results showed complex interactions among chronic diseases, with more intensive connections among males and inpatients ?40 years old. A total of 9 chronic diseases were simultaneously classified as central diseases, hubs, and bursts in the multimorbidity networks. Among them, 5 diseases were common to both males and females, including hypertension, chronic ischemic heart disease, cerebral infarction, other cerebrovascular diseases, and atherosclerosis. The earliest leaps (degree leaps ?6) appeared at a disorder of glycoprotein metabolism that happened at 25-29 years in males, about 15 years earlier than in females. The number of chronic diseases in the community increased over time, but the new entrants did not replace the root of the community. Conclusions: Our multimorbidity network analysis identified specific differences in the co-occurrence of chronic diagnoses by sex and age, which could help in the design of clinical interventions for inpatient multimorbidity. ", doi="10.2196/27146", url="https://www.jmir.org/2022/2/e27146", url="http://www.ncbi.nlm.nih.gov/pubmed/35212632" } @Article{info:doi/10.2196/34560, author="Bove, Riley and Schleimer, Erica and Sukhanov, Paul and Gilson, Michael and Law, M. Sindy and Barnecut, Andrew and Miller, L. Bruce and Hauser, L. Stephen and Sanders, J. Stephan and Rankin, P. Katherine", title="Building a Precision Medicine Delivery Platform for Clinics: The University of California, San Francisco, BRIDGE Experience", journal="J Med Internet Res", year="2022", month="Feb", day="15", volume="24", number="2", pages="e34560", keywords="precision medicine", keywords="clinical implementation", keywords="in silico trials", keywords="clinical dashboard", keywords="precision", keywords="implementation", keywords="dashboard", keywords="design", keywords="experience", keywords="analytic", keywords="tool", keywords="analysis", keywords="decision-making", keywords="real time", keywords="platform", keywords="human-centered design", doi="10.2196/34560", url="https://www.jmir.org/2022/2/e34560", url="http://www.ncbi.nlm.nih.gov/pubmed/35166689" } @Article{info:doi/10.2196/34573, author="Petsani, Despoina and Ahmed, Sara and Petronikolou, Vasileia and Kehayia, Eva and Alastalo, Mika and Santonen, Teemu and Merino-Barbancho, Beatriz and Cea, Gloria and Segkouli, Sofia and Stavropoulos, G. Thanos and Billis, Antonis and Doumas, Michael and Almeida, Rosa and Nagy, Enik? and Broeckx, Leen and Bamidis, Panagiotis and Konstantinidis, Evdokimos", title="Digital Biomarkers for Supporting Transitional Care Decisions: Protocol for a Transnational Feasibility Study", journal="JMIR Res Protoc", year="2022", month="Jan", day="19", volume="11", number="1", pages="e34573", keywords="Living Lab", keywords="cocreation", keywords="transitional care", keywords="technology", keywords="feasibility study", abstract="Background: Virtual Health and Wellbeing Living Lab Infrastructure is a Horizon 2020 project that aims to harmonize Living Lab procedures and facilitate access to European health and well-being research infrastructures. In this context, this study presents a joint research activity that will be conducted within Virtual Health and Wellbeing Living Lab Infrastructure in the transitional care domain to test and validate the harmonized Living Lab procedures and infrastructures. The collection of data from various sources (information and communications technology and clinical and patient-reported outcome measures) demonstrated the capacity to assess risk and support decisions during care transitions, but there is no harmonized way of combining this information. Objective: This study primarily aims to evaluate the feasibility and benefit of collecting multichannel data across Living Labs on the topic of transitional care and to harmonize data processes and collection. In addition, the authors aim to investigate the collection and use of digital biomarkers and explore initial patterns in the data that demonstrate the potential to predict transition outcomes, such as readmissions and adverse events. Methods: The current research protocol presents a multicenter, prospective, observational cohort study that will consist of three phases, running consecutively in multiple sites: a cocreation phase, a testing and simulation phase, and a transnational pilot phase. The cocreation phase aims to build a common understanding among different sites, investigate the differences in hospitalization discharge management among countries, and the willingness of different stakeholders to use technological solutions in the transitional care process. The testing and simulation phase aims to explore ways of integrating observation of a patient's clinical condition, patient involvement, and discharge education in transitional care. The objective of the simulation phase is to evaluate the feasibility and the barriers faced by health care professionals in assessing transition readiness. Results: The cocreation phase will be completed by April 2022. The testing and simulation phase will begin in September 2022 and will partially overlap with the deployment of the transnational pilot phase that will start in the same month. The data collection of the transnational pilots will be finalized by the end of June 2023. Data processing is expected to be completed by March 2024. The results will consist of guidelines and implementation pathways for large-scale studies and an analysis for identifying initial patterns in the acquired data. Conclusions: The knowledge acquired through this research will lead to harmonized procedures and data collection for Living Labs that support transitions in care. International Registered Report Identifier (IRRID): PRR1-10.2196/34573 ", doi="10.2196/34573", url="https://www.researchprotocols.org/2022/1/e34573", url="http://www.ncbi.nlm.nih.gov/pubmed/35044303" } @Article{info:doi/10.2196/25440, author="Ulrich, Hannes and Kock-Schoppenhauer, Ann-Kristin and Deppenwiese, Noemi and G{\"o}tt, Robert and Kern, Jori and Lablans, Martin and Majeed, W. Raphael and St{\"o}hr, R. Mark and Stausberg, J{\"u}rgen and Varghese, Julian and Dugas, Martin and Ingenerf, Josef", title="Understanding the Nature of Metadata: Systematic Review", journal="J Med Internet Res", year="2022", month="Jan", day="11", volume="24", number="1", pages="e25440", keywords="metadata", keywords="metadata definition", keywords="systematic review", keywords="data integration", keywords="data identification", keywords="data classification", abstract="Background: Metadata are created to describe the corresponding data in a detailed and unambiguous way and is used for various applications in different research areas, for example, data identification and classification. However, a clear definition of metadata is crucial for further use. Unfortunately, extensive experience with the processing and management of metadata has shown that the term ``metadata'' and its use is not always unambiguous. Objective: This study aimed to understand the definition of metadata and the challenges resulting from metadata reuse. Methods: A systematic literature search was performed in this study following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for reporting on systematic reviews. Five research questions were identified to streamline the review process, addressing metadata characteristics, metadata standards, use cases, and problems encountered. This review was preceded by a harmonization process to achieve a general understanding of the terms used. Results: The harmonization process resulted in a clear set of definitions for metadata processing focusing on data integration. The following literature review was conducted by 10 reviewers with different backgrounds and using the harmonized definitions. This study included 81 peer-reviewed papers from the last decade after applying various filtering steps to identify the most relevant papers. The 5 research questions could be answered, resulting in a broad overview of the standards, use cases, problems, and corresponding solutions for the application of metadata in different research areas. Conclusions: Metadata can be a powerful tool for identifying, describing, and processing information, but its meaningful creation is costly and challenging. This review process uncovered many standards, use cases, problems, and solutions for dealing with metadata. The presented harmonized definitions and the new schema have the potential to improve the classification and generation of metadata by creating a shared understanding of metadata and its context. ", doi="10.2196/25440", url="https://www.jmir.org/2022/1/e25440", url="http://www.ncbi.nlm.nih.gov/pubmed/35014967" } @Article{info:doi/10.2196/30557, author="Vaidyam, Aditya and Halamka, John and Torous, John", title="Enabling Research and Clinical Use of Patient-Generated Health Data (the mindLAMP Platform): Digital Phenotyping Study", journal="JMIR Mhealth Uhealth", year="2022", month="Jan", day="7", volume="10", number="1", pages="e30557", keywords="digital phenotyping", keywords="mHealth", keywords="apps", keywords="FHIR", keywords="digital health", keywords="health data", keywords="patient-generated health data", keywords="mobile health", keywords="smartphones", keywords="wearables", keywords="mobile apps", keywords="mental health, mobile phone", abstract="Background: There is a growing need for the integration of patient-generated health data (PGHD) into research and clinical care to enable personalized, preventive, and interactive care, but technical and organizational challenges, such as the lack of standards and easy-to-use tools, preclude the effective use of PGHD generated from consumer devices, such as smartphones and wearables. Objective: This study outlines how we used mobile apps and semantic web standards such as HTTP 2.0, Representational State Transfer, JSON (JavaScript Object Notation), JSON Schema, Transport Layer Security (version 1.3), Advanced Encryption Standard-256, OpenAPI, HTML5, and Vega, in conjunction with patient and provider feedback to completely update a previous version of mindLAMP. Methods: The Learn, Assess, Manage, and Prevent (LAMP) platform addresses the abovementioned challenges in enhancing clinical insight by supporting research, data analysis, and implementation efforts around PGHD as an open-source solution with freely accessible and shared code. Results: With a simplified programming interface and novel data representation that captures additional metadata, the LAMP platform enables interoperability with existing Fast Healthcare Interoperability Resources--based health care systems as well as consumer wearables and services such as Apple HealthKit and Google Fit. The companion Cortex data analysis and machine learning toolkit offer robust support for artificial intelligence, behavioral feature extraction, interactive visualizations, and high-performance data processing through parallelization and vectorization techniques. Conclusions: The LAMP platform incorporates feedback from patients and clinicians alongside a standards-based approach to address these needs and functions across a wide range of use cases through its customizable and flexible components. These range from simple survey-based research to international consortiums capturing multimodal data to simple delivery of mindfulness exercises through personalized, just-in-time adaptive interventions. ", doi="10.2196/30557", url="https://mhealth.jmir.org/2022/1/e30557", url="http://www.ncbi.nlm.nih.gov/pubmed/34994710" } @Article{info:doi/10.2196/30720, author="Wang, Ni and Wang, Muyu and Zhou, Yang and Liu, Honglei and Wei, Lan and Fei, Xiaolu and Chen, Hui", title="Sequential Data--Based Patient Similarity Framework for Patient Outcome Prediction: Algorithm Development", journal="J Med Internet Res", year="2022", month="Jan", day="6", volume="24", number="1", pages="e30720", keywords="patient similarity", keywords="electronic medical records", keywords="time series", keywords="acute myocardial infarction", keywords="natural language processing", keywords="machine learning", keywords="deep learning", keywords="outcome prediction", keywords="informatics", keywords="health data", abstract="Background: Sequential information in electronic medical records is valuable and helpful for patient outcome prediction but is rarely used for patient similarity measurement because of its unevenness, irregularity, and heterogeneity. Objective: We aimed to develop a patient similarity framework for patient outcome prediction that makes use of sequential and cross-sectional information in electronic medical record systems. Methods: Sequence similarity was calculated from timestamped event sequences using edit distance, and trend similarity was calculated from time series using dynamic time warping and Haar decomposition. We also extracted cross-sectional information, namely, demographic, laboratory test, and radiological report data, for additional similarity calculations. We validated the effectiveness of the framework by constructing k--nearest neighbors classifiers to predict mortality and readmission for acute myocardial infarction patients, using data from (1) a public data set and (2) a private data set, at 3 time points---at admission, on Day 7, and at discharge---to provide early warning patient outcomes. We also constructed state-of-the-art Euclidean-distance k--nearest neighbor, logistic regression, random forest, long short-term memory network, and recurrent neural network models, which were used for comparison. Results: With all available information during a hospitalization episode, predictive models using the similarity model outperformed baseline models based on both public and private data sets. For mortality predictions, all models except for the logistic regression model showed improved performances over time. There were no such increasing trends in predictive performances for readmission predictions. The random forest and logistic regression models performed best for mortality and readmission predictions, respectively, when using information from the first week after admission. Conclusions: For patient outcome predictions, the patient similarity framework facilitated sequential similarity calculations for uneven electronic medical record data and helped improve predictive performance. ", doi="10.2196/30720", url="https://www.jmir.org/2022/1/e30720", url="http://www.ncbi.nlm.nih.gov/pubmed/34989682" } @Article{info:doi/10.2196/34567, author="Santonen, Teemu and Petsani, Despoina and Julin, Mikko and Garschall, Markus and Kropf, Johannes and Van der Auwera, Vicky and Bernaerts, Sylvie and Losada, Raquel and Almeida, Rosa and Garatea, Jokin and Mu{\~n}oz, Idoia and Nagy, Eniko and Kehayia, Eva and de Guise, Elaine and Nadeau, Sylvie and Azevedo, Nancy and Segkouli, Sofia and Lazarou, Ioulietta and Petronikolou, Vasileia and Bamidis, Panagiotis and Konstantinidis, Evdokimos", title="Cocreating a Harmonized Living Lab for Big Data--Driven Hybrid Persona Development: Protocol for Cocreating, Testing, and Seeking Consensus", journal="JMIR Res Protoc", year="2022", month="Jan", day="6", volume="11", number="1", pages="e34567", keywords="Living Lab", keywords="everyday living", keywords="technology", keywords="big data", keywords="harmonization", keywords="personas", keywords="small-scale real-life testing", keywords="mobile phone", abstract="Background: Living Labs are user-centered, open innovation ecosystems based on a systematic user cocreation approach, which integrates research and innovation processes in real-life communities and settings. The Horizon 2020 Project VITALISE (Virtual Health and Wellbeing Living Lab Infrastructure) unites 19 partners across 11 countries. The project aims to harmonize Living Lab procedures and enable effective and convenient transnational and virtual access to key European health and well-being research infrastructures, which are governed by Living Labs. The VITALISE consortium will conduct joint research activities in the fields included in the care pathway of patients: rehabilitation, transitional care, and everyday living environments for older adults. This protocol focuses on health and well-being research in everyday living environments. Objective: The main aim of this study is to cocreate and test a harmonized research protocol for developing big data--driven hybrid persona, which are hypothetical user archetypes created to represent a user community. In addition, the use and applicability of innovative technologies will be investigated in the context of various everyday living and Living Lab environments. Methods: In phase 1, surveys and structured interviews will be used to identify the most suitable Living Lab methods, tools, and instruments for health-related research among VITALISE project Living Labs (N=10). A series of web-based cocreation workshops and iterative cowriting processes will be applied to define the initial protocols. In phase 2, five small-scale case studies will be conducted to test the cocreated research protocols in various real-life everyday living settings and Living Lab infrastructures. In phase 3, a cross-case analysis grounded on semistructured interviews will be conducted to identify the challenges and benefits of using the proposed research protocols. Furthermore, a series of cocreation workshops and the consensus seeking Delphi study process will be conducted in parallel to cocreate and validate the acceptance of the defined harmonized research protocols among wider Living Lab communities. Results: As of September 30, 2021, project deliverables Ethics and safety manual and Living lab standard version 1 have been submitted to the European Commission review process. The study will be finished by March 2024. Conclusions: The outcome of this research will lead to harmonized procedures and protocols in the context of big data--driven hybrid persona development among health and well-being Living Labs in Europe and beyond. Harmonized protocols enable Living Labs to exploit similar research protocols, devices, hardware, and software for interventions and complex data collection purposes. Economies of scale and improved use of resources will speed up and improve research quality and offer novel possibilities for open data sharing, multidisciplinary research, and comparative studies beyond current practices. Case studies will also provide novel insights for implementing innovative technologies in the context of everyday Living Lab research. International Registered Report Identifier (IRRID): DERR1-10.2196/34567 ", doi="10.2196/34567", url="https://www.researchprotocols.org/2022/1/e34567", url="http://www.ncbi.nlm.nih.gov/pubmed/34989697" } @Article{info:doi/10.2196/31365, author="Horvath, Correia Jaqueline Driemeyer and Bessel, Marina and Kops, Luiza Nat{\'a}lia and Souza, Alves Fl{\'a}via Moreno and Pereira, Mendes Gerson and Wendland, Marcia Eliana", title="A Nationwide Evaluation of the Prevalence of Human Papillomavirus in Brazil (POP-Brazil Study): Protocol for Data Quality Assurance and Control", journal="JMIR Res Protoc", year="2022", month="Jan", day="5", volume="11", number="1", pages="e31365", keywords="quality control", keywords="quality assurance", keywords="evidence-based medicine", keywords="quality data", abstract="Background: The credibility of a study and its internal and external validity depend crucially on the quality of the data produced. An in-depth knowledge of quality control processes is essential as large and integrative epidemiological studies are increasingly prioritized. Objective: This study aimed to describe the stages of quality control in the POP-Brazil study and to present an analysis of the quality indicators. Methods: Quality assurance and control were initiated with the planning of this nationwide, multicentric study and continued through the development of the project. All quality control protocol strategies, such as training, protocol implementation, audits, and inspection, were discussed one by one. We highlight the importance of conducting a pilot study that provides the researcher the opportunity to refine or modify the research methodology and validating the results through double data entry, test-retest, and analysis of nonresponse rates. Results: This cross-sectional, nationwide, multicentric study recruited 8628 sexually active young adults (16-25 years old) in 119 public health units between September 2016 and November 2017. The Human Research Ethics Committee of the Moinhos de Vento Hospital approved this project. Conclusions: Quality control processes are a continuum, not restricted to a single event, and are fundamental to the success of data integrity and the minimization of bias in epidemiological studies. The quality control steps described can be used as a guide to implement evidence-based, valid, reliable, and useful procedures in most observational studies to ensure data integrity. International Registered Report Identifier (IRRID): RR1-10.2196/31365 ", doi="10.2196/31365", url="https://www.researchprotocols.org/2022/1/e31365", url="http://www.ncbi.nlm.nih.gov/pubmed/34989680" } @Article{info:doi/10.2196/48892, author="Schapranow, Matthieu-P and Bayat, Mozhgan and Rasheed, Aadil and Naik, Marcel and Graf, Verena and Schmidt, Danilo and Budde, Klemens and Cardinal, H{\'e}lo{\"i}se and Sapir-Pichhadze, Ruth and Fenninger, Franz and Sherwood, Karen and Keown, Paul and G{\"u}nther, P. Oliver and Pandl, D. Konstantin and Leiser, Florian and Thiebes, Scott and Sunyaev, Ali and Niemann, Matthias and Schimanski, Andreas and Klein, Thomas", title="NephroCAGE---German-Canadian Consortium on AI for Improved Kidney Transplantation Outcome: Protocol for an Algorithm Development and Validation Study", journal="JMIR Res Protoc", year="2023", month="Dec", day="22", volume="12", pages="e48892", keywords="posttransplant risks", keywords="kidney transplantation", keywords="federated learning infrastructure", keywords="clinical prediction model", keywords="donor-recipient matching", keywords="multinational transplant data set", abstract="Background: Recent advances in hardware and software enabled the use of artificial intelligence (AI) algorithms for analysis of complex data in a wide range of daily-life use cases. We aim to explore the benefits of applying AI to a specific use case in transplant nephrology: risk prediction for severe posttransplant events. For the first time, we combine multinational real-world transplant data, which require specific legal and technical protection measures. Objective: The German-Canadian NephroCAGE consortium aims to develop and evaluate specific processes, software tools, and methods to (1) combine transplant data of more than 8000 cases over the past decades from leading transplant centers in Germany and Canada, (2) implement specific measures to protect sensitive transplant data, and (3) use multinational data as a foundation for developing high-quality prognostic AI models. Methods: To protect sensitive transplant data addressing the first and second objectives, we aim to implement a decentralized NephroCAGE federated learning infrastructure upon a private blockchain. Our NephroCAGE federated learning infrastructure enables a switch of paradigms: instead of pooling sensitive data into a central database for analysis, it enables the transfer of clinical prediction models (CPMs) to clinical sites for local data analyses. Thus, sensitive transplant data reside protected in their original sites while the comparable small algorithms are exchanged instead. For our third objective, we will compare the performance of selected AI algorithms, for example, random forest and extreme gradient boosting, as foundation for CPMs to predict severe short- and long-term posttransplant risks, for example, graft failure or mortality. The CPMs will be trained on donor and recipient data from retrospective cohorts of kidney transplant patients. Results: We have received initial funding for NephroCAGE in February 2021. All clinical partners have applied for and received ethics approval as of 2022. The process of exploration of clinical transplant database for variable extraction has started at all the centers in 2022. In total, 8120 patient records have been retrieved as of August 2023. The development and validation of CPMs is ongoing as of 2023. Conclusions: For the first time, we will (1) combine kidney transplant data from nephrology centers in Germany and Canada, (2) implement federated learning as a foundation to use such real-world transplant data as a basis for the training of CPMs in a privacy-preserving way, and (3) develop a learning software system to investigate population specifics, for example, to understand population heterogeneity, treatment specificities, and individual impact on selected posttransplant outcomes. International Registered Report Identifier (IRRID): DERR1-10.2196/48892 ", doi="10.2196/48892", url="https://www.researchprotocols.org/2023/1/e48892", url="http://www.ncbi.nlm.nih.gov/pubmed/38133915" } @Article{info:doi/10.2196/51471, author="Dolezel, Diane and Beauvais, Brad and Stigler Granados, Paula and Fulton, Lawrence and Kruse, Scott Clemens", title="Effects of Internal and External Factors on Hospital Data Breaches: Quantitative Study", journal="J Med Internet Res", year="2023", month="Dec", day="21", volume="25", pages="e51471", keywords="data breach", keywords="security", keywords="geospatial", keywords="predictive", keywords="mobile phone", abstract="Background: Health care data breaches are the most rapidly increasing type of cybercrime; however, the predictors of health care data breaches are uncertain. Objective: This quantitative study aims to develop a predictive model to explain the number of hospital data breaches at the county level. Methods: This study evaluated data consolidated at the county level from 1032 short-term acute care hospitals. We considered the association between data breach occurrence (a dichotomous variable), predictors based on county demographics, and socioeconomics, average hospital workload, facility type, and average performance on several hospital financial metrics using 3 model types: logistic regression, perceptron, and support vector machine. Results: The model coefficient performance metrics indicated convergent validity across the 3 model types for all variables except bad debt and the factor level accounting for counties with >20\% and up to 40\% Hispanic populations, both of which had mixed coefficient directionality. The support vector machine model performed the classification task best based on all metrics (accuracy, precision, recall, F1-score). All the 3 models performed the classification task well with directional congruence of weights. From the logistic regression model, the top 5 odds ratios (indicating a higher risk of breach) included inpatient workload, medical center status, pediatric trauma center status, accounts receivable, and the number of outpatient visits, in high to low order. The bottom 5 odds ratios (indicating the lowest odds of experiencing a data breach) occurred for counties with Black populations of >20\% and <40\%, >80\% and <100\%, and >40\% but <60\%, as well as counties with ?20\% Asian or between 80\% and 100\% Hispanic individuals. Our results are in line with those of other studies that determined that patient workload, facility type, and financial outcomes were associated with the likelihood of health care data breach occurrence. Conclusions: The results of this study provide a predictive model for health care data breaches that may guide health care managers to reduce the risk of data breaches by raising awareness of the risk factors. ", doi="10.2196/51471", url="https://www.jmir.org/2023/1/e51471", url="http://www.ncbi.nlm.nih.gov/pubmed/38127426" } @Article{info:doi/10.2196/44599, author="Autio, Reija and Virta, Joni and Nordhausen, Klaus and Fogelholm, Mikael and Erkkola, Maijaliisa and Nevalainen, Jaakko", title="Tensorial Principal Component Analysis in Detecting Temporal Trajectories of Purchase Patterns in Loyalty Card Data: Retrospective Cohort Study", journal="J Med Internet Res", year="2023", month="Dec", day="15", volume="25", pages="e44599", keywords="tensorial data", keywords="principal components", keywords="loyalty card data", keywords="purchase pattern", keywords="food expenditure", keywords="seasonality", keywords="food", keywords="diet", abstract="Background: Loyalty card data automatically collected by retailers provide an excellent source for evaluating health-related purchase behavior of customers. The data comprise information on every grocery purchase, including expenditures on product groups and the time of purchase for each customer. Such data where customers have an expenditure value for every product group for each time can be formulated as 3D tensorial data. Objective: This study aimed to use the modern tensorial principal component analysis (PCA) method to uncover the characteristics of health-related purchase patterns from loyalty card data. Another aim was to identify card holders with distinct purchase patterns. We also considered the interpretation, advantages, and challenges of tensorial PCA compared with standard PCA. Methods: Loyalty card program members from the largest retailer in Finland were invited to participate in this study. Our LoCard data consist of the purchases of 7251 card holders who consented to the use of their data from the year 2016. The purchases were reclassified into 55 product groups and aggregated across 52 weeks. The data were then analyzed using tensorial PCA, allowing us to effectively reduce the time and product group-wise dimensions simultaneously. The augmentation method was used for selecting the suitable number of principal components for the analysis. Results: Using tensorial PCA, we were able to systematically search for typical food purchasing patterns across time and product groups as well as detect different purchasing behaviors across groups of card holders. For example, we identified customers who purchased large amounts of meat products and separated them further into groups based on time profiles, that is, customers whose purchases of meat remained stable, increased, or decreased throughout the year or varied between seasons of the year. Conclusions: Using tensorial PCA, we can effectively examine customers' purchasing behavior in more detail than with traditional methods because it can handle time and product group dimensions simultaneously. When interpreting the results, both time and product dimensions must be considered. In further analyses, these time and product groups can be directly associated with additional consumer characteristics such as socioeconomic and demographic predictors of dietary patterns. In addition, they can be linked to external factors that impact grocery purchases such as inflation and unexpected pandemics. This enables us to identify what types of people have specific purchasing patterns, which can help in the development of ways in which consumers can be steered toward making healthier food choices. ", doi="10.2196/44599", url="https://www.jmir.org/2023/1/e44599", url="http://www.ncbi.nlm.nih.gov/pubmed/38100168" } @Article{info:doi/10.2196/50027, author="Gierend, Kerstin and Waltemath, Dagmar and Ganslandt, Thomas and Siegel, Fabian", title="Traceable Research Data Sharing in a German Medical Data Integration Center With FAIR (Findability, Accessibility, Interoperability, and Reusability)-Geared Provenance Implementation: Proof-of-Concept Study", journal="JMIR Form Res", year="2023", month="Dec", day="7", volume="7", pages="e50027", keywords="provenance", keywords="traceability", keywords="data management", keywords="metadata", keywords="data integrity", keywords="data integration center", keywords="medical informatics", abstract="Background: Secondary investigations into digital health records, including electronic patient data from German medical data integration centers (DICs), pave the way for enhanced future patient care. However, only limited information is captured regarding the integrity, traceability, and quality of the (sensitive) data elements. This lack of detail diminishes trust in the validity of the collected data. From a technical standpoint, adhering to the widely accepted FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for data stewardship necessitates enriching data with provenance-related metadata. Provenance offers insights into the readiness for the reuse of a data element and serves as a supplier of data governance. Objective: The primary goal of this study is to augment the reusability of clinical routine data within a medical DIC for secondary utilization in clinical research. Our aim is to establish provenance traces that underpin the status of data integrity, reliability, and consequently, trust in electronic health records, thereby enhancing the accountability of the medical DIC. We present the implementation of a proof-of-concept provenance library integrating international standards as an initial step. Methods: We adhered to a customized road map for a provenance framework, and examined the data integration steps across the ETL (extract, transform, and load) phases. Following a maturity model, we derived requirements for a provenance library. Using this research approach, we formulated a provenance model with associated metadata and implemented a proof-of-concept provenance class. Furthermore, we seamlessly incorporated the internationally recognized Word Wide Web Consortium (W3C) provenance standard, aligned the resultant provenance records with the interoperable health care standard Fast Healthcare Interoperability Resources, and presented them in various representation formats. Ultimately, we conducted a thorough assessment of provenance trace measurements. Results: This study marks the inaugural implementation of integrated provenance traces at the data element level within a German medical DIC. We devised and executed a practical method that synergizes the robustness of quality- and health standard--guided (meta)data management practices. Our measurements indicate commendable pipeline execution times, attaining notable levels of accuracy and reliability in processing clinical routine data, thereby ensuring accountability in the medical DIC. These findings should inspire the development of additional tools aimed at providing evidence-based and reliable electronic health record services for secondary use. Conclusions: The research method outlined for the proof-of-concept provenance class has been crafted to promote effective and reliable core data management practices. It aims to enhance biomedical data by imbuing it with meaningful provenance, thereby bolstering the benefits for both research and society. Additionally, it facilitates the streamlined reuse of biomedical data. As a result, the system mitigates risks, as data analysis without knowledge of the origin and quality of all data elements is rendered futile. While the approach was initially developed for the medical DIC use case, these principles can be universally applied throughout the scientific domain. ", doi="10.2196/50027", url="https://formative.jmir.org/2023/1/e50027", url="http://www.ncbi.nlm.nih.gov/pubmed/38060305" } @Article{info:doi/10.2196/44639, author="Keszthelyi, Daniel and Gaudet-Blavignac, Christophe and Bjelogrlic, Mina and Lovis, Christian", title="Patient Information Summarization in Clinical Settings: Scoping Review", journal="JMIR Med Inform", year="2023", month="Nov", day="28", volume="11", pages="e44639", keywords="summarization", keywords="electronic health records", keywords="EHR", keywords="medical record", keywords="visualization", keywords="dashboard", keywords="natural language processing", abstract="Background: Information overflow, a common problem in the present clinical environment, can be mitigated by summarizing clinical data. Although there are several solutions for clinical summarization, there is a lack of a complete overview of the research relevant to this field. Objective: This study aims to identify state-of-the-art solutions for clinical summarization, to analyze their capabilities, and to identify their properties. Methods: A scoping review of articles published between 2005 and 2022 was conducted. With a clinical focus, PubMed and Web of Science were queried to find an initial set of reports, later extended by articles found through a chain of citations. The included reports were analyzed to answer the questions of where, what, and how medical information is summarized; whether summarization conserves temporality, uncertainty, and medical pertinence; and how the propositions are evaluated and deployed. To answer how information is summarized, methods were compared through a new framework ``collect---synthesize---communicate'' referring to information gathering from data, its synthesis, and communication to the end user. Results: Overall, 128 articles were included, representing various medical fields. Exclusively structured data were used as input in 46.1\% (59/128) of papers, text in 41.4\% (53/128) of articles, and both in 10.2\% (13/128) of papers. Using the proposed framework, 42.2\% (54/128) of the records contributed to information collection, 27.3\% (35/128) contributed to information synthesis, and 46.1\% (59/128) presented solutions for summary communication. Numerous summarization approaches have been presented, including extractive (n=13) and abstractive summarization (n=19); topic modeling (n=5); summary specification (n=11); concept and relation extraction (n=30); visual design considerations (n=59); and complete pipelines (n=7) using information extraction, synthesis, and communication. Graphical displays (n=53), short texts (n=41), static reports (n=7), and problem-oriented views (n=7) were the most common types in terms of summary communication. Although temporality and uncertainty information were usually not conserved in most studies (74/128, 57.8\% and 113/128, 88.3\%, respectively), some studies presented solutions to treat this information. Overall, 115 (89.8\%) articles showed results of an evaluation, and methods included evaluations with human participants (median 15, IQR 24 participants): measurements in experiments with human participants (n=31), real situations (n=8), and usability studies (n=28). Methods without human involvement included intrinsic evaluation (n=24), performance on a proxy (n=10), or domain-specific tasks (n=11). Overall, 11 (8.6\%) reports described a system deployed in clinical settings. Conclusions: The scientific literature contains many propositions for summarizing patient information but reports very few comparisons of these proposals. This work proposes to compare these algorithms through how they conserve essential aspects of clinical information and through the ``collect---synthesize---communicate'' framework. We found that current propositions usually address these 3 steps only partially. Moreover, they conserve and use temporality, uncertainty, and pertinent medical aspects to varying extents, and solutions are often preliminary. ", doi="10.2196/44639", url="https://medinform.jmir.org/2023/1/e44639", url="http://www.ncbi.nlm.nih.gov/pubmed/38015588" } @Article{info:doi/10.2196/47859, author="Kang, Jin Ha Ye and Batbaatar, Erdenebileg and Choi, Dong-Woo and Choi, Son Kui and Ko, Minsam and Ryu, Sun Kwang", title="Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy", journal="JMIR Med Inform", year="2023", month="Nov", day="24", volume="11", pages="e47859", keywords="generative adversarial networks", keywords="GAN", keywords="synthetic data generation", keywords="synthetic tabular data", keywords="lung cancer", keywords="machine learning", keywords="mortality prediction", abstract="Background: Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. Objective: This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. Methods: The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)--based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. Results: The synthetic data of the 3 diseases (non--small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. Conclusions: This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated. ", doi="10.2196/47859", url="https://medinform.jmir.org/2023/1/e47859", url="http://www.ncbi.nlm.nih.gov/pubmed/37999942" } @Article{info:doi/10.2196/49314, author="Rose, Christian and Barber, Rachel and Preiksaitis, Carl and Kim, Ireh and Mishra, Nikesh and Kayser, Kristen and Brown, Italo and Gisondi, Michael", title="A Conference (Missingness in Action) to Address Missingness in Data and AI in Health Care: Qualitative Thematic Analysis", journal="J Med Internet Res", year="2023", month="Nov", day="23", volume="25", pages="e49314", keywords="machine learning", keywords="artificial intelligence", keywords="health care data", keywords="data quality", keywords="thematic analysis", keywords="AI", keywords="implementation", keywords="digital conference", keywords="trust", keywords="privacy", keywords="predictive model", keywords="health care community", abstract="Background: Missingness in health care data poses significant challenges in the development and implementation of artificial intelligence (AI) and machine learning solutions. Identifying and addressing these challenges is critical to ensuring the continued growth and accuracy of these models as well as their equitable and effective use in health care settings. Objective: This study aims to explore the challenges, opportunities, and potential solutions related to missingness in health care data for AI applications through the conduct of a digital conference and thematic analysis of conference proceedings. Methods: A digital conference was held in September 2022, attracting 861 registered participants, with 164 (19\%) attending the live event. The conference featured presentations and panel discussions by experts in AI, machine learning, and health care. Transcripts of the event were analyzed using the stepwise framework of Braun and Clark to identify key themes related to missingness in health care data. Results: Three principal themes---data quality and bias, human input in model development, and trust and privacy---emerged from the analysis. Topics included the accuracy of predictive models, lack of inclusion of underrepresented communities, partnership with physicians and other populations, challenges with sensitive health care data, and fostering trust with patients and the health care community. Conclusions: Addressing the challenges of data quality, human input, and trust is vital when devising and using machine learning algorithms in health care. Recommendations include expanding data collection efforts to reduce gaps and biases, involving medical professionals in the development and implementation of AI models, and developing clear ethical guidelines to safeguard patient privacy. Further research and ongoing discussions are needed to ensure these conclusions remain relevant as health care and AI continue to evolve. ", doi="10.2196/49314", url="https://www.jmir.org/2023/1/e49314", url="http://www.ncbi.nlm.nih.gov/pubmed/37995113" } @Article{info:doi/10.2196/47066, author="Biasiotto, Roberta and Viberg Johansson, Jennifer and Alemu, Birhanu Melaku and Romano, Virginia and Bentzen, Beate Heidi and Kaye, Jane and Ancillotti, Mirko and Blom, Catharina Johanna Maria and Chassang, Gauthier and Hallinan, Dara and J{\'o}nsd{\'o}ttir, Andrea Gu?bj{\"o}rg and Monasterio Astobiza, An{\'i}bal and Rial-Sebbag, Emmanuelle and Rodr{\'i}guez-Arias, David and Shah, Nisha and Skovgaard, Lea and Staunton, Ciara and Tschigg, Katharina and Veldwijk, Jorien and Mascalzoni, Deborah", title="Public Preferences for Digital Health Data Sharing: Discrete Choice Experiment Study in 12 European Countries", journal="J Med Internet Res", year="2023", month="Nov", day="23", volume="25", pages="e47066", keywords="governance", keywords="digital health data", keywords="preferences", keywords="Europe", keywords="discrete choice experiment", keywords="data use", keywords="data sharing", keywords="secondary use of data", abstract="Background: With new technologies, health data can be collected in a variety of different clinical, research, and public health contexts, and then can be used for a range of new purposes. Establishing the public's views about digital health data sharing is essential for policy makers to develop effective harmonization initiatives for digital health data governance at the European level. Objective: This study investigated public preferences for digital health data sharing. Methods: A discrete choice experiment survey was administered to a sample of European residents in 12 European countries (Austria, Denmark, France, Germany, Iceland, Ireland, Italy, the Netherlands, Norway, Spain, Sweden, and the United Kingdom) from August 2020 to August 2021. Respondents answered whether hypothetical situations of data sharing were acceptable for them. Each hypothetical scenario was defined by 5 attributes (``data collector,'' ``data user,'' ``reason for data use,'' ``information on data sharing and consent,'' and ``availability of review process''), which had 3 to 4 attribute levels each. A latent class model was run across the whole data set and separately for different European regions (Northern, Central, and Southern Europe). Attribute relative importance was calculated for each latent class's pooled and regional data sets. Results: A total of 5015 completed surveys were analyzed. In general, the most important attribute for respondents was the availability of information and consent during health data sharing. In the latent class model, 4 classes of preference patterns were identified. While respondents in 2 classes strongly expressed their preferences for data sharing with opposing positions, respondents in the other 2 classes preferred not to share their data, but attribute levels of the situation could have had an impact on their preferences. Respondents generally found the following to be the most acceptable: a national authority or academic research project as the data user; being informed and asked to consent; and a review process for data transfer and use, or transfer only. On the other hand, collection of their data by a technological company and data use for commercial communication were the least acceptable. There was preference heterogeneity across Europe and within European regions. Conclusions: This study showed the importance of transparency in data use and oversight of health-related data sharing for European respondents. Regional and intraregional preference heterogeneity for ``data collector,'' ``data user,'' ``reason,'' ``type of consent,'' and ``review'' calls for governance solutions that would grant data subjects the ability to control their digital health data being shared within different contexts. These results suggest that the use of data without consent will demand weighty and exceptional reasons. An interactive and dynamic informed consent model combined with oversight mechanisms may be a solution for policy initiatives aiming to harmonize health data use across Europe. ", doi="10.2196/47066", url="https://www.jmir.org/2023/1/e47066", url="http://www.ncbi.nlm.nih.gov/pubmed/37995125" } @Article{info:doi/10.2196/50998, author="Yu, Shirui and Wang, Ziyang and Nan, Jiale and Li, Aihua and Yang, Xuemei and Tang, Xiaoli", title="Potential Schizophrenia Disease-Related Genes Prediction Using Metagraph Representations Based on a Protein-Protein Interaction Keyword Network: Framework Development and Validation", journal="JMIR Form Res", year="2023", month="Nov", day="15", volume="7", pages="e50998", keywords="disease gene prediction", keywords="metagraph", keywords="protein representations", keywords="schizophrenia", keywords="keyword network", abstract="Background: Schizophrenia is a serious mental disease. With increased research funding for this disease, schizophrenia has become one of the key areas of focus in the medical field. Searching for associations between diseases and genes is an effective approach to study complex diseases, which may enhance research on schizophrenia pathology and lead to the identification of new treatment targets. Objective: The aim of this study was to identify potential schizophrenia risk genes by employing machine learning methods to extract topological characteristics of proteins and their functional roles in a protein-protein interaction (PPI)-keywords (PPIK) network and understand the complex disease--causing property. Consequently, a PPIK-based metagraph representation approach is proposed. Methods: To enrich the PPI network, we integrated keywords describing protein properties and constructed a PPIK network. We extracted features that describe the topology of this network through metagraphs. We further transformed these metagraphs into vectors and represented proteins with a series of vectors. We then trained and optimized our model using random forest (RF), extreme gradient boosting, light gradient boosting machine, and logistic regression models. Results: Comprehensive experiments demonstrated the good performance of our proposed method with an area under the receiver operating characteristic curve (AUC) value between 0.72 and 0.76. Our model also outperformed baseline methods for overall disease protein prediction, including the random walk with restart, average commute time, and Katz models. Compared with the PPI network constructed from the baseline models, complementation of keywords in the PPIK network improved the performance (AUC) by 0.08 on average, and the metagraph-based method improved the AUC by 0.30 on average compared with that of the baseline methods. According to the comprehensive performance of the four models, RF was selected as the best model for disease protein prediction, with precision, recall, F1-score, and AUC values of 0.76, 0.73, 0.72, and 0.76, respectively. We transformed these proteins to their encoding gene IDs and identified the top 20 genes as the most probable schizophrenia-risk genes, including the EYA3, CNTN4, HSPA8, LRRK2, and AFP genes. We further validated these outcomes against metagraph features and evidence from the literature, performed a features analysis, and exploited evidence from the literature to interpret the correlation between the predicted genes and diseases. Conclusions: The metagraph representation based on the PPIK network framework was found to be effective for potential schizophrenia risk genes identification. The results are quite reliable as evidence can be found in the literature to support our prediction. Our approach can provide more biological insights into the pathogenesis of schizophrenia. ", doi="10.2196/50998", url="https://formative.jmir.org/2023/1/e50998", url="http://www.ncbi.nlm.nih.gov/pubmed/37966892" } @Article{info:doi/10.2196/48030, author="Pirmani, Ashkan and De Brouwer, Edward and Geys, Lotte and Parciak, Tina and Moreau, Yves and Peeters, M. Liesbet", title="The Journey of Data Within a Global Data Sharing Initiative: A Federated 3-Layer Data Analysis Pipeline to Scale Up Multiple Sclerosis Research", journal="JMIR Med Inform", year="2023", month="Nov", day="9", volume="11", pages="e48030", keywords="data analysis pipeline", keywords="federated model sharing", keywords="real-world data", keywords="evidence-based decision-making", keywords="end-to-end pipeline", keywords="multiple sclerosis", keywords="data analysis", keywords="pipeline", keywords="data science", keywords="federated", keywords="neurology", keywords="brain", keywords="spine", keywords="spinal nervous system", keywords="neuroscience", keywords="data sharing", keywords="rare", keywords="low prevalence", abstract="Background: Investigating low-prevalence diseases such as multiple sclerosis is challenging because of the rather small number of individuals affected by this disease and the scattering of real-world data across numerous data sources. These obstacles impair data integration, standardization, and analysis, which negatively impact the generation of significant meaningful clinical evidence. Objective: This study aims to present a comprehensive, research question--agnostic, multistakeholder-driven end-to-end data analysis pipeline that accommodates 3 prevalent data-sharing streams: individual data sharing, core data set sharing, and federated model sharing. Methods: A demand-driven methodology is employed for standardization, followed by 3 streams of data acquisition, a data quality enhancement process, a data integration procedure, and a concluding analysis stage to fulfill real-world data-sharing requirements. This pipeline's effectiveness was demonstrated through its successful implementation in the COVID-19 and multiple sclerosis global data sharing initiative. Results: The global data sharing initiative yielded multiple scientific publications and provided extensive worldwide guidance for the community with multiple sclerosis. The pipeline facilitated gathering pertinent data from various sources, accommodating distinct sharing streams and assimilating them into a unified data set for subsequent statistical analysis or secure data examination. This pipeline contributed to the assembly of the largest data set of people with multiple sclerosis infected with COVID-19. Conclusions: The proposed data analysis pipeline exemplifies the potential of global stakeholder collaboration and underlines the significance of evidence-based decision-making. It serves as a paradigm for how data sharing initiatives can propel advancements in health care, emphasizing its adaptability and capacity to address diverse research inquiries. ", doi="10.2196/48030", url="https://medinform.jmir.org/2023/1/e48030", url="http://www.ncbi.nlm.nih.gov/pubmed/37943585" } @Article{info:doi/10.2196/48809, author="Gierend, Kerstin and Freiesleben, Sherry and Kadioglu, Dennis and Siegel, Fabian and Ganslandt, Thomas and Waltemath, Dagmar", title="The Status of Data Management Practices Across German Medical Data Integration Centers: Mixed Methods Study", journal="J Med Internet Res", year="2023", month="Nov", day="8", volume="25", pages="e48809", keywords="data management", keywords="provenance", keywords="traceability", keywords="metadata", keywords="data integration center", keywords="maturity model", abstract="Background: In the context of the Medical Informatics Initiative, medical data integration centers (DICs) have implemented complex data flows to transfer routine health care data into research data repositories for secondary use. Data management practices are of importance throughout these processes, and special attention should be given to provenance aspects. Insufficient knowledge can lead to validity risks and reduce the confidence and quality of the processed data. The need to implement maintainable data management practices is undisputed, but there is a great lack of clarity on the status. Objective: Our study examines the current data management practices throughout the data life cycle within the Medical Informatics in Research and Care in University Medicine (MIRACUM) consortium. We present a framework for the maturity status of data management practices and present recommendations to enable a trustful dissemination and reuse of routine health care data. Methods: In this mixed methods study, we conducted semistructured interviews with stakeholders from 10 DICs between July and September 2021. We used a self-designed questionnaire that we tailored to the MIRACUM DICs, to collect qualitative and quantitative data. Our study method is compliant with the Good Reporting of a Mixed Methods Study (GRAMMS) checklist. Results: Our study provides insights into the data management practices at the MIRACUM DICs. We identify several traceability issues that can be partially explained with a lack of contextual information within nonharmonized workflow steps, unclear responsibilities, missing or incomplete data elements, and incomplete information about the computational environment information. Based on the identified shortcomings, we suggest a data management maturity framework to reach more clarity and to help define enhanced data management strategies. Conclusions: The data management maturity framework supports the production and dissemination of accurate and provenance-enriched data for secondary use. Our work serves as a catalyst for the derivation of an overarching data management strategy, abiding data integrity and provenance characteristics as key factors. We envision that this work will lead to the generation of fairer and maintained health research data of high quality. ", doi="10.2196/48809", url="https://www.jmir.org/2023/1/e48809", url="http://www.ncbi.nlm.nih.gov/pubmed/37938878" } @Article{info:doi/10.2196/41446, author="Bernardi, Andrade Filipe and Alves, Domingos and Crepaldi, Nathalia and Yamada, Bettiol Diego and Lima, Costa Vin{\'i}cius and Rijo, Rui", title="Data Quality in Health Research: Integrative Literature Review", journal="J Med Internet Res", year="2023", month="Oct", day="31", volume="25", pages="e41446", keywords="data quality", keywords="research", keywords="digital health", keywords="review", keywords="decision-making", keywords="health data", keywords="research network", keywords="artificial intelligence", keywords="e-management", keywords="digital governance", keywords="reliability", keywords="database", keywords="health system", keywords="health services", keywords="health stakeholders", abstract="Background: Decision-making and strategies to improve service delivery must be supported by reliable health data to generate consistent evidence on health status. The data quality management process must ensure the reliability of collected data. Consequently, various methodologies to improve the quality of services are applied in the health field. At the same time, scientific research is constantly evolving to improve data quality through better reproducibility and empowerment of researchers and offers patient groups tools for secured data sharing and privacy compliance. Objective: Through an integrative literature review, the aim of this work was to identify and evaluate digital health technology interventions designed to support the conducting of health research based on data quality. Methods: A search was conducted in 6 electronic scientific databases in January 2022: PubMed, SCOPUS, Web of Science, Institute of Electrical and Electronics Engineers Digital Library, Cumulative Index of Nursing and Allied Health Literature, and Latin American and Caribbean Health Sciences Literature. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses checklist and flowchart were used to visualize the search strategy results in the databases. Results: After analyzing and extracting the outcomes of interest, 33 papers were included in the review. The studies covered the period of 2017-2021 and were conducted in 22 countries. Key findings revealed variability and a lack of consensus in assessing data quality domains and metrics. Data quality factors included the research environment, application time, and development steps. Strategies for improving data quality involved using business intelligence models, statistical analyses, data mining techniques, and qualitative approaches. Conclusions: The main barriers to health data quality are technical, motivational, economical, political, legal, ethical, organizational, human resources, and methodological. The data quality process and techniques, from precollection to gathering, postcollection, and analysis, are critical for the final result of a study or the quality of processes and decision-making in a health care organization. The findings highlight the need for standardized practices and collaborative efforts to enhance data quality in health research. Finally, context guides decisions regarding data quality strategies and techniques. International Registered Report Identifier (IRRID): RR2-10.1101/2022.05.31.22275804 ", doi="10.2196/41446", url="https://www.jmir.org/2023/1/e41446", url="http://www.ncbi.nlm.nih.gov/pubmed/37906223" } @Article{info:doi/10.2196/45225, author="Lou, Pei and Fang, An and Zhao, Wanqing and Yao, Kuanda and Yang, Yusheng and Hu, Jiahui", title="Potential Target Discovery and Drug Repurposing for Coronaviruses: Study Involving a Knowledge Graph--Based Approach", journal="J Med Internet Res", year="2023", month="Oct", day="20", volume="25", pages="e45225", keywords="coronavirus", keywords="heterogeneous data integration", keywords="knowledge graph embedding", keywords="drug repurposing", keywords="interpretable prediction", keywords="COVID-19", abstract="Background: The global pandemics of severe acute respiratory syndrome, Middle East respiratory syndrome, and COVID-19 have caused unprecedented crises for public health. Coronaviruses are constantly evolving, and it is unknown which new coronavirus will emerge and when the next coronavirus will sweep across the world. Knowledge graphs are expected to help discover the pathogenicity and transmission mechanism of viruses. Objective: The aim of this study was to discover potential targets and candidate drugs to repurpose for coronaviruses through a knowledge graph--based approach. Methods: We propose a computational and evidence-based knowledge discovery approach to identify potential targets and candidate drugs for coronaviruses from biomedical literature and well-known knowledge bases. To organize the semantic triples extracted automatically from biomedical literature, a semantic conversion model was designed. The literature knowledge was associated and integrated with existing drug and gene knowledge through semantic mapping, and the coronavirus knowledge graph (CovKG) was constructed. We adopted both the knowledge graph embedding model and the semantic reasoning mechanism to discover unrecorded mechanisms of drug action as well as potential targets and drug candidates. Furthermore, we have provided evidence-based support with a scoring and backtracking mechanism. Results: The constructed CovKG contains 17,369,620 triples, of which 641,195 were extracted from biomedical literature, covering 13,065 concept unique identifiers, 209 semantic types, and 97 semantic relations of the Unified Medical Language System. Through multi-source knowledge integration, 475 drugs and 262 targets were mapped to existing knowledge, and 41 new drug mechanisms of action were found by semantic reasoning, which were not recorded in the existing knowledge base. Among the knowledge graph embedding models, TransR outperformed others (mean reciprocal rank=0.2510, Hits@10=0.3505). A total of 33 potential targets and 18 drug candidates were identified for coronaviruses. Among them, 7 novel drugs (ie, quinine, nelfinavir, ivermectin, asunaprevir, tylophorine, Artemisia annua extract, and resveratrol) and 3 highly ranked targets (ie, angiotensin converting enzyme 2, transmembrane serine protease 2, and M protein) were further discussed. Conclusions: We showed the effectiveness of a knowledge graph--based approach in potential target discovery and drug repurposing for coronaviruses. Our approach can be extended to other viruses or diseases for biomedical knowledge discovery and relevant applications. ", doi="10.2196/45225", url="https://www.jmir.org/2023/1/e45225", url="http://www.ncbi.nlm.nih.gov/pubmed/37862061" } @Article{info:doi/10.2196/47254, author="Blatter, Ueli Tobias and Witte, Harald and Fasquelle-Lopez, Jules and Nakas, Theodoros Christos and Raisaro, Louis Jean and Leichtle, Benedikt Alexander", title="The BioRef Infrastructure, a Framework for Real-Time, Federated, Privacy-Preserving, and Personalized Reference Intervals: Design, Development, and Application", journal="J Med Internet Res", year="2023", month="Oct", day="18", volume="25", pages="e47254", keywords="personalized health", keywords="laboratory medicine", keywords="reference interval", keywords="research infrastructure", keywords="sensitive data", keywords="confidential data", keywords="data security", keywords="differential privacy", keywords="precision medicine", abstract="Background: Reference intervals (RIs) for patient test results are in standard use across many medical disciplines, allowing physicians to identify measurements indicating potentially pathological states with relative ease. The process of inferring cohort-specific RIs is, however, often ignored because of the high costs and cumbersome efforts associated with it. Sophisticated analysis tools are required to automatically infer relevant and locally specific RIs directly from routine laboratory data. These tools would effectively connect clinical laboratory databases to physicians and provide personalized target ranges for the respective cohort population. Objective: This study aims to describe the BioRef infrastructure, a multicentric governance and IT framework for the estimation and assessment of patient group--specific RIs from routine clinical laboratory data using an innovative decentralized data-sharing approach and a sophisticated, clinically oriented graphical user interface for data analysis. Methods: A common governance agreement and interoperability standards have been established, allowing the harmonization of multidimensional laboratory measurements from multiple clinical databases into a unified ``big data'' resource. International coding systems, such as the International Classification of Diseases, Tenth Revision (ICD-10); unique identifiers for medical devices from the Global Unique Device Identification Database; type identifiers from the Global Medical Device Nomenclature; and a universal transfer logic, such as the Resource Description Framework (RDF), are used to align the routine laboratory data of each data provider for use within the BioRef framework. With a decentralized data-sharing approach, the BioRef data can be evaluated by end users from each cohort site following a strict ``no copy, no move'' principle, that is, only data aggregates for the intercohort analysis of target ranges are exchanged. Results: The TI4Health distributed and secure analytics system was used to implement the proposed federated and privacy-preserving approach and comply with the limitations applied to sensitive patient data. Under the BioRef interoperability consensus, clinical partners enable the computation of RIs via the TI4Health graphical user interface for query without exposing the underlying raw data. The interface was developed for use by physicians and clinical laboratory specialists and allows intuitive and interactive data stratification by patient factors (age, sex, and personal medical history) as well as laboratory analysis determinants (device, analyzer, and test kit identifier). This consolidated effort enables the creation of extremely detailed and patient group--specific queries, allowing the generation of individualized, covariate-adjusted RIs on the fly. Conclusions: With the BioRef-TI4Health infrastructure, a framework for clinical physicians and researchers to define precise RIs immediately in a convenient, privacy-preserving, and reproducible manner has been implemented, promoting a vital part of practicing precision medicine while streamlining compliance and avoiding transfers of raw patient data. This new approach can provide a crucial update on RIs and improve patient care for personalized medicine. ", doi="10.2196/47254", url="https://www.jmir.org/2023/1/e47254", url="http://www.ncbi.nlm.nih.gov/pubmed/37851984" } @Article{info:doi/10.2196/44892, author="Zhang, Ying and Li, Xiaoying and Liu, Yi and Li, Aihua and Yang, Xuemei and Tang, Xiaoli", title="A Multilabel Text Classifier of Cancer Literature at the Publication Level: Methods Study of Medical Text Classification", journal="JMIR Med Inform", year="2023", month="Oct", day="5", volume="11", pages="e44892", keywords="text classification", keywords="publication-level classifier", keywords="cancer literature", keywords="deep learning", abstract="Background: Given the threat posed by cancer to human health, there is a rapid growth in the volume of data in the cancer field and interdisciplinary and collaborative research is becoming increasingly important for fine-grained classification. The low-resolution classifier of reported studies at the journal level fails to satisfy advanced searching demands, and a single label does not adequately characterize the literature originated from interdisciplinary research results. There is thus a need to establish a multilabel classifier with higher resolution to support literature retrieval for cancer research and reduce the burden of screening papers for clinical relevance. Objective: The primary objective of this research was to address the low-resolution issue of cancer literature classification due to the ambiguity of the existing journal-level classifier in order to support gaining high-relevance evidence for clinical consideration and all-sided results for literature retrieval. Methods: We trained a multilabel classifier with scalability for classifying the literature on cancer research directly at the publication level to assign proper content-derived labels based on the ``Bidirectional Encoder Representation from Transformers (BERT) + X'' model and obtain the best option for X. First, a corpus of 70,599 cancer publications retrieved from the Dimensions database was divided into a training and a testing set in a ratio of 7:3. Second, using the classification terminology of International Cancer Research Partnership cancer types, we compared the performance of classifiers developed using BERT and 5 classical deep learning models, such as the text recurrent neural network (TextRNN) and FastText, followed by metrics analysis. Results: After comparing various combined deep learning models, we obtained a classifier based on the optimal combination ``BERT + TextRNN,'' with a precision of 93.09\%, a recall of 87.75\%, and an F1-score of 90.34\%. Moreover, we quantified the distinctive characteristics in the text structure and multilabel distribution in order to generalize the model to other fields with similar characteristics. Conclusions: The ``BERT + TextRNN'' model was trained for high-resolution classification of cancer literature at the publication level to support accurate retrieval and academic statistics. The model automatically assigns 1 or more labels to each cancer paper, as required. Quantitative comparison verified that the ``BERT + TextRNN'' model is the best fit for multilabel classification of cancer literature compared to other models. More data from diverse fields will be collected to testify the scalability and extensibility of the proposed model in the future. ", doi="10.2196/44892", url="https://medinform.jmir.org/2023/1/e44892", url="http://www.ncbi.nlm.nih.gov/pubmed/37796584" } @Article{info:doi/10.2196/44310, author="Guo, Manping and Wang, Yiming and Yang, Qiaoning and Li, Rui and Zhao, Yang and Li, Chenfei and Zhu, Mingbo and Cui, Yao and Jiang, Xin and Sheng, Song and Li, Qingna and Gao, Rui", title="Normal Workflow and Key Strategies for Data Cleaning Toward Real-World Data: Viewpoint", journal="Interact J Med Res", year="2023", month="Sep", day="21", volume="12", pages="e44310", keywords="data cleaning", keywords="data quality", keywords="key technologies", keywords="real-world data", keywords="viewpoint", doi="10.2196/44310", url="https://www.i-jmr.org/2023/1/e44310", url="http://www.ncbi.nlm.nih.gov/pubmed/37733421" } @Article{info:doi/10.2196/48115, author="Zhang, Zeyu and Fang, Meng and Wu, Rebecca and Zong, Hui and Huang, Honglian and Tong, Yuantao and Xie, Yujia and Cheng, Shiyang and Wei, Ziyi and Crabbe, C. M. James and Zhang, Xiaoyan and Wang, Ying", title="Large-Scale Biomedical Relation Extraction Across Diverse Relation Types: Model Development and Usability Study on COVID-19", journal="J Med Internet Res", year="2023", month="Sep", day="20", volume="25", pages="e48115", keywords="biomedical text mining", keywords="biomedical relation extraction", keywords="pretrained language model", keywords="task-adaptive pretraining", keywords="knowledge graph", keywords="knowledge discovery", keywords="clinical drug path", keywords="COVID-19", abstract="Background: Biomedical relation extraction (RE) is of great importance for researchers to conduct systematic biomedical studies. It not only helps knowledge mining, such as knowledge graphs and novel knowledge discovery, but also promotes translational applications, such as clinical diagnosis, decision-making, and precision medicine. However, the relations between biomedical entities are complex and diverse, and comprehensive biomedical RE is not yet well established. Objective: We aimed to investigate and improve large-scale RE with diverse relation types and conduct usability studies with application scenarios to optimize biomedical text mining. Methods: Data sets containing 125 relation types with different entity semantic levels were constructed to evaluate the impact of entity semantic information on RE, and performance analysis was conducted on different model architectures and domain models. This study also proposed a continued pretraining strategy and integrated models with scripts into a tool. Furthermore, this study applied RE to the COVID-19 corpus with article topics and application scenarios of clinical interest to assess and demonstrate its biological interpretability and usability. Results: The performance analysis revealed that RE achieves the best performance when the detailed semantic type is provided. For a single model, PubMedBERT with continued pretraining performed the best, with an F1-score of 0.8998. Usability studies on COVID-19 demonstrated the interpretability and usability of RE, and a relation graph database was constructed, which was used to reveal existing and novel drug paths with edge explanations. The models (including pretrained and fine-tuned models), integrated tool (Docker), and generated data (including the COVID-19 relation graph database and drug paths) have been made publicly available to the biomedical text mining community and clinical researchers. Conclusions: This study provided a comprehensive analysis of RE with diverse relation types. Optimized RE models and tools for diverse relation types were developed, which can be widely used in biomedical text mining. Our usability studies provided a proof-of-concept demonstration of how large-scale RE can be leveraged to facilitate novel research. ", doi="10.2196/48115", url="https://www.jmir.org/2023/1/e48115", url="http://www.ncbi.nlm.nih.gov/pubmed/37632414" } @Article{info:doi/10.2196/48636, author="Butters, Alexandra and Blanch, Bianca and Kemp-Casey, Anna and Do, Judy and Yeates, Laura and Leslie, Felicity and Semsarian, Christopher and Nedkoff, Lee and Briffa, Tom and Ingles, Jodie and Sweeting, Joanna", title="The Australian Genetic Heart Disease Registry: Protocol for a Data Linkage Study", journal="JMIR Res Protoc", year="2023", month="Sep", day="20", volume="12", pages="e48636", keywords="data linkage", keywords="genetic heart diseases", keywords="health care use", keywords="cardiomyopathies", keywords="arrhythmia", keywords="cardiology", keywords="heart", keywords="genetics", keywords="registry", keywords="registries", keywords="risk", keywords="mortality", keywords="national", keywords="big data", keywords="harmonization", keywords="probabilistic matching", abstract="Background: Genetic heart diseases such as hypertrophic cardiomyopathy can cause significant morbidity and mortality, ranging from syncope, chest pain, and palpitations to heart failure and sudden cardiac death. These diseases are inherited in an autosomal dominant fashion, meaning family members of affected individuals have a 1 in 2 chance of also inheriting the disease (``at-risk relatives''). The health care use patterns of individuals with a genetic heart disease, including emergency department presentations and hospital admissions, are poorly understood. By linking genetic heart disease registry data to routinely collected health data, we aim to provide a more comprehensive clinical data set to examine the burden of disease on individuals, families, and health care systems. Objective: The objective of this study is to link the Australian Genetic Heart Disease (AGHD) Registry with routinely collected whole-population health data sets to investigate the health care use of individuals with a genetic heart disease and their at-risk relatives. This linked data set will allow for the investigation of differences in outcomes and health care use due to disease, sex, socioeconomic status, and other factors. Methods: The AGHD Registry is a nationwide data set that began in 2007 and aims to recruit individuals with a genetic heart disease and their family members. In this study, demographic, clinical, and genetic data (available from 2007 to 2019) for AGHD Registry participants and at-risk relatives residing in New South Wales (NSW), Australia, were linked to routinely collected health data. These data included NSW-based data sets covering hospitalizations (2001-2019), emergency department presentations (2005-2019), and both state-wide and national mortality registries (2007-2019). The linkage was performed by the Centre for Health Record Linkage. Investigations stratifying by diagnosis, age, sex, socioeconomic status, and gene status will be undertaken and reported using descriptive statistics. Results: NSW AGHD Registry participants were linked to routinely collected health data sets using probabilistic matching (November 2019). Of 1720 AGHD Registry participants, 1384 had linkages with 11,610 hospital records, 7032 emergency department records, and 60 death records. Data assessment and harmonization were performed, and descriptive data analysis is underway. Conclusions: We intend to provide insights into the health care use patterns of individuals with a genetic heart disease and their at-risk relatives, including frequency of hospital admissions and differences due to factors such as disease, sex, and socioeconomic status. Identifying disparities and potential barriers to care may highlight specific health care needs (eg, between sexes) and factors impacting health care access and use. International Registered Report Identifier (IRRID): DERR1-10.2196/48636 ", doi="10.2196/48636", url="https://www.researchprotocols.org/2023/1/e48636", url="http://www.ncbi.nlm.nih.gov/pubmed/37728963" } @Article{info:doi/10.2196/47540, author="Tajabadi, Mohammad and Grabenhenrich, Linus and Ribeiro, Ad{\`e}le and Leyer, Michael and Heider, Dominik", title="Sharing Data With Shared Benefits: Artificial Intelligence Perspective", journal="J Med Internet Res", year="2023", month="Aug", day="29", volume="25", pages="e47540", keywords="federated learning", keywords="machine learning", keywords="medical data", keywords="fairness", keywords="data sharing", keywords="artificial intelligence", keywords="development", keywords="artificial intelligence model", keywords="applications", keywords="data analysis", keywords="diagnostic tool", keywords="tool", doi="10.2196/47540", url="https://www.jmir.org/2023/1/e47540", url="http://www.ncbi.nlm.nih.gov/pubmed/37642995" } @Article{info:doi/10.2196/45013, author="Inau, Thea Esther and Sack, Jean and Waltemath, Dagmar and Zeleke, Alamirrew Atinkut", title="Initiatives, Concepts, and Implementation Practices of the Findable, Accessible, Interoperable, and Reusable Data Principles in Health Data Stewardship: Scoping Review", journal="J Med Internet Res", year="2023", month="Aug", day="28", volume="25", pages="e45013", keywords="data stewardship", keywords="findable, accessible, interoperable, and reusable data principles", keywords="FAIR data principles", keywords="health research", keywords="Preferred Reporting Items for Systematic Reviews and Meta-Analyses", keywords="PRISMA", keywords="qualitative analysis", keywords="scoping review", keywords="information retrieval", keywords="health information exchange", abstract="Background: Thorough data stewardship is a key enabler of comprehensive health research. Processes such as data collection, storage, access, sharing, and analytics require researchers to follow elaborate data management strategies properly and consistently. Studies have shown that findable, accessible, interoperable, and reusable (FAIR) data leads to improved data sharing in different scientific domains. Objective: This scoping review identifies and discusses concepts, approaches, implementation experiences, and lessons learned in FAIR initiatives in health research data. Methods: The Arksey and O'Malley stage-based methodological framework for scoping reviews was applied. PubMed, Web of Science, and Google Scholar were searched to access relevant publications. Articles written in English, published between 2014 and 2020, and addressing FAIR concepts or practices in the health domain were included. The 3 data sources were deduplicated using a reference management software. In total, 2 independent authors reviewed the eligibility of each article based on defined inclusion and exclusion criteria. A charting tool was used to extract information from the full-text papers. The results were reported using the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines. Results: A total of 2.18\% (34/1561) of the screened articles were included in the final review. The authors reported FAIRification approaches, which include interpolation, inclusion of comprehensive data dictionaries, repository design, semantic interoperability, ontologies, data quality, linked data, and requirement gathering for FAIRification tools. Challenges and mitigation strategies associated with FAIRification, such as high setup costs, data politics, technical and administrative issues, privacy concerns, and difficulties encountered in sharing health data despite its sensitive nature were also reported. We found various workflows, tools, and infrastructures designed by different groups worldwide to facilitate the FAIRification of health research data. We also uncovered a wide range of problems and questions that researchers are trying to address by using the different workflows, tools, and infrastructures. Although the concept of FAIR data stewardship in the health research domain is relatively new, almost all continents have been reached by at least one network trying to achieve health data FAIRness. Documented outcomes of FAIRification efforts include peer-reviewed publications, improved data sharing, facilitated data reuse, return on investment, and new treatments. Successful FAIRification of data has informed the management and prognosis of various diseases such as cancer, cardiovascular diseases, and neurological diseases. Efforts to FAIRify data on a wider variety of diseases have been ongoing since the COVID-19 pandemic. Conclusions: This work summarises projects, tools, and workflows for the FAIRification of health research data. The comprehensive review shows that implementing the FAIR concept in health data stewardship carries the promise of improved research data management and transparency in the era of big data and open research publishing. International Registered Report Identifier (IRRID): RR2-10.2196/22505 ", doi="10.2196/45013", url="https://www.jmir.org/2023/1/e45013", url="http://www.ncbi.nlm.nih.gov/pubmed/37639292" } @Article{info:doi/10.2196/46275, author="Xu, Ting and Ma, Yuming and Pan, Tianya and Chen, Yifei and Liu, Yuhua and Zhu, Fudong and Zhou, Zhiguang and Chen, Qianming", title="Visual Analytics of Multidimensional Oral Health Surveys: Data Mining Study", journal="JMIR Med Inform", year="2023", month="Aug", day="1", volume="11", pages="e46275", keywords="visual analytics", keywords="oral health data mining", keywords="knowledge graph", keywords="multidimensional data visualization", abstract="Background: Oral health surveys largely facilitate the prevention and treatment of oral diseases as well as the awareness of population health status. As oral health is always surveyed from a variety of perspectives, it is a difficult and complicated task to gain insights from multidimensional oral health surveys. Objective: We aimed to develop a visualization framework for the visual analytics and deep mining of multidimensional oral health surveys. Methods: First, diseases and groups were embedded into data portraits based on their multidimensional attributes. Subsequently, group classification and correlation pattern extraction were conducted to explore the correlation features among diseases, behaviors, symptoms, and cognitions. On the basis of the feature mining of diseases, groups, behaviors, and their attributes, a knowledge graph was constructed to reveal semantic information, integrate the graph query function, and describe the features of intrigue to users. Results: A visualization framework was implemented for the exploration of multidimensional oral health surveys. A series of user-friendly interactions were integrated to propose a visual analysis system that can help users further achieve the regulations of oral health conditions. Conclusions: A visualization framework is provided in this paper with a set of meaningful user interactions integrated, enabling users to intuitively understand the oral health situation and conduct in-depth data exploration and analysis. Case studies based on real-world data sets demonstrate the effectiveness of our system in the exploration of oral diseases. ", doi="10.2196/46275", url="https://medinform.jmir.org/2023/1/e46275", url="http://www.ncbi.nlm.nih.gov/pubmed/37526971" } @Article{info:doi/10.2196/42404, author="Mitchell, Eileen and O'Reilly, Dermot and O'Donovan, Diarmuid and Bradley, Declan", title="Predictors and Consequences of Homelessness: Protocol for a Cohort Study Design Using Linked Routine Data", journal="JMIR Res Protoc", year="2023", month="Jul", day="27", volume="12", pages="e42404", keywords="administrative data", keywords="data linkage", keywords="health care use", keywords="homelessness", keywords="housing", keywords="mortality", abstract="Background: Homelessness is a global burden, estimated to impact more than 100 million people worldwide. Individuals and families experiencing homelessness are more likely to have poorer physical and mental health than the general population. Administrative data is being increasingly used in homelessness research. Objective: The objective of this study is to combine administrative health care data and social housing data to better understand the consequences and predictors associated with being homeless. Methods: We will be linking health and social care administrative databases from Northern Ireland, United Kingdom. We will conduct descriptive analyses to examine trends in homelessness and investigate risk factors for key outcomes. Results: The results of our analyses will be shared with stakeholders, reported at conferences and in academic journals, and summarized in policy briefing notes for policymakers. Conclusions: This study will aim to identify predictors and consequences of homelessness in Northern Ireland using linked housing, health, and social care data. The findings of this study will examine trends and outcomes in this vulnerable population using routinely collected health and social care administrative data. International Registered Report Identifier (IRRID): DERR1-10.2196/42404 ", doi="10.2196/42404", url="https://www.researchprotocols.org/2023/1/e42404", url="http://www.ncbi.nlm.nih.gov/pubmed/37498664" } @Article{info:doi/10.2196/41858, author="Huang, Shih-Tsung and Hsiao, Fei-Yuan and Tsai, Tsung-Hsien and Chen, Pei-Jung and Peng, Li-Ning and Chen, Liang-Kung", title="Using Hypothesis-Led Machine Learning and Hierarchical Cluster Analysis to Identify Disease Pathways Prior to Dementia: Longitudinal Cohort Study", journal="J Med Internet Res", year="2023", month="Jul", day="26", volume="25", pages="e41858", keywords="dementia", keywords="machine learning", keywords="cluster analysis", keywords="disease", keywords="condition", keywords="symptoms", keywords="data", keywords="data set", keywords="cardiovascular", keywords="neuropsychiatric", keywords="infection", keywords="mobility", keywords="mental conditions", keywords="development", abstract="Background: Dementia development is a complex process in which the occurrence and sequential relationships of different diseases or conditions may construct specific patterns leading to incident dementia. Objective: This study aimed to identify patterns of disease or symptom clusters and their sequences prior to incident dementia using a novel approach incorporating machine learning methods. Methods: Using Taiwan's National Health Insurance Research Database, data from 15,700 older people with dementia and 15,700 nondementia controls matched on age, sex, and index year (n=10,466, 67\% for the training data set and n=5234, 33\% for the testing data set) were retrieved for analysis. Using machine learning methods to capture specific hierarchical disease triplet clusters prior to dementia, we designed a study algorithm with four steps: (1) data preprocessing, (2) disease or symptom pathway selection, (3) model construction and optimization, and (4) data visualization. Results: Among 15,700 identified older people with dementia, 10,466 and 5234 subjects were randomly assigned to the training and testing data sets, and 6215 hierarchical disease triplet clusters with positive correlations with dementia onset were identified. We subsequently generated 19,438 features to construct prediction models, and the model with the best performance was support vector machine (SVM) with the by-group LASSO (least absolute shrinkage and selection operator) regression method (total corresponding features=2513; accuracy=0.615; sensitivity=0.607; specificity=0.622; positive predictive value=0.612; negative predictive value=0.619; area under the curve=0.639). In total, this study captured 49 hierarchical disease triplet clusters related to dementia development, and the most characteristic patterns leading to incident dementia started with cardiovascular conditions (mainly hypertension), cerebrovascular disease, mobility disorders, or infections, followed by neuropsychiatric conditions. Conclusions: Dementia development in the real world is an intricate process involving various diseases or conditions, their co-occurrence, and sequential relationships. Using a machine learning approach, we identified 49 hierarchical disease triplet clusters with leading roles (cardio- or cerebrovascular disease) and supporting roles (mental conditions, locomotion difficulties, infections, and nonspecific neurological conditions) in dementia development. Further studies using data from other countries are needed to validate the prediction algorithms for dementia development, allowing the development of comprehensive strategies to prevent or care for dementia in the real world. ", doi="10.2196/41858", url="https://www.jmir.org/2023/1/e41858", url="http://www.ncbi.nlm.nih.gov/pubmed/37494081" } @Article{info:doi/10.2196/46542, author="Pujolar-D{\'i}az, Georgina and Vidal-Alaball, Josep and Forcada, Anna and Descals-Singla, Elisabet and Basora, Josep and ", title="Creation of a Laboratory for Statistics and Analysis of Dependence and Chronic Conditions: Protocol for the Bages Territorial Specialization and Competitiveness Project (PECT BAGESS)", journal="JMIR Res Protoc", year="2023", month="Jul", day="26", volume="12", pages="e46542", keywords="chronic disease", keywords="multiple chronic conditions", keywords="primary health care", keywords="diffusion of innovation", keywords="health data", keywords="data sharing", abstract="Background: With the increasing prevalence of chronic diseases, partly due to the increase in life expectancy and the aging of the population, the complexity of the approach faced by the structures, dynamics, and actors that are part of the current care and attention systems is evident. The territory of Bages (Catalonia, Spain) presents characteristics of a highly complex ecosystem where there is a need to develop new, more dynamic structures for the various actors in the health and social systems, aimed at incorporating new actors in the technological and business field that would allow innovation in the management of this context. Within the framework of the Bages Territorial Specialization and Competitiveness Project (PECT BAGESS), the aim is to address these challenges through various entities that will develop 7 interrelated operations. Of these, the operation of the IDIAP Jordi Gol-Catalan Health Institute focuses on the creation of a Laboratory for Statistics and Analysis of Dependence and Chronic Conditions in the Bages region, in the form of a database that will collect the most relevant information from the different environments that affect the management of chronic conditions and dependence: health, social, economic, and environment. Objective: This study aims to create a laboratory for statistical, dependence, and chronic condition analysis in the Bages region, to determine the chronic conditions and conditions that generate dependence in the Bages area, in order to propose products and services that respond to the needs of people in these situations. Methods: PECT BAGESS originated from the Shared Agenda initiative, which was established in the Bages region with the goal of enhancing the quality of life and fostering social inclusion for individuals with chronic diseases. This study presents part of this broader project, consisting of the creation of a database. Data from chronic conditions and dependence service providers will be combined, using a unique identifier for the different sources of information. A thorough legal analysis was conducted to establish a secure data sharing mechanism among the entities participating in the project. Results: The laboratory will be a key piece in the structure generated in the environment of the PECT BAGESS, which will allow relevant information to be passed on from the different sectors involved to respond to the needs of people with chronic conditions and dependence, as well as to generate opportunities for products and services. Conclusions: The emerging organizational dynamics and structures are expected to demonstrate a health and social management model that may have a remarkable impact on these sectors. Products and services developed may be very useful for generating synergies and facilitating the living conditions of people who can benefit from all these services. However, secure data sharing circuits must be considered. International Registered Report Identifier (IRRID): PRR1-10.2196/46542 ", doi="10.2196/46542", url="https://www.researchprotocols.org/2023/1/e46542", url="http://www.ncbi.nlm.nih.gov/pubmed/37494102" } @Article{info:doi/10.2196/45948, author="Wolfien, Markus and Ahmadi, Najia and Fitzer, Kai and Grummt, Sophia and Heine, Kilian-Ludwig and Jung, Ian-C and Krefting, Dagmar and K{\"u}hn, Andreas and Peng, Yuan and Reinecke, Ines and Scheel, Julia and Schmidt, Tobias and Schm{\"u}cker, Paul and Sch{\"u}ttler, Christina and Waltemath, Dagmar and Zoch, Michele and Sedlmayr, Martin", title="Ten Topics to Get Started in Medical Informatics Research", journal="J Med Internet Res", year="2023", month="Jul", day="24", volume="25", pages="e45948", keywords="medical informatics", keywords="health informatics", keywords="interdisciplinary communication", keywords="research data", keywords="clinical data", keywords="digital health", doi="10.2196/45948", url="https://www.jmir.org/2023/1/e45948", url="http://www.ncbi.nlm.nih.gov/pubmed/37486754" } @Article{info:doi/10.2196/47934, author="Yune, Jung So and Kim, Youngjon and Lee, Woog Jea", title="Data Analysis of Physician Competence Research Trend: Social Network Analysis and Topic Modeling Approach", journal="JMIR Med Inform", year="2023", month="Jul", day="19", volume="11", pages="e47934", keywords="physician competency", keywords="research trend", keywords="competency-based education", keywords="professionalism", keywords="topic modeling", keywords="latent Dirichlet allocation", keywords="LDA algorithm", keywords="data science", keywords="social network analysis", abstract="Background: Studies on competency in medical education often explore the acquisition, performance, and evaluation of particular skills, knowledge, or behaviors that constitute physician competency. As physician competency reflects social demands according to changes in the medical environment, analyzing the research trends of physician competency by period is necessary to derive major research topics for future studies. Therefore, a more macroscopic method is required to analyze the core competencies of physicians in this era. Objective: This study aimed to analyze research trends related to physicians' competency in reflecting social needs according to changes in the medical environment. Methods: We used topic modeling to identify potential research topics by analyzing data from studies related to physician competency published between 2011 and 2020. We preprocessed 1354 articles and extracted 272 keywords. Results: The terms that appeared most frequently in the research related to physician competency since 2010 were knowledge, hospital, family, job, guidelines, management, and communication. The terms that appeared in most studies were education, model, knowledge, and hospital. Topic modeling revealed that the main topics about physician competency included Evidence-based clinical practice, Community-based healthcare, Patient care, Career and self-management, Continuous professional development, and Communication and cooperation. We divided the studies into 4 periods (2011-2013, 2014-2016, 2017-2019, and 2020-2021) and performed a linear regression analysis. The results showed a change in topics by period. The hot topics that have shown increased interest among scholars over time include Community-based healthcare, Career and self-management, and Continuous professional development. Conclusions: On the basis of the analysis of research trends, it is predicted that physician professionalism and community-based medicine will continue to be studied in future studies on physician competency. ", doi="10.2196/47934", url="https://medinform.jmir.org/2023/1/e47934", url="http://www.ncbi.nlm.nih.gov/pubmed/37467028" } @Article{info:doi/10.2196/45059, author="Agnello, Marie Danielle and Loisel, Armand Quentin Emile and An, Qingfan and Balaskas, George and Chrifou, Rabab and Dall, Philippa and de Boer, Janneke and Delfmann, Rahel Lea and Gin{\'e}-Garriga, Maria and Goh, Kunshan and Longworth, Raffaella Giuliana and Messiha, Katrina and McCaffrey, Lauren and Smith, Niamh and Steiner, Artur and Vogelsang, Mira and Chastin, Sebastien", title="Establishing a Health CASCADE--Curated Open-Access Database to Consolidate Knowledge About Co-Creation: Novel Artificial Intelligence--Assisted Methodology Based on Systematic Reviews", journal="J Med Internet Res", year="2023", month="Jul", day="18", volume="25", pages="e45059", keywords="co-creation", keywords="co-production", keywords="co-design", keywords="database", keywords="participatory", keywords="methodology", keywords="artificial intelligence", abstract="Background: Co-creation is an approach that aims to democratize research and bridge the gap between research and practice, but the potential fragmentation of knowledge about co-creation has hindered progress. A comprehensive database of published literature from multidisciplinary sources can address this fragmentation through the integration of diverse perspectives, identification and dissemination of best practices, and increase clarity about co-creation. However, two considerable challenges exist. First, there is uncertainty about co-creation terminology, making it difficult to identify relevant literature. Second, the exponential growth of scientific publications has led to an overwhelming amount of literature that surpasses the human capacity for a comprehensive review. These challenges hinder progress in co-creation research and underscore the need for a novel methodology to consolidate and investigate the literature. Objective: This study aimed to synthesize knowledge about co-creation across various fields through the development and application of an artificial intelligence (AI)--assisted selection process. The ultimate goal of this database was to provide stakeholders interested in co-creation with relevant literature. Methods: We created a novel methodology for establishing a curated database. To accommodate the variation in terminology, we used a broad definition of co-creation that encompassed the essence of existing definitions. To filter out irrelevant information, an AI-assisted selection process was used. In addition, we conducted bibliometric analyses and quality control procedures to assess content and accuracy. Overall, this approach allowed us to develop a robust and reliable database that serves as a valuable resource for stakeholders interested in co-creation. Results: The final version of the database included 13,501 papers, which are indexed in Zenodo and accessible in an open-access downloadable format. The quality assessment revealed that 20.3\% (140/688) of the database likely contained irrelevant material, whereas the methodology captured 91\% (58/64) of the relevant literature. Participatory and variations of the term co-creation were the most frequent terms in the title and abstracts of included literature. The predominant source journals included health sciences, sustainability, environmental sciences, medical research, and health services research. Conclusions: This study produced a high-quality, open-access database about co-creation. The study demonstrates that it is possible to perform a systematic review selection process on a fragmented concept using human-AI collaboration. Our unified concept of co-creation includes the co-approaches (co-creation, co-design, and co-production), forms of participatory research, and user involvement. Our analysis of authorship, citations, and source landscape highlights the potential lack of collaboration among co-creation researchers and underscores the need for future investigation into the different research methodologies. The database provides a resource for relevant literature and can support rapid literature reviews about co-creation. It also offers clarity about the current co-creation landscape and helps to address barriers that researchers may face when seeking evidence about co-creation. ", doi="10.2196/45059", url="https://www.jmir.org/2023/1/e45059", url="http://www.ncbi.nlm.nih.gov/pubmed/37463024" } @Article{info:doi/10.2196/44700, author="Woods, Andrew and Kramer, T. Skyler and Xu, Dong and Jiang, Wei", title="Secure Comparisons of Single Nucleotide Polymorphisms Using Secure Multiparty Computation: Method Development", journal="JMIR Bioinform Biotech", year="2023", month="Jul", day="18", volume="4", pages="e44700", keywords="secure multiparty computation", keywords="single nucleotide polymorphism", keywords="Variant Call Format", keywords="Jaccard similarity", abstract="Background: While genomic variations can provide valuable information for health care and ancestry, the privacy of individual genomic data must be protected. Thus, a secure environment is desirable for a human DNA database such that the total data are queryable but not directly accessible to involved parties (eg, data hosts and hospitals) and that the query results are learned only by the user or authorized party. Objective: In this study, we provide efficient and secure computations on panels of single nucleotide polymorphisms (SNPs) from genomic sequences as computed under the following set operations: union, intersection, set difference, and symmetric difference. Methods: Using these operations, we can compute similarity metrics, such as the Jaccard similarity, which could allow querying a DNA database to find the same person and genetic relatives securely. We analyzed various security paradigms and show metrics for the protocols under several security assumptions, such as semihonest, malicious with honest majority, and malicious with a malicious majority. Results: We show that our methods can be used practically on realistically sized data. Specifically, we can compute the Jaccard similarity of two genomes when considering sets of SNPs, each with 400,000 SNPs, in 2.16 seconds with the assumption of a malicious adversary in an honest majority and 0.36 seconds under a semihonest model. Conclusions: Our methods may help adopt trusted environments for hosting individual genomic data with end-to-end data security. ", doi="10.2196/44700", url="https://bioinform.jmir.org/2023/1/e44700" } @Article{info:doi/10.2196/45651, author="Yang, Dan and Su, Zihan and Mu, Runqing and Diao, Yingying and Zhang, Xin and Liu, Yusi and Wang, Shuo and Wang, Xu and Zhao, Lei and Wang, Hongyi and Zhao, Min", title="Effects of Using Different Indirect Techniques on the Calculation of Reference Intervals: Observational Study", journal="J Med Internet Res", year="2023", month="Jul", day="17", volume="25", pages="e45651", keywords="comparative study", keywords="data transformation", keywords="indirect method", keywords="outliers", keywords="reference interval", keywords="clinical decision-making", keywords="complete blood count", keywords="red blood cells", keywords="white blood cells", keywords="platelets", keywords="laboratory", keywords="clinical", abstract="Background: Reference intervals (RIs) play an important role in clinical decision-making. However, due to the time, labor, and financial costs involved in establishing RIs using direct means, the use of indirect methods, based on big data previously obtained from clinical laboratories, is getting increasing attention. Different indirect techniques combined with different data transformation methods and outlier removal might cause differences in the calculation of RIs. However, there are few systematic evaluations of this. Objective: This study used data derived from direct methods as reference standards and evaluated the accuracy of combinations of different data transformation, outlier removal, and indirect techniques in establishing complete blood count (CBC) RIs for large-scale data. Methods: The CBC data of populations aged ?18 years undergoing physical examination from January 2010 to December 2011 were retrieved from the First Affiliated Hospital of China Medical University in northern China. After exclusion of repeated individuals, we performed parametric, nonparametric, Hoffmann, Bhattacharya, and truncation points and Kolmogorov--Smirnov distance (kosmic) indirect methods, combined with log or BoxCox transformation, and Reed--Dixon, Tukey, and iterative mean (3SD) outlier removal methods in order to derive the RIs of 8 CBC parameters and compared the results with those directly and previously established. Furthermore, bias ratios (BRs) were calculated to assess which combination of indirect technique, data transformation pattern, and outlier removal method is preferrable. Results: Raw data showed that the degrees of skewness of the white blood cell (WBC) count, platelet (PLT) count, mean corpuscular hemoglobin (MCH), mean corpuscular hemoglobin concentration (MCHC), and mean corpuscular volume (MCV) were much more obvious than those of other CBC parameters. After log or BoxCox transformation combined with Tukey or iterative mean (3SD) processing, the distribution types of these data were close to Gaussian distribution. Tukey-based outlier removal yielded the maximum number of outliers. The lower-limit bias of WBC (male), PLT (male), hemoglobin (HGB; male), MCH (male/female), and MCV (female) was greater than that of the corresponding upper limit for more than half of 30 indirect methods. Computational indirect choices of CBC parameters for males and females were inconsistent. The RIs of MCHC established by the direct method for females were narrow. For this, the kosmic method was markedly superior, which contrasted with the RI calculation of CBC parameters with high |BR| qualification rates for males. Among the top 10 methodologies for the WBC count, PLT count, HGB, MCV, and MCHC with a high-BR qualification rate among males, the Bhattacharya, Hoffmann, and parametric methods were superior to the other 2 indirect methods. Conclusions: Compared to results derived by the direct method, outlier removal methods and indirect techniques markedly influence the final RIs, whereas data transformation has negligible effects, except for obviously skewed data. Specifically, the outlier removal efficiency of Tukey and iterative mean (3SD) methods is almost equivalent. Furthermore, the choice of indirect techniques depends more on the characteristics of the studied analyte itself. This study provides scientific evidence for clinical laboratories to use their previous data sets to establish RIs. ", doi="10.2196/45651", url="https://www.jmir.org/2023/1/e45651", url="http://www.ncbi.nlm.nih.gov/pubmed/37459170" } @Article{info:doi/10.2196/42262, author="Besculides, Melanie and Mazumdar, Madhu and Phlegar, Sydney and Freeman, Robert and Wilson, Sara and Joshi, Himanshu and Kia, Arash and Gorbenko, Ksenia", title="Implementing a Machine Learning Screening Tool for Malnutrition: Insights From Qualitative Research Applicable to Other Machine Learning--Based Clinical Decision Support Systems", journal="JMIR Form Res", year="2023", month="Jul", day="13", volume="7", pages="e42262", keywords="machine learning", keywords="AI", keywords="CDSS", keywords="evaluation", keywords="nutrition", keywords="screening", keywords="clinical", keywords="usability", keywords="effectiveness", keywords="treatment", keywords="malnutrition", keywords="decision-making", keywords="tool", keywords="data", keywords="acceptability", abstract="Background: Machine learning (ML)--based clinical decision support systems (CDSS) are popular in clinical practice settings but are often criticized for being limited in usability, interpretability, and effectiveness. Evaluating the implementation of ML-based CDSS is critical to ensure CDSS is acceptable and useful to clinicians and helps them deliver high-quality health care. Malnutrition is a common and underdiagnosed condition among hospital patients, which can have serious adverse impacts. Early identification and treatment of malnutrition are important. Objective: This study aims to evaluate the implementation of an ML tool, Malnutrition Universal Screening Tool (MUST)--Plus, that predicts hospital patients at high risk for malnutrition and identify best implementation practices applicable to this and other ML-based CDSS. Methods: We conducted a qualitative postimplementation evaluation using in-depth interviews with registered dietitians (RDs) who use MUST-Plus output in their everyday work. After coding the data, we mapped emergent themes onto select domains of the nonadoption, abandonment, scale-up, spread, and sustainability (NASSS) framework. Results: We interviewed 17 of the 24 RDs approached (71\%), representing 37\% of those who use MUST-Plus output. Several themes emerged: (1) enhancements to the tool were made to improve accuracy and usability; (2) MUST-Plus helped identify patients that would not otherwise be seen; perceived usefulness was highest in the original site; (3) perceived accuracy varied by respondent and site; (4) RDs valued autonomy in prioritizing patients; (5) depth of tool understanding varied by hospital and level; (6) MUST-Plus was integrated into workflows and electronic health records; and (7) RDs expressed a desire to eventually have 1 automated screener. Conclusions: Our findings suggest that continuous involvement of stakeholders at new sites given staff turnover is vital to ensure buy-in. Qualitative research can help identify the potential bias of ML tools and should be widely used to ensure health equity. Ongoing collaboration among CDSS developers, data scientists, and clinical providers may help refine CDSS for optimal use and improve the acceptability of CDSS in the clinical context. ", doi="10.2196/42262", url="https://formative.jmir.org/2023/1/e42262", url="http://www.ncbi.nlm.nih.gov/pubmed/37440303" } @Article{info:doi/10.2196/42621, author="Matschinske, Julian and Sp{\"a}th, Julian and Bakhtiari, Mohammad and Probul, Niklas and Kazemi Majdabadi, Mahdi Mohammad and Nasirigerdeh, Reza and Torkzadehmahani, Reihaneh and Hartebrodt, Anne and Orban, Balazs-Attila and Fej{\'e}r, S{\'a}ndor-J{\'o}zsef and Zolotareva, Olga and Das, Supratim and Baumbach, Linda and Pauling, K. Josch and Toma{\vs}evi{\'c}, Olivera and Bihari, B{\'e}la and Bloice, Marcus and Donner, C. Nina and Fdhila, Walid and Frisch, Tobias and Hauschild, Anne-Christin and Heider, Dominik and Holzinger, Andreas and H{\"o}tzendorfer, Walter and Hospes, Jan and Kacprowski, Tim and Kastelitz, Markus and List, Markus and Mayer, Rudolf and Moga, M{\'o}nika and M{\"u}ller, Heimo and Pustozerova, Anastasia and R{\"o}ttger, Richard and Saak, C. Christina and Saranti, Anna and Schmidt, W. Harald H. H. and Tschohl, Christof and Wenke, K. Nina and Baumbach, Jan", title="The FeatureCloud Platform for Federated Learning in Biomedicine: Unified Approach", journal="J Med Internet Res", year="2023", month="Jul", day="12", volume="25", pages="e42621", keywords="privacy-preserving machine learning", keywords="federated learning", keywords="interactive platform", keywords="artificial intelligence", keywords="AI store", keywords="privacy-enhancing technologies", keywords="additive secret sharing", abstract="Background: Machine learning and artificial intelligence have shown promising results in many areas and are driven by the increasing amount of available data. However, these data are often distributed across different institutions and cannot be easily shared owing to strict privacy regulations. Federated learning (FL) allows the training of distributed machine learning models without sharing sensitive data. In addition, the implementation is time-consuming and requires advanced programming skills and complex technical infrastructures. Objective: Various tools and frameworks have been developed to simplify the development of FL algorithms and provide the necessary technical infrastructure. Although there are many high-quality frameworks, most focus only on a single application case or method. To our knowledge, there are no generic frameworks, meaning that the existing solutions are restricted to a particular type of algorithm or application field. Furthermore, most of these frameworks provide an application programming interface that needs programming knowledge. There is no collection of ready-to-use FL algorithms that are extendable and allow users (eg, researchers) without programming knowledge to apply FL. A central FL platform for both FL algorithm developers and users does not exist. This study aimed to address this gap and make FL available to everyone by developing FeatureCloud, an all-in-one platform for FL in biomedicine and beyond. Methods: The FeatureCloud platform consists of 3 main components: a global frontend, a global backend, and a local controller. Our platform uses a Docker to separate the local acting components of the platform from the sensitive data systems. We evaluated our platform using 4 different algorithms on 5 data sets for both accuracy and runtime. Results: FeatureCloud removes the complexity of distributed systems for developers and end users by providing a comprehensive platform for executing multi-institutional FL analyses and implementing FL algorithms. Through its integrated artificial intelligence store, federated algorithms can easily be published and reused by the community. To secure sensitive raw data, FeatureCloud supports privacy-enhancing technologies to secure the shared local models and assures high standards in data privacy to comply with the strict General Data Protection Regulation. Our evaluation shows that applications developed in FeatureCloud can produce highly similar results compared with centralized approaches and scale well for an increasing number of participating sites. Conclusions: FeatureCloud provides a ready-to-use platform that integrates the development and execution of FL algorithms while reducing the complexity to a minimum and removing the hurdles of federated infrastructure. Thus, we believe that it has the potential to greatly increase the accessibility of privacy-preserving and distributed data analyses in biomedicine and beyond. ", doi="10.2196/42621", url="https://www.jmir.org/2023/1/e42621", url="http://www.ncbi.nlm.nih.gov/pubmed/37436815" } @Article{info:doi/10.2196/46328, author="Park, Woo Han and Yoon, Young Ho", title="Global COVID-19 Policy Engagement With Scientific Research Information: Altmetric Data Study", journal="J Med Internet Res", year="2023", month="Jun", day="29", volume="25", pages="e46328", keywords="altmetrics", keywords="government policy report", keywords="citation analysis", keywords="COVID-19", keywords="World Health Organization", keywords="WHO", keywords="COVID-19 research", keywords="online citation network", keywords="policy domains", abstract="Background: Previous studies on COVID-19 scholarly articles have primarily focused on bibliometric characteristics, neglecting the identification of institutional actors that cite recent scientific contributions related to COVID-19 in the policy domain, and their locations. Objective: The purpose of this study was to assess the online citation network and knowledge structure of COVID-19 research across policy domains over 2 years from January 2020 to January 2022, with a particular emphasis on geographical frequency. Two research questions were addressed. The first question was related to who has been the most active in policy engagement with science and research information sharing during the COVID-19 pandemic, particularly in terms of countries and organization types. The second question was related to whether there are significant differences in the types of coronavirus research shared among countries and continents. Methods: The Altmetric database was used to collect policy report citations of scientific articles for 3 topic terms (COVID-19, COVID-19 vaccine, and COVID-19 variants). Altmetric provides the URLs of policy agencies that have cited COVID-19 research. The scientific articles used for Altmetric citations are extracted from journals indexed by PubMed. The numbers of COVID-19, COVID-19 vaccine, and COVID-19 variant research outputs between January 1, 2020, and January 31, 2022, were 216,787, 16,748, and 2777, respectively. The study examined the frequency of citations based on policy institutional domains, such as intergovernmental organizations, national and domestic governmental organizations, and nongovernmental organizations (think tanks and academic institutions). Results: The World Health Organization (WHO) stood out as the most notable institution citing COVID-19--related research outputs. The WHO actively sought and disseminated information regarding the COVID-19 pandemic. The COVID-19 vaccine citation network exhibited the most extensive connections in terms of degree centrality, 2-local eigenvector centrality, and eigenvector centrality among the 3 key terms. The Netherlands, the United States, the United Kingdom, and Australia were the countries that sought and shared the most information on COVID-19 vaccines, likely due to their high numbers of COVID-19 cases. Developing nations, although gaining quicker access to COVID-19 vaccine information, appeared to be relatively isolated from the enriched COVID-19 pandemic content in the global network. Conclusions: The global scientific network ecology during the COVID-19 pandemic revealed distinct types of links primarily centered around the WHO. Western countries demonstrated effective networking practices in constructing these networks. The prominent position of the key term ``COVID-19 vaccine'' demonstrates that nation-states align with global authority regardless of their national contexts. In summary, the citation networking practices of policy agencies have the potential to uncover the global knowledge distribution structure as a proxy for the networking strategy employed during a pandemic. ", doi="10.2196/46328", url="https://www.jmir.org/2023/1/e46328", url="http://www.ncbi.nlm.nih.gov/pubmed/37384384" } @Article{info:doi/10.2196/42149, author="Tai, Shu-Yu and Chi, Ying-Chen and Chien, Yu-Wen and Kawachi, Ichiro and Lu, Tsung-Hsueh", title="Dashboard With Bump Charts to Visualize the Changes in the Rankings of Leading Causes of Death According to Two Lists: National Population-Based Time-Series Cross-Sectional Study", journal="JMIR Public Health Surveill", year="2023", month="Jun", day="27", volume="9", pages="e42149", keywords="COVID-19", keywords="dashboard", keywords="data visualization", keywords="leading causes of death", keywords="mortality/trend", keywords="ranking", keywords="surveillance", keywords="cause of mortality", keywords="cause of death", keywords="monitoring", keywords="surveillance indicator", keywords="health statistics", keywords="mortality data", abstract="Background: Health advocates and the media often use the rankings of the leading causes of death (CODs) to draw attention to health issues with relatively high mortality burdens in a population. The National Center for Health Statistics (NCHS) publishes ``Deaths: leading causes'' annually. The ranking list used by the NCHS and statistical offices in several countries includes broad categories such as cancer, heart disease, and accidents. However, the list used by the World Health Organization (WHO) subdivides broad categories (17 for cancer, 8 for heart disease, and 6 for accidents) and classifies Alzheimer disease and related dementias and hypertensive diseases more comprehensively compared to the NCHS list. Regarding the data visualization of the rankings of leading CODs, the bar chart is the most commonly used graph; nevertheless, bar charts may not effectively reveal the changes in the rankings over time. Objective: The aim of this study is to use a dashboard with bump charts to visualize the changes in the rankings of the leading CODs in the United States by sex and age from 1999 to 2021, according to 2 lists (NCHS vs WHO). Methods: Data on the number of deaths in each category from each list for each year were obtained from the Wide-ranging Online Data for Epidemiologic Research system, maintained by the Center for Disease Control and Prevention. Rankings were based on the absolute number of deaths. The dashboard enables users to filter by list (NCHS or WHO) and demographic characteristics (sex and age) and highlight a particular COD. Results: Several CODs that were only on the WHO list, including brain, breast, colon, hematopoietic, lung, pancreas, prostate, and uterus cancer (all classified as cancer on the NCHS list); unintentional transport injury; poisoning; drowning; and falls (all classified as accidents on the NCHS list), were among the 10 leading CODs in several sex and age subgroups. In contrast, several CODs that appeared among the 10 leading CODs according to the NCHS list, such as pneumonia, kidney disease, cirrhosis, and sepsis, were excluded from the 10 leading CODs if the WHO list was used. The rank of Alzheimer disease and related dementias and hypertensive diseases according to the WHO list was higher than their ranks according to the NCHS list. A marked increase in the ranking of unintentional poisoning among men aged 45-64 years was noted from 2008 to 2021. Conclusions: A dashboard with bump charts can be used to improve the visualization of the changes in the rankings of leading CODs according to the WHO and NCHS lists as well as demographic characteristics; the visualization can help users make informed decisions regarding the most appropriate ranking list for their needs. ", doi="10.2196/42149", url="https://publichealth.jmir.org/2023/1/e42149", url="http://www.ncbi.nlm.nih.gov/pubmed/37368475" } @Article{info:doi/10.2196/45614, author="Boussina, Aaron and Wardi, Gabriel and Shashikumar, Prajwal Supreeth and Malhotra, Atul and Zheng, Kai and Nemati, Shamim", title="Representation Learning and Spectral Clustering for the Development and External Validation of Dynamic Sepsis Phenotypes: Observational Cohort Study", journal="J Med Internet Res", year="2023", month="Jun", day="23", volume="25", pages="e45614", keywords="sepsis", keywords="phenotype", keywords="emergency service, hospital", keywords="disease progression", keywords="artificial intelligence", keywords="machine learning", keywords="emergency", keywords="infection", keywords="clinical phenotype", keywords="clinical phenotyping", keywords="transition model", keywords="transition modeling", abstract="Background: Recent attempts at clinical phenotyping for sepsis have shown promise in identifying groups of patients with distinct treatment responses. Nonetheless, the replicability and actionability of these phenotypes remain an issue because the patient trajectory is a function of both the patient's physiological state and the interventions they receive. Objective: We aimed to develop a novel approach for deriving clinical phenotypes using unsupervised learning and transition modeling. Methods: Forty commonly used clinical variables from the electronic health record were used as inputs to a feed-forward neural network trained to predict the onset of sepsis. Using spectral clustering on the representations from this network, we derived and validated consistent phenotypes across a diverse cohort of patients with sepsis. We modeled phenotype dynamics as a Markov decision process with transitions as a function of the patient's current state and the interventions they received. Results: Four consistent and distinct phenotypes were derived from over 11,500 adult patients who were admitted from the University of California, San Diego emergency department (ED) with sepsis between January 1, 2016, and January 31, 2020. Over 2000 adult patients admitted from the University of California, Irvine ED with sepsis between November 4, 2017, and August 4, 2022, were involved in the external validation. We demonstrate that sepsis phenotypes are not static and evolve in response to physiological factors and based on interventions. We show that roughly 45\% of patients change phenotype membership within the first 6 hours of ED arrival. We observed consistent trends in patient dynamics as a function of interventions including early administration of antibiotics. Conclusions: We derived and describe 4 sepsis phenotypes present within 6 hours of triage in the ED. We observe that the administration of a 30 mL/kg fluid bolus may be associated with worse outcomes in certain phenotypes, whereas prompt antimicrobial therapy is associated with improved outcomes. ", doi="10.2196/45614", url="https://www.jmir.org/2023/1/e45614", url="http://www.ncbi.nlm.nih.gov/pubmed/37351927" } @Article{info:doi/10.2196/41576, author="Chen, Yang and Liu, Xuejiao and Gao, Lei and Zhu, Miao and Shia, Ben-Chang and Chen, Mingchih and Ye, Linglong and Qin, Lei", title="Using the H2O Automatic Machine Learning Algorithms to Identify Predictors of Web-Based Medical Record Nonuse Among Patients in a Data-Rich Environment: Mixed Methods Study", journal="JMIR Med Inform", year="2023", month="Jun", day="19", volume="11", pages="e41576", keywords="web-based medical record", keywords="predictors", keywords="H2O's automatic machine learning", keywords="Health Information National Trends Survey", keywords="HINTS", keywords="mobile phone", abstract="Background: With the advent of electronic storage of medical records and the internet, patients can access web-based medical records. This has facilitated doctor-patient communication and built trust between them. However, many patients avoid using web-based medical records despite their greater availability and readability. Objective: On the basis of demographic and individual behavioral characteristics, this study explores the predictors of web-based medical record nonuse among patients. Methods: Data were collected from the National Cancer Institute 2019 to 2020 Health Information National Trends Survey. First, based on the data-rich environment, the chi-square test (categorical variables) and 2-tailed t tests (continuous variables) were performed on the response variables and the variables in the questionnaire. According to the test results, the variables were initially screened, and those that passed the test were selected for subsequent analysis. Second, participants were excluded from the study if any of the initially screened variables were missing. Third, the data obtained were modeled using 5 machine learning algorithms, namely, logistic regression, automatic generalized linear model, automatic random forest, automatic deep neural network, and automatic gradient boosting machine, to identify and investigate factors affecting web-based medical record nonuse. The aforementioned automatic machine learning algorithms were based on the R interface (R Foundation for Statistical Computing) of the H2O (H2O.ai) scalable machine learning platform. Finally, 5-fold cross-validation was adopted for 80\% of the data set, which was used as the training data to determine hyperparameters of 5 algorithms, and 20\% of the data set was used as the test data for model comparison. Results: Among the 9072 respondents, 5409 (59.62\%) had no experience using web-based medical records. Using the 5 algorithms, 29 variables were identified as crucial predictors of nonuse of web-based medical records. These 29 variables comprised 6 (21\%) sociodemographic variables (age, BMI, race, marital status, education, and income) and 23 (79\%) variables related to individual lifestyles and behavioral habits (such as electronic and internet use, individuals' health status and their level of health concern, etc). H2O's automatic machine learning methods have a high model accuracy. On the basis of the performance of the validation data set, the optimal model was the automatic random forest with the highest area under the curve in the validation set (88.52\%) and the test set (82.87\%). Conclusions: When monitoring web-based medical record use trends, research should focus on social factors such as age, education, BMI, and marital status, as well as personal lifestyle and behavioral habits, including smoking, use of electronic devices and the internet, patients' personal health status, and their level of health concern. The use of electronic medical records can be targeted to specific patient groups, allowing more people to benefit from their usefulness. ", doi="10.2196/41576", url="https://medinform.jmir.org/2023/1/e41576", url="http://www.ncbi.nlm.nih.gov/pubmed/37335616" } @Article{info:doi/10.2196/45823, author="Diniz, Miguel Jos{\'e} and Vasconcelos, Henrique and Souza, J{\'u}lio and Rb-Silva, Rita and Ameijeiras-Rodriguez, Carolina and Freitas, Alberto", title="Comparing Decentralized Learning Methods for Health Data Models to Nondecentralized Alternatives: Protocol for a Systematic Review", journal="JMIR Res Protoc", year="2023", month="Jun", day="19", volume="12", pages="e45823", keywords="decentralized learning", keywords="distributed learning", keywords="federated learning", keywords="centralized learning", keywords="privacy", keywords="health", keywords="health data", keywords="secondary data use", keywords="health data model", keywords="blockchain", keywords="health care", keywords="data science", abstract="Background: Considering the soaring health-related costs directed toward a growing, aging, and comorbid population, the health sector needs effective data-driven interventions while managing rising care costs. While health interventions using data mining have become more robust and adopted, they often demand high-quality big data. However, growing privacy concerns have hindered large-scale data sharing. In parallel, recently introduced legal instruments require complex implementations, especially when it comes to biomedical data. New privacy-preserving technologies, such as decentralized learning, make it possible to create health models without mobilizing data sets by using distributed computation principles. Several multinational partnerships, including a recent agreement between the United States and the European Union, are adopting these techniques for next-generation data science. While these approaches are promising, there is no clear and robust evidence synthesis of health care applications. Objective: The main aim is to compare the performance among health data models (eg, automated diagnosis and mortality prediction) developed using decentralized learning approaches (eg, federated and blockchain) to those using centralized or local methods. Secondary aims are comparing the privacy compromise and resource use among model architectures. Methods: We will conduct a systematic review using the first-ever registered research protocol for this topic following a robust search methodology, including several biomedical and computational databases. This work will compare health data models differing in development architecture, grouping them according to their clinical applications. For reporting purposes, a PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 flow diagram will be presented. CHARMS (Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies)--based forms will be used for data extraction and to assess the risk of bias, alongside PROBAST (Prediction Model Risk of Bias Assessment Tool). All effect measures in the original studies will be reported. Results: The queries and data extractions are expected to start on February 28, 2023, and end by July 31, 2023. The research protocol was registered with PROSPERO, under the number 393126, on February 3, 2023. With this protocol, we detail how we will conduct the systematic review. With that study, we aim to summarize the progress and findings from state-of-the-art decentralized learning models in health care in comparison to their local and centralized counterparts. Results are expected to clarify the consensuses and heterogeneities reported and help guide the research and development of new robust and sustainable applications to address the health data privacy problem, with applicability in real-world settings. Conclusions: We expect to clearly present the status quo of these privacy-preserving technologies in health care. With this robust synthesis of the currently available scientific evidence, the review will inform health technology assessment and evidence-based decisions, from health professionals, data scientists, and policy makers alike. Importantly, it should also guide the development and application of new tools in service of patients' privacy and future research. Trial Registration: PROSPERO 393126; https://www.crd.york.ac.uk/prospero/display\_record.php?RecordID=393126 International Registered Report Identifier (IRRID): PRR1-10.2196/45823 ", doi="10.2196/45823", url="https://www.researchprotocols.org/2023/1/e45823", url="http://www.ncbi.nlm.nih.gov/pubmed/37335606" } @Article{info:doi/10.2196/44567, author="Kusejko, Katharina and Smith, Daniel and Scherrer, Alexandra and Paioni, Paolo and Kohns Vasconcelos, Malte and Aebi-Popp, Karoline and Kouyos, D. Roger and G{\"u}nthard, F. Huldrych and Kahlert, R. Christian and ", title="Migrating a Well-Established Longitudinal Cohort Database From Oracle SQL to Research Electronic Data Entry (REDCap): Data Management Research and Design Study", journal="JMIR Form Res", year="2023", month="May", day="31", volume="7", pages="e44567", keywords="REDCap", keywords="cohort study", keywords="data collection", keywords="electronic case report forms", keywords="eCRF", keywords="software", keywords="digital solution", keywords="electronic data entry", keywords="HIV", abstract="Background: Providing user-friendly electronic data collection tools for large multicenter studies is key for obtaining high-quality research data. Research Electronic Data Capture (REDCap) is a software solution developed for setting up research databases with integrated graphical user interfaces for electronic data entry. The Swiss Mother and Child HIV Cohort Study (MoCHiV) is a longitudinal cohort study with around 2 million data entries dating back to the early 1980s. Until 2022, data collection in MoCHiV was paper-based. Objective: The objective of this study was to provide a user-friendly graphical interface for electronic data entry for physicians and study nurses reporting MoCHiV data. Methods: MoCHiV collects information on obstetric events among women living with HIV and children born to mothers living with HIV. Until 2022, MoCHiV data were stored in an Oracle SQL relational database. In this project, R and REDCap were used to develop an electronic data entry platform for MoCHiV with migration of already collected data. Results: The key steps for providing an electronic data entry option for MoCHiV were (1) design, (2) data cleaning and formatting, (3) migration and compliance, and (4) add-on features. In the first step, the database structure was defined in REDCap, including the specification of primary and foreign keys, definition of study variables, and the hierarchy of questions (termed ``branching logic''). In the second step, data stored in Oracle were cleaned and formatted to adhere to the defined database structure. Systematic data checks ensured compliance to all branching logic and levels of categorical variables. REDCap-specific variables and numbering of repeated events for enabling a relational data structure in REDCap were generated using R. In the third step, data were imported to REDCap and then systematically compared to the original data. In the last step, add-on features, such as data access groups, redirections, and summary reports, were integrated to facilitate data entry in the multicenter MoCHiV study. Conclusions: By combining different software tools---Oracle SQL, R, and REDCap---and building a systematic pipeline for data cleaning, formatting, and comparing, we were able to migrate a multicenter longitudinal cohort study from Oracle SQL to REDCap. REDCap offers a flexible way for developing customized study designs, even in the case of longitudinal studies with different study arms (ie, obstetric events, women, and mother-child pairs). However, REDCap does not offer built-in tools for preprocessing large data sets before data import. Additional software is needed (eg, R) for data formatting and cleaning to achieve the predefined REDCap data structure. ", doi="10.2196/44567", url="https://formative.jmir.org/2023/1/e44567", url="http://www.ncbi.nlm.nih.gov/pubmed/37256686" } @Article{info:doi/10.2196/45171, author="Cao, Yiding and Rajendran, Suraj and Sundararajan, Prathic and Law, Royal and Bacon, Sarah and Sumner, A. Steven and Masuda, Naoki", title="Web-Based Social Networks of Individuals With Adverse Childhood Experiences: Quantitative Study", journal="J Med Internet Res", year="2023", month="May", day="30", volume="25", pages="e45171", keywords="adverse childhood experience", keywords="ACE", keywords="social networks", keywords="Twitter", keywords="Reddit", keywords="childhood", keywords="abuse", keywords="neglect", keywords="violence", keywords="substance use", keywords="coping strategy", keywords="coping", keywords="interpersonal connection", keywords="web-based connection", keywords="behavior", keywords="social connection", keywords="resilience", abstract="Background: Adverse childhood experiences (ACEs), which include abuse and neglect and various household challenges such as exposure to intimate partner violence and substance use in the home, can have negative impacts on the lifelong health of affected individuals. Among various strategies for mitigating the adverse effects of ACEs is to enhance connectedness and social support for those who have experienced them. However, how the social networks of those who experienced ACEs differ from the social networks of those who did not is poorly understood. Objective: In this study, we used Reddit and Twitter data to investigate and compare social networks between individuals with and without ACE exposure. Methods: We first used a neural network classifier to identify the presence or absence of public ACE disclosures in social media posts. We then analyzed egocentric social networks comparing individuals with self-reported ACEs with those with no reported history. Results: We found that, although individuals reporting ACEs had fewer total followers in web-based social networks, they had higher reciprocity in following behavior (ie, mutual following with other users), a higher tendency to follow and be followed by other individuals with ACEs, and a higher tendency to follow back individuals with ACEs rather than individuals without ACEs. Conclusions: These results imply that individuals with ACEs may try to actively connect with others who have similar previous traumatic experiences as a positive connection and coping strategy. Supportive interpersonal connections on the web for individuals with ACEs appear to be a prevalent behavior and may be a way to enhance social connectedness and resilience in those who have experienced ACEs. ", doi="10.2196/45171", url="https://www.jmir.org/2023/1/e45171", url="http://www.ncbi.nlm.nih.gov/pubmed/37252791" } @Article{info:doi/10.2196/45662, author="Hou, Jue and Zhao, Rachel and Gronsbell, Jessica and Lin, Yucong and Bonzel, Clara-Lea and Zeng, Qingyi and Zhang, Sinian and Beaulieu-Jones, K. Brett and Weber, M. Griffin and Jemielita, Thomas and Wan, Sabrina Shuyan and Hong, Chuan and Cai, Tianrun and Wen, Jun and Ayakulangara Panickan, Vidul and Liaw, Kai-Li and Liao, Katherine and Cai, Tianxi", title="Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies", journal="J Med Internet Res", year="2023", month="May", day="25", volume="25", pages="e45662", keywords="electronic health records", keywords="real-world evidence", keywords="data curation", keywords="medical informatics", keywords="randomized controlled trials", keywords="reproducibility", doi="10.2196/45662", url="https://www.jmir.org/2023/1/e45662", url="http://www.ncbi.nlm.nih.gov/pubmed/37227772" } @Article{info:doi/10.2196/44330, author="Hadley, Emily and Marcial, Haak Laura and Quattrone, Wes and Bobashev, Georgiy", title="Text Analysis of Trends in Health Equity and Disparities From the Internal Revenue Service Tax Documentation Submitted by US Nonprofit Hospitals Between 2010 and 2019: Exploratory Study", journal="J Med Internet Res", year="2023", month="May", day="24", volume="25", pages="e44330", keywords="text mining", keywords="natural language processing", keywords="health care disparities", keywords="hospital administration", abstract="Background: Many US hospitals are classified as nonprofits and receive tax-exempt status partially in exchange for providing benefits to the community. Proof of compliance is collected with the Schedule H form submitted as part of the annual Internal Revenue Service Form 990 (F990H), including a free-response text section that is known for being ambiguous and difficult to audit. This research is among the first to use natural language processing approaches to evaluate this text section with a focus on health equity and disparities. Objective: This study aims to determine the extent to which the free-response text in F990H reveals how nonprofit hospitals address health equity and disparities, including alignment with public priorities. Methods: We used free-response text submitted by hospital reporting entities in Part V and VI of the Internal Revenue Service Form 990 Schedule H between 2010 and 2019. We identified 29 main themes connected to health equity and disparities, and 152 related key phrases. We tallied occurrences of these phrases through term frequency analysis, calculated the Moran I statistic to assess geographic variation in 2018, analyzed Google Trends use for the same terms during the same period, and used semantic search with Sentence-BERT in Python to understand contextual use. Results: We found increased use from 2010 to 2019 across all the 29 phrase themes related to health equity and disparities. More than 90\% of hospital reporting entities used terms in 2018 and 2019 related to affordability (2018: 2117/2131, 99.34\%; 2019: 1620/1627, 99.57\%), government organizations (2018: 2053/2131, 96.33\%; 2019: 1577/1627, 96.93\%), mental health (2018: 1937/2131, 90.9\%; 2019: 1517/1627, 93.24\%), and data collection (2018: 1947/2131, 91.37\%; 2019: 1502/1627, 92.32\%). The themes with the largest relative increase were LGBTQ (lesbian, gay, bisexual, transgender, and queer; 1676\%; 2010: 12/2328, 0.51\%; 2019: 149/1627, 9.16\%) and social determinants of health (958\%; 2010: 68/2328, 2.92\%; 2019: 503/1627, 30.92\%). Terms related to homelessness varied geographically from 2010 to 2018, and terms related to equity, health IT, immigration, LGBTQ, oral health, rural, social determinants of health, and substance use showed statistically significant (P<.05) geographic variation in 2018. The largest percentage point increase was for terms related to substance use (2010: 403/2328, 17.31\%; 2019: 1149/1627, 70.62\%). However, use in themes such as LGBTQ, disability, oral health, and race and ethnicity ranked lower than public interest in these topics, and some increased mentions of themes were to explicitly say that no action was taken. Conclusions: Hospital reporting entities demonstrate an increasing awareness of health equity and disparities in community benefit tax documentation, but these do not necessarily correspond with general population interests or additional action. We propose further investigation of alignment with community health needs assessments and make suggestions for improvements to F990H reporting requirements. ", doi="10.2196/44330", url="https://www.jmir.org/2023/1/e44330", url="http://www.ncbi.nlm.nih.gov/pubmed/37223985" } @Article{info:doi/10.2196/42375, author="Johnson, D. Rhodri and Griffiths, J. Lucy and Cowley, E. Laura and Broadhurst, Karen and Bailey, Rowena", title="Risk Factors Associated With Primary Care--Reported Domestic Violence for Women Involved in Family Law Care Proceedings: Data Linkage Observational Study", journal="J Med Internet Res", year="2023", month="May", day="24", volume="25", pages="e42375", keywords="data linkage", keywords="domestic violence", keywords="domestic abuse", keywords="health data", keywords="family justice data", abstract="Background: Domestic violence and abuse (DVA) has a detrimental impact on the health and well-being of children and families but is commonly underreported, with an estimated prevalence of 5.5\% in England and Wales in 2020. DVA is more common in groups considered vulnerable, including those involved in public law family court proceedings; however, there is a lack of evidence regarding risk factors for DVA among those involved in the family justice system. Objective: This study examines risk factors for DVA within a cohort of mothers involved in public law family court proceedings in Wales and a matched general population comparison group. Methods: We linked family justice data from the Children and Family Court Advisory and Support Service (Cafcass Cymru [Wales]) to demographic and electronic health records within the Secure Anonymised Information Linkage (SAIL) Databank. We constructed 2 study cohorts: mothers involved in public law family court proceedings (2011-2019) and a general population group of mothers not involved in public law family court proceedings, matched on key demographics (age and deprivation). We used published clinical codes to identify mothers with exposure to DVA documented in their primary care records and who therefore reported DVA to their general practitioner. Multiple logistic regression analyses were used to examine risk factors for primary care--recorded DVA. Results: Mothers involved in public law family court proceedings were 8 times more likely to have had exposure to DVA documented in their primary care records than the general population group (adjusted odds ratio [AOR] 8.0, 95\% CI 6.6-9.7). Within the cohort of mothers involved in public law family court proceedings, risk factors for DVA with the greatest effect sizes included living in sparsely populated areas (AOR 3.9, 95\% CI 2.8-5.5), assault-related emergency department attendances (AOR 2.2, 95\% CI 1.5-3.1), and mental health conditions (AOR 1.7, 95\% CI 1.3-2.2). An 8-fold increased risk of DVA emphasizes increased vulnerabilities for individuals involved in public law family court proceedings. Conclusions: Previously reported DVA risk factors do not necessarily apply to this group of women. The additional risk factors identified in this study could be considered for inclusion in national guidelines. The evidence that living in sparsely populated areas and assault-related emergency department attendances are associated with increased risk of DVA could be used to inform policy and practice interventions targeting prevention as well as tailored support services for those with exposure to DVA. However, further work should also explore other sources of DVA, such as that recorded in secondary health care, family, and criminal justice records, to understand the true scale of the problem. ", doi="10.2196/42375", url="https://www.jmir.org/2023/1/e42375", url="http://www.ncbi.nlm.nih.gov/pubmed/37223967" } @Article{info:doi/10.2196/41808, author="Liman, Leon and May, Bernd and Fette, Georg and Krebs, Jonathan and Puppe, Frank", title="Using a Clinical Data Warehouse to Calculate and Present Key Metrics for the Radiology Department: Implementation and Performance Evaluation", journal="JMIR Med Inform", year="2023", month="May", day="22", volume="11", pages="e41808", keywords="data warehouse", keywords="electronic health records", keywords="radiology", keywords="statistics and numerical data", keywords="hospital data", keywords="eHealth", keywords="medical records", abstract="Background: Due to the importance of radiologic examinations, such as X-rays or computed tomography scans, for many clinical diagnoses, the optimal use of the radiology department is 1 of the primary goals of many hospitals. Objective: This study aims to calculate the key metrics of this use by creating a radiology data warehouse solution, where data from radiology information systems (RISs) can be imported and then queried using a query language as well as a graphical user interface (GUI). Methods: Using a simple configuration file, the developed system allowed for the processing of radiology data exported from any kind of RIS into a Microsoft Excel, comma-separated value (CSV), or JavaScript Object Notation (JSON) file. These data were then imported into a clinical data warehouse. Additional values based on the radiology data were calculated during this import process by implementing 1 of several provided interfaces. Afterward, the query language and GUI of the data warehouse were used to configure and calculate reports on these data. For the most common types of requested reports, a web interface was created to view their numbers as graphics. Results: The tool was successfully tested with the data of 4 different German hospitals from 2018 to 2021, with a total of 1,436,111 examinations. The user feedback was good, since all their queries could be answered if the available data were sufficient. The initial processing of the radiology data for using them with the clinical data warehouse took (depending on the amount of data provided by each hospital) between 7 minutes and 1 hour 11 minutes. Calculating 3 reports of different complexities on the data of each hospital was possible in 1-3 seconds for reports with up to 200 individual calculations and in up to 1.5 minutes for reports with up to 8200 individual calculations. Conclusions: A system was developed with the main advantage of being generic concerning the export of different RISs as well as concerning the configuration of queries for various reports. The queries could be configured easily using the GUI of the data warehouse, and their results could be exported into the standard formats Excel and CSV for further processing. ", doi="10.2196/41808", url="https://medinform.jmir.org/2023/1/e41808", url="http://www.ncbi.nlm.nih.gov/pubmed/37213191" } @Article{info:doi/10.2196/41048, author="Ranchon, Florence and Chanoine, S{\'e}bastien and Lambert-Lacroix, Sophie and Bosson, Jean-Luc and Moreau-Gaudry, Alexandre and Bedouch, Pierrick", title="Development of Indirect Health Data Linkage on Health Product Use and Care Trajectories in France: Systematic Review", journal="J Med Internet Res", year="2023", month="May", day="18", volume="25", pages="e41048", keywords="data linkage", keywords="health database", keywords="deterministic approach", keywords="probabilistic approach", keywords="health products", keywords="public health activity", keywords="health data", keywords="linkage", keywords="France", keywords="big data", keywords="usability", keywords="integration", keywords="care trajectories", abstract="Background: European national disparities in the integration of data linkage (ie, being able to match patient data between databases) into routine public health activities were recently highlighted. In France, the claims database covers almost the whole population from birth to death, offering a great research potential for data linkage. As the use of a common unique identifier to directly link personal data is often limited, linkage with a set of indirect key identifiers has been developed, which is associated with the linkage quality challenge to minimize errors in linked data. Objective: The aim of this systematic review is to analyze the type and quality of research publications on indirect data linkage on health product use and care trajectories in France. Methods: A comprehensive search for all papers published in PubMed/Medline and Embase databases up to December 31, 2022, involving linked French database focusing on health products use or care trajectories was realized. Only studies based on the use of indirect identifiers were included (ie, without a unique personal identifier available to easily link the databases). A descriptive analysis of data linkage with quality indicators and adherence to the Bohensky framework for evaluating data linkage studies was also realized. Results: In total, 16 papers were selected. Data linkage was performed at the national level in 7 (43.8\%) cases or at the local level in 9 (56.2\%) studies. The number of patients included in the different databases and resulting from data linkage varied greatly, respectively, from 713 to 75,000 patients and from 210 to 31,000 linked patients. The diseases studied were mainly chronic diseases and infections. The objectives of the data linkage were multiple: to estimate the risk of adverse drug reactions (ADRs; n=6, 37.5\%), to reconstruct the patient's care trajectory (n=5, 31.3\%), to describe therapeutic uses (n=2, 12.5\%), to evaluate the benefits of treatments (n=2, 12.5\%), and to evaluate treatment adherence (n=1, 6.3\%). Registries are the most frequently linked databases with French claims data. No studies have looked at linking with a hospital data warehouse, a clinical trial database, or patient self-reported databases. The linkage approach was deterministic in 7 (43.8\%) studies, probabilistic in 4 (25.0\%) studies, and not specified in 5 (31.3\%) studies. The linkage rate was mainly from 80\% to 90\% (reported in 11/15, 73.3\%, studies). Adherence to the Bohensky framework for evaluating data linkage studies showed that the description of the source databases for the linkage was always performed but that the completion rate and accuracy of the variables to be linked were not systematically described. Conclusions: This review highlights the growing interest in health data linkage in France. Nevertheless, regulatory, technical, and human constraints remain major obstacles to their deployment. The volume, variety, and validity of the data represent a real challenge, and advanced expertise and skills in statistical analysis and artificial intelligence are required to treat these big data. ", doi="10.2196/41048", url="https://www.jmir.org/2023/1/e41048", url="http://www.ncbi.nlm.nih.gov/pubmed/37200084" } @Article{info:doi/10.2196/37306, author="Kochuthakidiyel Suresh, Praveenkumar and Sekar, Gnanasoundari and Mallady, Kavya and Wan Ab Rahman, Suriana Wan and Shima Shahidan, Nazatul Wan and Venkatesan, Gokulakannan", title="The Identification of Potential Drugs for Dengue Hemorrhagic Fever: Network-Based Drug Reprofiling Study", journal="JMIR Bioinform Biotech", year="2023", month="May", day="9", volume="4", pages="e37306", keywords="dengue hemorrhagic fever", keywords="drug reprofiling", keywords="network pharmacology", keywords="network medicine", keywords="DHF", keywords="repurposable drugs", keywords="viral fevers", keywords="drug repurposing", abstract="Background: Dengue fever can progress to dengue hemorrhagic fever (DHF), a more serious and occasionally fatal form of the disease. Indicators of serious disease arise about the time the fever begins to reduce (typically 3 to 7 days following symptom onset). There are currently no effective antivirals available. Drug repurposing is an emerging drug discovery process for rapidly developing effective DHF therapies. Through network pharmacology modeling, several US Food and Drug Administration (FDA)-approved medications have already been researched for various viral outbreaks. Objective: We aimed to identify potentially repurposable drugs for DHF among existing FDA-approved drugs for viral attacks, symptoms of viral fevers, and DHF. Methods: Using target identification databases (GeneCards and DrugBank), we identified human--DHF virus interacting genes and drug targets against these genes. We determined hub genes and potential drugs with a network-based analysis. We performed functional enrichment and network analyses to identify pathways, protein-protein interactions, tissues where the gene expression was high, and disease-gene associations. Results: Analyzing virus-host interactions and therapeutic targets in the human genome network revealed 45 repurposable medicines. Hub network analysis of host-virus-drug associations suggested that aspirin, captopril, and rilonacept might efficiently treat DHF. Gene enrichment analysis supported these findings. According to a Mayo Clinic report, using aspirin in the treatment of dengue fever may increase the risk of bleeding complications, but several studies from around the world suggest that thrombosis is associated with DHF. The human interactome contains the genes prostaglandin-endoperoxide synthase 2 (PTGS2), angiotensin converting enzyme (ACE), and coagulation factor II, thrombin (F2), which have been documented to have a role in the pathogenesis of disease progression in DHF, and our analysis of most of the drugs targeting these genes showed that the hub gene module (human-virus-drug) was highly enriched in tissues associated with the immune system (P=7.29 {\texttimes} 10--24) and human umbilical vein endothelial cells (P=1.83 {\texttimes} 10--20); this group of tissues acts as an anticoagulant barrier between the vessel walls and blood. Kegg analysis showed an association with genes linked to cancer (P=1.13 {\texttimes} 10--14) and the advanced glycation end products--receptor for advanced glycation end products signaling pathway in diabetic complications (P=3.52 {\texttimes} 10--14), which indicates that DHF patients with diabetes and cancer are at risk of higher pathogenicity. Thus, gene-targeting medications may play a significant part in limiting or worsening the condition of DHF patients. Conclusions: Aspirin is not usually prescribed for dengue fever because of bleeding complications, but it has been reported that using aspirin in lower doses is beneficial in the management of diseases with thrombosis. Drug repurposing is an emerging field in which clinical validation and dosage identification are required before the drug is prescribed. Further retrospective and collaborative international trials are essential for understanding the pathogenesis of this condition. ", doi="10.2196/37306", url="https://bioinform.jmir.org/2023/1/e37306" } @Article{info:doi/10.2196/45534, author="Fitzpatrick, K. Natalie and Dobson, Richard and Roberts, Angus and Jones, Kerina and Shah, D. Anoop and Nenadic, Goran and Ford, Elizabeth", title="Understanding Views Around the Creation of a Consented, Donated Databank of Clinical Free Text to Develop and Train Natural Language Processing Models for Research: Focus Group Interviews With Stakeholders", journal="JMIR Med Inform", year="2023", month="May", day="3", volume="11", pages="e45534", keywords="consent", keywords="databank", keywords="electronic health records", keywords="free text", keywords="governance", keywords="natural language processing", keywords="public involvement", keywords="unstructured text", abstract="Background: Information stored within electronic health records is often recorded as unstructured text. Special computerized natural language processing (NLP) tools are needed to process this text; however, complex governance arrangements make such data in the National Health Service hard to access, and therefore, it is difficult to use for research in improving NLP methods. The creation of a donated databank of clinical free text could provide an important opportunity for researchers to develop NLP methods and tools and may circumvent delays in accessing the data needed to train the models. However, to date, there has been little or no engagement with stakeholders on the acceptability and design considerations of establishing a free-text databank for this purpose. Objective: This study aimed to ascertain stakeholder views around the creation of a consented, donated databank of clinical free text to help create, train, and evaluate NLP for clinical research and to inform the potential next steps for adopting a partner-led approach to establish a national, funded databank of free text for use by the research community. Methods: Web-based in-depth focus group interviews were conducted with 4 stakeholder groups (patients and members of the public, clinicians, information governance leads and research ethics members, and NLP researchers). Results: All stakeholder groups were strongly in favor of the databank and saw great value in creating an environment where NLP tools can be tested and trained to improve their accuracy. Participants highlighted a range of complex issues for consideration as the databank is developed, including communicating the intended purpose, the approach to access and safeguarding the data, who should have access, and how to fund the databank. Participants recommended that a small-scale, gradual approach be adopted to start to gather donations and encouraged further engagement with stakeholders to develop a road map and set of standards for the databank. Conclusions: These findings provide a clear mandate to begin developing the databank and a framework for stakeholder expectations, which we would aim to meet with the databank delivery. ", doi="10.2196/45534", url="https://medinform.jmir.org/2023/1/e45534", url="http://www.ncbi.nlm.nih.gov/pubmed/37133927" } @Article{info:doi/10.2196/40524, author="Hearn, Jason and Van den Eynde, Jef and Chinni, Bhargava and Cedars, Ari and Gottlieb Sen, Danielle and Kutty, Shelby and Manlhiot, Cedric", title="Data Quality Degradation on Prediction Models Generated From Continuous Activity and Heart Rate Monitoring: Exploratory Analysis Using Simulation", journal="JMIR Cardio", year="2023", month="May", day="3", volume="7", pages="e40524", keywords="wearables", keywords="time series", keywords="data reliability", keywords="prediction models", keywords="hear rate", keywords="monitoring", keywords="data", keywords="reliability", keywords="clinical", keywords="sleep", keywords="data set", keywords="cardiac", keywords="physiological", keywords="accuracy", keywords="consumer", keywords="device", abstract="Background: Limited data accuracy is often cited as a reason for caution in the integration of physiological data obtained from consumer-oriented wearable devices in care management pathways. The effect of decreasing accuracy on predictive models generated from these data has not been previously investigated. Objective: The aim of this study is to simulate the effect of data degradation on the reliability of prediction models generated from those data and thus determine the extent to which lower device accuracy might or might not limit their use in clinical settings. Methods: Using the Multilevel Monitoring of Activity and Sleep in Healthy People data set, which includes continuous free-living step count and heart rate data from 21 healthy volunteers, we trained a random forest model to predict cardiac competence. Model performance in 75 perturbed data sets with increasing missingness, noisiness, bias, and a combination of all 3 perturbations was compared to model performance for the unperturbed data set. Results: The unperturbed data set achieved a mean root mean square error (RMSE) of 0.079 (SD 0.001) in predicting cardiac competence index. For all types of perturbations, RMSE remained stable up to 20\%-30\% perturbation. Above this level, RMSE started increasing and reached the point at which the model was no longer predictive at 80\% for noise, 50\% for missingness, and 35\% for the combination of all perturbations. Introducing systematic bias in the underlying data had no effect on RMSE. Conclusions: In this proof-of-concept study, the performance of predictive models for cardiac competence generated from continuously acquired physiological data was relatively stable with declining quality of the source data. As such, lower accuracy of consumer-oriented wearable devices might not be an absolute contraindication for their use in clinical prediction models. ", doi="10.2196/40524", url="https://cardio.jmir.org/2023/1/e40524", url="http://www.ncbi.nlm.nih.gov/pubmed/37133921" } @Article{info:doi/10.2196/43802, author="Annis, Ann and Reaves, Crista and Sender, Jessica and Bumpus, Sherry", title="Health-Related Data Sources Accessible to Health Researchers From the US Government: Mapping Review", journal="J Med Internet Res", year="2023", month="Apr", day="27", volume="25", pages="e43802", keywords="data sets as topic", keywords="federal government", keywords="data collection", keywords="survey", keywords="questionnaire", keywords="health surveys", keywords="big data", keywords="government", keywords="data set", keywords="public domain", keywords="data source", keywords="systematic review", keywords="mapping review", keywords="review method", keywords="open data", keywords="health research", abstract="Background: Big data from large, government-sponsored surveys and data sets offers researchers opportunities to conduct population-based studies of important health issues in the United States, as well as develop preliminary data to support proposed future work. Yet, navigating these national data sources is challenging. Despite the widespread availability of national data, there is little guidance for researchers on how to access and evaluate the use of these resources. Objective: Our aim was to identify and summarize a comprehensive list of federally sponsored, health- and health care--related data sources that are accessible in the public domain in order to facilitate their use by researchers. Methods: We conducted a systematic mapping review of government sources of health-related data on US populations and with active or recent (previous 10 years) data collection. The key measures were government sponsor, overview and purpose of data, population of interest, sampling design, sample size, data collection methodology, type and description of data, and cost to obtain data. Convergent synthesis was used to aggregate findings. Results: Among 106 unique data sources, 57 met the inclusion criteria. Data sources were classified as survey or assessment data (n=30, 53\%), trends data (n=27, 47\%), summative processed data (n=27, 47\%), primary registry data (n=17, 30\%), and evaluative data (n=11, 19\%). Most (n=39, 68\%) served more than 1 purpose. The population of interest included individuals/patients (n=40, 70\%), providers (n=15, 26\%), and health care sites and systems (n=14, 25\%). The sources collected data on demographic (n=44, 77\%) and clinical information (n=35, 61\%), health behaviors (n=24, 42\%), provider or practice characteristics (n=22, 39\%), health care costs (n=17, 30\%), and laboratory tests (n=8, 14\%). Most (n=43, 75\%) offered free data sets. Conclusions: A broad scope of national health data is accessible to researchers. These data provide insights into important health issues and the nation's health care system while eliminating the burden of primary data collection. Data standardization and uniformity were uncommon across government entities, highlighting a need to improve data consistency. Secondary analyses of national data are a feasible, cost-efficient means to address national health concerns. ", doi="10.2196/43802", url="https://www.jmir.org/2023/1/e43802", url="http://www.ncbi.nlm.nih.gov/pubmed/37103987" } @Article{info:doi/10.2196/40805, author="Wu, Zhiyue and Peng, Suyuan and Zhou, Liang", title="Visualization of Traditional Chinese Medicine Formulas: Development and Usability Study", journal="JMIR Form Res", year="2023", month="Apr", day="21", volume="7", pages="e40805", keywords="visualization", keywords="Chinese medicine formulas", keywords="interactive data analysis", keywords="traditional Chinese medicine", keywords="multifaceted data visualization", keywords="five elements", abstract="Background: Traditional Chinese medicine (TCM) formulas are combinations of Chinese herbal medicines. Knowledge of classic medicine formulas is the basis of TCM diagnosis and treatment and is the core of TCM inheritance. The large number and flexibility of medicine formulas make memorization difficult, and understanding their composition rules is even more difficult. The multifaceted and multidimensional properties of herbal medicines are important for understanding the formula; however, these are usually separated from the formula information. Furthermore, these data are presented as text and cannot be analyzed jointly and interactively. Objective: We aimed to devise a visualization method for TCM formulas that shows the composition of medicine formulas and the multidimensional properties of herbal medicines involved and supports the comparison of medicine formulas. Methods: A TCM formula visualization method with multiple linked views is proposed and implemented as a web-based tool after close collaboration between visualization and TCM experts. The composition of medicine formulas is visualized in a formula view with a similarity-based layout supporting the comparison of compositing herbs; a shared herb view complements the formula view by showing all overlaps of pair-wise formulas; and a dimensionality-reduction plot of herbs enables the visualization of multidimensional herb properties. The usefulness of the tool was evaluated through a usability study with TCM experts. Results: Our method was applied to 2 typical categories of medicine formulas, namely tonic formulas and heat-clearing formulas, which contain 20 and 26 formulas composed of 58 and 73 herbal medicines, respectively. Each herbal medicine has a 23-dimensional characterizing attribute. In the usability study, TCM experts explored the 2 data sets with our web-based tool and quickly gained insight into formulas and herbs of interest, as well as the overall features of the formula groups that are difficult to identify with the traditional text-based method. Moreover, feedback from the experts indicated the usefulness of the proposed method. Conclusions: Our TCM formula visualization method is able to visualize and compare complex medicine formulas and the multidimensional attributes of herbal medicines using a web-based tool. TCM experts gained insights into 2 typical medicine formula categories using our method. Overall, the new method is a promising first step toward new TCM formula education and analysis methodologies. ", doi="10.2196/40805", url="https://formative.jmir.org/2023/1/e40805", url="http://www.ncbi.nlm.nih.gov/pubmed/37083631" } @Article{info:doi/10.2196/39259, author="Armbruster, David Friedrich Aaron and Br{\"u}ggmann, D{\"o}rthe and Groneberg, Alexander David and Bendels, Michael", title="The Influence of Paid Memberships on Physician Rating Websites With the Example of the German Portal Jameda: Descriptive Cross-sectional Study", journal="J Med Internet Res", year="2023", month="Apr", day="4", volume="25", pages="e39259", keywords="physician rating websites", keywords="physician rating portals", keywords="paid influence", keywords="Germany", abstract="Background: The majority of Germans see a deficit in information availability for choosing a physician. An increasing number of people use physician rating websites and decide upon the information provided. In Germany, the most popular physician rating website is Jameda.de, which offers monthly paid membership plans. The platform operator states that paid memberships have no influence on the rating indicators or list placement. Objective: The goal of this study was to investigate whether a physician's membership status might be related to his or her quantitative evaluation factors and to possibly quantify these effects. Methods: Physician profiles were retrieved through the search mask on Jameda.de website. Physicians from 8 disciplines in Germany's 12 most populous cities were specified as search criteria. Data Analysis and visualization were done with Matlab. Significance testing was conducted using a single factor ANOVA test followed by a multiple comparison test (Tukey Test). For analysis, the profiles were grouped according to member status (nonpaying, Gold, and Platinum) and analyzed according to the target variables---physician rating score, individual patient's ratings, number of evaluations, recommendation quota, number of colleague recommendations, and profile views. Results: A total of 21,837 nonpaying profiles, 2904 Gold, and 808 Platinum member profiles were acquired. Statistically significant differences were found between paying (Gold and Platinum) and nonpaying profiles in all parameters we examined. The distribution of patient reviews differed also by membership status. Paying profiles had more ratings, a better overall physician rating, a higher recommendation quota, and more colleague recommendations, and they were visited more frequently than nonpaying physicians' profiles. Statistically significant differences were found in most evaluation parameters within the paid membership packages in the sample analyzed. Conclusions: Paid physician profiles could be interpreted to be optimized for decision-making criteria of potential patients. With our data, it is not possible to draw any conclusions of mechanisms that alter physicians' ratings. Further research is needed to investigate the causes for the observed effects. ", doi="10.2196/39259", url="https://www.jmir.org/2023/1/e39259", url="http://www.ncbi.nlm.nih.gov/pubmed/37014690" } @Article{info:doi/10.2196/42615, author="Syed, Rehan and Eden, Rebekah and Makasi, Tendai and Chukwudi, Ignatius and Mamudu, Azumah and Kamalpour, Mostafa and Kapugama Geeganage, Dakshi and Sadeghianasl, Sareh and Leemans, J. Sander J. and Goel, Kanika and Andrews, Robert and Wynn, Thandar Moe and ter Hofstede, Arthur and Myers, Trina", title="Digital Health Data Quality Issues: Systematic Review", journal="J Med Internet Res", year="2023", month="Mar", day="31", volume="25", pages="e42615", keywords="data quality", keywords="digital health", keywords="electronic health record", keywords="eHealth", keywords="systematic reviews", abstract="Background: The promise of digital health is principally dependent on the ability to electronically capture data that can be analyzed to improve decision-making. However, the ability to effectively harness data has proven elusive, largely because of the quality of the data captured. Despite the importance of data quality (DQ), an agreed-upon DQ taxonomy evades literature. When consolidated frameworks are developed, the dimensions are often fragmented, without consideration of the interrelationships among the dimensions or their resultant impact. Objective: The aim of this study was to develop a consolidated digital health DQ dimension and outcome (DQ-DO) framework to provide insights into 3 research questions: What are the dimensions of digital health DQ? How are the dimensions of digital health DQ related? and What are the impacts of digital health DQ? Methods: Following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, a developmental systematic literature review was conducted of peer-reviewed literature focusing on digital health DQ in predominately hospital settings. A total of 227 relevant articles were retrieved and inductively analyzed to identify digital health DQ dimensions and outcomes. The inductive analysis was performed through open coding, constant comparison, and card sorting with subject matter experts to identify digital health DQ dimensions and digital health DQ outcomes. Subsequently, a computer-assisted analysis was performed and verified by DQ experts to identify the interrelationships among the DQ dimensions and relationships between DQ dimensions and outcomes. The analysis resulted in the development of the DQ-DO framework. Results: The digital health DQ-DO framework consists of 6 dimensions of DQ, namely accessibility, accuracy, completeness, consistency, contextual validity, and currency; interrelationships among the dimensions of digital health DQ, with consistency being the most influential dimension impacting all other digital health DQ dimensions; 5 digital health DQ outcomes, namely clinical, clinician, research-related, business process, and organizational outcomes; and relationships between the digital health DQ dimensions and DQ outcomes, with the consistency and accessibility dimensions impacting all DQ outcomes. Conclusions: The DQ-DO framework developed in this study demonstrates the complexity of digital health DQ and the necessity for reducing digital health DQ issues. The framework further provides health care executives with holistic insights into DQ issues and resultant outcomes, which can help them prioritize which DQ-related problems to tackle first. ", doi="10.2196/42615", url="https://www.jmir.org/2023/1/e42615", url="http://www.ncbi.nlm.nih.gov/pubmed/37000497" } @Article{info:doi/10.2196/41588, author="Brauneck, Alissa and Schmalhorst, Louisa and Kazemi Majdabadi, Mahdi Mohammad and Bakhtiari, Mohammad and V{\"o}lker, Uwe and Baumbach, Jan and Baumbach, Linda and Buchholtz, Gabriele", title="Federated Machine Learning, Privacy-Enhancing Technologies, and Data Protection Laws in Medical Research: Scoping Review", journal="J Med Internet Res", year="2023", month="Mar", day="30", volume="25", pages="e41588", keywords="federated learning", keywords="data protection regulation", keywords="data protection by design", keywords="privacy protection", keywords="General Data Protection Regulation compliance", keywords="GDPR compliance", keywords="privacy-preserving technologies", keywords="differential privacy", keywords="secure multiparty computation", abstract="Background: The collection, storage, and analysis of large data sets are relevant in many sectors. Especially in the medical field, the processing of patient data promises great progress in personalized health care. However, it is strictly regulated, such as by the General Data Protection Regulation (GDPR). These regulations mandate strict data security and data protection and, thus, create major challenges for collecting and using large data sets. Technologies such as federated learning (FL), especially paired with differential privacy (DP) and secure multiparty computation (SMPC), aim to solve these challenges. Objective: This scoping review aimed to summarize the current discussion on the legal questions and concerns related to FL systems in medical research. We were particularly interested in whether and to what extent FL applications and training processes are compliant with the GDPR data protection law and whether the use of the aforementioned privacy-enhancing technologies (DP and SMPC) affects this legal compliance. We placed special emphasis on the consequences for medical research and development. Methods: We performed a scoping review according to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). We reviewed articles on Beck-Online, SSRN, ScienceDirect, arXiv, and Google Scholar published in German or English between 2016 and 2022. We examined 4 questions: whether local and global models are ``personal data'' as per the GDPR; what the ``roles'' as defined by the GDPR of various parties in FL are; who controls the data at various stages of the training process; and how, if at all, the use of privacy-enhancing technologies affects these findings. Results: We identified and summarized the findings of 56 relevant publications on FL. Local and likely also global models constitute personal data according to the GDPR. FL strengthens data protection but is still vulnerable to a number of attacks and the possibility of data leakage. These concerns can be successfully addressed through the privacy-enhancing technologies SMPC and DP. Conclusions: Combining FL with SMPC and DP is necessary to fulfill the legal data protection requirements (GDPR) in medical research dealing with personal data. Even though some technical and legal challenges remain, for example, the possibility of successful attacks on the system, combining FL with SMPC and DP creates enough security to satisfy the legal requirements of the GDPR. This combination thereby provides an attractive technical solution for health institutions willing to collaborate without exposing their data to risk. From a legal perspective, the combination provides enough built-in security measures to satisfy data protection requirements, and from a technical perspective, the combination provides secure systems with comparable performance with centralized machine learning applications. ", doi="10.2196/41588", url="https://www.jmir.org/2023/1/e41588", url="http://www.ncbi.nlm.nih.gov/pubmed/36995759" } @Article{info:doi/10.2196/46700, author="Tewari, Ambuj", title="mHealth Systems Need a Privacy-by-Design Approach: Commentary on ``Federated Machine Learning, Privacy-Enhancing Technologies, and Data Protection Laws in Medical Research: Scoping Review''", journal="J Med Internet Res", year="2023", month="Mar", day="30", volume="25", pages="e46700", keywords="mHealth", keywords="differential privacy", keywords="private synthetic data", keywords="federated learning", keywords="data protection regulation", keywords="data protection by design", keywords="privacy protection", keywords="General Data Protection Regulation", keywords="GDPR compliance", keywords="privacy-preserving technologies", keywords="secure multiparty computation", keywords="multiparty computation", keywords="machine learning", keywords="privacy", doi="10.2196/46700", url="https://www.jmir.org/2023/1/e46700", url="http://www.ncbi.nlm.nih.gov/pubmed/36995757" } @Article{info:doi/10.2196/41882, author="Li, Yue and Gee, William and Jin, Kun and Bond, Robert", title="Examining Homophily, Language Coordination, and Analytical Thinking in Web-Based Conversations About Vaccines on Reddit: Study Using Deep Neural Network Language Models and Computer-Assisted Conversational Analyses", journal="J Med Internet Res", year="2023", month="Mar", day="23", volume="25", pages="e41882", keywords="vaccine hesitancy", keywords="social media", keywords="web-based conversations", keywords="neural network language models", keywords="computer-assisted conversational analyses", abstract="Background: Vaccine hesitancy has been deemed one of the top 10 threats to global health. Antivaccine information on social media is a major barrier to addressing vaccine hesitancy. Understanding how vaccine proponents and opponents interact with each other on social media may help address vaccine hesitancy. Objective: We aimed to examine conversations between vaccine proponents and opponents on Reddit to understand whether homophily in web-based conversations impedes opinion exchange, whether people are able to accommodate their languages to each other in web-based conversations, and whether engaging with opposing viewpoints stimulates higher levels of analytical thinking. Methods: We analyzed large-scale conversational text data about human vaccines on Reddit from 2016 to 2018. Using deep neural network language models and computer-assisted conversational analyses, we obtained each Redditor's stance on vaccines, each post's stance on vaccines, each Redditor's language coordination score, and each post or comment's analytical thinking score. We then performed chi-square tests, 2-tailed t tests, and multilevel modeling to test 3 questions of interest. Results: The results show that both provaccine and antivaccine Redditors are more likely to selectively respond to Redditors who indicate similar views on vaccines (P<.001). When Redditors interact with others who hold opposing views on vaccines, both provaccine and antivaccine Redditors accommodate their language to out-group members (provaccine Redditors: P=.044; antivaccine Redditors: P=.047) and show no difference in analytical thinking compared with interacting with congruent views (P=.63), suggesting that Redditors do not engage in motivated reasoning. Antivaccine Redditors, on average, showed higher analytical thinking in their posts and comments than provaccine Redditors (P<.001). Conclusions: This study shows that although vaccine proponents and opponents selectively communicate with their in-group members on Reddit, they accommodate their language and do not engage in motivated reasoning when communicating with out-group members. These findings may have implications for the design of provaccine campaigns on social media. ", doi="10.2196/41882", url="https://www.jmir.org/2023/1/e41882", url="http://www.ncbi.nlm.nih.gov/pubmed/36951921" } @Article{info:doi/10.2196/35568, author="{\vS}uster, Simon and Baldwin, Timothy and Lau, Han Jey and Jimeno Yepes, Antonio and Martinez Iraola, David and Otmakhova, Yulia and Verspoor, Karin", title="Automating Quality Assessment of Medical Evidence in Systematic Reviews: Model Development and Validation Study", journal="J Med Internet Res", year="2023", month="Mar", day="13", volume="25", pages="e35568", keywords="critical appraisal", keywords="evidence synthesis", keywords="systematic reviews", keywords="bias detection", keywords="automated quality assessment", abstract="Background: Assessment of the quality of medical evidence available on the web is a critical step in the preparation of systematic reviews. Existing tools that automate parts of this task validate the quality of individual studies but not of entire bodies of evidence and focus on a restricted set of quality criteria. Objective: We proposed a quality assessment task that provides an overall quality rating for each body of evidence (BoE), as well as finer-grained justification for different quality criteria according to the Grading of Recommendation, Assessment, Development, and Evaluation formalization framework. For this purpose, we constructed a new data set and developed a machine learning baseline system (EvidenceGRADEr). Methods: We algorithmically extracted quality-related data from all summaries of findings found in the Cochrane Database of Systematic Reviews. Each BoE was defined by a set of population, intervention, comparison, and outcome criteria and assigned a quality grade (high, moderate, low, or very low) together with quality criteria (justification) that influenced that decision. Different statistical data, metadata about the review, and parts of the review text were extracted as support for grading each BoE. After pruning the resulting data set with various quality checks, we used it to train several neural-model variants. The predictions were compared against the labels originally assigned by the authors of the systematic reviews. Results: Our quality assessment data set, Cochrane Database of Systematic Reviews Quality of Evidence, contains 13,440 instances, or BoEs labeled for quality, originating from 2252 systematic reviews published on the internet from 2002 to 2020. On the basis of a 10-fold cross-validation, the best neural binary classifiers for quality criteria detected risk of bias at 0.78 F1 (P=.68; R=0.92) and imprecision at 0.75 F1 (P=.66; R=0.86), while the performance on inconsistency, indirectness, and publication bias criteria was lower (F1 in the range of 0.3-0.4). The prediction of the overall quality grade into 1 of the 4 levels resulted in 0.5 F1. When casting the task as a binary problem by merging the Grading of Recommendation, Assessment, Development, and Evaluation classes (high+moderate vs low+very low-quality evidence), we attained 0.74 F1. We also found that the results varied depending on the supporting information that is provided as an input to the models. Conclusions: Different factors affect the quality of evidence in the context of systematic reviews of medical evidence. Some of these (risk of bias and imprecision) can be automated with reasonable accuracy. Other quality dimensions such as indirectness, inconsistency, and publication bias prove more challenging for machine learning, largely because they are much rarer. This technology could substantially reduce reviewer workload in the future and expedite quality assessment as part of evidence synthesis. ", doi="10.2196/35568", url="https://www.jmir.org/2023/1/e35568", url="http://www.ncbi.nlm.nih.gov/pubmed/36722350" } @Article{info:doi/10.2196/42822, author="Sinaci, Anil A. and Gencturk, Mert and Teoman, Alper Huseyin and Laleci Erturkmen, Banu Gokce and Alvarez-Romero, Celia and Martinez-Garcia, Alicia and Poblador-Plou, Beatriz and Carmona-P{\'i}rez, Jon{\'a}s and L{\"o}be, Matthias and Parra-Calderon, Luis Carlos", title="A Data Transformation Methodology to Create Findable, Accessible, Interoperable, and Reusable Health Data: Software Design, Development, and Evaluation Study", journal="J Med Internet Res", year="2023", month="Mar", day="8", volume="25", pages="e42822", keywords="Health Level 7 Fast Healthcare Interoperability Resources", keywords="HL7 FHIR", keywords="Findable, Accessible, Interoperable, and Reusable principles", keywords="FAIR principles", keywords="health data sharing", keywords="health data transformation", keywords="secondary use", abstract="Background: Sharing health data is challenging because of several technical, ethical, and regulatory issues. The Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles have been conceptualized to enable data interoperability. Many studies provide implementation guidelines, assessment metrics, and software to achieve FAIR-compliant data, especially for health data sets. Health Level 7 (HL7) Fast Healthcare Interoperability Resources (FHIR) is a health data content modeling and exchange standard. Objective: Our goal was to devise a new methodology to extract, transform, and load existing health data sets into HL7 FHIR repositories in line with FAIR principles, develop a Data Curation Tool to implement the methodology, and evaluate it on health data sets from 2 different but complementary institutions. We aimed to increase the level of compliance with FAIR principles of existing health data sets through standardization and facilitate health data sharing by eliminating the associated technical barriers. Methods: Our approach automatically processes the capabilities of a given FHIR end point and directs the user while configuring mappings according to the rules enforced by FHIR profile definitions. Code system mappings can be configured for terminology translations through automatic use of FHIR resources. The validity of the created FHIR resources can be automatically checked, and the software does not allow invalid resources to be persisted. At each stage of our data transformation methodology, we used particular FHIR-based techniques so that the resulting data set could be evaluated as FAIR. We performed a data-centric evaluation of our methodology on health data sets from 2 different institutions. Results: Through an intuitive graphical user interface, users are prompted to configure the mappings into FHIR resource types with respect to the restrictions of selected profiles. Once the mappings are developed, our approach can syntactically and semantically transform existing health data sets into HL7 FHIR without loss of data utility according to our privacy-concerned criteria. In addition to the mapped resource types, behind the scenes, we create additional FHIR resources to satisfy several FAIR criteria. According to the data maturity indicators and evaluation methods of the FAIR Data Maturity Model, we achieved the maximum level (level 5) for being Findable, Accessible, and Interoperable and level 3 for being Reusable. Conclusions: We developed and extensively evaluated our data transformation approach to unlock the value of existing health data residing in disparate data silos to make them available for sharing according to the FAIR principles. We showed that our method can successfully transform existing health data sets into HL7 FHIR without loss of data utility, and the result is FAIR in terms of the FAIR Data Maturity Model. We support institutional migration to HL7 FHIR, which not only leads to FAIR data sharing but also eases the integration with different research networks. ", doi="10.2196/42822", url="https://www.jmir.org/2023/1/e42822", url="http://www.ncbi.nlm.nih.gov/pubmed/36884270" } @Article{info:doi/10.2196/41100, author="Karapetian, Karina and Jeon, Min Soo and Kwon, Jin-Won and Suh, Young-Kyoon", title="Supervised Relation Extraction Between Suicide-Related Entities and Drugs: Development and Usability Study of an Annotated PubMed Corpus", journal="J Med Internet Res", year="2023", month="Mar", day="8", volume="25", pages="e41100", keywords="suicide", keywords="adverse drug events", keywords="information extraction", keywords="relation classification", keywords="bidirectional encoder representations from transformers", keywords="pharmacovigilance", keywords="natural language processing", keywords="PubMed", keywords="corpus", keywords="language model", abstract="Background: Drug-induced suicide has been debated as a crucial issue in both clinical and public health research. Published research articles contain valuable data on the drugs associated with suicidal adverse events. An automated process that extracts such information and rapidly detects drugs related to suicide risk is essential but has not been well established. Moreover, few data sets are available for training and validating classification models on drug-induced suicide. Objective: This study aimed to build a corpus of drug-suicide relations containing annotated entities for drugs, suicidal adverse events, and their relations. To confirm the effectiveness of the drug-suicide relation corpus, we evaluated the performance of a relation classification model using the corpus in conjunction with various embeddings. Methods: We collected the abstracts and titles of research articles associated with drugs and suicide from PubMed and manually annotated them along with their relations at the sentence level (adverse drug events, treatment, suicide means, or miscellaneous). To reduce the manual annotation effort, we preliminarily selected sentences with a pretrained zero-shot classifier or sentences containing only drug and suicide keywords. We trained a relation classification model using various Bidirectional Encoder Representations from Transformer embeddings with the proposed corpus. We then compared the performances of the model with different Bidirectional Encoder Representations from Transformer--based embeddings and selected the most suitable embedding for our corpus. Results: Our corpus comprised 11,894 sentences extracted from the titles and abstracts of the PubMed research articles. Each sentence was annotated with drug and suicide entities and the relationship between these 2 entities (adverse drug events, treatment, means, and miscellaneous). All of the tested relation classification models that were fine-tuned on the corpus accurately detected sentences of suicidal adverse events regardless of their pretrained type and data set properties. Conclusions: To our knowledge, this is the first and most extensive corpus of drug-suicide relations. ", doi="10.2196/41100", url="https://www.jmir.org/2023/1/e41100", url="http://www.ncbi.nlm.nih.gov/pubmed/36884281" } @Article{info:doi/10.2196/42231, author="Mali, Namrata and Restrepo, Felipe and Abrahams, Alan and Sands, Laura and Goldberg, M. David and Gruss, Richard and Zaman, Nohel and Shields, Wendy and Omaki, Elise and Ehsani, Johnathon and Ractham, Peter and Kaewkitipong, Laddawan", title="Safety Concerns in Mobility-Assistive Products for Older Adults: Content Analysis of Online Reviews", journal="J Med Internet Res", year="2023", month="Mar", day="2", volume="25", pages="e42231", keywords="injury prevention", keywords="consumer-reported injuries", keywords="older adults", keywords="online reviews", keywords="mobility-assistive devices", keywords="product failures", abstract="Background: Older adults who have difficulty moving around are commonly advised to adopt mobility-assistive devices to prevent injuries. However, limited evidence exists on the safety of these devices. Existing data sources such as the National Electronic Injury Surveillance System tend to focus on injury description rather than the underlying context, thus providing little to no actionable information regarding the safety of these devices. Although online reviews are often used by consumers to assess the safety of products, prior studies have not explored consumer-reported injuries and safety concerns within online reviews of mobility-assistive devices. Objective: This study aimed to investigate injury types and contexts stemming from the use of mobility-assistive devices, as reported by older adults or their caregivers in online reviews. It not only identified injury severities and mobility-assistive device failure pathways but also shed light on the development of safety information and protocols for these products. Methods: Reviews concerning assistive devices were extracted from the ``assistive aid'' categories, which are typically intended for older adult use, on Amazon's US website. The extracted reviews were filtered so that only those pertaining to mobility-assistive devices (canes, gait or transfer belts, ramps, walkers or rollators, and wheelchairs or transport chairs) were retained. We conducted large-scale content analysis of these 48,886 retained reviews by coding them according to injury type (no injury, potential future injury, minor injury, and major injury) and injury pathway (device critical component breakage or decoupling; unintended movement; instability; poor, uneven surface handling; and trip hazards). Coding efforts were carried out across 2 separate phases in which the team manually verified all instances coded as minor injury, major injury, or potential future injury and established interrater reliability to validate coding efforts. Results: The content analysis provided a better understanding of the contexts and conditions leading to user injury, as well as the severity of injuries associated with these mobility-assistive devices. Injury pathways---device critical component failures; unintended device movement; poor, uneven surface handling; instability; and trip hazards---were identified for 5 product types (canes, gait and transfer belts, ramps, walkers and rollators, and wheelchairs and transport chairs). Outcomes were normalized per 10,000 posting counts (online reviews) mentioning minor injury, major injury, or potential future injury by product category. Overall, per 10,000 reviews, 240 (2.4\%) described mobility-assistive equipment--related user injuries, whereas 2318 (23.18\%) revealed potential future injuries. Conclusions: This study highlights mobility-assistive device injury contexts and severities, suggesting that consumers who posted online reviews attribute most serious injuries to a defective item, rather than user misuse. It implies that many mobility-assistive device injuries may be preventable through patient and caregiver education on how to evaluate new and existing equipment for risk of potential future injury. ", doi="10.2196/42231", url="https://www.jmir.org/2023/1/e42231", url="http://www.ncbi.nlm.nih.gov/pubmed/36862459" } @Article{info:doi/10.2196/36477, author="Chen, Min and Tan, Xuan and Padman, Rema", title="A Machine Learning Approach to Support Urgent Stroke Triage Using Administrative Data and Social Determinants of Health at Hospital Presentation: Retrospective Study", journal="J Med Internet Res", year="2023", month="Jan", day="30", volume="25", pages="e36477", keywords="stroke", keywords="diagnosis", keywords="triage", keywords="decision support", keywords="social determinants of health", keywords="prediction", keywords="machine learning", keywords="interpretability", keywords="medical decision-making", keywords="retrospective study", keywords="claims data", abstract="Background: The key to effective stroke management is timely diagnosis and triage. Machine learning (ML) methods developed to assist in detecting stroke have focused on interpreting detailed clinical data such as clinical notes and diagnostic imaging results. However, such information may not be readily available when patients are initially triaged, particularly in rural and underserved communities. Objective: This study aimed to develop an ML stroke prediction algorithm based on data widely available at the time of patients' hospital presentations and assess the added value of social determinants of health (SDoH) in stroke prediction. Methods: We conducted a retrospective study of the emergency department and hospitalization records from 2012 to 2014 from all the acute care hospitals in the state of Florida, merged with the SDoH data from the American Community Survey. A case-control design was adopted to construct stroke and stroke mimic cohorts. We compared the algorithm performance and feature importance measures of the ML models (ie, gradient boosting machine and random forest) with those of the logistic regression model based on 3 sets of predictors. To provide insights into the prediction and ultimately assist care providers in decision-making, we used TreeSHAP for tree-based ML models to explain the stroke prediction. Results: Our analysis included 143,203 hospital visits of unique patients, and it was confirmed based on the principal diagnosis at discharge that 73\% (n=104,662) of these patients had a stroke. The approach proposed in this study has high sensitivity and is particularly effective at reducing the misdiagnosis of dangerous stroke chameleons (false-negative rate <4\%). ML classifiers consistently outperformed the benchmark logistic regression in all 3 input combinations. We found significant consistency across the models in the features that explain their performance. The most important features are age, the number of chronic conditions on admission, and primary payer (eg, Medicare or private insurance). Although both the individual- and community-level SDoH features helped improve the predictive performance of the models, the inclusion of the individual-level SDoH features led to a much larger improvement (area under the receiver operating characteristic curve increased from 0.694 to 0.823) than the inclusion of the community-level SDoH features (area under the receiver operating characteristic curve increased from 0.823 to 0.829). Conclusions: Using data widely available at the time of patients' hospital presentations, we developed a stroke prediction model with high sensitivity and reasonable specificity. The prediction algorithm uses variables that are routinely collected by providers and payers and might be useful in underresourced hospitals with limited availability of sensitive diagnostic tools or incomplete data-gathering capabilities. ", doi="10.2196/36477", url="https://www.jmir.org/2023/1/e36477", url="http://www.ncbi.nlm.nih.gov/pubmed/36716097" } @Article{info:doi/10.2196/40565, author="Wamala-Andersson, Sarah and Richardson, X. Matt and Landerdahl Stridsberg, Sara and Ryan, Jillian and Sukums, Felix and Goh, Yong-Shian", title="Artificial Intelligence and Precision Health Through Lenses of Ethics and Social Determinants of Health: Protocol for a State-of-the-Art Literature Review", journal="JMIR Res Protoc", year="2023", month="Jan", day="24", volume="12", pages="e40565", keywords="artificial intelligence", keywords="clinical outcome", keywords="detection", keywords="diagnosis", keywords="diagnostic", keywords="disease management", keywords="ethical framework", keywords="ethical", keywords="ethics", keywords="health outcome", keywords="health promotion", keywords="literature review", keywords="patient centered", keywords="person centered", keywords="precision health", keywords="precision medicine", keywords="prevention", keywords="review methodology", keywords="search strategy", keywords="social determinant", abstract="Background: ?Precision health is a rapidly developing field, largely driven by the development of artificial intelligence (AI)--related solutions. AI facilitates complex analysis of numerous health data risk assessment, early detection of disease, and initiation of timely preventative health interventions that can be highly tailored to the individual. Despite such promise, ethical concerns arising from the rapid development and use of AI-related technologies have led to development of national and international frameworks to address responsible use of AI. Objective: ?We aimed to address research gaps and provide new knowledge regarding (1) examples of existing AI applications and what role they play regarding precision health, (2) what salient features can be used to categorize them, (3) what evidence exists for their effects on precision health outcomes, (4) how do these AI applications comply with established ethical and responsible framework, and (5) how these AI applications address equity and social determinants of health (SDOH). Methods: ?This protocol delineates a state-of-the-art literature review of novel AI-based applications in precision health. Published and unpublished studies were retrieved from 6 electronic databases. Articles included in this study were from the inception of the databases to January 2023. The review will encompass applications that use AI as a primary or supporting system or method when primarily applied for precision health purposes in human populations. It includes any geographical location or setting, including the internet, community-based, and acute or clinical settings, reporting clinical, behavioral, and psychosocial outcomes, including detection-, diagnosis-, promotion-, prevention-, management-, and treatment-related outcomes. Results: ? This is step 1 toward a full state-of-the-art literature review with data analyses, results, and discussion of findings, which will also be published. The anticipated consequences on equity from the perspective of SDOH will be analyzed. Keyword cluster relationships and analyses will be visualized to indicate which research foci are leading the development of the field and where research gaps exist. Results will be presented based on the data analysis plan that includes primary analyses, visualization of sources, and secondary analyses. Implications for future research and person-centered public health will be discussed. Conclusions: ?Results from the review will potentially guide the continued development of AI applications, future research in reducing the knowledge gaps, and improvement of practice related to precision health. New insights regarding examples of existing AI applications, their salient features, their role regarding precision health, and the existing evidence that exists for their effects on precision health outcomes will be demonstrated. Additionally, a demonstration of how existing AI applications address equity and SDOH and comply with established ethical and responsible frameworks will be provided. International Registered Report Identifier (IRRID): PRR1-10.2196/40565 ", doi="10.2196/40565", url="https://www.researchprotocols.org/2023/1/e40565", url="http://www.ncbi.nlm.nih.gov/pubmed/36692922" } @Article{info:doi/10.2196/38590, author="Chen, Xiaojie and Chen, Han and Nan, Shan and Kong, Xiangtian and Duan, Huilong and Zhu, Haiyan", title="Dealing With Missing, Imbalanced, and Sparse Features During the Development of a Prediction Model for Sudden Death Using Emergency Medicine Data: Machine Learning Approach", journal="JMIR Med Inform", year="2023", month="Jan", day="20", volume="11", pages="e38590", keywords="emergency medicine", keywords="prediction model", keywords="data preprocessing", keywords="imbalanced data", keywords="missing value interpolation", keywords="sparse features", keywords="clinical informatics", keywords="machine learning", keywords="medical informatics", abstract="Background: In emergency departments (EDs), early diagnosis and timely rescue, which are supported by prediction modes using ED data, can increase patients' chances of survival. Unfortunately, ED data usually contain missing, imbalanced, and sparse features, which makes it challenging to build early identification models for diseases. Objective: This study aims to propose a systematic approach to deal with the problems of missing, imbalanced, and sparse features for developing sudden-death prediction models using emergency medicine (or ED) data. Methods: We proposed a 3-step approach to deal with data quality issues: a random forest (RF) for missing values, k-means for imbalanced data, and principal component analysis (PCA) for sparse features. For continuous and discrete variables, the decision coefficient R2 and the $\kappa$ coefficient were used to evaluate performance, respectively. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) were used to estimate the model's performance. To further evaluate the proposed approach, we carried out a case study using an ED data set obtained from the Hainan Hospital of Chinese PLA General Hospital. A logistic regression (LR) prediction model for patient condition worsening was built. Results: A total of 1085 patients with rescue records and 17,959 patients without rescue records were selected and significantly imbalanced. We extracted 275, 402, and 891 variables from laboratory tests, medications, and diagnosis, respectively. After data preprocessing, the median R2 of the RF continuous variable interpolation was 0.623 (IQR 0.647), and the median of the $\kappa$ coefficient for discrete variable interpolation was 0.444 (IQR 0.285). The LR model constructed using the initial diagnostic data showed poor performance and variable separation, which was reflected in the abnormally high odds ratio (OR) values of the 2 variables of cardiac arrest and respiratory arrest (201568034532 and 1211118945, respectively) and an abnormal 95\% CI. Using processed data, the recall of the model reached 0.746, the F1-score was 0.73, and the AUROC was 0.708. Conclusions: The proposed systematic approach is valid for building a prediction model for emergency patients. ", doi="10.2196/38590", url="https://medinform.jmir.org/2023/1/e38590", url="http://www.ncbi.nlm.nih.gov/pubmed/36662548" } @Article{info:doi/10.2196/43521, author="Kim, Donghun and Jung, Woojin and Jiang, Ting and Zhu, Yongjun", title="An Exploratory Study of Medical Journal's Twitter Use: Metadata, Networks, and Content Analyses", journal="J Med Internet Res", year="2023", month="Jan", day="19", volume="25", pages="e43521", keywords="medical journals", keywords="social networks", keywords="Twitter", abstract="Background: An increasing number of medical journals are using social media to promote themselves and communicate with their readers. However, little is known about how medical journals use Twitter and what their social media management strategies are. Objective: This study aimed to understand how medical journals use Twitter from a global standpoint. We conducted a broad, in-depth analysis of all the available Twitter accounts of medical journals indexed by major indexing services, with a particular focus on their social networks and content. Methods: The Twitter profiles and metadata of medical journals were analyzed along with the social networks on their Twitter accounts. Results: The results showed that overall, publishers used different strategies regarding Twitter adoption, Twitter use patterns, and their subsequent decisions. The following specific findings were noted: journals with Twitter accounts had a significantly higher number of publications and a greater impact than their counterparts; subscription journals had a slightly higher Twitter adoption rate (2\%) than open access journals; journals with higher impact had more followers; and prestigious journals rarely followed other lesser-known journals on social media. In addition, an in-depth analysis of 2000 randomly selected tweets from 4 prestigious journals revealed that The Lancet had dedicated considerable effort to communicating with people about health information and fulfilling its social responsibility by organizing committees and activities to engage with a broad range of health-related issues; The New England Journal of Medicine and the Journal of the American Medical Association focused on promoting research articles and attempting to maximize the visibility of their research articles; and the British Medical Journal provided copious amounts of health information and discussed various health-related social problems to increase social awareness of the field of medicine. Conclusions: Our study used various perspectives to investigate how medical journals use Twitter and explored the Twitter management strategies of 4 of the most prestigious journals. Our study provides a detailed understanding of medical journals' use of Twitter from various perspectives and can help publishers, journals, and researchers to better use Twitter for their respective purposes. ", doi="10.2196/43521", url="https://www.jmir.org/2023/1/e43521", url="http://www.ncbi.nlm.nih.gov/pubmed/36656626" } @Article{info:doi/10.2196/31618, author="Cho, Sylvia and Weng, Chunhua and Kahn, G. Michael and Natarajan, Karthik", title="Identifying Data Quality Dimensions for Person-Generated Wearable Device Data: Multi-Method Study", journal="JMIR Mhealth Uhealth", year="2021", month="Dec", day="23", volume="9", number="12", pages="e31618", keywords="patient-generated health data", keywords="data accuracy", keywords="data quality", keywords="wearable device", keywords="fitness trackers", keywords="qualitative research", abstract="Background: There is a growing interest in using person-generated wearable device data for biomedical research, but there are also concerns regarding the quality of data such as missing or incorrect data. This emphasizes the importance of assessing data quality before conducting research. In order to perform data quality assessments, it is essential to define what data quality means for person-generated wearable device data by identifying the data quality dimensions. Objective: This study aims to identify data quality dimensions for person-generated wearable device data for research purposes. Methods: This study was conducted in 3 phases: literature review, survey, and focus group discussion. The literature review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guideline to identify factors affecting data quality and its associated data quality challenges. In addition, we conducted a survey to confirm and complement results from the literature review and to understand researchers' perceptions on data quality dimensions that were previously identified as dimensions for the secondary use of electronic health record (EHR) data. We sent the survey to researchers with experience in analyzing wearable device data. Focus group discussion sessions were conducted with domain experts to derive data quality dimensions for person-generated wearable device data. On the basis of the results from the literature review and survey, a facilitator proposed potential data quality dimensions relevant to person-generated wearable device data, and the domain experts accepted or rejected the suggested dimensions. Results: In total, 19 studies were included in the literature review, and 3 major themes emerged: device- and technical-related, user-related, and data governance--related factors. The associated data quality problems were incomplete data, incorrect data, and heterogeneous data. A total of 20 respondents answered the survey. The major data quality challenges faced by researchers were completeness, accuracy, and plausibility. The importance ratings on data quality dimensions in an existing framework showed that the dimensions for secondary use of EHR data are applicable to person-generated wearable device data. There were 3 focus group sessions with domain experts in data quality and wearable device research. The experts concluded that intrinsic data quality features, such as conformance, completeness, and plausibility, and contextual and fitness-for-use data quality features, such as completeness (breadth and density) and temporal data granularity, are important data quality dimensions for assessing person-generated wearable device data for research purposes. Conclusions: In this study, intrinsic and contextual and fitness-for-use data quality dimensions for person-generated wearable device data were identified. The dimensions were adapted from data quality terminologies and frameworks for the secondary use of EHR data with a few modifications. Further research on how data quality can be assessed with respect to each dimension is needed. ", doi="10.2196/31618", url="https://mhealth.jmir.org/2021/12/e31618", url="http://www.ncbi.nlm.nih.gov/pubmed/34941540" } @Article{info:doi/10.2196/25414, author="Bartlett Ellis, Rebecca and Wright, Julie and Miller, Soederberg Lisa and Jake-Schoffman, Danielle and Hekler, B. Eric and Goldstein, M. Carly and Arigo, Danielle and Nebeker, Camille", title="Lessons Learned: Beta-Testing the Digital Health Checklist for Researchers Prompts a Call to Action by Behavioral Scientists", journal="J Med Internet Res", year="2021", month="Dec", day="22", volume="23", number="12", pages="e25414", keywords="digital health", keywords="mHealth", keywords="research ethics", keywords="institutional review board", keywords="IRB", keywords="behavioral medicine", keywords="wearable sensors", keywords="social media", keywords="bioethics", keywords="data management", keywords="usability", keywords="privacy", keywords="access", keywords="risks and benefits", keywords="mobile phone", doi="10.2196/25414", url="https://www.jmir.org/2021/12/e25414", url="http://www.ncbi.nlm.nih.gov/pubmed/34941548" } @Article{info:doi/10.2196/19250, author="Yeng, Kandabongee Prosper and Nweke, Obiora Livinus and Yang, Bian and Ali Fauzi, Muhammad and Snekkenes, Arthur Einar", title="Artificial Intelligence--Based Framework for Analyzing Health Care Staff Security Practice: Mapping Review and Simulation Study", journal="JMIR Med Inform", year="2021", month="Dec", day="22", volume="9", number="12", pages="e19250", keywords="artificial intelligence", keywords="machine learning", keywords="health care", keywords="security practice", keywords="framework", keywords="security", keywords="modeling", keywords="analysis", abstract="Background: Blocklisting malicious activities in health care is challenging in relation to access control in health care security practices due to the fear of preventing legitimate access for therapeutic reasons. Inadvertent prevention of legitimate access can contravene the availability trait of the confidentiality, integrity, and availability triad, and may result in worsening health conditions, leading to serious consequences, including deaths. Therefore, health care staff are often provided with a wide range of access such as a ``breaking-the-glass'' or ``self-authorization'' mechanism for emergency access. However, this broad access can undermine the confidentiality and integrity of sensitive health care data because breaking-the-glass can lead to vast unauthorized access, which could be problematic when determining illegitimate access in security practices. Objective: A review was performed to pinpoint appropriate artificial intelligence (AI) methods and data sources that can be used for effective modeling and analysis of health care staff security practices. Based on knowledge obtained from the review, a framework was developed and implemented with simulated data to provide a comprehensive approach toward effective modeling and analyzing security practices of health care staff in real access logs. Methods: The flow of our approach was a mapping review to provide AI methods, data sources and their attributes, along with other categories as input for framework development. To assess implementation of the framework, electronic health record (EHR) log data were simulated and analyzed, and the performance of various approaches in the framework was compared. Results: Among the total 130 articles initially identified, 18 met the inclusion and exclusion criteria. A thorough assessment and analysis of the included articles revealed that K-nearest neighbor, Bayesian network, and decision tree (C4.5) algorithms were predominantly applied to EHR and network logs with varying input features of health care staff security practices. Based on the review results, a framework was developed and implemented with simulated logs. The decision tree obtained the best precision of 0.655, whereas the best recall was achieved by the support vector machine (SVM) algorithm at 0.977. However, the best F1-score was obtained by random forest at 0.775. In brief, three classifiers (random forest, decision tree, and SVM) in the two-class approach achieved the best precision of 0.998. Conclusions: The security practices of health care staff can be effectively analyzed using a two-class approach to detect malicious and nonmalicious security practices. Based on our comparative study, the algorithms that can effectively be used in related studies include random forest, decision tree, and SVM. Deviations of security practices from required health care staff's security behavior in the big data context can be analyzed with real access logs to define appropriate incentives for improving conscious care security practice. ", doi="10.2196/19250", url="https://medinform.jmir.org/2021/12/e19250", url="http://www.ncbi.nlm.nih.gov/pubmed/34941549" } @Article{info:doi/10.2196/30368, author="Mudaranthakam, Pal Dinesh and Brown, Alexandra and Kerling, Elizabeth and Carlson, E. Susan and Valentine, J. Christina and Gajewski, Byron", title="The Successful Synchronized Orchestration of an Investigator-Initiated Multicenter Trial Using a Clinical Trial Management System and Team Approach: Design and Utility Study", journal="JMIR Form Res", year="2021", month="Dec", day="22", volume="5", number="12", pages="e30368", keywords="data management", keywords="data quality", keywords="metrics", keywords="trial execution", keywords="clinical trials", keywords="cost", keywords="accrual", keywords="accrual inequality", keywords="rare diseases", keywords="healthcare", keywords="health care", keywords="health operations", abstract="Background: As the cost of clinical trials continues to rise, novel approaches are required to ensure ethical allocation of resources. Multisite trials have been increasingly utilized in phase 1 trials for rare diseases and in phase 2 and 3 trials to meet accrual needs. The benefits of multisite trials include easier patient recruitment, expanded generalizability, and more robust statistical analyses. However, there are several problems more likely to arise in multisite trials, including accrual inequality, protocol nonadherence, data entry mistakes, and data integration difficulties. Objective: The Biostatistics \& Data Science department at the University of Kansas Medical Center developed a clinical trial management system (comprehensive research information system [CRIS]) specifically designed to streamline multisite clinical trial management. Methods: A National Institute of Child Health and Human Development--funded phase 3 trial, the ADORE (assessment of docosahexaenoic acid [DHA] on reducing early preterm birth) trial fully utilized CRIS to provide automated accrual reports, centralize data capture, automate trial completion reports, and streamline data harmonization. Results: Using the ADORE trial as an example, we describe the utility of CRIS in database design, regulatory compliance, training standardization, study management, and automated reporting. Our goal is to continue to build a CRIS through use in subsequent multisite trials. Reports generated to suit the needs of future studies will be available as templates. Conclusions: The implementation of similar tools and systems could provide significant cost-saving and operational benefit to multisite trials. Trial Registration: ClinicalTrials.gov NCT02626299; https://tinyurl.com/j6erphcj ", doi="10.2196/30368", url="https://formative.jmir.org/2021/12/e30368", url="http://www.ncbi.nlm.nih.gov/pubmed/34941552" } @Article{info:doi/10.2196/34286, author="Divi, Nomita and Smolinski, Mark", title="EpiHacks, a Process for Technologists and Health Experts to Cocreate Optimal Solutions for Disease Prevention and Control: User-Centered Design Approach", journal="J Med Internet Res", year="2021", month="Dec", day="15", volume="23", number="12", pages="e34286", keywords="epidemiology", keywords="public health", keywords="diagnostic", keywords="tool", keywords="disease surveillance", keywords="technology solution", keywords="innovative approaches to disease surveillance", keywords="One Health", keywords="surveillance", keywords="hack", keywords="innovation", keywords="expert", keywords="solution", keywords="prevention", keywords="control", abstract="Background: Technology-based innovations that are created collaboratively by local technology specialists and health experts can optimize the addressing of priority needs for disease prevention and control. An EpiHack is a distinct, collaborative approach to developing solutions that combines the science of epidemiology with the format of a hackathon. Since 2013, a total of 12 EpiHacks have collectively brought together over 500 technology and health professionals from 29 countries. Objective: We aimed to define the EpiHack process and summarize the impacts of the technology-based innovations that have been created through this approach. Methods: The key components and timeline of an EpiHack were described in detail. The focus areas, outputs, and impacts of the twelve EpiHacks that were conducted between 2013 and 2021 were summarized. Results: EpiHack solutions have served to improve surveillance for influenza, dengue, and mass gatherings, as well as laboratory sample tracking and One Health surveillance, in rural and urban communities. Several EpiHack tools were scaled during the COVID-19 pandemic to support local governments in conducting active surveillance. All tools were designed to be open source to allow for easy replication and adaptation by other governments or parties. Conclusions: EpiHacks provide an efficient, flexible, and replicable new approach to generating relevant and timely innovations that are locally developed and owned, are scalable, and are sustainable. ", doi="10.2196/34286", url="https://www.jmir.org/2021/12/e34286", url="http://www.ncbi.nlm.nih.gov/pubmed/34807832" } @Article{info:doi/10.2196/30970, author="Paris, Nicolas and Lamer, Antoine and Parrot, Adrien", title="Transformation and Evaluation of the MIMIC Database in the OMOP Common Data Model: Development and Usability Study", journal="JMIR Med Inform", year="2021", month="Dec", day="14", volume="9", number="12", pages="e30970", keywords="data reuse", keywords="open data", keywords="OMOP", keywords="common data model", keywords="critical care", keywords="machine learning", keywords="big data", keywords="health informatics", keywords="health data", keywords="health database", keywords="electronic health records", keywords="open access database", keywords="digital health", keywords="intensive care", keywords="health care", abstract="Background: In the era of big data, the intensive care unit (ICU) is likely to benefit from real-time computer analysis and modeling based on close patient monitoring and electronic health record data. The Medical Information Mart for Intensive Care (MIMIC) is the first open access database in the ICU domain. Many studies have shown that common data models (CDMs) improve database searching by allowing code, tools, and experience to be shared. The Observational Medical Outcomes Partnership (OMOP) CDM is spreading all over the world. Objective: The objective was to transform MIMIC into an OMOP database and to evaluate the benefits of this transformation for analysts. Methods: We transformed MIMIC (version 1.4.21) into OMOP format (version 5.3.3.1) through semantic and structural mapping. The structural mapping aimed at moving the MIMIC data into the right place in OMOP, with some data transformations. The mapping was divided into 3 phases: conception, implementation, and evaluation. The conceptual mapping aimed at aligning the MIMIC local terminologies to OMOP's standard ones. It consisted of 3 phases: integration, alignment, and evaluation. A documented, tested, versioned, exemplified, and open repository was set up to support the transformation and improvement of the MIMIC community's source code. The resulting data set was evaluated over a 48-hour datathon. Results: With an investment of 2 people for 500 hours, 64\% of the data items of the 26 MIMIC tables were standardized into the OMOP CDM and 78\% of the source concepts mapped to reference terminologies. The model proved its ability to support community contributions and was well received during the datathon, with 160 participants and 15,000 requests executed with a maximum duration of 1 minute. Conclusions: The resulting MIMIC-OMOP data set is the first MIMIC-OMOP data set available free of charge with real disidentified data ready for replicable intensive care research. This approach can be generalized to any medical field. ", doi="10.2196/30970", url="https://medinform.jmir.org/2021/12/e30970", url="http://www.ncbi.nlm.nih.gov/pubmed/34904958" } @Article{info:doi/10.2196/29286, author="Bannay, Aur{\'e}lie and Bories, Mathilde and Le Corre, Pascal and Riou, Christine and Lemordant, Pierre and Van Hille, Pascal and Chazard, Emmanuel and Dode, Xavier and Cuggia, Marc and Bouzill{\'e}, Guillaume", title="Leveraging National Claims and Hospital Big Data: Cohort Study on a Statin-Drug Interaction Use Case", journal="JMIR Med Inform", year="2021", month="Dec", day="13", volume="9", number="12", pages="e29286", keywords="drug interactions", keywords="statins", keywords="administrative claims", keywords="health care", keywords="big data", keywords="data linking", keywords="data warehousing", abstract="Background: Linking different sources of medical data is a promising approach to analyze care trajectories. The aim of the INSHARE (Integrating and Sharing Health Big Data for Research) project was to provide the blueprint for a technological platform that facilitates integration, sharing, and reuse of data from 2 sources: the clinical data warehouse (CDW) of the Rennes academic hospital, called eHOP (entrep{\^o}t H{\^o}pital), and a data set extracted from the French national claim data warehouse (Syst{\`e}me National des Donn{\'e}es de Sant{\'e} [SNDS]). Objective: This study aims to demonstrate how the INSHARE platform can support big data analytic tasks in the health field using a pharmacovigilance use case based on statin consumption and statin-drug interactions. Methods: A Spark distributed cluster-computing framework was used for the record linkage procedure and all analyses. A semideterministic record linkage method based on the common variables between the chosen data sources was developed to identify all patients discharged after at least one hospital stay at the Rennes academic hospital between 2015 and 2017. The use-case study focused on a cohort of patients treated with statins prescribed by their general practitioner or during their hospital stay. Results: The whole process (record linkage procedure and use-case analyses) required 88 minutes. Of the 161,532 and 164,316 patients from the SNDS and eHOP CDW data sets, respectively, 159,495 patients were successfully linked (98.74\% and 97.07\% of patients from SNDS and eHOP CDW, respectively). Of the 16,806 patients with at least one statin delivery, 8293 patients started the consumption before and continued during the hospital stay, 6382 patients stopped statin consumption at hospital admission, and 2131 patients initiated statins in hospital. Statin-drug interactions occurred more frequently during hospitalization than in the community (3800/10,424, 36.45\% and 3253/14,675, 22.17\%, respectively; P<.001). Only 121 patients had the most severe level of statin-drug interaction. Hospital stay burden (length of stay and in-hospital mortality) was more severe in patients with statin-drug interactions during hospitalization. Conclusions: This study demonstrates the added value of combining and reusing clinical and claim data to provide large-scale measures of drug-drug interaction prevalence and care pathways outside hospitals. It builds a path to move the current health care system toward a Learning Health System using knowledge generated from research on real-world health data. ", doi="10.2196/29286", url="https://medinform.jmir.org/2021/12/e29286", url="http://www.ncbi.nlm.nih.gov/pubmed/34898457" } @Article{info:doi/10.2196/32698, author="Pan, Youcheng and Wang, Chenghao and Hu, Baotian and Xiang, Yang and Wang, Xiaolong and Chen, Qingcai and Chen, Junjie and Du, Jingcheng", title="A BERT-Based Generation Model to Transform Medical Texts to SQL Queries for Electronic Medical Records: Model Development and Validation", journal="JMIR Med Inform", year="2021", month="Dec", day="8", volume="9", number="12", pages="e32698", keywords="electronic medical record", keywords="text-to-SQL generation", keywords="BERT", keywords="grammar-based decoding", keywords="tree-structured intermediate representation", abstract="Background: Electronic medical records (EMRs) are usually stored in relational databases that require SQL queries to retrieve information of interest. Effectively completing such queries can be a challenging task for medical experts due to the barriers in expertise. Existing text-to-SQL generation studies have not been fully embraced in the medical domain. Objective: The objective of this study was to propose a neural generation model that can jointly consider the characteristics of medical text and the SQL structure to automatically transform medical texts to SQL queries for EMRs. Methods: We proposed a medical text--to-SQL model (MedTS), which employed a pretrained Bidirectional Encoder Representations From Transformers model as the encoder and leveraged a grammar-based long short-term memory network as the decoder to predict the intermediate representation that can easily be transformed into the final SQL query. We adopted the syntax tree as the intermediate representation rather than directly regarding the SQL query as an ordinary word sequence, which is more in line with the tree-structure nature of SQL and can also effectively reduce the search space during generation. Experiments were conducted on the MIMICSQL dataset, and 5 competitor methods were compared. Results: Experimental results demonstrated that MedTS achieved the accuracy of 0.784 and 0.899 on the test set in terms of logic form and execution, respectively, which significantly outperformed the existing state-of-the-art methods. Further analyses proved that the performance on each component of the generated SQL was relatively balanced and offered substantial improvements. Conclusions: The proposed MedTS was effective and robust for improving the performance of medical text--to-SQL generation, indicating strong potential to be applied in the real medical scenario. ", doi="10.2196/32698", url="https://medinform.jmir.org/2021/12/e32698", url="http://www.ncbi.nlm.nih.gov/pubmed/34889749" } @Article{info:doi/10.2196/25022, author="Singh, Janmajay and Sato, Masahiro and Ohkuma, Tomoko", title="On Missingness Features in Machine Learning Models for Critical Care: Observational Study", journal="JMIR Med Inform", year="2021", month="Dec", day="8", volume="9", number="12", pages="e25022", keywords="electronic health records", keywords="informative missingness", keywords="machine learning", keywords="missing data", keywords="hospital mortality", keywords="sepsis", abstract="Background: Missing data in electronic health records is inevitable and considered to be nonrandom. Several studies have found that features indicating missing patterns (missingness) encode useful information about a patient's health and advocate for their inclusion in clinical prediction models. But their effectiveness has not been comprehensively evaluated. Objective: The goal of the research is to study the effect of including informative missingness features in machine learning models for various clinically relevant outcomes and explore robustness of these features across patient subgroups and task settings. Methods: A total of 48,336 electronic health records from the 2012 and 2019 PhysioNet Challenges were used, and mortality, length of stay, and sepsis outcomes were chosen. The latter dataset was multicenter, allowing external validation. Gated recurrent units were used to learn sequential patterns in the data and classify or predict labels of interest. Models were evaluated on various criteria and across population subgroups evaluating discriminative ability and calibration. Results: Generally improved model performance in retrospective tasks was observed on including missingness features. Extent of improvement depended on the outcome of interest (area under the curve of the receiver operating characteristic [AUROC] improved from 1.2\% to 7.7\%) and even patient subgroup. However, missingness features did not display utility in a simulated prospective setting, being outperformed (0.9\% difference in AUROC) by the model relying only on pathological features. This was despite leading to earlier detection of disease (true positives), since including these features led to a concomitant rise in false positive detections. Conclusions: This study comprehensively evaluated effectiveness of missingness features on machine learning models. A detailed understanding of how these features affect model performance may lead to their informed use in clinical settings especially for administrative tasks like length of stay prediction where they present the greatest benefit. While missingness features, representative of health care processes, vary greatly due to intra- and interhospital factors, they may still be used in prediction models for clinically relevant outcomes. However, their use in prospective models producing frequent predictions needs to be explored further. ", doi="10.2196/25022", url="https://medinform.jmir.org/2021/12/e25022", url="http://www.ncbi.nlm.nih.gov/pubmed/34889756" } @Article{info:doi/10.2196/20767, author="Liu, Dianbo and Zheng, Ming and Sepulveda, Andres Nestor", title="Using Artificial Neural Network Condensation to Facilitate Adaptation of Machine Learning in Medical Settings by Reducing Computational Burden: Model Design and Evaluation Study", journal="JMIR Form Res", year="2021", month="Dec", day="8", volume="5", number="12", pages="e20767", keywords="artificial neural network", keywords="electronic medical records", keywords="parameter pruning", keywords="machine learning", keywords="computational burden", keywords="", abstract="Background: Machine learning applications in the health care domain can have a great impact on people's lives. At the same time, medical data is usually big, requiring a significant number of computational resources. Although this might not be a problem for the wide adoption of machine learning tools in high-income countries, the availability of computational resources can be limited in low-income countries and on mobile devices. This can limit many people from benefiting from the advancement in machine learning applications in the field of health care. Objective: In this study, we explore three methods to increase the computational efficiency and reduce model sizes of either recurrent neural networks (RNNs) or feedforward deep neural networks (DNNs) without compromising their accuracy. Methods: We used inpatient mortality prediction as our case analysis upon review of an intensive care unit dataset. We reduced the size of RNN and DNN by applying pruning of ``unused'' neurons. Additionally, we modified the RNN structure by adding a hidden layer to the RNN cell but reducing the total number of recurrent layers to accomplish a reduction of the total parameters used in the network. Finally, we implemented quantization on DNN by forcing the weights to be 8 bits instead of 32 bits. Results: We found that all methods increased implementation efficiency, including training speed, memory size, and inference speed, without reducing the accuracy of mortality prediction. Conclusions: Our findings suggest that neural network condensation allows for the implementation of sophisticated neural network algorithms on devices with lower computational resources. ", doi="10.2196/20767", url="https://formative.jmir.org/2021/12/e20767", url="http://www.ncbi.nlm.nih.gov/pubmed/34889747" } @Article{info:doi/10.2196/30308, author="St{\"o}hr, R. Mark and G{\"u}nther, Andreas and Majeed, W. Raphael", title="The Collaborative Metadata Repository (CoMetaR) Web App: Quantitative and Qualitative Usability Evaluation", journal="JMIR Med Inform", year="2021", month="Nov", day="29", volume="9", number="11", pages="e30308", keywords="usability", keywords="metadata", keywords="data visualization", keywords="semantic web", keywords="data management", keywords="data warehousing", keywords="communication barriers", keywords="quality improvement", keywords="biological ontologies", keywords="data curation", abstract="Background: In the field of medicine and medical informatics, the importance of comprehensive metadata has long been recognized, and the composition of metadata has become its own field of profession and research. To ensure sustainable and meaningful metadata are maintained, standards and guidelines such as the FAIR (Findability, Accessibility, Interoperability, Reusability) principles have been published. The compilation and maintenance of metadata is performed by field experts supported by metadata management apps. The usability of these apps, for example, in terms of ease of use, efficiency, and error tolerance, crucially determines their benefit to those interested in the data. Objective: This study aims to provide a metadata management app with high usability that assists scientists in compiling and using rich metadata. We aim to evaluate our recently developed interactive web app for our collaborative metadata repository (CoMetaR). This study reflects how real users perceive the app by assessing usability scores and explicit usability issues. Methods: We evaluated the CoMetaR web app by measuring the usability of 3 modules: core module, provenance module, and data integration module. We defined 10 tasks in which users must acquire information specific to their user role. The participants were asked to complete the tasks in a live web meeting. We used the System Usability Scale questionnaire to measure the usability of the app. For qualitative analysis, we applied a modified think aloud method with the following thematic analysis and categorization into the ISO 9241-110 usability categories. Results: A total of 12 individuals participated in the study. We found that over 97\% (85/88) of all the tasks were completed successfully. We measured usability scores of 81, 81, and 72 for the 3 evaluated modules. The qualitative analysis resulted in 24 issues with the app. Conclusions: A usability score of 81 implies very good usability for the 2 modules, whereas a usability score of 72 still indicates acceptable usability for the third module. We identified 24 issues that serve as starting points for further development. Our method proved to be effective and efficient in terms of effort and outcome. It can be adapted to evaluate apps within the medical informatics field and potentially beyond. ", doi="10.2196/30308", url="https://medinform.jmir.org/2021/11/e30308", url="http://www.ncbi.nlm.nih.gov/pubmed/34847059" } @Article{info:doi/10.2196/33124, author="Holub, Karl and Hardy, Nicole and Kallmes, Kevin", title="Toward Automated Data Extraction According to Tabular Data Structure: Cross-sectional Pilot Survey of the Comparative Clinical Literature", journal="JMIR Form Res", year="2021", month="Nov", day="24", volume="5", number="11", pages="e33124", keywords="table structure", keywords="systematic review", keywords="automated data extraction", keywords="data reporting conventions", keywords="clinical comparative data", keywords="data elements", keywords="statistic formats", abstract="Background: Systematic reviews depend on time-consuming extraction of data from the PDFs of underlying studies. To date, automation efforts have focused on extracting data from the text, and no approach has yet succeeded in fully automating ingestion of quantitative evidence. However, the majority of relevant data is generally presented in tables, and the tabular structure is more amenable to automated extraction than free text. Objective: The purpose of this study was to classify the structure and format of descriptive statistics reported in tables in the comparative medical literature. Methods: We sampled 100 published randomized controlled trials from 2019 based on a search in PubMed; these results were imported to the AutoLit platform. Studies were excluded if they were nonclinical, noncomparative, not in English, protocols, or not available in full text. In AutoLit, tables reporting baseline or outcome data in all studies were characterized based on reporting practices. Measurement context, meaning the structure in which the interventions of interest, patient arm breakdown, measurement time points, and data element descriptions were presented, was classified based on the number of contextual pieces and metadata reported. The statistic formats for reported metrics (specific instances of reporting of data elements) were then classified by location and broken down into reporting strategies for continuous, dichotomous, and categorical metrics. Results: We included 78 of 100 sampled studies, one of which (1.3\%) did not report data elements in tables. The remaining 77 studies reported baseline and outcome data in 174 tables, and 96\% (69/72) of these tables broke down reporting by patient arms. Fifteen structures were found for the reporting of measurement context, which were broadly grouped into: 1{\texttimes}1 contexts, where two pieces of context are reported in total (eg, arms in columns, data elements in rows); 2{\texttimes}1 contexts, where two pieces of context are given on row headers (eg, time points in columns, arms nested in data elements on rows); and 1{\texttimes}2 contexts, where two pieces of context are given on column headers. The 1{\texttimes}1 contexts were present in 57\% of tables (99/174), compared to 20\% (34/174) for 2{\texttimes}1 contexts and 15\% (26/174) for 1{\texttimes}2 contexts; the remaining 8\% (15/174) used unique/other stratification methods. Statistic formats were reported in the headers or descriptions of 84\% (65/74) of studies. Conclusions: In this cross-sectional pilot review, we found a high density of information in tables, but with major heterogeneity in presentation of measurement context. The highest-density studies reported both baseline and outcome measures in tables, with arm-level breakout, intervention labels, and arm sizes present, and reported both the statistic formats and units. The measurement context formats presented here, broadly classified into three classes that cover 92\% (71/78) of studies, form a basis for understanding the frequency of different reporting styles, supporting automated detection of the data format for extraction of metrics. ", doi="10.2196/33124", url="https://formative.jmir.org/2021/11/e33124", url="http://www.ncbi.nlm.nih.gov/pubmed/34821562" } @Article{info:doi/10.2196/31750, author="Gierend, Kerstin and Kr{\"u}ger, Frank and Waltemath, Dagmar and F{\"u}nfgeld, Maximilian and Ganslandt, Thomas and Zeleke, Alamirrew Atinkut", title="Approaches and Criteria for Provenance in Biomedical Data Sets and Workflows: Protocol for a Scoping Review", journal="JMIR Res Protoc", year="2021", month="Nov", day="22", volume="10", number="11", pages="e31750", keywords="provenance", keywords="biomedical", keywords="workflow", keywords="data sharing", keywords="lineage", keywords="scoping review", keywords="data genesis", keywords="scientific data", keywords="digital objects", keywords="healthcare data", abstract="Background: Provenance supports the understanding of data genesis, and it is a key factor to ensure the trustworthiness of digital objects containing (sensitive) scientific data. Provenance information contributes to a better understanding of scientific results and fosters collaboration on existing data as well as data sharing. This encompasses defining comprehensive concepts and standards for transparency and traceability, reproducibility, validity, and quality assurance during clinical and scientific data workflows and research. Objective: The aim of this scoping review is to investigate existing evidence regarding approaches and criteria for provenance tracking as well as disclosing current knowledge gaps in the biomedical domain. This review covers modeling aspects as well as metadata frameworks for meaningful and usable provenance information during creation, collection, and processing of (sensitive) scientific biomedical data. This review also covers the examination of quality aspects of provenance criteria. Methods: This scoping review will follow the methodological framework by Arksey and O'Malley. Relevant publications will be obtained by querying PubMed and Web of Science. All papers in English language will be included, published between January 1, 2006 and March 23, 2021. Data retrieval will be accompanied by manual search for grey literature. Potential publications will then be exported into a reference management software, and duplicates will be removed. Afterwards, the obtained set of papers will be transferred into a systematic review management tool. All publications will be screened, extracted, and analyzed: title and abstract screening will be carried out by 4 independent reviewers. Majority vote is required for consent to eligibility of papers based on the defined inclusion and exclusion criteria. Full-text reading will be performed independently by 2 reviewers and in the last step, key information will be extracted on a pretested template. If agreement cannot be reached, the conflict will be resolved by a domain expert. Charted data will be analyzed by categorizing and summarizing the individual data items based on the research questions. Tabular or graphical overviews will be given, if applicable. Results: The reporting follows the extension of the Preferred Reporting Items for Systematic reviews and Meta-Analyses statements for Scoping Reviews. Electronic database searches in PubMed and Web of Science resulted in 469 matches after deduplication. As of September 2021, the scoping review is in the full-text screening stage. The data extraction using the pretested charting template will follow the full-text screening stage. We expect the scoping review report to be completed by February 2022. Conclusions: Information about the origin of healthcare data has a major impact on the quality and the reusability of scientific results as well as follow-up activities. This protocol outlines plans for a scoping review that will provide information about current approaches, challenges, or knowledge gaps with provenance tracking in biomedical sciences. International Registered Report Identifier (IRRID): DERR1-10.2196/31750 ", doi="10.2196/31750", url="https://www.researchprotocols.org/2021/11/e31750", url="http://www.ncbi.nlm.nih.gov/pubmed/34813494" } @Article{info:doi/10.2196/29176, author="Greulich, Leonard and Hegselmann, Stefan and Dugas, Martin", title="An Open-Source, Standard-Compliant, and Mobile Electronic Data Capture System for Medical Research (OpenEDC): Design and Evaluation Study", journal="JMIR Med Inform", year="2021", month="Nov", day="19", volume="9", number="11", pages="e29176", keywords="electronic data capture", keywords="open science", keywords="data interoperability", keywords="metadata reuse", keywords="mobile health", keywords="data standard", keywords="mobile phone", abstract="Background: Medical research and machine learning for health care depend on high-quality data. Electronic data capture (EDC) systems have been widely adopted for metadata-driven digital data collection. However, many systems use proprietary and incompatible formats that inhibit clinical data exchange and metadata reuse. In addition, the configuration and financial requirements of typical EDC systems frequently prevent small-scale studies from benefiting from their inherent advantages. Objective: The aim of this study is to develop and publish an open-source EDC system that addresses these issues. We aim to plan a system that is applicable to a wide range of research projects. Methods: We conducted a literature-based requirements analysis to identify the academic and regulatory demands for digital data collection. After designing and implementing OpenEDC, we performed a usability evaluation to obtain feedback from users. Results: We identified 20 frequently stated requirements for EDC. According to the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) 25010 norm, we categorized the requirements into functional suitability, availability, compatibility, usability, and security. We developed OpenEDC based on the regulatory-compliant Clinical Data Interchange Standards Consortium Operational Data Model (CDISC ODM) standard. Mobile device support enables the collection of patient-reported outcomes. OpenEDC is publicly available and released under the MIT open-source license. Conclusions: Adopting an established standard without modifications supports metadata reuse and clinical data exchange, but it limits item layouts. OpenEDC is a stand-alone web app that can be used without a setup or configuration. This should foster compatibility between medical research and open science. OpenEDC is targeted at observational and translational research studies by clinicians. ", doi="10.2196/29176", url="https://medinform.jmir.org/2021/11/e29176", url="http://www.ncbi.nlm.nih.gov/pubmed/34806987" } @Article{info:doi/10.2196/34493, author="Clay, Ieuan and Angelopoulos, Christian and Bailey, Lord Anne and Blocker, Aaron and Carini, Simona and Carvajal, Rodrigo and Drummond, David and McManus, F. Kimberly and Oakley-Girvan, Ingrid and Patel, B. Krupal and Szepietowski, Phillip and Goldsack, C. Jennifer", title="Sensor Data Integration: A New Cross-Industry Collaboration to Articulate Value, Define Needs, and Advance a Framework for Best Practices", journal="J Med Internet Res", year="2021", month="Nov", day="9", volume="23", number="11", pages="e34493", keywords="digital measures", keywords="data integration", keywords="patient centricity", keywords="utility", doi="10.2196/34493", url="https://www.jmir.org/2021/11/e34493", url="http://www.ncbi.nlm.nih.gov/pubmed/34751656" } @Article{info:doi/10.2196/26914, author="Sung, MinDong and Cha, Dongchul and Park, Rang Yu", title="Local Differential Privacy in the Medical Domain to Protect Sensitive Information: Algorithm Development and Real-World Validation", journal="JMIR Med Inform", year="2021", month="Nov", day="8", volume="9", number="11", pages="e26914", keywords="privacy-preserving", keywords="differential privacy", keywords="medical informatics", keywords="medical data", keywords="privacy", keywords="electronic health record", keywords="algorithm", keywords="development", keywords="validation", keywords="big data", keywords="feasibility", keywords="machine learning", keywords="synthetic data", abstract="Background: Privacy is of increasing interest in the present big data era, particularly the privacy of medical data. Specifically, differential privacy has emerged as the standard method for preservation of privacy during data analysis and publishing. Objective: Using machine learning techniques, we applied differential privacy to medical data with diverse parameters and checked the feasibility of our algorithms with synthetic data as well as the balance between data privacy and utility. Methods: All data were normalized to a range between --1 and 1, and the bounded Laplacian method was applied to prevent the generation of out-of-bound values after applying the differential privacy algorithm. To preserve the cardinality of the categorical variables, we performed postprocessing via discretization. The algorithm was evaluated using both synthetic and real-world data (from the eICU Collaborative Research Database). We evaluated the difference between the original data and the perturbated data using misclassification rates and the mean squared error for categorical data and continuous data, respectively. Further, we compared the performance of classification models that predict in-hospital mortality using real-world data. Results: The misclassification rate of categorical variables ranged between 0.49 and 0.85 when the value of $\epsilon$ was 0.1, and it converged to 0 as $\epsilon$ increased. When $\epsilon$ was between 102 and 103, the misclassification rate rapidly dropped to 0. Similarly, the mean squared error of the continuous variables decreased as $\epsilon$ increased. The performance of the model developed from perturbed data converged to that of the model developed from original data as $\epsilon$ increased. In particular, the accuracy of a random forest model developed from the original data was 0.801, and this value ranged from 0.757 to 0.81 when $\epsilon$ was 10-1 and 104, respectively. Conclusions: We applied local differential privacy to medical domain data, which are diverse and high dimensional. Higher noise may offer enhanced privacy, but it simultaneously hinders utility. We should choose an appropriate degree of noise for data perturbation to balance privacy and utility depending on specific situations. ", doi="10.2196/26914", url="https://medinform.jmir.org/2021/11/e26914", url="http://www.ncbi.nlm.nih.gov/pubmed/34747711" } @Article{info:doi/10.2196/29871, author="Zuo, Zheming and Watson, Matthew and Budgen, David and Hall, Robert and Kennelly, Chris and Al Moubayed, Noura", title="Data Anonymization for Pervasive Health Care: Systematic Literature Mapping Study", journal="JMIR Med Inform", year="2021", month="Oct", day="15", volume="9", number="10", pages="e29871", keywords="healthcare", keywords="privacy-preserving", keywords="GDPR", keywords="DPA 2018", keywords="EHR", keywords="SLM", keywords="data science", keywords="anonymization", keywords="reidentification risk", keywords="usability", abstract="Background: Data science offers an unparalleled opportunity to identify new insights into many aspects of human life with recent advances in health care. Using data science in digital health raises significant challenges regarding data privacy, transparency, and trustworthiness. Recent regulations enforce the need for a clear legal basis for collecting, processing, and sharing data, for example, the European Union's General Data Protection Regulation (2016) and the United Kingdom's Data Protection Act (2018). For health care providers, legal use of the electronic health record (EHR) is permitted only in clinical care cases. Any other use of the data requires thoughtful considerations of the legal context and direct patient consent. Identifiable personal and sensitive information must be sufficiently anonymized. Raw data are commonly anonymized to be used for research purposes, with risk assessment for reidentification and utility. Although health care organizations have internal policies defined for information governance, there is a significant lack of practical tools and intuitive guidance about the use of data for research and modeling. Off-the-shelf data anonymization tools are developed frequently, but privacy-related functionalities are often incomparable with regard to use in different problem domains. In addition, tools to support measuring the risk of the anonymized data with regard to reidentification against the usefulness of the data exist, but there are question marks over their efficacy. Objective: In this systematic literature mapping study, we aim to alleviate the aforementioned issues by reviewing the landscape of data anonymization for digital health care. Methods: We used Google Scholar, Web of Science, Elsevier Scopus, and PubMed to retrieve academic studies published in English up to June 2020. Noteworthy gray literature was also used to initialize the search. We focused on review questions covering 5 bottom-up aspects: basic anonymization operations, privacy models, reidentification risk and usability metrics, off-the-shelf anonymization tools, and the lawful basis for EHR data anonymization. Results: We identified 239 eligible studies, of which 60 were chosen for general background information; 16 were selected for 7 basic anonymization operations; 104 covered 72 conventional and machine learning--based privacy models; four and 19 papers included seven and 15 metrics, respectively, for measuring the reidentification risk and degree of usability; and 36 explored 20 data anonymization software tools. In addition, we also evaluated the practical feasibility of performing anonymization on EHR data with reference to their usability in medical decision-making. Furthermore, we summarized the lawful basis for delivering guidance on practical EHR data anonymization. Conclusions: This systematic literature mapping study indicates that anonymization of EHR data is theoretically achievable; yet, it requires more research efforts in practical implementations to balance privacy preservation and usability to ensure more reliable health care applications. ", doi="10.2196/29871", url="https://medinform.jmir.org/2021/10/e29871", url="http://www.ncbi.nlm.nih.gov/pubmed/34652278" } @Article{info:doi/10.2196/30697, author="Foraker, Randi and Guo, Aixia and Thomas, Jason and Zamstein, Noa and Payne, RO Philip and Wilcox, Adam and ", title="The National COVID Cohort Collaborative: Analyses of Original and Computationally Derived Electronic Health Record Data", journal="J Med Internet Res", year="2021", month="Oct", day="4", volume="23", number="10", pages="e30697", keywords="synthetic data", keywords="protected health information", keywords="COVID-19", keywords="electronic health records and systems", keywords="data analysis", abstract="Background: Computationally derived (``synthetic'') data can enable the creation and analysis of clinical, laboratory, and diagnostic data as if they were the original electronic health record data. Synthetic data can support data sharing to answer critical research questions to address the COVID-19 pandemic. Objective: We aim to compare the results from analyses of synthetic data to those from original data and assess the strengths and limitations of leveraging computationally derived data for research purposes. Methods: We used the National COVID Cohort Collaborative's instance of MDClone, a big data platform with data-synthesizing capabilities (MDClone Ltd). We downloaded electronic health record data from 34 National COVID Cohort Collaborative institutional partners and tested three use cases, including (1) exploring the distributions of key features of the COVID-19--positive cohort; (2) training and testing predictive models for assessing the risk of admission among these patients; and (3) determining geospatial and temporal COVID-19--related measures and outcomes, and constructing their epidemic curves. We compared the results from synthetic data to those from original data using traditional statistics, machine learning approaches, and temporal and spatial representations of the data. Results: For each use case, the results of the synthetic data analyses successfully mimicked those of the original data such that the distributions of the data were similar and the predictive models demonstrated comparable performance. Although the synthetic and original data yielded overall nearly the same results, there were exceptions that included an odds ratio on either side of the null in multivariable analyses (0.97 vs 1.01) and differences in the magnitude of epidemic curves constructed for zip codes with low population counts. Conclusions: This paper presents the results of each use case and outlines key considerations for the use of synthetic data, examining their role in collaborative research for faster insights. ", doi="10.2196/30697", url="https://www.jmir.org/2021/10/e30697", url="http://www.ncbi.nlm.nih.gov/pubmed/34559671" } @Article{info:doi/10.2196/15739, author="Daniels, Helen and Jones, Helen Kerina and Heys, Sharon and Ford, Vincent David", title="Exploring the Use of Genomic and Routinely Collected Data: Narrative Literature Review and Interview Study", journal="J Med Internet Res", year="2021", month="Sep", day="24", volume="23", number="9", pages="e15739", keywords="genomic data", keywords="routine data", keywords="electronic health records", keywords="health data science", keywords="genome", keywords="data regulation", keywords="case study", keywords="eHealth", abstract="Background: Advancing the use of genomic data with routinely collected health data holds great promise for health care and research. Increasing the use of these data is a high priority to understand and address the causes of disease. Objective: This study aims to provide an outline of the use of genomic data alongside routinely collected data in health research to date. As this field prepares to move forward, it is important to take stock of the current state of play in order to highlight new avenues for development, identify challenges, and ensure that adequate data governance models are in place for safe and socially acceptable progress. Methods: We conducted a literature review to draw information from past studies that have used genomic and routinely collected data and conducted interviews with individuals who use these data for health research. We collected data on the following: the rationale of using genomic data in conjunction with routinely collected data, types of genomic and routinely collected data used, data sources, project approvals, governance and access models, and challenges encountered. Results: The main purpose of using genomic and routinely collected data was to conduct genome-wide and phenome-wide association studies. Routine data sources included electronic health records, disease and death registries, health insurance systems, and deprivation indices. The types of genomic data included polygenic risk scores, single nucleotide polymorphisms, and measures of genetic activity, and biobanks generally provided these data. Although the literature search showed that biobanks released data to researchers, the case studies revealed a growing tendency for use within a data safe haven. Challenges of working with these data revolved around data collection, data storage, technical, and data privacy issues. Conclusions: Using genomic and routinely collected data holds great promise for progressing health research. Several challenges are involved, particularly in terms of privacy. Overcoming these barriers will ensure that the use of these data to progress health research can be exploited to its full potential. ", doi="10.2196/15739", url="https://www.jmir.org/2021/9/e15739", url="http://www.ncbi.nlm.nih.gov/pubmed/34559060" } @Article{info:doi/10.2196/28229, author="Stojanov, Riste and Popovski, Gorjan and Cenikj, Gjorgjina and Korou{\vs}i{\'c} Seljak, Barbara and Eftimov, Tome", title="A Fine-Tuned Bidirectional Encoder Representations From Transformers Model for Food Named-Entity Recognition: Algorithm Development and Validation", journal="J Med Internet Res", year="2021", month="Aug", day="9", volume="23", number="8", pages="e28229", keywords="food information extraction", keywords="named-entity recognition", keywords="fine-tuning BERT", keywords="semantic annotation", keywords="information extraction", keywords="BERT", keywords="bidirectional encoder representations from transformers", keywords="natural language processing", keywords="machine learning", abstract="Background: Recently, food science has been garnering a lot of attention. There are many open research questions on food interactions, as one of the main environmental factors, with other health-related entities such as diseases, treatments, and drugs. In the last 2 decades, a large amount of work has been done in natural language processing and machine learning to enable biomedical information extraction. However, machine learning in food science domains remains inadequately resourced, which brings to attention the problem of developing methods for food information extraction. There are only few food semantic resources and few rule-based methods for food information extraction, which often depend on some external resources. However, an annotated corpus with food entities along with their normalization was published in 2019 by using several food semantic resources. Objective: In this study, we investigated how the recently published bidirectional encoder representations from transformers (BERT) model, which provides state-of-the-art results in information extraction, can be fine-tuned for food information extraction. Methods: We introduce FoodNER, which is a collection of corpus-based food named-entity recognition methods. It consists of 15 different models obtained by fine-tuning 3 pretrained BERT models on 5 groups of semantic resources: food versus nonfood entity, 2 subsets of Hansard food semantic tags, FoodOn semantic tags, and Systematized Nomenclature of Medicine Clinical Terms food semantic tags. Results: All BERT models provided very promising results with 93.30\% to 94.31\% macro F1 scores in the task of distinguishing food versus nonfood entity, which represents the new state-of-the-art technology in food information extraction. Considering the tasks where semantic tags are predicted, all BERT models obtained very promising results once again, with their macro F1 scores ranging from 73.39\% to 78.96\%. Conclusions: FoodNER can be used to extract and annotate food entities in 5 different tasks: food versus nonfood entities and distinguishing food entities on the level of food groups by using the closest Hansard semantic tags, the parent Hansard semantic tags, the FoodOn semantic tags, or the Systematized Nomenclature of Medicine Clinical Terms semantic tags. ", doi="10.2196/28229", url="https://www.jmir.org/2021/8/e28229", url="http://www.ncbi.nlm.nih.gov/pubmed/34383671" } @Article{info:doi/10.2196/19824, author="Wongkoblap, Akkapon and Vadillo, A. Miguel and Curcin, Vasa", title="Deep Learning With Anaphora Resolution for the Detection of Tweeters With Depression: Algorithm Development and Validation Study", journal="JMIR Ment Health", year="2021", month="Aug", day="6", volume="8", number="8", pages="e19824", keywords="depression", keywords="mental health", keywords="Twitter", keywords="social media", keywords="deep learning", keywords="anaphora resolution", keywords="multiple-instance learning", keywords="depression markers", abstract="Background: Mental health problems are widely recognized as a major public health challenge worldwide. This concern highlights the need to develop effective tools for detecting mental health disorders in the population. Social networks are a promising source of data wherein patients publish rich personal information that can be mined to extract valuable psychological cues; however, these data come with their own set of challenges, such as the need to disambiguate between statements about oneself and third parties. Traditionally, natural language processing techniques for social media have looked at text classifiers and user classification models separately, hence presenting a challenge for researchers who want to combine text sentiment and user sentiment analysis. Objective: The objective of this study is to develop a predictive model that can detect users with depression from Twitter posts and instantly identify textual content associated with mental health topics. The model can also address the problem of anaphoric resolution and highlight anaphoric interpretations. Methods: We retrieved the data set from Twitter by using a regular expression or stream of real-time tweets comprising 3682 users, of which 1983 self-declared their depression and 1699 declared no depression. Two multiple instance learning models were developed---one with and one without an anaphoric resolution encoder---to identify users with depression and highlight posts related to the mental health of the author. Several previously published models were applied to our data set, and their performance was compared with that of our models. Results: The maximum accuracy, F1 score, and area under the curve of our anaphoric resolution model were 92\%, 92\%, and 90\%, respectively. The model outperformed alternative predictive models, which ranged from classical machine learning models to deep learning models. Conclusions: Our model with anaphoric resolution shows promising results when compared with other predictive models and provides valuable insights into textual content that is relevant to the mental health of the tweeter. ", doi="10.2196/19824", url="https://mental.jmir.org/2021/8/e19824", url="http://www.ncbi.nlm.nih.gov/pubmed/34383688" } @Article{info:doi/10.2196/26823, author="Barata, Carolina and Rodrigues, Maria Ana and Canh{\~a}o, Helena and Vinga, Susana and Carvalho, M. Alexandra", title="Predicting Biologic Therapy Outcome of Patients With Spondyloarthritis: Joint Models for Longitudinal and Survival Analysis", journal="JMIR Med Inform", year="2021", month="Jul", day="30", volume="9", number="7", pages="e26823", keywords="data mining", keywords="survival analysis", keywords="joint models", keywords="spondyloarthritis", keywords="drug survival", keywords="rheumatic disease", keywords="electronic medical records", keywords="medical records", abstract="Background: Rheumatic diseases are one of the most common chronic diseases worldwide. Among them, spondyloarthritis (SpA) is a group of highly debilitating diseases, with an early onset age, which significantly impacts patients' quality of life, health care systems, and society in general. Recent treatment options consist of using biologic therapies, and establishing the most beneficial option according to the patients' characteristics is a challenge that needs to be overcome. Meanwhile, the emerging availability of electronic medical records has made necessary the development of methods that can extract insightful information while handling all the challenges of dealing with complex, real-world data. Objective: The aim of this study was to achieve a better understanding of SpA patients' therapy responses and identify the predictors that affect them, thereby enabling the prognosis of therapy success or failure. Methods: A data mining approach based on joint models for the survival analysis of the biologic therapy failure is proposed, which considers the information of both baseline and time-varying variables extracted from the electronic medical records of SpA patients from the database, Reuma.pt. Results: Our results show that being a male, starting biologic therapy at an older age, having a larger time interval between disease start and initiation of the first biologic drug, and being human leukocyte antigen (HLA)--B27 positive are indicators of a good prognosis for the biological drug survival; meanwhile, having disease onset or biologic therapy initiation occur in more recent years, a larger number of education years, and higher values of C-reactive protein or Bath Ankylosing Spondylitis Functional Index (BASFI) at baseline are all predictors of a greater risk of failure of the first biologic therapy. Conclusions: Among this Portuguese subpopulation of SpA patients, those who were male, HLA-B27 positive, and with a later biologic therapy starting date or a larger time interval between disease start and initiation of the first biologic therapy showed longer therapy adherence. Joint models proved to be a valuable tool for the analysis of electronic medical records in the field of rheumatic diseases and may allow for the identification of potential predictors of biologic therapy failure. ", doi="10.2196/26823", url="https://medinform.jmir.org/2021/7/e26823", url="http://www.ncbi.nlm.nih.gov/pubmed/34328435" } @Article{info:doi/10.2196/25482, author="Feusner, D. Jamie and Mohideen, Reza and Smith, Stephen and Patanam, Ilyas and Vaitla, Anil and Lam, Christopher and Massi, Michelle and Leow, Alex", title="Semantic Linkages of Obsessions From an International Obsessive-Compulsive Disorder Mobile App Data Set: Big Data Analytics Study", journal="J Med Internet Res", year="2021", month="Jun", day="21", volume="23", number="6", pages="e25482", keywords="OCD", keywords="natural language processing", keywords="clinical subtypes", keywords="semantic", keywords="word embedding", keywords="clustering", abstract="Background: Obsessive-compulsive disorder (OCD) is characterized by recurrent intrusive thoughts, urges, or images (obsessions) and repetitive physical or mental behaviors (compulsions). Previous factor analytic and clustering studies suggest the presence of three or four subtypes of OCD symptoms. However, these studies have relied on predefined symptom checklists, which are limited in breadth and may be biased toward researchers' previous conceptualizations of OCD. Objective: In this study, we examine a large data set of freely reported obsession symptoms obtained from an OCD mobile app as an alternative to uncovering potential OCD subtypes. From this, we examine data-driven clusters of obsessions based on their latent semantic relationships in the English language using word embeddings. Methods: We extracted free-text entry words describing obsessions in a large sample of users of a mobile app, NOCD. Semantic vector space modeling was applied using the Global Vectors for Word Representation algorithm. A domain-specific extension, Mittens, was also applied to enhance the corpus with OCD-specific words. The resulting representations provided linear substructures of the word vector in a 100-dimensional space. We applied principal component analysis to the 100-dimensional vector representation of the most frequent words, followed by k-means clustering to obtain clusters of related words. Results: We obtained 7001 unique words representing obsessions from 25,369 individuals. Heuristics for determining the optimal number of clusters pointed to a three-cluster solution for grouping subtypes of OCD. The first had themes relating to relationship and just-right; the second had themes relating to doubt and checking; and the third had themes relating to contamination, somatic, physical harm, and sexual harm. All three clusters showed close semantic relationships with each other in the central area of convergence, with themes relating to harm. An equal-sized split-sample analysis across individuals and a split-sample analysis over time both showed overall stable cluster solutions. Words in the third cluster were the most frequently occurring words, followed by words in the first cluster. Conclusions: The clustering of naturally acquired obsessional words resulted in three major groupings of semantic themes, which partially overlapped with predefined checklists from previous studies. Furthermore, the closeness of the overall embedded relationships across clusters and their central convergence on harm suggests that, at least at the level of self-reported obsessional thoughts, most obsessions have close semantic relationships. Harm to self or others may be an underlying organizing theme across many obsessions. Notably, relationship-themed words, not previously included in factor-analytic studies, clustered with just-right words. These novel insights have potential implications for understanding how an apparent multitude of obsessional symptoms are connected by underlying themes. This observation could aid exposure-based treatment approaches and could be used as a conceptual framework for future research. ", doi="10.2196/25482", url="https://www.jmir.org/2021/6/e25482", url="http://www.ncbi.nlm.nih.gov/pubmed/33892466" } @Article{info:doi/10.2196/17137, author="An, Ning and Mattison, John and Chen, Xinyu and Alterovitz, Gil", title="Team Science in Precision Medicine: Study of Coleadership and Coauthorship Across Health Organizations", journal="J Med Internet Res", year="2021", month="Jun", day="14", volume="23", number="6", pages="e17137", keywords="precision medicine", keywords="team science", abstract="Background: Interdisciplinary collaborations bring lots of benefits to researchers in multiple areas, including precision medicine. Objective: This viewpoint aims at studying how cross-institution team science would affect the development of precision medicine. Methods: Publications of organizations on the eHealth Catalogue of Activities were collected in 2015 and 2017. The significance of the correlation between coleadership and coauthorship among different organizations was calculated using the Pearson chi-square test of independence. Other nonparametric tests examined whether organizations with coleaders publish more and better papers than organizations without coleaders. Results: A total of 374 publications from 69 organizations were analyzed in 2015, and 7064 papers from 87 organizations were analyzed in 2017. Organizations with coleadership published more papers (P<.001, 2015 and 2017), which received higher citations (Z=--13.547, P<.001, 2017), compared to those without coleadership. Organizations with coleaders tended to publish papers together (P<.001, 2015 and 2017). Conclusions: Our findings suggest that organizations in the field of precision medicine could greatly benefit from institutional-level team science. As a result, stronger collaboration is recommended. ", doi="10.2196/17137", url="https://www.jmir.org/2021/6/e17137", url="http://www.ncbi.nlm.nih.gov/pubmed/34125070" } @Article{info:doi/10.2196/26681, author="Blitz, Rog{\'e}rio and Storck, Michael and Baune, T. Bernhard and Dugas, Martin and Opel, Nils", title="Design and Implementation of an Informatics Infrastructure for Standardized Data Acquisition, Transfer, Storage, and Export in Psychiatric Clinical Routine: Feasibility Study", journal="JMIR Ment Health", year="2021", month="Jun", day="9", volume="8", number="6", pages="e26681", keywords="medical informatics", keywords="digital mental health", keywords="digital data collection", keywords="psychiatry", keywords="single-source metadata architecture transformation", keywords="mental health", keywords="design", keywords="implementation", keywords="feasibility", keywords="informatics", keywords="infrastructure", keywords="data", abstract="Background: Empirically driven personalized diagnostic applications and treatment stratification is widely perceived as a major hallmark in psychiatry. However, databased personalized decision making requires standardized data acquisition and data access, which are currently absent in psychiatric clinical routine. Objective: Here, we describe the informatics infrastructure implemented at the psychiatric M{\"u}nster University Hospital, which allows standardized acquisition, transfer, storage, and export of clinical data for future real-time predictive modelling in psychiatric routine. Methods: We designed and implemented a technical architecture that includes an extension of the electronic health record (EHR) via scalable standardized data collection and data transfer between EHRs and research databases, thus allowing the pooling of EHRs and research data in a unified database and technical solutions for the visual presentation of collected data and analyses results in the EHR. The Single-source Metadata ARchitecture Transformation (SMA:T) was used as the software architecture. SMA:T is an extension of the EHR system and uses module-driven engineering to generate standardized applications and interfaces. The operational data model was used as the standard. Standardized data were entered on iPads via the Mobile Patient Survey (MoPat) and the web application Mopat@home, and the standardized transmission, processing, display, and export of data were realized via SMA:T. Results: The technical feasibility of the informatics infrastructure was demonstrated in the course of this study. We created 19 standardized documentation forms with 241 items. For 317 patients, 6451 instances were automatically transferred to the EHR system without errors. Moreover, 96,323 instances were automatically transferred from the EHR system to the research database for further analyses. Conclusions: In this study, we present the successful implementation of the informatics infrastructure enabling standardized data acquisition and data access for future real-time predictive modelling in clinical routine in psychiatry. The technical solution presented here might guide similar initiatives at other sites and thus help to pave the way toward future application of predictive models in psychiatric clinical routine. ", doi="10.2196/26681", url="https://mental.jmir.org/2021/6/e26681", url="http://www.ncbi.nlm.nih.gov/pubmed/34106072" } @Article{info:doi/10.2196/26075, author="Patr{\'i}cio, Andr{\'e} and Costa, S. Rafael and Henriques, Rui", title="Predictability of COVID-19 Hospitalizations, Intensive Care Unit Admissions, and Respiratory Assistance in Portugal: Longitudinal Cohort Study", journal="J Med Internet Res", year="2021", month="Apr", day="28", volume="23", number="4", pages="e26075", keywords="COVID-19", keywords="machine learning", keywords="intensive care admissions", keywords="respiratory assistance", keywords="predictive models", keywords="data modeling", keywords="clinical informatics", abstract="Background: In the face of the current COVID-19 pandemic, the timely prediction of upcoming medical needs for infected individuals enables better and quicker care provision when necessary and management decisions within health care systems. Objective: This work aims to predict the medical needs (hospitalizations, intensive care unit admissions, and respiratory assistance) and survivability of individuals testing positive for SARS-CoV-2 infection in Portugal. Methods: A retrospective cohort of 38,545 infected individuals during 2020 was used. Predictions of medical needs were performed using state-of-the-art machine learning approaches at various stages of a patient's cycle, namely, at testing (prehospitalization), at posthospitalization, and during postintensive care. A thorough optimization of state-of-the-art predictors was undertaken to assess the ability to anticipate medical needs and infection outcomes using demographic and comorbidity variables, as well as dates associated with symptom onset, testing, and hospitalization. Results: For the target cohort, 75\% of hospitalization needs could be identified at the time of testing for SARS-CoV-2 infection. Over 60\% of respiratory needs could be identified at the time of hospitalization. Both predictions had >50\% precision. Conclusions: The conducted study pinpoints the relevance of the proposed predictive models as good candidates to support medical decisions in the Portuguese population, including both monitoring and in-hospital care decisions. A clinical decision support system is further provided to this end. ", doi="10.2196/26075", url="https://www.jmir.org/2021/4/e26075", url="http://www.ncbi.nlm.nih.gov/pubmed/33835931" } @Article{info:doi/10.2196/27275, author="Borges do Nascimento, J{\'u}nior Israel and Marcolino, Soriano Milena and Abdulazeem, Mohamed Hebatullah and Weerasekara, Ishanka and Azzopardi-Muscat, Natasha and Gon{\c{c}}alves, Andr{\'e} Marcos and Novillo-Ortiz, David", title="Impact of Big Data Analytics on People's Health: Overview of Systematic Reviews and Recommendations for Future Studies", journal="J Med Internet Res", year="2021", month="Apr", day="13", volume="23", number="4", pages="e27275", keywords="public health", keywords="big data", keywords="health status", keywords="evidence-based medicine", keywords="big data analytics", keywords="secondary data analysis", keywords="machine learning", keywords="systematic review", keywords="overview", keywords="World Health Organization", abstract="Background: Although the potential of big data analytics for health care is well recognized, evidence is lacking on its effects on public health. Objective: The aim of this study was to assess the impact of the use of big data analytics on people's health based on the health indicators and core priorities in the World Health Organization (WHO) General Programme of Work 2019/2023 and the European Programme of Work (EPW), approved and adopted by its Member States, in addition to SARS-CoV-2--related studies. Furthermore, we sought to identify the most relevant challenges and opportunities of these tools with respect to people's health. Methods: Six databases (MEDLINE, Embase, Cochrane Database of Systematic Reviews via Cochrane Library, Web of Science, Scopus, and Epistemonikos) were searched from the inception date to September 21, 2020. Systematic reviews assessing the effects of big data analytics on health indicators were included. Two authors independently performed screening, selection, data extraction, and quality assessment using the AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews 2) checklist. Results: The literature search initially yielded 185 records, 35 of which met the inclusion criteria, involving more than 5,000,000 patients. Most of the included studies used patient data collected from electronic health records, hospital information systems, private patient databases, and imaging datasets, and involved the use of big data analytics for noncommunicable diseases. ``Probability of dying from any of cardiovascular, cancer, diabetes or chronic renal disease'' and ``suicide mortality rate'' were the most commonly assessed health indicators and core priorities within the WHO General Programme of Work 2019/2023 and the EPW 2020/2025. Big data analytics have shown moderate to high accuracy for the diagnosis and prediction of complications of diabetes mellitus as well as for the diagnosis and classification of mental disorders; prediction of suicide attempts and behaviors; and the diagnosis, treatment, and prediction of important clinical outcomes of several chronic diseases. Confidence in the results was rated as ``critically low'' for 25 reviews, as ``low'' for 7 reviews, and as ``moderate'' for 3 reviews. The most frequently identified challenges were establishment of a well-designed and structured data source, and a secure, transparent, and standardized database for patient data. Conclusions: Although the overall quality of included studies was limited, big data analytics has shown moderate to high accuracy for the diagnosis of certain diseases, improvement in managing chronic diseases, and support for prompt and real-time analyses of large sets of varied input data to diagnose and predict disease outcomes. Trial Registration: International Prospective Register of Systematic Reviews (PROSPERO) CRD42020214048; https://www.crd.york.ac.uk/prospero/display\_record.php?RecordID=214048 ", doi="10.2196/27275", url="https://www.jmir.org/2021/4/e27275", url="http://www.ncbi.nlm.nih.gov/pubmed/33847586" } @Article{info:doi/10.2196/24656, author="Chatterjee, Ayan and Prinz, Andreas and Gerdes, Martin and Martinez, Santiago", title="An Automatic Ontology-Based Approach to Support Logical Representation of Observable and Measurable Data for Healthy Lifestyle Management: Proof-of-Concept Study", journal="J Med Internet Res", year="2021", month="Apr", day="9", volume="23", number="4", pages="e24656", keywords="activity", keywords="nutrition", keywords="sensor", keywords="questionnaire", keywords="SSN", keywords="ontology", keywords="SNOMED CT", keywords="eCoach", keywords="personalized", keywords="recommendation", keywords="automated", keywords="CDSS", keywords="healthy lifestyle", keywords="interoperability", keywords="eHealth", keywords="goal setting", keywords="semantics", keywords="simulation", keywords="proposition", abstract="Background: Lifestyle diseases, because of adverse health behavior, are the foremost cause of death worldwide. An eCoach system may encourage individuals to lead a healthy lifestyle with early health risk prediction, personalized recommendation generation, and goal evaluation. Such an eCoach system needs to collect and transform distributed heterogenous health and wellness data into meaningful information to train an artificially intelligent health risk prediction model. However, it may produce a data compatibility dilemma. Our proposed eHealth ontology can increase interoperability between different heterogeneous networks, provide situation awareness, help in data integration, and discover inferred knowledge. This ``proof-of-concept'' study will help sensor, questionnaire, and interview data to be more organized for health risk prediction and personalized recommendation generation targeting obesity as a study case. Objective: The aim of this study is to develop an OWL-based ontology (UiA eHealth Ontology/UiAeHo) model to annotate personal, physiological, behavioral, and contextual data from heterogeneous sources (sensor, questionnaire, and interview), followed by structuring and standardizing of diverse descriptions to generate meaningful, practical, personalized, and contextual lifestyle recommendations based on the defined rules. Methods: We have developed a simulator to collect dummy personal, physiological, behavioral, and contextual data related to artificial participants involved in health monitoring. We have integrated the concepts of ``Semantic Sensor Network Ontology'' and ``Systematized Nomenclature of Medicine---Clinical Terms'' to develop our proposed eHealth ontology. The ontology has been created using Prot{\'e}g{\'e} (version 5.x). We have used the Java-based ``Jena Framework'' (version 3.16) for building a semantic web application that includes resource description framework (RDF) application programming interface (API), OWL API, native tuple store (tuple database), and the SPARQL (Simple Protocol and RDF Query Language) query engine. The logical and structural consistency of the proposed ontology has been evaluated with the ``HermiT 1.4.3.x'' ontology reasoner available in Prot{\'e}g{\'e} 5.x. Results: The proposed ontology has been implemented for the study case ``obesity.'' However, it can be extended further to other lifestyle diseases. ``UiA eHealth Ontology'' has been constructed using logical axioms, declaration axioms, classes, object properties, and data properties. The ontology can be visualized with ``Owl Viz,'' and the formal representation has been used to infer a participant's health status using the ``HermiT'' reasoner. We have also developed a module for ontology verification that behaves like a rule-based decision support system to predict the probability for health risk, based on the evaluation of the results obtained from SPARQL queries. Furthermore, we discussed the potential lifestyle recommendation generation plan against adverse behavioral risks. Conclusions: This study has led to the creation of a meaningful, context-specific ontology to model massive, unintuitive, raw, unstructured observations for health and wellness data (eg, sensors, interviews, questionnaires) and to annotate them with semantic metadata to create a compact, intelligible abstraction for health risk predictions for individualized recommendation generation. ", doi="10.2196/24656", url="https://www.jmir.org/2021/4/e24656", url="http://www.ncbi.nlm.nih.gov/pubmed/33835031" } @Article{info:doi/10.2196/24288, author="Ossom-Williamson, Peace and Williams, Maximilian Isaac and Kim, Kukhyoung and Kindratt, B. Tiffany", title="Reporting and Availability of COVID-19 Demographic Data by US Health Departments (April to October 2020): Observational Study", journal="JMIR Public Health Surveill", year="2021", month="Apr", day="6", volume="7", number="4", pages="e24288", keywords="coronavirus disease 2019", keywords="COVID-19", keywords="SARS-CoV-2", keywords="race", keywords="ethnicity", keywords="age", keywords="sex", keywords="health equity", keywords="open data", keywords="dashboards", abstract="Background: There is an urgent need for consistent collection of demographic data on COVID-19 morbidity and mortality and sharing it with the public in open and accessible ways. Due to the lack of consistency in data reporting during the initial spread of COVID-19, the Equitable Data Collection and Disclosure on COVID-19 Act was introduced into the Congress that mandates collection and reporting of demographic COVID-19 data on testing, treatments, and deaths by age, sex, race and ethnicity, primary language, socioeconomic status, disability, and county. To our knowledge, no studies have evaluated how COVID-19 demographic data have been collected before and after the introduction of this legislation. Objective: This study aimed to evaluate differences in reporting and public availability of COVID-19 demographic data by US state health departments and Washington, District of Columbia (DC) before (pre-Act), immediately after (post-Act), and 6 months after (6-month follow-up) the introduction of the Equitable Data Collection and Disclosure on COVID-19 Act in the Congress on April 21, 2020. Methods: We reviewed health department websites of all 50 US states and Washington, DC (N=51). We evaluated how each state reported age, sex, and race and ethnicity data for all confirmed COVID-19 cases and deaths and how they made this data available (ie, charts and tables only or combined with dashboards and machine-actionable downloadable formats) at the three timepoints. Results: We found statistically significant increases in the number of health departments reporting age-specific data for COVID-19 cases (P=.045) and resulting deaths (P=.002), sex-specific data for COVID-19 deaths (P=.003), and race- and ethnicity-specific data for confirmed cases (P=.003) and deaths (P=.005) post-Act and at the 6-month follow-up (P<.05 for all). The largest increases were race and ethnicity state data for confirmed cases (pre-Act: 18/51, 35\%; post-Act: 31/51, 61\%; 6-month follow-up: 46/51, 90\%) and deaths due to COVID-19 (pre-Act: 13/51, 25\%; post-Act: 25/51, 49\%; and 6-month follow-up: 39/51, 76\%). Although more health departments reported race and ethnicity data based on federal requirements (P<.001), over half (29/51, 56.9\%) still did not report all racial and ethnic groups as per the Office of Management and Budget guidelines (pre-Act: 5/51, 10\%; post-Act: 21/51, 41\%; and 6-month follow-up: 27/51, 53\%). The number of health departments that made COVID-19 data available for download significantly increased from 7 to 23 (P<.001) from our initial data collection (April 2020) to the 6-month follow-up, (October 2020). Conclusions: Although the increased demand for disaggregation has improved public reporting of demographics across health departments, an urgent need persists for the introduced legislation to be passed by the Congress for the US states to consistently collect and make characteristics of COVID-19 cases, deaths, and vaccinations available in order to allocate resources to mitigate disease spread. ", doi="10.2196/24288", url="https://publichealth.jmir.org/2021/4/e24288", url="http://www.ncbi.nlm.nih.gov/pubmed/33821804" } @Article{info:doi/10.2196/25645, author="Gruendner, Julian and Gulden, Christian and Kampf, Marvin and Mate, Sebastian and Prokosch, Hans-Ulrich and Zierk, Jakob", title="A Framework for Criteria-Based Selection and Processing of Fast Healthcare Interoperability Resources (FHIR) Data for Statistical Analysis: Design and Implementation Study", journal="JMIR Med Inform", year="2021", month="Apr", day="1", volume="9", number="4", pages="e25645", keywords="data analysis", keywords="data science", keywords="data standardization", keywords="digital medical information", keywords="eHealth", keywords="Fast Healthcare Interoperability Resources", keywords="data harmonization", keywords="medical information", keywords="patient privacy", keywords="data repositories", keywords="HL7 FHIR", abstract="Background: The harmonization and standardization of digital medical information for research purposes is a challenging and ongoing collaborative effort. Current research data repositories typically require extensive efforts in harmonizing and transforming original clinical data. The Fast Healthcare Interoperability Resources (FHIR) format was designed primarily to represent clinical processes; therefore, it closely resembles the clinical data model and is more widely available across modern electronic health records. However, no common standardized data format is directly suitable for statistical analyses, and data need to be preprocessed before statistical analysis. Objective: This study aimed to elucidate how FHIR data can be queried directly with a preprocessing service and be used for statistical analyses. Methods: We propose that the binary JavaScript Object Notation format of the PostgreSQL (PSQL) open source database is suitable for not only storing FHIR data, but also extending it with preprocessing and filtering services, which directly transform data stored in FHIR format into prepared data subsets for statistical analysis. We specified an interface for this preprocessor, implemented and deployed it at University Hospital Erlangen-N{\"u}rnberg, generated 3 sample data sets, and analyzed the available data. Results: We imported real-world patient data from 2016 to 2018 into a standard PSQL database, generating a dataset of approximately 35.5 million FHIR resources, including ``Patient,'' ``Encounter,'' ``Condition'' (diagnoses specified using International Classification of Diseases codes), ``Procedure,'' and ``Observation'' (laboratory test results). We then integrated the developed preprocessing service with the PSQL database and the locally installed web-based KETOS analysis platform. Advanced statistical analyses were feasible using the developed framework using 3 clinically relevant scenarios (data-driven establishment of hemoglobin reference intervals, assessment of anemia prevalence in patients with cancer, and investigation of the adverse effects of drugs). Conclusions: This study shows how the standard open source database PSQL can be used to store FHIR data and be integrated with a specifically developed preprocessing and analysis framework. This enables dataset generation with advanced medical criteria and the integration of subsequent statistical analysis. The web-based preprocessing service can be deployed locally at the hospital level, protecting patients' privacy while being integrated with existing open source data analysis tools currently being developed across Germany. ", doi="10.2196/25645", url="https://medinform.jmir.org/2021/4/e25645", url="http://www.ncbi.nlm.nih.gov/pubmed/33792554" } @Article{info:doi/10.2196/23011, author="Coetzee, Timothy and Ball, Price Mad and Boutin, Marc and Bronson, Abby and Dexter, T. David and English, A. Rebecca and Furlong, Patricia and Goodman, D. Andrew and Grossman, Cynthia and Hernandez, F. Adrian and Hinners, E. Jennifer and Hudson, Lynn and Kennedy, Annie and Marchisotto, Jane Mary and Matrisian, Lynn and Myers, Elizabeth and Nowell, Benjamin W. and Nosek, A. Brian and Sherer, Todd and Shore, Carolyn and Sim, Ida and Smolensky, Luba and Williams, Christopher and Wood, Julie and Terry, F. Sharon", title="Data Sharing Goals for Nonprofit Funders of Clinical Trials", journal="J Participat Med", year="2021", month="Mar", day="29", volume="13", number="1", pages="e23011", keywords="clinical trial", keywords="biomedical research", keywords="data sharing", keywords="patients", doi="10.2196/23011", url="https://jopm.jmir.org/2021/1/e23011", url="http://www.ncbi.nlm.nih.gov/pubmed/33779573" } @Article{info:doi/10.2196/23328, author="Park, Young Ho and Bae, Hyun-Jin and Hong, Gil-Sun and Kim, Minjee and Yun, JiHye and Park, Sungwon and Chung, Jung Won and Kim, NamKug", title="Realistic High-Resolution Body Computed Tomography Image Synthesis by Using Progressive Growing Generative Adversarial Network: Visual Turing Test", journal="JMIR Med Inform", year="2021", month="Mar", day="17", volume="9", number="3", pages="e23328", keywords="generative adversarial network", keywords="unsupervised deep learning", keywords="computed tomography", keywords="synthetic body images", keywords="visual Turing test", abstract="Background: Generative adversarial network (GAN)--based synthetic images can be viable solutions to current supervised deep learning challenges. However, generating highly realistic images is a prerequisite for these approaches. Objective: The aim of this study was to investigate and validate the unsupervised synthesis of highly realistic body computed tomography (CT) images by using a progressive growing GAN (PGGAN) trained to learn the probability distribution of normal data. Methods: We trained the PGGAN by using 11,755 body CT scans. Ten radiologists (4 radiologists with <5 years of experience [Group I], 4 radiologists with 5-10 years of experience [Group II], and 2 radiologists with >10 years of experience [Group III]) evaluated the results in a binary approach by using an independent validation set of 300 images (150 real and 150 synthetic) to judge the authenticity of each image. Results: The mean accuracy of the 10 readers in the entire image set was higher than random guessing (1781/3000, 59.4\% vs 1500/3000, 50.0\%, respectively; P<.001). However, in terms of identifying synthetic images as fake, there was no significant difference in the specificity between the visual Turing test and random guessing (779/1500, 51.9\% vs 750/1500, 50.0\%, respectively; P=.29). The accuracy between the 3 reader groups with different experience levels was not significantly different (Group I, 696/1200, 58.0\%; Group II, 726/1200, 60.5\%; and Group III, 359/600, 59.8\%; P=.36). Interreader agreements were poor ($\kappa$=0.11) for the entire image set. In subgroup analysis, the discrepancies between real and synthetic CT images occurred mainly in the thoracoabdominal junction and in the anatomical details. Conclusions: The GAN can synthesize highly realistic high-resolution body CT images that are indistinguishable from real images; however, it has limitations in generating body images of the thoracoabdominal junction and lacks accuracy in the anatomical details. ", doi="10.2196/23328", url="https://medinform.jmir.org/2021/3/e23328", url="http://www.ncbi.nlm.nih.gov/pubmed/33609339" } @Article{info:doi/10.2196/18766, author="Frias, Mario and Moyano, M. Jose and Rivero-Juarez, Antonio and Luna, M. Jose and Camacho, {\'A}ngela and Fardoun, M. Habib and Machuca, Isabel and Al-Twijri, Mohamed and Rivero, Antonio and Ventura, Sebastian", title="Classification Accuracy of Hepatitis C Virus Infection Outcome: Data Mining Approach", journal="J Med Internet Res", year="2021", month="Feb", day="24", volume="23", number="2", pages="e18766", keywords="HIV/HCV", keywords="data mining", keywords="PART", keywords="ensemble", keywords="classification accuracy", abstract="Background: The dataset from genes used to predict hepatitis C virus outcome was evaluated in a previous study using a conventional statistical methodology. Objective: The aim of this study was to reanalyze this same dataset using the data mining approach in order to find models that improve the classification accuracy of the genes studied. Methods: We built predictive models using different subsets of factors, selected according to their importance in predicting patient classification. We then evaluated each independent model and also a combination of them, leading to a better predictive model. Results: Our data mining approach identified genetic patterns that escaped detection using conventional statistics. More specifically, the partial decision trees and ensemble models increased the classification accuracy of hepatitis C virus outcome compared with conventional methods. Conclusions: Data mining can be used more extensively in biomedicine, facilitating knowledge building and management of human diseases. ", doi="10.2196/18766", url="https://www.jmir.org/2021/2/e18766", url="http://www.ncbi.nlm.nih.gov/pubmed/33624609" } @Article{info:doi/10.2196/21679, author="Parikh, Soham and Davoudi, Anahita and Yu, Shun and Giraldo, Carolina and Schriver, Emily and Mowery, Danielle", title="Lexicon Development for COVID-19-related Concepts Using Open-source Word Embedding Sources: An Intrinsic and Extrinsic Evaluation", journal="JMIR Med Inform", year="2021", month="Feb", day="22", volume="9", number="2", pages="e21679", keywords="natural language processing", keywords="word embedding", keywords="COVID-19", keywords="intrinsic", keywords="open-source", keywords="computation", keywords="model", keywords="prediction", keywords="semantic", keywords="syntactic", keywords="pattern", abstract="Background: Scientists are developing new computational methods and prediction models to better clinically understand COVID-19 prevalence, treatment efficacy, and patient outcomes. These efforts could be improved by leveraging documented COVID-19--related symptoms, findings, and disorders from clinical text sources in an electronic health record. Word embeddings can identify terms related to these clinical concepts from both the biomedical and nonbiomedical domains, and are being shared with the open-source community at large. However, it's unclear how useful openly available word embeddings are for developing lexicons for COVID-19--related concepts. Objective: Given an initial lexicon of COVID-19--related terms, this study aims to characterize the returned terms by similarity across various open-source word embeddings and determine common semantic and syntactic patterns between the COVID-19 queried terms and returned terms specific to the word embedding source. Methods: We compared seven openly available word embedding sources. Using a series of COVID-19--related terms for associated symptoms, findings, and disorders, we conducted an interannotator agreement study to determine how accurately the most similar returned terms could be classified according to semantic types by three annotators. We conducted a qualitative study of COVID-19 queried terms and their returned terms to detect informative patterns for constructing lexicons. We demonstrated the utility of applying such learned synonyms to discharge summaries by reporting the proportion of patients identified by concept among three patient cohorts: pneumonia (n=6410), acute respiratory distress syndrome (n=8647), and COVID-19 (n=2397). Results: We observed high pairwise interannotator agreement (Cohen kappa) for symptoms (0.86-0.99), findings (0.93-0.99), and disorders (0.93-0.99). Word embedding sources generated based on characters tend to return more synonyms (mean count of 7.2 synonyms) compared to token-based embedding sources (mean counts range from 2.0 to 3.4). Word embedding sources queried using a qualifier term (eg, dry cough or muscle pain) more often returned qualifiers of the similar semantic type (eg, ``dry'' returns consistency qualifiers like ``wet'' and ``runny'') compared to a single term (eg, cough or pain) queries. A higher proportion of patients had documented fever (0.61-0.84), cough (0.41-0.55), shortness of breath (0.40-0.59), and hypoxia (0.51-0.56) retrieved than other clinical features. Terms for dry cough returned a higher proportion of patients with COVID-19 (0.07) than the pneumonia (0.05) and acute respiratory distress syndrome (0.03) populations. Conclusions: Word embeddings are valuable technology for learning related terms, including synonyms. When leveraging openly available word embedding sources, choices made for the construction of the word embeddings can significantly influence the words learned. ", doi="10.2196/21679", url="https://medinform.jmir.org/2021/2/e21679", url="http://www.ncbi.nlm.nih.gov/pubmed/33544689" } @Article{info:doi/10.2196/16348, author="Hassan, Lamiece and Nenadic, Goran and Tully, Patricia Mary", title="A Social Media Campaign (\#datasaveslives) to Promote the Benefits of Using Health Data for Research Purposes: Mixed Methods Analysis", journal="J Med Internet Res", year="2021", month="Feb", day="16", volume="23", number="2", pages="e16348", keywords="social media", keywords="public engagement", keywords="social network analysis", keywords="medical research", abstract="Background: Social media provides the potential to engage a wide audience about scientific research, including the public. However, little empirical research exists to guide health scientists regarding what works and how to optimize impact. We examined the social media campaign \#datasaveslives established in 2014 to highlight positive examples of the use and reuse of health data in research. Objective: This study aims to examine how the \#datasaveslives hashtag was used on social media, how often, and by whom; thus, we aim to provide insights into the impact of a major social media campaign in the UK health informatics research community and further afield. Methods: We analyzed all publicly available posts (tweets) that included the hashtag \#datasaveslives (N=13,895) on the microblogging platform Twitter between September 1, 2016, and August 31, 2017. Using a combination of qualitative and quantitative analyses, we determined the frequency and purpose of tweets. Social network analysis was used to analyze and visualize tweet sharing (retweet) networks among hashtag users. Results: Overall, we found 4175 original posts and 9720 retweets featuring \#datasaveslives by 3649 unique Twitter users. In total, 66.01\% (2756/4175) of the original posts were retweeted at least once. Higher frequencies of tweets were observed during the weeks of prominent policy publications, popular conferences, and public engagement events. Cluster analysis based on retweet relationships revealed an interconnected series of groups of \#datasaveslives users in academia, health services and policy, and charities and patient networks. Thematic analysis of tweets showed that \#datasaveslives was used for a broader range of purposes than indexing information, including event reporting, encouraging participation and action, and showing personal support for data sharing. Conclusions: This study shows that a hashtag-based social media campaign was effective in encouraging a wide audience of stakeholders to disseminate positive examples of health research. Furthermore, the findings suggest that the campaign supported community building and bridging practices within and between the interdisciplinary sectors related to the field of health data science and encouraged individuals to demonstrate personal support for sharing health data. ", doi="10.2196/16348", url="http://www.jmir.org/2021/2/e16348/", url="http://www.ncbi.nlm.nih.gov/pubmed/33591280" } @Article{info:doi/10.2196/22505, author="Inau, Thea Esther and Sack, Jean and Waltemath, Dagmar and Zeleke, Alamirrew Atinkut", title="Initiatives, Concepts, and Implementation Practices of FAIR (Findable, Accessible, Interoperable, and Reusable) Data Principles in Health Data Stewardship Practice: Protocol for a Scoping Review", journal="JMIR Res Protoc", year="2021", month="Feb", day="2", volume="10", number="2", pages="e22505", keywords="data stewardship", keywords="FAIR data principles", keywords="health research", keywords="PRISMA", keywords="scoping review", abstract="Background: Data stewardship is an essential driver of research and clinical practice. Data collection, storage, access, sharing, and analytics are dependent on the proper and consistent use of data management principles among the investigators. Since 2016, the FAIR (findable, accessible, interoperable, and reusable) guiding principles for research data management have been resonating in scientific communities. Enabling data to be findable, accessible, interoperable, and reusable is currently believed to strengthen data sharing, reduce duplicated efforts, and move toward harmonization of data from heterogeneous unconnected data silos. FAIR initiatives and implementation trends are rising in different facets of scientific domains. It is important to understand the concepts and implementation practices of the FAIR data principles as applied to human health data by studying the flourishing initiatives and implementation lessons relevant to improved health research, particularly for data sharing during the coronavirus pandemic. Objective: This paper aims to conduct a scoping review to identify concepts, approaches, implementation experiences, and lessons learned in FAIR initiatives in the health data domain. Methods: The Arksey and O'Malley stage-based methodological framework for scoping reviews will be used for this review. PubMed, Web of Science, and Google Scholar will be searched to access relevant primary and grey publications. Articles written in English and published from 2014 onwards with FAIR principle concepts or practices in the health domain will be included. Duplication among the 3 data sources will be removed using a reference management software. The articles will then be exported to a systematic review management software. At least two independent authors will review the eligibility of each article based on defined inclusion and exclusion criteria. A pretested charting tool will be used to extract relevant information from the full-text papers. Qualitative thematic synthesis analysis methods will be employed by coding and developing themes. Themes will be derived from the research questions and contents in the included papers. Results: The results will be reported using the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-analyses Extension for Scoping Reviews) reporting guidelines. We anticipate finalizing the manuscript for this work in 2021. Conclusions: We believe comprehensive information about the FAIR data principles, initiatives, implementation practices, and lessons learned in the FAIRification process in the health domain is paramount to supporting both evidence-based clinical practice and research transparency in the era of big data and open research publishing. International Registered Report Identifier (IRRID): PRR1-10.2196/22505 ", doi="10.2196/22505", url="https://www.researchprotocols.org/2021/2/e22505", url="http://www.ncbi.nlm.nih.gov/pubmed/33528373" } @Article{info:doi/10.2196/21252, author="Spasic, Irena and Button, Kate", title="Patient Triage by Topic Modeling of Referral Letters: Feasibility Study", journal="JMIR Med Inform", year="2020", month="Nov", day="6", volume="8", number="11", pages="e21252", keywords="natural language processing", keywords="machine learning", keywords="data science", keywords="medical informatics", keywords="computer-assisted decision making", abstract="Background: Musculoskeletal conditions are managed within primary care, but patients can be referred to secondary care if a specialist opinion is required. The ever-increasing demand for health care resources emphasizes the need to streamline care pathways with the ultimate aim of ensuring that patients receive timely and optimal care. Information contained in referral letters underpins the referral decision-making process but is yet to be explored systematically for the purposes of treatment prioritization for musculoskeletal conditions. Objective: This study aims to explore the feasibility of using natural language processing and machine learning to automate the triage of patients with musculoskeletal conditions by analyzing information from referral letters. Specifically, we aim to determine whether referral letters can be automatically assorted into latent topics that are clinically relevant, that is, considered relevant when prescribing treatments. Here, clinical relevance is assessed by posing 2 research questions. Can latent topics be used to automatically predict treatment? Can clinicians interpret latent topics as cohorts of patients who share common characteristics or experiences such as medical history, demographics, and possible treatments? Methods: We used latent Dirichlet allocation to model each referral letter as a finite mixture over an underlying set of topics and model each topic as an infinite mixture over an underlying set of topic probabilities. The topic model was evaluated in the context of automating patient triage. Given a set of treatment outcomes, a binary classifier was trained for each outcome using previously extracted topics as the input features of the machine learning algorithm. In addition, a qualitative evaluation was performed to assess the human interpretability of topics. Results: The prediction accuracy of binary classifiers outperformed the stratified random classifier by a large margin, indicating that topic modeling could be used to predict the treatment, thus effectively supporting patient triage. The qualitative evaluation confirmed the high clinical interpretability of the topic model. Conclusions: The results established the feasibility of using natural language processing and machine learning to automate triage of patients with knee or hip pain by analyzing information from their referral letters. ", doi="10.2196/21252", url="https://medinform.jmir.org/2020/11/e21252", url="http://www.ncbi.nlm.nih.gov/pubmed/33155985" } @Article{info:doi/10.2196/20324, author="Shirakawa, Toru and Sonoo, Tomohiro and Ogura, Kentaro and Fujimori, Ryo and Hara, Konan and Goto, Tadahiro and Hashimoto, Hideki and Takahashi, Yuji and Naraba, Hiromu and Nakamura, Kensuke", title="Institution-Specific Machine Learning Models for Prehospital Assessment to Predict Hospital Admission: Prediction Model Development Study", journal="JMIR Med Inform", year="2020", month="Oct", day="27", volume="8", number="10", pages="e20324", keywords="prehospital", keywords="prediction", keywords="hospital admission", keywords="emergency medicine", keywords="machine learning", keywords="data science", abstract="Background: Although multiple prediction models have been developed to predict hospital admission to emergency departments (EDs) to address overcrowding and patient safety, only a few studies have examined prediction models for prehospital use. Development of institution-specific prediction models is feasible in this age of data science, provided that predictor-related information is readily collectable. Objective: We aimed to develop a hospital admission prediction model based on patient information that is commonly available during ambulance transport before hospitalization. Methods: Patients transported by ambulance to our ED from April 2018 through March 2019 were enrolled. Candidate predictors were age, sex, chief complaint, vital signs, and patient medical history, all of which were recorded by emergency medical teams during ambulance transport. Patients were divided into two cohorts for derivation (3601/5145, 70.0\%) and validation (1544/5145, 30.0\%). For statistical models, logistic regression, logistic lasso, random forest, and gradient boosting machine were used. Prediction models were developed in the derivation cohort. Model performance was assessed by area under the receiver operating characteristic curve (AUROC) and association measures in the validation cohort. Results: Of 5145 patients transported by ambulance, including deaths in the ED and hospital transfers, 2699 (52.5\%) required hospital admission. Prediction performance was higher with the addition of predictive factors, attaining the best performance with an AUROC of 0.818 (95\% CI 0.792-0.839) with a machine learning model and predictive factors of age, sex, chief complaint, and vital signs. Sensitivity and specificity of this model were 0.744 (95\% CI 0.716-0.773) and 0.745 (95\% CI 0.709-0.776), respectively. Conclusions: For patients transferred to EDs, we developed a well-performing hospital admission prediction model based on routinely collected prehospital information including chief complaints. ", doi="10.2196/20324", url="http://medinform.jmir.org/2020/10/e20324/", url="http://www.ncbi.nlm.nih.gov/pubmed/33107830" } @Article{info:doi/10.2196/21980, author="Wu, Jun and Wang, Jian and Nicholas, Stephen and Maitland, Elizabeth and Fan, Qiuyan", title="Application of Big Data Technology for COVID-19 Prevention and Control in China: Lessons and Recommendations", journal="J Med Internet Res", year="2020", month="Oct", day="9", volume="22", number="10", pages="e21980", keywords="big data", keywords="COVID-19", keywords="disease prevention and control", abstract="Background: In the prevention and control of infectious diseases, previous research on the application of big data technology has mainly focused on the early warning and early monitoring of infectious diseases. Although the application of big data technology for COVID-19 warning and monitoring remain important tasks, prevention of the disease's rapid spread and reduction of its impact on society are currently the most pressing challenges for the application of big data technology during the COVID-19 pandemic. After the outbreak of COVID-19 in Wuhan, the Chinese government and nongovernmental organizations actively used big data technology to prevent, contain, and control the spread of COVID-19. Objective: The aim of this study is to discuss the application of big data technology to prevent, contain, and control COVID-19 in China; draw lessons; and make recommendations. Methods: We discuss the data collection methods and key data information that existed in China before the outbreak of COVID-19 and how these data contributed to the prevention and control of COVID-19. Next, we discuss China's new data collection methods and new information assembled after the outbreak of COVID-19. Based on the data and information collected in China, we analyzed the application of big data technology from the perspectives of data sources, data application logic, data application level, and application results. In addition, we analyzed the issues, challenges, and responses encountered by China in the application of big data technology from four perspectives: data access, data use, data sharing, and data protection. Suggestions for improvements are made for data collection, data circulation, data innovation, and data security to help understand China's response to the epidemic and to provide lessons for other countries' prevention and control of COVID-19. Results: In the process of the prevention and control of COVID-19 in China, big data technology has played an important role in personal tracking, surveillance and early warning, tracking of the virus's sources, drug screening, medical treatment, resource allocation, and production recovery. The data used included location and travel data, medical and health data, news media data, government data, online consumption data, data collected by intelligent equipment, and epidemic prevention data. We identified a number of big data problems including low efficiency of data collection, difficulty in guaranteeing data quality, low efficiency of data use, lack of timely data sharing, and data privacy protection issues. To address these problems, we suggest unified data collection standards, innovative use of data, accelerated exchange and circulation of data, and a detailed and rigorous data protection system. Conclusions: China has used big data technology to prevent and control COVID-19 in a timely manner. To prevent and control infectious diseases, countries must collect, clean, and integrate data from a wide range of sources; use big data technology to analyze a wide range of big data; create platforms for data analyses and sharing; and address privacy issues in the collection and use of big data. ", doi="10.2196/21980", url="http://www.jmir.org/2020/10/e21980/", url="http://www.ncbi.nlm.nih.gov/pubmed/33001836" } @Article{info:doi/10.2196/19879, author="Gruendner, Julian and Wolf, Nicolas and T{\"o}gel, Lars and Haller, Florian and Prokosch, Hans-Ulrich and Christoph, Jan", title="Integrating Genomics and Clinical Data for Statistical Analysis by Using GEnome MINIng (GEMINI) and Fast Healthcare Interoperability Resources (FHIR): System Design and Implementation", journal="J Med Internet Res", year="2020", month="Oct", day="7", volume="22", number="10", pages="e19879", keywords="next-generation sequencing", keywords="data analysis", keywords="genetic databases", keywords="GEnome MINIng", keywords="Fast Healthcare Interoperability Resources", keywords="data standardization", abstract="Background: The introduction of next-generation sequencing (NGS) into molecular cancer diagnostics has led to an increase in the data available for the identification and evaluation of driver mutations and for defining personalized cancer treatment regimens. The meaningful combination of omics data, ie, pathogenic gene variants and alterations with other patient data, to understand the full picture of malignancy has been challenging. Objective: This study describes the implementation of a system capable of processing, analyzing, and subsequently combining NGS data with other clinical patient data for analysis within and across institutions. Methods: On the basis of the already existing NGS analysis workflows for the identification of malignant gene variants at the Institute of Pathology of the University Hospital Erlangen, we defined basic requirements on an NGS processing and analysis pipeline and implemented a pipeline based on the GEMINI (GEnome MINIng) open source genetic variation database. For the purpose of validation, this pipeline was applied to data from the 1000 Genomes Project and subsequently to NGS data derived from 206 patients of a local hospital. We further integrated the pipeline into existing structures of data integration centers at the University Hospital Erlangen and combined NGS data with local nongenomic patient-derived data available in Fast Healthcare Interoperability Resources format. Results: Using data from the 1000 Genomes Project and from the patient cohort as input, the implemented system produced the same results as already established methodologies. Further, it satisfied all our identified requirements and was successfully integrated into the existing infrastructure. Finally, we showed in an exemplary analysis how the data could be quickly loaded into and analyzed in KETOS, a web-based analysis platform for statistical analysis and clinical decision support. Conclusions: This study demonstrates that the GEMINI open source database can be augmented to create an NGS analysis pipeline. The pipeline generates high-quality results consistent with the already established workflows for gene variant annotation and pathological evaluation. We further demonstrate how NGS-derived genomic and other clinical data can be combined for further statistical analysis, thereby providing for data integration using standardized vocabularies and methods. Finally, we demonstrate the feasibility of the pipeline integration into hospital workflows by providing an exemplary integration into the data integration center infrastructure, which is currently being established across Germany. ", doi="10.2196/19879", url="http://www.jmir.org/2020/10/e19879/", url="http://www.ncbi.nlm.nih.gov/pubmed/33026356" } @Article{info:doi/10.2196/17818, author="Sultana, Madeena and Al-Jefri, Majed and Lee, Joon", title="Using Machine Learning and Smartphone and Smartwatch Data to Detect Emotional States and Transitions: Exploratory Study", journal="JMIR Mhealth Uhealth", year="2020", month="Sep", day="29", volume="8", number="9", pages="e17818", keywords="mHealth", keywords="mental health", keywords="emotion detection", keywords="emotional transition detection", keywords="spatiotemporal context", keywords="supervised machine learning", keywords="artificial intelligence", keywords="mobile phone", keywords="digital biomarkers", keywords="digital phenotyping", abstract="Background: Emotional state in everyday life is an essential indicator of health and well-being. However, daily assessment of emotional states largely depends on active self-reports, which are often inconvenient and prone to incomplete information. Automated detection of emotional states and transitions on a daily basis could be an effective solution to this problem. However, the relationship between emotional transitions and everyday context remains to be unexplored. Objective: This study aims to explore the relationship between contextual information and emotional transitions and states to evaluate the feasibility of detecting emotional transitions and states from daily contextual information using machine learning (ML) techniques. Methods: This study was conducted on the data of 18 individuals from a publicly available data set called ExtraSensory. Contextual and sensor data were collected using smartphone and smartwatch sensors in a free-living condition, where the number of days for each person varied from 3 to 9. Sensors included an accelerometer, a gyroscope, a compass, location services, a microphone, a phone state indicator, light, temperature, and a barometer. The users self-reported approximately 49 discrete emotions at different intervals via a smartphone app throughout the data collection period. We mapped the 49 reported discrete emotions to the 3 dimensions of the pleasure, arousal, and dominance model and considered 6 emotional states: discordant, pleased, dissuaded, aroused, submissive, and dominant. We built general and personalized models for detecting emotional transitions and states every 5 min. The transition detection problem is a binary classification problem that detects whether a person's emotional state has changed over time, whereas state detection is a multiclass classification problem. In both cases, a wide range of supervised ML algorithms were leveraged, in addition to data preprocessing, feature selection, and data imbalance handling techniques. Finally, an assessment was conducted to shed light on the association between everyday context and emotional states. Results: This study obtained promising results for emotional state and transition detection. The best area under the receiver operating characteristic (AUROC) curve for emotional state detection reached 60.55\% in the general models and an average of 96.33\% across personalized models. Despite the highly imbalanced data, the best AUROC curve for emotional transition detection reached 90.5\% in the general models and an average of 88.73\% across personalized models. In general, feature analyses show that spatiotemporal context, phone state, and motion-related information are the most informative factors for emotional state and transition detection. Our assessment showed that lifestyle has an impact on the predictability of emotion. Conclusions: Our results demonstrate a strong association of daily context with emotional states and transitions as well as the feasibility of detecting emotional states and transitions using data from smartphone and smartwatch sensors. ", doi="10.2196/17818", url="http://mhealth.jmir.org/2020/9/e17818/", url="http://www.ncbi.nlm.nih.gov/pubmed/32990638" } @Article{info:doi/10.2196/18920, author="Brown, Paul Adrian and Randall, M. Sean", title="Secure Record Linkage of Large Health Data Sets: Evaluation of a Hybrid Cloud Model", journal="JMIR Med Inform", year="2020", month="Sep", day="23", volume="8", number="9", pages="e18920", keywords="cloud computing", keywords="medical record linkage", keywords="confidentiality", keywords="data science", abstract="Background: The linking of administrative data across agencies provides the capability to investigate many health and social issues with the potential to deliver significant public benefit. Despite its advantages, the use of cloud computing resources for linkage purposes is scarce, with the storage of identifiable information on cloud infrastructure assessed as high risk by data custodians. Objective: This study aims to present a model for record linkage that utilizes cloud computing capabilities while assuring custodians that identifiable data sets remain secure and local. Methods: A new hybrid cloud model was developed, including privacy-preserving record linkage techniques and container-based batch processing. An evaluation of this model was conducted with a prototype implementation using large synthetic data sets representative of administrative health data. Results: The cloud model kept identifiers on premises and uses privacy-preserved identifiers to run all linkage computations on cloud infrastructure. Our prototype used a managed container cluster in Amazon Web Services to distribute the computation using existing linkage software. Although the cost of computation was relatively low, the use of existing software resulted in an overhead of processing of 35.7\% (149/417 min execution time). Conclusions: The result of our experimental evaluation shows the operational feasibility of such a model and the exciting opportunities for advancing the analysis of linkage outputs. ", doi="10.2196/18920", url="http://medinform.jmir.org/2020/9/e18920/", url="http://www.ncbi.nlm.nih.gov/pubmed/32965236" } @Article{info:doi/10.2196/19572, author="Hu, Guangyu and Li, Peiyi and Yuan, Changzheng and Tao, Chenglin and Wen, Hai and Liu, Qiannan and Qiu, Wuqi", title="Information Disclosure During the COVID-19 Epidemic in China: City-Level Observational Study", journal="J Med Internet Res", year="2020", month="Aug", day="27", volume="22", number="8", pages="e19572", keywords="information disclosure", keywords="COVID-19", keywords="website", keywords="risk", keywords="communication", keywords="China", keywords="disclosure", keywords="pandemic", keywords="health information", keywords="public health", abstract="Background: Information disclosure is a top priority for official responses to the COVID-19 pandemic. The timely and standardized information published by authorities as a response to the crisis can better inform the public and enable better preparations for the pandemic; however, there is limited evidence of any systematic analyses of the disclosed epidemic information. This in turn has important implications for risk communication. Objective: This study aimed to describe and compare the officially released content regarding local epidemic situations as well as analyze the characteristics of information disclosure through local communication in major cities in China. Methods: The 31 capital cities in mainland China were included in this city-level observational study. Data were retrieved from local municipalities and health commission websites as of March 18, 2020. A checklist was employed as a rapid qualitative assessment tool to analyze the information disclosure performance of each city. Descriptive analyses and data visualizations were produced to present and compare the comparative performances of the cities. Results: In total, 29 of 31 cities (93.5\%) established specific COVID-19 webpages to disclose information. Among them, 12 of the city webpages were added to their corresponding municipal websites. A majority of the cities (21/31, 67.7\%) published their first cases of infection in a timely manner on the actual day of confirmation. Regarding the information disclosures highlighted on the websites, news updates from local media or press briefings were the most prevalent (28/29, 96.6\%), followed by epidemic surveillance (25/29, 86.2\%), and advice for the public (25/29, 86.2\%). Clarifications of misinformation and frequently asked questions were largely overlooked as only 2 cities provided this valuable information. The median daily update frequency of epidemic surveillance summaries was 1.2 times per day (IQR 1.0-1.3 times), and the majority of these summaries (18/25, 72.0\%) also provided detailed information regarding confirmed cases. The reporting of key indicators in the epidemic surveillance summaries, as well as critical facts included in the confirmed case reports, varied substantially between cities. In general, the best performance in terms of timely reporting and the transparency of information disclosures were observed in the municipalities directly administered by the central government compared to the other cities. Conclusions: Timely and effective efforts to disclose information related to the COVID-19 epidemic have been made in major cities in China. Continued improvements to local authority reporting will contribute to more effective public communication and efficient public health research responses. The development of protocols and the standardization of epidemic message templates---as well as the use of uniform operating procedures to provide regular information updates---should be prioritized to ensure a coordinated national response. ", doi="10.2196/19572", url="http://www.jmir.org/2020/8/e19572/", url="http://www.ncbi.nlm.nih.gov/pubmed/32790640" } @Article{info:doi/10.2196/18087, author="Suver, Christine and Thorogood, Adrian and Doerr, Megan and Wilbanks, John and Knoppers, Bartha", title="Bringing Code to Data: Do Not Forget Governance", journal="J Med Internet Res", year="2020", month="Jul", day="28", volume="22", number="7", pages="e18087", keywords="data management", keywords="privacy", keywords="ethics, research", keywords="data science", keywords="machine learning", doi="10.2196/18087", url="http://www.jmir.org/2020/7/e18087/", url="http://www.ncbi.nlm.nih.gov/pubmed/32540846" } @Article{info:doi/10.2196/14591, author="Wang, Karen and Grossetta Nardini, Holly and Post, Lori and Edwards, Todd and Nunez-Smith, Marcella and Brandt, Cynthia", title="Information Loss in Harmonizing Granular Race and Ethnicity Data: Descriptive Study of Standards", journal="J Med Internet Res", year="2020", month="Jul", day="20", volume="22", number="7", pages="e14591", keywords="continental population groups", keywords="multiracial populations", keywords="multiethnic groups", keywords="data standards", keywords="health status disparities", keywords="race factors", keywords="demography", abstract="Background: Data standards for race and ethnicity have significant implications for health equity research. Objective: We aim to describe a challenge encountered when working with a multiple--race and ethnicity assessment in the Eastern Caribbean Health Outcomes Research Network (ECHORN), a research collaborative of Barbados, Puerto Rico, Trinidad and Tobago, and the US Virgin Islands. Methods: We examined the data standards guiding harmonization of race and ethnicity data for multiracial and multiethnic populations, using the Office of Management and Budget (OMB) Statistical Policy Directive No. 15. Results: Of 1211 participants in the ECHORN cohort study, 901 (74.40\%) selected 1 racial category. Of those that selected 1 category, 13.0\% (117/901) selected Caribbean; 6.4\% (58/901), Puerto Rican or Boricua; and 13.5\% (122/901), the mixed or multiracial category. A total of 17.84\% (216/1211) of participants selected 2 or more categories, with 15.19\% (184/1211) selecting 2 categories and 2.64\% (32/1211) selecting 3 or more categories. With aggregation of ECHORN data into OMB categories, 27.91\% (338/1211) of the participants can be placed in the ``more than one race'' category. Conclusions: This analysis exposes the fundamental informatics challenges that current race and ethnicity data standards present to meaningful collection, organization, and dissemination of granular data about subgroup populations in diverse and marginalized communities. Current standards should reflect the science of measuring race and ethnicity and the need for multidisciplinary teams to improve evolving standards throughout the data life cycle. ", doi="10.2196/14591", url="http://www.jmir.org/2020/7/e14591/", url="http://www.ncbi.nlm.nih.gov/pubmed/32706693" } @Article{info:doi/10.2196/17257, author="Du, Zhenzhen and Yang, Yujie and Zheng, Jing and Li, Qi and Lin, Denan and Li, Ye and Fan, Jianping and Cheng, Wen and Chen, Xie-Hui and Cai, Yunpeng", title="Accurate Prediction of Coronary Heart Disease for Patients With Hypertension From Electronic Health Records With Big Data and Machine-Learning Methods: Model Development and Performance Evaluation", journal="JMIR Med Inform", year="2020", month="Jul", day="6", volume="8", number="7", pages="e17257", keywords="coronary heart disease", keywords="machine learning", keywords="electronic health records", keywords="predictive algorithms", keywords="hypertension", abstract="Background: Predictions of cardiovascular disease risks based on health records have long attracted broad research interests. Despite extensive efforts, the prediction accuracy has remained unsatisfactory. This raises the question as to whether the data insufficiency, statistical and machine-learning methods, or intrinsic noise have hindered the performance of previous approaches, and how these issues can be alleviated. Objective: Based on a large population of patients with hypertension in Shenzhen, China, we aimed to establish a high-precision coronary heart disease (CHD) prediction model through big data and machine-learning Methods: Data from a large cohort of 42,676 patients with hypertension, including 20,156 patients with CHD onset, were investigated from electronic health records (EHRs) 1-3 years prior to CHD onset (for CHD-positive cases) or during a disease-free follow-up period of more than 3 years (for CHD-negative cases). The population was divided evenly into independent training and test datasets. Various machine-learning methods were adopted on the training set to achieve high-accuracy prediction models and the results were compared with traditional statistical methods and well-known risk scales. Comparison analyses were performed to investigate the effects of training sample size, factor sets, and modeling approaches on the prediction performance. Results: An ensemble method, XGBoost, achieved high accuracy in predicting 3-year CHD onset for the independent test dataset with an area under the receiver operating characteristic curve (AUC) value of 0.943. Comparison analysis showed that nonlinear models (K-nearest neighbor AUC 0.908, random forest AUC 0.938) outperform linear models (logistic regression AUC 0.865) on the same datasets, and machine-learning methods significantly surpassed traditional risk scales or fixed models (eg, Framingham cardiovascular disease risk models). Further analyses revealed that using time-dependent features obtained from multiple records, including both statistical variables and changing-trend variables, helped to improve the performance compared to using only static features. Subpopulation analysis showed that the impact of feature design had a more significant effect on model accuracy than the population size. Marginal effect analysis showed that both traditional and EHR factors exhibited highly nonlinear characteristics with respect to the risk scores. Conclusions: We demonstrated that accurate risk prediction of CHD from EHRs is possible given a sufficiently large population of training data. Sophisticated machine-learning methods played an important role in tackling the heterogeneity and nonlinear nature of disease prediction. Moreover, accumulated EHR data over multiple time points provided additional features that were valuable for risk prediction. Our study highlights the importance of accumulating big data from EHRs for accurate disease predictions. ", doi="10.2196/17257", url="https://medinform.jmir.org/2020/7/e17257", url="http://www.ncbi.nlm.nih.gov/pubmed/32628616" } @Article{info:doi/10.2196/19170, author="Mavian, Carla and Marini, Simone and Prosperi, Mattia and Salemi, Marco", title="A Snapshot of SARS-CoV-2 Genome Availability up to April 2020 and its Implications: Data Analysis", journal="JMIR Public Health Surveill", year="2020", month="Jun", day="1", volume="6", number="2", pages="e19170", keywords="covid-19", keywords="sars-cov-2", keywords="phylogenetics", keywords="genome", keywords="evolution", keywords="genetics", keywords="pandemic", keywords="infectious disease", keywords="virus", keywords="sequence", keywords="transmission", keywords="tracing", keywords="tracking", abstract="Background: The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic has been growing exponentially, affecting over 4 million people and causing enormous distress to economies and societies worldwide. A plethora of analyses based on viral sequences has already been published both in scientific journals and through non--peer-reviewed channels to investigate the genetic heterogeneity and spatiotemporal dissemination of SARS-CoV-2. However, a systematic investigation of phylogenetic information and sampling bias in the available data is lacking. Although the number of available genome sequences of SARS-CoV-2 is growing daily and the sequences show increasing phylogenetic information, country-specific data still present severe limitations and should be interpreted with caution. Objective: The objective of this study was to determine the quality of the currently available SARS-CoV-2 full genome data in terms of sampling bias as well as phylogenetic and temporal signals to inform and guide the scientific community. Methods: We used maximum likelihood--based methods to assess the presence of sufficient information for robust phylogenetic and phylogeographic studies in several SARS-CoV-2 sequence alignments assembled from GISAID (Global Initiative on Sharing All Influenza Data) data released between March and April 2020. Results: Although the number of high-quality full genomes is growing daily, and sequence data released in April 2020 contain sufficient phylogenetic information to allow reliable inference of phylogenetic relationships, country-specific SARS-CoV-2 data sets still present severe limitations. Conclusions: At the present time, studies assessing within-country spread or transmission clusters should be considered preliminary or hypothesis-generating at best. Hence, current reports should be interpreted with caution, and concerted efforts should continue to increase the number and quality of sequences required for robust tracing of the epidemic. ", doi="10.2196/19170", url="http://publichealth.jmir.org/2020/2/e19170/", url="http://www.ncbi.nlm.nih.gov/pubmed/32412415" } @Article{info:doi/10.2196/19273, author="Chen, Emily and Lerman, Kristina and Ferrara, Emilio", title="Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set", journal="JMIR Public Health Surveill", year="2020", month="May", day="29", volume="6", number="2", pages="e19273", keywords="COVID-19", keywords="SARS-CoV-2", keywords="social media", keywords="network analysis", keywords="computational social sciences", abstract="Background: At the time of this writing, the coronavirus disease (COVID-19) pandemic outbreak has already put tremendous strain on many countries' citizens, resources, and economies around the world. Social distancing measures, travel bans, self-quarantines, and business closures are changing the very fabric of societies worldwide. With people forced out of public spaces, much of the conversation about these phenomena now occurs online on social media platforms like Twitter. Objective: In this paper, we describe a multilingual COVID-19 Twitter data set that we are making available to the research community via our COVID-19-TweetIDs GitHub repository. Methods: We started this ongoing data collection on January 28, 2020, leveraging Twitter's streaming application programming interface (API) and Tweepy to follow certain keywords and accounts that were trending at the time data collection began. We used Twitter's search API to query for past tweets, resulting in the earliest tweets in our collection dating back to January 21, 2020. Results: Since the inception of our collection, we have actively maintained and updated our GitHub repository on a weekly basis. We have published over 123 million tweets, with over 60\% of the tweets in English. This paper also presents basic statistics that show that Twitter activity responds and reacts to COVID-19-related events. Conclusions: It is our hope that our contribution will enable the study of online conversation dynamics in the context of a planetary-scale epidemic outbreak of unprecedented proportions and implications. This data set could also help track COVID-19-related misinformation and unverified rumors or enable the understanding of fear and panic---and undoubtedly more. ", doi="10.2196/19273", url="http://publichealth.jmir.org/2020/2/e19273/", url="http://www.ncbi.nlm.nih.gov/pubmed/32427106" } @Article{info:doi/10.2196/18828, author="Ayyoubzadeh, Mohammad Seyed and Ayyoubzadeh, Mehdi Seyed and Zahedi, Hoda and Ahmadi, Mahnaz and R Niakan Kalhori, Sharareh", title="Predicting COVID-19 Incidence Through Analysis of Google Trends Data in Iran: Data Mining and Deep Learning Pilot Study", journal="JMIR Public Health Surveill", year="2020", month="Apr", day="14", volume="6", number="2", pages="e18828", keywords="coronavirus", keywords="COVID-19", keywords="prediction", keywords="incidence", keywords="Google Trends", keywords="linear regression", keywords="LSTM", keywords="pandemic", keywords="outbreak", keywords="public health", abstract="Background: The recent global outbreak of coronavirus disease (COVID-19) is affecting many countries worldwide. Iran is one of the top 10 most affected countries. Search engines provide useful data from populations, and these data might be useful to analyze epidemics. Utilizing data mining methods on electronic resources' data might provide a better insight into the COVID-19 outbreak to manage the health crisis in each country and worldwide. Objective: This study aimed to predict the incidence of COVID-19 in Iran. Methods: Data were obtained from the Google Trends website. Linear regression and long short-term memory (LSTM) models were used to estimate the number of positive COVID-19 cases. All models were evaluated using 10-fold cross-validation, and root mean square error (RMSE) was used as the performance metric. Results: The linear regression model predicted the incidence with an RMSE of 7.562 (SD 6.492). The most effective factors besides previous day incidence included the search frequency of handwashing, hand sanitizer, and antiseptic topics. The RMSE of the LSTM model was 27.187 (SD 20.705). Conclusions: Data mining algorithms can be employed to predict trends of outbreaks. This prediction might support policymakers and health care managers to plan and allocate health care resources accordingly. ", doi="10.2196/18828", url="http://publichealth.jmir.org/2020/2/e18828/", url="http://www.ncbi.nlm.nih.gov/pubmed/32234709" } @Article{info:doi/10.2196/15816, author="Wilson, James and Herron, Daniel and Nachev, Parashkev and McNally, Nick and Williams, Bryan and Rees, Geraint", title="The Value of Data: Applying a Public Value Model to the English National Health Service", journal="J Med Internet Res", year="2020", month="Mar", day="27", volume="22", number="3", pages="e15816", keywords="health policy", keywords="innovation", keywords="public value", keywords="intellectual property", keywords="NHS Constitution", doi="10.2196/15816", url="http://www.jmir.org/2020/3/e15816/", url="http://www.ncbi.nlm.nih.gov/pubmed/32217501" } @Article{info:doi/10.2196/16102, author="Grundstrom, Casandra and Korhonen, Olli and V{\"a}yrynen, Karin and Isomursu, Minna", title="Insurance Customers' Expectations for Sharing Health Data: Qualitative Survey Study", journal="JMIR Med Inform", year="2020", month="Mar", day="26", volume="8", number="3", pages="e16102", keywords="data sharing", keywords="qualitative research", keywords="survey", keywords="health insurance", keywords="insurance", keywords="medical informatics", keywords="health services", abstract="Background: Insurance organizations are essential stakeholders in health care ecosystems. For addressing future health care needs, insurance companies require access to health data to deliver preventative and proactive digital health services to customers. However, extant research is limited in examining the conditions that incentivize health data sharing. Objective: This study aimed to (1) identify the expectations of insurance customers when sharing health data, (2) determine the perceived intrinsic value of health data, and (3) explore the conditions that aid in incentivizing health data sharing in the relationship between an insurance organization and its customer. Methods: A Web-based survey was distributed to randomly selected customers from a Finnish insurance organization through email. A single open-text answer was used for a qualitative data analysis through inductive coding, followed by a thematic analysis. Furthermore, the 4 constructs of commitment, power, reciprocity, and trust from the social exchange theory (SET) were applied as a framework. Results: From the 5000 customers invited to participate, we received 452 surveys (response rate: 9.0\%). Customer characteristics were found to reflect customer demographics. Of the 452 surveys, 48 (10.6\%) open-text responses were skipped by the customer, 57 (12.6\%) customers had no expectations from sharing health data, and 44 (9.7\%) customers preferred to abstain from a data sharing relationship. Using the SET framework, we found that customers expected different conditions to be fulfilled by their insurance provider based on the commitment, power, reciprocity, and trust constructs. Of the 452 customers who completed the surveys, 64 (14.2\%) customers required that the insurance organization meets their data treatment expectations (commitment). Overall, 4.9\% (22/452) of customers were concerned about their health data being used against them to profile their health, to increase insurance prices, or to deny health insurance claims (power). A total of 28.5\% (129/452) of customers expected some form of benefit, such as personalized digital health services, and 29.9\% (135/452) of customers expected finance-related compensation (reciprocity). Furthermore, 7.5\% (34/452) of customers expected some form of empathy from the insurance organization through enhanced transparency or an emotional connection (trust). Conclusions: To aid in the design and development of digital health services, insurance organizations need to address the customers' expectations when sharing their health data. We established the expectations of customers in the social exchange of health data and explored the perceived values of data as intangible goods. Actions by the insurance organization should aim to increase trust through a culture of transparency, commitment to treat health data in a prescribed manner, provide reciprocal benefits through digital health services that customers deem valuable, and assuage fears of health data being used to prevent providing insurance coverage or increase costs. ", doi="10.2196/16102", url="http://medinform.jmir.org/2020/3/e16102/", url="http://www.ncbi.nlm.nih.gov/pubmed/32213467" } @Article{info:doi/10.2196/14777, author="McDonough, W. Caitrin and Smith, M. Steven and Cooper-DeHoff, M. Rhonda and Hogan, R. William", title="Optimizing Antihypertensive Medication Classification in Electronic Health Record-Based Data: Classification System Development and Methodological Comparison", journal="JMIR Med Inform", year="2020", month="Feb", day="27", volume="8", number="2", pages="e14777", keywords="antihypertensive agents", keywords="electronic health records", keywords="classification", keywords="RxNorm", keywords="phenotype", abstract="Background: Computable phenotypes have the ability to utilize data within the electronic health record (EHR) to identify patients with certain characteristics. Many computable phenotypes rely on multiple types of data within the EHR including prescription drug information. Hypertension (HTN)-related computable phenotypes are particularly dependent on the correct classification of antihypertensive prescription drug information, as well as corresponding diagnoses and blood pressure information. Objective: This study aimed to create an antihypertensive drug classification system to be utilized with EHR-based data as part of HTN-related computable phenotypes. Methods: We compared 4 different antihypertensive drug classification systems based off of 4 different methodologies and terminologies, including 3 RxNorm Concept Unique Identifier (RxCUI)--based classifications and 1 medication name--based classification. The RxCUI-based classifications utilized data from (1) the Drug Ontology, (2) the new Medication Reference Terminology, and (3) the Anatomical Therapeutic Chemical Classification System and DrugBank, whereas the medication name--based classification relied on antihypertensive drug names. Each classification system was applied to EHR-based prescription drug data from hypertensive patients in the OneFlorida Data Trust. Results: There were 13,627 unique RxCUIs and 8025 unique medication names from the 13,879,046 prescriptions. We observed a broad overlap between the 4 methods, with 84.1\% (691/822) to 95.3\% (695/729) of terms overlapping pairwise between the different classification methods. Key differences arose from drug products with multiple dosage forms, drug products with an indication of benign prostatic hyperplasia, drug products that contain more than 1 ingredient (combination products), and terms within the classification systems corresponding to retired or obsolete RxCUIs. Conclusions: In total, 2 antihypertensive drug classifications were constructed, one based on RxCUIs and one based on medication name, that can be used in future computable phenotypes that require antihypertensive drug classifications. ", doi="10.2196/14777", url="http://medinform.jmir.org/2020/2/e14777/", url="http://www.ncbi.nlm.nih.gov/pubmed/32130152" } @Article{info:doi/10.2196/16153, author="Nam, Min Sang and Peterson, A. Thomas and Butte, J. Atul and Seo, Yul Kyoung and Han, Wook Hyun", title="Explanatory Model of Dry Eye Disease Using Health and Nutrition Examinations: Machine Learning and Network-Based Factor Analysis From a National Survey", journal="JMIR Med Inform", year="2020", month="Feb", day="20", volume="8", number="2", pages="e16153", keywords="dry eye disease", keywords="epidemiology", keywords="machine learning", keywords="systems analysis", keywords="patient-specific modeling", abstract="Background: Dry eye disease (DED) is a complex disease of the ocular surface, and its associated factors are important for understanding and effectively treating DED. Objective: This study aimed to provide an integrative and personalized model of DED by making an explanatory model of DED using as many factors as possible from the Korea National Health and Nutrition Examination Survey (KNHANES) data. Methods: Using KNHANES data for 2012 (4391 sample cases), a point-based scoring system was created for ranking factors associated with DED and assessing patient-specific DED risk. First, decision trees and lasso were used to classify continuous factors and to select important factors, respectively. Next, a survey-weighted multiple logistic regression was trained using these factors, and points were assigned using the regression coefficients. Finally, network graphs of partial correlations between factors were utilized to study the interrelatedness of DED-associated factors. Results: The point-based model achieved an area under the curve of 0.70 (95\% CI 0.61-0.78), and 13 of 78 factors considered were chosen. Important factors included sex (+9 points for women), corneal refractive surgery (+9 points), current depression (+7 points), cataract surgery (+7 points), stress (+6 points), age (54-66 years; +4 points), rhinitis (+4 points), lipid-lowering medication (+4 points), and intake of omega-3 (0.43\%-0.65\% kcal/day; ?4 points). Among these, the age group 54 to 66 years had high centrality in the network, whereas omega-3 had low centrality. Conclusions: Integrative understanding of DED was possible using the machine learning--based model and network-based factor analysis. This method for finding important risk factors and identifying patient-specific risk could be applied to other multifactorial diseases. ", doi="10.2196/16153", url="http://medinform.jmir.org/2020/2/e16153/", url="http://www.ncbi.nlm.nih.gov/pubmed/32130150" } @Article{info:doi/10.2196/13046, author="Gong, Mengchun and Wang, Shuang and Wang, Lezi and Liu, Chao and Wang, Jianyang and Guo, Qiang and Zheng, Hao and Xie, Kang and Wang, Chenghong and Hui, Zhouguang", title="Evaluation of Privacy Risks of Patients' Data in China: Case Study", journal="JMIR Med Inform", year="2020", month="Feb", day="5", volume="8", number="2", pages="e13046", keywords="patient privacy", keywords="privacy risk", keywords="Chinese patients' data", keywords="data sharing", keywords="re-identification", abstract="Background: Patient privacy is a ubiquitous problem around the world. Many existing studies have demonstrated the potential privacy risks associated with sharing of biomedical data. Owing to the increasing need for data sharing and analysis, health care data privacy is drawing more attention. However, to better protect biomedical data privacy, it is essential to assess the privacy risk in the first place. Objective: In China, there is no clear regulation for health systems to deidentify data. It is also not known whether a mechanism such as the Health Insurance Portability and Accountability Act (HIPAA) safe harbor policy will achieve sufficient protection. This study aimed to conduct a pilot study using patient data from Chinese hospitals to understand and quantify the privacy risks of Chinese patients. Methods: We used g-distinct analysis to evaluate the reidentification risks with regard to the HIPAA safe harbor approach when applied to Chinese patients' data. More specifically, we estimated the risks based on the HIPAA safe harbor and limited dataset policies by assuming an attacker has background knowledge of the patient from the public domain. Results: The experiments were conducted on 0.83 million patients (with data field of date of birth, gender, and surrogate ZIP codes generated based on home address) across 33 provincial-level administrative divisions in China. Under the Limited Dataset policy, 19.58\% (163,262/833,235) of the population could be uniquely identifiable under the g-distinct metric (ie, 1-distinct). In contrast, the Safe Harbor policy is able to significantly reduce privacy risk, where only 0.072\% (601/833,235) of individuals are uniquely identifiable, and the majority of the population is 3000 indistinguishable (ie the population is expected to share common attributes with 3000 or less people). Conclusions: Through the experiments based on real-world patient data, this work illustrates that the results of g-distinct analysis about Chinese patient privacy risk are similar to those from a previous US study, in which data from different organizations/regions might be vulnerable to different reidentification risks under different policies. This work provides reference to Chinese health care entities for estimating patients' privacy risk during data sharing, which laid the foundation of privacy risk study about Chinese patients' data in the future. ", doi="10.2196/13046", url="https://medinform.jmir.org/2020/2/e13046", url="http://www.ncbi.nlm.nih.gov/pubmed/32022691" } @Article{info:doi/10.2196/16816, author="Wang, Jing and Deng, Huan and Liu, Bangtao and Hu, Anbin and Liang, Jun and Fan, Lingye and Zheng, Xu and Wang, Tong and Lei, Jianbo", title="Systematic Evaluation of Research Progress on Natural Language Processing in Medicine Over the Past 20 Years: Bibliometric Study on PubMed", journal="J Med Internet Res", year="2020", month="Jan", day="23", volume="22", number="1", pages="e16816", keywords="natural language processing", keywords="clinical", keywords="medicine", keywords="information extraction", keywords="electronic medical record", abstract="Background: Natural language processing (NLP) is an important traditional field in computer science, but its application in medical research has faced many challenges. With the extensive digitalization of medical information globally and increasing importance of understanding and mining big data in the medical field, NLP is becoming more crucial. Objective: The goal of the research was to perform a systematic review on the use of NLP in medical research with the aim of understanding the global progress on NLP research outcomes, content, methods, and study groups involved. Methods: A systematic review was conducted using the PubMed database as a search platform. All published studies on the application of NLP in medicine (except biomedicine) during the 20 years between 1999 and 2018 were retrieved. The data obtained from these published studies were cleaned and structured. Excel (Microsoft Corp) and VOSviewer (Nees Jan van Eck and Ludo Waltman) were used to perform bibliometric analysis of publication trends, author orders, countries, institutions, collaboration relationships, research hot spots, diseases studied, and research methods. Results: A total of 3498 articles were obtained during initial screening, and 2336 articles were found to meet the study criteria after manual screening. The number of publications increased every year, with a significant growth after 2012 (number of publications ranged from 148 to a maximum of 302 annually). The United States has occupied the leading position since the inception of the field, with the largest number of articles published. The United States contributed to 63.01\% (1472/2336) of all publications, followed by France (5.44\%, 127/2336) and the United Kingdom (3.51\%, 82/2336). The author with the largest number of articles published was Hongfang Liu (70), while St{\'e}phane Meystre (17) and Hua Xu (33) published the largest number of articles as the first and corresponding authors. Among the first author's affiliation institution, Columbia University published the largest number of articles, accounting for 4.54\% (106/2336) of the total. Specifically, approximately one-fifth (17.68\%, 413/2336) of the articles involved research on specific diseases, and the subject areas primarily focused on mental illness (16.46\%, 68/413), breast cancer (5.81\%, 24/413), and pneumonia (4.12\%, 17/413). Conclusions: NLP is in a period of robust development in the medical field, with an average of approximately 100 publications annually. Electronic medical records were the most used research materials, but social media such as Twitter have become important research materials since 2015. Cancer (24.94\%, 103/413) was the most common subject area in NLP-assisted medical research on diseases, with breast cancers (23.30\%, 24/103) and lung cancers (14.56\%, 15/103) accounting for the highest proportions of studies. Columbia University and the talents trained therein were the most active and prolific research forces on NLP in the medical field. ", doi="10.2196/16816", url="http://www.jmir.org/2020/1/e16816/", url="http://www.ncbi.nlm.nih.gov/pubmed/32012074" } @Article{info:doi/10.2196/14340, author="Jin, Yonghao and Li, Fei and Vimalananda, G. Varsha and Yu, Hong", title="Automatic Detection of Hypoglycemic Events From the Electronic Health Record Notes of Diabetes Patients: Empirical Study", journal="JMIR Med Inform", year="2019", month="Nov", day="8", volume="7", number="4", pages="e14340", keywords="natural language processing", keywords="convolutional neural networks", keywords="hypoglycemia", keywords="adverse events", abstract="Background: Hypoglycemic events are common and potentially dangerous conditions among patients being treated for diabetes. Automatic detection of such events could improve patient care and is valuable in population studies. Electronic health records (EHRs) are valuable resources for the detection of such events. Objective: In this study, we aim to develop a deep-learning--based natural language processing (NLP) system to automatically detect hypoglycemic events from EHR notes. Our model is called the High-Performing System for Automatically Detecting Hypoglycemic Events (HYPE). Methods: Domain experts reviewed 500 EHR notes of diabetes patients to determine whether each sentence contained a hypoglycemic event or not. We used this annotated corpus to train and evaluate HYPE, the high-performance NLP system for hypoglycemia detection. We built and evaluated both a classical machine learning model (ie, support vector machines [SVMs]) and state-of-the-art neural network models. Results: We found that neural network models outperformed the SVM model. The convolutional neural network (CNN) model yielded the highest performance in a 10-fold cross-validation setting: mean precision=0.96 (SD 0.03), mean recall=0.86 (SD 0.03), and mean F1=0.91 (SD 0.03). Conclusions: Despite the challenges posed by small and highly imbalanced data, our CNN-based HYPE system still achieved a high performance for hypoglycemia detection. HYPE can be used for EHR-based hypoglycemia surveillance and population studies in diabetes patients. ", doi="10.2196/14340", url="http://medinform.jmir.org/2019/4/e14340/", url="http://www.ncbi.nlm.nih.gov/pubmed/31702562" } @Article{info:doi/10.2196/15511, author="Tran, Xuan Bach and Nghiem, Son and Sahin, Oz and Vu, Manh Tuan and Ha, Hai Giang and Vu, Thu Giang and Pham, Quang Hai and Do, Thi Hoa and Latkin, A. Carl and Tam, Wilson and Ho, H. Cyrus S. and Ho, M. Roger C.", title="Modeling Research Topics for Artificial Intelligence Applications in Medicine: Latent Dirichlet Allocation Application Study", journal="J Med Internet Res", year="2019", month="Nov", day="1", volume="21", number="11", pages="e15511", keywords="artificial intelligence", keywords="applications", keywords="medicine", keywords="scientometric", keywords="bibliometric", keywords="latent Dirichlet allocation", abstract="Background: Artificial intelligence (AI)--based technologies develop rapidly and have myriad applications in medicine and health care. However, there is a lack of comprehensive reporting on the productivity, workflow, topics, and research landscape of AI in this field. Objective: This study aimed to evaluate the global development of scientific publications and constructed interdisciplinary research topics on the theory and practice of AI in medicine from 1977 to 2018. Methods: We obtained bibliographic data and abstract contents of publications published between 1977 and 2018 from the Web of Science database. A total of 27,451 eligible articles were analyzed. Research topics were classified by latent Dirichlet allocation, and principal component analysis was used to identify the construct of the research landscape. Results: The applications of AI have mainly impacted clinical settings (enhanced prognosis and diagnosis, robot-assisted surgery, and rehabilitation), data science and precision medicine (collecting individual data for precision medicine), and policy making (raising ethical and legal issues, especially regarding privacy and confidentiality of data). However, AI applications have not been commonly used in resource-poor settings due to the limit in infrastructure and human resources. Conclusions: The application of AI in medicine has grown rapidly and focuses on three leading platforms: clinical practices, clinical material, and policies. AI might be one of the methods to narrow down the inequality in health care and medicine between developing and developed countries. Technology transfer and support from developed countries are essential measures for the advancement of AI application in health care in developing countries. ", doi="10.2196/15511", url="https://www.jmir.org/2019/11/e15511", url="http://www.ncbi.nlm.nih.gov/pubmed/31682577" } @Article{info:doi/10.2196/14083, author="Kim, Mina and Shin, Soo-Yong and Kang, Mira and Yi, Byoung-Kee and Chang, Kyung Dong", title="Developing a Standardization Algorithm for Categorical Laboratory Tests for Clinical Big Data Research: Retrospective Study", journal="JMIR Med Inform", year="2019", month="Aug", day="29", volume="7", number="3", pages="e14083", keywords="standardization", keywords="electronic health records", keywords="data quality", keywords="data science", abstract="Background: Data standardization is essential in electronic health records (EHRs) for both clinical practice and retrospective research. However, it is still not easy to standardize EHR data because of nonidentical duplicates, typographical errors, or inconsistencies. To overcome this drawback, standardization efforts have been undertaken for collecting data in a standardized format as well as for curating the stored data in EHRs. To perform clinical big data research, the stored data in EHR should be standardized, starting from laboratory results, given their importance. However, most of the previous efforts have been based on labor-intensive manual methods. Objective: We aimed to develop an automatic standardization method for eliminating the noises of categorical laboratory data, grouping, and mapping of cleaned data using standard terminology. Methods: We developed a method called standardization algorithm for laboratory test--categorical result (SALT-C) that can process categorical laboratory data, such as pos +, 250 4+ (urinalysis results), and reddish (urinalysis color results). SALT-C consists of five steps. First, it applies data cleaning rules to categorical laboratory data. Second, it categorizes the cleaned data into 5 predefined groups (urine color, urine dipstick, blood type, presence-finding, and pathogenesis tests). Third, all data in each group are vectorized. Fourth, similarity is calculated between the vectors of data and those of each value in the predefined value sets. Finally, the value closest to the data is assigned. Results: The performance of SALT-C was validated using 59,213,696 data points (167,938 unique values) generated over 23 years from a tertiary hospital. Apart from the data whose original meaning could not be interpreted correctly (eg, ** and \_^), SALT-C mapped unique raw data to the correct reference value for each group with accuracy of 97.6\% (123/126; urine color tests), 97.5\% (198/203; (urine dipstick tests), 95\% (53/56; blood type tests), 99.68\% (162,291/162,805; presence-finding tests), and 99.61\% (4643/4661; pathogenesis tests). Conclusions: The proposed SALT-C successfully standardized the categorical laboratory test results with high reliability. SALT-C can be beneficial for clinical big data research by reducing laborious manual standardization efforts. ", doi="10.2196/14083", url="http://medinform.jmir.org/2019/3/e14083/", url="http://www.ncbi.nlm.nih.gov/pubmed/31469075" } @Article{info:doi/10.2196/14126, author="Kim, Heon Ho and Kim, Bora and Joo, Segyeong and Shin, Soo-Yong and Cha, Soung Hyo and Park, Rang Yu", title="Why Do Data Users Say Health Care Data Are Difficult to Use? A Cross-Sectional Survey Study", journal="J Med Internet Res", year="2019", month="Aug", day="06", volume="21", number="8", pages="e14126", keywords="data anonymization", keywords="privacy act", keywords="data sharing", keywords="data protection", keywords="data linking", keywords="health care data demand", abstract="Background: There has been significant effort in attempting to use health care data. However, laws that protect patients' privacy have restricted data use because health care data contain sensitive information. Thus, discussions on privacy laws now focus on the active use of health care data beyond protection. However, current literature does not clarify the obstacles that make data usage and deidentification processes difficult or elaborate on users' needs for data linking from practical perspectives. Objective: The objective of this study is to investigate (1) the current status of data use in each medical area, (2) institutional efforts and difficulties in deidentification processes, and (3) users' data linking needs. Methods: We conducted a cross-sectional online survey. To recruit people who have used health care data, we publicized the promotion campaign and sent official documents to an academic society encouraging participation in the online survey. Results: In total, 128 participants responded to the online survey; 10 participants were excluded for either inconsistent responses or lack of demand for health care data. Finally, 118 participants' responses were analyzed. The majority of participants worked in general hospitals or universities (62/118, 52.5\% and 51/118, 43.2\%, respectively, multiple-choice answers). More than half of participants responded that they have a need for clinical data (82/118, 69.5\%) and public data (76/118, 64.4\%). Furthermore, 85.6\% (101/118) of respondents conducted deidentification measures when using data, and they considered rigid social culture as an obstacle for deidentification (28/101, 27.7\%). In addition, they required data linking (98/118, 83.1\%), and they noted deregulation and data standardization to allow access to health care data linking (33/98, 33.7\% and 38/98, 38.8\%, respectively). There were no significant differences in the proportion of responded data needs and linking in groups that used health care data for either public purposes or commercial purposes. Conclusions: This study provides a cross-sectional view from a practical, user-oriented perspective on the kinds of data users want to utilize, efforts and difficulties in deidentification processes, and the needs for data linking. Most users want to use clinical and public data, and most participants conduct deidentification processes and express a desire to conduct data linking. Our study confirmed that they noted regulation as a primary obstacle whether their purpose is commercial or public. A legal system based on both data utilization and data protection needs is required. ", doi="10.2196/14126", url="https://www.jmir.org/2019/8/e14126/", url="http://www.ncbi.nlm.nih.gov/pubmed/31389335" } @Article{info:doi/10.2196/11672, author="Fiske, Amelia and Prainsack, Barbara and Buyx, Alena", title="Data Work: Meaning-Making in the Era of Data-Rich Medicine", journal="J Med Internet Res", year="2019", month="Jul", day="09", volume="21", number="7", pages="e11672", keywords="big data", keywords="data work", keywords="medical informatics", keywords="internet", keywords="data interpretation", keywords="decision support systems", doi="10.2196/11672", url="https://www.jmir.org/2019/7/e11672/", url="http://www.ncbi.nlm.nih.gov/pubmed/31290397" } @Article{info:doi/10.2196/12702, author="Dankar, K. Fida and Madathil, Nisha and Dankar, K. Samar and Boughorbel, Sabri", title="Privacy-Preserving Analysis of Distributed Biomedical Data: Designing Efficient and Secure Multiparty Computations Using Distributed Statistical Learning Theory", journal="JMIR Med Inform", year="2019", month="Apr", day="29", volume="7", number="2", pages="e12702", keywords="data analytics", keywords="data aggregation", keywords="personal genetic information", keywords="patient data privacy", abstract="Background: Biomedical research often requires large cohorts and necessitates the sharing of biomedical data with researchers around the world, which raises many privacy, ethical, and legal concerns. In the face of these concerns, privacy experts are trying to explore approaches to analyzing the distributed data while protecting its privacy. Many of these approaches are based on secure multiparty computations (SMCs). SMC is an attractive approach allowing multiple parties to collectively carry out calculations on their datasets without having to reveal their own raw data; however, it incurs heavy computation time and requires extensive communication between the involved parties. Objective: This study aimed to develop usable and efficient SMC applications that meet the needs of the potential end-users and to raise general awareness about SMC as a tool that supports data sharing. Methods: We have introduced distributed statistical computing (DSC) into the design of secure multiparty protocols, which allows us to conduct computations on each of the parties' sites independently and then combine these computations to form 1 estimator for the collective dataset, thus limiting communication to the final step and reducing complexity. The effectiveness of our privacy-preserving model is demonstrated through a linear regression application. Results: Our secure linear regression algorithm was tested for accuracy and performance using real and synthetic datasets. The results showed no loss of accuracy (over nonsecure regression) and very good performance (20 min for 100 million records). Conclusions: We used DSC to securely calculate a linear regression model over multiple datasets. Our experiments showed very good performance (in terms of the number of records it can handle). We plan to extend our method to other estimators such as logistic regression. ", doi="10.2196/12702", url="http://medinform.jmir.org/2019/2/e12702/", url="http://www.ncbi.nlm.nih.gov/pubmed/31033449" } @Article{info:doi/10.2196/13043, author="McPadden, Jacob and Durant, JS Thomas and Bunch, R. Dustin and Coppi, Andreas and Price, Nathaniel and Rodgerson, Kris and Torre Jr, J. Charles and Byron, William and Hsiao, L. Allen and Krumholz, M. Harlan and Schulz, L. Wade", title="Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform", journal="J Med Internet Res", year="2019", month="Apr", day="09", volume="21", number="4", pages="e13043", keywords="data science", keywords="monitoring, physiologic", keywords="computational health care", keywords="medical informatics computing", keywords="big data", abstract="Background: Health care data are increasing in volume and complexity. Storing and analyzing these data to implement precision medicine initiatives and data-driven research has exceeded the capabilities of traditional computer systems. Modern big data platforms must be adapted to the specific demands of health care and designed for scalability and growth. Objective: The objectives of our study were to (1) demonstrate the implementation of a data science platform built on open source technology within a large, academic health care system and (2) describe 2 computational health care applications built on such a platform. Methods: We deployed a data science platform based on several open source technologies to support real-time, big data workloads. We developed data-acquisition workflows for Apache Storm and NiFi in Java and Python to capture patient monitoring and laboratory data for downstream analytics. Results: Emerging data management approaches, along with open source technologies such as Hadoop, can be used to create integrated data lakes to store large, real-time datasets. This infrastructure also provides a robust analytics platform where health care and biomedical research data can be analyzed in near real time for precision medicine and computational health care use cases. Conclusions: The implementation and use of integrated data science platforms offer organizations the opportunity to combine traditional datasets, including data from the electronic health record, with emerging big data sources, such as continuous patient monitoring and real-time laboratory results. These platforms can enable cost-effective and scalable analytics for the information that will be key to the delivery of precision medicine initiatives. Organizations that can take advantage of the technical advances found in data science platforms will have the opportunity to provide comprehensive access to health care data for computational health care and precision medicine research. ", doi="10.2196/13043", url="https://www.jmir.org/2019/4/e13043/", url="http://www.ncbi.nlm.nih.gov/pubmed/30964441" }