Machine Learning and Natural Language Processing in Mental Health: Systematic Review

Background Machine learning systems are part of the field of artificial intelligence that automatically learn models from data to make better decisions. Natural language processing (NLP), by using corpora and learning approaches, provides good performance in statistical tasks, such as text classification or sentiment mining. Objective The primary aim of this systematic review was to summarize and characterize, in methodological and technical terms, studies that used machine learning and NLP techniques for mental health. The secondary aim was to consider the potential use of these methods in mental health clinical practice Methods This systematic review follows the PRISMA (Preferred Reporting Items for Systematic Review and Meta-analysis) guidelines and is registered with PROSPERO (Prospective Register of Systematic Reviews; number CRD42019107376). The search was conducted using 4 medical databases (PubMed, Scopus, ScienceDirect, and PsycINFO) with the following keywords: machine learning, data mining, psychiatry, mental health, and mental disorder. The exclusion criteria were as follows: languages other than English, anonymization process, case studies, conference papers, and reviews. No limitations on publication dates were imposed. Results A total of 327 articles were identified, of which 269 (82.3%) were excluded and 58 (17.7%) were included in the review. The results were organized through a qualitative perspective. Although studies had heterogeneous topics and methods, some themes emerged. Population studies could be grouped into 3 categories: patients included in medical databases, patients who came to the emergency room, and social media users. The main objectives were to extract symptoms, classify severity of illness, compare therapy effectiveness, provide psychopathological clues, and challenge the current nosography. Medical records and social media were the 2 major data sources. With regard to the methods used, preprocessing used the standard methods of NLP and unique identifier extraction dedicated to medical texts. Efficient classifiers were preferred rather than transparent functioning classifiers. Python was the most frequently used platform. Conclusions Machine learning and NLP models have been highly topical issues in medicine in recent years and may be considered a new paradigm in medical research. However, these processes tend to confirm clinical hypotheses rather than developing entirely new information, and only one major category of the population (ie, social media users) is an imprecise cohort. Moreover, some language-specific features can improve the performance of NLP methods, and their extension to other languages should be more closely investigated. However, machine learning and NLP techniques provide useful information from unexplored data (ie, patients’ daily habits that are usually inaccessible to care providers). Before considering It as an additional tool of mental health care, ethical issues remain and should be discussed in a timely manner. Machine learning and NLP methods may offer multiple perspectives in mental health research but should also be considered as tools to support clinical practice.


Population
Patients mechanically restrained in the period between one hour and 3 days after the admission in psychiatry

Danish
EHRs (MidtEPJ) unstructured clinical notes Volume 5,050 patients with a total of 8,869 admissions Goal: To investigate whether incident mechanical restraint occurring in the first 3 days following admission can be predicted based on analysis of electronic health data available after the first hour of admission Method: Data are EHR with a few notes in natural language for which topic detection is applied but apparently the topics are not used for classification. POS, lemmatization, only nouns, verbs, adjectives and pronouns kept. Classification by logistic regression, LASSO, neural networks, Random Forest, SVM. SAS platform used. Results: Best performance by a Random Forest algorithm that predicted MR with an area under the curve of 0.87 (95% CI 0.79-0.93).

Population
Suicides and undetermined deaths among adults (55 years old and older) living in or transitioning to residential long-term care and listed in the National Violent Death Reporting System (NVDRS)

Data
Narratives abstracted from coroner reports Volume 47,759 deaths including 42 576 suicides, 279 deaths due to unintentional firearm injury, and 4904 undetermined deaths Goal: To estimate the number of suicides associated with residential long-term care among adults 55 and older. To identify whether machine learning tools could improve the quality of suicide surveillance data. Method: Search of terms (curated list) and then training on the corpus of documents containing the terms, with tfidf weights. Then classification via random forests. Python NLTK platform used. Results: Among 47,759 deaths, the algorithm identified 1,037 associated with long-term care Goal: To analyze sentiment in children with neurodevelopmental disorders Method: Google Sentiment Analysis tool was applied to small Spanish texts written by the subjects after watching videos. Chi2, Fisher test, linear regression. Stata software used. Results: Although everybody knew the rules of soccer, when the participants punished the transgressor, a preference for members of their own group was observed, except for the ASD group. Children with ASD seem not to base their opinion on their group membership, but rather on precise adherence to regulations.

Partners
HealthCare System Research Patient Data Registry, a clinical data warehouse that gathers medical records for nearly 4.6 million patients from Massachusetts General Hospital (MGH) and Brigham and Women's Hospital (BWH)

Data
Ambulatory notes, discharge summaries, EPIC progress reports (such as emergency department (ED) observation progress notes, labor and delivery notes, lactation notes, progress notes, and significant event notes), operative notes. Pathology, cardiology, endoscopy, pulmonary, and radiology reports.
Volume 273,410 women with at least one CUI related to pregnancy or delivery, of which 23,098 with mention of CUIs related to suicidal behavior during pregnancy or within the 42 days after abortion or delivery Goal: To develop a classification algorithm that would accurately identify pregnant women with suicidal behavior Method: CUIs UMLS extraction via cTAKES and additional suicide-related features added manually. Classification using logistic regression (elastic net). R platform used. Results: Best AUC value: 0.83 for an algorithm using ICDs, extracted CUIs and additional expertcurated features: feeling hopeless, feeling relief, tired, love, feeling empty, feeling content, low self-esteem, impulsive character, isolation, distractibility, childhood adversity, adult sexual abuse, severe depression, substance abuse problem, personality disorders, psychotic disorders, seizures, anxiety disorders, wound and injury, abortion.

Population
Patients registered in the Clinical Record Interactive Search (CRIS) system (South London and Maudsley). Suicide attempt cohort, suicidal ideation cohort.

Data
EHRs: free-text and correspondences between patient and clinical staff Volume 500 documents selected out of 188,843 from a suicidal ideation cohort, and 500 documents selected out of 542,769 from a suicide attempt cohort. Goal: To develop two NLP tools (one for detecting the presence of recorded suicidal ideation, and one for detecting a recorded suicide attempt) and to compare them with manual text annotation. Method: EHRs and correspondence, bag-of-words, POS tags, stemming, detection of negation, mention of another person, temporal irrelevance. Manually constructed list of 150 terms. Classification by SVM. GATE platform used. Results: The rule-based algorithm achieved a sensitivity of 87.8% and a precision of 91.7%. The hybrid algorithm achieved a sensitivity of 98.2% and a precision of 82.8%.

Population
Twitter profiles with at least 100 tweets emitted

Data Tweets
Volume 4,000 Twitter users randomly selected (the half of which taken from a set of 7,046 users (21M tweets) having self-declared their depression, and the other half having no depression terms in their profile descriptions) Goal: To detect depressive symptoms on Twitter profil Method: Use of lexicon and of topics. LDA being insufficient, supervision is added to it (FOL LDA by Andrzejewski 2011), so that terms "strongly related" to a list of curated terms are detected. The algorithm (ssToT) is described in the paper. It is applied to tweets, and topic coherence is measured. The method is compared with other topic detection methods. Python NLTK platform used. Results: ssToT model allowed identification of clinical depressive symptoms with an accuracy of 68% and a precision of 72%. The ssToT model is competitive with supervised approaches in terms of F-score.

Maguen et al. 2018 [63] Context
Evidence-based psychotherapy in PTSD Goal: To create an initial training set for digital dating abuse, and to classify text messages as abusive or non abusive. Method: Lemmatized, count-vectorized, tfidf. Classification by SVMs (linear), multinomial NB and decision trees, on n-grams (for n ≤ 3). Python scikit-learn platform used. Results: Best accuracy (0.89) achieved by linear SVM, unigrams and tfidf weights.

Volume
An ad-hoc dataset of 4,247 posts and 34,118 comments by 3,029 users of the proed forum on Reddit. Goal: To analyze data on an eating-disorder social forum. Method: Punctuation and numbers, stop-words and hapaxes removed, then LDA applied (with 9 and 11 topics). No classification, just Speerman correlation between topics. R platform used. Results: The aim was not to report results but to demonstrate strategies and the potential of big data approaches in social media.

Population
Twitter users in Los Angeles, New York, San Diego, San Francisco

Data
Neuropsychiatric clinical records (Transnosographic CEGS N-GRID 2016) Volume 1,000 records Goal: To establish associations of clinical and social parameters with violent behavior among psychiatric patients Method: Text spell-corrected. CUIs UMLS extraction (cTAKES). To detect violent behavior, 49 questions manually selected and extracted by manual rules. Then words "violent"and "violence"detected (considering also negation). Clinical data combined with linguistic data using association rule mining. Platform .Net used. Results: Stimulants, family history of violent behavior, suicidal behaviors, and financial stress are strongly associated with violent behavior.

3,611,528 tweets
Goal: To demonstrate that the geographic variation of social media posts mentioning prescription opioid misuse strongly correlates with government estimates of MUPO (Misuse of Prescription Opioids) in the last month Method: Twitter corpus with geolocalization (Carmen). Lemmatization and removal of stop words, then WordNet-induced semantic similarity between tweets, K-Means with silhouette and PCA. Python NLTK platform used. Results: Mentions of MUPO on Twitter correlate strongly with state-by-state NSDUH estimates of MUPO. Natural language processing can be used to analyze social media to provide insights for syndromic toxicosurveillance.

Population
Male or female, aged 18 or older discharged after self-harm from emergency services or after being hospitalized for less than 7 days, able to be contacted by phone.

Population
Patients following motivational interviewing

Data
Notes taken by clinicians in motivational interviewing sessions Volume 1,7 million words.
Goal: To compare two NLP methods for MI automated coding Method: The texts of psychotherapy sessions were manually labeled as a golden corpus. To automate the task, two approaches: (a) n-grams (n ≤ 3) and dependency parse trees (Stanford parser), multinomial regression and (b) glove embedding using POS-dependent weights, RNNs. The platform emulab was used. Results: Dependency trees performed equally well or better than RNNs.

Luo et al. 2016 [48] Context Autism
Population 27 adults with ASD and 132 matched controls

Data
Written questionnaires with verbal descriptions of the patients' relatives

participants
Goal: To reveal patterns in descriptions of social relations by adults with ASD Method: What the authors call "semantic network"is in fact a graph of co-occurrences of words having the largest contributions to LSA dimensions. Density of the graphs is measured (called"connectivity"by the authors) as well as clustering coefficients. Classification with linear/quadratic regression and SVMs. Matlab platform used. Results: There is a difference in word connectivity patterns between the ASD and the typical participants, with the ASD participants'semantic network exhibiting less "small-world"characteristics. Goal: To identify suicidal subjects in EDs. Method: Corpus of transcription of interviews. No information is given on text preprocessing (other than "converting it to a matrix"by which is meant a matrix with interviewees as rows and responses as cells). For classification, SVM has been chosen and then k-Means to show the shape of the two groups. Leave-one-out was used for testing/training corpus separation when the sets were small. Platform not mentioned. Results: The number of unique words was significantly different between suicidal and non-suicidal subjects. SVM classified 96.67% of the subjects accurately compared with the Columbia Suicide Severity Rating Scale (C-SSRS).

Population
Patients admitted to ED at Hôpital de la Croix-Rousse (Lyon) in 2011 and 2012, for suicide attempt or suicidal ideation plus a control group

records
Goal: To predict the annual rate of emergency department visits for suicide (compared to the national surveillance system based on manual coding by emergency pratitioners) Method: CUIs UMLS extraction. Seven classification methods (predictive association rules, decision trees, neural networks, logistic regression, random forests, SVMs, Naïve Bayes). R platform used.

Results:
Methods with the best F-measures were the random forest method (95.3%) and Naïve Bayes classifier (95.3%). The number of cases of suicidal ideation (false positive suicide attempts) detected by the random forest method was higher (94 vs. 93). Random forests, NB, SVM, association rules, decision trees estimated close to the gold standard method (manual classification) and would be valuable for epidemiological surveillance of suicide attempts. Results: Best method achieved a precision of 57.2% and a recall of 46.8% for negative event extraction, For depression tendency analysis, the model has better recall than legacy method (66.8% vs. 57.1%) but weaker precision (59.3% vs. 66.6%).

Data
Free-text, demographic questionnaire.

participants
Goal: To overview the procedure of automated textual assessment on patients' self-narratives for PTSD screening. To compare the performances of different classification models in conjunction with n-gram representations in the screening process. Method: Bag-of-words, n-grams, stop-word removal, stemming, feature selection by χ 2 , classification by decision trees, Naïve Bayes, SVM, product score model. Platform not mentioned. Results: Best recall (95%) obtained by trigrams and SVM. Best specificity (81%) by unigrams+bigrams and product score model.