Social Media as a Research Tool (SMaaRT) for Risky Behavior Analytics: Methodological Review

Background Modifiable risky health behaviors, such as tobacco use, excessive alcohol use, being overweight, lack of physical activity, and unhealthy eating habits, are some of the major factors for developing chronic health conditions. Social media platforms have become indispensable means of communication in the digital era. They provide an opportunity for individuals to express themselves, as well as share their health-related concerns with peers and health care providers, with respect to risky behaviors. Such peer interactions can be utilized as valuable data sources to better understand inter-and intrapersonal psychosocial mediators and the mechanisms of social influence that drive behavior change. Objective The objective of this review is to summarize computational and quantitative techniques facilitating the analysis of data generated through peer interactions pertaining to risky health behaviors on social media platforms. Methods We performed a systematic review of the literature in September 2020 by searching three databases—PubMed, Web of Science, and Scopus—using relevant keywords, such as “social media,” “online health communities,” “machine learning,” “data mining,” etc. The reporting of the studies was directed by the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Two reviewers independently assessed the eligibility of studies based on the inclusion and exclusion criteria. We extracted the required information from the selected studies. Results The initial search returned a total of 1554 studies, and after careful analysis of titles, abstracts, and full texts, a total of 64 studies were included in this review. We extracted the following key characteristics from all of the studies: social media platform used for conducting the study, risky health behavior studied, the number of posts analyzed, study focus, key methodological functions and tools used for data analysis, evaluation metrics used, and summary of the key findings. The most commonly used social media platform was Twitter, followed by Facebook, QuitNet, and Reddit. The most commonly studied risky health behavior was nicotine use, followed by drug or substance abuse and alcohol use. Various supervised and unsupervised machine learning approaches were used for analyzing textual data generated from online peer interactions. Few studies utilized deep learning methods for analyzing textual data as well as image or video data. Social network analysis was also performed, as reported in some studies. Conclusions Our review consolidates the methodological underpinnings for analyzing risky health behaviors and has enhanced our understanding of how social media can be leveraged for nuanced behavioral modeling and representation. The knowledge gained from our review can serve as a foundational component for the development of persuasive health communication and effective behavior modification technologies aimed at the individual and population levels.


Introduction
Modifiable risky health behaviors, such as tobacco use, excessive alcohol use, being overweight, lack of physical activity, and unhealthy eating habits, are some of the major factors for developing chronic health conditions [1].Chronic health conditions, such as cancer and heart disease, lead to approximately 1.5 million deaths per year in the United States [2].These chronic health conditions together with diabetes are also responsible for nearly US $3.5 trillion in annual economic costs; hence, it becomes crucial to prevent and/or efficiently manage such conditions [2].Behavior modification is pivotal for managing chronic health conditions, and a range of psychological and social processes have been shown to influence the engagement of an individual in the adoption of positive healthy behaviors [3,4].Traditionally, the methods used for measuring and studying health-related behaviors in populations include telephone or internet-based surveys [5], motivational interviews [6], commercial wearables and smartphone apps [7], and ecological momentary assessment [8].
Recently, social media has emerged as a viable platform for studying and analyzing health-related behaviors and promoting behavior change [9].The field of infodemiology [10] examines the determinants and distribution of health information in the electronic medium (eg, social media and internet) for public health purposes: preventing diseases via predictive modeling [11][12][13], informing policy regulations [14], assessing the quality of health information on websites [15], and analyzing the health-related behaviors of individuals [16][17][18].The recent COVID-19 pandemic has also shown how analyzing communication on such platforms can provide insights into the attitudes and behaviors of individuals as well as health care providers [19,20].Social media, through its various mobile and web-based technologies, provides interactive platforms for individuals and communities to share, create, modify, and discuss content in the form of ideas, messages, or information [21].In recent years, the penetration of social media platforms has increased in all spheres of life.According to the Global Digital Report of 2019, there are about 3.5 billion active social media users throughout the world, with Facebook being the most dominant social networking website.More than two-thirds of the world's population use a mobile device, mostly a smartphone.Powered by these connected devices, many older adults as well as teenagers have also started incorporating social media into their daily routines [22].
Consequently, social media has become an important part of the public health landscape, given that these platforms are increasingly being used by health care consumers for gaining knowledge on a variety of health-related topics as well as for interacting with their peers and health care providers to garner social support, mostly informational and emotional in nature [23,24].These platforms are widely used by health care consumers to (1) meet their health-related goals [25] and (2) adopt positive health behaviors [26,27].Research has shown that an individual is more likely to comply with health-related goals and adhere to preventive practices provided their social ties also engage in similar behaviors [28,29].The major advantages of using such platforms over standard approaches for studying and analyzing health promotion and behavior change include their ability to reach a wider and less accessible audience, cost-effective recruitment of participants for research, and their round-the-clock accessibility via mobile and web-based connections [30].These platforms can leverage group norms; thus, behavior change interventions implemented through these platforms have the potential to make a significant impact through widespread diffusion of preventive programs to meet the needs of individuals, communities, and populations.
These online platforms can be broadly classified into two major categories: (1) open social media platforms (eg, Facebook, Twitter, and Reddit), which are generic platforms used for networking, information sharing, and collaboration, and (2) intentionally designed health-related social media platforms (eg, QuitNet [31] and BecomeAnEX.org[32]), which focus on providing health-specific support to its members.Even though open social media platforms provide opportunities for large-scale inferences about behaviors of individuals, they still lack in providing context-specific interactional observations, for which we need to turn to intentionally designed social media platforms [33].Depending on whether or not a social media platform has a specific focus on health topics, the environmental factors affecting an individual's attempt to sustain positive health changes can greatly vary, thus affecting contextual granularities that inform the accuracy and reliability of computational and quantitative data modeling approaches.Despite these differences, the universal presence of these platforms has led to the generation of invaluable and large data sets in the form of electronic traces of peer interactions in the form of text, images, or videos (eg, traditional forums like Facebook and YouTube).These data sets capture the attitudes and behaviors of individuals in near real time and in natural settings as compared to conventional settings, which involve the presence of a researcher and are prone to instrument bias [34].The analysis of such data sets provides us with an opportunity to understand the individualistic as well as environmental factors underlying behavior change, which can eventually guide the design and development of network interventions for health-related behavior change [35][36][37].
Traditional methods of qualitative data analysis are not conducive to analyzing large amounts of data generated by social media platforms.Recent advances in automated text analysis provide us with suitable methods for analyzing digital content generated from social media platforms.The latest review highlights the breakthroughs in computational technologies that are currently being applied to the field of health care in the form of digitized data acquisition, machine learning (ML) techniques, and computing infrastructure [38].In addition to advances in predictive analytics and combinatorial forces from mobile computing and the internet, participatory social media has resulted in rich, just-in-time data that can be leveraged to conduct digital phenotyping of health consumer engagement in self-management of risky health behaviors.
The objective of this review is to summarize computational and quantitative approaches that highlight the potential of using social media as a research tool (SMaaRT) to understand the patterns of inter-and intrapersonal psychosocial factors associated with the prevention and management of risky health behaviors.These methodologies can provide a comprehensive understanding of the most common practices, their utility, limitations, and resulting inferences, thus providing health researchers with capabilities to better describe health behaviors at scale.The enhanced understanding from these secondary analyses can ultimately be infused into the design processes of effective behavioral interventions through the translation of data-driven insights into practical public health solutions via scalable techniques, such as tailored messaging and persuasive environment design.

Overview
We conducted a systematic review of the literature to summarize the computational and quantitative methods for analyzing social media data that have been used to study risky health behaviors.We followed the guidelines outlined by PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) [39] to retrieve relevant studies.

Literature Search Strategy
We searched the literature in September 2020, collecting studies published between 2011 and September 11, 2020.We searched three different databases-PubMed, Web of Science, and Scopus-using a specific set of keywords.Our search keywords lie at the intersection of two key clusters: social media and ML.We also included Medical Subject Headings (MeSH) for relevant keywords to ensure our search was as inclusive as possible.The search was conducted using the following query: ("Social Media" [MeSH] OR "social media" OR "Online Health Community" OR "Online Health Communities" OR "Online Social Network" OR "Online Social Networks" OR "peer to peer" OR "Peer Influence" [MeSH]) AND ("Machine Learning" [MeSH] OR "machine learning" OR "text mining" OR "Natural Language Processing" [MeSH] OR "natural language processing" OR "Data Mining" [MeSH] OR "data mining" OR "network models").In addition, we also examined the reference lists of studies that met our inclusion criteria for any additional sources.

Inclusion and Exclusion Criteria
The inclusion and exclusion criteria to determine eligibility of studies for the review are listed in Textbox 1.
Textbox 1. Eligibility criteria for the studies.
Inclusion criteria: 1. Studies conducted original research that was published in a peer-reviewed journal.
2. Studies used English language-based social media platforms (ie, the language of generated content is in the English language).
3. Studies conducted data analysis at scale using computational or quantitative methods like machine learning techniques, network modeling, and/or visualization techniques.
4. Studies focused on risky health behaviors, or related attitudes or beliefs, of the patients or health consumers such as nicotine use, alcohol use, drug or substance abuse, physical activity or inactivity patterns, or obesity-related behaviors.
5. Studies focused primarily on analyzing textual content from online social media platforms (eg, YouTube comments instead of YouTube videos).
Exclusion criteria: 1. Studies described the use of social media platforms for other purposes (eg, recruitment and data collection).
2. Studies focused on health care providers instead of patients or health consumers.
3. Studies focused on behaviors unrelated to health.

Data Extraction
Two authors (TS and SM) independently assessed the retrieved studies against the inclusion criteria in two stages.In the first stage, the authors reviewed the titles and abstracts of all the retrieved studies for their inclusion in full-text screening.In the second stage, the authors performed the full-text screening of the relevant studies identified from the first stage for final inclusion in this review.Disagreements were resolved through discussion between the two authors.The interrater agreement, Cohen κ, was calculated at both stages.After screening the studies that met our inclusion criteria, we extracted the relevant data from the main text, which included the following: 1. Risky health behavior studied, such as nicotine use, alcohol use, drug or substance abuse, physical activity or inactivity patterns, obesity-related behaviors, etc. 2. Social media platform used for the study, whether it was an open social network, such as Twitter or Facebook, or a disease-specific social network, such as QuitNet (ie, smoking cessation). 3. Number of posts: total number of posts used for analysis and number of posts used for manual annotations. 4. Study focus: what were the underlying aims of the study for analyzing risky health behaviors? 5. Key methodological functions and tools; for example, topic modeling (ie, function) was performed using latent Dirichlet allocation (LDA) (ie, method). 6. Evaluation metrics used by the study (eg, precision, recall, and F1 score). 7. Key findings of the study: results obtained after analyzing the data generated from online peer interactions.

Overview
The initial search resulted in a total of 1554 studies.From these, we removed 203 studies because of duplication.In the first stage, we reviewed the titles and abstracts of the remaining studies to ensure that they met the inclusion and exclusion criteria for further thorough analysis.The interrater agreement at the first stage was 81.37%.After resolving disagreements through discussion, we initially excluded 1246 studies that did not meet the inclusion criteria and included the remaining 105 studies for full-text screening in the second stage.The interrater agreement at the second stage was 83.50%.A total of 52 studies meeting the inclusion criteria were included in the review.We further identified 12 additional studies through the snowballing technique that were also included in this review.Thus, a total of 64 studies  were included in the final review.Of the studies reviewed, 55 (86%) studies were published from 2016 onward 97,98,[100][101][102], while only 9 (14%) studies were published between 2013 and 2015 [62][63][64][65][66][67]96,99,103].None of the studies were published before 2013.Figure 1 shows the PRISMA diagram highlighting the overall process of selecting the final studies for the review.The results of our review showed that the focus of social media analysis has been on a variety of risky health behaviors, including nicotine use, alcohol use, drug abuse, physical activity patterns, and obesity-related behaviors.Social media platforms have been widely used for secondary data analysis as well as for follow-up analysis of data generated from active interventions or campaigns conducted using such platforms.Multiple computational and quantitative functions and tools were utilized for analyzing the data generated from online peer interactions on social media platforms.A detailed exposition of our results is included in Multimedia Appendix 1, which shows the key characteristics of the selected studies grouped by risky health behaviors and then ordered by year published.
In the following sections, we aggregate the results of our review to highlight the usage patterns of various social media platforms for secondary analysis purposes, the prevalence of risky health behaviors studied on these platforms, and the methodological tools and functions used to understand these behaviors.[45,46,[55][56][57]62,64,65,84,86,[89][90][91].

Social Media Platforms
Most of the studies that used Twitter as their data source relied on Twitter application programming interfaces (APIs) for extracting the data.The majority of these studies utilized streaming APIs, which provide a push of the subset of data in near real time [47,50,51,59,61,70,74,[78][79][80][81]92,94,95], and some of these studies also used search APIs, which provide access to the data set that consists of tweets that have already occurred in the past [68,76,82,98,99].Some studies also used Twitter's data provider called Gnip [54,59,60,63,92], which guarantees access to all the tweets that match the researcher's criteria.Some studies did not indicate which specific kind of API was used for accessing Twitter's data [40,41,48,66,73,77,88,100,102].For Reddit, the data were extracted using the following techniques: (1) the use of Pushshift, which is a publicly available archive of Reddit submissions [42], (2) the data set was downloaded using a web crawler called Wget [62], (3) the use of Python Reddit API Wrapper [97], (4) the data set was released from the Reddit member [101], and (5) the use of Reddit's official API [103].The data from Facebook were extracted using either Facebook's API and the Facebook platform's Python software development kit [87] or by using the extraction feature in NVivo (QSR International) [71].A similar approach was used for extracting data using Instagram's API [44,72].

Methodological Details and Related Tools
The methodological functions used across various studies are discussed in the following sections, as well as the specific tools used for performing those functions.

Deep Learning Techniques
Out of 64 studies, 6 (9%) used deep learning models for text classification, such as convolutional neural networks (CNNs) [41,70,[73][74][75]100], long short-term memory (LSTM) [41,72], LSTM-CNN [41], bidirectional LSTM [41], shallow neural network [100], and reinforcement neural network-gated recurrent unit [100].Hassanpour et al [72] optimized their deep learning model through the stochastic gradient descent optimization algorithm.One study used an ensemble deep learning model consisting of a word-level CNN and a character-level CNN [73].One of these studies also performed image classification using image features extracted through a residual neural network [72], which is a state-of-the-art CNN architecture for computer vision tasks.Another study [87] performed image as well as video classification using a neural network called AlexNet, which is another famous deep CNN used for computer vision problems.

Empirical Distributional Semantics
Some studies applied distributional semantics to recognize meaningful relationships between terms, for instance, between messages and identified themes applying techniques such as latent semantic analysis (LSA) [64,65], random indexing (RI) [55], and the skip-gram with negative sampling (SGNS) algorithm [56] using the Semantic Vectors package.Some of these studies used pretraining on general domain corpora: RI with the Touchstone Applied Science Associates (TASA) corpus [55], the SGNS algorithm with the Wiki corpus [56], and LSA with the TASA corpus [64,65].
Various unsupervised ML models were also utilized for identifying e-cigarette communities using k-means clustering [42] and pattern or theme recognition through a technique called the biterm topic model [78].One study performed clustering analysis through an agglomerative hierarchical clustering technique [102] to group the temporal patterns of alcohol consumption among members of an online community.

Language Modeling
Out of 64 studies, 5 (8%) performed linguistic text analysis using linguistic inquiry word count (LIWC), which is used to count words in psychologically meaningful categories [45,71,83,88,89].Linguistic analysis performed by Singh et al [45] for analyzing smoking cessation behaviors showed that interrogatives in the form of seeking information were more frequently expressed in an individual's language if they belonged to the contemplation stage of behavior change; however, numbers were more frequently expressed in an individual's language if they belonged to the action stage of behavior change.Another study showed that words carrying negative affect were more frequently associated with greater substance abuse [71].In one study, LIWC was used to measure personal pronoun use within each community to understand if the individual was tweeting about one's drinking behavior or was referencing others' behavior [83].One study extracted psycholinguistic features from the language used on social media platforms to train a classifier to predict recovery from alcoholism [88].Similarly, another study showed that the negative emotions or swear words, inhibition words, and love words were significantly associated with increased risk of relapse for individuals suffering from alcohol use disorder [89].

Quantitative Modeling Using Social Network Analysis
Out of 64 studies, 9 (13%) performed social network analysis [42,48,50,53,64,65,86,91,103]: 1.One study generated network graphs to visualize presence and co-occurrence of e-cigarette topics across different subreddits [42]. 2. One study created network graphs to understand the reach of a campaign targeted to educate young individuals about harmful effects of smoking [48]. 3.One study identified topics of e-cigarette-related conversations by creating a Twitter hashtag co-occurrence network [50]. 4. One study analyzed structural differences in social networks of smokers and nonsmokers by analyzing the relationship of network metrics with smoking status of individuals [53]. 5.One study performed affiliation network analysis by constructing two-mode network graphs to understand the association of the members of a smoking cessation community with different communication themes [64]. 6.One study visualized topological and theme-based differences in social networks of members of an online smoking cessation community [65]. 7. One study analyzed how an individual's social network connectivity affected their alcohol use behaviors based on the topics of discussion [86]. 8.One study showed that individuals who expressed negative sentiment about drinking were more centrally located within the social network compared to other members of the community [91]. 9.One study quantified the peer interactions between the members of the community using social network features (eg, in-degree, out-degree, degree, reciprocity, and clustering coefficient) [103].
Table 3. Summary of methods and related tools used by various studies.

Principal Findings
The purpose of this review was to investigate the current state of computational and quantitative techniques available for analyzing risky health behaviors, beliefs, and attitudes using online peer interactions from social media platforms.From the initial set of studies retrieved and snowballing techniques, 64 studies that met our inclusion criteria were included in this review, out of which 75% (48/64) [40-57,68-79, 82-94,97, 98, 100-102] were published in 2017 onward.This suggests that there is a growing trend in utilizing computational approaches to characterize risky health behaviors by analyzing conversational data generated from online peer interactions.
Several platforms were used as the source of data for analyzing risky health behaviors, with the most popular being open social media platforms, since 80% (51/64) of the studies utilized them as compared to intentionally designed health-related social media platforms.In terms of data collection, our results showed that Twitter was a popular source of social media data, as it provides three easy ways to access the data: Twitter Search API, Twitter Streaming API, and Twitter Firehose [104].Some studies utilized platforms (eg, Facebook, Instagram, and Reddit) that also provide access to data through their APIs [105][106][107] but were not as widely used as compared to Twitter.A few studies utilized intentionally designed health-related social media platforms, such as QuitNet, Cancer Survivors Network, patient.info/forums,BecomeAnEx.org,Hello Sunday Morning blog, and the A-CHESS online discussion forum, but they did not provide any information about their data collection techniques.In terms of data types, this review included studies that primarily focused on analyzing textual data generated from online peer interactions.Thus, we excluded two studies during the full-text screening that focused on analyzing risky health behaviors through image analysis only [108,109].
Sentiments toward smoking-related products (eg, cigars, e-cigarettes, hookah, vaping, and JUUL) and identification of various themes related to the discussion of such products were widely studied using online social media platforms.Prescription drug abuse, opioid misuse, and binge drinking-related behaviors were another set of widely analyzed risky health behaviors using online social media platforms.This highlights the potential of using such platforms for the dissemination of behavioral change interventions targeting uncharted and evolving domains (eg, e-cigarettes) as well as well-charted domains (eg, alcohol use).In addition to addictive behaviors, uptake behaviors were analyzed, such as the association of physical activity patterns, sentiments, and types of behaviors (eg, running, walking, and jogging) with different geographical locations (eg, in Canada) and population demographics (eg, genders).Social media platforms were used for identifying the themes related to weight loss and obesity-related behaviors.None of the studies focused on analyzing unprotected sex-related behaviors, an important public health focus and priority, which can likely be an interesting avenue for future research.However, given the stigma, privacy concerns, and the opaque nature of the domain, access to such data sets might be limited.
The LIWC tool was widely used for linguistic feature extraction, as it is an easily accessible tool that extracts features like style words, emotional words, and parts of speech from the texts [110].Language modeling performed using LIWC showed how the usage of language among members can be used to predict their relapse or behavior transition patterns.For topic modeling, LDA was the most commonly used tool; it analyzes latent topics based on word distribution and then assigns a distribution of topics to each document [111].The topics discussed varied from one risky health behavior to another but mostly highlighted the attitudes and behavior patterns of individuals engaging in such behaviors.Few examples include highlighting the controversial topics related to e-cigarette and marijuana use (eg, legalization, prohibition, etc) [101], identifying topics related to the normative or cultural context surrounding e-cigarette use and alcoholic preferences [60,83], and understanding how the social environment of individuals affects their behaviors toward weight loss [98].
A wide range of supervised ML algorithms were used for the content and sentiment analysis of the data generated from online peer interactions.Most of the studies utilized traditional ML models (eg, SVM, LR, RF, DT, and KNN) for text classification purposes.Only a few studies [41,70,[72][73][74][75]87,100] utilized deep learning models (eg, CNNs and LSTMs) for text as well as image and video classification tasks.In terms of performance evaluation, the following results were observed: 1.In 4 out of 64 (6%) studies [41,[72][73][74], the performance of deep learning models on classification tasks was better compared to the traditional ML classifiers (eg, the deep learning model had an AUROC curve of 0.65 as compared to the baseline LR model, which had an AUROC curve of 0.54 [72]).2. In 1 study out of 64 (2%) [75], the deep learning model marginally outperformed the traditional ML classifier: RF (accuracy 70.1%) and deep CNN (accuracy 70.4%).3.In another 2 studies out of 64 (3%) [70,100], the performance of deep learning models on classification tasks was lower compared to the traditional ML classifiers (eg, RF [accuracy 93.4%] performed better than CNN [accuracy 60.1%] [100]).
Some of the studies included in this review also performed network analysis [42,48,50,53,64,65,86,91,103].The Gephi platform [112] and UCINET software [113] were widely used tools for analyzing online social ties.One study characterized the role of content-specific social influence patterns underlying peer-to-peer communication using affiliation exposure models and the two-mode version of the network autocorrelation model [64].One study analyzed the social network structure of smokers and compared it with the network structure of nonsmokers to understand the factors related to the social influence that might affect addictive tobacco-related behaviors [53].Such network analysis can help us understand the context of communication, which can eventually guide the development of tangible technology features by health researchers and technology developers [114,115].
One study [85] analyzed online peer interactions based on a communication model called the dynamic transactional model [116], which is suitable for modeling two-way communication between individuals.Very few studies [42,45,55,64,65,97] linked theoretical constructs that define behavior change in analyzing content generated from social media platforms, such as social cognitive theory [117], the transtheoretical model of change [118], the health belief model [119], and the taxonomy of behavior change techniques [120].The online peer interactions should be analyzed using theoretical frameworks that can lead to the development of empirically grounded digital health interventions for promoting health and positive behavior changes [121,122].Theory-driven large-scale analysis of social media data sets will yield insights into the specific processes of behavior change that manifest in peer interactions.The analysis of these data sets in conjunction with theoretical constructs can aid in enhancing our knowledge of how social influence plays a major role in diffusing health information and modifying individual health behaviors.This can have implications for the development of high-yield interventions for individuals and populations based on their risky health behavior, thereby enabling individuals to make positive lifestyle changes and improving their quality of life.
It is also important to understand that online social media platforms can be used for disseminating health-related misinformation as well [123].The COVID-19 pandemic has provided us with abundant evidence that highlights the urgency to address public concerns related to misinformation that is plaguing social media, which can negatively impact health-related behaviors of individuals [124,125].Also, the ground truth of aggregated trends extracted from information disseminated through these platforms is reflective of community perceptions only to a certain extent because of the large amount of content push by automated bots [126].Studies have shown how misinformation also impacts risky health behaviors (eg, misleading marketing claims about e-cigarettes [127] and alcohol use [128]).Future work should focus on leveraging the techniques described in this review for analysis of misinformation diffused throughout online social media platforms to enhance the utility and positive impact of these platforms.

Limitations
Our review is not without limitations.Firstly, we included studies related to risky health behaviors alone; however, studies focusing on other public health domains (eg, epidemiology [129] and surveillance [130]) or studies focusing on chronic health conditions (eg, diabetes [131,132] and cancer [133]), as well as clinical and health outcomes [134,135], can provide us with a comprehensive understanding of how data generated from social media platforms are analyzed for various public health applications by leveraging computational modeling and high-throughput analytics.The domain of infodemiology and infoveillance is quite broad and includes various other aspects of risky health behaviors that were not included in this review (eg, mining consumer opinions toward online marketing of e-cigarettes [136,137], or understanding their reactions toward media coverage [138,139] or policy regulations [140,141] concerning such products).Secondly, we only focused on studies that primarily performed textual data analysis.Even though we did include studies that reported image or video data analysis along with textual data analysis [72,87], we did not include studies that solely described image or video data analysis [108,109].These studies can provide useful insights into ML trade-offs and computational scalability as related to varying data density, heterogeneity, and inferential granularity.
Finally, given the constraints of our search strategy, we might have missed some studies from the infodemiology and infoveillance domain; for example, an initial exploration of the literature search in this domain [142] had resulted in a total of 397 studies, out of which 23 studies were relevant for inclusion in this review.Of these, 15 studies were captured by our search strategy and included in the review [40,41,43,50,51,54,[61][62][63]66,[68][69][70]80,95], and an additional one was included as part of the snowballing efforts [47].However, the remaining seven were not identified by our search strategy [143][144][145][146][147][148][149].Broad methodological descriptions or excessively granular terminology use capturing ML methods in metadata, titles, abstracts, and keywords are noted in these studies.For consistency and to limit bias with studies in other journals, we have not included these studies in the review.Future researchers conducting similar reviews should ensure the inclusion of terms that capture the interdisciplinary nature of studies (eg, infodemiology), analytical functions (eg, text classification, content analysis, and topic modeling), and analytical techniques (eg, LDA) for the exhaustive representation of related works that leverage SMaaRT for risky behavior modeling and analysis.

Conclusions
Our review shows that online discourse related to risky health behaviors on social media platforms can span multiple topics that include nicotine dependence, alcohol use, drug or substance abuse, physical activity patterns, and obesity-related behaviors.This results in the generation of large amounts of digitally archived data, which can provide a deeper understanding of the organic manifestation and natural evolution of health-related behavior change processes.
Our review highlights the characteristics of social media platforms (eg, general-purpose vs health-focused platforms and ease of data access for secondary analysis), the robustness of methods used for analyzing peer interactions within these platforms, and an overview of a wide variety of text mining and network modeling tools available to conduct analyses of social media data sets at scale.Our review allows us to consolidate the methodological underpinnings and enhance our understanding of how social media can be leveraged for nuanced behavioral modeling and representation.This can ultimately inform and lead to the formulation of persuasive health communication and effective behavior modification technologies targeting inter-and intrapersonal psychosocial processes distributed at the individual and population levels.It is also important to understand the merits and shortfalls of existing computational studies to assess the generalizability and strength of the downstream predictive models and data-driven interventions resulting from such large-scale analyses.

Figure 1 .
Figure 1.PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) diagram for study selection.

Table 1 .
Social media platforms used by various studies.Percentages do not add up to 100% due to rounding and one study that used multiple social media platforms.
a b A-CHESS: Addiction-Comprehensive Health Enhancement Support System.