Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence–Based Infodemiology Study

Background Discussions of health issues on social media are a crucial information source reflecting real-world responses regarding events and opinions. They are often important in public health care, since these are influencing pathways that affect vaccination decision-making by hesitant individuals. Artificial intelligence methodologies based on internet search engine queries have been suggested to detect disease outbreaks and population behavior. Among social media, Twitter is a common platform of choice to search and share opinions and (mis)information about health care issues, including vaccination and vaccines. Objective Our primary objective was to support the design and implementation of future eHealth strategies and interventions on social media to increase the quality of targeted communication campaigns and therefore increase influenza vaccination rates. Our goal was to define an artificial intelligence–based approach to elucidate how threads in Twitter on influenza vaccination changed during the COVID-19 pandemic. Such findings may support adapted vaccination campaigns and could be generalized to other health-related mass communications. Methods The study comprised the following 5 stages: (1) collecting tweets from Twitter related to influenza, vaccines, and vaccination in the United States; (2) data cleansing and storage using machine learning techniques; (3) identifying terms, hashtags, and topics related to influenza, vaccines, and vaccination; (4) building a dynamic folksonomy of the previously defined vocabulary (terms and topics) to support the understanding of its trends; and (5) labeling and evaluating the folksonomy. Results We collected and analyzed 2,782,720 tweets of 420,617 unique users between December 30, 2019, and April 30, 2021. These tweets were in English, were from the United States, and included at least one of the following terms: “flu,” “influenza,” “vaccination,” “vaccine,” and “vaxx.” We noticed that the prevalence of the terms vaccine and vaccination increased over 2020, and that “flu” and “covid” occurrences were inversely correlated as “flu” disappeared over time from the tweets. By combining word embedding and clustering, we then identified a folksonomy built around the following 3 topics dominating the content of the collected tweets: “health and medicine (biological and clinical aspects),” “protection and responsibility,” and “politics.” By analyzing terms frequently appearing together, we noticed that the tweets were related mainly to COVID-19 pandemic events. Conclusions This study focused initially on vaccination against influenza and moved to vaccination against COVID-19. Infoveillance supported by machine learning on Twitter and other social media about topics related to vaccines and vaccination against communicable diseases and their trends can lead to the design of personalized messages encouraging targeted subpopulations’ engagement in vaccination. A greater likelihood that a targeted population receives a personalized message is associated with higher response, engagement, and proactiveness of the target population for the vaccination process.


Introduction Background
As online-mediated communication environments increase, social media platforms enable individuals to discuss diverse issues, express their thoughts, and debate [1][2][3]. Twitter is a leading social network that provides microblogging services. Users can publish posts, called tweets, with a limited length of 280 characters. Thereby, users can interact with others by responding, sharing, or showing their interest by "liking" a tweet. These interactive abilities are the fundamental building blocks of the connective nature of social networks and serve as an echo of ideas transferred among users on the platform around the globe [4]. Retrieving information in tweets' contents is challenging but is more manageable than in other social media platforms with long messages [5]. Indeed, the amount of structured and unstructured data from social media and Twitter has been increasing exponentially over the years [6,7]. Data mining and text mining enable the discovery of potentially new knowledge and contribute to developing efficient evidence-based decision-making tools [8][9][10] by extracting meaningful summaries, such as statistical ones, or controlled vocabularies (eg, terminology, folksonomy, taxonomy, and ontology) [11][12][13][14][15].
One of the most critical achievements of modern medicine is the development and widespread use of safe and efficacious vaccines. Nevertheless, their partial acceptance due to vaccine hesitancy and refusal is a significant health threat. Regarding influenza, compliance with the vaccine against it is relatively low compared with other vaccines, mainly because vaccination must be repeated annually [16]. Like other vaccines, influenza generates discussions both in the real world and online [17][18][19][20]. The COVID-19 vaccine is no exception.
Moreover, the global spread of the COVID-19 epidemic [21], its significant impact on daily life, and the relatively fast development of a vaccine against it have made the COVID-19 vaccine a critical health topic of discussion on social media. Reducing the incidence of transmissible diseases, such as influenza and COVID-19, requires achieving herd immunity [22,23], preferably by vaccination. This public health objective is achievable only with population engagement [18,19].
Social media platforms, such as Twitter, are a place of choice to share opinions and to search for (mis)information [24,25] about health care issues [26,27], including vaccines [17,18,28]. These open forums can influence opinions and vaccination decisions by hesitant individuals [29]. Discussions between provaccine advocates and "anti-vaxx" militants about vaccines' necessity, effectiveness, and safety are continuous. Moreover, the internet as a whole enables the detection of early warnings of disease outbreaks, their dissemination tracking and resilience [30], and the spread of evidence-based information [31,32]. Artificial intelligence methods and algorithms (ie, data mining, text mining, and natural language processing) have been efficiently used in the last decade to detect outbreaks, such as influenza, based on emerging trends in internet search engine queries and social media threads [33][34][35][36]. There is a need for public health interventions [37] to make drastic stands against the spread of misinformation like that disseminated by vaccine opponents [19,38]. Related tools should be based on artificial intelligence to analyze efficiently and in an automated manner the big data generated over social media [39,40].
Understanding the changes happening during some health-related event discussions is crucial to improving health communication efficiency [41,42]. Disease prevention programs need to incorporate methods to make evidence-based information accessible to widespread populations using online resources and to increase control of biased and misleading announcements. The main focus is on advertising policies and campaigns on social media [30,43].

Aims, Objectives, and Hypotheses
Our primary objective was to support the design and implementation of future eHealth strategies and interventions on social media to increase the quality of targeted communication campaigns and therefore increase influenza vaccination rates [18,19,44,45].
Our main aim was to define an artificial intelligence-based approach to analyze tweets, including terms related to vaccination against influenza and COVID-19. We focused on detecting co-occurring terms related to influenza vaccination and highlighting the dominant topics related to these terms. Therefore, these results must be used to build a folksonomy [46][47][48][49], which may then support the enhancement of vaccination campaigns. The methodology could be generalized to other health-related mass communications. Our research goal was to build a timely and dynamic vocabulary of the various topics related to influenza, vaccines, and vaccination posted in the English language. This vocabulary can be used as a decision support tool for health communication specialists and health policymakers, facilitating the understanding of the variations over time of different topics, such as those suggested in this study (tweets related to "influenza," "vaccines," and "vaccination").
The following 4 hypotheses guided this research: 1. Tweets are a source of understanding the reasoning to take a vaccine. 2. "Influenza," "vaccines," and "vaccination" topics are not linked directly to other topics (such as politics, economics, and fears) but are related to health matters. 3. Actuality and news impact tweet content related to vaccines and vaccination. 4. The terms and hashtags of tweets about influenza, vaccines, and vaccination can be organized in a dynamic vocabulary [50]. It can reflect the main topics and their terms discussed over time on the social media platform.

Overview
This study included the following 5 stages: 1. data sourcing to collect tweets and related data using the Twitter streaming application programming interface (API) [51]; 2. data cleansing and storage; 3. identifying the terms, hashtags, and topics related to "influenza," "vaccines," and "vaccination;" 4. building a dynamic vocabulary, a folksonomy, to support the understanding of the relations between them; and 5. evaluating the vocabulary clusters.

Data Sourcing
We extracted and collected tweets via the Twitter API for 16 months, between December 30, 2019, and April 30, 2021. These tweets were in English, from North America, and included at least one of the following terms: "flu," "vaccination," "vaccine," and "vaxx" (this last term was used to capture messages related to vaccination opponents as these individuals use it). We selected these terms to maximize the chance to retrieve discussions concerning a vaccine as a product, vaccination as an act or a policy, vaccination hesitancy, and influenza. Moreover, since Twitter participants use informal language, for extracting influenza-related content, we used the popular term "flu." The extraction omitted retweets and likes. The 16-month follow-up period allowed us to capture terms and topics of Twitter threads related to influenza, vaccines, and vaccination. Indeed, in the United States, 2020 involved the COVID-19 pandemic and the presidential elections.

Data Preprocessing and Cleansing
To ensure efficient use of machine learning methods [52] on the tweet collection [53], we preprocessed it by cleansing and lemmatizing similar words appearing in posts. Data cleansing consisted of removing punctuation marks [54], mentions of users, glyphs, website addresses, and stop words [55]. Moreover, as the tweets were written in a natural language and concise manner (due to the limitation of 280 characters), a word may be written in several ways due to various reasons (eg, typos and short forms), all of which have the same or similar meaning. Lemmatization is one of the methods for overcoming this issue. It consists of replacing words by their root form (eg, "vaccine" for "vaccines"). [56]. For example, due to the COVID-19 pandemic, the tweets retrieved during the collection process contained multiple representations of the term "covid," such as "COVID19," "COVID-19," and "coronavirus." We used the Python Natural Language Toolkit (NLTK) package for lemmatization [55]. Since the nature of tweets is informal, it has been assumed that using a single representation of those words will not significantly change the tweet's context, thus improving the model's accuracy. Therefore, the frequent representations of the term "COVID" were replaced with the single form "covid," and the terms related to "influenza" were lemmatized to "flu" as the popular language used on Twitter. All the lemmas were stored in lowercase.

Identifying the Terms and Topics Related to Influenza, Vaccines, and Vaccination
We handled the identification of the terms, hashtags, and topics related to influenza, vaccines, and vaccination by a 3-step process as follows: (1) clustering with word embedding and n-grams, (2) building a folksonomy, and (3) evaluating folksonomy clusters.

Clustering
The objective of clustering is to segregate a set of points into groups, with each one as similar as possible and different from the others [57]. For example, in the context of text mining and specifically mining a tweet corpus, clustering can be used to group terms that are semantically similar or frequently appearing in the same message. Each cluster, according to its content, can then be annotated with a topic.

Word Embedding
Handling the high volume of collected tweets over time means dealing with the curse of dimensionality [58]. Therefore, a symbolic-numeric reformulation associated with dimension reduction [59] must be used to handle a large amount of data in a reasonable time and reduce the processing complexity. Word embedding is a relevant approach supporting these 2 goals; it consists of a learned numerical representation of text where words having a similar meaning in a specific context have an equal numerical representation in a vector. Globally, word embedding allows the prediction of words in a specific context. Thus, Word2Vec is a word embedding algorithm based on a neural network model learning from a large corpus of text (ie, a context) the association between words or terms. After the first training step, Word2Vec can detect synonymous words or terms, or suggest complete sentences. This is done by searching for vectors and so words with a close semantic similarity represented by cosine or Euclidean distance (ie, the similarity or the relation) between two vectors (ie, words and terms) in a space (ie, corpus) of n dimensions (ie, number of words or terms in the corpus) [60]. As an example, words related to time, such as "day," "week," "month," "season," and "year," will be used in similar contexts and will be defined as semantically closed. The preprocessed data were used for creating the Gensim Word2Vec model in Python [61]. In order to see each word in the context it has with other words, we produced clusters, with the K-means algorithm [62,63], to assist decision makers in better understanding the public's perceptions of vaccines and vaccination against influenza and COVID-19. As discussions constantly evolve, the word embedding and clustering process was repeated monthly on newly collected tweets.

N-grams
As a complementary approach to word embedding, we built an n-gram language model predicting the probability of a sequence of words (after stop-word cleaning) to appear in our corpus of tweets. We extracted the most frequent n-grams comprising between 1 and 4 terms (n) for each week. Moreover, this process used the Gensim Python library [61]. This approach enables health communication decision makers to learn about new growing or shrinking isolated terms and sets of terms in the discussions related to vaccination and influenza.

Defining the Numbers of Clusters as Topics
Clustering is an unsupervised learning task and is challenging due to the need to define k and the number of clusters to build. The "silhouette method" allows assessing the quality of clustering, as it determines the similarity of an object (eg, a word also called a unigram) with the content of its cluster and the likeness with the other clusters. A silhouette shows which objects (eg, words, vectors, and values) lie well within a cluster and which are less related. The graphical combination of the silhouettes of an entire clustering (eg, with k clusters) into a single plot allows the appreciation of each cluster's relative quality and the overall clustering itself. The overall average silhouette width (ie, the average silhouette width of each cluster) provides an evaluation of clustering validity. A higher value of the overall average silhouette width (ie, silhouette score) is associated with better clustering with k, and therefore, it must be selected as the better partitioning. The silhouette method is independent of the partitioning algorithm used [64]. From our research perspective, each term must have a minimum number of occurrences to be included in the analysis. Moreover, 2 terms must have a maximum distance (number of other terms) between them in a tweet to consider their potential semantic link.

Cluster Visualization
Cluster visualization is produced by using t-distributed stochastic neighbor embedding (t-SNE), which is a nonlinear dimensionality reduction technique for embedding high-dimensional data and visualizing it in a low-dimensional (ie, 2 or 3) space [65].

Evaluation of the Terms in the Clusters and as N-grams
To evaluate our approach and the results of identifying the terms, hashtags, and topics related to influenza, vaccines, and vaccination, we implemented a validation process built on complementary approaches. One focused on the word embedding results, the second focused on n-grams, and the third focused on the whole by involving social media users. Thus, the terms were grouped once from a semantic perspective with word embedding on the first hand and once from a high coappearance frequency as n-grams describe the content of the explored Twitter threads in summarized ways.
The second evaluation approach consisted of using Google Trends [66] for getting the relative frequency of search terms during a specific period and in a specific geographic area. In this study, the n-grams (n between 1 to 4) were extracted from the tweets, and their weekly frequency was calculated. Next, the n-grams that appeared in the top 150 list continuously for at least 12 weeks were used as an input for a Google Trends query at the time frame they were published on Twitter. Finally, the n-grams (bi-grams) and the Google Trends query results were normalized. Their Pearson correlation coefficients were calculated by considering the weekly tweet-based n-grams and the weekly relative number of queries (comprising the n-gram terms) on the Google search engine.
The third evaluation consisted of computing Pearson correlations between the weekly frequency (between December 2020 and April 2021) of n-grams specific to vaccines, vaccination, influenza, and COVID-19, and the proportion of the population vaccinated against COVID-19.

Informed Consent Statement
The social network data were collected in an anonymized way and following Twitter's rules. The participants of the evaluation survey provided anonymous informed consent in an electronic way on the platform before they could proceed to the completion of the questionnaire.

Data Availability Statement
The Twitter data that support the findings of this study are not available owing to Twitter's rules and regulations. The survey data that support the findings are available from the corresponding author (AB) upon reasonable request, which will need to undergo ethical and legal approvals by the investigators' institutions. The methodology of this research will be reported in the AIMe registry for artificial intelligence in biomedical research [67].

Descriptive Statistics
A total of 2,782,720 tweets of 420,617 unique users between December 30, 2019, and April 30, 2021, were collected. The graph in Figure 1 shows the number of tweets per month (bar columns) containing at least one of the following terms (or similar after cleansing and lemmatization): (1) "flu," (2) "vaccination," (3) "vaccine," (4) "vaxx," and (5) "covid." The lines in Figure 1 show the proportion in percentage of each of these terms in the collected tweets. Although the term "covid" and its synonyms were not part of the initial keywords used for querying tweets, its emergence reflects the effect of the COVID-19 pandemic as an important topic in the discussions regarding vaccination and influenza in 2020 and 2021. Figure 1 also shows that globally the number of tweets comprising at least one of the terms "flu," "vaccination," "vaccine," "vaxx," and "covid" has dramatically increased over the period from December 2020 to April 2021 (see also Multimedia Appendix 1). Two peaks were noticed. The first was in March 2020, with the World Health Organization declaring COVID-19 as a pandemic (March 11, 2020) and President Donald Trump promulgating COVID-19 as a national emergency (March 13, 2020). The second peak in December 2020 was related mainly to "vaccines" in response to the approval of COVID-19 vaccines (Food and Drug Administration [FDA] emergency use authorizations for Pfizer BioNTech vaccine on December 11, 2020, and Moderna vaccine on December 18, 2020). Thus, the term "vaccine" increased from approximately 35% in January 2020 to approximately 80% one year later. In contrast, the term "vaxx" (for the terms "antivaxx," "antivaxxer," "anti-vaxx," and "anti-vaxxer") was stable at 1% to 3% over the whole data collection period. Nevertheless, it is essential to take into account that vaccination opponents used various tools and communication discourse, not evocating the "anti-vaxx" term itself [68][69][70]. The terms related to influenza ("flu") and COVID-19 ("covid") showed an inverse correlation (r=−0.83, P<.001) at the monthly level (Multimedia Appendix 1). The use of "covid" increased linearly, starting in January 2020, with the first cases of COVID-19 spreading from China to Europe and the United States [71], until February 2021, when it was part of approximately 35% of the collected tweets. In parallel, the use of the term "flu" decreased steadily, probably due to the low influenza activity during the 2020-2021 season [72,73]. Figure 1. Distribution of the number of tweets by month comprising at least one of the terms "flu," "vaccination," "vaccine," "vaxx," and "covid" between December 30, 2019, and April 30, 2021.

Word Embedding
The Word2Vec algorithm was run monthly to find the optimal parameters supporting the finding of the dominant trending topics. Determination of the optimal parameters' values was performed by creating models using a different value for each parameter and calculating the silhouette score for each iteration with the "silhouette_score" function of sklearn.metrics in Python [74]. Multimedia Appendix 2 shows the parameters' values and the silhouette scores of the various models of each month. Moreover, each week, only the terms having the highest occurrence regarding the overall number of terms detected in the tweets collected in the same week were investigated. The values of these attributes were changed over time to consider the dynamic changes in social media users' lexicons impacted by the actuality.

K-means Clustering
Using the monthly word embedding model as an input, word clusters were generated with the NLTK KMeansClusterer [75]. The clustering method groups together a given data set to a k predetermined number of clusters [66,76]. The partition is performed while aiming to minimize the in-cluster variance and maximize the variance between the elements from different clusters. To determine the optimal number of clusters [77], we computed the silhouette scores of k-means clustering runs with k ∈ [3;6]. The silhouette scores of the clustering models were generated on the 2,782,720 tweets of 420,617 unique users between December 30, 2019, and April 30, 2021, related to 141,407 n-grams with n ∈ [2;4]. The highest silhouette score reflects this grouping, wherein the different objects are well affected to their clusters and less linked to neighboring and less relevant clusters. A higher silhouette score (s=0.72) was achieved with k=3. This score can be considered good as we clustered terms that can relate to different topics and the clusters can overlap partially [78,79]. Furthermore, by computing the Ray-Turi index [80] for k between 2 and 10, and building the curve of the different generated values allowed with the Elbow method, the optimal k was equal to 3 [81].
Explicitly, this visualization ( Figure 2) allows us to see the first 1000 most used terms in the tweets of each one of the previously computed clusters. It is noticeable that overlaps exist between the clusters, which is quite logical when we realize that the tweets relate in many cases to a few topics at the same time (eg, from an account dealing with political issues: "The vaccines offer good protection with more than 80% effectiveness. Most people will not be sick and the ones that will, will not get seriously ill or die"). Orange, seafoam (green-blue facilitating reading of the figure by color-blind individuals), and violet represent "health and medicine (biological and clinical aspects)," "protection and responsibility," and "politics," respectively.

N-grams
The preprocessed tweets were used to extract n-grams for each week. Multimedia Appendix 4 shows the 10 most common n-grams for each n ∈ [1;4]. For example, the words "flu" and "bad" were found close to each other in the word embedding model over the months of this study (Multimedia Appendix 3, list of the 1000 most frequent n-grams for cluster 1). Those 2 words were also a common n-gram, whether a bigram or a part of a higher degree of an n-gram. Although included in the word embedding representation, we see the relations between those 2 words in general, as they get closer to each other and in the same semantic cluster.
Following the extraction, each n-gram received its growth value, indicating an increased or decreased n-gram frequency from the previous week. The growth is used to highlight the significant changes in the n-grams and therefore in general discussions. For example, on November 9, 2020, Pfizer BioNTech published the initial results of the COVID-19 vaccine trial, which showed high efficacy against the disease. The n-grams of the same week showed a significant increase as follows: "take,

Google Trends Validation
As a component of the internet, social media like Twitter are a part of how people get and share information and knowledge. Therefore, looking at queries on search engines like Google allows the evaluation of global interests in terms and topics detected on social media. Thus, we computed Pearson correlations between the weekly occurrences of n-grams in tweets and weekly queries in the Google search engine and those reported on Google Trends [84]. As an example of the consistency of the previously disclosed results, the n-gram of "flu, symptom" on Twitter and the number of queries on Google were highly correlated (r=0.85, P<.001) between January 1, 2020, and March 4, 2021 (Table 1). During these 65 weeks, this n-gram (ie, "flu, symptom") was also used to search for information about "influenza" and "symptoms." Moreover, as we noticed the decreasing popularity of its use on Twitter, we also noticed similar behavior on Google. Additionally, the n-gram "covid, vaccine" also showed a high correlation between Twitter and Google (r=0.85, P<.001), and on the 2 platforms, its occurrence increased between January Globally, the top topics related to vaccines, vaccination, and COVID-19 were similar on social networks and search engines (Table 1). Thus, internet users' queries on search engines relate with the timing of topics defined by analysis of the text of our Twitter message data set.

Real-World Validation
On December 11, 2020, the FDA issued an emergency use authorization for a COVID-19 vaccine. A few days later, on December 20, 2020, vaccination of the population with the Pfizer BioNTech vaccine was started. We downloaded the daily vaccination rate from Centers for Disease Control and Prevention (CDC) publications and aggregated them at the weekly level [85]. We noticed that starting in December 2020 and ending on April 30, 2021, Pearson correlations between the weekly occurrences of COVID-19 vaccination n-grams and the weekly vaccination rates (Table 2) were high and significant (r>0.81, P<.001) [86]. These results demonstrate that the tweets of this study mirror "real-life" significant events during the pandemic.

Principal Findings
This research was initiated to elucidate online public perceptions regarding vaccination, mainly against seasonal influenza. However, the COVID-19 pandemic in 2020 was impressively reflected by major changes in the focus of Twitter-based discussions. The most important aspect of this study is the building of a folksonomy based on tweet text analysis, word embedding, and clustering. The 3 topics that were identified in this folksonomy were as follows: 1. General issues from the "health and medicine (biological and clinical aspects)" perspective. The initial terms used for the tweet extraction were "flu," "vaccination," "vaccine," and "vaxx." These terms are de facto strongly related to health and medicine, and generate a large spectrum of threats (ie, from asking/answering questions about symptoms, reporting health conditions, and sharing positions). The presence of terms related to the COVID-19 pandemic is understandable given the period of the data collection. 2. "Protection and responsibility" as a central dimension of the decision to take a vaccine or not. The COVID-19 pandemic showed the need for social distancing and mask wearing to reduce the spread of the virus. For these reasons, tweets related to influenza ("flu") or immunization ("vaccine" and "vaccination") and, by extension, to COVID-19 comprise threads discussing protection measures (like vaccination) and the responsibility to use them (such as taking a vaccine). It is important to highlight, based on prior studies [19,87,88], that the intent to take a vaccine is considered by the younger adult US population as an act of collective responsibility. 3. "Politics" is a cluster showing the divergence of opinions and messages of US political leaders (ie, Republicans and Democrats) about the severity of the crisis and the efforts to reduce disease transmission [89]. Besides this cluster, it is important to remember that in parallel to the first year and first waves of the COVID-19 pandemic, 2020 was an election year. Thus, the local and national management of this global epidemic was a source of political debates, and support or criticism of governments, administrations, and the health care system.
The mechanisms behind the folksonomy rely on a complex set of factors. First, as pointed out above, the reasons for the emergence of each cluster depend on both culture and real-life events. Second, these mechanisms can be quantified by analyzing terms that frequently appear together (n-grams). Thus, in the context of this research, we observed that the main focus of the tweets related mainly to COVID-19 pandemic events (disease, confinement, politician talks, vaccines approval, and vaccination) and increased over time, like the prevalence of the terms "vaccines" and "vaccination," and this was in contrast with the term "flu," which disappeared over time from the tweets. This reflects that COVID-19 measures, such as social distancing and mask wearing, significantly reduced the seasonal influenza rates in 2020-2021 [73,90,91]. However, a potential major reason and mechanism of these changes in trends and therefore of the folksonomy content may be associated with the diversion of citizens' attention to annual influenza spread, caused by the disruptive and menacing COVID-19 pandemic. These distractions induced different behaviors or feelings, such as devastation, fear, worry, and the need to understand [92,93].

Strengths and Limitations
Social media and social networks are increasingly being used to disseminate multimodal and multisource-based health-related information in a timely manner. In the context of epidemics and pandemics, such as seasonal influenza and COVID-19, health care organizations and governmental institutions nowadays spread information and run communication campaigns on social media, for example, to increase citizen engagement in vaccination. At the same time, individuals share their positions, even if it is associated with the antivax trend, and sometimes spread misinformation [94]. The strength of our study is its ability to provide health authorities with a weekly, monthly, and long-term folksonomy of the emerging or persisting topics of social media threads related to a health care issue or event, such as vaccination or a virus-related matter. Providing a folksonomy and the co-occurring terms in the same or additional clusters, using these tools, can enhance health-related social media campaigns, focusing on grand public in-time interests and queries, similar to the approaches used in other business fields.
By getting reports in a timely manner, it has been proven possible to point out the various topics, words, and terms frequently used on social media, thereby enabling health communication specialists, and more specifically those dealing with social media, to focus on up-to-date campaigns to increase population engagement, such as that done in other business fields [95], and actions related to health promotion, especially during epidemics and crises [96] (eg, H1N1 [97] and Ebola [98]), as has been suggested in prior research not dealing with terms, topics, and target population discovery or designation [99].
Exploring social media, and more particularly social networks, is limited by the passive exclusion of nonusers of these communication channels or inactive users who only read posts but do not post by themselves or respond to the messages of other users.
Another limitation of this study is that it was based only on tweets in English and posted from North America. This filtering limits the generalization of the results. The diversity of the US population suggests that running this kind of study in the United States in other languages will enable fine-tuning of health communication and increase vaccination compliance in non-English speaking communities (ie, around 22.0% of the US population) [19,100].
In parallel with our study, another study dealing specifically and strictly with vaccination and COVID-19 was performed among Australian Twitter users (versus US Twitter users in our study) between January and October 2020 (versus between December 2019 and April 2021 in our study) and collected 31,100 tweets (versus 2,782,720 tweets collected by us). The analysis was based on latent Dirichlet allocation, which is an unsupervised learning approach that can be large-scale intensively system resource consuming [101]. The Australian tweet analysis revealed the following 3 dominant topics: (1) "COVID-19 and its vaccination," (2) "advocacy for infection control measures and vaccine trials," and (3) "conspiracy theories, complaints, and misinformation" [102]. Even though some convergence exists, these results are distinct from ours by focusing more specifically on COVID-19-related issues.
Moreover, the set of words initially used for extracting the tweets ("influenza" OR "vaccine" OR "vaccination" OR "vaxx") allowed us to capture a larger spectrum of threads related to each one of the terms that we were interested in focusing on and not in a strict filtering approach, as in other prior research [101]. Nevertheless, without extending the extraction word set, with terms of the COVID-19 pandemic, tweets potentially interesting but not comprising one of these terms would not have been extracted. For example, the following tweet published in mid-April 2021 that included words detected in the n-gram analysis but not explicitly the words used for the tweet's extraction failed to be retrieved: "I am excited, I am in my county seat to get my first injection of the Pfizer." A future perspective for enhancing the dynamic of trend tracking can be considered to update the terms of the tweet extraction query with other disrupting terms due to actuality (eg, "covid," "dose," "injection," and trade names of vaccines). This enhancement can be achieved by a domain expert (ie, human action) or by automatically selecting words emerging as trending in a cluster of the folksonomy and co-occurrence frequency analysis (ie, n-grams) [95].
Additionally, when dealing with the large volume of tweets generated each minute, looking at all tweets in real time is impossible without deploying a high computational infrastructure, which is available in dedicated centers. Accordingly, the objective of this research was to define a framework enabling health system decision makers to focus on specific issues in order to enhance their social media campaigns by understanding the topics discussed in a particular context (ie, vaccination and influenza). Furthermore, the tweets are collected daily (due to Twitter constraints, without using a paying platform) and analyzed, with the machine learning flow described in the methodology, at the weekly, monthly, and all-time levels. To deal with others' terms of interest, changing the terms of the tweet extraction query will allow the expansion of the current data set or the start of new research with the same methodology. This study shows that combining social media data, such as tweets, and artificial intelligence approaches, such as machine learning algorithms for text and data mining, enables an infodemiology and infoveillance study as a whole. More specifically, in this study, we noticed the strength of this combined approach by following the changes in the contents and topics of the tweets over time and the influence of the actual events. Like other Twitter-based public health research, the approach of collecting, analyzing, and assessing in near real time the content of messages provides powerful indications to health decision makers for adapting and enhancing communication as an emergency response and in planning [103].
In other words, these forewarnings must support social media-based health information in targeting advertisements of recommendations, instructions, and directives, according to social media user' interests and focuses (ie, terms appearing in the clusters of the folksonomy) disclosed passively in previous posts, shares, or likes. Moreover, social media platforms allow accurate targeting by stratifying advertising campaigns on sociodemographic attributes, such as age, gender, marital status, location, spoken language, and educational and professional background [104]. Thus, social media-based health information is intended to increase population adherence to health policies, such as vaccination against epidemic or pandemic diseases (eg, influenza and COVID-19), by delivering personalized messages taking into account both sociodemographics and domains of interest. For example, a young person playing basketball, living in an area with recurrent high acute influenza incidence in a young population, following social media groups dealing with basketball, and sharing posts related to vaccination hesitancy will get advertisements with personalized content targeting young vaccination-hesitant individuals playing collective sports and emphasizing that vaccination is the best solution to continue this activity during an epidemic [105].

Conclusions
Twitter is one of the leading social network platforms allowing anyone to share positions and information in any domain. Therefore, any kind of information published and spread about influenza and COVID-19, and the vaccines against each, can be perceived as reliable and can influence social media users. Specifically, during the COVID-19 pandemic, world leaders have widely used Twitter to communicate public health information with citizens. These messages had a strong effect on vaccination compliance [106], with the ability to dynamically improve the content and target health communication campaigns on social media.
This study allowed us to validate our initial hypothesis. Tweets are a source of information for understanding why it is recommended to take a vaccine and the public perception about it [107][108][109]. Indeed, we defined a folksonomy of the 3 main topics coexisting in the collected messages over 16 months. Accordingly, the terms and hashtags of tweets concerning "influenza," "vaccines," and "vaccination" can be organized in a dynamic vocabulary, such as a folksonomy, reflecting the main topics and their terms discussed over time on the social media platform. Additionally, the emergence and dominance of terms related to COVID-19 over time, reported in the folksonomy with frequently co-occurring words, shows that although the study did not initially focus on this thematic, the health changes are reflected in the Twitter threads related to vaccines and vaccination.
This study focused initially on vaccination against influenza and moved to vaccination against COVID-19. Infoveillance on Twitter (and other social media) about the topics related to vaccines and vaccination against communicable diseases can create opportunities to design and convey personalized messages encouraging specific targeted subpopulations' engagement in vaccination. A greater likelihood that a targeted population receives a personalized message is associated with a higher response, engagement, and proactiveness of the target population for vaccination or other public health measures [110].