Detecting Elevated Air Pollution Levels by Monitoring Web Search Queries: Algorithm Development and Validation

Background Real-time air pollution monitoring is a valuable tool for public health and environmental surveillance. In recent years, there has been a dramatic increase in air pollution forecasting and monitoring research using artificial neural networks. Most prior work relied on modeling pollutant concentrations collected from ground-based monitors and meteorological data for long-term forecasting of outdoor ozone (O3), oxides of nitrogen, and fine particulate matter (PM2.5). Given that traditional, highly sophisticated air quality monitors are expensive and not universally available, these models cannot adequately serve those not living near pollutant monitoring sites. Furthermore, because prior models were built based on physical measurement data collected from sensors, they may not be suitable for predicting the public health effects of pollution exposure. Objective This study aimed to develop and validate models to nowcast the observed pollution levels using web search data, which are publicly available in near real time from major search engines. Methods We developed novel machine learning–based models using both traditional supervised classification methods and state-of-the-art deep learning methods to detect elevated air pollution levels at the US city level by using generally available meteorological data and aggregate web-based search volume data derived from Google Trends. We validated the performance of these methods by predicting 3 critical air pollutants (O3, nitrogen dioxide, and PM2.5) across 10 major US metropolitan statistical areas in 2017 and 2018. We also explore different variations of the long short-term memory model and propose a novel search term dictionary learner-long short-term memory model to learn sequential patterns across multiple search terms for prediction. Results The top-performing model was a deep neural sequence model long short-term memory, using meteorological and web search data, and reached an accuracy of 0.82 (F1-score 0.51) for O3, 0.74 (F1-score 0.41) for nitrogen dioxide, and 0.85 (F1-score 0.27) for PM2.5, when used for detecting elevated pollution levels. Compared with using only meteorological data, the proposed method achieved superior accuracy by incorporating web search data. Conclusions The results show that incorporating web search data with meteorological data improves the nowcasting performance for all 3 pollutants and suggest promising novel applications for tracking global physical phenomena using web search data.


Introduction
Online crowd surveillance has been used as a means of tracking emergent risk to public health [1][2][3].Most commonly, these efforts involve the collection of online search queries to document acute changes in incidence or symptoms occurrence to primary infectious disease agents, such as influenza [4][5][6][7], Ebola [8], dengue fever [9], and COVID-19 [10].These methods have the potential to provide public health and medical professionals with benefits over traditional health surveillance and environmental epidemiology in their ability to capture both personal exposures and response dynamics at more sensitive spatial and temporal scales [11].
Despite the promise of these approaches on infectious diseases, only a limited number of studies have examined how crowd-surveillance approaches can be used to track environmental exposures and, less frequently, responses to non-infectious environmental-mediated disease processes [12][13][14].The global burden of disease attributable to outdoor and indoor air pollution has been quantified by recent efforts and has increased public awareness on the severity of this public health crisis worldwide [15].Urban air pollution, therefore, provides a key test case for the evaluation of online surveillance approaches for non-infectious environmental risks.The online surveillance approach is distinct from traditional approaches for measuring urban air pollution exposures.Therefore, it could possibly serve as a substitute or complement to existing approaches.Traditional indicators of air pollution exposures, namely concentrations measured at ambient monitoring sites, are widely used to assess health effects associated with air pollution in epidemiological studies.However, the use of ambient monitoring measurements as surrogates of exposure may result in misclassification of health response and potential risk, especially for those not living near pollutant monitoring sites [16][17][18].Moreover, ambient monitoring, by design, provides information on measured outdoor pollutant concentrations, and may not necessarily reflect accurate personal exposures for individuals spending the majority of their time indoors, or for those with preexisting biological susceptibility to air pollution.Several recent studies have focused on using smartphones within distributed air pollution sensing networks, where users record and upload local air pollution conditions to crowd-generated, geospatially-refined pollution maps [12][13][14].These studies demonstrate the feasibility of online crowdgenerated participation in projects predicated on urban air pollution awareness.
To our knowledge, few studies have investigated the feasibility of using Web search data to produce accurate "nowcasts" of urban air pollution levels in real-time.Conducting accurate predictions using Web search data is a challenging task with two major challenges.The first is the selection of search terms to capture people's responses comprehensively.Several approaches have been proposed to select the search terms.For example, some studies preliminarily prepare keywords related to the target disease and then use these keywords for filtering the search terms, this task is often difficult because finding related keywords could be hard for some diseases or be costly when conducting for multiple diseases.The second is the selection of proper models.While the literature on data-driven nowcasting methods for estimating infectious disease activity is well-developed from an epidemiological standpoint, the machine learning methods employed lag behind the state-of-the-art.The nowcasting models introduced to date mainly use variations of regularized linear regressions or, less often, random forests or support vector machines.From a machine learning perspective, the problem of disease activity estimation is most suited to more sophisticated and time-series specific model architecture, and thanks to the growing volume of recorded environmental-mediated disease data, the use of recurrent neural networks (RNNs), and more specifically their variants long short-term memory (LSTM) and gated recurrent unit (GRU) networks, is increasingly feasible.The vanilla LSTM model makes predictions solely relying on the time series of search activity while ignoring the semantic information in the search query phrases.Previous studies have pointed out that the search queries could be semantically related and ignoring their correlation would lead to a decrease in the model performance [28,33].Recent advances in natural language processing (NLP) have led to the development of a technique called word embeddings to represent the semantic information in the phrases and fine-tuning of word embeddings has been encouraged for down-stream tasks [52].However, there is still a lack of knowledge of incorporating both the semantic information of search queries and the time series of search activity to make predictions.
In this study, we investigate Web search data as one important source of an online crowd-based indicator.As Web search data is free and broadly accessible, we posit that it could serve as a scalable means of tracking urban air pollution exposures and corresponding population-level health responses.To measure search interest, we use the freely accessible Google Trends service, which reports aggregate search volume data at city-level geographical resolution.For this analysis we use known health endpoint terms and topics, such as "difficulty breathing", and observations (e.g., "haze") suggested by public health researchers, augmented by automatic term expansion based on semantic and temporal correlations, to estimate the levels of search activities related to air pollution, and ultimately to predict whether the pollution levels were elevated.
Compared to existing air pollution classification models, this study explores the usage of Web search anomalies as an auxiliary signal to detect air pollution.We compare our approach to the state-of-theart physical sensor-based models which incorporate various pollutant covariates such as historical pollutant concentrations and meteorological data [19].Using Web search data for prediction introduces several challenges, including an unclear relationship between search interest and pollution levels and the trade-off between model complexity and convergence for the inclusion of Web search data on a data-deficient scenario.
In summary, our contributions are: • We propose a novel search term Dictionary Learner-Long-Short Term Memory (DL-LSTM) model to learn sequential patterns from broad historical records of Web search data for air pollution nowcasting.
• We compare the DL-LSTM models to a variety of baseline models on the efficacy of using Web search data to indicate exposure to a non-infectious environment stressor (i.e., air pollution) and demonstrate that the proposed models are effective across different experimental settings.
• We evaluate the efficacy of combining Web search data and meteorology data for air pollution predicting and show that the inclusion of Web search data improves the prediction accuracy and provides a promising substitute when historical pollutant data is unavailable.Concentration on day t-7 Concentration on day t-6 . . .

Search
Search volumes of search terms

Methods
We now describe the methodology: first, we formalize our problem setting, then describe the data and then introduce our modeling approaches.

Problem Statement
We formalized this task as a classification problem and adapted state-of-the-art machine learning models.We constructed a multivariate autoregressive model and a random forest model fit on historical air pollutant concentration as well as search and meteorological data as baseline models.
We evaluate the performance our proposed models (described below) compared to the baselines on prediction accuracy and other standard classification prediction metrics.

Data Collection
We collected daily air pollutant concentration data as well as temperature and relative humidity in the ten largest U.S. MSAs from Jan. 2007 to Dec. 2018.We focus on three air pollutants: O 3 , NO 2 , and PM 2.5 .The in-situ pollutant concentrations and meteorological data such as temperature, relative humidity, and dew-point temperature were retrieved from the US Environmental Protection Agency (EPA), Air Quality System (AQS), and AirNow database.To create a single daily pollutant concentration for each city, we used the median pollutant concentration from all available monitoring sites within each city to avoid outlier bias.We collected the daily search frequency of pollution-related terms from Google Trends for the same 12-year period and cities.We created a curated list of 152 pollution-related terms based on our previous air pollution epidemiology studies and in reviewing the environmental health literature [41][42][43][57][58][59] and downloaded the reports of trending results terms using PyTrends [44].For each PyTrends request, we downloaded the search history of pollution-related terms over a six-month window with one overlapping month for calibration.PyTrends provided us with search frequency scaled on a range of 0 to 100 based on a topic's proportion to all searches on all topics.Because of the PyTrends restriction, we downloaded the reports of trending results multiple times and the search frequencies are scaled, separately in each six-month window, which required us to calibrate the search frequency for the 12-year period.We calibrateed the search frequencies by joining the search logs on the overlapping periods (1 out of 6 months) for inter-calibration [45].
We investigated the available input features from meteorological data (temperature and relative humidity), historical pollutant concentration, and Web search data (see (Table 1)).

Missing Data Imputation and Normalization
Smoothing and interpolation is a simple and efficient data imputation method [53], we apply linear interpolation to fill the missing data in historical pollutant concentration, temperature, and humidity, with a rolling mean of window size 3.To fill in the missing data in infrequent search terms for which Google Trends does not return a count, we use random numbers close to zero (e −10 ∼ e −5 ).We normalize all the input features to standard scores by subtracting their mean values and dividing them by the respective standard deviations.

Search Term Expansion (STE)
As online search queries may reflect individual exposures to ambient air pollution, the seed terms are mostly related to symptoms, observations, and emission sources (Appendix 1; Table S1).However, since an exhaustive list of user queries was not available, reliance on only expert-generated seed words may result in a poor prediction-due to the high mismatch rate between the user queries and our expected search words.
Query expansion is a common approach to resolve this discrepancy.A recent study [28] showed that the initial set of seed words can be effectively expanded through semantic and temporal correlations.Thus, for each seed word we use Google Correlate [46] to retrieve the top 100 correlated query terms.Then we use the pre-trained word2vec model [47] to retrieve the vector representation of each query-phrases are mapped to the centroid of the constituent terms.A utility score is calculated for each candidate query by measuring the maximum cosine similarity between the query and the seed words.The queries with a high utility score are retained and the remaining queries are eliminatedwe empirically set the utility cut-off to 0.55.This method expands the set of search terms for a total of 152 search terms to track (Appendix 1; Table S2).

Modeling and Evaluation
Problem Definition Given sequences of physical sensor data P = [p t−L , . . ., p t−1 ] T ∈ R L×dp , and search interest data S = [s t−L+2 , . . ., s t+1 ] T ∈ R L×ds , the task is to classify day t as "polluted" or not, where the positive class label indicates that the air pollution was above a pre-defined threshold.L denotes sequence length, and d p and d s are the number of physical sensors features and the number of search-related terms, respectively.
Autoregressive and Random Forest Classification Models Previous work has shown that simple autoregressive models using Web search data can generate nowcast estimates for influenzalike illness (ILI) at the US national level [33].We adapt the autoregressive models with the logistic regression (LR) classifier for classification purposes.Furthermore, we apply elastic net regularization, which is a linear combination of l 1 and l 2 regularization and is proposed in previous studies [28,33].LR+Elastic Net was implemented using python scikit-learn package, using cross-validation to set the model's hyper-parameters to maximize the F1 score on the validation set, with class_weight set to "balanced".
Random forest is an ensemble learning model and is robust against over-fitting, providing a strong baseline for the development of nonlinear predictive models [50].We used the scikit-learn implementation of random forests.The number of trees and the maximum depth of individual trees were selected to maximize the F1 score on the validation set, with balanced class_weight for positive and negative samples.

Day -7
Day -6 Day -5 Day -1   Long Short-Term Memory and its Variants Long short-term memory units (LSTM) [37] are recurrent neural networks (RNNs) models designed for sequence modeling, which could learn the non-linear relationship in time series data [38].We first describe a baseline LSTM model with two sub-networks to separate the search data and meteorological data.As shown in (Figure 1), there are four layers in the model, i.e., the sequence embedding layer, the LSTM layer, the fully-connected hidden layer, and the output layer.
In the left sub-network of the LSTM model with search data as input, we propose two methods to capture the semantic information in search terms.The first one is called the LSTM semantic model (LSTM-GloVe).As a variant of the vanilla LSTM model, for the sequence embedding layer of the right sub-network in (Figure 1), we introduce the matrix multiplication operation to project the search values of search terms to their semantic embedding space (GloVe embeddings) as shown in (Equation 1).
Given the search interest data S = [s 1 , . . ., s 7 ] T ∈ R 7×ds , and their GloVe embedding G = [g 1 , . . ., g dg ] ∈ R ds×dg , where d g = 50 (GloVe 50-dimensional word vectors trained on tweets [52]).The matrix multiplication operation is defined as: Specifically, the tensor generated by matrix multiplication operation is then fed to LSTM layer for further calculation.This matrix multiplication is designed specifically for the model consistency problem when introducing co-linear predictors after search term expansion.
The second variation of the LSTM model is the DL-LSTM model, based theoretically on the idea of matrix multiplication as shown in LSTM-GloVe.However, instead of directly applying the GloVe embedding for matrix multiplication, it introduces the fine-tuning of the word embeddings via a d g by d e ReLU-activated fully-connected layer.As shown in (Figure 2), it applies the ReLU-activated fully-connected layer to the initial GloVe embedding, where d e = 100 is the size of the new embedding.In this architecture, the GloVe 50-dimensional word vectors are used to initialize the search term embedding dictionary, and the matrix multiplication operation is used to transform the input embedding of search terms to the semantic embedding space 1 .
In summary, we evaluate the following models in this paper: • LR: Logistic regression classifier with elastic net regularization.
• RF : Random forest classifier with the number of trees and maximum depth tuned for prediction.
• LSTM : Baseline LSTM model as shown in (Figure 1), which combines physical sensor features, if available, with the search interest volume data directly, providing a direct adaptation of RNNs to this problem without any problem-specific extensions.
• LSTM-GloVe: LSTM semantic model, which is a variant of LSTM model as described by (Equation 1), we control the input of search interest data (i.e., 51 seed search terms vs. 152 terms after STE) in this model.We refer to the variants as LSTM-GloVe and LSTM-GloVe w/ STE respectively.
• DL-LSTM : DL-LSTM model as shown in (Figure 2).We control the input of search interest data (i.e.

Evaluation metrics
Because we defined this task as a classification problem, we used standard classification evaluation metrics.We report accuracy and F1 score of the positive class (the harmonic mean of precision and recall) of predictions as evaluation metrics for all models.While accuracy measures the total fraction of correct predictions and could misrepresent model performance in presence of heavily imbalanced classes, the F1 score takes class imbalance into account and is, therefore, a more appropriate metric for our problem.
where T P , T N , F P , and F N are the number of true positive samples, true negative samples, false positive samples, and false negative samples, respectively.

Overview
In this section, we first present the findings of the data exploration.Then we present the principal findings of this study.

Insights from Collected Data
In this section, we describe the thresholds of abnormal air pollutant concentrations and then we present the lag between search anomalies and air pollution.

Thresholds of Abnormal Air pollutant Concentrations
The major MSAs chosen for study in this work, have different distributions of pollutant concentration through time, almost always fall below the EPA standard 24-hour threshold (Figure 3).Despite this, multiple studies have shown that even at low concentrations, chronic exposure to air pollution negatively affects human health [41,42].Therefore, calibrating a meaningful threshold for each city, especially ones with generally lower levels of air pollution (e.g., Miami) may be critical for adequately protecting population health.A natural way to do this may be to set the threshold as one standard deviation above the mean daily pollutant concentration within each city, which is adopted in this study.The input predictors are also normalized within each city to reflect city-level dynamics.The resulting thresholds for the three pollutants and cities under investigation are reported in (Table 3).

Lag between Search Anomalies and Air Pollution
A previous study showed that there could be a lag between incident occurrence and google search activity.[31].As shown in (Figure 4), the normalized search frequency of the term "cough" is correlated with the concentration of NO 2 in Atlanta with a certain lag of time.To determine the lag between elevated pollution levels and consequent pollution-related searches, the mean absolute Spearman's correlation between pollutant concentrations and search interest data was calculated, shifted forward in time for 0, 1, 2, and 3 days.As shown in (Table 4), for O 3 and PM 2.5 , the mean absolute Spearman's correlation increases with the increase of the shifted days.Considering that the task aims to detect elevated pollution levels as soon as possible, a lag of one day was applied to search data.In other words, search interest data from the current day was used to estimate whether air pollution was elevated on the previous day.

Evaluation Outcomes
In this section, we consider three conditions to evaluate the performance of using Web search data to detect elevated pollution, i.e., using only search data, using search data as an auxiliary data of meteorological data, and using search data as an auxiliary data of meteorological data and historical pollutant concentration.
Using Only Search Data: For areas where ambient pollution monitoring is unavailable, investigating whether Web search data could be used as the only signal for nowcasting elevated air pollution is a vital question.When relying on only search data for air pollution prediction, both the proposed DL-LSTM architecture and search term expansion contribute to the improvement of prediction accuracy.As shown in the "Search" section of (Table 5), the LSTM-based models exhibit superior accuracy over the baseline LR and RF models for O 3 and NO 2. For PM 2.5 , the proposed models do not perform better than the baseline LR or LSTM model because the validation and test dataset are heavily imbalanced (as shown in (  Using Search Data and Meteorological Data: When meteorological data is available, we investigated the feasibility of using meteorological data with/without search activity data to nowcast air pollution under this condition.As shown in the "Met" and "Met +Search" sections of (Table 5), the inclusion of Web search data improves the nowcasting accuracy for all three pollutants.In Using Search Data, Meteorological Data and Historical Pollutant Concentration: When historical pollution concentration is available, search activity data is added as auxiliary data to both meteorological data and historical pollution data.As shown in the "Met+Pol" and "Met+Pol+Search" sections of (Table 5), the inclusion of Web search data improves the nowcasting accuracy for O 3 and PM 2.5 .However, for NO 2, the inclusion of Web search data does not improve the nowcasting accuracy, which indicates increases in NO 2 concentrations may not be directly noticeable by people sufficiently to increase their search interest.This difference in performance for different pollutants and locales merits further investigation.

City-level Analysis of Ozone Pollution Prediction
We investigated the potential of using search interest and meteorological data to replace groundbased ozone sensor data for ozone pollution prediction in individual cities.As shown in (Table 6), including search interest data (Met+Search) to augment purely meteorological data (Met) increases both accuracy and F1 metrics for most cities.While these metrics are not reaching the performance when the ground-level pollution sensors are available (Met+Pol), at least for two of the major MSAs (Philadelphia and Houston), search volume data indeed provides a useable alternative to pollution monitors, with only 1.6% and 0.14% degradation in accuracy, respectively.Besides, the differences in model performance across different cities indicate that the online search pattern could vary from city to city.As shown in (Table 7), the top five correlated terms differ across US cities in 10 years.The variation of search patterns could lead to a degraded prediction performance for certain areas, leaving promising directions for improvements.

Sensitivity Analysis of Air Pollution Thresholds
Classification thresholds play an essential role in our model.In this study, a standard deviation threshold from the mean of corresponding pollutants was used as a "probability threshold" to detect air pollution on a spatial-temporal resolution.However, the proposed method is sensitive to the threshold.We further investigate the performance of the proposed method on a variety of fixed classification thresholds.As shown in (Figure 5), (Figure 6) and (Figure 7), we fixed the classification thresholds for all ten cities for detecting Ozone, NO 2 and PM 2.5 pollutions.The result shows that the meteorological and search data are complementary and combining search and meteorological data leads to better prediction performance for all classification thresholds.

Discussion
Principal Findings In this study, we explored various existing air pollution prediction models and found that the use of a time-series neural network approach achieves the highest predictive accuracy in most of our experiments.The results showed that the LSTM-based models achieve superior accuracy for three air pollutants when both meteorological data and Web search data exist.Furthermore, our results on the inclusion of Web search data with meteorological data indicate that under short reporting delays, the LSTM models could provide highly accurate predictions compared to baseline models using meteorological and/or historical pollution concentration data.Compared to existing studies which predict urban air pollution concentrations using linear and nonlinear machine learning models [19][20][21][22][23][24][25], our proposed method could predict air pollution when source emissions and remotely sensed satellite data are infeasible (e.g., sensed satellite data often suffers from a high missing rate due to frequent cloud cover [27]).Previous studies using online search behavior has emphasized the usage of Google Trends [30,31] and apply regularized linear regression on colinear Web search queries to estimate disease rates from social media or online search data [28,29,32,33].Our research further explores the possibility of using LSTM models with semantic embeddings of search queries for predicting air pollution.As shown in (Figure 8) and (Figure 9), the semantic embeddings of search terms fine-tuned by the DL-LSTM model are less correlated compared to their initial GloVe embeddings, which shows that the co-linearity between search terms is reduced during the training process.
We also explored the various combination of search terms and found that a comprehensive set of user queries is critical for capturing people's responses to urban air pollution accurately.In this study, we expanded the initial set of seed terms through semantic and temporal correlations with search queries from Google Correlate.We investigate the contribution of different search term groups by manually classifying the search terms into four categories, where unclassified category includes terms with ambiguous meaning.(Table 8) shows the accuracy and F1 score when we remove search terms by categories for predicting ozone, NO 2 , and PM 2.5 pollution.Removing the search terms in categories of symptom, observation, and source categories leads to a decrease in the accuracy score for detecting at least two pollutants.At the same time, removing the search terms with ambiguous meaning only leads to a slightly higher accuracy score for all three pollutants.By analyzing the coefficients of each search term, the result shows that several search terms contribute more than other search terms.We calculate the average feature importance of seed search terms using the RF model.As shown in (Figure 10), (Figure 11) and (Figure 12), search terms including "particular matter", "rapid breathing" and "throat irritation" gain relatively high feature importance for detecting ozone, NO 2 , and PM 2.5 pollution respectively.The result also indicates that no search terms work best for all three pollutants.

Limitations
A key limitation to this study is the tuning of the neural network model.First the performance of neural network models is sensitive to several hyperparameters, including optimization choices, depth, width, and regularization.Due to computational limits, we adopt a simple LSTM architecture with a single, 128-unit hidden layer and tune the model using validation datasets for other hyperparameters.
In addition, we notice that the stochastic components such as the random seed for the RF model and the randomness in the optimization process of LSTM models would influence the interpretation of the results.Therefore, we repeat the experiments ten times with different random seeds for RF and LSTM models.Because the time cost of repeating LSTM models is high, we only repeat the RF, LSTM and DL-LSTM models ten times for predicting ozone pollution with all input features.The accuracy for DL-LSTM model is 0.8744 (SD 0.0046).Compared to the LSTM model (0.8714 (SD 0.0036)), the improvement is not significant (P=0.11).Compared to the RF model (0.8273 (SD 0.0017)), the improvement is significant (P<.001).The F1 score for DL-LSTM model is 0.6314 (SD 0.0058).Compared to both LSTM (0.6019 (SD 0.0096)) and RF model (0.5588 (SD 0.0024)), the improvements are significant (P<.001), which shows that the results of LSTM models are stable.
There is room for further exploration of more sophisticated neural network model architectures for non-infectious disease prediction.We leave the exploration of deeper and wider architecture as future work.
Another limitation relates to the biases introduced by relying on search data, which may not reflect the underlying population demographics or experiences.While some of these issues are alleviated automatically by training a model against ground sensor pollution levels, understanding and correcting for these data biases requires further study.In the future, we plan to investigate other sources of crowd-based surveillance data such as self-reports in social media, to augment traditional physical sensor methods, thus providing a more direct, human-centered measure of how people experience elevated air pollution levels.

Conclusions
In this study, we posit that while Web search data cannot yet replace ground-based pollution monitors completely, it may already serve as a valuable additional signal to augment the ground-based pollution data, providing significant accuracy improvements for detecting unusual spikes in air pollution.We also find that the correlation between search terms and pollution concentration varies at the city level.Therefore, the model needs to be fine-tuned when applied to specific cities.For model and search term selection, we use the simplest LSTM architecture with a dictionary learner module, and we find that no search terms work best for all three pollutants.We propose to use our model to learn semantic correlations between available search terms to obtain better prediction results.

Figure 1 :
Figure 1: The Architecture of the LSTM Model.

Figure 2 :
Figure 2: The Architecture of the DL-LSTM Model.

Figure 3 :
Figure 3: Distribution of pollution values for Atlanta, Los Angeles, Philadelphia, and Miami, with city-specific elevated pollution level (dashed line) and the general EPA-mandated standard (dotted line), for O 3 (left column), NO 2 (middle column), and PM 2.5 (right column).

Figure 4 :
Figure 4: NO 2 levels and search interest for term "cough" in Atlanta, October 2016.

Figure 5 :
Figure 5: Accuracy (left figure) and F1 score (right figure) for detecting Ozone pollution on various classification thresholds, with Met (LSTM model) and Met+Search (DL-LSTM w/ STE) as features.

Figure 6 :
Figure 6: Accuracy (left figure) and F1 score (right figure) for detecting NO 2 pollution on various classification thresholds, with Met (LSTM model) and Met+Search (DL-LSTM w/ STE) as features.

Figure 7 :
Figure 7: Accuracy (left figure) and F1 score (right figure) for detecting PM 2.5 pollution on various classification thresholds, with Met (LSTM model) and Met+Search (DL-LSTM w/ STE) as features.

Figure 8 :
Figure 8: Cosine similarity between GloVe embeddings of seed search terms.

Figure 9 :
Figure 9: Cosine similarity between trained embeddings of seed search terms.

Figure 10 :
Figure 10: Average feature importance for detecting Ozone pollution using Met + Search (RF) model.

Figure 11 :
Figure 11: Average feature importance for detecting NO 2 pollution using Met + Search (RF) model.

Figure 12 :
Figure 12: Average feature importance for detecting NO 2 pollution using Met + Search (RF) model.

Table 1 :
Input features calculated per time step in the input sequence.

Table 2 :
The distribution of classes in train, validation and test sets.ValidationTo tune model parameters and validate model performance, we split the available data into training (from Jan. 2007 to Dec. 2014), validation (from Jan. 2015 to Dec. 2016), and testing (from Jan. 2017 to Dec. 2018) sets.This eight-year training period provides a broad history to learn a relationship between input predictors and output variables, and the predictive models are evaluated based on their ability to make predictions for completely unseen periods.For evaluating our model, we make predictions for each day form Jan. 2017 to Dec. 2018 in the test dataset.The distribution of classes in train, validation, and test datasets is reported in (Table2).Note that positive and negative classes are heavily imbalanced, with positive classes comprising, for instance, only 16% of training samples when PM 2.5 is the target pollutant.
51seed search terms vs. 152 terms after STE) in this model and refer to the variants as DL-LSTM and DL-LSTM w/ STE respectively.

Table 3 :
Classification thresholds for three pollutants across 10 major MSAs in the U.S.

Table 4 :
Cross correlation of top five search terms with different lags for three pollutants in the Atlanta metropolitan area in 2016.

Table 5 :
Accuracy and F1 score of the LR, RF, and LSTM models for detecting elevated pollution across 10 major U.S. cities, for varying input feature combinations: no prior knowledge, search data only (Search), meteorological data only (Met); meteorological data and search data (Met +Search), meteorological data and historical pollutant concentration (Met +Pol) and all input features (Met +Pol+Search).