Monitoring Information-Seeking Patterns and Obesity Prevalence in Africa With Internet Search Data: Observational Study

Background The prevalence of chronic conditions such as obesity, hypertension, and diabetes is increasing in African countries. Many chronic diseases have been linked to risk factors such as poor diet and physical inactivity. Data for these behavioral risk factors are usually obtained from surveys, which can be delayed by years. Behavioral data from digital sources, including social media and search engines, could be used for timely monitoring of behavioral risk factors. Objective The objective of our study was to propose the use of digital data from internet sources for monitoring changes in behavioral risk factors in Africa. Methods We obtained the adjusted volume of search queries submitted to Google for 108 terms related to diet, exercise, and disease from 2010 to 2016. We also obtained the obesity and overweight prevalence for 52 African countries from the World Health Organization (WHO) for the same period. Machine learning algorithms (ie, random forest, support vector machine, Bayes generalized linear model, gradient boosting, and an ensemble of the individual methods) were used to identify search terms and patterns that correlate with changes in obesity and overweight prevalence across Africa. Out-of-sample predictions were used to assess and validate the model performance. Results The study included 52 African countries. In 2016, the WHO reported an overweight prevalence ranging from 20.9% (95% credible interval [CI] 17.1%-25.0%) to 66.8% (95% CI 62.4%-71.0%) and an obesity prevalence ranging from 4.5% (95% CI 2.9%-6.5%) to 32.5% (95% CI 27.2%-38.1%) in Africa. The highest obesity and overweight prevalence were noted in the northern and southern regions. Google searches for diet-, exercise-, and obesity-related terms explained 97.3% (root-mean-square error [RMSE] 1.15) of the variation in obesity prevalence across all 52 countries. Similarly, the search data explained 96.6% (RMSE 2.26) of the variation in the overweight prevalence. The search terms yoga, exercise, and gym were most correlated with changes in obesity and overweight prevalence in countries with the highest prevalence. Conclusions Information-seeking patterns for diet- and exercise-related terms could indicate changes in attitudes toward and engagement in risk factors or healthy behaviors. These trends could capture population changes in risk factor prevalence, inform digital and physical interventions, and supplement official data from surveys.


Introduction
Globally, obesity and overweight are the fifth leading cause of death, associated with at least 2.8 million adult deaths each year [1,2]. In Africa, the burden of obesity and overweight has increased significantly over the last two decades [3][4][5][6]. Among sub-Saharan African women, the prevalence of obesity increased by 12% between 1975 and 2016, while the prevalence of overweight increased by 24% [7][8][9]. Among men, obesity prevalence increased by 5%, while overweight prevalence increased by 15% in the same period [7][8][9].
Insufficient exercise and unhealthy diets (partly due to a nutrition transition from nutrient-dense foods to energy-dense foods) coupled with tobacco use and excessive alcohol consumption (factors predominantly associated with an urban lifestyle) are to blame for the increase in noncommunicable disease burden in Africa [10,11]. Specifically, urbanization and related economic advancements including higher income, higher education, and higher socioeconomic status have been associated with higher obesity prevalence [12][13][14][15][16]. Aging, cultural norms (eg, in some cultures female fatness symbolizes beauty, prosperity, and fertility), and television viewing habits have also correlated with increasing obesity prevalence [16][17][18][19][20].
Persons who are obese or overweight are at a higher risk of developing other medical conditions including hypertension, cardiovascular disease, type 2 diabetes, and stroke [21][22][23][24]. Joubert et al [25] noted that 68% of hypertensive disease, 38% of ischemic heart disease, 78% of type 2 diabetes, and 45% of ischemic stroke among adults in South Africa were due to obesity. The burden of obesity-associated noncommunicable diseases is expected to continue to increase in sub-Saharan African countries. Data suggest that millions of people living with diabetes in sub-Saharan Africa are unaware of their status and many lack access to necessary information and medications [4,[26][27][28][29]. Furthermore, obesity-related diseases have been associated with an increased risk of severe COVID-19 disease [30].
The rise in prevalence of noncommunicable diseases in Africa creates new challenges that many health care systems are not currently equipped to manage. Furthermore, the lack of high-quality data also creates a barrier in quantifying public health needs and addressing the impact of diseases [31]. This data limitation includes a substantial gap in the standard and availability of health data, especially where health information is not digitized or comprehensive [31].
Usually, data on behavioral risk factors are collected through surveys, which can be costly and capture only a single time point. In contrast, digital data from internet sources can capture timely changes in attitudes toward and engagement in risky behaviors. While computational and statistical approaches have been successfully used to process data from digital sources for monitoring infectious disease reports and chronic disease risk factors, few studies have focused on Africa [31][32][33][34][35][36][37][38][39][40][41][42][43]. As more people in Africa use internet platforms and mobile phones for seeking and sharing information, it is important to understand how behavioral data shared on digital platforms can be used to support and develop timely disease and risk factor surveillance platforms. Here, we assess how diet-and exercise-related searches submitted on an internet search engine can be used for monitoring information-seeking patterns and obesity prevalence in 52 African countries.

Data Collection
Search data were collected for 108 search terms (Multimedia Appendix 1) from Google application programming interfaces. The search terms included terms related to chronic diseases, risk factors, diet, and physical activity. To generate a comprehensive list of terms, we used the Google Trends website [44] to identify terms that had similar search trends for chronic diseases and their associated risk factors. We collected the yearly search volume for each country from 2010 to 2016 for 52 countries in English [45]. Google normalizes the search volume for each term relative to the search activity in the country and the specific time period. Two countries (South Sudan and Sudan) were excluded because obesity prevalence estimates were unavailable for these countries.
We also downloaded age-standardized obesity and overweight prevalence estimates for adults aged 18 years and older from 2010 to 2016 from the World Health Organization (WHO) website [46,47]. These estimates were obtained using data from population-based studies on cardiometabolic risk factors, multicountry and national measurement surveys, as well as the WHO STEPwise approach to surveillance (STEPS) surveys for estimating BMI [48]. Overweight was defined as a BMI >25 kg/m 2 and obese was defined as a BMI ≥30 kg/m 2 [49]. The reported credible intervals (CIs) for the estimates represented the 2.5th and 97.5th percentiles of the posterior distributions.

Machine Learning Methods
We used machine learning methods to identify search patterns that were associated with changes in obesity and overweight prevalence across African countries. Specifically, we employed support vector machine (SVM), random forest (RF), gradient boosting, and Bayes generalized linear model (GLM). The machine learning methods were selected to assess a broad range of approaches from decision tree methods, kernel-based approaches, and least squares regression methods. We implemented these methods using the SuperLearner package in R [50,51], which generates estimates for each individual method and an ensemble of the methods.
RF regression is an extension of bootstrap aggregating ("bagging"). It involves the construction of de-correlated decision trees, which are averaged to reduce the variance of the prediction function. Trees are preferred candidates for bagging because they capture the complex interaction structures in the data and have relatively low bias if grown deep. Since each generated tree in bagging is identically distributed, the average of B such trees is the same as the likelihood of any one of the trees. The gradient boosting algorithm also involves the generation of ensembles of predictive trees. However, trees are built using the gradient boosting approach, which involves a sequential iterative fitting procedure to reduce bias by assigning higher weights to poorly fit samples and optimization via a loss function. An advantage of the gradient boosting algorithm is that nonlinearities and interactions do not need to be explicitly specified.
In contrast, SVM regression is similar to multiple linear regression when the relationship between X and y is linear: y = ƒ(x) = W · X + b. However, SVM regression involves the application of kernel functions (eg, gaussian, polynomial, radial basis, and sigmoid kernel) to model nonlinearity between X and y. The SVM regression model parameters are selected to minimize an epsilon-insensitive cost function. The model parameters were selected by applying cross-validation to the training data.
Lastly, Bayes GLMs are a class of GLMs that are a generalization of linear regression models such that the distribution of the dependent variable is of the exponential family (eg, gaussian, poisson, binomial, categorical, multinomial, or beta). In the Bayesian approach, inferences are based on the posterior distribution, prior knowledge is captured quantitatively through the prior distribution, and the data are represented through the likelihood function [52,53]. Two advantages of Bayesian models include the incorporation of domain knowledge via the prior and uncertainty quantification via the posterior distribution.

Data Analysis
First, we estimated the Pearson correlation coefficient (r) between the search data and obesity and overweight prevalence across Africa from 2010 to 2016. Next, we excluded all search terms that had zero variance (ie, 20 search terms) and search terms not significantly correlated with obesity/overweight prevalence at a significance level of P<.05. Additionally, because there were zero reported searches for some terms in some countries, we excluded all terms with less than 50% of observations greater than zero, implying that only the most significant and comprehensive variables were used in the modeling. We then fitted separate models to estimate obesity and overweight prevalence using the search data. The coefficient of determination (R 2 ) and root-mean-square error (RMSE) were used to assess the model fit. The out-of-sample estimation involved splitting the data into 2 sets: data from 2010 to 2014 were used to train the model, while data from 2015 to 2016 were used to evaluate the model. In machine learning, the data used to train the model are usually different from the data used to validate it. The training data are used to fit the model (ie, train the algorithm to identify patterns) and the evaluation data are used to assess the predictive performance of the fitted model by comparing the model estimates to true values. The aim is to allow the model to be generalizable to future sets of data. However, in the absence of future data, the evaluation data are used. We also report the correlation between the out-of-sample predictions and WHO-estimated obesity and overweight prevalence. The following R packages were used: SuperLearner, randomForest, kernlab, and arm [51,54].

Estimating Obesity With Search Trends
Twelve of the terms that were significantly correlated with obesity prevalence (ie, hypertension, breakfast, diet, nutrition, obese, green tea, weight gain, lose weight, weight loss, weight, gym, and malnutrition) were used in modeling to estimate obesity prevalence. The estimated variances explained by the various models were 0.97, 0.92, 0.77, and 0.30 for RF ( Figure  3), gradient boosting, SVM, and Bayes GLM, respectively; the corresponding RMSEs were 1.15, 1.87, 3.53, and 5.60, respectively. Likewise, the correlations between the out-of-sample estimates (ie, data not used to train the model) and obesity prevalence were 0.96, 0.94, 0.87, and 0.56 for RF, gradient boosting, SVM, and Bayes GLM, respectively.
Similarly, 8 search terms (hypertension, breakfast, diet, nutrition, obese, lose weight, gym, and malnutrition) were used in modeling to estimate overweight prevalence. The RF model was also the best performing model for estimating overweight prevalence ( Figure 4). The estimated variances explained by the various models were 0.96 (RMSE 2.26), 0.91 (RMSE 3.56), 0.62 (RMSE 7.72), and 0.23 (RMSE 9.99) for RF, gradient boosting, SVM, and Bayes GLM, respectively; the corresponding correlations between the out-of-sample model estimates and overweight prevalence were 0.95, 0.94, 0.78, and 0.49, respectively.

Discussion
Our study assessed the potential use of information-seeking trends of obesity-and overweight-related terms for monitoring these conditions in Africa. Several of the search terms were correlated with changes in obesity and overweight prevalence and, when modeled together, produced estimates that were significantly correlated with data from the WHO. Data from internet sources, including social media and search engines, can capture detailed information on individuals' well-being that can collectively reflect community perceptions of health. Web searches, unlike social media, can more accurately reflect information-seeking patterns on sensitive or stigmatized health topics since individuals tend to consider it private [55].
As African nations become more urbanized, digital data and tools could be useful for monitoring changes in behavioral risk factors, which could help public health officers, policy makers, health providers, and nutritionists to make informed decisions on chronic disease prevention efforts in Africa. Similarly, health care professionals can also use digital platforms to seek information on advances in medical practice, disseminate health information, and communicate with and support patients [56,57]. However, digital health implementation in some African countries is constricted by systemic hurdles such as weak health systems and a lack of coordination of mushrooming pilot projects [58].
A research agenda around monitoring risk factors for noncommunicable diseases using digital platforms should focus on quantifying changes with the intent to participate in behavioral risk factors, postings of engagement on social media, and information seeking on poor diet, physical inactivity, and other risk factors. Interventions can target younger populations-who tend to use digital platforms and are at risk-to promote healthy behaviors (eg, to stop smoking or reduce intake of sugary drinks). By monitoring changes in discussion trends on digital platforms, interventions designed for both online and offline targeting could be more beneficial, thereby avoiding the unintended effects of poorly designed campaigns. Furthermore, in regions where large data sets are available, systems can be developed for quantifying the prevalence of these risk factors at a granular level (ie, subnational or subregional)-using a combination of digital data, hospital data, and demographic data-where survey estimates are unavailable or delayed.
A major limitation of this study is that we did not collect data in other languages spoken in Africa (including Swahili, Portuguese, Sesotho, Zulu, Afrikaans, Xhosa, Tswana, Hausa, Tsonga, Afar, French, Arabic, and Somali). However, other studies suggest that English is used on the internet in many African countries [31,45]. Also, the obesity and overweight data are estimates that might not accurately reflect current obesity rates due to limitations in data and methods. Furthermore, the differences in search patterns between countries suggest a need for country-specific analysis. For example, there are local dieting fads (such as herbal life in South Africa) that should be monitored to capture local context. However, the number of observations was insufficient for fitting individual models to each country. Additionally, access to the internet might be influenced by socioeconomic status, which means that individuals seeking information on Google might not be representative of the total population [59][60][61].
However, our approach demonstrates that the adoption of internet technologies in Africa provides opportunities for studying and improving health. Obesity and overweight are health challenges faced by countries in Africa, and population information-seeking behaviors can inform how we design interventions. Information-seeking patterns on obesity-related risk factors could capture changes in attitudes, behaviors, and risk factor prevalence that could supplement official estimates from surveys.