Published on in Vol 16, No 12 (2014): December

Cumulative Query Method for Influenza Surveillance Using Search Engine Data

Cumulative Query Method for Influenza Surveillance Using Search Engine Data

Cumulative Query Method for Influenza Surveillance Using Search Engine Data

Original Paper

1Asan Medical Center, Department of Emergency Medicine, University of Ulsan, College of Medicine, Seoul, Republic Of Korea

2Asan Medical Center, Department of Preventive Medicine, University of Ulsan, College of Medicine, Seoul, Republic Of Korea

3Asan Medical Center, Department of Biomedical Informatics, Seoul, Republic Of Korea

4Brigham and Women's Hospital, Division of General Medicine and Primary Care, Boston, MA, United States

5Harvard Medical School, Boston, MA, United States

6Daum Communications, Search Development Unit, Seoul, Republic Of Korea

Corresponding Author:

Min-Woo Jo, MD, PhD

Asan Medical Center

Department of Preventive Medicine

University of Ulsan, College of Medicine

88, Olympic-Ro 43-Gil, Songpa-gu

Seoul, 138-769

Republic Of Korea

Phone: 82 2 3010 3350

Fax:82 2 3010 3356

Email: leiseo@hanmail.net


Background: Internet search queries have become an important data source in syndromic surveillance system. However, there is currently no syndromic surveillance system using Internet search query data in South Korea.

Objectives: The objective of this study was to examine correlations between our cumulative query method and national influenza surveillance data.

Methods: Our study was based on the local search engine, Daum (approximately 25% market share), and influenza-like illness (ILI) data from the Korea Centers for Disease Control and Prevention. A quota sampling survey was conducted with 200 participants to obtain popular queries. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development Set 2 and 2011/12 for validation Set 2). Pearson’s correlation coefficients were calculated between the Daum data and the ILI data for the development set. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. Then, we created a cumulative query method n representing the number of cumulative combined queries in descending order of the correlation coefficient.

Results: In validation set 1, 13 cumulative query methods were applied, and 8 had higher correlation coefficients (min=.916, max=.943) than that of the highest single combined query. Further, 11 of 13 cumulative query methods had an r value of ≥.7, but 4 of 13 combined queries had an r value of ≥.7. In validation set 2, 8 of 15 cumulative query methods showed higher correlation coefficients (min=.975, max=.987) than that of the highest single combined query. All 15 cumulative query methods had an r value of ≥.7, but 6 of 15 combined queries had an r value of ≥.7.

Conclusions: Cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.

J Med Internet Res 2014;16(12):e289

doi:10.2196/jmir.3680

Keywords



Syndromic surveillance may alert public health care providers in the early phases of an outbreak, allowing them to decrease morbidity and mortality resulting from the outbreak [1-5]. Syndromic surveillance is defined as the real-time or near real-time collection, analysis, interpretation, and dissemination of health-related data to enable the early identification of the impact of potential human or veterinary public health threats that require effective public health action [1,3]. The 2009 H1N1 influenza pandemic highlighted the need for syndromic surveillance to inform policy and plan for effective responses.

Because conventional syndromic surveillance of indicators such as influenza-like illness (ILI) depends on case reporting to report disease activity, time delays in reporting and case confirmation can interfere with the early detection of outbreaks or increases in influenza cases in the community. Thus, researchers have been investigating alternative data sources for the detection of outbreaks. For example, over-the-counter sales of medications and school absenteeism data have been used for earlier detection of outbreaks [6-12].

Internet search queries have become an important data source in recent years [13-22]. Internet search engines allow billions of people to have instant access to a vast amount of information online. New syndromic surveillance sources, such as Google Flu Trends (GFT), provide the potential to identify influenza outbreaks in real time [23]. Several studies have reported that GFT is highly correlated with conventional ILI surveillance data [23-28]. GFT has now been applied in many countries, but neither GFT nor other search query-based tools for disease surveillance are available in South Korea [24,27-30]. Generally, Google’s market share is dominant in the countries where GFT is available [24-29,31], but not in South Korea [32]. Studies using Google Trends for influenza surveillance show that it can be used as a complementary source of data but that its performance is insufficient for use as a model for prediction [33,34]. It is difficult to find queries that show high correlations for consecutive years because Internet searching behavior may change over time [33,34]. To reduce the effects from changes in search queries, we used a combination of queries and cumulation of combined queries from the search engine Daum. Daum is the second largest Web portal service provider in daily visits in South Korea (approximately 25% of the market share) [32,35]. Daum offers many Internet services to Web users, including email, messaging service, forums, shopping, and news. The main language is Korean.

In South Korea, influenza is generally seasonal, with most activity occurring during winter. The 2009/10 epidemiological year, called the Influenza A (H1N1) pandemic period, was an exceptional situation (see Multimedia Appendix 1). The primary objective of this study was to examine correlations between our cumulative query method and national influenza surveillance data.


Source of Data

Study Period

The study period was September 6, 2009 (week 36), through September 1, 2012 (week 34)—156 weeks of data for 3 consecutive epidemiological years. We divided the study period into two sets: Set 1 (the 2009/10 epidemiological year for development set 1 and 2010/11 for validation set 1) and Set 2 (2010/11 for development set 2 and 2011/12 for validation set 2).

Collection of Influenza-Like Illness Data

We collected the ILI data from the Korea Centers for Disease Control and Prevention (KCDC) as a gold standard. KCDC ILI data were available from the KCDC website; we downloaded the ILI data for the study period from this site [36]. A KCDC case of ILI was defined as a person with a fever of 38°C with a cough and/or a sore throat [36]. ILI surveillance consisted of 850 sentinel clinics in South Korea, and the clinics reported weekly percentages of outpatients who met the case definition of ILI [36].

Survey for Obtaining Queries

To obtain population search queries related to influenza, we conducted a survey from quota sampling based on sex and age in September 2012. The quotas were based on address of resident registry, age, and sex. There were five quota groups by age: 20-29 years, 30-39 years, 40-49 years, 50-59 years, 60 years or older. Half of each quota group were female. We randomly selected the addresses from the residence registry in Seoul, and then if interviewees living at the address of residence registry met the criteria, we included the oldest interviewee. We then conducted face-to-face interviews. The survey included searching history for influenza and typed queries. The survey was performed anonymously. A KCDC definition of ILI was a person with a fever (발열 in Korean) of 38°C with a cough (기침) and/or a sore throat (인후통). These three queries from the definitions of ILI were included in the queries for the following operations, regardless of the survey result. In the case of queries originally submitted in English only, we translated them to Korean and added them as new queries.

Combination of Queries

We believe that people typically search for things of interest on the Internet using one or more queries at a time. To reflect people’s searching behavior and include as many queries as possible, we used a combination of queries. Queries from the survey results and the definition of KCDC ILI were divided into groups as follows: query group 1 consisted of queries specific to influenza (eg, “H1N1”, “Influenza”), and query group 2 contained queries not specific to influenza (eg, “Treatment”, “Symptom”). Then, we combined query groups 1 and 2. Combined queries consisted of query group 1 alone and a combination of query groups 1 and 2 (eg, “H1N1”, “H1N1 Treatment”, “H1N1 Symptom”, “Influenza”, “Influenza Treatment”, “Influenza Symptom”).

Collection of Data from Search Engine

We sent the combined queries and the queries that belonged to query group 1 (because these queries were searchable by themselves) to Daum and received proportional data in weekly form. Proportional data for these combined queries were extracted from the Daum search engine during development sets 1 and 2. Proportional data from the Daum search engine were calculated by dividing the number of each combined query by the total number of search queries for 1 week.

Data Analysis

Creating Cumulative Query Methods and Data Analysis

Pearson’s correlation coefficients were calculated between the Daum data for the combined queries and the KCDC ILI data in development sets 1 and 2. We selected the combined queries for which the correlation coefficients were .7 or higher and listed them in descending order. To see the change of correlation coefficients over time, we also calculated correlation coefficients of the combined queries in subsequent epidemiological years. We then created a cumulative query method n representing the number of cumulative combined queries in descending order of the correlation coefficient. For example, cumulative query method 4 consisted of a summation of the proportional data from the 1st, 2nd, 3rd, and 4th highest combined queries on the correlation coefficient list. In validation sets 1 and 2, Pearson’s correlation coefficients were calculated between the cumulative query method n and the KCDC ILI data. Specifically in validation set 2, we analyzed the cumulative query methods from development set 2 as well as development set 1. Useful cumulative query methods in the validation sets were defined as having higher correlation coefficienst than the highest correlation coefficient of a single combined query in the same development set. Analysis was performed using IBM SPSS Statistics software, version 20. Significance was set at P<.05.

Institutional Review Board

This study was approved by the Institutional Review Board of Asan Medical Center (Seoul, Korea).


Survey for Obtaining Queries

We contacted 322 people and included 200 participants older than 20 years who lived in Seoul, Korea. Over a quarter (56/200, 28%) answered “Yes” to the question of searching history for influenza and provided search queries (Table 1).

Table 1. Results of the survey.
Raw dataEnglish translationFrequency (%)
신종New1 (1.8)
신종플루New flua23 (41.1)
신종플루 증상New flu symptom1 (1.8)
신종플루 증세New flu sign1 (1.8)
신종플루, 독감New flu, bad cold2 (3.6)
신종플루, 목아픔New flu, neck pain1 (1.8)
신종플루, 백신, TamifluNew flu, vaccine, Tamiflu (English)b1 (1.8)
신종플루, 신플 증상New flu, new flu (abbr.)c symptom1 (1.8)
신종플루, 인플루엔자, H1N1, PCRNew flu, influenza, H1N1 (English)b, PCRd (English)b1 (1.8)
신종플루, 조류독감New flu, bird flu1 (1.8)
신종플루의 치료, 합병증New flu, treatment, complication1 (1.8)
신종플루증상New flu symptom1 (1.8)
신종플루증세, 예방, 마스크New flu sign, prevention, mask1 (1.8)
신플증상New flu (abbr.)c symptom1 (1.8)
열, 기침Fever, cough1 (1.8)
유행성독감, influenzaEpidemic bad cold, influenza (English)b1 (1.8)
인플루엔자Influenza7 (12.5)
인플루엔자, 신종독감, 신종 플루Influenza, new bad cold, new flu1 (1.8)
인플루엔자, 조류독감Influenza, bird flu1 (1.8)
인플루엔자, 조류독감, 돼지독감, 신종플루Influenza, bird flu, swine flu, new flu1 (1.8)
조류독감Bird flu5 (8.9)
조류독감, 사망Bird flu, decease1 (1.8)
증상, 목통증Symptom, throat pain1 (1.8)
Total
200 (100.0)

aSince the Influenza A (H1N1) pandemic period, media began to use “New flu (신종플루)” to distinguish the H1N1 influenza and previous influenzas in Korea. In 2010, KCDC announced that the official term was “Influenza (인플루엔자)”. But “New flu (신종플루)” and “Bad cold (독감)” are still more popular terms than “Flu (플루)” or “Influenza (인플루엔자)” in Korea. “Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold.

bThe query was originally submitted in English.

cAbbreviation: “New flu (abbr.) (신플)” is the abbreviation of “New flu (신종플루)” in Korean.

dPCR: polymerase chain reaction.

Combination of Queries From the Survey

Query group 1 contained 14 queries that were specific to influenza, and query group 2 had 14 queries that were not specific to influenza (Table 2). A total of 210 combined queries were submitted to Daum. Full data of combined queries are presented in Multimedia Appendix 2.

Table 2. Query groups 1 and 2 from the survey results and the KCDC definition of ILIa.
Query group 1Query group 2
FluVaccine
New fluPrevention
New flu (abbr.)bMask
InfluenzaSymptom
Influenza (English)cSign
New influenzaCough
Bad colddFever
New bad coldNeck pain
Epidemic bad coldSore throat
H1N1 (English)cThroat pain
Bird fluPCR (English)c,e
Swine fluTreatment
TamifluComplication
Tamiflu (English)cDecease

aQuery group 1 consisted of queries specific to or related to influenza. Query group 2 contained queries not specific to influenza.

bAbbreviation.

cThe query was originally submitted in English.

d“Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold. “Flu” in query group 1 is “플루” which is the English pronunciation written in Korean. In Korea, “Bad cold (독감)” is a more popular term than “Flu (플루)” or “Influenza (인플루엔자)”.

ePCR: polymerase chain reaction.

Collection of Data From Search Engine

Correlation analysis was performed between the Daum data for combined queries and the KCDC ILI data in development sets 1 and 2 (Table 3). In development set 1, “New flu (abbr.)” had the highest correlation coefficient (r=.894, P<.001), and 13 combined queries had correlation coefficient r values of ≥.7. Among these 13 combined queries, the number of the combined queries that had correlation coefficient r values of ≥.7 was reduced to 4 in validation set 1 and to 2 in validation set 2. In development set 2, “Bad cold + Symptom” had the highest correlation coefficient (r=.969, P<.001), and a total of 15 combined queries had an r value of ≥.7. Among these 15 combined queries, the number of the combined queries that had correlation coefficient r values of ≥.7 was reduced to 6 in validation set 2. Only “Tamiflu” and “New flu + Symptom” showed correlation coefficients r values of ≥.7 for 3 consecutive years (Figure 1). The change of correlation coefficients for all combined queries over time are presented in Multimedia Appendix 2.

Table 3. Correlation analysis between the Daum data for combined queries and the KCDC ILI data in development sets 1 and 2.
OrderCombined queryCorrelation coefficientCombined queryCorrelation coefficient
Development set 1 (2009/10)Validation set 1 (2010/11)Validation set 2 (2011/12)Development set 2 (2010/11)Validation set 2 (2011/12)
1New flu (abbr.)a.894b.622bcBad cold + Symptom.969b.981b
2Flu + Vaccine.871b-.062d-.157eNew flu + Treatment.951b.616b
3New flu + Cough.849b.930b.291bNew flu + Cough.930b.291b
4New flu + Fever.814b.591b.460bNew flu + Sign.919b.684b
5Tamiflu + Vaccine.805b-.062ccTamiflu.904b.981b
6Tamiflu + Symptom.800bccNew influenza + Symptom.896b.650b
7Flu + Symptom.799b.815b.416bBad cold + Treatment.887b.814b
8H1N1 + Symptom.791bccSwine flu + Symptom.877b.005e
9New flu + Sore throat.738b.504bcNew flu + Symptom.836b.936b
10New flu (abbr.)a + Vaccine.713bccFlu + Symptom.815b.416b
11New flu + Symptom.709b.836b.936bInfluenza + Symptom.813b.782b
12Tamiflu.703b.904b.981bInfluenza (English)g.762b.751b
13Tamiflu (English)g.700bb,h.523b.286bNew influenza.748b.503b
14



Bird flu + Symptom.747b.005f
15



Bird flu.709b,h.136i

aabbr.: abbreviation

bP<.05.

cCorrelation cannot be computed because it has a constant value in that period (see Multimedia Appendix 2).

dP=.66.

eP=.27.

fP=.98.

gThe query was originally submitted in English.

hWe selected the combined queries for which the correlation coefficients were ≥.7 and listed them in descending order.

iP=.34.

Figure 1. Plot of combined queries that consecutively show correlation coefficient (P<.05) (only “Tamiflu” and “New flu + Symptom” showed r values greater than .7 for 3 consecutive years).
View this figure

Creating Cumulative Query Methods

A total of 13 cumulative query methods were created in development set 1 (see Table 4). In validation set 1, cumulative query methods 7, 8, 9, and 10 showed the highest correlation coefficients (r=.943, P<.001; see Multimedia Appendix 3). Eight of the 13 cumulative query methods were useful, which was defined as having higher correlation coefficients than the highest correlation coefficient of a single combined query in development set 1 (min=.916, max=.943). But only three of the cumulative query methods from development set 1 were useful in validation set 2 (min=.935, max=.953). The correlation did not increase by adding queries in cumulative query method 5, 6, 8, 9, 10, and 13 in validation set 1. In validation set 2, cumulative query method 5 from development set 2 had the highest correlation coefficient (r=.987, P<.001; see Figure 2 and Multimedia Appendix 3). Eight of the 15 cumulative query methods from development set 2 were useful (min=.975, max=.987). The correlation did not increase by adding queries in cumulative query method 3, 4, 7, 8, 10, 12, and 14 in validation set 2. Scatter plots between the KCDC ILI and other useful cumulative query methods are presented in Multimedia Appendix 4. Cumulative query methods for influenza virologic data are presented in Multimedia Appendix 5.

In each development set, cumulative query methods had a higher correlation coefficient than combined queries (see Tables 5 and 6). After 1 year, 11 of 13 cumulative query methods had an r value of ≥.7, but 4 of 13 combined queries had an r value of ≥.7 in validation set 1 (see Table 5 and Figure 3). All 15 cumulative query methods had an r value of ≥.7, but 6 of 15 combined queries had an r value of ≥.7 in validation set 2 (see Table 6 and Figure 4).

Table 4. Correlation coefficients of cumulative query method n in each validation seta.
Cumulative query methodCorrelation coefficient in validation set 1Correlation coefficient in validation set 2 from development set 1Correlation coefficient in validation set 2 from development set 2
1.622bc.981b,d
2.183e-.157f.975b,d
3.916b,d.092g.975b,d
4.933b,d.467b.975b,d
5.933b,d.467b.987b,d
6.933b,d.467b.986b,d
7.943b,d.486b.986b,d
8.943b,d.486b.986b,d
9.943b,d.486b.968b
10.943b,d.486b.968b
11.838b.935b,d.965b
12.841b.953b,d.965b
13.841b.953b,d.964b
14Not applicableNot applicable.964b
15Not applicableNot applicable.780b

aWe selected the combined queries for which the correlation coefficients were ≥.7 and listed them in descending order. We then created a cumulative query method n representing the number of cumulative combined queries in descending order of the correlation coefficients.

bP<.05.

cCorrelation of cumulative query method 1 in validation set 2 from development set 1 cannot be computed because it has a constant value in that period (see Multimedia Appendix 2).

dUseful cumulative query method in the validation set was defined as having higher correlation coefficient than the highest correlation coefficient of a single combined query in the same development set.

eP=.20.

fP=.27.

gP=.52.

Figure 2. Scatter plot between the KCDC ILI and cumulative query model 5 in validation set 2.
View this figure
Table 5. Correlation coefficients of combined queries for which the correlation coefficients were ≥.7 and cumulative query methods in set 1.
Cumulative query method from development set 1 (2009/10)Correlation coefficientCombined query from development set 1 (2009/10)Correlation coefficient
2009/102010/112009/102010/11
1.894.622New flu (abbr.)a.894.622
2.887.183Flu + Vaccine.871-.062
3.883.916New flu + Cough.849.93
4.861.933New flu + Fever.814.591
5.86.933Tamiflu + Vaccine.805-.062
6.859.933Tamiflu + Symptom.8.b
7.849.943Flu + Symptom.799.815
8.849.943H1N1 + Symptom.791.b
9.851.943New flu + Sore throat.738.504
10.853.943New flu (abbr.)a + Vaccine.713.b
11.712.838New flu + Symptom.709.836
12.728.841Tamiflu.703.904
13.728.841Tamiflu (English)c.7.523

aabbr.: abbreviation

bCorrelation cannot be computed because it has a constant value in that period (see Multimedia Appendix 2).

cThe query was originally submitted in English.

Table 6. Correlation coefficients of combined queries for which the correlation coefficients were ≥.7 and cumulative query methods in set 2.
Cumulative query method from development set 2 (2010/11)Correlation coefficientCombined query from development set 2 (2010/11)Correlation coefficient
2010/112011/122010/112011/12
1.969.981Bad cold + Symptom.969.981
2.977.975New flu + Treatment.951.616
3.978.975New flu + Cough.93.291
4.982.975New flu + Sign.919.684
5.97.987Tamiflu.904.981
6.968.986New influenza + Symptom.896.65
7.969.986Bad cold + Treatment.887.814
8.967.986Swine flu + Symptom.877.005
9.853.968New flu + Symptom.836.936
10.853.968Flu + Symptom.815.416
11.854.965Influenza + Symptom.813.782
12.854.965Influenza (English)a.762.751
13.857.964New influenza.748.503
14.857.964Bird flu + Symptom.747.005
15.86.78Bird flu.709.136

aThe query was originally submitted in English.

Figure 3. Plot of combined queries for which the correlation coefficients were .7 or higher and cumulative query methods of set 1.
View this figure
Figure 4. Plot of combined queries for which the correlation coefficients were .7 or higher and cumulative query methods of set 2.
View this figure

Principal Findings

In this study, the cumulative query method showed relatively higher correlation with national influenza surveillance data than combined queries in the development and validation set.

Many people use Internet searches for health information before visiting a doctor [18,20,23,33]. Hence, search query trends can reflect actual disease progression earlier than conventional surveillance. Queries used prior to this study only reflected the authors’ opinions [13] or were obtained from databases [13,14,23,37]. To obtain population search queries, we carried out a study survey.

Search queries may vary from country to country. In Korea, “Bad cold (독감)” in Korean has two meanings: one is influenza and the other, a severe common cold. Since the 2009/10 epidemiological season, the Influenza A (H1N1) pandemic period, the media began to use “New flu (신종플루)” in order to distinguish H1N1 influenza and previous influenzas. In 2010, KCDC announced that the official term was “Influenza (인플루엔자)” [36]. But “New flu (신종플루)” and “Bad cold (독감)” are still more popular terms than “Flu (플루)” or “Influenza (인플루엔자)” in Korea (Table 1).

For the 2009/10 epidemiological year (development set 1), 13 combined queries had correlation coefficient r values ≥.7. However, only 4 of these combined queries (“New flu + Cough”, “Flu + Symptom”, “New flu + Symptom”, and “Tamiflu”) had correlation coefficient r values ≥.7 in the 2010/11 epidemiological year (validation set 1) (Table 3). But 11 of 13 cumulative query methods had an r value of ≥.7 in validation set 1 (see Table 5 and Figure 3). Among 15 combined queries of development set 2, the number of the combined queries that had correlation coefficient r values of ≥.7 reduced to 6 in validation set 2. But all 15 cumulative query methods had an r value of ≥.7 in validation set 2. We think that the cumulative query method is more robust with time, and this factor is helpful for improving surveillance performance using search queries. Since Internet searching behavior may change over time, this could have affected the performance of Web query-based surveillance model [31]. In this study, 20 out of 210 combined queries had correlation coefficients for all 3 years. And only “Tamiflu” and “New flu + Symptom” showed correlation coefficients r values of ≥.7 for 3 consecutive years (see Figure 1 and Multimedia Appendix 2). Recently, a study using Google Trends for influenza surveillance showed that Google Trends can be used as a complementary source of data [33]; however, its performance is insufficient for use as a model for prediction because its maximum correlation coefficient was .82 for only one query, “Fever”, in 2009, and the coefficient decreased to .64 in 2011 [33].

It is difficult to predict the change of search queries in the future. To reduce the effects from changes in search queries, we used a combination of queries and cumulation of combined queries to construct our method. Additionally, the method we wanted to develop was meaningful only when the cumulative query method had a higher correlation coefficient than the highest single combined query. In each validation set, 8 useful cumulative query methods were developed. The useful cumulative query methods in each validation set had a high correlation coefficient (Table 4). In validation set 2, the range of correlation coefficients of the useful cumulative query methods was from .975 to .987. These values are similar to or higher than those reported elsewhere [13,14,23,26,27,29,31]. In Europe, correlation coefficients of .716 to .940 were reported for GFT [27], and coefficients of .82 to .99 were reported in the United States [23,26,31]. In the 2009/10 epidemiological year, called the Influenza A (H1N1) pandemic period, the proportional data of queries were likely to have been different compared to the other epidemiological year. It might affect performance of cumulative query methods in set 1. The performance of the cumulative query method in set 1 was decreased with time (Table 4). It is thought to be related to the changes of queries (see Table 3). For some cumulative query methods, the correlation did not increase by adding queries. The added query did not give extra value in the cumulative query methods 6, 8, and 10 in validation set 1 (see Table 4 and Multimedia Appendix 2). Combined queries 6, 8, and 10 from development set 1 in validation set 1 have a constant value 0 (see Table 3 and Multimedia Appendix 2). The added queries were relatively too small compare to the previous queries in the cumulative query methods 5, 9, 10 in validation sets 1, 3, 4, 7, 8, 10, 12, and 14 in validation set 2 (see Table 4 and Multimedia Appendix 2).

We used proportional data from Daum, a non-dominant local search engine (approximately 25% of the market share) in South Korea [32]. Our cumulative query methods showed a strong correlation with KCDC ILI data. Generally, Google’s market share is dominant in countries where GFT is available [27,28]. Our study showed the possibility of developing a surveillance model using a non-dominant local search engine.

Limitations

There are several limitations to this study. The survey of our study is not a representative sample. Because respondents were asked to provide typed queries without mention of the influenza pandemic of 2009/10, recent search queries were more likely to have been included in this study because the survey was conducted recently. This might affect performance of the cumulative query method. Further, the data from the influenza pandemic of 2009/10 might affect the outcome of this study. In this study, we did not combine queries from the same query group. Although important, the performance of using symptoms in the definition of KCDC ILI was not tested. The learning effect from the influenza pandemic of 2009/10, news reports, outbreak briefs, health information from the Internet, and changing search behavior stemming from the diffusion of smartphones might have affected the outcome of this study. We did not determine the extent to which these factors affected the searching behavior. More data for subsequent years are required in order to know the life of the cumulative query method.

Conclusion

We presented a cumulative query method using search engine data. We conducted a survey to obtain population search queries. To reduce the effects from changes in search queries, we used a combination of queries and cumulation of combined queries. Our method showed high correlation with national influenza surveillance data in South Korea. However, to further our method, additional research is needed.

Acknowledgments

This study was supported by grant 2012-0580 from the Asan Institute for Life Sciences, Seoul, Korea. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Asan Institute for Life Sciences, Seoul, Korea. The technical consultation was supported by Daum Communications. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Conflicts of Interest

Our study was based on the search engine Daum. This study was partly supported by Daum Communications, the employer of author Maengsoo Yu.

Multimedia Appendix 1

Seasonality of influenza in South Korea.

PDF File (Adobe PDF File), 47KB

Multimedia Appendix 2

Full study data (proportional data from Daum are multiplied by 12 squares of 10).

XLSX File (Microsoft Excel File), 393KB

Multimedia Appendix 3

Scaled proportional data based on the two best cumulative query methods.

PDF File (Adobe PDF File), 143KB

Multimedia Appendix 4

Scatter plots between the KCDC ILI and other useful cumulative query methods.

PDF File (Adobe PDF File), 397KB

Multimedia Appendix 5

Cumulative query method for influenza virologic data.

PDF File (Adobe PDF File), 54KB

  1. Triple S Project. Assessment of syndromic surveillance in Europe. Lancet 2011 Nov 26;378(9806):1833-1834. [CrossRef] [Medline]
  2. Hirshon JM. The rationale for developing public health surveillance systems based on emergency department data. Acad Emerg Med 2000 Dec;7(12):1428-1432. [Medline]
  3. Henning KJ. What is syndromic surveillance? MMWR Morb Mortal Wkly Rep 2004 Sep 24;53 Suppl:5-11 [FREE Full text] [Medline]
  4. Ferguson NM, Cummings DA, Cauchemez S, Fraser C, Riley S, Meeyai A, et al. Strategies for containing an emerging influenza pandemic in Southeast Asia. Nature 2005 Sep 8;437(7056):209-214. [CrossRef] [Medline]
  5. Longini IM, Nizam A, Xu S, Ungchusak K, Hanshaoworakul W, Cummings DA, et al. Containing pandemic influenza at the source. Science 2005 Aug 12;309(5737):1083-1087 [FREE Full text] [CrossRef] [Medline]
  6. Cheng CK, Cowling BJ, Lau EH, Ho LM, Leung GM, Ip DK. Electronic school absenteeism monitoring and influenza surveillance, Hong Kong. Emerg Infect Dis 2012 May;18(5):885-887 [FREE Full text] [CrossRef] [Medline]
  7. Egger JR, Hoen AG, Brownstein JS, Buckeridge DL, Olson DR, Konty KJ. Usefulness of school absenteeism data for predicting influenza outbreaks, United States. Emerg Infect Dis 2012 Aug;18(8):1375-1377 [FREE Full text] [CrossRef] [Medline]
  8. Galante M, Garin O, Sicuri E, Cots F, García-Altés A, Ferrer M, et al. Health services utilization, work absenteeism and costs of pandemic influenza A (H1N1) 2009 in Spain: a multicenter-longitudinal study. PLoS One 2012;7(2):e31696 [FREE Full text] [CrossRef] [Medline]
  9. Ohkusa Y, Shigematsu M, Taniguchi K, Okabe N. Experimental surveillance using data on sales of over-the-counter medications--Japan, November 2003-April 2004. MMWR Morb Mortal Wkly Rep 2005 Aug 26;54 Suppl:47-52 [FREE Full text] [Medline]
  10. Patwardhan A, Bilkovski R. Comparison: Flu prescription sales data from a retail pharmacy in the US with Google Flu trends and US ILINet (CDC) data as flu activity indicator. PLoS One 2012;7(8):e43611 [FREE Full text] [CrossRef] [Medline]
  11. Vergu E, Grais RF, Sarter H, Fagot JP, Lambert B, Valleron AJ, et al. Medication sales and syndromic surveillance, France. Emerg Infect Dis 2006 Mar;12(3):416-421 [FREE Full text] [CrossRef] [Medline]
  12. Hill S, Mao J, Ungar L, Hennessy S, Leonard CE, Holmes J. Natural supplements for H1N1 influenza: retrospective observational infodemiology study of information and search activity on the Internet. J Med Internet Res 2011;13(2):e36 [FREE Full text] [CrossRef] [Medline]
  13. Eysenbach G. Infodemiology: tracking flu-related searches on the web for syndromic surveillance. AMIA Annu Symp Proc 2006:244-248 [FREE Full text] [Medline]
  14. Hulth A, Rydevik G. Web query-based surveillance in Sweden during the influenza A(H1N1)2009 pandemic, April 2009 to February 2010. Euro Surveill 2011;16(18):- [FREE Full text] [Medline]
  15. Yang AC, Huang NE, Peng CK, Tsai SJ. Do seasons have an influence on the incidence of depression? The use of an internet search engine query data as a proxy of human affect. PLoS One 2010;5(10):e13728 [FREE Full text] [CrossRef] [Medline]
  16. Eysenbach G. Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the Internet. J Med Internet Res 2009;11(1):e11 [FREE Full text] [CrossRef] [Medline]
  17. Chretien JP, George D, Shaman J, Chitale RA, McKenzie FE. Influenza forecasting in human populations: a scoping review. PLoS One 2014;9(4):e94130 [FREE Full text] [CrossRef] [Medline]
  18. Eysenbach G. Infodemiology and infoveillance tracking online health information and cyberbehavior for public health. Am J Prev Med 2011 May;40(5 Suppl 2):S154-S158. [CrossRef] [Medline]
  19. Zheluk A, Gillespie JA, Quinn C. Searching for truth: internet search patterns as a method of investigating online responses to a Russian illicit drug policy debate. J Med Internet Res 2012;14(6):e165 [FREE Full text] [CrossRef] [Medline]
  20. Bernardo TM, Rajic A, Young I, Robiadek K, Pham MT, Funk JA. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. J Med Internet Res 2013;15(7):e147 [FREE Full text] [CrossRef] [Medline]
  21. Liang B, Scammon DL. Incidence of online health information search: a useful proxy for public health risk perception. J Med Internet Res 2013;15(6):e114 [FREE Full text] [CrossRef] [Medline]
  22. Yom-Tov E, Gabrilovich E. Postmarket drug surveillance without trial costs: discovery of adverse drug reactions through large-scale analysis of web search queries. J Med Internet Res 2013;15(6):e124 [FREE Full text] [CrossRef] [Medline]
  23. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature 2009 Feb 19;457(7232):1012-1014. [CrossRef] [Medline]
  24. Eurosurveillance editorial team. Google Flu Trends includes 14 European countries. Euro Surveill 2009;14(40):- [FREE Full text] [Medline]
  25. Malik MT, Gumel A, Thompson LH, Strome T, Mahmud SM. "Google flu trends" and emergency department triage data predicted the 2009 pandemic H1N1 waves in Manitoba. Can J Public Health 2011;102(4):294-297. [Medline]
  26. Ortiz JR, Zhou H, Shay DK, Neuzil KM, Fowlkes AL, Goss CH. Monitoring influenza activity in the United States: a comparison of traditional surveillance systems with Google Flu Trends. PLoS One 2011;6(4):e18687 [FREE Full text] [CrossRef] [Medline]
  27. Valdivia A, Lopez-Alcalde J, Vicente M, Pichiule M, Ruiz M, Ordobas M. Monitoring influenza activity in Europe with Google Flu Trends: comparison with the findings of sentinel physician networks - results for 2009-10. Euro Surveill 2010;15(29):- [FREE Full text] [Medline]
  28. Wilson N, Mason K, Tobias M, Peacey M, Huang QS, Baker M. Interpreting Google flu trends data for pandemic H1N1 influenza: the New Zealand experience. Euro Surveill 2009;14(44):- [FREE Full text] [Medline]
  29. Timpka T, Spreco A, Dahlström Ö, Eriksson O, Gursky E, Ekberg J, et al. Performance of eHealth data sources in local influenza surveillance: a 5-year open cohort study. J Med Internet Res 2014;16(4):e116 [FREE Full text] [CrossRef] [Medline]
  30. Pervaiz F, Pervaiz M, Abdur Rehman N, Saif U. FluBreaks: early epidemic detection from Google flu trends. J Med Internet Res 2012;14(5):e125 [FREE Full text] [CrossRef] [Medline]
  31. Cook S, Conrad C, Fowlkes AL, Mohebbi MH. Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One 2011;6(8):e23610 [FREE Full text] [CrossRef] [Medline]
  32. Market share of search engine in South Korea.   URL: http://www.webWebcitation.org/6QrnPuRwJ [accessed 2014-12-02] [WebCite Cache]
  33. Kang M, Zhong H, He J, Rutherford S, Yang F. Using Google Trends for influenza surveillance in South China. PLoS One 2013;8(1):e55205 [FREE Full text] [CrossRef] [Medline]
  34. Cho S, Sohn CH, Jo MW, Shin SY, Lee JH, Ryoo SM, et al. Correlation between national influenza surveillance data and google trends in South Korea. PLoS One 2013;8(12):e81422 [FREE Full text] [CrossRef] [Medline]
  35. Search engine Daum.   URL: http://www.webWebcitation.org/6RvfPYTwI [accessed 2014-12-01] [WebCite Cache]
  36. Korea Centers for Disease Control and Prevention.   URL: http://www.webWebcitation.org/6QrlqwNxO [accessed 2014-12-01] [WebCite Cache]
  37. Polgreen PM, Chen Y, Pennock DM, Nelson FD. Using internet searches for influenza surveillance. Clin Infect Dis 2008 Dec 1;47(11):1443-1448 [FREE Full text] [CrossRef] [Medline]


GFT: Google Flu Trends
ILI: influenza-like illness
KCDC: Korea Centers for Disease Prevention and Control
PCR: polymerase chain reaction


Edited by G Eysenbach; submitted 07.07.14; peer-reviewed by M Kang, E Lau; comments to author 28.07.14; revised version received 25.08.14; accepted 21.11.14; published 16.12.14

Copyright

©Dong-Woo Seo, Min-Woo Jo, Chang Hwan Sohn, Soo-Yong Shin, JaeHo Lee, Maengsoo Yu, Won Young Kim, Kyoung Soo Lim, Sang-Il Lee. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 16.12.2014.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.