Continuous Monitoring of Vital Signs With Wearable Sensors During Daily Life Activities: Validation Study

Background Continuous telemonitoring of vital signs in a clinical or home setting may lead to improved knowledge of patients’ baseline vital signs and earlier detection of patient deterioration, and it may also facilitate the migration of care toward home. Little is known about the performance of available wearable sensors, especially during daily life activities, although accurate technology is critical for clinical decision-making. Objective The aim of this study is to assess the data availability, accuracy, and concurrent validity of vital sign data measured with wearable sensors in volunteers during various daily life activities in a simulated free-living environment. Methods Volunteers were equipped with 4 wearable sensors (Everion placed on the left and right arms, VitalPatch, and Fitbit Charge 3) and 2 reference devices (Oxycon Mobile and iButton) to obtain continuous measurements of heart rate (HR), respiratory rate (RR), oxygen saturation (SpO2), and temperature. Participants performed standardized activities, including resting, walking, metronome breathing, chores, stationary cycling, and recovery afterward. Data availability was measured as the percentage of missing data. Accuracy was evaluated by the median absolute percentage error (MAPE) and concurrent validity using the Bland-Altman plot with mean difference and 95% limits of agreement (LoA). Results A total of 20 volunteers (median age 64 years, range 20-74 years) were included. Data availability was high for all vital signs measured by VitalPatch and for HR and temperature measured by Everion. Data availability for HR was the lowest for Fitbit (4807/13,680, 35.14% missing data points). For SpO2 measured by Everion, median percentages of missing data of up to 100% were noted. The overall accuracy of HR was high for all wearable sensors, except during walking. For RR, an overall MAPE of 8.6% was noted for VitalPatch and that of 18.9% for Everion, with a higher MAPE noted during physical activity (up to 27.1%) for both sensors. The accuracy of temperature was high for VitalPatch (MAPE up to 1.7%), and it decreased for Everion (MAPE from 6.3% to 9%). Bland-Altman analyses showed small mean differences of VitalPatch for HR (0.1 beats/min [bpm]), RR (−0.1 breaths/min), and temperature (0.5 °C). Everion and Fitbit underestimated HR up to 5.3 (LoA of −39.0 to 28.3) bpm and 11.4 (LoA of −53.8 to 30.9) bpm, respectively. Everion had a small mean difference with large LoA (−10.8 to 10.4 breaths/min) for RR, underestimated SpO2 (>1%), and overestimated temperature up to 2.9 °C. Conclusions Data availability, accuracy, and concurrent validity of the studied wearable sensors varied and differed according to activity. In this study, the accuracy of all sensors decreased with physical activity. Of the tested sensors, VitalPatch was found to be the most accurate and valid for vital signs monitoring.


Background
Continuous telemonitoring of vital signs in daily life may lead to earlier detection of patient deterioration [1][2][3] and facilitate the migration of care toward home. In chronic diseases, telemonitoring is associated with improved clinical outcomes and cost-effectiveness of care [4,5]. It is expected that telemonitoring may also be of added value in other settings, such as the perioperative trajectory to monitor postoperative recovery in a ward or home setting. Preoperative monitoring at home may improve the knowledge of patients' baseline vital signs. Especially since the COVID-19 pandemic, the demand for remote monitoring of vital signs has grown [6].
Several wearable sensors are available for telemonitoring of patients both in hospital and at home [1,7], which mainly differ in the location of placement, being reusable or disposable, battery life, and data transmission. According to legislation, sensors must be certified as a medical device and be safe and beneficial in their intended use. However, wearable sensors should be accurate and reliable as well before implementation in health care [1]. Accurate technology for telemonitoring is essential when used for clinical decision-making, although little is known about the accuracy and reliability of current generation wearable sensors, especially during daily life activities. Wearable sensors for continuous monitoring of vital signs are often evaluated in the in-patient setting [7] using patches (ie, Sensium Vitals, Sensium), mattress sensors (ie, EarlySense, EarlySense Inc), or more extensive sensors worn on the arm (ie, Radius-7, Masimo) [8]. Results from in-patient settings cannot directly be translated to the home environment when performing daily activities with less supervision, and research using wearable sensors for vital sign monitoring at home is lacking.

Objectives
Information about the performance of wearable sensors in daily life is scarce and should be available before using these sensors for clinical decision-making. The aim of this study is to assess the data availability, accuracy, and concurrent validity of vital signs measured with currently available wearable sensors during daily life activities in a simulated living environment. We selected 3 types of recently available wearable sensors: arm-worn, chest-worn, and wrist-worn. This study investigates the technical performance of wearable sensors during daily life activities in volunteers to gain insight into their potential for telemonitoring.

Design
For this prospective observational validation study, experiments were performed at the eHealth House of the University of Twente, a simulated living environment (furnished apartment) used for research purposes [9]. The protocol was approved by the ethical committee of the University Medical Center Groningen and was executed according to the Declaration of Helsinki. Written consent was received from all participants for study participation and data use.

Participants
Volunteers aged >18 years were included, with at least half of the participants aged >60 years, to reflect a general patient population. Interested volunteers were contacted by one of the researchers (RV) to assess their eligibility for study participation. The exclusion criteria were having a medical condition uncontrolled with medication that interferes with the execution of the protocol (ie, cardiovascular diseases, neuromuscular diseases, immobility, or cognitive disorders), pacemaker, or plaster allergy. Because of the lack of preliminary data for power calculation, a sample size of 20 was chosen on the basis of previous experience in quite similar validation studies for wearable devices associated with vital sign monitoring in volunteers [10][11][12][13].

Devices
A total of 3 wearable sensors of interest for continuous and noninvasive measurement of vital signs were used: Everion (Biovotion AG), VitalPatch (MediBioSense), and Fitbit Charge 3 (Fitbit Inc). The VitalPatch is intended for the collection of physiological data in a health care setting, whereas Everion and Fitbit are intended to monitor fitness and general wellness only. These sensors differ in measurement location and techniques and have the potential to be used in clinical settings. All the used sensors are depicted in Figure 1. Everion is a Conformity European (CE) class 2a-certified sensor worn on the upper arm that measures heart rate (HR), respiratory rate (RR), and blood oxygen saturation (SpO 2 ) by photoplethysmography (PPG) and skin temperature using a negative temperature coefficient thermistor. The vital signs were stored every 10 seconds. VitalPatch is a CE class 2a-certified and Food and Drug Administration 510(k)-cleared disposable patch worn on the chest to measure HR and RR by electrocardiography (ECG) and temperature by a thermistor with a sample storage frequency of once per 4 seconds. The Fitbit Charge 3 is a commercially available activity tracker worn at the wrist and measures HR using PPG with a sample storage frequency of once per second during exercise and once per 5 seconds at all other times [14].
A total of 2 devices were used as gold standard reference devices. Oxycon Mobile (CareFusion Germany 234 GmbH) is a portable metabolic measurement system certified as a CE class 2a medical product and has been used as the gold standard in several studies [15,16]. Oxycon Mobile used ECG and expired volume measurements to monitor HR and RR, respectively. Volume measurement is a reliable method for RR calculation compared with other measurement principles that derive RR from impedance, ECG, or waveform modulation, such as in other wearable devices. In addition, SpO 2 was measured using a PPG sensor that was positioned using an ear probe instead of a finger probe to enable free hand movement during the experiment. If ECG was missing, HR was determined from the SpO 2 curve as reference. For all vital signs, a storage frequency of once per 5 seconds was used. The Thermochron iButton (Maxim Integrated), a validated wireless skin temperature logger [17], was used as a reference device for monitoring temperature with a sample storage frequency of once per 10 seconds and a resolution of 0.5 °C. The iButtons enabled wireless temperature measurements right above the relevant wearable sensors.

Protocol
Before the start of the experiment, the protocol was explained and demographic data of participants were obtained and stored in Research Electronic Data Capture (REDCap; Vanderbilt University) version 10.0.23, including age, gender, BMI, occupation, physical activity lifestyle [18], and relevant medical history. The standardized protocol existed for 17 different tasks subsequently performed by participants with a total duration of 57 minutes. The detailed protocol, including task descriptions and durations, is provided in Multimedia Appendix 1. The task durations varied from 2 to 10 minutes and were performed in 6 activity clusters: resting, walking, metronome breathing, daily household activities (chores), stationary cycling on an exercise bike, and recovery. Transition periods were present between all tasks, which were not included in the data analysis. For more intensive tasks, a transition period of several minutes was included in the protocol for physiological stabilization between tasks. Resting included lying in several positions, sitting, and standing. Walking included walking at normal and slow speeds and stair climbing. Metronome breathing comprised breathing at 6, 15, 20, and 24 breaths per minute (brpm) and was guided by a metronome app. Chores were performed in the kitchen, where the participant was instructed to do various household tasks such as preparing food and cleaning. Cycling was performed on an ergometer with increasing load and rotation until a HR of at least 120 beats per minute (bpm) was reached. Thereafter, the participants recovered in an armchair or on a couch. During each experiment, 2 researchers were present, of whom 1 instructed the participant, and the other logged the start time of each task.
All sensors were synchronized with the computer time before the start of the experiment. During the experiment, vital signs were simultaneously recorded by the 4 wearable sensors and 2 reference devices. The placement of all the sensors is shown in Figure 1 and Table 1. A total of 2 Everion sensors were placed on the left and right arm, respectively, aiming to investigate the performance for different sensor placements. The data availability of real-time measurements was monitored regularly during the protocol, and technical issues were dissolved if needed.

Data Collection and Analysis
Data from all devices were exported from separate databases and processed in MATLAB R2018b (MathWorks, Inc) and SPSS Statistics 23 (IBM Corp). The logged start time and predefined duration of the respective tasks were used to select the data-recording windows for each task. Subsequently, nearest-neighbor resampling was used to pair wearable sensor data with the nearest data of reference devices for the combinations of sensors, as shown in Table 1. As the lowest data storage frequency was once per 10 seconds (for Everion and iButton), the maximum time shift between data points of the wearable sensor and reference device was 5 seconds. Data analysis was performed for each activity cluster and over the complete experiment (for all tasks).

Data Availability
The data availability of each sensor was assessed by the percentage of missing data points out of the expected data points per activity cluster and over all tasks per vital sign. In addition, the number and duration of missing data periods (epochs), for example, where the time between subsequent data points exceeded the expected sample period, was assessed.

Vital Sign Agreement
Agreement in vital sign data between wearable sensors and reference devices was inspected visually over all tasks. The measured values and variability of each sensor were described using the median and median absolute deviation (MAD) calculated per minute for all sensors and all participants. The median and IQR of the median and MAD of all participants were calculated per activity cluster and over all tasks to compare the (differences in) measured values and variability between activities and sensors. Furthermore, the median absolute percentage error (MAPE) was calculated per minute per vital sign to evaluate the accuracy of each wearable sensor.

Concurrent Validity
Concurrent validity was assessed using the data samples in a preselected activity cluster, with the aim of obtaining a large range of physiological variation with the least variation in position or task to minimize movement artifacts. Accordingly, the concurrent validity of HR was obtained in the cycling cluster, RR in the breathing cluster, and SpO 2 and temperature in the recovery cluster after cycling. As the VitalPatch and Everion had an averaging duration of 45 and 60 seconds, respectively, to compute RR, measurements during the first minute of each breathing activity were not considered in the validity analysis. Before data selection, data of reference devices during the selected activity clusters were visually analyzed per participant to exclude physiologically implausible reference data by 2 researchers (MEH and MCVR). If needed, periods with unexpected scattering, variation, or drops were excluded from further analysis. Concurrent validity was assessed using Bland-Altman analyses to evaluate the mean differences (bias) and 95% limits of agreement (LoA). Bland-Altman analyses were corrected for repeated measurements, where the variance between measurement pairs was the sum of between-and within-subject variances [19,20]. The root mean square error (RMSE) was calculated to obtain insights into the amplitude of deviations. Bland-Altman plots, mean differences, LoA, and RMSEs were also assessed using median values per minute during the same predefined activity cluster. The results of both Bland-Altman analyses were compared to evaluate the influence of averaging on the concurrent validity of the wearable sensors.

Overview
Between September 2020 and October 2020, 20 volunteers were included in the study. A total of 2 experiments were redone because of incomplete data from the reference devices because of recording failure. Data from 20 participants were analyzed, and the participant characteristics are shown in Table 2.    For RR, Everion had the most available data during the breathing activity (2865/2880, 99.48%) and most missing data points during the more active clusters, walking and cycling, with median percentages of missing data of 8.3% to 26

Vital Sign Agreement
In most cases, wearable sensors showed similar trends compared with those of reference devices when measuring HR, RR, and temperature. Trends in vital signs during the complete experiment are shown in Figure 3 for 1 participant as an example. In half of the participants (10/20, 50%), an unexpected drop or low agreement in HR during cycling could be seen for Fitbit (9/10, 90%) and Everion (3/10, 30%), as illustrated in Multimedia Appendix 2, and also in the Oxycon Mobile (1/10, 10%).
The median values and MAD per minute for each vital sign and sensor over all tasks are shown in Table 3. Variability in terms of MAD per minute was generally low for all devices and vital signs.   The median MAPE of each wearable sensor as compared with the reference device per activity cluster per vital sign is shown in Table 4. For HR, all wearable sensors had an overall low median MAPE (2.3%-3.9%), with the highest MAPE for Fitbit. All sensors had the highest median MAPE during the walking cluster for HR (13.4%-23.4%). For RR, VitalPatch had the lowest median MAPE during the breathing and cycling cluster, whereas the Everion median MAPE was higher, especially during walking and cycling. The median MAPE of SpO 2 measured by Everion was maximally 3.8% (during walking). The median MAPE for temperature of VitalPatch was very low (1%-1.7%). The lowest median MAPE for temperature of Everion was during the first activity cluster (resting: mean 6.3%) and the highest during the last cluster (recovery: mean 9%).   Figure 4 shows Bland-Altman plots for individual samples, whereas plots for median values per minute are shown in Figure   5. Table 5 shows mean differences and LoA from Bland-Atman analyses and RMSE per vital sign for the 2 methods for each wearable sensor compared with their reference devices. Bland-Altman plots for the 12 combinations of vital signs measured by the wearable sensors and reference devices for individual samples during the preselected activity cluster, where the x-axis represents the mean of and the y-axis the difference (Δ) between both sensors. Dotted lines represent the mean difference and limits of agreement for repeated measurements. bpm: beats per minute; brpm: breaths per minute; HR: heart rate; RR: respiratory rate; SpO 2 : oxygen saturation; T: temperature.

Figure 5.
Bland-Altman plots for the 12 combinations of vital signs measured by the wearable sensors and reference devices of median data per minute during the preselected activity cluster, where the x-axis represents the mean of and the y-axis represents the difference (Δ) between both sensors. Dotted lines represent the bias and limits of agreement for the repeated measurements. bpm: beats per minute; brpm: breaths per minute; HR: heart rate; RR: respiratory rate; SpO 2 : oxygen saturation; T: temperature.  Mean differences for RR were low with large LoA for both VitalPatch (LoA −7.6 brpm to 7.3 brpm) and Everion (LoA −10.6 brpm to 9.8 brpm). In addition, Figures 4 and 5 show higher differences for RR by Everion at the lowest breathing frequency (overestimation) and highest breathing frequency (underestimation). SpO 2 was underestimated, with mean differences of over 1% by Everion and LoA of −4.6% to 2.5%. For temperature, VitalPatch had a small overestimation of 0.5 °C. Everion overestimated temperature with a mean difference of 2.8 °C, with slightly higher differences at lower temperature and vice versa. The mean differences and LoA for median values per minute were similar to those for the individual samples.

Principal Findings
Telemonitoring requires vital sign data from wearable sensors to be available, accurate, and valid when used for clinical decision-making, as well as during daily activities. Our results showed variable data availability and accuracy of vital signs measured for the evaluated wearable sensors during different daily life activities in a simulated free-living environment. VitalPatch is accurate and the least vulnerable to movement during daily activities. With regard to Everion, the mean difference, lower accuracy during physical activity, and limited data availability for RR and SpO 2 must be considered when interpreting its measurements for diagnostic aims. Our results showed no relevant differences in performance between the left and right Everion because of sensor placement. Fitbit had a large mean difference and an activity-dependent storage frequency for HR.
Different results for the tested wearable sensors may be explained by differences in the underlying measurement technologies, processing algorithms, and sensor placement sites. Relevant findings and points of consideration will be discussed in the context of each sensor.
Our study showed low availability of Everion RR during the more active clusters and SpO 2 data, which might be because of the placement site of the Everions. The upper arm is a nontraditional and uncommon site for measuring PPG signals, for which its accuracy has not yet been established [21,22]. Everion calculates an accuracy metric per vital sign, which prevents data with an accuracy <50% from being stored. This accuracy metric could be low when the measurement of vital signs is affected by movement, which is a general limitation of PPG signals [22,23]. On the other hand, the fact that Everion is PPG-based creates the ability to monitor multiple vital signs (HR, RR, and SpO 2 ) with only 1 sensor [24]. There is an increasing demand for such devices, as patients are becoming multimorbid.
We reported an underestimation of HR by Everion. Only Barrios et al [13] evaluated HR measured by Everion in 6 healthy volunteers compared with ECG Holter measurements during different activities and found a mean difference for HR of −0.2 bpm (LoA of −6.3 bpm to 6.0 bpm) during cycling. These results imply better accuracy compared with those of our study, which could be related to the small number and young age of their participants. Finally, our study showed unexpected drops in HR by Everion during the rapid increase of HR while cycling without extensive arm movement, which is expected to be because of the algorithms of the sensors.
In our study, VitalPatch measured all vital signs with the highest accuracy and validity. No previous studies have been reported on the performance of VitalPatch. Only similar patches have been studied previously, including the Sensium Vitals patch (Sensium) [25] and HealthPatch (VitalConnect) [26].
For Fitbit, our study showed the lowest data availability of HR, which might be related to its irregular storage frequency. Although the sensor specification [14] stated that the sample storage frequency should be once per second to once per 5 seconds, depending on the level of activity, data were collected at much lower frequencies between once per 5 seconds and once per 15 seconds in our study.
Our results showed high errors and mean differences for Fitbit compared with the reference device. Earlier validation of the Fitbit Charge HR (Fitbit Inc) for HR showed higher accuracy during walking or running on a treadmill and lower accuracy during daily activities, with a MAPE of 8.4% and 10.1%, respectively [27]. A second validation study using Fitbit Charge HR showed an even higher underestimation of HR of 16 bpm during moderate-to-vigorous physical activity compared with Polar H6 HR monitor in 10 healthy participants during daily life activities [28].
In general, our results showed that the mean difference and LoA did not improve using median values per minute instead of individual data samples. This was unexpected, as using median values minimizes the influence of potential outliers. Breteler et al [26] found an improvement in the mean difference and LoA of HR and RR when applying a median filter per 15 minutes, although this might be more relevant for long-term measurements. In addition, averaging rigorously decreases the number of data points.

Strengths and Limitations
A strength of this study is that we evaluated the sensor performance during daily life activities in a general population with mixed characteristics. In addition, the study was performed in a simulated home environment, which is as close as possible to the target setting while enabling well-controlled study measurements. Accordingly, the current results give more insight into the sensor performance as compared with typically performed validation protocols that only include young, healthy participants and measurements at rest.
A limitation is that we assessed the wearable sensor performance over a relatively short period. A second limitation is the limited translatability of our results to patients because of the measurement of vital signs in volunteers without pathophysiological abnormalities. Other limitations are related to the reference devices; we had to redo 1 volunteer because of the recording failure of Oxycon Mobile, and the resolution of the iButtons was set at 0.5 °C. This might have influenced the bias of Everion and VitalPatch in temperature. Although Oxycon Mobile has been used as the gold standard for portable monitoring of vital signs before [15,16], validation studies have so far focused on its measurement of metabolic capacity [29][30][31][32][33].

Implications
Wearable sensors could assist in various areas of health care, such as detection of deviant values of vital signs to alarm health care professionals, trend analysis to monitor recovery or deterioration, and decision-making to operate or visit the hospital. Applications of vital sign telemonitoring are diverse, from trend monitoring to acute alarms, based on the clinical goal and which medical actions follow. The required accuracy of the sensor measurements depends on this. Sensor performance for patient monitoring still needs evaluation in specific patient groups, at home or in hospital, during longer periods, and on its diagnostic ability, which are the next steps toward clinical applicability. Patient acceptance and actual use (adherence) are important for clinical use [34]. Therefore, this should be the subject of future work. However, the potential of our tested wearable sensors for patient monitoring will be discussed in the context of the following technical factors to consider: (1) the vital signs to monitor, (2) a sensors' accuracy and trending ability, (3) data storage frequency or filtering, and (4) confounding factors such as movement during daily activities.
First, the vital signs that need to be monitored depend on the aforementioned application. For example, for in-hospital monitoring, detection of cardiac events might require ECG monitoring [35], whereas for detection of postoperative deterioration, all vital signs used in the modified early warning score might be preferred [2], which are HR, RR, temperature, SpO 2 , and blood pressure. In many cases, it is still unknown which parameters to monitor at home and how to interpret long-term measurements obtained in a remote setting, as current common practice is often that a patient returns to or contacts the hospital in case of (increasing) symptoms without further monitoring [7,36]. The ability of the tested sensors to measure the available parameters is discussed per vital sign.
VitalPatch and Everion both monitor multiple vital signs, whereas VitalPatch can also monitor raw ECG. HR is the most commonly measured vital sign and is often measured accurately [7,8,13]. Owing to its large mean difference and unexpected drops during rapidly increasing HR, Fitbit is the least suitable for HR monitoring in patients.
Everion measurements for RR were less accurate <15 brpm or >20 brpm, according to our Bland-Altman analyses. However, these ranges are especially important for the detection of deterioration and predicting cardiac arrest [37,38]. Algorithms for ECG and PPG can use the same techniques to derive RR, such as amplitude and frequency modulation, although algorithms based on ECG perform better than those based on PPG [39]. Respiratory-synchronized variations are subtle, and proximity to the chest improves the measurement of RR (less susceptible to vasoconstriction) [40,41]. Therefore, VitalPatch may be preferred for monitoring RR.
SpO 2 is less commonly measured [7,8,13]. Available wearable SpO 2 sensors are generally commercially available fingertip sensors, and few meet the International Organization for Standardization 80601-2-61 accuracy standards [42]. Fingertip probes are not ideal for long-term monitoring of SpO 2 at home, although this enables transmission mode PPG with higher perfusion compared with more convenient measurement sites that require reflection mode PPG [41,42]. The low variability in SpO 2 levels of volunteers precludes insight into the accuracy of Everion for monitoring SpO 2 levels in patients. However, because of its limited data availability and underestimation of SpO 2 , our results indicate that Everion is not suitable for (high-frequency) clinical monitoring of SpO 2 .
Most available wearable sensors measure skin temperature (including Everion and VitalPatch), whereas core temperature may be clinically more relevant because of its current use in clinical practice. Nevertheless, the clinical relevance of skin temperature monitoring should be evaluated in future research [2].
Second, it is important to define what performance and trending ability are acceptable for clinical use. Currently, no criteria are available for MAPE, mean differences, and LoA of wearable sensors. Although all wearable sensors in our study followed similar trends compared with those followed by the reference devices for HR, RR, and temperature, their trending ability and diagnostic ability to detect clinically relevant changes should be assessed during longer assessments in patients.
A challenge for validation studies for vital sign monitoring is choosing the right reference devices to use as gold standard devices [43]. We experienced that ECG cables and electrodes used for the HR reference measurements are susceptible to movement as well, as also described by Barrios et al [13] using ECG Holter. The Oxycon Mobile reference device enabled ambulatory expired volume analysis, which is the best available solution to monitor RR wireless and continuously. Accordingly, the RR validation results are expected to be more accurate as compared with those of clinical validation studies that use intermittent nurse assessments as reference, which is often poorly reported or inaccurate [1].
Third, optimal filtering strategies and data storage frequencies should be investigated. Fourth, further reduction of movement artifacts, for example, using information from the present accelerometer [42], is essential for optimizing measurements at sites that enable long-term monitoring, such as the upper arm.

Conclusions
To use wearable sensors for clinical decision-making, information about their performance in daily life is needed. Of the tested sensors, VitalPatch was found to be the most accurate and valid for vital sign monitoring. For all sensors, movement during daily activities should be considered. Longer assessments of wearable sensors are needed to evaluate the technical performance and trending ability to work toward the clinical applicability of wearable sensors in patients.