Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms With Electronic Health Records

Background Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems, such as diabetes. Early preventive interventions based upon advanced predictive models using electronic health records data for identifying such patients can ultimately help provide better health outcomes. Objective Our study investigated the performance of predictive models to forecast HbA1c elevation levels by employing several machine learning models. We also examined the use of patient electronic health record longitudinal data in the performance of the predictive models. Explainable methods were employed to interpret the decisions made by the black box models. Methods This study employed multiple logistic regression, random forest, support vector machine, and logistic regression models, as well as a deep learning model (multilayer perceptron) to classify patients with normal (<5.7%) and elevated (≥5.7%) levels of HbA1c. We also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and provide an understanding of the reasons behind the decisions made by the models. All models were trained and tested using a large data set from Saudi Arabia with 18,844 unique patient records. Results The machine learning models achieved promising results for predicting current HbA1c elevation risk. When coupled with longitudinal data, the machine learning models outperformed the multiple logistic regression model used in the comparative study. The multilayer perceptron model achieved an accuracy of 83.22% for the area under receiver operating characteristic curve when used with historical data. All models showed a close level of agreement on the contribution of random blood sugar and age variables with and without longitudinal data. Conclusions This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels (≥5.7% or less). Using patients’ longitudinal data improved the performance and affected the relative importance for the predictors used. The models showed results that are consistent with comparable studies.


Introduction
The level of glycated hemoglobin (HbA1c) is used to measure the average glucose concentration in red blood cells [1,2]. Unlike other glucose blood tests such as Random Blood Sugar (RBS) and Fasting Blood Sugar (FBS), HbA1c provides a long-term measure of a patient's blood glucose levels [3]. The HbA1c test can therefore provide physicians with a reliable means of monitoring a patient's hyperglycemia without requiring the patient to undertake overnight fasting prior to being tested.
A concentration of 6.5% for the Glycated Haemoglobin (HbA1c) in patient blood is considered as a cut-off point for the diagnosis of diabetes [4]. However, patients with a concentration of less than 6.5% are not completely excluded from a diabetes diagnosis as the range of elevation levels (5.7%≤ HbA1c <6.5%) can indicate the future onset of diabetes. Therefore, HbA1c can act as an early predictor for the potential development of Type-2 Diabetes Mellitus (T2DM) [2]. Ackermann et al suggested using the HbA1c test as a measure for identifying those adults who are at a greater risk of developing T2DM in the future [3].
Research has shown that reducing HbA1c levels can significantly reduce the possibility of developing serious complications. Hence, close monitoring of HbA1c levels is recommended for all diabetic patients and also for those with the potential for developing diabetes [5]. It is also suggested that diabetic and non-diabetic patients with raised HbA1c levels should be clinically checked and monitored as a preventive intervention to avoid developing T2DM [6].
Currently, the clinical data collected from patient visits consists of a set of readings for vital signs and lab tests, diagnosis, physician's notes, and treatments that are stored in Electronic Health Records (EHR). These are collected on an irregular basis, according to clinical needs, and stored with an associated timestamp.
In recent years, machine learning models have shown powerful capabilities for analyzing and understanding complex data across a wide variety of applications. Our research question for this study is: "Can HbA1c prediction be improved by using machine learning and utilizing longitudinal data that are normally available in EHR systems?". This paper reports an investigation into the performance of machine learning models to predict current HbA1c levels as a binary classification problem using the EHR data. Nondiabetic patients with an HbA1c level of 5.7% or more are considered to have an elevated HbA1c, while those with lower levels than that are considered normal. The models combine current visit data with extra features (independent variables) extracted from previous visits by patients. We used explainable methods to rank the features in order of their importance to the decision made by each of the models. To the best of our knowledge, this work is the first to employ machine learning models that use longitudinal data from EHR systems for the purpose of HbA1c elevation risk prediction. This work is also the first to utilize explainable machine learning techniques to explain the classification decisions made by the black box models (SVM and MLP) in predicting HbA1c elevation risk (≥5.7%), in order to better understand the behavior of the model.

Related Work
EHR data has been intensively investigated for a variety of medical decision support tasks [7]. These tasks include the analysis of complex patterns and prediction of major medical events (for example, diagnostic imaging and genes interactions) [8,9]. Several studies have demonstrated the successful employment of EHR data with prediction models [10]. For instance, machine learning, has been intensively used in diagnosing diabetes, and discovering its related patterns, using EHR data [11][12][13][14][15]. However, we are not aware of any studies that have explored machine learning models for the prediction of current elevated HbA1c levels using EHR data from a non-diabetic population, as well as the impact of patient longitudinal data on the effectiveness of such predictive machine learning models.
Several studies have investigated the association between HbA1c levels and clinical variables using statistical models [16] [17]. A study by Rose et al [18] discussed the correlation between RBS and HbA1c levels. Stanley et al [19] used a linear regression model for imputation of missing HbA1c data. Their model calculates HbA1c levels for patient records with missing HbA1c values as continuous and categorical values and uses 4 predictors extracted from an EHR system: RBS, FBS, along with age and gender, as predictors to calculate the level of HbA1c for a diabetic population. Simone et al [20] used linear regression models to predict HbA1c levels after 6 years for non-diabetic patients using different populations.
A study by Wells et al [21] in 2018 was the first to focus on predicting current HbA1c elevation levels for non-diabetic patients using an EHR dataset. Multiple Logistic Regression (MLR) was employed to calculate the probability of a patient having an elevated HbA1c level (≥5.7%). The dataset was extracted from an EHR system used in the USA. The authors used 8 independent variables fitted to the model using Restricted Cubic Splines (RCS) with 3-knots to formulate the final equation. The performance of the MLR model was compared to that of the models used by Baan et al [22] and Griffin et al [23]. However, the models by Baan and Griffin aimed at predicting the onset of patients' diabetes rather than predicting HbA1c levels for non-diabetic patients. In addition, the experimental dataset used by Wells et al to train and test their model was imbalanced with 74% of the samples having normal HbA1c levels (<5.7%) and only 26% of the samples having elevated HbA1c levels (≥5.7%).
We have performed a differentiated replication of the study by Wells et al [21] using the more balanced KAIMRC dataset [24]. While the significant variables identified in our replication were in general agreement with those of the original study, there were some differences in the ranking of importance for these, suggesting that such models do need to be 'tuned' to the characteristics of different populations.

Methods
To study the impact of employing advanced predictive models with EHR data to predict current HbA1c levels, we employed the Multiple Logistic Regression (MLR), Random Forest (RF), Support Vector Machine (SVM) and Logistic Regression (LR) models; as well as a deep learning model, Multi-layer Perceptron (MLP) [25]. The problem was formulated into binary classification problem whereby the target variable, HbA1c level, was encoded with 1 when the level of HbA1c is 5.7% or more and with 0 otherwise. The results obtained from using these models were compared to those obtained from employing the model used by Wells et al with the KAIMRC dataset (detailed in the Dataset subsection). The performance of the models was investigated using current visit data only and also with additional longitudinal data from current and previous visits. The performance of each model was evaluated using measures commonly employed in clinical applications. For the SVM and MLP models, the relative importance of the features was also calculated using explainable machine learning techniques.
Using black box machine learning models in healthcare can have adverse effects on the trust and confidence placed in their outcomes; the risk of misclassification is potentially too high for clinicians to confidently use black box models for high risk healthcare decisions, and not being able to interpret a model's decision exacerbates this problem [26]. Explainable methods for machine learning models allow interpretable outcomes that can expose the reasons behind the decision made by the model [27]. This transparency provides both health professionals and patients with the confidence and trust in the outcome of the models. The widely-used SHAP (SHapley Additive exPlanations) values [28] and LIME scores [29] techniques have therefore been employed to provide a degree of transparency to our deep learning model. SHAP values are derived from Shapley values used in game theory, and provide a method of calculating the contribution of each feature (variable) to the final prediction via the GradientSHAP approximation. This is achieved for each feature by comparing the prediction the model makes when the feature is present with the prediction obtained when the feature takes some baseline value [28]. Consequently, the SHAP values for a given input 'explain' how each feature affects the output of the model when compared to the baseline (or 'default') output of the model. We use SHAP values to interpret our black box models, as they can be efficiently calculated, and their use enables a global view of the model to be constructed through the computation of SHAP values from across the whole dataset. To ensure that the SHAP values we calculate are not too greatly affected by the approximation method used, we also compute the LIME [29] scores for the models, across the entire dataset. LIME tries to estimate locally faithful linear explanations (i.e. explanations that correspond to how the model behaves around the instance being explained) for any classifier. LIME achieves this by creating local linear classifiers that approximate the behavior of the original model in the vicinity of the data being explained. As linear models are inherently interpretable through their parameters, they can be used to generate explanations of the original model. Both SHAP and LIME have the advantage that they are model-agnostic techniques, and so we are able to apply both methods to both of our black box classification models (SVM and MLP).

Dataset
The data used in this study is taken from the King Abdullah International Research Center (KAIMRC) dataset. The data has been collected from King Abdulaziz Medical City located in the central and western regions of Saudi Arabia (KSA), which the World Health Organization (WHO) ranked as the second highest in the Middle East for prevalence of diabetes, and 17th in the world [30]. According to the International Diabetes Federation (IDF), the diabetes prevalence rate in Saudi Arabia is 18.3%. Therefore, the availability of the data from this population provides considerable opportunities for research into the early prediction of diabetes.
The dataset contains a full history of patient details, vital signs, and lab test readings for each patient visit for the period from 2016 to the end of 2018. As the aim of this study is to identify non-diabetic patients that are at a high risk of HbA1c elevation, all patients previously diagnosed with hyperglycemia were eliminated from the experimental dataset. The remaining cohort formed our experimental dataset, and was categorized by using the American Diabetes Association's (ADA) guidelines [31]. Patients with HbA1c readings of more than 5.7% are considered as being in the pre-diabetic range while those with less than 5.7% are considered to be in the normal range.
Most medical datasets are imbalanced [32] [33] [34]. Such imbalances occur when the proportion of one class of patients in the dataset is greater than its counterpart class [35] [36]. However, unusually, our experimental dataset is not imbalanced. Slightly over half of the patients in our experimental dataset (52.1%) were found to have elevated levels of HbA1c (≥5.7%) while 47.9% of patients had normal HbA1c levels (<5.7%). This can be ascribed to the high incidence of diabetes in the region from which the dataset was collected [37].
A detailed illustration of the patients' class distribution (HbA1c levels) by age groups and gender is shown in Figure 1. This shows that as the age of patients increases, so the proportion of patients who have elevated HbA1c levels is steadily increasing. The dataset also exhibits a balanced gender distribution, with 49.4% of the patients being male and 50.6% female. However, the proportion of male patients with elevated levels of HbA1c (≥5.7%) is greater than for the female patients. Also, female patients with normal levels of HbA1c (<5.7%) made more visits than males. Table 1 shows the profile for the distribution of HbA1c elevation levels organized by gender.

Feature Selection and Data Sampling
Six main variables (features) were extracted from the KAIMRC EHR dataset to be used in this study. These features were selected firstly for their theoretical association with hyperglycemia and secondly for their availability in the KAIMRC dataset, and are: Age, Body Mass Index (BMI), Estimated Glomerular Filtration Rate (eGFR), Random Blood Sugar (RBS), Total cholesterol (CHOL) and non-high-density lipoprotein (non-HDL).
For the lab codes of the features used, refer to Table 1 in Multimedia Appendix 1. The descriptive statistics (using the data for the current visit only for unique patients), units, and P values for the selected features are presented in Table 2. It is very common in clinical practice that physicians may require that some lab tests and vital signs be recorded frequently. In these cases, the average value of all readings taken on a given day (the basic time interval used for this study) was used. For inpatient visits, only data for the first day were considered and where there were missing values, the first available values from the visit were used.
For the purpose of this study we aim at predicting the HbA1c levels (≥5.7%) for current (last) patient visits only. Unlike the sampling approach used by Wells et al, which was based on independent hospital visits for patients (including for the same patients), the sampling approach used in this study includes independent patients, to ensure only unseen patients data are used for testing the models. Since we aim at identifying patients with elevated levels of HbA1c from non-diabetic population, patients previously diagnosed with diabetes were excluded. We also excluded non-adult patients and those with erroneous or missing values [24]. Figure 2 shows the details of the tasks performed to refine the sample selection. This resulted in a reduction in the size of the experimental dataset from 114,057 patients with 750,709 visits to 18,844 unique patients with 157,600 visits.  Here is the feature value at a patient visit (0 < ≥ , 0 < ≥ ); is the number of time series steps (the length of the input sequence); and is the number of features for each time step, which is set to 6 as explained earlier.
If the number of visits (longitudinal time-series visits) for a patient is fewer than , the input for this patient is padded out with the mean value of the available visits to compensate for the missing time-series data (Multimedia Appendix 3 shows an example of the padding approach used). Where the number of longitudinal visits for a patient is more than , the Piece-wise Aggregation Approximation (PAA) technique [38] is applied to the data for these visits to take account of all data from patient visits.
PAA transforms the longitudinal time-series data using as a number of sliding windows (or segments), into a reduced number of time steps data (approximated) employing the mean value of the series falling within that window (segment) [39]. We tested the models with several values for the size of the sliding window ( ), and 3 was shown to be the optimal value. The formula used to calculate the approximated time-series data is: where ̃ represents the approximated value for and is the total number of visits for a patient. is the reduced number of time-series steps (Multimedia Appendix 4 shows an example of the PAA technique used).
The approximated time-series data forming the output of the PAA is then concatenated with the current visit data to form the final input for the deep learning model. Since the MLR, RF, SVM and LR models are not capable of handling the multi-dimensional data (formed as matrices), for these the output of the PAA was re-organized into a singledimensional input by vectorizing the matrix used in equation (Eq1) as below: = [ 11 12 13 ⋯ ] (Eq:3) The last data pre-processing task before training the predictive models was data scaling. The experimental dataset was scaled using the normalization technique that re-scales the ranges of each of the features to be between 0 and 1 using minimum and maximum values of that feature.

Predictive Models and Experimental Setups
As a baseline comparison, we employed the Multiple Logistic Regression (MLR) model used by Wells et al, and compared the results from this with those from 4 commonly used machine learning models.
The MLR model is used to create a mathematical equation that can best calculate the probability of a value by the assigning weights (coefficients) to the independent variables (features) based on their importance [40]. In this study we employed the same approach used by Wells et al by which the continuous features were fitted into the MLR model using Restricted Cubic Splines (RCS) technique with 3-knots. When using the longitudinal input, the variables that caused collinearity were excluded.
Random Forest (RF) is an algorithm very commonly used for classification. It combines several decision trees that are generated during the training process. Each decision tree is trained using a random subset of the training dataset. The final classification is then based on the majority voting results of all generated decision trees [41]. The quality function used in the employed RF model is Gini, with a value of 100 for the number of trees parameters.
Logistic regression (LR) is commonly used to solve binary classification problems. It calculates the odds ratio of the variables, and is similar to multiple linear regression but uses a binomial distribution of the dependent variable (i.e. more than 1). Thus, it includes a logit function that handles different types of relationships between the dependent and independent variables [42] [43]. Support Vector Machine (SVM) was introduced by Vapnik [44] in 1998. It can solve both classification and regression problems. It uses the training feature space to decide on the separation boundaries (hyperplane) that best divides the training dataset into regions, one for each class. The very close points to the hyperplanes are the support vectors. SVMs also use kernels to help enhance class separation by mapping the training features into a higher dimensional space with an increased number of dimensions [45] [44]. The kernel function used in SVM model employed is Radial Base Function (RBF) with a value of 1 for the cost parameter (C).
Multi-layer perceptron (MLP), also known as a feed-forward neural network, is one of the most common deep learning approaches. MLP is mainly used to address supervised learning problems by learning the dependencies between the input layer (the features or variables) and output layer (the classification decision) using a fully connected hidden layer in-between. The layers, including hidden ones, contain a number of neurons that are connected to the neurons of the next and previous layers via weights and non-linear functions. MLP uses a backpropagation algorithm to update the weights and biases within the hidden layers to minimize the output error rate [46] [25].
To optimize the MLP model, fine tuning of the structure and hyperparameters has been performed, involving the number of hidden layers and neurons, activation functions, optimizers and loss functions. The optimized structure of the MLP model used in this study contained 3 hidden layers. The number of neurons in the hidden layers were 48, 48, and 24, respectively. The final layer (the output layer) contained 2 neurons for the final output of the model ( 1 for normal or 2 for elevated HbA1c). A relu activation function was used in the 3 hidden layers and a sigmoid in the output layer. The detailed structure of the MLP model is shown in Figure 3. The model was trained using an Adam optimizer with Mean Squared Error as the loss function. Figure 3. The structure used for multi-layer perceptron (MLP) trained with the longitudinal data.

Evaluation of Model Performance
The models all employed the same data pre-processing, training, and testing techniques. The models were validated using the 10-fold cross-validation technique. The K-fold CV is one of the most commonly approximation approaches used for validating the obtained results [47,48]. For the MLP model, 100 epochs were used to train each fold.
As our measure for evaluating and comparing the performance of the proposed models, we used the area under the receiver operating characteristic (AUC-ROC), which is equal to the concordance statistic [49]. We also report values for a set of measures that are commonly used in clinical applications: balanced accuracy (that calculates the recall average for each class), overall accuracy, F1-score, precision and precision-recall area under the curve (PR-AUC).
To determine the importance that the black box models (SVM and MLP) place upon each variable, we first compute the SHAP values and LIME scores for all samples in our dataset and then calculate the average absolute SHAP value and LIME score for each predictor. Table 3 shows the performance metrics obtained using the MLR, RF, SVM, LR and MLP models with and without the longitudinal data. The results show that the models achieved competitive performance using the reported measures. The LR and MLP models trained with and without the longitudinal data achieved better performance with regards to the AUC-ROC measure than the MLR (statistical model employed by Wells et al), as well as the RF and SVM models. (More details about AUC-ROC and PR-AUC curve plots are presented in Multimedia Appendix 5.). The results also show that the SVM, LR and MLP models trained with and without the longitudinal data achieved better performance than the MLR and RF using the balanced accuracy measure.   Figure 4 summarizes the 10-folds performance achieved for the set of measures where the models were trained without longitudinal data, and Figure 5 shows the performance where they were trained with the longitudinal data. Both figures show a more consistent prediction trend for RF, LR, SVM as well as MLP with and without longitudinal data, as the measures for these models show a small variation between the folds. As shown in Figures 4 and 5, the SD values for MLR with and without longitudinal data are larger than for the rest of the models. This indicates that the machine learning models used can not only enhance the performance, but also improve the classification confidence for HbA1c prediction.   Table 4 shows the ranked order of importance of the set of predictors used for training the models. Further detail on the actual importance values for each model is provided in Multimedia Appendix 6. (Refer to Multimedia Appendix 7 for more details of the MLR and LR calculator.) Calculating the importance of the predictors for the MLP models using vectorized longitudinal data was not possible due to the collinearity caused by having multiple variables for BMI. The order of importance results obtained using the SHAP method for both the SVM and MLP are identical to those obtained using LIME, providing greater confidence in the explainability methods used (see Multimedia Appendix 6).  For all models trained with longitudinal data, BMI is ranked lower than when the models are trained without longitudinal data. However, the importance value produced for the BMI variable from the models is still not insignificant (see Figures in Multimedia Appendix 7). This indicates that models are able to find subtle relationships in the longitudinal data that are more relevant to the prediction than BMI, rendering it less important.  When using the MLP and LR models trained on the longitudinal data the eGFR variable is ranked higher than CHOL and BMI, in contrast to when these are trained on the current visit only. None of the other models trained with the current visit only, except RF, consider it important. Again, we ascribe this to the information that the model learns from the variations of eGFR values between a patient's visits (longitudinal EHR data).

Results
SHAP values are calculated on the sample level. Figures 8 and 9 illustrate the SHAP values for 2 randomly selected sample patients from our dataset. These figures highlight how different inputs have different SHAP values. The patient in Figure 8 (for whom our model correctly predicts elevated HbA1c levels (≥5.7%)) has a higher RBS value than the patient in Figure 9 (for whom our model correctly predicts normal HbA1c levels (<5.7%)). This explains why our MLP model places much more importance on the RBS value of the patient in Figure 6.  The task of predicting HbA1c elevation risk can be challenging. Figure 10 provides a visualization of the datapoints for the 2 classes (pre-diabetic with ≥5.7%) and (normal with <5.7%) after mapping the datapoints (for the test data) into 2 dimensions using t-SNE [50]. The overlap in the datapoints visualized in the figure demonstrates the challenge of separating the patients with and without elevated levels of HbA1c (≥5.7%) in the KAIMRC dataset. We avoided intensive feature engineering techniques in the sampling approach used. However, the approaches adopted are able to achieve promising results with an accuracy of 83.22% for the AUC-ROC using MLP with historical data. In summary, all models show promising results for predicting the current HbA1c elevation levels (≥5.7%) using EHR data. The results emphasize that the HbA1c predictive models can exhibit more learnability when they are trained with the patient longitudinal observations that are normally available from EHR systems.

Discussion and conclusion
EHR systems were adopted for the purpose of improving healthcare outcomes and were not originally intended for research purposes [19]. Patient data stored in EHR systems can be obtained at irregular intervals, as lab instructions are carried out with different frequencies based on the physician's decisions and a patient's visit patterns. It is very common that medical data extracted from EHR systems suffer from problems such as irregularity, incompleteness, and noisy and imbalanced data [13]. These can be challenging obstacles for any technology used for predictive analytics.
The sampling approach used did not affect the balanced nature of the dataset used. As shown in Figure 2, there were 56,185 unique patients before removing the records with 1 or more missing values. . We would argue that the absence or the presence of the HbA1c readings is not random. Being a sample collected from the population of Saudi Arabia, the likelihood of a patient taking an HbA1c test is large because of the prevalence of diabetes [51]. This may affect the reproducibility of this work using different populations from different countries especially those with lower rates of diabetes.
It is hoped that these outcomes will encourage further investigation into the predictability of current HbA1c levels (≥5.7%) using more of the readings normally provided in EHR data. For example, other important readings such as FBS and triglycerides have shown clinical correlations with diabetes [52]. In addition, our dataset contained only 3 years of patient data, which limits the number of patient visits recorded. Figure 11 shows the number of visits made by patients from 2016 to 2018. Figure 12  . This also justifies the size of the sliding window ( = 3) as the optimal input size for the models employed. However, we hypothesize that the longitudinal behavior of the features used can be enriched by employing more values obtained over longer periods. Therefore, incorporating more features and their longitudinal behavior over longer periods into the models used in this study would be likely to improve the prediction performance of our chosen models.  Variations in the data/model produce slightly different attribution values. However, due to the critical nature of many healthcare applications, it is always important to verify that our models make 'sensible' predictions. Without the use of SHAP/LIME, this would be hard to verify for any non-linear model. Although it is possible to see that the models have high performance, we would be unable to verify that a model is not making spurious correlations. Furthermore, through the use of SHAP, we can verify that MLPs trained on the longitudinal data are learning to use the extra information contained in the longitudinal data (as indicated by the higher importance of eGFR), allowing us to pinpoint the reason these models gain higher performance.
To investigate the effect of temporal dependencies in the data, this study has involved investigating the use of other deep learning models along with the MLP, such as Long-Short Term Memory (LSTM) and Bidirectional LSTM [25,53] for HbA1c prediction. Table 5 reports the results of using these models. The MLP model achieved similar performance to the LSTM and BiLSTM models using all reported measures. This suggests that directly modelling the temporal dynamics in the data is not very helpful. This could be due to the short lengths of the time series, or to weak temporal dependency. Generalizing our findings using other datasets is challenging because of the accessibility and privacy restrictions that apply to medical datasets. For this reason, and because of the lack of similar studies that have employed machine learning for HbA1c prediction using EHR data, comparing the performance achieved by the models outlined in this work with those developed by other researchers will require the availability of alternative anonymized datasets.

Conclusions
We believe that this study is the first to investigate the performance of machine learning models used with EHR data for predicting current HbA1c elevation risk (≥5.7%) for non-diabetic patients. It is also the first to investigate employing the longitudinal data that are normally stored on EHR systems to enhance the prediction of HbA1c elevation levels.
Our findings show that the MLP model achieves better results when a patient's longitudinal data are combined with current visit data, and the use of longitudinal data also affects the relative importance for the predictors used.
As this work formed a continuation of previous work [24], we avoided changing the sampling approach used. However, studying the impact of applying different sampling approaches could be valuable to explore in future work, as would the use of a larger dataset with more variables and the recording of longitudinal behavior over longer periods.