Self-Supervised Electroencephalogram Representation Learning for Automatic Sleep Staging: Model Development and Evaluation Study

Background Deep learning models have shown great success in automating tasks in sleep medicine by learning from carefully annotated electroencephalogram (EEG) data. However, effectively using a large amount of raw EEG data remains a challenge. Objective In this study, we aim to learn robust vector representations from massive unlabeled EEG signals, such that the learned vectorized features (1) are expressive enough to replace the raw signals in the sleep staging task, and (2) provide better predictive performance than supervised models in scenarios involving fewer labels and noisy samples. Methods We propose a self-supervised model, Contrast with the World Representation (ContraWR), for EEG signal representation learning. Unlike previous models that use a set of negative samples, our model uses global statistics (ie, the average representation) from the data set to distinguish signals associated with different sleep stages. The ContraWR model is evaluated on 3 real-world EEG data sets that include both settings: at-home and in-laboratory EEG recording. Results ContraWR outperforms 4 recently reported self-supervised learning methods on the sleep staging task across 3 large EEG data sets. ContraWR also supersedes supervised learning when fewer training labels are available (eg, 4% accuracy improvement when less than 2% of data are labeled on the Sleep EDF data set). Moreover, the model provides informative, representative feature structures in 2D projection. Conclusions We show that ContraWR is robust to noise and can provide high-quality EEG representations for downstream prediction tasks. The proposed model can be generalized to other unsupervised physiological signal learning tasks. Future directions include exploring task-specific data augmentations and combining self-supervised methods with supervised methods, building upon the initial success of self-supervised learning reported in this study.


Introduction
Deep learning models have shown great success in automating tasks in sleep medicine by learning from high-quality labeled EEG data [1].EEG data are collected from patients wearing clinical sensors, which generate real-time multi-modal signal data.A common challenge in classifying physiological signals, including EEG, is the lack of enough high-quality labels.This paper introduces a novel self-supervised model that leverages the inherent structure within large, unlabeled, and noisy datasets and produces robust feature representations.These representations can significantly enhance the performance of downstream classification tasks, such as sleep staging, especially in cases where only limited labeled data is available.
Self-supervised learning (specifically, self-supervised contrastive learning) aims at learning a feature encoder that maps input signals into a vector representation using unlabeled data.Self-supervised methods involve two steps: (I) a pretrain step: to learn the feature encoder without labels; (II) a supervised step: to evaluate the learned encoder with a small amount of labeled data.During the pretext task, some recent methods (e.g., MoCo [2], SimCLR [3]) use the feature encoder to construct positive and negative pairs from the unlabeled data and then optimize the encoder by pushing positive pairs closer and negative pairs farther away.A positive pair consists of two different augmented versions of the same sample (i.e., applying two data augmentation methods separately to the same sample), while a negative pair is generated from two different samples.For example, the augmentation method for EEG data can be denoising or channel flipping.In this practice, existing negative sampling strategies often incur sampling bias [4,5], especially for noisy EEG data, which significantly hurts performance [6].
Technically, this paper contributes to the pretrain step, where we address the inherent limitations of negative sampling strategy in the existing self-supervised methods (e.g., MoCo [2], SimCLR [3]) by leveraging global data statistics.In contrastive learning, positive pairs bring similarity information, while negative pairs provide contrastive information.Both information are essential in learning an effective feature encoder.This paper proposes a new method to address the limitation in negative sampling, named contrast with the world representation (in abbreviation, ContraWR), where a global average representation over the dataset (called the world representation) is presented as the contrastive information.Therefore, we propose robust contrastive guidance under the absence of labels: the representation similarity between positive pairs is stronger than the similarity to the world representation.Derived from global data statistics, the world representation brings robust contrastive information even in noisy environments.Moreover, we strengthen our model with an instance-aware world representation for individual samples, where closer samples have larger weights in calculating the global average.Our experiments show that the instance-aware world representation makes the model more accurate, and this conclusion aligns with the findings from a previous paper [6].
We evaluate the proposed ContraWR on the sleep staging task with three real-world EEG datasets.Our model achieves results comparable to or better than recent popular self-supervised methods, MoCo [2], SimCLR [3], BYOL [7] and SimSiam [8].The results also show that self-supervised contrastive methods, especially our ContraWR method, are much more powerful in low-label scenarios than supervised learning (e.g., 4% accuracy improvement on sleep staging with less than 2% training data of Sleep EDF dataset).

EEG Datasets
We consider three real-world EEG datasets for this study: • Sleep Heart Health Study (SHHS) [9,10]   For these datasets, the ground truth labels were released by the original data publishers.To align with the problem setting, subjects are randomly assigned to pretrain set, training set and test set with different proportions (90%: 5%: 5% for Sleep EDF and MGH, 98%: 1%: 1% for SHHS, since they have different amount of data).All epochs segmented from one subject are placed within the same set.The pretrain set is used for self-supervised learning, so we remove their labels.
In the pretrain step, the EEG self-supervised representation learning problem requires building a feature encoder (⋅) from the pretrain group (without labels), which maps one epoch  into a vector representation ℎ ∈ ℝ  , where  is the feature dimensionality, such that the representation ℎ can replace raw signal for downstream classification tasks.The evaluation of the encoder (⋅) is conducted on the training and test data (with labels).We focus on sleep staging as the supervised step, where the feature vector of a sample  will be mapped into five sleep cycle labels, awake (W), rapid eye movement REM (R), Non-REM 1 (N1), Non-REM 2 (N2), Non-REM 3 (N3), based on American Academy of Sleep Medicine (AASM) scoring standards [12].Specifically, based on the feature encoder from the pretrain step, the training set is used to learn a linear model on top of the feature vectors, and the test group is used to evaluate the linear classification performance.

Background and Concepts
Self-supervised learning happens in the pretrain task, and it uses representation similarity to exploit the unlabeled signals, with an encoder network (⋅): ℝ ||( " || the positive sample, and these two together are called a positive pair.For a large number of projections  , derived from other randomly selected signals (by negative sampling strategy), their representation is commonly conceived of as negative samples (though they are random samples), and any one of them together with the anchor is called a negative pair.The loss functions L is derived from the similarity comparison between positive pair and negative pairs (e.g., encouraging similarity of positive pairs to be stronger than that of all the negative pairs, referred to as the noise contrastive estimation (NCE) loss [13] in Appendix).A common forward flow of self-supervised learning on EEG signals can be illustrated as, For the data augmentation part, this paper uses bandpass filtering, noising, channel flipping, and shifting (see the visual illustrations in supplementary).We conduct ablation studies on the augmentation methods in experiment and provide the implementation details.To reduce clutter, we also use  to denote the L2 normalized version in the rest of the paper.

(I). ContraWR: Contrast with the World Representation
As mentioned above, negative sampling can introduce bias for the pretrain step and can undermine representation quality.We propose a new self-supervised learning method, Contrast with the World Representation (ContraWR).ContraWR replaces the large number of negative samples with a single average representation over the dataset, called the world representation.The world representation works as a reference to calibrate the model, making the pretrain step more effective and robust to noise.Our loss function in pretrain step follows the principle: the representation similarity between a positive pair should be stronger than the similarity between the anchor and the world representation.The pipeline of our ContraWR is shown in Figure 1.The online networks  = (⋅),  = (⋅) and the target networks  > (⋅),  > (⋅) share an identical network structure.Encoder networks  = (⋅),  > (⋅) map two augmented versions of the same signal to feature representations, respectively.Then, the projection networks  = (⋅),  > (⋅) project the feature representations onto a unit hypersphere, where the loss is defined.During optimization the online networks are updated by gradient descent, and the target ) Triplet loss $ !: the anchor $ " : the positive $ * : the world average networks update parameters from the online network with an exponential moving average (EMA) trick [2].
where  indicates the  -th update,  is the learning rate, and  is a weight hyperparameter.After this optimization in the pretrain step, the encoder network  = (⋅) is ready to be evaluated on the training and test sets for the supervised step.

(II). ContraWR+: Contrast with Instance-aware World Representation
To learn a better representation, we introduce a weighted averaged world representation, based on the harder principle: the similarity between a positive pair should be stronger than the similarity between the anchor and the weighted average feature representations of the dataset, where the weight is set higher for closer samples.We call the new model ContraWR+.This is a more difficult objective than the simple global average in ContraWR.

Instance
In this new method, we also use triplet loss as the final objective.Baseline Methods.In the experiments, several recent self-supervised learning methods are implemented for comparison,

Implementations
• MoCo [2] devises two parallel encoders with exponential moving average (EMA).It also utilizes a large memory table to store new negative samples, which are frequently updated.
• SimCLR [3] uses one encoder network to generate both anchor and positive samples, where negative samples are collected from the same batch.
• BYOL [7] also employs two encoders: one online network and one target network.They put one more predictive layer on top of the online network to predict (reconstruct) the result from the target network, while no negative samples are presented.• SimSiam [8] uses the same encoder networks on two sides and also does not utilize the negative samples.Model Architecture.Our proposed ContraWR and ContraWR+ use the same encoder architecture, as shown in Figure 3.This architecture cascades a short time Fourier transform (STFT) operation, a 2D convolutional neural network layer, and three 2D convolutional blocks.Empirically, we find that apply neural networks on the STFT spectrogram generates better accuracy than on the raw signals.Same practices can be found in [14,15].For a fair comparison, the baseline approaches use the same augmentation and encoder architecture.
We also consider a supervised model (called Supervised) with the same encoder, on top of which we further add a 2-layer fully connected network (128-unit for Sleep EDF, 256-unit for SHHS, and 192-unit for MGH) for the sleep staging classification task.The supervised model does not use the pretrain set but is trained from scratch on raw EEG signals in the training set and tested on the test set.We also include an Untrained Encoder model as a baseline, where the encoder is initialized but not optimized in the pretrain step.
Evaluation Protocol.We evaluate performance on the sleep staging task with overall five-class classification accuracy.Each experiment is conducted with five different random seeds.For self-supervised methods, we optimize the encoder for 100 epochs (here, "epoch" is a concept in deep learning) with unlabeled data and use the training set to find a good logistic classifier and use the test set data for evaluation following [2,3].For the supervised method, we train the model for 100 epochs on the training set.Our setting ensures the convergence of all models.

Better Accuracy in Sleep Staging
Comparisons on the downstream sleep staging task are shown in Table 3.

Ablation Study on Data Augmentations
We also inspect the effectiveness of different augmentation methods on EEG signals, shown in Table 4.We empirically test all possible combinations of four considered augmentations: channel flipping, bandpass filtering, noising, shifting.Since channel flipping cannot be applied solely, we combine it with other augmentations.The evaluation is conducted on Sleep EDF with ContraWR+ model.To sum up, all augmentation methods are beneficial, and collectively, they can further boost the classification performance.

Varying Amount of Training Data
To further investigate the benefits of self-supervised learning, we evaluate the effectiveness of the learned feature representations with varying training data on Sleep EDF in Figure 3.The default setting is to split all the data into pretrain/training/test sets by 90%: 5%: 5% (as stated in the problem formulation).
In this section, we keep the 5% test set unchanged and re-split the pretrain and training sets (after re-splitting, we ensure the training set data all have labels and remove the labels from the pretrain set), such that the training proportion becomes 0.5%, 1%, 2%, 5%, 10%, and the rest is used for the pretrain set.This "re-splitting" is conducted at the subject level, after which we again segment each subject's recording within the pretrain or training set.We compare our ContraWR+ to MoCo, SimCLR, BYOL, SimSiam, and the supervised baseline models.Our model outperforms the compared models consistently with different amount of training data.For example, our model achieves similar performance (with only 5% data as training) compared to the best baseline, BYOL, which needs twice amount of training data (10% data as training).Also, compared to the supervised model, the self-supervised methods perform better when the labels are insufficient, e.g., only ≤ 2% of the data are labeled.

Representation Projection
We next sought to assess the quality of the learned feature representations.To do this, we use the representations produced by ContraWR+ on the MGH dataset and randomly select 5,000 signal epochs per label from the dataset.The ContraWR+ encoder is optimized on the pretrain step without using the labels.We extract feature representations for each sample through the encoder network and use uniform manifold approximation and projection (UMAP) [16] to project onto the 2D space.We finally color code samples according to sleep stage labels for illustration.
The 2D projection is shown in Figure 4. We also compute the confusion matrix from the evaluation stage (based on the test set), also shown in Figure 4.In the UMAP projection, epochs from the same latent class are closely co-located, which means the pretrain step extracts important information for sleep stage classification from the raw unlabeled EEG signals.Stage N1 overlaps with stages W, N2, and N3, which is as expected given that N1 is often ambiguous and thus difficult to classify even for welltrained experts [1].Ablation study results are in shown in Figure 5; the red star indicates the default configuration.Each configuration runs with 5 different random seeds and the error bars indicate the standard deviation over 5 experiments.We see that the model is not sensitive to batch size.We see that over a large range (< 10) the model is insensitive to the Gaussian width .For temperature , we noted previously that a very small  may be problematic, and a very large  reduces ContraWR+ to ContraWR.Based on the ablation experiments the performance is relatively insensitive to choices of .For the margin , the distance difference is bounded (given fixed  = 2), Thus,  should be chosen large enough, i.e.,  ≥ 0.1.It is obvious that with a larger batch size, the model will perform better, while it is not sensitive to all hyperparameters.

Principle Results
Our proposed ContraWR and ContraWR+ models outperform 4 recent selfsupervised learning methods on the sleep staging task across 3 large EEG datasets p<.002).ContraWR+ also beats supervised learning when fewer labels are available (e.g., 4% accuracy improvement when less than 2% data is labeled).Moreover, the models provide well-separated representative structures in 2D projection.
Recently, self-supervised contrastive learning [2,3,7,8,14] has become popular, where loss functions are devised from representation similarity and negative sampling.However, one recent work [4] highlighted inherent limitations of negative sampling and showed that this strategy could hurt the learned representation significantly [5].To address these limitations, Chuang et al. [5] utilized the law of total probability and approximated the per-class negative sample distribution using the weighted sum of the global data distribution and the expected class label distribution.However, without the actual labels, the true class label distribution is unknown.Grill et al. [7] and Chen et al. [8] proposed ignoring negative samples and learning latent representations using only positive pairs.In this paper, we still use the negative information by replacing negative sampling with the global average (i.e., the world representation).We argue and provide experiments showing that contrasting with the world representation is more powerful and robust in the noisy EEG setting.

EEG Sleep Staging
Before the emergence of deep learning, several traditional machine learning approaches [26][27][28] significantly advanced the field using features, as highlighted in [29].Recently, deep learning models have been applied to various large sleep databases.SLEEPNET [29] built a comprehensive system combining many machine learning models to learn sleep signal representations.Biswal et al. [1] designed a multi-layer RCNN model to process multi-channel signals from EEG.To provide interpretable stage prototypes, Al-Hussaini et al. [30] developed a SLEEPER model that utilizes a particular deep learning approach called prototype learning guided by a decision tree to provide more interpretable results.These works rely on a large set of labeled training data.However, the annotations are expensive, and often times the labeled set is small.In this paper, we exploit the large set of unlabeled data to improve the classification, which is more challenging.

Self-supervised Learning on Physiological Signals
While image [31,32], video [33], language [34,35], and speech [36] representations have benefited from contrastive learning, research on learning physiological signals has been limited [37,38].Lemkhenter et al. [39] proposed phase and amplitude coupling for physiological data augmentation.Banville et al. [40] conducted representation learning on EEG signals, and they targeted monitoring and pathology screening tasks, without utilizing frequency information.Cheng et al. [41] learned subject-aware representations for ECG data and tested various augmentation methods.While most of these methods are based on pairwise similarity comparison, our model brings contrastive information from global data statistics, providing more robust representations.Also, we extract signal information from the spectral domain.

Strengths and Limitations
Strengths of our study are: (I) we use three real-world datasets collected from different institutes and across different year ranges, and two are publicly available; (II) our PSG recordings are diverse and generalizable, including two datasets collected at home and one collected in the lab setting, all have relatively large sizes; (III) we have open sourced our data processing pipelines and all programs used for his study, including the baseline model implementations; (IV) we propose new data augmentation methods for PSG signals and have systematically evaluated their effectiveness.However, limitations of our study should be noted, and they include the following: (I) we fixed the neural network encoder architecture in the study, which we plan to explore using other models like recurrent neural networks in the future.;

Figure 1 .
Figure 1.ContraWR Model Pipeline.We show the two-way model pipeline in this figure.The online network (upper) is updated by gradient descent, while the target network (lower) is updated by exponential moving average (EMA).Finally, the results from two models form the triplet loss function.
-aware world representation.In this new model, the world representation is enhanced by modifying the sampling distribution to be instance specific.We define (⋅ |) as the instance-aware sampling distribution of an anchor , different from the sample distribution (⋅) used in ContraWR,(⋅ |) ∝ exp H ⟨⋅,(⟩ C J ,where  > 0 is a temperature hyperparameter, such that similar samples are selected with higher probability parametrized by (⋅ |).Consequently, for an anchor  & , the instance-aware world representation becomes, 8(&) =  ,∼:D ⋅| | & F [ , ] =  ,∼: bexp c ⟨ , ,  & ⟩  f ⋅ , g  ,∼: bexp c ⟨ , ,  & ⟩  fg .Here, T controls the contrastive hardness of the world representation.When  → ∞, (⋅ |) is asymptotically identical to (⋅) , and the above equation reduces to the simple global average form  8 =  ,∼:(⋅) [ , ]; while  → 0 < ,, the form becomes trivial,  8(&) =  ( # (( & ,  , )))).We have tested different T and find the model is not sensitive to T over a wide range.Here,  8(&) is also approximated by Monte Carlo sampling.We can re-write the similarity measure given the anchor  & and the new world representation  8(&) as: Signal Augmentation.For the experiments, we use four augmentation methods, illustrated in the supplementary: (I) Bandpass Filtering.To reduce noise, we use an order-1 Butterworth filter (the bandpass is specified in the supplementary); (II) Noising.We add extra high-frequency or low-frequency noise to each channel; (III) Channel Flipping.Corresponding sensors from the left side and the right of the head are swapped; (IV) Shifting.Within one sample, we advance or delay the signal for a certain time span.Detailed configurations of augmentation methods vary for the three datasets, and we list them in the supplementary.

Figure 2 .
Figure 2. STFT Convolutional Encoder Network.The encoder network first transforms raw signals into spectrogram by short time Fourier transform (STFT), then a CNN-based encoder is built on top of the spectrogram.

Figure 3 .
Figure 3. Model Performance with Different Amount of Training Data (on Sleep EDF).Format: curves are the mean values and shaded areas are the standard deviation of training/test over 5 random seeds.All models have the same encoder network architecture.For the self-supervised method, we train a logistic regression model on top of the frozen encoder with the training set, and for the supervised model, we train the encoder along with the final nonlinear classification layer from scratch with the training set.The amount of training data is set 0.5%, 1%, 2%, 5%, 10%.Each configuration runs with 5 different random seeds and the error bars indicate the standard deviation over 5 seeds.

Figure 4 .
Figure 4. UMAP Projection and Confusion Matrix.Using MGH dataset, we project the output representations of each signal into 2D space and color by the actual labels (left).We also show the confusion matrix on sleep staging (right).Hyperparameter Ablation Study To investigate the sensitivity of our model to hyperparameter settings, we test with different batch sizes and train on different values for the Gaussian parameter  , temperature , and margin .We focus on the ContraWR+ model and evaluate it on the Sleep EDF dataset.During the experiment, the default settings are batch size = 256,  = 2,  = 2,  = 0.2, learning rate  = 2e-4, weight decay = 1e-4, epoch = 100.When testing on one hyperparameter, others are held fixed.

Figure 5 .
Figure 5. Ablation Study on Batch Size and Three Hyperparameters.Format: curves are the mean values and shaded areas are the standard deviation of training/test over 5 random seeds.The red star denotes the default setting.It is obvious that with a larger batch size, the model will perform better, while it is not sensitive to all hyperparameters.

Table 1 .
Dataset Statistics [11] multi-center cohort study from the National Heart Lung & Blood Institute assembled to study sleepdisordered breathing, which contains 5,445 recordings.Each recording has 14 Polysomnography (PSG) channels, and the recording frequency is 125.0Hz.We use the C3/A2 and C4/A1 EEG channels.•SleepEDF[11]cassetteportion is another benchmark dataset collected in a 1987-1991 study of age effects on sleep in healthy Caucasians aged 25-101 who were taking non sleep-related medications, which contains 153 full-night EEG recordings with recording frequency 100.0 Hz.We extract the Fpz-Cz/Pz-Oz EEG channels as the raw inputs to the model.The first two datasets are all at-home PSG recordings.• MGH Sleep [1] is collected from sleep laboratory at Massachusetts General Hospital (MGH), where six EEG channels (i.e., F3-M2, F4-M1, C3-M2, C4-M1, O1-M2, O2-M1) are used for sleep staging recorded at 200.0 Hz frequency.After filtering out mismatched signals and missing labels, we finally get 6,478 recordings.Dataset statistics can be found in Table 1, and class label distribution is in Table 2.

training or test sets. The
training and test sets are usually small, but their EEG recordings are labeled, while the pretrain set contains a large number of unlabeled recordings.Within each set, the long recordings are segmented into disjoint 30-second windows.Each window is called an epoch, denoted as  ∈ ℝ !×# .Each epoch has the same format:  input channels and  timestamps from each channel.

Table 3 .
Sleep Staging Accuracy Comparison with Difference Methods (%).All self-supervised methods outperform the Untrained Encoder model, indicating that the pretrain step does learn some useful features from unlabeled data.We observe that ContraWR and ContraWR+ both outperform the supervised model, suggesting that the feature representations provided by the encoder can better preserve the predictive features and filter out noises than using the raw signals for the sleep staging task, in the case when the amount of labeled data available is not sufficient Sleep EDF dataset is 1.7e-3.MGH Sleep data contains more noise than the other two datasets (reflected by the relatively low accuracy with supervised model on raw signals).It is notable that the performance gain is much more significant on MGH over other self-supervised or supervised models (about 3.3% relative improvement on accuracy) which suggests that the proposed models handle noisy environments better.
(e.g., less than 2% in Sleep EDF).Compared to other self-supervised methods, our proposed model ContraWR+ also provides better predictive accuracy, i.e., about 1.3% on Sleep EDF, 0.8% on SHHS, 1.3% on MGH Sleep.The performance improvements are mostly significant with p<.001, except the p-values comparing with MoCo on

Table 4 .
Evaluation Accuracy of Different Augmentations (%).Format: mean ± standard deviation of training/test over 5 random seeds.