Using a Large Margin Context-Aware Convolutional Neural Network to Automatically Extract Disease-Disease Association from Literature: Comparative Analytic Study

Background Research on disease-disease association (DDA), like comorbidity and complication, provides important insights into disease treatment and drug discovery, and a large body of the literature has been published in the field. However, using current search tools, it is not easy for researchers to retrieve information on the latest DDA findings. First, comorbidity and complication keywords pull up large numbers of PubMed studies. Second, disease is not highlighted in search results. Finally, DDA is not identified, as currently no disease-disease association extraction (DDAE) dataset or tools are available. Objective As there are no available DDAE datasets or tools, this study aimed to develop (1) a DDAE dataset and (2) a neural network model for extracting DDA from the literature. Methods In this study, we formulated DDAE as a supervised machine learning classification problem. To develop the system, we first built a DDAE dataset. We then employed two machine learning models, support vector machine and convolutional neural network, to extract DDA. Furthermore, we evaluated the effect of using the output layer as features of the support vector machine-based model. Finally, we implemented large margin context-aware convolutional neural network architecture to integrate context features and convolutional neural networks through the large margin function. Results Our DDAE dataset consisted of 521 PubMed abstracts. Experiment results showed that the support vector machine-based approach achieved an F1 measure of 80.32%, which is higher than the convolutional neural network-based approach (73.32%). Using the output layer of convolutional neural network as a feature for the support vector machine does not further improve the performance of support vector machine. However, our large margin context-aware-convolutional neural network achieved the highest F1 measure of 84.18% and demonstrated that combining the hinge loss function of support vector machine with a convolutional neural network into a single neural network architecture outperforms other approaches. Conclusions To facilitate the development of text-mining research for DDAE, we developed the first publicly available DDAE dataset consisting of disease mentions, Medical Subject Heading IDs, and relation annotations. We developed different conventional machine learning models and neural network architectures and evaluated their effects on our DDAE dataset. To further improve DDAE performance, we propose an large margin context-aware-convolutional neural network model for DDAE that outperforms other approaches.


Background
The origin and treatment of disease is an important research field in the life sciences, covering a wide range of research topics such as comorbidity, complication, genetic disorder, drug treatment, and adverse drug reaction.As disease is involved in many areas, new scientific findings are frequently made or updated.
Disease-disease association (DDA) is an important research topic in the biomedical domain [1][2][3][4][5].The influence of one disease on others is wide ranging and can manifest in any patient.Diabetes, for example, may cause macrovascular diseases [6], such as cardiovascular disease [7] and cerebrovascular disease [8].Treating a disease without consideration of potential DDAs may result in poor treatment outcomes.Therefore, DDAs are often a prime concern for researchers and doctors involved in drug discovery and disease treatment.Figure 1 illustrates examples of DDAs in the literature (refer to Multimedia Appendix 1 for more examples, including comorbidity, complications, general associations, and risk factors).There have been several studies attempting to generate disease connectivity networks [3][4][5].However, the enormous and rapidly growing disease-related literature has not been utilized.
Finding DDA in the literature is a time-consuming and challenging task for researchers.First, there are huge numbers of DDA papers to sort through, and existing search engines, such as PubMed, do not mark up all relevant disease mentions in search results.Although there are text-mining tools available that could automatically identify diseases [9][10][11], genes [10,12,13], chemicals [14,15], and associations among them [16][17][18][19][20][21][22], they have not been integrated into a single interface to assist researchers in searching through the latest DDA findings.The main obstacle in creating a DDA extraction (DDAE) system is the lack of a relevant dataset.Moreover, only a few text-mining approaches [23] are suitable for extracting DDA.
In this study, we compiled a DDAE dataset consisting of 521 annotated PubMed abstracts.As it is hard for a human annotator to distinguish one DDA type from another without reading a broader context, such as a whole paragraph, we therefore annotated only 3 DDA types: positive, negative, and null associations: 1. Positive associations include comorbidity, complications, physical associations, and risk factors. 2. Negative associations are counted when the text clearly states that there is no association between 2 diseases. 3. Null associations are annotated when 2 diseases co-occur in a sentence, but no association is stated, suggested, or apparent.
In this study, we formulated DDAE as a supervised machine learning (ML) classification task in which, given a sentence containing a disease pair, the goal was to classify the pair into one of the DDA types.For classification, we employed 2 machine learning models, support vector machine (SVM) [24] and convolutional neural network (CNN) [25].We compared different combinations of SVM and CNN to maximize performance, arriving at a novel neural network architecture, which we termed as large margin context-aware CNN (LC-CNN).LC-CNN achieved the highest F1 measure of 84.18% on our DDAE test set.

Related Work
In this section, we first review published disease annotation datasets.Then, we briefly review different methods of relation extraction in biomedical domains.

Disease Annotation Datasets
Before identifying DDAs, we have to identify diseases in the text first.Fortunately, there are many datasets for developing such disease name recognition and normalization systems.The National Center for Biotechnology Information (NCBI) disease dataset [26] is the most widely used.For instance, Leaman and Lu [9] proposed a semi-Markov model trained on an NCBI disease dataset that achieved an F1 measure of 80.7%.However, DDAs are not annotated in the NCBI dataset abstracts, limiting its usefulness for the DDAE task.
As DDAs can give insights into disease etiology and treatment, many studies focus on generating DDA networks [1][2][3][4][5].For example, Sun et al [4] used disease-gene associations in the Online Mendelian Inheritance in Man [27] to predict DDAs with similar phenotypes.Bang et al [3] used disease-gene relations to define disease-disease network, and the causalities of disease pairs are confirmed through using clinical results and metabolic pathways.However, the constructed networks lack text evidence and therefore cannot be used to develop a DDAE dataset.
Xu et al [23] proposed a semisupervised iterative pattern-learning approach to learn DDA patterns from PubMed abstracts.They constructed a disease-disease risk relationship knowledge base (dRiskKB) consisting of 34,000 unique disease pairs.However, there are some limitations of dRiskKB that make it hard to use in developing DDAE systems.First, dRiskKB only provides positive DDA sentences.Owing to the lack of negative instances, it cannot be used to train ML-based classifiers.In addition, as the development of dRiskKB is based on a pattern-learning approach, it only includes DDA sentences with very simple structures and thus is not ideal for training a DDA system capable of analyzing complicated sentences.
To solve the above problems, we developed a DDAE dataset.
Our dataset was different from dRiskKB in 3 aspects.First, our DDAE dataset contained positive, negative, and null DDAs.Second, it did not use patterns to annotate DDAs and therefore included DDA sentences with more complex expressions.Finally, it annotated DDAs in the entire abstract, allowing an ML-based classifier to use document-level features.

Relation Extraction
Rule-based approaches are commonly used in new domains or tasks that do not have large-scale annotated datasets.Lee et al's [28] approach is an example.They extracted protein-protein interactions (PPIs) from plain text using handcrafted dependency rules.Their approach did not require a training set, but it achieved a high precision of 97.4% on the Artificial Intelligence in Medicine (AIMed) dataset [29].However, it was difficult for them to create rules that can extract all PPIs, and their system, therefore, achieved a low recall of 23.6%.Moreover, Nguyen et al [30] used predicate-argument structure (PAS) [31] rules to extract more general relations including PPI and drug-drug interaction.Their rules detected PPIs by examining where relation verbs and proteins are located in the spans of predicates and arguments.Their approach required less effort to design rules and was able to adapt to different relation types.Compared with Lee et al's system, it achieved a higher recall of 52.6% on the AIMed dataset but a lower precision of 30.4%.
ML-based approaches can usually achieve relatively higher performance than rule-based ones.For instance, Zhang et al [32] used hybrid feature-based and tree-based kernels implemented with SVM-LIGHT-TK [33] for PPI extraction.
The feature-based kernel uses SENNA (Semantic/syntactic Extraction using a Neural Network Architecture)'s pretrained word-embedding model [34].In the tree-based kernel configuration, the sentence dependency structure is used as input.The structure is decomposed into substructures and then transformed into one-hot encoding features for SVMs.For drug-drug interaction extraction, Lin et al [20] proposed a syntax CNN (SCNN) that integrates syntactic features, including words, predicates, and shortest dependency paths into a CNN.They trained their model with word2vec [40] and the Enju parser [31].The Enju parser breaks the sentence into PASs, and non-PAS words or phrases are removed.The pruned sentences are then used to train the word-embedding model.Their approach achieved an F score of 68.6% on the 2013 DDIExtraction dataset.
Our LC-CNN was also inspired by Zhao et al's [20] SCNN architecture with 3 main differences.First, we replaced the log loss function with the hinge loss function.Second, SCNN uses a fully connected layer for traditional features before merging them with the CNN's output.However, LC-CNN directly merges the CNN's output with traditional features.Finally, SCNN's traditional features only use sentence-level information, whereas LC-CNN also uses both sentence-level and document-level features.

Study Process
In this section, we have first described the process of DDAE dataset construction.We then introduced our LC-CNN architecture in subsection The Neural Network Architecture.Further, we described each layer of LC-CNN in subsection Composite Embedding Vector to Output Layer of Combined Sentence and Context Vector.Finally, we introduced backward propagation for learning parameters of each layer.

Dataset Construction
The process of DDAE dataset construction is illustrated in Figure 2. Our DDAE dataset consisted of abstracts found in PubMed.To generate PubMed search queries related to DDA, we selected all disease nodes of the MeSH [41] tree whose tree number prefix starts with C and F, indicating diseases.We then selected any nodes related to human diseases.This produced a list of approximately 4700 disease names, which we then used to retrieve 236,000 abstracts whose titles or content contain one or more query terms.As some of these abstracts do not contain any DDAs, we used simple heuristic rules and a disease name recognizer/normalizer to select abstracts with a higher likelihood of containing DDAs.
The process was as follows: 1. We selected only abstracts published from 2013 to 2017.
2. We used DNorm [42] to annotate disease mentions and their Medical Subject Heading (MeSH) IDs in these abstracts. 3. To ensure that the selected abstracts contain rich DDAs for training classifier, we removed abstracts that have fewer than 3 sentences that contain at least two different disease MeSH IDs. 4. To ensure the selected abstracts contain at least one DDA, we applied a DDA-adapted version of Lee et al's [28] dependency tree-based relation rules and removed any abstract not matched by any rule.5. We randomly selected 521 abstracts from the remaining abstracts for annotation.
For the manual annotation step, we employed 2 biomedical specialists.Annotator 1 is a PhD candidate in a bioinformatics program, whereas Annotator 2 is a full-time research assistant in a hospital.Both have at least 6 years of biomedical experience.After agreeing on initial annotation guidelines (refer to Multimedia Appendix 1-Annotation Guideline), they used

XSL • FO
RenderX the brat rapid annotation tool [43] to annotate 10 abstracts and then compare results.In the first independent annotation processing, Cohen kappa value was 34%.Once both annotators agreed that all annotations that indicate consistency is satisfactory, they each annotated all remaining abstracts.Thus, each abstract was annotated independently twice.Inconsistent annotations were resolved afterward through discussion.The final Cohen kappa value was 76%.

The Neural Network Architecture
We formulated relation extraction as a classification problem in which, given a sentence containing a mention pair, the goal was to classify the pair into one of relation types.For classification, we propose an LC-CNN architecture as illustrated in Figure 3.The network is fed input in 2 forms: sentence representation and context representation (CR).Sentence representation is a n emb x T matrix representing the sentence.
n emb and T are the length of composite embedding vector and the length of the sentence, respectively.The sentence representation uses only word embedding, part of speech (POS) encoding and Named Entity (NE) distance information, and parameters are learned through the next CNN and max-pool layers, which outputs an m-dimension sentence-level feature vector.The CR is a feature-rich n-dimension vector containing both syntactic and document-level features, such as whether the disease pair also appears in the title.Next, the m-dimension vector and the n-dimension vector are concatenated to form the final feature vector with (m+n) dimension.To compute the confidence of each relation type, the feature vector is fed into a fully connected layer, where we use a linear activation function with categorical hinge loss [44].The output layer is a three-dimensional vector, with each dimension value representing the confidence of a predefined relation type.

Composite Embedding Vector
In a sentence, each word is represented as a composite embedding vector, as shown in Figure 3 (or in Multimedia Appendix 2).A composite embedding vector consists of 3 parts: word embedding, POS one-hot coding, and the distance between the word and disease pair.A matrix represents a sentence.The matrix contains the composite embedding vectors in the sentence, each placed in the order in its row.The sentence matrix is a matrix of size n emb x T, wheren emb is the dimension of the composite embedding vector and T represents the maximum length of the sentence in the dataset.

Word Embedding
The embedding of a word is a mapping of the word to a vector of real values.Generally, the word embeddings of semantically similar words are closer together in the vector space.Word embedding learned by neural networks has been demonstrated to be able to capture linguistic regularities and patterns in language models [40].Therefore, it is commonly used in features in popular NN approaches, such as CNN [20,39] and long-short term memory (LSTM) [19].In general, word embeddings are learned from large corpora such as Wikipedia or PubMed.For example, Pyysalo et al [45] applied word2vec to learn word embeddings from different texts, including Wikipedia, PubMed abstracts, and PubMed Central full-text papers, and developed a word-embedding lookup dictionary.Here, we employed their dictionary to generate word embeddings.

Part of Speech
The embedding of a word is a single vector and, therefore, cannot fully represent the multiple syntactic/semantic roles of a word like good, which can be either an adjective or a noun.The POS feature is designed to provide syntactic information (part of speech) to help the model separate the different semantic senses of a word.We used Zhao et al's [20] approach, in which similar POSs are assigned to the same group.We divided POSs into 11 groups, including adjectives, adverbs, articles, conjunctions, foreign words, interjections, nouns, prepositions, pronouns, punctuation, and verbs.If a word belongs to a POS

XSL • FO
RenderX group, the corresponding bit value will be 1; otherwise, it will be 0.

Named Entity Distance
Zeng et al [46] proposed the use of NE distance (position features) to improve a CNN by keeping track of how close words are to the target nouns.We adopted their NE distance in this study.The NE distance feature is a two-dimensional vector (d 1 , d 2 ).d 1 and d 2 represent the distance (number of words) between the current word and the first and second diseases of the pair.

Context Representation Layer
Contextual information, such as pair and document information, is very useful for classification and has been widely used in previous research.The purpose of using contextual representation is to introduce traditional contextual features into a neural network architecture through simple representation.
We can then apply the fully connected layer to the context vector to obtain a condensed vector that combines 2 different representations.
Here are the features used in our contextual representation (refer to Multimedia Appendix 3 for more details).

Bag of Words
Word embedding has been shown to represent abstract information about words.However, word embedding can sometimes change the original meaning of a word.For example, not usually appears in negative relation statements.However, in the word2vec model trained on news, the 3 words most similar to not are do, did, and anymore.This violates our intuition that don't, doesn't, and isn't are more similar to not in the relation statement.As the embedded vector words of certain words may differ in the news and biomedicine domains, we use BOW features for context vector.Our BOW features include unigram, bigram, and surrounding diseases.

Part of Speech
The POS tags are commonly used for relation extraction.We used one-hot encoding to represent each word's POS tag type.

Named Entity Information
The number of diseases is useful when classifying relations.We used 3 different features to capture information, including the following:

Output Layer of Combined Sentence and Context Vector
We used m concat = [sr cr] to represent the concatenation of sentence representation sr and context representation cr.The size of the vector m concat is n concat = n sr + n cr .We then applied a fully connected layer to m concat to obtain a 3D vector out, each value of which refers to the confidence of a predefined category.
W out is a matrix with a size of n out x n concat and Bias out is a bias matrix with a size of n out x 1. n out is the number of predefined categories.out is the output of this fully connected layer and is defined as matrix W out multiplied by matrix m concat , plus bias Bias out Therefore, the size of out is n out x 1. out is the final output of the prediction, and each dimension value of out refers to the score of its predefined category.out is calculated by a linear activation function, the values of out could be R × R × R.

Backward Propagation With Large Margin Loss
We used the following parameters: 1. k weight matrices, convWf each with a size of ne x f.Here, ne is the size of the input embedding vector of a word, and f is the window size of the filter.2.k biases, convBf, each with size of ne x 1. 3. Weight matrix Wsr with a size of nsr x npool.Here, nsr is the output dimension of sentence vector and a hyperparameter. 4. Bias Biassr with a size of nsr x 1. 5. Weight matrix wout with a size of nout x nconcat.Here, nout is the number of relation types. 6. Bias BiasmaxF with a size of nout x 1.
In forward propagation, given those parameters, we calculated out with the methods mentioned in section The Neural Network Architecture to Context Representation Layer.In backward propagation, gradient descent is used to learn these parameters through minimizing the hinge loss of out.Given a sentence and its disease-disease pair, we defined a vector y as the pair's relation label vector.y is a 3D vector, and each dimension value of y represents the score of one relation type.According to the definition of hinge loss [44], the value is either -1 or 1. 1 means that the pair belongs to the relation type, whereas -1 means it does not.Therefore, one value of the 3D vector must be 1, and the others must be -1.For instance, the 3 vectors <1, -1, -1>, <-1, 1, -1>, and <-1, -1, 1> indicate that 3 vectors are Positive, Negative, and Null, respectively.We used the hinge loss function to evaluate the loss between prediction out and its truth label y; a larger loss indicates a larger gap between out and y.The hinge loss function is defined as follows: loss(out, y)=sum i=1 tonout (max(1 -y i * out i , 0))/n out Here, y i is the i-th dimension value of y. out is calculated by using forward propagation (sections The Neural Network Architecture to Context Representation Layer), and each dimension value of o refers to the prediction score of one predefined relation type.out i is the i-th dimension value of out.out i belongs to R. If out i is a positive value, then the pair may be the i-th relation type.Otherwise, if out i is a negative value, then the pair is less likely to be the i-th relation type.
In the equation, 1 is the value of the decision boundary.Ideally, y i * out i will be larger than the decision boundary value.If y i and out i have the same sign, then y i * out i will be a positive value belong to R. If y i * out i is larger than the decision boundary value 1, then the loss(out, y) must be 0. If y i * out i is smaller than the decision boundary value 1, then the loss(out, y) must be 1 -y i * out i which is equal to the cost.If y i and out i are different signs, then y i * out i will be a negative value ε R. Therefore, the loss(out, y) is a value greater than 1.
Given the training set x (i) is the i-th instance in the training set, y (i) is its label vector, and N is the number of training instances.Weight learning consists of the following optimization: argmin convWf, convBf, Wst, Biassr, Wout, Biasout loss(out,y) Finally, mini-batch stochastic gradient descent [47] is applied to update the learned parameters in each iteration.

Dataset
Currently, there are no available annotated datasets for training DDA extraction systems.To create one, we used our DDAE dataset development process, described in section Dataset Construction.The DDAE dataset consists of 521 annotated abstracts.After annotation, we used Cohen kappa coefficient to evaluate annotation consistency.The final kappa value is 76%, suggesting a high level of agreement.
For the experiments in this study, we divided our DDAE dataset into a training set of 400 abstracts and a test set of 121 abstracts.Before testing, we tuned the hyperparameters on one-third of abstracts randomly chosen from the training set called tuning set.Finally, our classifiers were trained on the whole training set and evaluated on the test set.A summary of the final DDAE dataset is shown in Table 1.

Experiment Setup
We conducted 3 experiments to evaluate our LC-CNN.The first experiment was designed to measure the effects of different NN architectures and ML models.In the second experiment, we evaluated the effects of different approaches combining context features with NN methods.In the third experiment, we evaluated the effects of different word embeddings.The hyperparameters are listed in Multimedia Appendix 4. The performances of experiments on the tuning set can be found in Multimedia Appendix 5.
Our system is implemented on TensorFlow with Keras and runs on an Nvidia GTX 1080ti GPU.The process used in our experiments to generate the word-embedding model can be found in Multimedia Appendix 6.

Evaluation Metric
We used the F1 measure to evaluate system performance.The precision and recall are defined as given in Figure 4.

Experiment 1-Performance Comparison With Other Models
The performance comparison between LC-CNN and different methods is listed in Table 2.It shows the performances on the tuning and test sets.The NN models (models 1 to 3) use only sentence representation.The CR cross-entropy and SVM methods use only CR.CR cross-entropy is implemented using a single hidden fully connected layer with the context vector as its input layer, and its architecture can be found in Multimedia Appendix 7. Furthermore, we also compared LC-CNN with LSTM and bidirectional LSTM (BiLSTM) models.They have been used in many relation extraction tasks, such as those seen in the studies by Hsieh et al and Zhao et al [19,48].In our experiment, we were surprised to find that LSTM achieved the lowest F1 measure (65.02%) on the test set among all tested models.Furthermore, we also evaluated the performance of SCNN, Bidirectional Transformers for Language Understanding (BERT) [49], and BioBERT [50].As we would like to compare the architecture of SCNN with LC-CNN, LC-CNN and SCNN use the same sentence representation, CR, and hinge loss function.The architecture of SCNN is illustrated in Multimedia Appendix 8.
As shown in Table 2, NN models trained on the entire training set (models 1 to 3) performed worse on the test set than on the tuning set.One potential reason is that the selected hyperparameters and parameters may be less likely to find unseen data, which could cause the hyperparameters and parameters of the NN models to overfit the tuning set.This problem is especially obvious in the LSTM and BiLSTM models.In contrast, CR cross-entropy , SVM, and LC-CNN models trained on the entire training set with context information performed better on the test set than on the tuning set.
Furthermore, as shown in Table 2, CNN and CR cross-entropy performed similarly on the tuning set.The F1 measures of CNN and CR cross-entropy were 75.35% and 75.76%, respectively.CNN's recall rate was better than CR cross-entropy 's recall rate by 2.84%, whereas CR cross-entropy 's precision was 3.95% higher than that of CNN.This may be because the document feature provides CR cross-entropy with the information on the entire document, thus causing the model to generate fewer false positive cases.As CNN does not directly encode document information, it predicts more FPs.However, as CNN does not use any particular feature to separate positive, negative, and null relation pairs, it may be able to extract potential positive and negative pairs missed by CR cross-entropy , resulting in higher recall rates.In addition, the SVM and CR cross-entropy use the same input features, but SVM mainly uses large margin for learning.The result shows that the SVM implemented with LibSVM [24] outperforms the CR cross-entropy by an F1 measure of 2.83%.Moreover, LC-CNN is able to combine the advantages of CNN and SVM to achieve the highest precision/recall/F1 measure among the tested models and outperforms SCNN, BERT, and BioBERT by F1 measures of 3.25%, 2.06%, and 1.91, respectively.

Experiment 2-Effect of Different Uses of Context Information
To demonstrate the advantage of integrating CNN and context information in a single LC-CNN architecture, we evaluated different ways of combining them.The performances of these combinations are shown in Table 3.There are 3 baseline models that use only either CNN or context information.Baselines 1 to 3 are CR cross-entropy , SVM, and CNN and are used in Experiment 1.Only CR cross-entropy and SVM use contextual information.SVM + CNN is an intuitive method in which the output vector of CNN is considered an additional feature vector of SVM, and its architecture is illustrated in Multimedia Appendix 9.As shown in Table 3, the F1-measure of SVM + CNN is significantly lower than that of SVM by 6.98%.One possible reason is that the CNN used in SVM + CNN is adjusted on the tuning set, so it causes the model to overfit CNN predictions, making it difficult to learn feature weights well.
We designed the LC-CNN to learn the model in a single stage.LC-CNN achieves an F1 measure of 84.18% on the test set, which is the highest score among all methods and outperform SCNN.The results showed that LC-CNN can learn CNN and context information well in a single stage.

Experiment 3-Effect of Composite Embedding Vectors on Large Margin Context-Aware Convolutional Neural Networks
In our third experiment, we evaluated the effect of different composite embedding vectors on LC-CNN (the effect of different features on LC-CNN can be found in Multimedia Appendix 10).The performance on the test set is shown in Table 4.We compared 3 different word embeddings.The word embeddings of LC-CNN PubMed are from Pyysalo et al [45], who learned them from Wikipedia, PubMed abstracts, and PubMed Central full texts.The word embeddings of LC-CNN News are learned from Google News using word2vec.In contrast, LC-CNN no pretrain does not use any pretrained word embeddings.Its word embeddings are treated as parameters and are learned through training LC-CNN no pretrain on the training set.Moreover, we also evaluated the effect of 3 different embedding features (word embedding, POS, and NE distance) by removing them individually from the LC-CNN PubMed .
As shown in Table 4, the model with PubMed word embeddings (LC-CNN PubMed ) outperformed LC-CNN News and LC-CNN no pretrain .In addition, our removal tests indicated that both POS and NE distance have strong impact on performance.

Large Margin Context-Aware Convolutional Neural Network Error Cases Distribution
We randomly sampled approximately 60 error cases of the LC-CNN's predictions, and their distribution is illustrated in Table 5. FP and FN denote the false positive and false negative cases, respectively.As shown in Table 5, the symptom/subclass is a common error category in the FPs, and it contains a ratio of 28% in the sampled error cases.The symptom/subclass indicates that a disease is either a subclass or a symptom of another disease in the FP/FN disease pair.For example, an FP case: "Other large-artery aneurysms, including carotid, subclavian, and iliac artery aneurysms DISEASE1 , have also been associated with Marfan syndrome DISEASE2 .---PMID:23891252" [51].
Here, the carotid, subclavian, and iliac artery aneurysms are 3 Traumatic syndrome for Marfan syndrome.They are the symptoms of Marfan syndrome.The symptom is not included in our DDA definition.Therefore, iliac artery aneurysms DISEASE1 does not have a relation with the Marfan syndrome DISEASE2.However, in this case, the keyword phrase been associated with XSL • FO RenderX makes LC-CNN predict it as positive relation, and thus results in an FP case.
In contrast with the FP cases, the FN cases are relatively sparse, and most of them cannot be categorized.For example, "CONCLUSION: Cataract DISEASE1 , uncorrected refractive error, and fundus diseases are ranked in the top 3 causes of moderate to severe visual impairment DISEASE2 and blindness in adults aged 50 years or more in rural Shandong Province.---PMID: 23714032" [52].
In the sentence, Cataract is one cause of visual impairment; however, the description also lists the other 2 diseases that cause visual impairment.For example, "it can be associated with any type of vision loss DISEASE1 including that related to maculardegeneration DISEASE2 , corneal disease DISEASE3 , diabetic retinopathy DISEASE4 , and occipital infarct DISEASE5 .---PMID:24339694" [53].
Here, the LC-CNN correctly identifies the relation between DISEASE1 and DISEASE2.However, it failed to identify the relations between DISEASE1 and the other diseases (DISEASE3, DISEASE4, and DISEASE5).

The Result of Using Automatic Annotated Disease Mentions
In our experiment, we used the manually annotated disease mentions, which may not reflect the actual performance of the fully automated DDAE task.Hence, we conducted an experiment, in which we used the TaggerOne [9], a state-of-the-art disease mention recognizer/normalizer, to annotate the disease mentions of the test set.Then we used the LC-CNN to extract DDAs from the TaggerOne-annotated test set.As the boundaries of some predicted mentions may be inconsistent with the gold mentions, we used an approximate matching to allow this.In the fully automatic process, the LC-CNN achieved a Precision/Recall/F1 measure of 75.28/55.03/63.57,respectively.The recall is significantly lower because it failed to recognize some diseases.However, the performance is reasonable but 7.08% lower than that of the semiautomatic process (using gold disease mentions).

Principal Findings
Our objective was to develop a DDAE dataset and a neural network-based approach to extract DDAs.In our experiments, the LC-CNN trained on our dataset achieved an F1 measure of 84.18%.We also compared LC-CNN with common NN models including CNN, Bi-LSTM, and SVM.The results showed that the LSTM and BiLSTM models achieved relatively lower F1 measures of 65.02% and 65.40%, respectively.This may be because the hyperparameters and parameters tend to overfit the training set.The CNN and SVM models achieved relatively higher F1 measures of 73.32% and 77.49%, respectively, but LC-CNN still outperformed all tested methods.In addition, the results showed that the 2-stage SVM + CNN model scored significantly lower in terms of F1 than SVM and LC-CNN by 6.98% and 10.84%, respectively.This suggests that simple methods may achieve better results than complex ones.Furthermore, in our experiments, the model with PubMed word embeddings (LC-CNN PubMed ) outperformed the LC-CNN News and LC-CNN no pretrain models, indicating that PubMed word embeddings may be more compatible with our DDAE dataset.

Conclusions
In this paper, we proposed a text-mining approach for automatically extracting DDAs from abstracts.We collected disease-related abstracts from PubMed and annotated the first publicly available DDAE dataset consisting of 521 abstracts and 3322 disease-disease pairs.Moreover, to extract DDAs, we used several different ML models, including BiLSTM, CNN, and SVM.We also evaluated the effect of combining CNN and context features.Finally, we implemented a novel neural network called LC-CNN to integrate context features and CNN through the large margin function.Our experiment results showed that LC-CNN achieved an F1 measure of 84.18%, the highest among the tested models.
keyword, or the statements of DDA c are too complicated Others a FP: False positive.b FN: False negative.c DDA: disease-disease association.

Table 1 .
Summary of disease-disease association extraction dataset.

Table 4 .
The effect of different composite embedding vectors on large margin context-aware convolutional neural network performance.
b POS: part of speech.

Table 5 .
The distribution of sampled large margin context-aware convolutional neural network error cases.