Human Disease Symptom Network: Understanding Disease Relationships Through Symptoms and Genes
The Human Disease Symptom Network (HSDN) is constructed using a large-scale medical bibliographic records database to form a network of human diseases based on symptom similarities. By integrating disease-gene associations and protein-protein interaction data, correlations between symptom similarity and shared genes are investigated. This network aids in inferring comorbidity links between disorders and understanding disease progression patterns. The construction of HSDN involves basic datasets such as Medical Subject Headings (MeSH) and PubMed literature database to extract disease relations. Through this network, the entangled relationships between diseases, disease manifestations, and molecular mechanisms are explored.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Human Disease Symptom Network Zhou et.al., 2014 VAISHNAVI GOWRISANKAR 665591371 04/07/2015
Outline Introduction Human Symptom Disease network (HSDN) Construction of HSDN Results Performance evaluation of HSDN Integrating gene disease associations Integrating shared protein interactions Diversity of disease manifestation and molecular mechanism Disease Groups Discussion Limitations Future Directions
Introduction Networks used to study entangled relationship between diseases This construction of networks has been widely used to infer comorbidity links between disorders and disease history of patients1 disease phenotypic network using comorbidity patterns have been used to understand disease progression patterns2
Introduction Symptoms and signs which a patients presents overlooked Symptoms are most directly observable characteristics of a disease and the very basis of clinical disease classification Connection between shared symptoms and genes of 2 diseases could bridge gap between biological discovery and bed side clinical observations.
Introduction In this article, a large scale medical bibliographic records (PubMed including MEDLINE) and the related Medical Subjects Headings (MeSH) metadata was used to generate a symptom-based network of human diseases - HSDN The link weight between 2 diseases quantifies the similarity of their respective symptoms By integrating disease-gene association and protein- protein interaction data, the correlations between the symptom similarity of the disease and their degree of shared genes was investigated
Construction of HSDN Basic Datasets: Construction of symptom-based disease network requires a basic taxonomy for diseases and symptoms (MeSH) a corpus of data from which to extract their relations (PubMed) MeSH vocabulary and PubMed literature database was chosen from several possible combinations ICD9/10, HPO and OMIM MeSH is used directly to index all articles in the massive PubMed database
Construction of HSDN MeSH is designed as hierarchical structure with general categories (Animals, Diseases, Phenomena and Process) Diseases contains the sub-category Symptoms and Signs terms related to clinical manifestations observed by physicians and perceived by patients All terms in the diseases category except animal diseases was included
Construction of HSDN Finally 4442 distinct MeSH diseases terms and 322 distinct MeSH symptom terms were used in PubMed query which resulted in 7,109,429 PubMed records The above 7,109,429 PubMed records are filtered for the co-occurrence of at least one disease and one symptom term 849,103 records was obtained
Construction of HSDN Illustration of the protocol
Construction of HSDN Extracting the disease- symptom relationships from PubMed bibliographic literature database. The association between symptoms and diseases are based on their co- occurrence in the MeSH metadata fields of PubMed
Construction of HSDN Symptom based disease similarity To quantify the relationship between a symptom and a disease, Tf-Idf is used Every disease j by a vector of symptoms dj & wi,j quantifies the strength of the association between symptom i and disease j To avoid absolute co-occurrence due to highly abundant symptoms and publication biases towards certain diseases, Tf- Idf is used instead of wi,j
Construction of HSDN Term frequency-Inverse document frequency (Tf- Idf) Wi,j is the strength of an association between symptom i and disease j N number of all diseases in the dataset ni number of diseases where symptom i appears
Similarity between 2 diseases is defined by the cosine similarity of the respective disease vectors Vectors dx and dy of 2 diseases x and y Cosine similarity ranges from 0 (no shared symptoms) to 1 (identical symptoms)
Construction of HSDN A disease network is constructed, in which nodes represent diseases and links represent symptom similarities between diseases
Construction of HSDN Integrating gene disease association and PPI databases to obtain shared genes/PPI between diseases
Construction of HSDN Resulting diseases network in which links represent shared genes/PPI
Construction of HSDN The backbone of the HSDN. Highly clustered regions of the network belong to same broad disease categories
Results Performance Evaluation of HSDN Manual evaluation of retrieved co-occurrences 1000 records were randomly selected from 849,103 PubMed records and extracted disease-symptoms relationship with the help of medical experts. Our evaluation focused on the issues disease-symptom relationship is direct and not influenced by drugs or coincidental co-occurrence reported symptoms-disease relations are very specific. 57% of the records point to one disease, 28.5% point to 2 diseases and only 14.5% pointing 2 or more. minimal false positives only 0.8% (disease x is not related to symptom y)
Results Performance Evaluation of HSDN Reliability test for the disease similarity score Construction of benchmark disease network (HPO) and comparing it with HSDN Construction of HPO (Human Phenotype Ontology) Manually curated database derived from OMIM (Online catalogue of human genes and genetic disorders) Covers all phenotypic abnormalities in commonly human monogenic disease MeSH disease terms are typically more general and therefore several OMIM identifiers may map to one MeSH term Final HPO used to benchmark the HSDN contained 940 diseases map both on OMIM and MeSH with 121,945 links indicating shared symptoms
Results This network is much smaller than HSDN but arguably of higher quality (OMIM disease identifiers are much more specific when mapped with MeSH) Higher symptom similarity in HSDN is related to higher edge overlap in HPO Pearson correlation coefficient between ratio of shared disease links and disease similarity is very high (0.96) indicating HSDN obtains a reliable disease similarity score
Results Shared symptoms indicate shared genes between diseases Integrated 3 genotype-phenotype databases and constructed a Human Disease Network (HDN) as described by Goh et.al3 In HDN, 2 diseases are connected if they share a gene Comparing HSDN with HDN, overlapping link ratio shows a strong positive correlation between disease similarity Overlapping link ratio is a fraction of disease pairs with both shared symptoms and shared genes of all disease pairs with shared symptoms. It can be inferred that diseases with more similar symptoms are more likely to have common gene associations
Results Shared symptoms indicate shared protein interactions Not only gene association but also close interaction of proteins Integrated 5 publicly available PPI databases into 1 binary PPI network Constructed disease networks in which 2 diseases are linked if they share first and second order PPI interactions Proteins associated to the same human disease/disease category or phenotype tend to interact with each other and so HDSN focuses only on symptoms and includes all diseases categories, thereby providing robust evidence that interacting proteins between diseases are also connected to similar higher level manifestations.
Results High symptoms similarity strongly correlates with shared genes as well as first and second order protein interactions suggesting general relationship between phenotypic similarity on one hand and path lengths on the PPI network on the other hand To test this we calculate the shortest path (DijKstra s algorithm) link for all protein pairs and the minimum shortest PPI path length between each disease pair Higher the symptom similarity shorter the PPI network distance between diseases
Results DijKstra s algorithm to find all shortest path in the PPI network To quantify the PPI distance between disease pairs single linkage distance DSL is used DSL is the minimum of all shortest paths between related proteins For 2 diseases x and y with corresponding related protein sets Px and Py, the single linkage distance is given by D(pi, pj) is the shortest path length between 2 proteins pi and Pj
Results Diversity of disease manifestations and molecular mechanisms Pleiotropism and genetic heterogeneity causes discrepancy in diverse clinical manifestations and underlying cellular mechanisms To understand these complex relations genome components are mapped with intermediate phenotype components, environmental factors To analyze the relation between molecular and phenotypic diversity of diseases SGPDN is constructed. Shared genes, proteins disease network (SGPDN) An integrated disease network that combines phenotypic relations based on symptom similarity with shared molecular mechanisms based on protein interactions was constructed
Results HSDN for significant links with similarity score >0.1 is filtered All disease links supported by either shared genes or 1st/2nd order protein interactions Betweennes and node diversity are used to measure the disease diversity in this network Betweennes is a centrality measure quantifying how many shortest path run through the node Diversity of node j is based on the node bridging coefficient k(i) is the degree of node I, N(i) denotes its neighborhood
Results A strong positive correlation of the 2 quantities used to measure disease diversity in the SGPDN and the corresponding maximum diversities of disease related genes in the PPI network was found These results demonstrate that a disease with diverse clinical manifestations will typically also have more diverse underlying cellular network mechanisms
Results Disease Groups To study the interrelationship between the classes of diseases. In the SGPDN, it was found that diseases within the same category form clear highly interconnected communities Eg: metabolic diseases, digestive system diseases Exceptions include bacterial, viruses diseases which link to all the communities
Discussion Results indicate strong associations between symptom similarity of diseases and shared genes and PPI s Clear correspondence between the diversity of the clinical manifestations of diseases and the underlying diversity in their cellular mechanisms Individual level disease phenotypes (symptoms) and molecular level disease components (genes/PPIs) show robust correlations, even though their direct associations are influenced by complicated intermediate factors
Discussion Observed correlations between clinical manifestations and molecular mechanisms of disease can be highly valuable for functional annotations of genomics and reveal regularities between different disease categories Another promising use of this broad data across disease categories is a comparison between genetic and infectious diseases Symptoms also play a crucial role in drug related research and as most FDA approved drugs are palliative (just treat symptoms rather targeting disease specific genes or pathways)
Limitations MeSH vocabulary is relatively old and rigid with only annual updates This could limit the extent to which the identified associations capture latest research results of the rapidly evolving field of medicine
Future Directions How to improve full text analysis of large-scale database to increase the accuracy of search? How to improvise on the distinction between symptoms and disease is not very well understood as yet? How to develop techniques that can automatically extract information from clinical records? How to develop a method of symptom similarity scores that can be assigned to provide for gene prioritization and target identification of viral/bacterial infections.
References 1) Rzhetsky A., Wajngurt,D., Park,N. & Zheng,T. Probing genetic overlap among complex human phenotypes. Proc. Natl. Acad. SciUSA 104, 11694-11699 (2007) 2)Hidalgo, C.A., Blumm, N., Barabasi, A.L. & Christakis, N.A A dynamic network approach for the study of human phenotypes. PLoS. Comput. Biol 5, e1000353 (2009) 3) Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci., 104, 8685 8690, (2007) 4) Supplementary Methods, Zhou et.al Nature Communications, 1-22, 2014 5) Wikipedia
Definitions Genotype The genetic makeup of a cell, an organism, or an individual usually with reference to a specific characteristic under consideration Phenotype The outward appearance of an organism, the expression of genotype in the form of traits that can be seen or measured.
Definitions HSDN Human Symptoms Disease Network MeSH Medical Subject Headings defined by experts and offers a comprehensive vocabulary across all disease categories PPI Protein-protein interaction
Definitions Polygenicity multiple gene inheritance influencing the phenotypic trait Pleiotropism A single gene affects a number of phenotypic traits in the same organism. These affected traits often seem unrelated to each other Genetic heterogeneity: Single phenotype or genetic disorder can be caused by a number of alleles,locus. This is in contrast to pleiotropism