Sequence, SNP and Mutation

Sequence, SNP and Mutation
Slide Note
Embed
Share

Delve into the significance of SNP and mutation databases in genetic analysis, emphasizing key concepts such as mapping of locations, SNP effects, data formats, and terminology. Explore the practical applications, terminologies, and data formats vital for genetic research. Gain insights into genetic variations, allele frequencies, and association studies to enhance your database knowledge for analysis.

  • SNP databases
  • Mutation databases
  • Genetic analysis
  • Data formats
  • Allele frequencies

Uploaded on Feb 25, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Sequence, SNP and Mutation Databases Mesut Erzurumluoglu epmmee@bristol.ac.uk

  2. Before we start Very important lecture Database knowledge is a must for analysis Hundreds of databases! Pay special attention to: Ensembl Genome Browser dbSNP GeneCards Exercise and Q&A time at the end

  3. Two lines of data Mapping of locations Annotation Genomic landmarks e.g. Genes (exon, intron, splice site), binding sites Known associations e.g. HBB gene and sickle cell disease rs9939609 SNP and BMI Recording of variation Summary stats Frequencies, counts Correlations (e.g. SNP-SNP)

  4. Why is it important? Mapping of locations Designing experiments e.g. Primers, candidate genes Location of variants e.g. genes within region, exon, intergenic Interpretation of results e.g. LD, consequence Recording of variation What s out there from other projects? Standardisation

  5. Terminology Single nucleotide polymorphism (SNP) Minor allele frequency (MAF) Difference between genotyping and sequencing Linkage disequilibrium (LD) Will be taught on next short-course Slide in appendix Genetic association study Also taught on next short-course Slide in appendix

  6. SNP Affects single nucleotide Common ones (>1%) Mostly bi-morphic (e.g. A>C) Minor allele Second most common one Minor allele frequency Frequency of minor allele SNV or SNP?

  7. Data formats FASTA raw sequencing data Can be nucleotide or protein SAM/BAM sequence alignment format VCF only variations from reference Nucleotide Can contain SNPs, insertion and/or deletions (aka indels), microsatellites PED/MAP Plink default format Incorporate familial and phenotypic data to genetic data Slide in appendix VEP Ensembl s variation annotation format Slide in appendix

  8. FASTA (and FASTQ) Quality scores Sample ID DNA Sequence (or amino acid)

  9. VCF Small file size and widely used Very useful for Mendelian disease studies Software: Vcftools

  10. Raw sequence databases Ensembl (Europe), NCBI (USA) DNA NC_123456 (complete genome) NG_123456 (genomic region) mRNA NM_123456 (mRNA) NR_123456 (transcript) Protein NP_123456 (protein) Also see RefSeq (NCBI) slide in appendix

  11. *Ensembl www.ensembl.org/info/data/ftp/index.html DNA Useful for mapping reads, designing primers cDNA = mRNA (transcripts) CDS = all exons in gene VCF HapMap, 1000 Genomes, Venter, Watson VEP 66 species (as of 26/02/14)

  12. Genome Browsers Ensembl Genome Browser View region Extract region Filter variants in region Prediction MAF Conservation GERP scores UCSC Genome Browser http://genome.ucsc.edu/ 1000 Genomes Browser http://browser.1000genomes.org/index.html VEGA Genome Browser

  13. *Ensembl Genome Browser (EGB) www.ensembl.org/index.html Browse many genomes (>70 vertebrates) Example: BRCA2 Start and end Sequence (exons, introns) Transcripts Orthologues Sequence homology in other species Indicative of similar function Easy sequence extraction: http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=13:32889611,32973805 Other EGBs EnsemblPlants, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblProtists

  14. EGB Conservation (GERP) Location GERP elements

  15. SNP databases Ensembl (and NCBI) dbSNP (largest database for SNPs) 1000 Genomes 1092 whole genome sequencing African, European, Far East HapMap projects Genotyping 270 individuals from 4 populations Exome Variant Server (EVS) 6503 individuals whole exomes American European and American African SNP and Phenotype associations GWAS Catalogue Slide in appendix

  16. *dbSNP http://ncbi.nlm.nih.gov/SNP Search, annotate and submit SNPs Apply filters e.g. human, cited, clinical >70 million SNPs just for humans dbSNP handbook www.ncbi.nlm.nih.gov/books/NBK21088/

  17. From GWAS catalogue

  18. Clinical databases OMIM Online Mendelian Inheritance in Man Database of disease-linked genes and associated phenotypes Links to Entrez, GDB and other databases HGMD Database of sequences and phenotypes of disease-causing mutations Used to train mutation effect prediction algorithms (e.g. FATHMM) Slide in appendix mutDB http://www.mutdb.org Mitomap (mitochondrial DNA) http://www.mitomap.org

  19. *OMIM http://omim.org/ Online Mendelian Inheritance in Man First address for Mendelian disorders Disease > gene and phenotype Gene > disease and phenotype Phenotypes > disease and genes Example search: Autosomal recessive intellectual disability +autosomal +recessive +intellectual +disability

  20. Protein databases Uniprot (Swiss-Prot + TrEMBL) PDB Protein data bank Slide in appendix Pfam Protein families Slide in appendix STRING Protein-Protein Interactions Slide in appendix HMPD (mitochondrial DNA) http://bioinfo.nist.gov/

  21. Uniprot http://www.uniprot.org/ Swiss-Prot is manually curated TrEMBL is automatically curated Download protein sequence Protein sequence BLAST Align Conserved regions and/or residues View AA properties Try example: Gene: DNALI1 Retrieve FASTA sequence, Blast and align (3 species)

  22. Very useful integrative databases GeneCards Integrated resource of information on human genes and their products Major emphasis on human disease Links to many kinds of biomedical information Sequence databases OMIM, HGMD, MDB Doctors Guide to the Internet Ensembl (and NCBI) fits in this category

  23. *Genecards http://www.genecards.org/ Graphical view of many things about your gene Links to Ensembl, OMIM and Literature Example: DNALI1 Entrez Gene Summary Associated disorders Orthologues STRING predictions

  24. Google is your friend! PubMed Bibliographic database Animal models Zfin Zebrafish knockouts http://zfin.org/ International Mouse Phenotyping Consortium https://www.mousephenotype.org/ New ones coming out all the time! New cohorts and studies UK10K project http://www.uk10k.org/ Human Microbiome project http://commonfund.nih.gov/hmp/index ARIES - Epigenetics http://www.ariesepigenomics.org.uk/ariesexplorer

  25. *PubMed http://www.ncbi.nlm.nih.gov/pubmed/ >23 million citations for biomedical literature from MEDLINE, life science journals, and online books Simple to use Searching Citation manager facility PubMed help book http://www.ncbi.nlm.nih.gov/books/NBK3827/

  26. Exercise 1 - SNPs Find where rs9939609 is located Is it in an exon or an intron? Minor allele? Global MAF? Which gene(s) are close by? Associated with any disorders? Which population(s) is the minor allele most frequent in? How many other known human SNPs in this gene?

  27. Answers Chr 16 at position 53,820,527 Intron A, 0.355 FTO BMI, Type 2 diabetes, Menarche Luhya in Webuye, Kenya (0.617) dbSNP, scroll to bottom 8099 (as of 21/02/14) Search for FTO in dbSNP and filter for H.sapiens

  28. Exercise 2 - Genes Find the start and end coordinates of your favourite gene (e.g. DNAH5) How many exons does it have? How many different transcripts does it have? What is the function? Associated with any disorders? Which proteins are predicted to interact with it? Extract the coding sequence in FASTA

  29. Answers Chr 5 from 13,690,440 to 13,944,652 79 4 Force generating protein of respiratory cilia (from GeneCards UniprotKB section) Primary ciliary dyskinesia DNAH1, DYNLL1, DNAL4 (from STRING) Many ways to do, for example: Export data in Ensembl (DNA) Q8TE73 in Uniprot (AA)

  30. Thank You Any questions? Please look back at the slides again once you complete the short-course(s)

  31. Appendices Two additional terms which must be understood to make full use of the databases 10 useful websites/databases we do not have the time to go through Two additional must know data formats

  32. LD Non random association of alleles at two or more loci Simple example: A at chr1:1000 and T at chr3:500 C at chr1:1000 and A at chr3:500 No other haplotypes Therefore chr1:1000 and chr3:500 are in LD Thus all SNPs are not independent Therefore carefully selected SNPs can save money (e.g. Tag SNPs) Different from linkage Proximal loci on the same chromosome inherited together during meiosis

  33. Genetic association study GWAS SNP arrays - using LD Very cheap (23andme offers for $100) Case v Controls Whole exome sequencing (WES) Whole genome sequencing (WGS) Expensive Candidate gene analyses E.g. ones identified by GWAS Animal models

  34. RefSeq (NCBI) http://www.ncbi.nlm.nih.gov/refseq/ A comprehensive, integrated, non- redundant, well-annotated set of reference sequences including genomic, transcript, and protein >33000 species (and/or strains) CCDS project

  35. GWAS catalogue www.genome.gov/gwastudies/ Search for SNP-phenotype associations from GWAS Search, view and filter Try example: BMI and P value of 1e-8 Result: 11 papers (as of 21/02/14)

  36. HGMD http://www.hgmd.cf.ac.uk/ac/index.php HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease Public version registration required Professional version Purchase a licence

  37. PDB Protein Data Bank http://www.rcsb.org/pdb/home/home.do Links biochemistry to your study If there is data of course! 3D view of protein (in Jmol) Amino acid sequence Try example: FTO (4IDZ)

  38. Pfam http://pfam.sanger.ac.uk/ Protein families and domains Predicted to have similar functions Domain organisation Phylogenetic tree Links to PDB Try example: Dynein_heavy (PF03028)

  39. STRING http://string-db.org/ Predicts protein-protein interactions Coexpression Literature Experiments Genomic context Try example: DNALI1

  40. KEGG http://www.kegg.jp/ resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high- throughput experimental technologies Similar to STRING but manually curated More reliable

  41. Regulatory elements Rfam (RNA family) http://rfam.sanger.ac.uk/ Noncoding RNA database http://biobases.ibch.poznan.pl/ncRNA Bioexplorer.net http://www.bioexplorer.net/Databases

  42. ENCODE http://www.nature.com/encode/ Radical shake up to the interrogation of genomic function Data available Functional impact of variant sites in multiple tissues Multiple assay types. Analysis/visualisation software and scripts for the generation of figures

  43. Locus specific or disease specific databases By HGVS Human Genome Variation Society http://www.hgvs.org/dblist/dblist.html Example: Ciliome database http://www.sfu.ca/~leroux/ciliome_home.htm

  44. PED/MAP Most used format User friendly software Plink pngu.mgh.harvard.edu/~purcell/plink/ One of the mostly cited software Ped: FAM001 1 0 0 1 2 A A G G A C ... FAM001 2 0 0 1 2 A A A G 0 0 ... .... Map: 1 rs123456 0 1234555 1 rs234567 0 1237793 1 rs233556 0 1337456

  45. Variant effect predictor (VEP) www.ensembl.org/info/docs/tools/vep/index.html Example file:

More Related Content