Sequence, SNP and Mutation
Delve into the significance of SNP and mutation databases in genetic analysis, emphasizing key concepts such as mapping of locations, SNP effects, data formats, and terminology. Explore the practical applications, terminologies, and data formats vital for genetic research. Gain insights into genetic variations, allele frequencies, and association studies to enhance your database knowledge for analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Sequence, SNP and Mutation Databases Mesut Erzurumluoglu epmmee@bristol.ac.uk
Before we start Very important lecture Database knowledge is a must for analysis Hundreds of databases! Pay special attention to: Ensembl Genome Browser dbSNP GeneCards Exercise and Q&A time at the end
Two lines of data Mapping of locations Annotation Genomic landmarks e.g. Genes (exon, intron, splice site), binding sites Known associations e.g. HBB gene and sickle cell disease rs9939609 SNP and BMI Recording of variation Summary stats Frequencies, counts Correlations (e.g. SNP-SNP)
Why is it important? Mapping of locations Designing experiments e.g. Primers, candidate genes Location of variants e.g. genes within region, exon, intergenic Interpretation of results e.g. LD, consequence Recording of variation What s out there from other projects? Standardisation
Terminology Single nucleotide polymorphism (SNP) Minor allele frequency (MAF) Difference between genotyping and sequencing Linkage disequilibrium (LD) Will be taught on next short-course Slide in appendix Genetic association study Also taught on next short-course Slide in appendix
SNP Affects single nucleotide Common ones (>1%) Mostly bi-morphic (e.g. A>C) Minor allele Second most common one Minor allele frequency Frequency of minor allele SNV or SNP?
Data formats FASTA raw sequencing data Can be nucleotide or protein SAM/BAM sequence alignment format VCF only variations from reference Nucleotide Can contain SNPs, insertion and/or deletions (aka indels), microsatellites PED/MAP Plink default format Incorporate familial and phenotypic data to genetic data Slide in appendix VEP Ensembl s variation annotation format Slide in appendix
FASTA (and FASTQ) Quality scores Sample ID DNA Sequence (or amino acid)
VCF Small file size and widely used Very useful for Mendelian disease studies Software: Vcftools
Raw sequence databases Ensembl (Europe), NCBI (USA) DNA NC_123456 (complete genome) NG_123456 (genomic region) mRNA NM_123456 (mRNA) NR_123456 (transcript) Protein NP_123456 (protein) Also see RefSeq (NCBI) slide in appendix
*Ensembl www.ensembl.org/info/data/ftp/index.html DNA Useful for mapping reads, designing primers cDNA = mRNA (transcripts) CDS = all exons in gene VCF HapMap, 1000 Genomes, Venter, Watson VEP 66 species (as of 26/02/14)
Genome Browsers Ensembl Genome Browser View region Extract region Filter variants in region Prediction MAF Conservation GERP scores UCSC Genome Browser http://genome.ucsc.edu/ 1000 Genomes Browser http://browser.1000genomes.org/index.html VEGA Genome Browser
*Ensembl Genome Browser (EGB) www.ensembl.org/index.html Browse many genomes (>70 vertebrates) Example: BRCA2 Start and end Sequence (exons, introns) Transcripts Orthologues Sequence homology in other species Indicative of similar function Easy sequence extraction: http://www.ensembl.org/das/Homo_sapiens.GRCh37.reference/sequence?segment=13:32889611,32973805 Other EGBs EnsemblPlants, EnsemblBacteria, EnsemblFungi, EnsemblMetazoa, EnsemblProtists
EGB Conservation (GERP) Location GERP elements
SNP databases Ensembl (and NCBI) dbSNP (largest database for SNPs) 1000 Genomes 1092 whole genome sequencing African, European, Far East HapMap projects Genotyping 270 individuals from 4 populations Exome Variant Server (EVS) 6503 individuals whole exomes American European and American African SNP and Phenotype associations GWAS Catalogue Slide in appendix
*dbSNP http://ncbi.nlm.nih.gov/SNP Search, annotate and submit SNPs Apply filters e.g. human, cited, clinical >70 million SNPs just for humans dbSNP handbook www.ncbi.nlm.nih.gov/books/NBK21088/
Clinical databases OMIM Online Mendelian Inheritance in Man Database of disease-linked genes and associated phenotypes Links to Entrez, GDB and other databases HGMD Database of sequences and phenotypes of disease-causing mutations Used to train mutation effect prediction algorithms (e.g. FATHMM) Slide in appendix mutDB http://www.mutdb.org Mitomap (mitochondrial DNA) http://www.mitomap.org
*OMIM http://omim.org/ Online Mendelian Inheritance in Man First address for Mendelian disorders Disease > gene and phenotype Gene > disease and phenotype Phenotypes > disease and genes Example search: Autosomal recessive intellectual disability +autosomal +recessive +intellectual +disability
Protein databases Uniprot (Swiss-Prot + TrEMBL) PDB Protein data bank Slide in appendix Pfam Protein families Slide in appendix STRING Protein-Protein Interactions Slide in appendix HMPD (mitochondrial DNA) http://bioinfo.nist.gov/
Uniprot http://www.uniprot.org/ Swiss-Prot is manually curated TrEMBL is automatically curated Download protein sequence Protein sequence BLAST Align Conserved regions and/or residues View AA properties Try example: Gene: DNALI1 Retrieve FASTA sequence, Blast and align (3 species)
Very useful integrative databases GeneCards Integrated resource of information on human genes and their products Major emphasis on human disease Links to many kinds of biomedical information Sequence databases OMIM, HGMD, MDB Doctors Guide to the Internet Ensembl (and NCBI) fits in this category
*Genecards http://www.genecards.org/ Graphical view of many things about your gene Links to Ensembl, OMIM and Literature Example: DNALI1 Entrez Gene Summary Associated disorders Orthologues STRING predictions
Google is your friend! PubMed Bibliographic database Animal models Zfin Zebrafish knockouts http://zfin.org/ International Mouse Phenotyping Consortium https://www.mousephenotype.org/ New ones coming out all the time! New cohorts and studies UK10K project http://www.uk10k.org/ Human Microbiome project http://commonfund.nih.gov/hmp/index ARIES - Epigenetics http://www.ariesepigenomics.org.uk/ariesexplorer
*PubMed http://www.ncbi.nlm.nih.gov/pubmed/ >23 million citations for biomedical literature from MEDLINE, life science journals, and online books Simple to use Searching Citation manager facility PubMed help book http://www.ncbi.nlm.nih.gov/books/NBK3827/
Exercise 1 - SNPs Find where rs9939609 is located Is it in an exon or an intron? Minor allele? Global MAF? Which gene(s) are close by? Associated with any disorders? Which population(s) is the minor allele most frequent in? How many other known human SNPs in this gene?
Answers Chr 16 at position 53,820,527 Intron A, 0.355 FTO BMI, Type 2 diabetes, Menarche Luhya in Webuye, Kenya (0.617) dbSNP, scroll to bottom 8099 (as of 21/02/14) Search for FTO in dbSNP and filter for H.sapiens
Exercise 2 - Genes Find the start and end coordinates of your favourite gene (e.g. DNAH5) How many exons does it have? How many different transcripts does it have? What is the function? Associated with any disorders? Which proteins are predicted to interact with it? Extract the coding sequence in FASTA
Answers Chr 5 from 13,690,440 to 13,944,652 79 4 Force generating protein of respiratory cilia (from GeneCards UniprotKB section) Primary ciliary dyskinesia DNAH1, DYNLL1, DNAL4 (from STRING) Many ways to do, for example: Export data in Ensembl (DNA) Q8TE73 in Uniprot (AA)
Thank You Any questions? Please look back at the slides again once you complete the short-course(s)
Appendices Two additional terms which must be understood to make full use of the databases 10 useful websites/databases we do not have the time to go through Two additional must know data formats
LD Non random association of alleles at two or more loci Simple example: A at chr1:1000 and T at chr3:500 C at chr1:1000 and A at chr3:500 No other haplotypes Therefore chr1:1000 and chr3:500 are in LD Thus all SNPs are not independent Therefore carefully selected SNPs can save money (e.g. Tag SNPs) Different from linkage Proximal loci on the same chromosome inherited together during meiosis
Genetic association study GWAS SNP arrays - using LD Very cheap (23andme offers for $100) Case v Controls Whole exome sequencing (WES) Whole genome sequencing (WGS) Expensive Candidate gene analyses E.g. ones identified by GWAS Animal models
RefSeq (NCBI) http://www.ncbi.nlm.nih.gov/refseq/ A comprehensive, integrated, non- redundant, well-annotated set of reference sequences including genomic, transcript, and protein >33000 species (and/or strains) CCDS project
GWAS catalogue www.genome.gov/gwastudies/ Search for SNP-phenotype associations from GWAS Search, view and filter Try example: BMI and P value of 1e-8 Result: 11 papers (as of 21/02/14)
HGMD http://www.hgmd.cf.ac.uk/ac/index.php HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease Public version registration required Professional version Purchase a licence
PDB Protein Data Bank http://www.rcsb.org/pdb/home/home.do Links biochemistry to your study If there is data of course! 3D view of protein (in Jmol) Amino acid sequence Try example: FTO (4IDZ)
Pfam http://pfam.sanger.ac.uk/ Protein families and domains Predicted to have similar functions Domain organisation Phylogenetic tree Links to PDB Try example: Dynein_heavy (PF03028)
STRING http://string-db.org/ Predicts protein-protein interactions Coexpression Literature Experiments Genomic context Try example: DNALI1
KEGG http://www.kegg.jp/ resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high- throughput experimental technologies Similar to STRING but manually curated More reliable
Regulatory elements Rfam (RNA family) http://rfam.sanger.ac.uk/ Noncoding RNA database http://biobases.ibch.poznan.pl/ncRNA Bioexplorer.net http://www.bioexplorer.net/Databases
ENCODE http://www.nature.com/encode/ Radical shake up to the interrogation of genomic function Data available Functional impact of variant sites in multiple tissues Multiple assay types. Analysis/visualisation software and scripts for the generation of figures
Locus specific or disease specific databases By HGVS Human Genome Variation Society http://www.hgvs.org/dblist/dblist.html Example: Ciliome database http://www.sfu.ca/~leroux/ciliome_home.htm
PED/MAP Most used format User friendly software Plink pngu.mgh.harvard.edu/~purcell/plink/ One of the mostly cited software Ped: FAM001 1 0 0 1 2 A A G G A C ... FAM001 2 0 0 1 2 A A A G 0 0 ... .... Map: 1 rs123456 0 1234555 1 rs234567 0 1237793 1 rs233556 0 1337456
Variant effect predictor (VEP) www.ensembl.org/info/docs/tools/vep/index.html Example file: