Introduction to BED Files: An Overview of Browser Extensible Data Format
BED (Browser Extensible Data) files are commonly used for annotating genomic sequences by specifying ranges on chromosomes. They consist of required fields like chromosome name, start and end positions, and optional fields for additional information such as name, score, strand, and color representation. The format example showcases how BED files are structured for easy visualization in genomic browsers like UCSC Genome Browser, aiding researchers in data interpretation and analysis.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Lectures on Informatics: An Introduction to Computers and Informatics in the Health Sciences Sequencing Application I: Sequencing of Microbiome Data Christopher Taylor Associate Professor Department of Microbiology, Immunology & Parasitology
BED (Browser Extensible Data) This format is useful for annotation of existing genomic sequence It is based on a given reference sequence Specifies ranges along chromosomes Three required BED fields for each record 1. chrom The name of the chromosome (e.g. chr4) 2. chromStart Starting position of the feature (0-based) 3. chromEnd Ending position of the feature (not included) For Example, the first 100 bases (bases 0-99) of chromosome 4 would be specified with the bed line: chr4 0 100 2
BED (Browser Extensible Data) (cont) There are several optional BED fields: 4. name Defines name for the bed line This could be a gene name for example 5. score Score between 0 and 1000 The line will be lighter closer to 0, darker closer to 1000 This could be used to represent expression level e.g. 6. strand Whether on +/- strand of the genome 7. thickStart Start of thick drawing, e.g. codon 8. thickEnd End of thick drawing 9. itemRgb RGB value to give color to the line 3
BED File Example browser position chr7:127471196-127495720 browser hide all track name="ItemRGBDemo" description="Item RGB demonstration" visibility=2 itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0 chr7 127480532 127481699 Neg4 0 - 127480532 127481699 0,0,255 Display this track in the UCSC Genome Browser 4
BED File Example with Scores browser position chr22:0-20000 browser hide all track name=pairedReads description="Clone Paired Reads" useScore=1 chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 - chr22 4000 7000 cloneC 700 + chr22 6000 8000 cloneD 400 - chr22 9000 14000 cloneE 200 + chr22 15000 19000 cloneF 50 - 5
SAM (Sequence Alignment/Map) This format is used for storing read alignments against reference sequences A SAM file begins with an optional header Each record following the header contains information on a read alignment 1. QNAME The name of the read that was aligned 2. FLAG A bitwise string describing the alignment Are there multiple fragments? Are all fragments properly aligned? Is this fragment unmapped? etc 6
SAM (Sequence Alignment/Map) (cont) 3. RNAME The name of the reference seq aligned to 4. POS 1-based leftmost position of sequence 5. MAPQ Mapping Quality (Phred-scaled) 6. CIGAR Extended CIGAR string 7. MRNM Mate Reference seq name (= if same as 3) 8. MPOS 1-based mate position 9. LEN inferred template length (insert size) 10. SEQ query seq on the same strand as reference 11. QUAL query quality (ASCII-33 gives Phred score) 12. OPT variable optional fields in TAG:VTYPE:VALUE 7
SAM File Example @HD VN:1.0 SO:coordinate @SQ SN:1 LN:249250621 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128 @SQ SN:2 LN:243199373 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e @SQ SN:3 LN:198022430 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5 @RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 DT:2010-05-05T20:00:00-0400 @RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 DT:2010-05-05T20:00:00-0400 @PG ID:bwa VN:0.5.4 @PG ID:GATK TableRecalibration VN:1.0.3471 CL:Covariates=[ReadGroupCovariate, QualityScoreCovariate, CycleCovariate, DinucCovariate, TileCovariate], default_read_group=null, default_platform=null, force_read_group=null, force_platform=null, solid_recal_mode=SET_Q_ZERO, window_size_nqs=5, homopolymer_nback=7, exception_if_no_tile=false, ignore_nocall_colorspace=false, pQ=5, maxQ=40, smoothing=1 AS:NCBI37 AS:NCBI37 AS:NCBI37 LB:80 SM:SD37743 CN:UMCORE LB:80 SM:SD37743 CN:UMCORE 8
SAM File Example (cont) 1:497:R:-272+13M17D24M 113 1 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U X0:i:1 X1:i:0 XM:i:0 497 37 37M 15 100338662 0 NM:i:0 SM:i:37 AM:i:0 XO:i:0 XG:i:0 MD:Z:37 19:20389:F:275+18M2D19M TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 SM:i:0 AM:i:0 X0:i:4 X1:i:0 99 1 17644 0 37M = 17919 314 XT:A:R MD:Z:37 NM:i:0 XM:i:0 XO:i:0 XG:i:0 19:20389:F:275+18M2D19M 314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R X0:i:4 X1:i:0 XM:i:0 147 1 17919 0 18M2D19M = 17644 - NM:i:2 SM:i:0 AM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19 9:21597+10M2I25M:R:-209 244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R X0:i:5 X1:i:0 XM:i:0 83 1 21678 0 8M2I27M = 21469 - NM:i:2 SM:i:0 AM:i:0 XO:i:1 XG:i:2 MD:Z:35 9
FLAGS Is the read paired? Is the read mapped in proper pair? Is the read umapped? Is the mate unmapped? Is the read mapped to reverse strand? Is the mate mapped to reverse strand? Is this first read in pair? Is this second read in pair? Not the primary alignment? Read fails platform quality checks? Read is PCR or optical duplicate? This is supplementary alignment? 0 or 1 0 or 2 0 or 4 0 or 8 0 or 16 0 or 32 0 or 64 0 or 128 0 or 256 0 or 512 0 or 1024 0 or 2048 Decoding SAM Flags 10
BAM (Binary Alignment/Map) This is the binary version of a SAM file Textual information in a SAM file is convenient for users to look at, but is a very inefficient encoding BAM contains all of the same information as SAM but is encoded in a compact binary format suitable for storage and indexing within the computer This provides more efficient representation in memory and a smaller file size on disk BAM files are indexed so that the system can jump around in memory without processing the whole file 11