Introduction to Galaxy for NGS Data Analysis

NGS data analysis
CCM Seminar series 11.26.2014
Michael Liang: m.liang@mail.utoronto.ca
Overview
Introduction to galaxy
Aligning raw NGS data in Galaxy
Peak calling with MACs
Basic operations with genomic intervals (peaks)
Viewing results in UCSC
Introduction to Galaxy
Galaxy
 is an open, web-based platform for 
accessible, reproducible
,
and 
transparent
 computational biomedical research.
Accessible:
 Users without programming experience can easily specify
parameters and run tools and workflows.
Reproducible:
 Galaxy captures information so that any user can
repeat and understand a complete computational analysis.
Transparent:
 Users share and publish analyses via the web and create
Pages, interactive, web-based documents that describe a complete
analysis.
Accessing Galaxy
Main portal: 
https://usegalaxy.org/
Wiki: 
https://wiki.galaxyproject.org/
Registering for an account greatly improves accessible features
Importing data into Galaxy
Tools -> Get Data
Upload File
Local upload
Link through URL
GenomeSpace
Other online resources
Import History
Saved or shared Galaxy session
http://wilsonlab.org/public/presentations/CCM_data/CEBPA.fastq.gz
History and Job status
QUEUED
QUEUED
RUNNING
RUNNING
COMPLETE
COMPLETE
FAILED
FAILED
Raw sequencing data
Fastq
 file format
Text files encode both nucleotide as well as ‘quality information’
@HWI-ST600:248:C1271ACXX:7:1101:1410:2127 1:N:0:TGACCA
TAATCGCTAAAATCAAAACGAAATGCTGCTTCTTACAGCAGCCTCCTTAG
+
B@@DDFFFGHHGHE@FIIGEHIFCHGIJIHIHHIEGIEHIIJIIHHIIIE
@HWI-ST600:248:C1271ACXX:7:1101:1508:2105 1:N:0:TGACCA
GGTTGTCCACTCATAAGATGTGACCTGGCTCTTAGAGGAACTTTACAAAT
+
?@:?AABDFFFHDGEGGIIIAECHCHHHH@FHIEF*?F9FDBFH<DGIII
Example of a fastq file
Line1: begin with @, sequence identifier
Line2: raw sequence letters
Line3: same information as line1
Line4: quality values for the sequence in line2
NGS: QC and FASTQ manipulation
Tools -> NGS TOOLBOX BETA -> NGS: QC and Manipulation
FASTQC: 
Perform basic quality checks on data
FASTQ GROOMER: 
“Groom” FASTQ file to correct version
NGS:  MAPPING
Tools -> NGS TOOLBOX BETA -> NGS: Mapping
Utilities to map raw reads to reference genomes
BWA and Bowtie most commonly used
Input FASTQ -> Output SAM/BAM
NB:
 Make sure reference genomes are consistent! (hg19)
Alignment-output file
SAM
(Sequence Alignment/Map format) file:
o
a tab-delimited text file that contains aligned sequence data
information (
human readable
)
o
Each alignment line has 11 fields contain information such as
mapping position, mapping quality, segment sequence...
o
Detailed description of SAM file format:
      
http://samtools.sourceforge.net/SAM1.pdf
NS500322:23:H0UM0AGXX:1:22305:20603:1636
 
0
 
chr1
 
93
 
0
 
61M
 
*
 
0
 
0
 
CCCTGTAGTTAAAATTGACTAAGTATTGGAAGGGGCCTATAGACCTTGAGTATTCTCAAGG
 
<AAAAFAFFF7FFFFFFFFF.FFFAFFFFFFFFFFFFFFF.F.F)FFFFFFFF<FAFFFFF
 
XT:A:R
 
NM:i:0
 
X0:i:2
 
X1:i:0
 
XM:i:0
 
XO:i:0
 
XG:i:0
 
MD:Z:61
 
XA:Z:chr7,-92852201,61M,0;
NS500322:23:H0UM0AGXX:1:13301:15368:13300
 
0
 
chr1
 
265
 
37
 
58M
 
*
 
0
 
0
 
AGTTATTTATTGGCCCTTCAATTTTCATTTTTATAACCTACTATTACCTTGCAAAAAA
 
7AAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<<FFFFFFFFFFFFFFFFFFFFFF
 
XT:A:U
 
NM:i:0
 
X0:i:1
 
X1:i:0
 
XM:i:0
 
XO:i:0
 
XG:i:0
 
MD:Z:58
NGS:  SAMTOOLS
Tools -> NGS TOOLBOX BETA -> NGS: SAM Tools
Suite of tools for processing SAM files
Capable of filtering based on quality, location, duplicates, etc.
Can convert to BAM format (used by most analysis tools)
SAM-to-BAM
NGS Workflow Recap
Extracting Workflow and sharing history
Steps involved in processing can be extracted as generic workflow
Workflows can be saved, modified, shared, etc.
History -> Options -> Extract Workflow
Full history including files and processing steps can be shared and
loaded.
History -> Options -> Share or Publish
ChIP-seq overview
Alignment of ChIP-seq reads
Importing data into Galaxy: Shared Data
Access published datasets / histories
Shared Data -> Published Histories
Search for History name, ie. “ChIP-seq sample (2: post-alignment)”
Search for username, ie. “mimi31k”
NGS:  Peak Calling
Tools -> NGS TOOLBOX BETA -> NGS: Peak Calling
Tools for identifying ChIP-seq Peaks
MACS
Accepts multiple TAG files (Bed, BAM, etc.)
Control File helps reduce technical artifacts
Check genome size, tag size
Downstream analyses
Tools -> NGS TOOLBOX BETA -> Bedtools
Tools for manipulating genomic intervals
Overlapping peaks for multiple factors
Intersect multiple sorted BED files
Filtering and sorting files
Select rows in a file based on “rules”
Find combinatorial binding versus singletons
Visualize in genome browser
Exporting data for other analyses
Download to local drive
Send to GenomeSpaces
Load from GenomeSpaces into other Galaxy servers
Slide Note
Embed
Share

Explore the capabilities of Galaxy, a web-based platform for computational biomedical research. Learn about aligning raw NGS data, peak calling, and basic operations with genomic intervals. Discover how Galaxy enables accessible, reproducible, and transparent analysis through user-friendly tools and workflows.

  • Galaxy
  • NGS
  • Data Analysis
  • Computational Biology

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. NGS data analysis CCM Seminar series 11.26.2014 Michael Liang: m.liang@mail.utoronto.ca

  2. Overview Introduction to galaxy Aligning raw NGS data in Galaxy Peak calling with MACs Basic operations with genomic intervals (peaks) Viewing results in UCSC

  3. Introduction to Galaxy Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research. Accessible: Users without programming experience can easily specify parameters and run tools and workflows. Reproducible: Galaxy captures information so that any user can repeat and understand a complete computational analysis. Transparent: Users share and publish analyses via the web and create Pages, interactive, web-based documents that describe a complete analysis.

  4. Accessing Galaxy Main portal: https://usegalaxy.org/ Wiki: https://wiki.galaxyproject.org/ Registering for an account greatly improves accessible features

  5. Importing data into Galaxy Tools -> Get Data Upload File Local upload Link through URL GenomeSpace Other online resources Import History Saved or shared Galaxy session http://wilsonlab.org/public/presentations/CCM_data/CEBPA.fastq.gz

  6. History and Job status QUEUED RUNNING COMPLETE FAILED

  7. Raw sequencing data Fastq file format Text files encode both nucleotide as well as quality information Example of a fastq file @HWI-ST600:248:C1271ACXX:7:1101:1410:2127 1:N:0:TGACCA TAATCGCTAAAATCAAAACGAAATGCTGCTTCTTACAGCAGCCTCCTTAG + B@@DDFFFGHHGHE@FIIGEHIFCHGIJIHIHHIEGIEHIIJIIHHIIIE @HWI-ST600:248:C1271ACXX:7:1101:1508:2105 1:N:0:TGACCA GGTTGTCCACTCATAAGATGTGACCTGGCTCTTAGAGGAACTTTACAAAT + ?@:?AABDFFFHDGEGGIIIAECHCHHHH@FHIEF*?F9FDBFH<DGIII Line4: quality values for the sequence in line2 Line1: begin with @, sequence identifier Line2: raw sequence letters Line3: same information as line1

  8. NGS: QC and FASTQ manipulation Tools -> NGS TOOLBOX BETA -> NGS: QC and Manipulation FASTQC: Perform basic quality checks on data FASTQ GROOMER: Groom FASTQ file to correct version

  9. NGS: MAPPING Tools -> NGS TOOLBOX BETA -> NGS: Mapping Utilities to map raw reads to reference genomes BWA and Bowtie most commonly used Input FASTQ -> Output SAM/BAM NB: Make sure reference genomes are consistent! (hg19)

  10. Alignment-output file SAM(Sequence Alignment/Map format) file: o a tab-delimited text file that contains aligned sequence data information (human readable) Each alignment line has 11 fields contain information such as mapping position, mapping quality, segment sequence... Detailed description of SAM file format: http://samtools.sourceforge.net/SAM1.pdf o o NS500322:23:H0UM0AGXX:1:22305:20603:1636 * CCCTGTAGTTAAAATTGACTAAGTATTGGAAGGGGCCTATAGACCTTGAGTATTCTCAAGG <AAAAFAFFF7FFFFFFFFF.FFFAFFFFFFFFFFFFFFF.F.F)FFFFFFFF<FAFFFFF X1:i:0 XM:i:0 XO:i:0 NS500322:23:H0UM0AGXX:1:13301:15368:13300 * 0 0 AGTTATTTATTGGCCCTTCAATTTTCATTTTTATAACCTACTATTACCTTGCAAAAAA 7AAAAFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<<FFFFFFFFFFFFFFFFFFFFFF X1:i:0 XM:i:0 XO:i:0 0 chr1 93 0 61M 0 0 XT:A:R XA:Z:chr7,-92852201,61M,0; 265 NM:i:0 X0:i:2 XG:i:0 0 MD:Z:61 chr1 37 58M XT:A:U NM:i:0 X0:i:1 XG:i:0 MD:Z:58

  11. NGS: SAMTOOLS Tools -> NGS TOOLBOX BETA -> NGS: SAM Tools Suite of tools for processing SAM files Capable of filtering based on quality, location, duplicates, etc. Can convert to BAM format (used by most analysis tools) SAM-to-BAM

  12. NGS Workflow Recap

  13. Extracting Workflow and sharing history Steps involved in processing can be extracted as generic workflow Workflows can be saved, modified, shared, etc. History -> Options -> Extract Workflow Full history including files and processing steps can be shared and loaded. History -> Options -> Share or Publish

  14. ChIP-seq overview Sequence and align to genome

  15. Alignment of ChIP-seq reads DNA binding protein

  16. Importing data into Galaxy: Shared Data Access published datasets / histories Shared Data -> Published Histories Search for History name, ie. ChIP-seq sample (2: post-alignment) Search for username, ie. mimi31k

  17. NGS: Peak Calling Tools -> NGS TOOLBOX BETA -> NGS: Peak Calling Tools for identifying ChIP-seq Peaks MACS Accepts multiple TAG files (Bed, BAM, etc.) Control File helps reduce technical artifacts Check genome size, tag size

  18. Downstream analyses Tools -> NGS TOOLBOX BETA -> Bedtools Tools for manipulating genomic intervals Overlapping peaks for multiple factors Intersect multiple sorted BED files Filtering and sorting files Select rows in a file based on rules Find combinatorial binding versus singletons Visualize in genome browser

  19. Exporting data for other analyses Download to local drive Send to GenomeSpaces Load from GenomeSpaces into other Galaxy servers

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#