Chimeric Artifacts in Bacterial 16S rRNA Gene Sequencing

undefined
Bacterial
chromosome
16S rRNA gene
Primers
16S rRNA gene
segments
PCR
Sequencing
Sample with bacteria
undefined
Bacterial
chromosome
16S rRNA gene
Primers
16S rRNA gene
segments
PCR
Amplified
segments
Biological
sequences
Chimeric artifacts 
formed from ≥2
biological sequences during PCR
Sequencing
Sample with bacteria
Biological
sequences
Chimeric artifacts formed from ≥2
biological sequences during PCR
Clustering
Biological OTUs
Chimeric
 
OTUs
From Haas 
et al.
 (2011) Chimeric 16S rRNA sequence formation and detection
in Sanger and 454-pyrosequenced PCR amplicons, 
Genome Research
.
Usually in region of high sequence similarity
Similarity usually due to homology
But not always!
Non-homologous cross-over
Chimera formed from single parent
Looks like deletion or tandem duplication
Proportional to parent abundance
Proportional to sequence similarity
P(Parent1, Parent2) 
      
(
proportional to
)
 
abundance(Parent1)
 
  × abundance(Parent2)
 
  × sequence_similarity(Parent1, Parent2)
  
(
cross-over at matching k-mer
)
Bimera (two segments)
Trimera (three segments)
Trimera (three segs., two parents)
From Lahr and Katz (2009) , Reducing the impact of PCR-mediated recombination in molecular evolution and
environmental studies using a new-generation high-fidelity DNA polymerase, 
Biotechniques
.
2-meras
3-meras
4-meras
= (nr segments – 1)
Homologous cross-over
Chimeras look like biological sequences
Often align well to reference sequences
How to distinguish?
Next-gen read 100 – 400nt
3% divergence = 3 – 12 diffs
Small amount of evidence
Reference database
Match segments to known parents
De novo
Find chimeric alignments (A-B-C)
Chimera is least abundant
UCHIME
Edgar 
et al
. (2001)
 
UCHIME improves speed and sensitivity of chimera detection,
Bioinformatics
.
Haas 
et al
. 2011.
Find 50-mers unique to single genus
Chimera if 50-mers indicate > 1 genus
Low sensitivity, genus level only
Ashelford 
et al
. 2005
At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain
substantial anomalies. 
Appl Environ Microbiol 
71: 7724–7736
Find closest reference sequence(s)
Measure divergence in sliding window (300nt)
Compare with avg. variability in 16S gene
Conserved vs. variable regions
Anomaly (chimera) if variability far from avg.
Newer algorithms work better
ChimeraSlayer
Haas 
et al.
 2011
Similar to UCHIME reference database mode
Perseus
Quince 
et al.
 2011
 
Removing Noise From Pyrosequenced Amplicons ,
BMC Bioinformatics.
Similar to UCHIME 
de novo 
mode
Query
Split into four chunks
Search database
Save top hits
Hits
A
B
Query
Find & align closest pair (A, B)
User provides reference database
Should be high-quality sequences
Believed to be chimera-free
Advantages:
High confidence in predictions
Disadvantages:
Expect high false-negative rate
Ref DB usually doesn't cover all possible parents
Parents amplified more than chimera
At least one more round
So parents at least 2x more abundant
"Abundance skew" >= 2 (user-settable)
Input is estimated amplicons + abundances
NOT reads!
Sort amplicons by decreasing abundance
Start with empty DB
For each amplicon:
Search DB for parents with >= 2x abundance
If chimeric hit:
Classify as chimera and discard query
If not chimeric hit:
Add to reference DB
Ref DB
De novo
Hits found
by both
Hits found
by ref DB
only (rare?)
Hits found by
de novo
 only
(common?
Two modes check each other
De novo
 should have better coverage
All parents should be present
Should examine hits found by ref DB but not
by 
de novo
See UCHIME manual for more discussion.
A    81 
CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc
 160
Q    81 
CCTTGGTAGGCCGCTGCCCTGCCAACTA
 GCTAATCAGACGC 
ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG
 160
B    81 
TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG
 160
Diffs   
A
      
A
     p 
A
   
A
      
A
                
BBBB
     
BBB    BBBBB BB   BB
a 
B  B BBB
Votes
   
Y      Y     A Y   Y      Y                YYYY     YYY    YYYYY YY   YYN Y  Y YYY
Model   
AAAAAAAAAAAAAAAAAAAAAAAAAAAA 
xxxxxxxxxxxxx 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
No vote
Abstain vote
Model = segment of A + segment of B
Chimeric if model closer to Q than A or B
Left closer to A 
and 
right closer to B
Closer if Y > N
Ratio Y/N > 1
A    81 
CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc
 160
Q    81 
CCTTGGTAGGCCGCTGCCCTGCCAACTA
 GCTAATCAGACGC 
ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG
 160
B    81 
TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG
 160
Diffs   
A
      
A
     p 
A
   
A
      
A
                
BBBB
     
BBB    BBBBB BB   BB
a 
B  B BBB
Votes
   
Y      Y     A Y   Y      Y                YYYY     YYY    YYYYY YY   YYN Y  Y YYY
Model   
AAAAAAAAAAAAAAAAAAAAAAAAAAAA 
xxxxxxxxxxxxx 
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
H
 
= 
Y
 
/ 
[
β
 
(
N
 
+
 
n
)
 +
 
A
]
Observations: 
Y
=Yes votes, 
N
=No votes, 
A
=abstain votes
Parameters:
 
β=
weight of No vote, 
n
=
prior number of No votes.
  Larger score = more likely to be chimera
  Default: chimera if 
H
Left
 
x 
H
Right
  
0.3
  Threshold 0.3 adjusts sensitivity vs. false-positive rate.
Real communities
Don't know all the biological sequences
Can't distinguish chimeras from real (that's the problem!)
Mock community
Do you really know all the 16S sequences -- no!
Communities too "easy"
To few biological sequences, too well separated
Simulation
How realistic is it -- we don't know.
No definitive validation
Length 300 simulated bimeras with 0 - 5% mutations.
Length 300 simulated multimeras with 1% substitutions.
Noisy regions
align well enough
Open-source version
Source code donated to public domain
USEARCH version
Leverages proprietary algorithms
10x or more faster than open-source version
100x faster than Perseus
1,000x faster than ChimeraSlayer
USEARCH version 10x or more faster again.
Many subtle issues
Read manual & Supp. Mat. carefully!
Database incomplete
Missing species
Missing paralogs
16S duplications are common
Probably high rates of false negatives
Parents should be present
Probably low false negative rate vs. ref db.
False positive rate not well known
Mock community validation may be optimistic
Error correction required
Input MUST be amplicons & abundances
Usually means starting from raw reads
Cannot use on processed seqs. (e.g. RDP)
Convergent evolution in different clades
Different rates in different regions
Biological chimeras
Bad sequences
Bad alignments
Full-length gene
Shotgun fragments
Paired-end reads
Reference database method
De novo
 mode not possible with shotgun
Screen each end separately
Using standard UCHIME in ref db. mode
For each end E1,E2
Find closest parent P1,P2
P2
P2
P1
P2
Gap 
d
1 
using P1
Pad gap with (
d
1
 + d
2
)/2
 
N
s
E2
E1
UCHIME ref db. on padded pair.
N
s don't count as diffs.
Ref db. mode can be run with any set of seqs.
Later is more efficient (fewer seqs.)
De novo
 mode no choice:
Requires full set of amplicons & abundances
Must follow denoise/error correction step
De novo
 first
Ref db. second
Two modes can check each other
Based on USEARCH/UCLUST/UCHIME
Described in afternoon talk on clustering.
http://drive5.com/uchime
Slide Note
Embed
Share

This content explores the formation of chimeric artifacts during PCR amplification of bacterial 16S rRNA gene segments, leading to challenges in biological sequence clustering and detection. It delves into the complexities of distinguishing between homologous and non-homologous crossover chimeras, providing insights on mitigating PCR-mediated recombination in molecular evolution studies using high-fidelity DNA polymerases.

  • Bacterial Sequencing
  • Chimeric Artifacts
  • 16S rRNA Gene
  • PCR Amplification
  • DNA Polymerase

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Sample with bacteria 16S rRNA gene Primers 16S rRNA gene segments Bacterial chromosome PCR Sequencing

  2. Sample with bacteria 16S rRNA gene Primers 16S rRNA gene segments Bacterial chromosome PCR Amplified segments Biological sequences Sequencing Chimeric artifacts formed from 2 biological sequences during PCR

  3. Biological OTUs Biological sequences Clustering Chimeric artifacts formed from 2 biological sequences during PCR Chimeric OTUs

  4. From Haas et al. (2011) Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons, Genome Research.

  5. Usually in region of high sequence similarity Similarity usually due to homology But not always! Non-homologous cross-over Chimera formed from single parent Looks like deletion or tandem duplication

  6. Proportional to parent abundance Proportional to sequence similarity P(Parent1, Parent2) (proportional to) abundance(Parent1) abundance(Parent2) sequence_similarity(Parent1, Parent2) (cross-over at matching k-mer)

  7. Bimera (two segments) Trimera (three segments) Trimera (three segs., two parents)

  8. 2-meras 3-meras 4-meras = (nr segments 1) From Lahr and Katz (2009) , Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase, Biotechniques.

  9. Homologous cross-over Chimeras look like biological sequences Often align well to reference sequences How to distinguish? Next-gen read 100 400nt 3% divergence = 3 12 diffs Small amount of evidence

  10. Reference database Match segments to known parents De novo Find chimeric alignments (A-B-C) Chimera is least abundant UCHIME Edgar et al. (2001) UCHIME improves speed and sensitivity of chimera detection, Bioinformatics.

  11. Haas et al. 2011. Find 50-mers unique to single genus Chimera if 50-mers indicate > 1 genus Low sensitivity, genus level only

  12. Ashelford et al. 2005 At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol 71: 7724 7736 Find closest reference sequence(s) Measure divergence in sliding window (300nt) Compare with avg. variability in 16S gene Conserved vs. variable regions Anomaly (chimera) if variability far from avg. Newer algorithms work better

  13. ChimeraSlayer Haas et al. 2011 Similar to UCHIME reference database mode Perseus Quince et al. 2011 Removing Noise From Pyrosequenced Amplicons ,BMC Bioinformatics. Similar to UCHIME de novo mode

  14. Query Split into four chunks Chunk Chunk Chunk Chunk Search database Save top hits Hits Find & align closest pair (A, B) A Query B

  15. User provides reference database Should be high-quality sequences Believed to be chimera-free Advantages: High confidence in predictions Disadvantages: Expect high false-negative rate Ref DB usually doesn't cover all possible parents

  16. Parents amplified more than chimera At least one more round So parents at least 2x more abundant "Abundance skew" >= 2 (user-settable) Input is estimated amplicons + abundances NOT reads!

  17. Sort amplicons by decreasing abundance Start with empty DB For each amplicon: Search DB for parents with >= 2x abundance If chimeric hit: Classify as chimera and discard query If not chimeric hit: Add to reference DB

  18. Hits found by ref DB only (rare?) De novo Ref DB Hits found by de novo only (common? Hits found by both

  19. Two modes check each other De novo should have better coverage All parents should be present Should examine hits found by ref DB but not by de novo See UCHIME manual for more discussion.

  20. A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB VotesY Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Abstain vote No vote

  21. A 81 CCTTGGTAGGCCGtTGCCCTGCCAACTA GCTAATCAGACGC gggtCCATCtcaCACCaccggAgtTTTtcTCaCTgTacc 160 Q 81 CCTTGGTAGGCCGCTGCCCTGCCAACTA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAATCTCTTTCAG 160 B 81 TCTTGGTgGGCCGtTaCCCcGCCAACaA GCTAATCAGACGC ATCCCCATCCATCACCGATAAATCTTTAAaCTCTTTCAG 160 Diffs A A p A A A BBBB BBB BBBBB BB BBa B B BBB VotesY Y A Y Y Y YYYY YYY YYYYY YY YYN Y Y YYY Model AAAAAAAAAAAAAAAAAAAAAAAAAAAA xxxxxxxxxxxxx BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Model = segment of A + segment of B Chimeric if model closer to Q than A or B Left closer to A and right closer to B Closer if Y > N Ratio Y/N > 1

  22. Observations: Y=Yes votes, N=No votes, A=abstain votes H= Y/ [ (N+n) +A] Parameters: =weight of No vote, n=prior number of No votes. Larger score = more likely to be chimera Default: chimera if HLeftx HRight 0.3 Threshold 0.3 adjusts sensitivity vs. false-positive rate.

  23. Real communities Don't know all the biological sequences Can't distinguish chimeras from real (that's the problem!) Mock community Do you really know all the 16S sequences -- no! Communities too "easy" To few biological sequences, too well separated Simulation How realistic is it -- we don't know. No definitive validation

  24. Length 300 simulated bimeras with 0 - 5% mutations.

  25. Length 300 simulated multimeras with 1% substitutions.

  26. Noisy regions align well enough

  27. Distribution of chimeric read fraction in 1247 samples (V3V5 and V6V4 ) Number of samples 0 5 10 15 20 25 30 Percent Chimeric

  28. Chimera Detection Rate vs. GAST Distance 100% 90% 80% 70% Percent of Reads 60% 50% 40% 30% 20% 10% 0% 0 0.05 0.1 0.15 0.2 0.25 0.3 GAST Distance HMP V3-V5 HMP V6-V4 LSM V6-V4

  29. Open-source version Source code donated to public domain USEARCH version Leverages proprietary algorithms 10x or more faster than open-source version

  30. 100x faster than Perseus 1,000x faster than ChimeraSlayer USEARCH version 10x or more faster again.

  31. Many subtle issues Read manual & Supp. Mat. carefully!

  32. Database incomplete Missing species Missing paralogs 16S duplications are common Probably high rates of false negatives

  33. Parents should be present Probably low false negative rate vs. ref db. False positive rate not well known Mock community validation may be optimistic Error correction required Input MUST be amplicons & abundances Usually means starting from raw reads Cannot use on processed seqs. (e.g. RDP)

  34. Convergent evolution in different clades Different rates in different regions Biological chimeras Bad sequences Bad alignments

  35. Bad A Errors Good A Bad A Errors

  36. Full-length gene Shotgun fragments Paired-end reads Reference database method De novo mode not possible with shotgun

  37. Screen each end separately Using standard UCHIME in ref db. mode For each end E1,E2 Find closest parent P1,P2

  38. P2 P2 Gap d1 using P1 P2 E2 E1 P1 Gap d2 using P2 NNNN... E1 E2 Pad gap with (d1 + d2)/2 Ns

  39. UCHIME ref db. on padded pair. Ns don't count as diffs. NNNN... E1 E2

  40. Ref db. mode can be run with any set of seqs. Later is more efficient (fewer seqs.) De novo mode no choice: Requires full set of amplicons & abundances Must follow denoise/error correction step De novo first Ref db. second Two modes can check each other

  41. Based on USEARCH/UCLUST/UCHIME Described in afternoon talk on clustering.

  42. http://drive5.com/uchime

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#