Challenges in GWAS: Data Quality Control and SNP QC

Quality control for GWAS
Jeff Barrett
Challenges to GWAS?
Data quality control
No common, single SNP main effects (all
epistasis or rare variants or …)
Sample size too small to detect effects
Computational burden
Multiple testing correction will drown signal
Unmatched controls / population structure
SNP chips don’t cover enough of the genome
Challenges to GWAS?
Data quality control
No common, single SNP main effects (all
epistasis or rare variants or …)
Sample size too small to detect effects
Computational burden
Multiple testing correction will drown signal
Unmatched controls / population structure
SNP chips don’t cover enough of the genome
Challenges to GWAS?
Data quality control
No common, single SNP main effects (all
epistasis or rare variants or …)
Sample size too small to detect effects
Computational burden
Multiple testing correction will drown signal
Unmatched controls / population structure
SNP chips don’t cover enough of the genome
What we want to work with
Getting from intensities to genotypes
Getting from intensities to genotypes
SNP QC
 
SNP QC for GWAS aims to systematically identify
these problems:
Hardy-Weinberg equilibrium (expected
frequency of three possible genotypes)
Fraction of missing genotypes
Frequency differences in separate controls (if
available)
…but the scale is huge: biggest meta-analyses
involve > 1 trillion genotypes!
Calling wrinkles: > 3 clusters
Plate effects
 
Transition to SSF site
Calling wrinkles: monomorphics
Calling wrinkles: rare SNPs
Missing data a good predictor of bad
calling
Sample QC
 
Collecting, processing and genotyping thousands
of samples (often from many different clinicians,
hospitals, countries. . . ) is difficult.
Duplicates
Unexpected relatives
Samples with different ancestry
 
Low quality DNA samples
Sample mix-ups
The good news is that simple analyses at scale
are very informative.
Heterozygosity locally and globally
A key advantage of GWAS is the sheer
volume of data, which allows simple
analyses.
A heterozygous sample at one SNP
isn
’t particularly interesting, but what
about across the entire genome?
Bad samples: call rate & heterozygosity
Data cleaning on X: gender
Bad samples: plate effects
Clean data matters!
Hit SNP 1
Hit SNP 2
The missed warning signs
The missed warning signs
The need for QC never dies
Useful references
Genome-wide association study of 14,000 cases of
seven common diseases and 3,000 shared controls.
Wellcome Trust Case Control Consortium. 
Nature
. 2007
Jun;447(7145):661-78.
Data quality control in genetic case-control association
studies. Anderson CA, Pettersson FH, Clarke GM,
Cardon LR, Morris AP, Zondervan KT. 
Nat. Protoc. 
2010
Sep;5(9):1564-73.
Slide Note
Embed
Share

GWAS face challenges such as data quality control issues, small sample sizes, computational burden, and inadequate SNP coverage. SNP QC for GWAS involves identifying problems like Hardy-Weinberg equilibrium, missing genotypes, and frequency differences. Intensities must be converted to genotypes, and calling wrinkles with more than 3 clusters is a concern.

  • GWAS
  • Data Quality Control
  • SNP QC
  • Computational Burden
  • SNP Coverage

Uploaded on Feb 21, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Quality control for GWAS Jeff Barrett

  2. Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome

  3. Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome

  4. Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome

  5. What we want to work with

  6. Getting from intensities to genotypes

  7. Getting from intensities to genotypes

  8. SNP QC SNP QC for GWAS aims to systematically identify these problems: Hardy-Weinberg equilibrium (expected frequency of three possible genotypes) Fraction of missing genotypes Frequency differences in separate controls (if available) but the scale is huge: biggest meta-analyses involve > 1 trillion genotypes!

  9. Calling wrinkles: > 3 clusters

  10. Plate effects Transition to SSF site

  11. Calling wrinkles: monomorphics

  12. Calling wrinkles: rare SNPs

  13. Missing data a good predictor of bad calling

  14. Sample QC Collecting, processing and genotyping thousands of samples (often from many different clinicians, hospitals, countries. . . ) is difficult. Duplicates Unexpected relatives Samples with different ancestry Low quality DNA samples Sample mix-ups The good news is that simple analyses at scale are very informative.

  15. Heterozygosity locally and globally A key advantage of GWAS is the sheer volume of data, which allows simple analyses. A heterozygous sample at one SNP isn t particularly interesting, but what about across the entire genome?

  16. Bad samples: call rate & heterozygosity

  17. Data cleaning on X: gender

  18. Bad samples: plate effects

  19. Clean data matters!

  20. Hit SNP 1

  21. Hit SNP 2

  22. The missed warning signs Fig. S5. The Manhattan plot displays the maximum log10(Bayes Factor) (y-axis) for each of the analyzed SNPs. The SNPs are ordered by chromosome (alternate color bands) and, within chromosome, by physical position (x-axis). We tested the association of each SNP with EL using general, allelic, dominant and recessive models and the y-axis reports the maximum log10(Bayes factor) observed for each SNP. Genome wide significance is met with a Bayes factor > 1500 (log10(Bayes factor) > 3.2) and estimated global error rate of 6 10-5 that was determined using simulations. (back to index) 21

  23. The missed warning signs Table S1: Genome wide significant SNPs in the discovery, replication, and aggregated sets. Discovery Set (801, 926 ) Replication Set (Elix 254, 341) Combined (1055, 1267) SNP Gene Chrom Alleles LOG10(BF) 10.92 7.48 p-value 2.89E-12 2.11E-09 OR 6.12 3.77 p(A) LOG10(BF) 14.60 9.96 p-value 1.39E-14 1.69E-11 OR 26.73 7.89 p(A) LOG10(BF) 24.67 16.99 p-value 1.09E-24 1.07E-18 OR 10.05 4.93 p(A) rs1036819 rs1455311 ZFAT chr8:135681127 chr4:80183611 CC v AA/AC GG v AA/AG 0.09/0.01 0.10/0.03 0.19/0.01 0.21/0.03 0.11/0.01 0.12/0.03 TOMM40/A POE rs2075650 chr19:50087459 AG/GG v AA 6.39 8.81E-09 0.49 0.15/0.26 2.08 0.000429 0.47 0.15/0.27 9.39 8.05E-12 0.48 0.15/0.26 rs1436013 rs9615362 rs10521157 rs10924270 rs7930940 rs415407 rs508001 rs6433379 chr8:58940672 chr22:45173805 chr17:9277095 chr1:243871287 chr11:60282362 chr5:55450713 chr10:99744963 chr2:173328711 AC/CC v AA AC/CC v AA G v A AG/GG v AA AG/GG v AA AC/CC v AA AC/CC v AA AG/GG v AA 6.31 5.89 3.95 5.23 5.03 4.96 4.93 4.86 8.69E-09 5.00E-08 1.40E-06 1.10E-07 4.50E-07 2.29E-07 2.55E-07 1.82E-07 0.52 2.31 1.52 1.86 2.54 1.88 0.52 1.78 0.23/0.36 0.91/0.81 0.81/0.74 0.80/0.69 0.94/0.86 0.82/0.70 0.15/0.26 0.37/0.25 1.01 0.05 0.11 1.56 2.67 1.10 0.51 0.34 0.004532 0.064279 0.060676 0.001309 0.000298 4.80E-03 0.0164 0.019598 0.57 1.60 1.41 1.94 3.31 1.94 0.60 1.55 0.22/0.34 0.87/0.81 0.82/0.77 0.80/0.69 0.95/0.84 0.85/0.76 0.18/0.28 0.37/0.28 8.14 6.53 4.92 7.72 8.54 6.70 6.40 5.92 1.10E-10 9.07E-09 1.21E-08 3.06E-10 1.85E-10 3.76E-09 7.22E-09 1.20E-08 0.53 2.11 1.49 1.89 2.80 1.89 0.54 1.71 0.23/0.35 0.90/0.81 0.81/0.75 0.80/0.69 0.94/0.86 0.83/0.72 0.16/0.26 0.37/0.25 CELSR1 STX8 KIF26B MS4A15 ANKRD55 CRTAC1 RAPGEF4 Column legends: SNP = official dbSNP identifier. Gene = official gene name for SNPs that are within 20kb from transcribed regions. Chrom= Chromosome and physical position of SNP. Alleles = the two SNP alleles (allele 1 v allele 2) in the genetic model that reached strongest significance. LOG10(BF) = the logarithm 10 Bayes Factor for the association relative to the null model of no association. Assuming uniform prior probabilities for the two hypotheses, the BF represents the posterior odds for association. Simulations showed that a BF > 1,500 (or log10(BF)>3.2) has a global false detection rate of 6 for every 100,000 independent tests. P-value = p-value for 1 degree of freedom test. OR = odds ratio for EL in subjects who carry allele 1 relative to allele 2. For example, subjects who carry the allele 1 (CC) of SNP rs1036819 have 6.12 times the odds for EL compared to subjects who carry the allele 2 (AA/AC: either the genotype AA or AC). P(A) = prevalence of allele 1 in cases and controls. For example, 9% of centenarians carry the allele CC of SNP rs1036819 compared to 1% of controls. (back to index) 36

  24. The need for QC never dies

  25. Useful references Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Wellcome Trust Case Control Consortium. Nature. 2007 Jun;447(7145):661-78. Data quality control in genetic case-control association studies. Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Nat. Protoc. 2010 Sep;5(9):1564-73.

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#