Challenges in GWAS: Data Quality Control and SNP QC
GWAS face challenges such as data quality control issues, small sample sizes, computational burden, and inadequate SNP coverage. SNP QC for GWAS involves identifying problems like Hardy-Weinberg equilibrium, missing genotypes, and frequency differences. Intensities must be converted to genotypes, and calling wrinkles with more than 3 clusters is a concern.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Quality control for GWAS Jeff Barrett
Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome
Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome
Challenges to GWAS? Data quality control No common, single SNP main effects (all epistasis or rare variants or ) Sample size too small to detect effects Computational burden Multiple testing correction will drown signal Unmatched controls / population structure SNP chips don t cover enough of the genome
SNP QC SNP QC for GWAS aims to systematically identify these problems: Hardy-Weinberg equilibrium (expected frequency of three possible genotypes) Fraction of missing genotypes Frequency differences in separate controls (if available) but the scale is huge: biggest meta-analyses involve > 1 trillion genotypes!
Plate effects Transition to SSF site
Sample QC Collecting, processing and genotyping thousands of samples (often from many different clinicians, hospitals, countries. . . ) is difficult. Duplicates Unexpected relatives Samples with different ancestry Low quality DNA samples Sample mix-ups The good news is that simple analyses at scale are very informative.
Heterozygosity locally and globally A key advantage of GWAS is the sheer volume of data, which allows simple analyses. A heterozygous sample at one SNP isn t particularly interesting, but what about across the entire genome?
The missed warning signs Fig. S5. The Manhattan plot displays the maximum log10(Bayes Factor) (y-axis) for each of the analyzed SNPs. The SNPs are ordered by chromosome (alternate color bands) and, within chromosome, by physical position (x-axis). We tested the association of each SNP with EL using general, allelic, dominant and recessive models and the y-axis reports the maximum log10(Bayes factor) observed for each SNP. Genome wide significance is met with a Bayes factor > 1500 (log10(Bayes factor) > 3.2) and estimated global error rate of 6 10-5 that was determined using simulations. (back to index) 21
The missed warning signs Table S1: Genome wide significant SNPs in the discovery, replication, and aggregated sets. Discovery Set (801, 926 ) Replication Set (Elix 254, 341) Combined (1055, 1267) SNP Gene Chrom Alleles LOG10(BF) 10.92 7.48 p-value 2.89E-12 2.11E-09 OR 6.12 3.77 p(A) LOG10(BF) 14.60 9.96 p-value 1.39E-14 1.69E-11 OR 26.73 7.89 p(A) LOG10(BF) 24.67 16.99 p-value 1.09E-24 1.07E-18 OR 10.05 4.93 p(A) rs1036819 rs1455311 ZFAT chr8:135681127 chr4:80183611 CC v AA/AC GG v AA/AG 0.09/0.01 0.10/0.03 0.19/0.01 0.21/0.03 0.11/0.01 0.12/0.03 TOMM40/A POE rs2075650 chr19:50087459 AG/GG v AA 6.39 8.81E-09 0.49 0.15/0.26 2.08 0.000429 0.47 0.15/0.27 9.39 8.05E-12 0.48 0.15/0.26 rs1436013 rs9615362 rs10521157 rs10924270 rs7930940 rs415407 rs508001 rs6433379 chr8:58940672 chr22:45173805 chr17:9277095 chr1:243871287 chr11:60282362 chr5:55450713 chr10:99744963 chr2:173328711 AC/CC v AA AC/CC v AA G v A AG/GG v AA AG/GG v AA AC/CC v AA AC/CC v AA AG/GG v AA 6.31 5.89 3.95 5.23 5.03 4.96 4.93 4.86 8.69E-09 5.00E-08 1.40E-06 1.10E-07 4.50E-07 2.29E-07 2.55E-07 1.82E-07 0.52 2.31 1.52 1.86 2.54 1.88 0.52 1.78 0.23/0.36 0.91/0.81 0.81/0.74 0.80/0.69 0.94/0.86 0.82/0.70 0.15/0.26 0.37/0.25 1.01 0.05 0.11 1.56 2.67 1.10 0.51 0.34 0.004532 0.064279 0.060676 0.001309 0.000298 4.80E-03 0.0164 0.019598 0.57 1.60 1.41 1.94 3.31 1.94 0.60 1.55 0.22/0.34 0.87/0.81 0.82/0.77 0.80/0.69 0.95/0.84 0.85/0.76 0.18/0.28 0.37/0.28 8.14 6.53 4.92 7.72 8.54 6.70 6.40 5.92 1.10E-10 9.07E-09 1.21E-08 3.06E-10 1.85E-10 3.76E-09 7.22E-09 1.20E-08 0.53 2.11 1.49 1.89 2.80 1.89 0.54 1.71 0.23/0.35 0.90/0.81 0.81/0.75 0.80/0.69 0.94/0.86 0.83/0.72 0.16/0.26 0.37/0.25 CELSR1 STX8 KIF26B MS4A15 ANKRD55 CRTAC1 RAPGEF4 Column legends: SNP = official dbSNP identifier. Gene = official gene name for SNPs that are within 20kb from transcribed regions. Chrom= Chromosome and physical position of SNP. Alleles = the two SNP alleles (allele 1 v allele 2) in the genetic model that reached strongest significance. LOG10(BF) = the logarithm 10 Bayes Factor for the association relative to the null model of no association. Assuming uniform prior probabilities for the two hypotheses, the BF represents the posterior odds for association. Simulations showed that a BF > 1,500 (or log10(BF)>3.2) has a global false detection rate of 6 for every 100,000 independent tests. P-value = p-value for 1 degree of freedom test. OR = odds ratio for EL in subjects who carry allele 1 relative to allele 2. For example, subjects who carry the allele 1 (CC) of SNP rs1036819 have 6.12 times the odds for EL compared to subjects who carry the allele 2 (AA/AC: either the genotype AA or AC). P(A) = prevalence of allele 1 in cases and controls. For example, 9% of centenarians carry the allele CC of SNP rs1036819 compared to 1% of controls. (back to index) 36
Useful references Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Wellcome Trust Case Control Consortium. Nature. 2007 Jun;447(7145):661-78. Data quality control in genetic case-control association studies. Anderson CA, Pettersson FH, Clarke GM, Cardon LR, Morris AP, Zondervan KT. Nat. Protoc. 2010 Sep;5(9):1564-73.