Understanding GWAS: A Brief Overview of Genetic Association Studies
GWAS, or Genome-Wide Association Studies, are a method used to map genes associated with traits or diseases by analyzing genetic markers throughout the genome. This process involves statistically testing the association between SNPs and traits using regression or chi-squared tests in a hypothesis-free approach. The stages of GWAS involve assessing the genetic contribution to phenotypic variation, conducting heritability analysis, epidemiological studies, data collection, and genome-wide association analysis. Genetic association involves testing for associations between phenotypes and genetic markers, considering LD/correlation and genetic variants. Quantitative traits are analyzed using linear regression models to test the null hypothesis and determine if the slope of regression is significantly different from zero.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
GWAS: Data quality control and analysis GWAS: Data quality control and analysis Beben Benyamin CNSG Teaching, 1 June 2016 (Shell and R scripts are available)
What is GWAS? What is GWAS? A method to map gene associated with a trait/disease using genetic markers (SNPs) throughout the genome Statistically testing for the association between SNPs and the trait/disease ~ linear regression or chi-squared test Hypothesis-free approach
Stages of GWAS Stages of GWAS How much genetics account for the phenotypic variation? Heritability analysis using twin and family studies Epidemiological studies ~ risk factors for disease Collection of data Genetic (SNP genotypes) Phenotypic (height measure, disease status) Where are the genes? Genome wide association analysis (GWAS)
GWAS Data The genotype (i.e. DD, Dd, dd) at 100,000 s Millions of SNPs Data normally collected on 1000 s 10,000 s individuals Phenotype or clinical data Age, sex, other covariates
Genetic Association Genetic Association Disease/ Phenotype Test for association Test for association between phenotype between phenotype and marker and marker Test for genetic association between the phenotype and genetic variants LD / correlation Genome Genetic variants Marker/SNPs Weiss and Lange
Quantitative traits Quantitative traits- - Linear model (regression) Linear model (regression) What are we testing here? What is the null hypothesis? Is the slope of the regression (beta) significantly different from zero Slope is fitted through the mean effects of the genotypes Y = + b*x + e x = 0, 1, 2 for genotypes DD, Dd and dd Powell, lecture notes
http://www.discoveryandinnovation.com/BIOL202/notes/lecture25.htmlhttp://www.discoveryandinnovation.com/BIOL202/notes/lecture25.html
Case-control Clarke et al Nat Prot 2011
Case / Control chi-squared test Is there a difference in allele frequency between cases and controls for SNP rsxxxxxxx? i.e. case / control study in 10651 cases and 10495 controls Observed Case Control Total D 11781 10327 22108 d 9521 10663 20184 Total 21302 20990 42292 Test for independence chi-squared (2x2) contingency table or Fisher Exact test Powell, lecture notes
Case/Control chi-squared test Expected = (Row Total x Column Total)/ Grand total Example for Case D = (22108 x 21302)/42292 = 11136 Observed Expected Case 11781 11136 9521 10166 Control 10327 10972 10663 10018 Total D 22108 d 20184 Total 21302 20990 42292 X2 = (11781-11136)2/11136 + (10327-10972)2/10972 + (9521-10166)2/10166 + (10663- 10018)2/10018 = 157.7 P-value = 5 x 10-36 Powell, lecture notes
Logistic regression Similar to linear regression, used for binary outcomes instead of continuous outcomes Let Yi be the phenotype for individual i Yi = 0 for controls Yi = 1 for cases Let Xi be the genotype of individual i at a particular SNP dd Xi = 0 Dd Xi = 1 DD Xi = 2 Powell, lecture notes
Association models and methods Association models and methods Models: Allelic: D versus d Dominant: (DD, Dd) versus dd Recessive: DD versus (Dd, dd) Genotypic: DD versus Dd versus dd Methods Linear regression ANOVA other: maximum likelihood, Bayesian) Specialized software PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) MERLIN (http://www.sph.umich.edu/csg/abecasis/merlin/)
Data quality control and cleaning Data quality control and cleaning Individual-level QC Individuals with discordant sex Use Chr X: Male have one copy ~ cannot be heterozygous Indicates plating errors or sample mix-ups Individuals with high missing genotype rate Indicates sample problem; low quality DNA Individuals with extreme heterozygosity rate Indicates sample contamination (high) or inbreeding (low) Related or duplicated individuals Use IBS or GRM. Avoids bias in test statistic Ancestry outliers Avoids population stratification
Data quality control and cleaning Data quality control and cleaning SNP-level QC SNP with high missing genotype rate Indicates bad quality SNPs SNP that deviates from Hardy-Weinberg Should be stable from generation to generation: (1-p2); 2p(1-p);p2. Tested in controls. SNP with lower minor allele frequency Difficult to call using genotyping platform. Indicates bad quality SNPs SNP with differential missingness in cases and controls Indicates confounding with disease status, especially cases and controls processed separately Comparing allele frequency to known database, such as HapMap or 1000G
Example Data before QC Individuals SNPs 800 Number of Individuals Number of SNPs 600 2e+05 400 200 0e+00 0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Genotyping Rate Genotyping Rate HWE P Value MAF 4e+05 4e+05 Number of SNPs Number of SNPs 2e+05 2e+05 0e+00 0e+00 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 HWE P value MAF
Example: Ancestry information Example: Ancestry information 0.04 Samples CASE 0.00 CEU PC2 CHB CONTROL JPT YRI 0.04 0.08 6SD From CHB Mean: 34 ethnic outliers 0.00 0.02 0.04 0.06 0.08 PC1
Example: PCA after QCs Example: PCA after QCs 0.010 0.05 0.005 0.00 Samples CASE Samples CASE CEU PC2 PC2 CHB CHB CONTROL CONTROL JPT 0.000 YRI 0.05 0.005 0.00 0.02 0.04 0.06 0.08 PC1 0.012 0.009 PC1 0.006
Example: Samples relatedness All Individuals PI HAT>0.25 6e+05 60 50 4e+05 Frequency Frequency 40 30 2e+05 20 10 0e+00 0 0.0 0.2 0.4 0.6 0.8 1.0 0.4 0.6 0.8 1.0 PI HAT (PLINK) PI HAT (PLINK) PI HAT<0.25 0.1<PI HAT<0.3 4 4e+05 3 Frequency Frequency 2 2e+05 1 0e+00 0 0.00 0.05 0.10 0.15 0.20 0.25 0.15 0.20 0.25 0.30 PI HAT (PLINK) PI HAT (PLINK)
Example: MAF Comparison: SNPs passed QC (N: 447,237) Example: MAF Comparison: SNPs passed QC (N: 447,237) r=0.990 r=0.994