Genome-Wide Association Studies in Statistical Genetics Workshop

 
Data QC / cleaning in
Genome-Wide Association Studies
(GWAS)
 
2023 Statistical Genetics workshop
Presenter: Daniel Howrigan
Data group leader 
 Neale Lab (MGH, Broad Institute)
 
Slides adapted from previous workshop presenters:
Lucia Colodro Conde (QIMR), Katrina Grasby (QIMR), Shaun Purcell (HMS)
 
With help from:
John Kemp (University of Queensland) and Daniel Gustavson (IBG)
 
Session Outline – genetic data QC
 
Lecture portion (~40 minutes)
Goals of GWAS
What does genetic data look like?
GWAS Quality Control (QC)
 
Practical portion (~40 minutes)
Viewing genotype data
Sample and SNP QC
Relatedness checking
Principal components analysis (PCA)
 
Goals of Genome Wide Association Studies
 
Go from trait heritability towards biological
mechanism
What genes/genetic variants drive heritable differences?
 
Genome-wide interrogation
Moving away from candidate gene studies
Technological advancement and dropping cost
 
Flexible application of study design
All heritable traits can be studied
Biological/mathematical properties of DNA quite robust
 
 
GWAS of Schizophrenia
 
GWAS of ~4,200 traits
 
G
e
n
e
t
i
c
 
v
a
r
i
a
t
i
o
n
:
 
d
i
f
f
e
r
e
n
c
e
s
 
i
n
 
t
h
e
 
s
e
q
u
e
n
c
e
 
o
f
 
D
N
A
 
a
m
o
n
g
 
i
n
d
i
v
i
d
u
a
l
s
.
 
M
u
t
a
t
i
o
n
:
 
a
 
n
e
w
l
y
 
a
r
i
s
e
n
 
v
a
r
i
a
n
t
 
adenine (A), thymine (T), cytosine (C), guanine (G)
 
What does genetic data look like?
 
Single Nucleotide Polymorphism
SNP
 
Allele 1 = C
Allele 2 = A
Bi-allelic combinations = C/C, C/A, A/A
 
Maternal Chromosome
 
Paternal Chromosome
 
E
x
a
m
p
l
e
s
 
o
f
 
g
e
n
e
t
i
c
v
a
r
i
a
t
i
o
n
 
 
GWAS
 
Genotyping on a chip
 
Affymetrix:
 
6.0 chip
 
>900,000 SNPs
CNV probes
82% coverage CEU HapMap
Accuracy 99.90%
 
Illumina:
 
Human1M BeadChip
 
>1 million SNPs
CNV probes
95% coverage CEU HapMap
Accuracy 99.94%
 
 
From DNA to data
 
Good SNP (Illumina chip example)
 
T/T
 
T/G
 
G/G
Raw Intensity
Normalized Intensity
 
Each dot is an individual genotype
 
Same SNP, different view
 
9
 
aa
 
AA
 
Aa
 
Overall Intensity
 
Angular position
 
SNPs with different allele frequencies
 
AA
 
AB
 
BB
 
MAF = Minor Allele Frequency
 
High MAF
 
Less common MAF
 
Monoallelic in the sample
 
“Common SNPs” = MAF > 5%? 1%? 0.1?
 
“Low Frequency SNPs” = MAF < 1%
 
“Ultra-rare variants” = MAF < 1e5 (1 in 100k)
 
Bad SNP call examples
 
Bad SNP
 
12
 
Another bad SNP
 
13
 
Deletion?
 
Duplications?
 
Another bad SNP
 
14
 
PLINK data format of GWAS data
 
.fam file
 
FID       IID        PID        MID      SEX      AFF
0101010010101010101
0101010010101010101
1010011101010101010
1010011101010101010
1101110101001010101
1101110101001010101
1101001011101101010
1101001011101101010
1101010101010111010
1101010101010111010
 
.bim file (or .map file)
 
.ped file
 
.bed file
 
FID = family ID
IID = Individual ID
PID = paternal ID
MID = maternal ID
AFF = affection status
1 = control
2 = case
-9 or 0 = unknown
 
CHR     SNP ID               CM     POS      A1     A2
 
Samples
 
Genetic variants
 
Genotype data
 
CHR = chromosome
POS = position
CM = Centimorgan (often unused)
A1 = 0 allele
A2 = 1 allele
compression
 
GWAS QC
 
 
GWAS Quality Control (QC)
 
GOAL
: Remove bad samples/SNPs, keep good samples/SNPs
 
Preliminary strategies (first pass)
Poorly genotyped samples / SNP markers
Potential genotype/phenotype mismatches
Deviation away from expected heterozygosity
Related or duplicated samples (population-based data)
 
Follow-up strategies
Batch effects
Quality differences between datasets
Comparison with reference data
…and more
 
Sample QC
 
Poorly genotyped individuals
Poor quality DNA (high number of failed SNP calls)
Contaminated DNA (unusual levels of heterozygosity)
Reporting error
Indications of sample mix-up (sex check or ancestry match)
Related individuals
Family-based and population-based samples require different experimental
designs
Related individuals can bias test statistics across the whole-genome
In family-based association: Mendelian errors used as QC
 
SNP QC
 
Poorly genotyped SNPs
Poor primer design / nonspecific DNA binding (high number of failed SNP
calls)
Poor clustering of genotype intensities (deviation from HWE)
Mendelian errors (if family-based data available)
Uninformative SNPs (too rare or mono-allelic)
 
Follow-up on association signals
No QC protocol will eliminate all instances of genotyping error
Re-analyze original intensity of significant associations (whenever possible)
For meta-analysis, examining heterogeneity of SNP effect
 
Preliminary QC steps
 
SAMPLE
: Sex-check (chr X heterozygosity)
SNP
: Genotyping Call Rate (genotypes missed in individuals)
SAMPLE
: Sample Call Rate (individuals missing genotypes)
SNP
: Hardy-Weinberg Equilibrium
SAMPLE
: Proportion of Heterozygosity
SAMPLE
/
SNP
: Mendelian errors
SAMPLE
: Genetic Relatedness
 
Confirming genetic sex
 
Primary question: Is the sample-level data correctly matching the SNP data?
 
    
FID         IID       PEDSEX       SNPSEX       STATUS            F
    T304      T30411            1            1           OK       0.9857
  A0641C   06410021C            1            1           OK       0.9841
  T06013    T2601310            2            2           OK     -0.06164
  T01533    T2153321            1            1           OK       0.9841
    T330      T33021            1            1           OK       0.9867
    T191      T19120            2            2           OK      0.01155
    T329      T32911            1            1           OK       0.9839
  T07981    T2798111            1            1           OK       0.9822
  A0601C   06010021C            1            1           OK       0.9858
  A1008C   10080011C            1            1           OK       0.9817
  A0880C   08800331C            1            1           OK       0.9818
  T00894    T2089420            2            2           OK      0.01927
  A0701C   07010011C            1            1           OK       0.9807
  T02911    T2291121            1            1           OK       0.9851
  T00588    T2058811            1            2      PROBLEM      -0.3396
  A0805C   08050031C            1            1           OK       0.9821
  T07755    T2775520            2            2           OK     -0.09906
  T03676    T2367611            1            1           OK       0.9845
    T082      T08220            2            1      PROBLEM       0.9833
 
Female sex = XX
Male sex = XY
 
Example .sexcheck file from PLINK (male=1, female=2)
    Chromosome X F-statistic
 
Male
 
Female
 
SNP genotyping call rate (“missingness”)
 
 
Usually done iteratively
Remove SNPs with < 95% call rate
Run sample QC
Remove SNPs with < 98% call rate
 
For case/control data
Look at difference in genotyping rate
Threshold usually at > 2% call rate difference
CHR  SNP         N_MISS  N_GENO  F_MISS
1    rs12565286  6       200     0.03
1    rs12124819  8       200     0.04
1    rs4970383   0       200     0
1    rs13303118  0       200     0
1    rs35940137  0       200     0
1    rs2465136   1       200     0.005
1    rs2488991   0       200     0
1    rs3766192   0       200     0
1    rs10907177  0       200     0
 
Example .lmiss file from PLINK
CHR  SNP         F_MISS_A  F_MISS_U  P
1    rs12565286  0.03125   0.03093   1
1    rs12124819  0.05208   0.03093   0.4974
1    rs2465136   0         0.01031   1
1    rs4970357   0         0.02062   0.4974
1    rs11466691  0         0.01031   1
1    rs11466681  0.01042   0.01031   1
1    rs34945898  0.03125   0         0.1211
1    rs715643    0.05208   0.02062   0.2787
1    rs13306651  0.01042   0.03093   0.6211
 
Example .missing file from PLINK
 
Bad SNP design, poor clustering…
 
Sample genotyping call rate
 
Example .imiss file from PLINK
FID      IID      MISS_PHENO  N_MISS  N_GENO  F_MISS
NA20505  NA20505  N           122     100310  0.001216
NA20504  NA20504  N           1406    100310  0.01402
NA20506  NA20506  N           204     100310  0.002034
NA20502  NA20502  N           847     100310  0.008444
NA20528  NA20528  N           219     100310  0.002183
NA20531  NA20531  N           96      100310  0.000957
NA20534  NA20534  N           338     100310  0.00337
NA20535  NA20535  N           182     100310  0.001814
NA20586  NA20586  N           214     100310  0.002133
 
http://zzz.bwh.harvard.edu/plink/summary.shtml#missing
 
Low quality DNA, degradation, lab error, contamination
 
Hardy-Weinberg Equilibrium (HWE)
 
A genetic variant is said to be in HWE if the
genotype proportions can be predicted by the allele
frequencies in the following way:
If:
f(A1) = p
f(A2) = q
Then:
f(A1/A1) = p
2
f(A1/A2) = 2pq
f(A2/A2) = q
2
 
p
2
 + 2pq + q
2 
= 1
 
p + q
 
= 1
 
Example:
 
p = 0.2
q = 0.8
 
p2 = 0.04
2pq = 0.32
q2 = 0.64
 
In C/T SNP terms:
 
C allele freq. = 20%
T allele freq.= 80%
 
C/C freq. = 4%
C/T freq. = 32%
T/T freq. = 64%
 
Testing for deviation from HWE
 
Deviations from HWE can be caused by:
Non-random mating (inbreeding, assortative mating, …)
Population stratification
Mutation
Limited population size
Random genetic drift
Gene flow
Genotyping errors
Selection 
(
 may be due to true association!)
 
So only extreme deviation from HWE (
p
 < 10
-6
) is
worrisome.
CHR  SNP         TEST   A1  A2  GENO       O(HET)   E(HET)   P
1    rs12565286  ALL    C   G   0/17/170   0.09091  0.08678  1
1    rs12565286  AFF    C   G   0/6/87     0.06452  0.06243  1
1    rs12565286  UNAFF  C   G   0/11/83    0.117    0.1102   1
1    rs12124819  ALL    G   A   0/77/108   0.4162   0.3296   6.919e-05
1    rs12124819  AFF    G   A   0/41/50    0.4505   0.3491   0.004878
1    rs12124819  UNAFF  G   A   0/36/58    0.383    0.3096   0.02001
1    rs4970383   ALL    A   C   10/68/115  0.3523   0.352    1
1    rs4970383   AFF    A   C   3/36/57    0.375    0.3418   0.5488
1    rs4970383   UNAFF  A   C   7/32/58    0.3299   0.3618   0.401
 
Example .hardy output in PLINK
 
Proportion of heterozygosity (Fhet)
 
http://zzz.bwh.harvard.edu/plink/ibdibs.shtml#inbreeding
 
Mendelian errors
 
https://www.cog-genomics.org/plink/1.9/basic_stats#mendel
 
Requires parent-offspring data
 
Similar to genotyping rate, can be
examined at sample and SNP level
 
High sample-level mendel error rate
Parental uncertainty
 
High SNP-level mendel error rate
Poor genotype quality
 
de novo mutation is a type of mendelian error
 
 
TL/DR: “Nearby SNPs are correlated”
 
Properties of linkage disequilibrium reduce
the loss of signal sensitivity when removing
SNPs
 
Strict multiple testing correction often
requires very large samples - no single
sample will drive a signal
 
LD 
must
 be taken into account when
examining genetic relatedness, population
stratification, and interpreting association
 
Linkage disequilibrium (LD) allows us to be more robust with our QC protocols
 
Genetic relatedness using Identity-By-Descent (IBD)
calculation
 
Question: How much does a pair of samples share 0, 1, or
both alleles?
 
Identical twins: Shares both alleles across entire genome
(barring mutation events)
 
Requires using LD-pruned SNPs for accurate estimates
Want each SNP to be an “independent” marker
 
Used to both “confirm” and “filter” related individuals
 
 
Checking genotype relatedness across samples
FID1     IID1     FID2     IID2     RT  EZ  Z0      Z1      Z2      PI_HAT  PHE  DST       PPC     RATIO
NA20505  NA20505  NA20506  NA20506  UN  NA  0.9872  0.0000  0.0128  0.0128  -1   0.771435  0.3446  1.9712
NA20505  NA20505  NA20502  NA20502  UN  NA  0.9888  0.0096  0.0016  0.0064  -1   0.770233  0.3950  1.9808
NA20505  NA20505  NA20528  NA20528  UN  NA  0.9733  0.0267  0.0000  0.0133  -1   0.770068  0.2922  1.9606
NA20505  NA20505  NA20531  NA20531  UN  NA  0.9789  0.0205  0.0006  0.0109  -1   0.770976  0.7407  2.0479
NA20505  NA20505  NA20534  NA20534  UN  NA  0.9602  0.0398  0.0000  0.0199  -1   0.772123  0.3046  1.9631
NA20505  NA20505  NA20535  NA20535  UN  NA  0.9650  0.0350  0.0000  0.0175  -1   0.771054  0.6510  2.0285
NA20505  NA20505  NA20586  NA20586  UN  NA  0.9728  0.0272  0.0000  0.0136  -1   0.770687  0.4281  1.9869
NA20505  NA20505  NA20756  NA20756  UN  NA  0.9675  0.0325  0.0000  0.0163  -1   0.770762  0.6902  2.0365
NA20505  NA20505  NA20760  NA20760  UN  NA  0.9344  0.0656  0.0000  0.0328  0    0.770978  0.8856  2.0904
 
Example of .genome file in PLINK
 
Using genetic relatedness estimates
 
Confirm unrelated or “population-based” sample
ascertainment
Filter out related samples (pi-hat > 0.2 often used)
“Cryptic relatedness” 
 related individuals identified in ”unrelated”
sample
Confirm family structure (pedigree)
Ensure parent-child and sibling relationship
Watch out for distinct ancestries
Can skew IBD estimates and incorrectly identify recent relatedness
PCrelate more robust to these patterns
https://rdrr.io/bioc/GENESIS/man/pcrelate.html
 
Session Outline – genetic data QC
 
 
Practical portion (~40 minutes)
Data checking
Sample and SNP QC
Relatedness checking
Principal components analysis (PCA)
 
Go to: workshop.colorado.edu
Slides + practical:
 /faculty/daniel/2023/QC
Terminal:
 workshop.colorado.edu/ssh
Rstudio:
 workshop.colorado.edu/rstudio
 
Script that you will be working through:
QC_practical_statgenWorkshop2023.txt
 
 
Full path: 
/faculty/daniel/2023/QC/QC_practical_statgenWorkshop2023.txt
 
Walk through this script and copy/paste commands to the ssh command line
 
 
Qualtrics version: 
https://ucsas.qualtrics.com/jfe/form/SV_eWpdYL7srw7Cy6W
 
Answers to be filled out by a single table member
 
 
See the ISGW forum for these and other useful links to start your practical session:
 
https://isgw-forum.colorado.edu/
 
 
 
# 1.1 Creating workspace
 
## Create day1 subdirectory (-p creates full path into new directories)
mkdir -p ~/day1/QC
 
## traverse into new subdirectory
cd ~/day1/QC
 
# 1.2 Copying over genetic dataset
 
# Copy the files to your working subdirectory
cp /faculty/daniel/2023/QC/* .
 
# Check you have the required files:
 
ls -l
 
# HM3.bed
# HM3.bim
# HM3.fam
# QC_practical_BoulderWorkshop2023.R
# QC_practical_BoulderWorkshop2023.sh
# 
QC_practical_BoulderWorkshop2023.txt
# cc.ped
# cc.map
 
## === Main QC ===
 
# STEP 1. Data and Formats
 
# STEP 2. Check for reported/genotype sex discrepancies
 
# STEP 3. Obtain information on individuals missing SNP data
 
# STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg
 
# STEP 5. Sample QC: genotype call rate and heterozygosity
 
# STEP 6. LD-pruned SNP set
 
# STEP 7. Sample QC: sex check filtering using LD-pruned SNP set
 
# STEP 8. Sample QC: Checking for cryptic relatedness
Slide Note
Embed
Share

Explore the essentials of Genome-Wide Association Studies (GWAS) and genetic data quality control as presented by Daniel Howrigan in the 2023 workshop. Delve into the goals of GWAS, genetic data characteristics, SNP variations, and genotyping techniques. Gain insights into moving from trait heritability to identifying genetic variants driving heritable differences and the shift towards genome-wide interrogation. Discover the advancements in technological tools, study designs, and the robust properties of DNA in understanding heritable traits.

  • Genetics
  • GWAS
  • Statistical
  • DNA
  • Genotyping

Uploaded on May 16, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data QC / cleaning in Genome-Wide Association Studies (GWAS) 2023 Statistical Genetics workshop Presenter: Daniel Howrigan Data group leader Neale Lab (MGH, Broad Institute) Slides adapted from previous workshop presenters: Lucia Colodro Conde (QIMR), Katrina Grasby (QIMR), Shaun Purcell (HMS) With help from: John Kemp (University of Queensland) and Daniel Gustavson (IBG)

  2. Session Outline genetic data QC Lecture portion (~40 minutes) Goals of GWAS What does genetic data look like? GWAS Quality Control (QC) Practical portion (~40 minutes) Viewing genotype data Sample and SNP QC Relatedness checking Principal components analysis (PCA)

  3. Goals of Genome Wide Association Studies GWAS of Schizophrenia Go from trait heritability towards biological mechanism What genes/genetic variants drive heritable differences? Genome-wide interrogation Moving away from candidate gene studies Technological advancement and dropping cost GWAS of ~4,200 traits Flexible application of study design All heritable traits can be studied Biological/mathematical properties of DNA quite robust

  4. What does genetic data look like? Single Nucleotide Polymorphism SNP Maternal Chromosome Paternal Chromosome Allele 1 = C Allele 2 = A Bi-allelic combinations = C/C, C/A, A/A adenine (A), thymine (T), cytosine (C), guanine (G) Genetic variation Genetic variation: differences in the sequence of DNA among individuals. Mutation Mutation: a newly arisen variant

  5. GWAS Examples of genetic Examples of genetic variation variation

  6. Genotyping on a chip Affymetrix: Illumina: 6.0 chip >900,000 SNPs CNV probes 82% coverage CEU HapMap Accuracy 99.90% Human1M BeadChip >1 million SNPs CNV probes 95% coverage CEU HapMap Accuracy 99.94%

  7. From DNA to data

  8. Good SNP (Illumina chip example) Raw Intensity Normalized Intensity T/T T/T T/G T/G G/G G/G Each dot is an individual genotype

  9. Same SNP, different view Overall Intensity AA aa Aa Angular position 9

  10. SNPs with different allele frequencies High MAF MAF = Minor Allele Frequency AA Common SNPs = MAF > 5%? 1%? 0.1? AB Low Frequency SNPs = MAF < 1% BB Ultra-rare variants = MAF < 1e5 (1 in 100k) Monoallelic in the sample Less common MAF

  11. Bad SNP call examples A2/A2 homalt A1/A2 het homref A1/A1

  12. Bad SNP 12

  13. Another bad SNP Duplications? Deletion? 13

  14. Another bad SNP 14

  15. PLINK data format of GWAS data Genotype data Samples Genetic variants .fam file .ped file .bim file (or .map file) FID IID PID MID SEX AFF CHR SNP ID CM POS A1 A2 compression .bed file FID = family ID IID = Individual ID PID = paternal ID MID = maternal ID AFF = affection status 1 = control 2 = case -9 or 0 = unknown 0101010010101010101 1010011101010101010 1101110101001010101 1101001011101101010 1101010101010111010 CHR = chromosome POS = position CM = Centimorgan (often unused) A1 = 0 allele A2 = 1 allele

  16. GWAS QC

  17. GWAS Quality Control (QC) GOAL: Remove bad samples/SNPs, keep good samples/SNPs Preliminary strategies (first pass) Poorly genotyped samples / SNP markers Potential genotype/phenotype mismatches Deviation away from expected heterozygosity Related or duplicated samples (population-based data) Follow-up strategies Batch effects Quality differences between datasets Comparison with reference data and more

  18. Sample QC Poorly genotyped individuals Poor quality DNA (high number of failed SNP calls) Contaminated DNA (unusual levels of heterozygosity) Reporting error Indications of sample mix-up (sex check or ancestry match) Related individuals Family-based and population-based samples require different experimental designs Related individuals can bias test statistics across the whole-genome In family-based association: Mendelian errors used as QC

  19. SNP QC Poorly genotyped SNPs Poor primer design / nonspecific DNA binding (high number of failed SNP calls) Poor clustering of genotype intensities (deviation from HWE) Mendelian errors (if family-based data available) Uninformative SNPs (too rare or mono-allelic) Follow-up on association signals No QC protocol will eliminate all instances of genotyping error Re-analyze original intensity of significant associations (whenever possible) For meta-analysis, examining heterogeneity of SNP effect

  20. Preliminary QC steps SAMPLE: Sex-check (chr X heterozygosity) SNP: Genotyping Call Rate (genotypes missed in individuals) SAMPLE: Sample Call Rate (individuals missing genotypes) SNP: Hardy-Weinberg Equilibrium SAMPLE: Proportion of Heterozygosity SAMPLE/SNP: Mendelian errors SAMPLE: Genetic Relatedness

  21. Confirming genetic sex Primary question: Is the sample-level data correctly matching the SNP data? Example .sexcheck file from PLINK (male=1, female=2) Female sex = XX Male sex = XY Male FID IID PEDSEX SNPSEX STATUS F T304 T30411 1 1 OK 0.9857 A0641C 06410021C 1 1 OK 0.9841 T06013 T2601310 2 2 OK -0.06164 T01533 T2153321 1 1 OK 0.9841 T330 T33021 1 1 OK 0.9867 T191 T19120 2 2 OK 0.01155 T329 T32911 1 1 OK 0.9839 T07981 T2798111 1 1 OK 0.9822 A0601C 06010021C 1 1 OK 0.9858 A1008C 10080011C 1 1 OK 0.9817 A0880C 08800331C 1 1 OK 0.9818 T00894 T2089420 2 2 OK 0.01927 A0701C 07010011C 1 1 OK 0.9807 T02911 T2291121 1 1 OK 0.9851 T00588 T2058811 1 2 PROBLEM -0.3396 A0805C 08050031C 1 1 OK 0.9821 T07755 T2775520 2 2 OK -0.09906 T03676 T2367611 1 1 OK 0.9845 T082 T08220 2 1 PROBLEM 0.9833 Female Chromosome X F-statistic

  22. SNP genotyping call rate (missingness) Bad SNP design, poor clustering Example .lmiss file from PLINK CHR SNP N_MISS N_GENO F_MISS 1 rs12565286 6 200 0.03 1 rs12124819 8 200 0.04 1 rs4970383 0 200 0 1 rs13303118 0 200 0 1 rs35940137 0 200 0 1 rs2465136 1 200 0.005 1 rs2488991 0 200 0 1 rs3766192 0 200 0 1 rs10907177 0 200 0 Usually done iteratively Remove SNPs with < 95% call rate Run sample QC Remove SNPs with < 98% call rate Example .missing file from PLINK For case/control data Look at difference in genotyping rate Threshold usually at > 2% call rate difference CHR SNP F_MISS_A F_MISS_U P 1 rs12565286 0.03125 0.03093 1 1 rs12124819 0.05208 0.03093 0.4974 1 rs2465136 0 0.01031 1 1 rs4970357 0 0.02062 0.4974 1 rs11466691 0 0.01031 1 1 rs11466681 0.01042 0.01031 1 1 rs34945898 0.03125 0 0.1211 1 rs715643 0.05208 0.02062 0.2787 1 rs13306651 0.01042 0.03093 0.6211

  23. Sample genotyping call rate Example .imiss file from PLINK Low quality DNA, degradation, lab error, contamination FID IID MISS_PHENO N_MISS N_GENO F_MISS NA20505 NA20505 N 122 100310 0.001216 NA20504 NA20504 N 1406 100310 0.01402 NA20506 NA20506 N 204 100310 0.002034 NA20502 NA20502 N 847 100310 0.008444 NA20528 NA20528 N 219 100310 0.002183 NA20531 NA20531 N 96 100310 0.000957 NA20534 NA20534 N 338 100310 0.00337 NA20535 NA20535 N 182 100310 0.001814 NA20586 NA20586 N 214 100310 0.002133 http://zzz.bwh.harvard.edu/plink/summary.shtml#missing

  24. Hardy-Weinberg Equilibrium (HWE) A genetic variant is said to be in HWE if the genotype proportions can be predicted by the allele frequencies in the following way: If: Example: In C/T SNP terms: f(A1) = p p + q= 1 p = 0.2 q = 0.8 C allele freq. = 20% T allele freq.= 80% f(A2) = q Then: f(A1/A1) = p2 p2 = 0.04 2pq = 0.32 q2 = 0.64 C/C freq. = 4% C/T freq. = 32% T/T freq. = 64% f(A1/A2) = 2pq f(A2/A2) = q2 p2 + 2pq + q2 = 1

  25. Testing for deviation from HWE Deviations from HWE can be caused by: Non-random mating (inbreeding, assortative mating, ) Population stratification Mutation Limited population size Random genetic drift Gene flow Genotyping errors Selection ( may be due to true association!) Example .hardy output in PLINK CHR SNP TEST A1 A2 GENO O(HET) E(HET) P 1 rs12565286 ALL C G 1 rs12565286 AFF C G 1 rs12565286 UNAFF C G 1 rs12124819 ALL G A 0/77/108 0.4162 0.3296 6.919e-05 1 rs12124819 AFF G A 0/41/50 0.4505 0.3491 0.004878 1 rs12124819 UNAFF G A 0/36/58 0.383 0.3096 0.02001 1 rs4970383 ALL A C 10/68/115 0.3523 0.352 1 1 rs4970383 AFF A C 3/36/57 0.375 0.3418 0.5488 1 rs4970383 UNAFF A C 7/32/58 0.3299 0.3618 0.401 0/17/170 0.09091 0.08678 1 0/6/87 0.06452 0.06243 1 0/11/83 0.117 0.1102 1 So only extreme deviation from HWE (p < 10-6) is worrisome.

  26. Proportion of heterozygosity (Fhet) http://zzz.bwh.harvard.edu/plink/ibdibs.shtml#inbreeding

  27. Mendelian errors Requires parent-offspring data Similar to genotyping rate, can be examined at sample and SNP level High sample-level mendel error rate Parental uncertainty High SNP-level mendel error rate Poor genotype quality AA AA AT https://www.cog-genomics.org/plink/1.9/basic_stats#mendel de novo mutation is a type of mendelian error

  28. Linkage disequilibrium (LD) allows us to be more robust with our QC protocols TL/DR: Nearby SNPs are correlated Properties of linkage disequilibrium reduce the loss of signal sensitivity when removing SNPs Strict multiple testing correction often requires very large samples - no single sample will drive a signal LD must be taken into account when examining genetic relatedness, population stratification, and interpreting association

  29. Genetic relatedness using Identity-By-Descent (IBD) calculation Question: How much does a pair of samples share 0, 1, or both alleles? Identical twins: Shares both alleles across entire genome (barring mutation events) Requires using LD-pruned SNPs for accurate estimates Want each SNP to be an independent marker Used to both confirm and filter related individuals

  30. Checking genotype relatedness across samples Example of .genome file in PLINK FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO NA20505 NA20505 NA20506 NA20506 UN NA 0.9872 0.0000 0.0128 0.0128 -1 0.771435 0.3446 1.9712 NA20505 NA20505 NA20502 NA20502 UN NA 0.9888 0.0096 0.0016 0.0064 -1 0.770233 0.3950 1.9808 NA20505 NA20505 NA20528 NA20528 UN NA 0.9733 0.0267 0.0000 0.0133 -1 0.770068 0.2922 1.9606 NA20505 NA20505 NA20531 NA20531 UN NA 0.9789 0.0205 0.0006 0.0109 -1 0.770976 0.7407 2.0479 NA20505 NA20505 NA20534 NA20534 UN NA 0.9602 0.0398 0.0000 0.0199 -1 0.772123 0.3046 1.9631 NA20505 NA20505 NA20535 NA20535 UN NA 0.9650 0.0350 0.0000 0.0175 -1 0.771054 0.6510 2.0285 NA20505 NA20505 NA20586 NA20586 UN NA 0.9728 0.0272 0.0000 0.0136 -1 0.770687 0.4281 1.9869 NA20505 NA20505 NA20756 NA20756 UN NA 0.9675 0.0325 0.0000 0.0163 -1 0.770762 0.6902 2.0365 NA20505 NA20505 NA20760 NA20760 UN NA 0.9344 0.0656 0.0000 0.0328 0 0.770978 0.8856 2.0904

  31. Using genetic relatedness estimates Confirm unrelated or population-based sample ascertainment Filter out related samples (pi-hat > 0.2 often used) Cryptic relatedness related individuals identified in unrelated sample Confirm family structure (pedigree) Ensure parent-child and sibling relationship Watch out for distinct ancestries Can skew IBD estimates and incorrectly identify recent relatedness PCrelate more robust to these patterns https://rdrr.io/bioc/GENESIS/man/pcrelate.html

  32. Session Outline genetic data QC Practical portion (~40 minutes) Data checking Sample and SNP QC Relatedness checking Principal components analysis (PCA) Go to: workshop.colorado.edu Slides + practical: /faculty/daniel/2023/QC Terminal: workshop.colorado.edu/ssh Rstudio: workshop.colorado.edu/rstudio

  33. Script that you will be working through: QC_practical_statgenWorkshop2023.txt Full path: /faculty/daniel/2023/QC/QC_practical_statgenWorkshop2023.txt Walk through this script and copy/paste commands to the ssh command line Qualtrics version: https://ucsas.qualtrics.com/jfe/form/SV_eWpdYL7srw7Cy6W Answers to be filled out by a single table member See the ISGW forum for these and other useful links to start your practical session: https://isgw-forum.colorado.edu/

  34. # 1.1 Creating workspace ## Create day1 subdirectory (-p creates full path into new directories) mkdir -p ~/day1/QC ## traverse into new subdirectory cd ~/day1/QC # 1.2 Copying over genetic dataset # Copy the files to your working subdirectory cp /faculty/daniel/2023/QC/* . # Check you have the required files: ls -l # HM3.bed # HM3.bim # HM3.fam # QC_practical_BoulderWorkshop2023.R # QC_practical_BoulderWorkshop2023.sh # QC_practical_BoulderWorkshop2023.txt # cc.ped # cc.map

  35. ## === Main QC === # STEP 1. Data and Formats # STEP 2. Check for reported/genotype sex discrepancies # STEP 3. Obtain information on individuals missing SNP data # STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg # STEP 5. Sample QC: genotype call rate and heterozygosity # STEP 6. LD-pruned SNP set # STEP 7. Sample QC: sex check filtering using LD-pruned SNP set # STEP 8. Sample QC: Checking for cryptic relatedness

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#