Genome-Wide Association Studies in Statistical Genetics Workshop

Data QC / cleaning in

Genome-Wide Association Studies

(GWAS)

2023 Statistical Genetics workshop

Presenter: Daniel Howrigan

Data group leader

–

 Neale Lab (MGH, Broad Institute)

Slides adapted from previous workshop presenters:

Lucia Colodro Conde (QIMR), Katrina Grasby (QIMR), Shaun Purcell (HMS)

With help from:

John Kemp (University of Queensland) and Daniel Gustavson (IBG)

Session Outline – genetic data QC

•

Lecture portion (~40 minutes)

•

Goals of GWAS

•

What does genetic data look like?

•

GWAS Quality Control (QC)

•

Practical portion (~40 minutes)

•

Viewing genotype data

•

Sample and SNP QC

•

Relatedness checking

•

Principal components analysis (PCA)

Goals of Genome Wide Association Studies

•

Go from trait heritability towards biological

mechanism

•

What genes/genetic variants drive heritable differences?

•

Genome-wide interrogation

•

Moving away from candidate gene studies

•

Technological advancement and dropping cost

•

Flexible application of study design

•

All heritable traits can be studied

•

Biological/mathematical properties of DNA quite robust

GWAS of Schizophrenia

GWAS of ~4,200 traits

adenine (A), thymine (T), cytosine (C), guanine (G)

What does genetic data look like?

Single Nucleotide Polymorphism

SNP

Allele 1 = C

Allele 2 = A

Bi-allelic combinations = C/C, C/A, A/A

Maternal Chromosome

Paternal Chromosome

GWAS

Genotyping on a chip

Affymetrix:

6.0 chip

>900,000 SNPs

CNV probes

82% coverage CEU HapMap

Accuracy 99.90%

Illumina:

Human1M BeadChip

>1 million SNPs

CNV probes

95% coverage CEU HapMap

Accuracy 99.94%

From DNA to data

Good SNP (Illumina chip example)

T/T

T/G

G/G

Raw Intensity

Normalized Intensity

Each dot is an individual genotype

Same SNP, different view

aa

AA

Aa

Overall Intensity

Angular position

SNPs with different allele frequencies

AA

AB

BB

MAF = Minor Allele Frequency

High MAF

Less common MAF

Monoallelic in the sample

•

“Common SNPs” = MAF > 5%? 1%? 0.1?

•

“Low Frequency SNPs” = MAF < 1%

•

“Ultra-rare variants” = MAF < 1e5 (1 in 100k)

Bad SNP call examples

Bad SNP

Another bad SNP

Deletion?

Duplications?

Another bad SNP

PLINK data format of GWAS data

.fam file

FID       IID        PID        MID      SEX      AFF

0101010010101010101

0101010010101010101

1010011101010101010

1010011101010101010

1101110101001010101

1101110101001010101

1101001011101101010

1101001011101101010

1101010101010111010

1101010101010111010

.bim file (or .map file)

.ped file

.bed file

FID = family ID

IID = Individual ID

PID = paternal ID

MID = maternal ID

AFF = affection status

•

1 = control

•

2 = case

•

-9 or 0 = unknown

CHR     SNP ID               CM     POS      A1     A2

Samples

Genetic variants

Genotype data

CHR = chromosome

POS = position

CM = Centimorgan (often unused)

A1 = 0 allele

A2 = 1 allele

compression

GWAS QC

GWAS Quality Control (QC)

•

GOAL

: Remove bad samples/SNPs, keep good samples/SNPs

•

Preliminary strategies (first pass)

•

Poorly genotyped samples / SNP markers

•

Potential genotype/phenotype mismatches

•

Deviation away from expected heterozygosity

•

Related or duplicated samples (population-based data)

•

Follow-up strategies

•

Batch effects

•

Quality differences between datasets

•

Comparison with reference data

…and more

Sample QC

•

Poorly genotyped individuals

•

Poor quality DNA (high number of failed SNP calls)

•

Contaminated DNA (unusual levels of heterozygosity)

•

Reporting error

•

Indications of sample mix-up (sex check or ancestry match)

•

Related individuals

•

Family-based and population-based samples require different experimental

designs

•

Related individuals can bias test statistics across the whole-genome

•

In family-based association: Mendelian errors used as QC

SNP QC

•

Poorly genotyped SNPs

•

Poor primer design / nonspecific DNA binding (high number of failed SNP

calls)

•

Poor clustering of genotype intensities (deviation from HWE)

•

Mendelian errors (if family-based data available)

•

Uninformative SNPs (too rare or mono-allelic)

•

Follow-up on association signals

•

No QC protocol will eliminate all instances of genotyping error

•

Re-analyze original intensity of significant associations (whenever possible)

•

For meta-analysis, examining heterogeneity of SNP effect

Preliminary QC steps

•

SAMPLE

: Sex-check (chr X heterozygosity)

•

SNP

: Genotyping Call Rate (genotypes missed in individuals)

•

SAMPLE

: Sample Call Rate (individuals missing genotypes)

•

SNP

: Hardy-Weinberg Equilibrium

•

SAMPLE

: Proportion of Heterozygosity

•

SAMPLE

SNP

: Mendelian errors

•

SAMPLE

: Genetic Relatedness

Confirming genetic sex

•

Primary question: Is the sample-level data correctly matching the SNP data?

FID         IID       PEDSEX       SNPSEX       STATUS            F

    T304      T30411            1            1           OK       0.9857

  A0641C   06410021C            1            1           OK       0.9841

  T06013    T2601310            2            2           OK     -0.06164

  T01533    T2153321            1            1           OK       0.9841

    T330      T33021            1            1           OK       0.9867

    T191      T19120            2            2           OK      0.01155

    T329      T32911            1            1           OK       0.9839

  T07981    T2798111            1            1           OK       0.9822

  A0601C   06010021C            1            1           OK       0.9858

  A1008C   10080011C            1            1           OK       0.9817

  A0880C   08800331C            1            1           OK       0.9818

  T00894    T2089420            2            2           OK      0.01927

  A0701C   07010011C            1            1           OK       0.9807

  T02911    T2291121            1            1           OK       0.9851

  T00588    T2058811            1            2      PROBLEM      -0.3396

  A0805C   08050031C            1            1           OK       0.9821

  T07755    T2775520            2            2           OK     -0.09906

  T03676    T2367611            1            1           OK       0.9845

    T082      T08220            2            1      PROBLEM       0.9833

Female sex = XX

Male sex = XY

Example .sexcheck file from PLINK (male=1, female=2)

    Chromosome X F-statistic

Male

Female

SNP genotyping call rate (“missingness”)

•

Usually done iteratively

•

Remove SNPs with < 95% call rate

•

Run sample QC

•

Remove SNPs with < 98% call rate

•

For case/control data

•

Look at difference in genotyping rate

•

Threshold usually at > 2% call rate difference

CHR  SNP         N_MISS  N_GENO  F_MISS

1    rs12565286  6       200     0.03

1    rs12124819  8       200     0.04

1    rs4970383   0       200     0

1    rs13303118  0       200     0

1    rs35940137  0       200     0

1    rs2465136   1       200     0.005

1    rs2488991   0       200     0

1    rs3766192   0       200     0

1    rs10907177  0       200     0

Example .lmiss file from PLINK

CHR  SNP         F_MISS_A  F_MISS_U  P

1    rs12565286  0.03125   0.03093   1

1    rs12124819  0.05208   0.03093   0.4974

1    rs2465136   0         0.01031   1

1    rs4970357   0         0.02062   0.4974

1    rs11466691  0         0.01031   1

1    rs11466681  0.01042   0.01031   1

1    rs34945898  0.03125   0         0.1211

1    rs715643    0.05208   0.02062   0.2787

1    rs13306651  0.01042   0.03093   0.6211

Example .missing file from PLINK

Bad SNP design, poor clustering…

Sample genotyping call rate

Example .imiss file from PLINK

FID      IID      MISS_PHENO  N_MISS  N_GENO  F_MISS

NA20505  NA20505  N           122     100310  0.001216

NA20504  NA20504  N           1406    100310  0.01402

NA20506  NA20506  N           204     100310  0.002034

NA20502  NA20502  N           847     100310  0.008444

NA20528  NA20528  N           219     100310  0.002183

NA20531  NA20531  N           96      100310  0.000957

NA20534  NA20534  N           338     100310  0.00337

NA20535  NA20535  N           182     100310  0.001814

NA20586  NA20586  N           214     100310  0.002133

http://zzz.bwh.harvard.edu/plink/summary.shtml#missing

Low quality DNA, degradation, lab error, contamination

Hardy-Weinberg Equilibrium (HWE)

•

A genetic variant is said to be in HWE if the

genotype proportions can be predicted by the allele

frequencies in the following way:

•

If:

•

f(A1) = p

•

f(A2) = q

•

Then:

•

f(A1/A1) = p

•

f(A1/A2) = 2pq

•

f(A2/A2) = q

 + 2pq + q

= 1

p + q

= 1

Example:

p = 0.2

q = 0.8

p2 = 0.04

2pq = 0.32

q2 = 0.64

In C/T SNP terms:

C allele freq. = 20%

T allele freq.= 80%

C/C freq. = 4%

C/T freq. = 32%

T/T freq. = 64%

Testing for deviation from HWE

Deviations from HWE can be caused by:

•

Non-random mating (inbreeding, assortative mating, …)

•

Population stratification

•

Mutation

•

Limited population size

•

Random genetic drift

•

Gene flow

•

Genotyping errors

•

Selection

→

 may be due to true association!)

So only extreme deviation from HWE (

 < 10

-6

) is

worrisome.

CHR  SNP         TEST   A1  A2  GENO       O(HET)   E(HET)   P

1    rs12565286  ALL    C   G   0/17/170   0.09091  0.08678  1

1    rs12565286  AFF    C   G   0/6/87     0.06452  0.06243  1

1    rs12565286  UNAFF  C   G   0/11/83    0.117    0.1102   1

1    rs12124819  ALL    G   A   0/77/108   0.4162   0.3296   6.919e-05

1    rs12124819  AFF    G   A   0/41/50    0.4505   0.3491   0.004878

1    rs12124819  UNAFF  G   A   0/36/58    0.383    0.3096   0.02001

1    rs4970383   ALL    A   C   10/68/115  0.3523   0.352    1

1    rs4970383   AFF    A   C   3/36/57    0.375    0.3418   0.5488

1    rs4970383   UNAFF  A   C   7/32/58    0.3299   0.3618   0.401

Example .hardy output in PLINK

Proportion of heterozygosity (Fhet)

http://zzz.bwh.harvard.edu/plink/ibdibs.shtml#inbreeding

Mendelian errors

https://www.cog-genomics.org/plink/1.9/basic_stats#mendel

•

Requires parent-offspring data

•

Similar to genotyping rate, can be

examined at sample and SNP level

•

High sample-level mendel error rate

•

Parental uncertainty

•

High SNP-level mendel error rate

•

Poor genotype quality

de novo mutation is a type of mendelian error

•

TL/DR: “Nearby SNPs are correlated”

•

Properties of linkage disequilibrium reduce

the loss of signal sensitivity when removing

SNPs

•

Strict multiple testing correction often

requires very large samples - no single

sample will drive a signal

•

LD

must

 be taken into account when

examining genetic relatedness, population

stratification, and interpreting association

Linkage disequilibrium (LD) allows us to be more robust with our QC protocols

Genetic relatedness using Identity-By-Descent (IBD)

calculation

•

Question: How much does a pair of samples share 0, 1, or

both alleles?

•

Identical twins: Shares both alleles across entire genome

(barring mutation events)

•

Requires using LD-pruned SNPs for accurate estimates

•

Want each SNP to be an “independent” marker

•

Used to both “confirm” and “filter” related individuals

Checking genotype relatedness across samples

FID1     IID1     FID2     IID2     RT  EZ  Z0      Z1      Z2      PI_HAT  PHE  DST       PPC     RATIO

NA20505  NA20505  NA20506  NA20506  UN  NA  0.9872  0.0000  0.0128  0.0128  -1   0.771435  0.3446  1.9712

NA20505  NA20505  NA20502  NA20502  UN  NA  0.9888  0.0096  0.0016  0.0064  -1   0.770233  0.3950  1.9808

NA20505  NA20505  NA20528  NA20528  UN  NA  0.9733  0.0267  0.0000  0.0133  -1   0.770068  0.2922  1.9606

NA20505  NA20505  NA20531  NA20531  UN  NA  0.9789  0.0205  0.0006  0.0109  -1   0.770976  0.7407  2.0479

NA20505  NA20505  NA20534  NA20534  UN  NA  0.9602  0.0398  0.0000  0.0199  -1   0.772123  0.3046  1.9631

NA20505  NA20505  NA20535  NA20535  UN  NA  0.9650  0.0350  0.0000  0.0175  -1   0.771054  0.6510  2.0285

NA20505  NA20505  NA20586  NA20586  UN  NA  0.9728  0.0272  0.0000  0.0136  -1   0.770687  0.4281  1.9869

NA20505  NA20505  NA20756  NA20756  UN  NA  0.9675  0.0325  0.0000  0.0163  -1   0.770762  0.6902  2.0365

NA20505  NA20505  NA20760  NA20760  UN  NA  0.9344  0.0656  0.0000  0.0328  0    0.770978  0.8856  2.0904

Example of .genome file in PLINK

Using genetic relatedness estimates

•

Confirm unrelated or “population-based” sample

ascertainment

•

Filter out related samples (pi-hat > 0.2 often used)

•

“Cryptic relatedness”

–

 related individuals identified in ”unrelated”

sample

•

Confirm family structure (pedigree)

•

Ensure parent-child and sibling relationship

•

Watch out for distinct ancestries

•

Can skew IBD estimates and incorrectly identify recent relatedness

•

PCrelate more robust to these patterns

https://rdrr.io/bioc/GENESIS/man/pcrelate.html

Session Outline – genetic data QC

•

Practical portion (~40 minutes)

•

Data checking

•

Sample and SNP QC

•

Relatedness checking

•

Principal components analysis (PCA)

•

Go to: workshop.colorado.edu

•

Slides + practical:

 /faculty/daniel/2023/QC

•

Terminal:

 workshop.colorado.edu/ssh

•

Rstudio:

 workshop.colorado.edu/rstudio

Script that you will be working through:

QC_practical_statgenWorkshop2023.txt

Full path:

/faculty/daniel/2023/QC/QC_practical_statgenWorkshop2023.txt

Walk through this script and copy/paste commands to the ssh command line

Qualtrics version:

https://ucsas.qualtrics.com/jfe/form/SV_eWpdYL7srw7Cy6W

Answers to be filled out by a single table member

See the ISGW forum for these and other useful links to start your practical session:

https://isgw-forum.colorado.edu/

# 1.1 Creating workspace

## Create day1 subdirectory (-p creates full path into new directories)

mkdir -p ~/day1/QC

## traverse into new subdirectory

cd ~/day1/QC

# 1.2 Copying over genetic dataset

# Copy the files to your working subdirectory

cp /faculty/daniel/2023/QC/* .

# Check you have the required files:

ls -l

# HM3.bed

# HM3.bim

# HM3.fam

# QC_practical_BoulderWorkshop2023.R

# QC_practical_BoulderWorkshop2023.sh

QC_practical_BoulderWorkshop2023.txt

# cc.ped

# cc.map

## === Main QC ===

# STEP 1. Data and Formats

# STEP 2. Check for reported/genotype sex discrepancies

# STEP 3. Obtain information on individuals missing SNP data

# STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg

# STEP 5. Sample QC: genotype call rate and heterozygosity

# STEP 6. LD-pruned SNP set

# STEP 7. Sample QC: sex check filtering using LD-pruned SNP set

# STEP 8. Sample QC: Checking for cryptic relatedness

Slide Note

Embed Share

Download

Explore the essentials of Genome-Wide Association Studies (GWAS) and genetic data quality control as presented by Daniel Howrigan in the 2023 workshop. Delve into the goals of GWAS, genetic data characteristics, SNP variations, and genotyping techniques. Gain insights into moving from trait heritability to identifying genetic variants driving heritable differences and the shift towards genome-wide interrogation. Discover the advancements in technological tools, study designs, and the robust properties of DNA in understanding heritable traits.

lancelot Follow

Uploaded on May 16, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Data QC / cleaning in Genome-Wide Association Studies (GWAS) 2023 Statistical Genetics workshop Presenter: Daniel Howrigan Data group leader Neale Lab (MGH, Broad Institute) Slides adapted from previous workshop presenters: Lucia Colodro Conde (QIMR), Katrina Grasby (QIMR), Shaun Purcell (HMS) With help from: John Kemp (University of Queensland) and Daniel Gustavson (IBG)

Session Outline genetic data QC Lecture portion (~40 minutes) Goals of GWAS What does genetic data look like? GWAS Quality Control (QC) Practical portion (~40 minutes) Viewing genotype data Sample and SNP QC Relatedness checking Principal components analysis (PCA)

Goals of Genome Wide Association Studies GWAS of Schizophrenia Go from trait heritability towards biological mechanism What genes/genetic variants drive heritable differences? Genome-wide interrogation Moving away from candidate gene studies Technological advancement and dropping cost GWAS of ~4,200 traits Flexible application of study design All heritable traits can be studied Biological/mathematical properties of DNA quite robust

What does genetic data look like? Single Nucleotide Polymorphism SNP Maternal Chromosome Paternal Chromosome Allele 1 = C Allele 2 = A Bi-allelic combinations = C/C, C/A, A/A adenine (A), thymine (T), cytosine (C), guanine (G) Genetic variation Genetic variation: differences in the sequence of DNA among individuals. Mutation Mutation: a newly arisen variant

GWAS Examples of genetic Examples of genetic variation variation

Genotyping on a chip Affymetrix: Illumina: 6.0 chip >900,000 SNPs CNV probes 82% coverage CEU HapMap Accuracy 99.90% Human1M BeadChip >1 million SNPs CNV probes 95% coverage CEU HapMap Accuracy 99.94%

From DNA to data

Good SNP (Illumina chip example) Raw Intensity Normalized Intensity T/T T/T T/G T/G G/G G/G Each dot is an individual genotype

Same SNP, different view Overall Intensity AA aa Aa Angular position 9

SNPs with different allele frequencies High MAF MAF = Minor Allele Frequency AA Common SNPs = MAF > 5%? 1%? 0.1? AB Low Frequency SNPs = MAF < 1% BB Ultra-rare variants = MAF < 1e5 (1 in 100k) Monoallelic in the sample Less common MAF

Bad SNP call examples A2/A2 homalt A1/A2 het homref A1/A1

Bad SNP 12

Another bad SNP Duplications? Deletion? 13

Another bad SNP 14

PLINK data format of GWAS data Genotype data Samples Genetic variants .fam file .ped file .bim file (or .map file) FID IID PID MID SEX AFF CHR SNP ID CM POS A1 A2 compression .bed file FID = family ID IID = Individual ID PID = paternal ID MID = maternal ID AFF = affection status 1 = control 2 = case -9 or 0 = unknown 0101010010101010101 1010011101010101010 1101110101001010101 1101001011101101010 1101010101010111010 CHR = chromosome POS = position CM = Centimorgan (often unused) A1 = 0 allele A2 = 1 allele

GWAS QC

GWAS Quality Control (QC) GOAL: Remove bad samples/SNPs, keep good samples/SNPs Preliminary strategies (first pass) Poorly genotyped samples / SNP markers Potential genotype/phenotype mismatches Deviation away from expected heterozygosity Related or duplicated samples (population-based data) Follow-up strategies Batch effects Quality differences between datasets Comparison with reference data and more

Sample QC Poorly genotyped individuals Poor quality DNA (high number of failed SNP calls) Contaminated DNA (unusual levels of heterozygosity) Reporting error Indications of sample mix-up (sex check or ancestry match) Related individuals Family-based and population-based samples require different experimental designs Related individuals can bias test statistics across the whole-genome In family-based association: Mendelian errors used as QC

SNP QC Poorly genotyped SNPs Poor primer design / nonspecific DNA binding (high number of failed SNP calls) Poor clustering of genotype intensities (deviation from HWE) Mendelian errors (if family-based data available) Uninformative SNPs (too rare or mono-allelic) Follow-up on association signals No QC protocol will eliminate all instances of genotyping error Re-analyze original intensity of significant associations (whenever possible) For meta-analysis, examining heterogeneity of SNP effect

Preliminary QC steps SAMPLE: Sex-check (chr X heterozygosity) SNP: Genotyping Call Rate (genotypes missed in individuals) SAMPLE: Sample Call Rate (individuals missing genotypes) SNP: Hardy-Weinberg Equilibrium SAMPLE: Proportion of Heterozygosity SAMPLE/SNP: Mendelian errors SAMPLE: Genetic Relatedness

Confirming genetic sex Primary question: Is the sample-level data correctly matching the SNP data? Example .sexcheck file from PLINK (male=1, female=2) Female sex = XX Male sex = XY Male FID IID PEDSEX SNPSEX STATUS F T304 T30411 1 1 OK 0.9857 A0641C 06410021C 1 1 OK 0.9841 T06013 T2601310 2 2 OK -0.06164 T01533 T2153321 1 1 OK 0.9841 T330 T33021 1 1 OK 0.9867 T191 T19120 2 2 OK 0.01155 T329 T32911 1 1 OK 0.9839 T07981 T2798111 1 1 OK 0.9822 A0601C 06010021C 1 1 OK 0.9858 A1008C 10080011C 1 1 OK 0.9817 A0880C 08800331C 1 1 OK 0.9818 T00894 T2089420 2 2 OK 0.01927 A0701C 07010011C 1 1 OK 0.9807 T02911 T2291121 1 1 OK 0.9851 T00588 T2058811 1 2 PROBLEM -0.3396 A0805C 08050031C 1 1 OK 0.9821 T07755 T2775520 2 2 OK -0.09906 T03676 T2367611 1 1 OK 0.9845 T082 T08220 2 1 PROBLEM 0.9833 Female Chromosome X F-statistic

SNP genotyping call rate (missingness) Bad SNP design, poor clustering Example .lmiss file from PLINK CHR SNP N_MISS N_GENO F_MISS 1 rs12565286 6 200 0.03 1 rs12124819 8 200 0.04 1 rs4970383 0 200 0 1 rs13303118 0 200 0 1 rs35940137 0 200 0 1 rs2465136 1 200 0.005 1 rs2488991 0 200 0 1 rs3766192 0 200 0 1 rs10907177 0 200 0 Usually done iteratively Remove SNPs with < 95% call rate Run sample QC Remove SNPs with < 98% call rate Example .missing file from PLINK For case/control data Look at difference in genotyping rate Threshold usually at > 2% call rate difference CHR SNP F_MISS_A F_MISS_U P 1 rs12565286 0.03125 0.03093 1 1 rs12124819 0.05208 0.03093 0.4974 1 rs2465136 0 0.01031 1 1 rs4970357 0 0.02062 0.4974 1 rs11466691 0 0.01031 1 1 rs11466681 0.01042 0.01031 1 1 rs34945898 0.03125 0 0.1211 1 rs715643 0.05208 0.02062 0.2787 1 rs13306651 0.01042 0.03093 0.6211

Sample genotyping call rate Example .imiss file from PLINK Low quality DNA, degradation, lab error, contamination FID IID MISS_PHENO N_MISS N_GENO F_MISS NA20505 NA20505 N 122 100310 0.001216 NA20504 NA20504 N 1406 100310 0.01402 NA20506 NA20506 N 204 100310 0.002034 NA20502 NA20502 N 847 100310 0.008444 NA20528 NA20528 N 219 100310 0.002183 NA20531 NA20531 N 96 100310 0.000957 NA20534 NA20534 N 338 100310 0.00337 NA20535 NA20535 N 182 100310 0.001814 NA20586 NA20586 N 214 100310 0.002133 http://zzz.bwh.harvard.edu/plink/summary.shtml#missing

Hardy-Weinberg Equilibrium (HWE) A genetic variant is said to be in HWE if the genotype proportions can be predicted by the allele frequencies in the following way: If: Example: In C/T SNP terms: f(A1) = p p + q= 1 p = 0.2 q = 0.8 C allele freq. = 20% T allele freq.= 80% f(A2) = q Then: f(A1/A1) = p2 p2 = 0.04 2pq = 0.32 q2 = 0.64 C/C freq. = 4% C/T freq. = 32% T/T freq. = 64% f(A1/A2) = 2pq f(A2/A2) = q2 p2 + 2pq + q2 = 1

Testing for deviation from HWE Deviations from HWE can be caused by: Non-random mating (inbreeding, assortative mating, ) Population stratification Mutation Limited population size Random genetic drift Gene flow Genotyping errors Selection ( may be due to true association!) Example .hardy output in PLINK CHR SNP TEST A1 A2 GENO O(HET) E(HET) P 1 rs12565286 ALL C G 1 rs12565286 AFF C G 1 rs12565286 UNAFF C G 1 rs12124819 ALL G A 0/77/108 0.4162 0.3296 6.919e-05 1 rs12124819 AFF G A 0/41/50 0.4505 0.3491 0.004878 1 rs12124819 UNAFF G A 0/36/58 0.383 0.3096 0.02001 1 rs4970383 ALL A C 10/68/115 0.3523 0.352 1 1 rs4970383 AFF A C 3/36/57 0.375 0.3418 0.5488 1 rs4970383 UNAFF A C 7/32/58 0.3299 0.3618 0.401 0/17/170 0.09091 0.08678 1 0/6/87 0.06452 0.06243 1 0/11/83 0.117 0.1102 1 So only extreme deviation from HWE (p < 10-6) is worrisome.

Proportion of heterozygosity (Fhet) http://zzz.bwh.harvard.edu/plink/ibdibs.shtml#inbreeding

Mendelian errors Requires parent-offspring data Similar to genotyping rate, can be examined at sample and SNP level High sample-level mendel error rate Parental uncertainty High SNP-level mendel error rate Poor genotype quality AA AA AT https://www.cog-genomics.org/plink/1.9/basic_stats#mendel de novo mutation is a type of mendelian error

Linkage disequilibrium (LD) allows us to be more robust with our QC protocols TL/DR: Nearby SNPs are correlated Properties of linkage disequilibrium reduce the loss of signal sensitivity when removing SNPs Strict multiple testing correction often requires very large samples - no single sample will drive a signal LD must be taken into account when examining genetic relatedness, population stratification, and interpreting association

Genetic relatedness using Identity-By-Descent (IBD) calculation Question: How much does a pair of samples share 0, 1, or both alleles? Identical twins: Shares both alleles across entire genome (barring mutation events) Requires using LD-pruned SNPs for accurate estimates Want each SNP to be an independent marker Used to both confirm and filter related individuals

Checking genotype relatedness across samples Example of .genome file in PLINK FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO NA20505 NA20505 NA20506 NA20506 UN NA 0.9872 0.0000 0.0128 0.0128 -1 0.771435 0.3446 1.9712 NA20505 NA20505 NA20502 NA20502 UN NA 0.9888 0.0096 0.0016 0.0064 -1 0.770233 0.3950 1.9808 NA20505 NA20505 NA20528 NA20528 UN NA 0.9733 0.0267 0.0000 0.0133 -1 0.770068 0.2922 1.9606 NA20505 NA20505 NA20531 NA20531 UN NA 0.9789 0.0205 0.0006 0.0109 -1 0.770976 0.7407 2.0479 NA20505 NA20505 NA20534 NA20534 UN NA 0.9602 0.0398 0.0000 0.0199 -1 0.772123 0.3046 1.9631 NA20505 NA20505 NA20535 NA20535 UN NA 0.9650 0.0350 0.0000 0.0175 -1 0.771054 0.6510 2.0285 NA20505 NA20505 NA20586 NA20586 UN NA 0.9728 0.0272 0.0000 0.0136 -1 0.770687 0.4281 1.9869 NA20505 NA20505 NA20756 NA20756 UN NA 0.9675 0.0325 0.0000 0.0163 -1 0.770762 0.6902 2.0365 NA20505 NA20505 NA20760 NA20760 UN NA 0.9344 0.0656 0.0000 0.0328 0 0.770978 0.8856 2.0904

Using genetic relatedness estimates Confirm unrelated or population-based sample ascertainment Filter out related samples (pi-hat > 0.2 often used) Cryptic relatedness related individuals identified in unrelated sample Confirm family structure (pedigree) Ensure parent-child and sibling relationship Watch out for distinct ancestries Can skew IBD estimates and incorrectly identify recent relatedness PCrelate more robust to these patterns https://rdrr.io/bioc/GENESIS/man/pcrelate.html

Session Outline genetic data QC Practical portion (~40 minutes) Data checking Sample and SNP QC Relatedness checking Principal components analysis (PCA) Go to: workshop.colorado.edu Slides + practical: /faculty/daniel/2023/QC Terminal: workshop.colorado.edu/ssh Rstudio: workshop.colorado.edu/rstudio

Script that you will be working through: QC_practical_statgenWorkshop2023.txt Full path: /faculty/daniel/2023/QC/QC_practical_statgenWorkshop2023.txt Walk through this script and copy/paste commands to the ssh command line Qualtrics version: https://ucsas.qualtrics.com/jfe/form/SV_eWpdYL7srw7Cy6W Answers to be filled out by a single table member See the ISGW forum for these and other useful links to start your practical session: https://isgw-forum.colorado.edu/

# 1.1 Creating workspace ## Create day1 subdirectory (-p creates full path into new directories) mkdir -p ~/day1/QC ## traverse into new subdirectory cd ~/day1/QC # 1.2 Copying over genetic dataset # Copy the files to your working subdirectory cp /faculty/daniel/2023/QC/* . # Check you have the required files: ls -l # HM3.bed # HM3.bim # HM3.fam # QC_practical_BoulderWorkshop2023.R # QC_practical_BoulderWorkshop2023.sh # QC_practical_BoulderWorkshop2023.txt # cc.ped # cc.map

## === Main QC === # STEP 1. Data and Formats # STEP 2. Check for reported/genotype sex discrepancies # STEP 3. Obtain information on individuals missing SNP data # STEP 4. Variant QC: SNPs missing data; MAF; Hardy-Weinberg # STEP 5. Sample QC: genotype call rate and heterozygosity # STEP 6. LD-pruned SNP set # STEP 7. Sample QC: sex check filtering using LD-pruned SNP set # STEP 8. Sample QC: Checking for cryptic relatedness

Genome-Wide Association Studies in Statistical Genetics Workshop

Download Presentation

Presentation Transcript

Related

More Related Content