Overview of General Quality Control (QC) Steps for UK Biobank Data

Slide Note

The presentation outlines the QC steps for handling UK Biobank data, including identifying unrelated EUR-ancestry individuals, QC of imputed data, and relatedness analysis within the EUR-ancestry group. It emphasizes the need to adjust datasets due to participants opting out and provides specific steps for QC analysis and data handling.

jun_h Follow

Uploaded on Sep 22, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

General QC of the UK Biobank Data Luke Evans 2020/04/22 IBG StatGen Meeting

Goals of the QC steps described next Identify a set of unrelated EUR-ancestry individuals (largest ancestry group represented and so most powerful for many analyses) Lightly QC the imputed data across all individuals (not just EUR-ancestry) to provide a set of SNPs that would be more manageable than the raw data (binary plink format) Note: This won t be adequate for many applications. You will likely have to do some additional QC depending on what you are planning for your analyses More heavily QC the imputed data within the EUR-ancestry individuals to provide a set of bgen files that were appropriate for running GWAS with BOLT-LMM You may want different QC, or different sets of individuals. Heavily QC the array SNPs for the EUR-ancestry individuals for a specific analysis I was running. This step included here because it turns out some others have wanted to use those SNP data

Note about changing UK Biobank participants As people have opted out, the UK Biobank sends lists of redacted individuals. These are individuals who originally consented to be included, and who have since decided they do not want to be involved. Past work does not have to be rerun or reanalyzed New work should remove those individuals from the datasets Richard has compiled the most current list of known, redacted individuals here: /pl/active/IBG/UKBiobank_redacted_ids/ README file in that directory The QC steps described in these slides were done originally on the the then- current lists of individuals who were still involved in the study Over time, some of the individuals in these QC d datasets have removed themselves from the study I have chosen not to rerun every step when there are so few individuals to remove You should remove these individuals from the dataset when you run your analyses with the QC d data.

Relatedness within the EUR-ancestry group See the documentation: /pl/active/IBG/relatedness_eur/README Steps: 1. Identify EUR individuals - UKB-identified individuals projected onto the 1KGv3 PCs and falling within the range of the 1KGv3-defined EUR group - UKB only included those who self-identified as White British - We (Gargi) identified the min/max of those individuals 1st4 PCs, and then identified all other individuals who fell within those ranges on the first 4 PCs. - Those are the EUR-identified individuals here, encompassing a few (10K+?) more individuals than the UKB-identified group Pull out autosomal array SNPs for only those individuals Heavily QC those SNPs, including MAF & LD-pruning geno 0.05 maf 0.05 hwe 0.00000001 --remove /work/IBG/imputed_QC/indiv_excl_sex2.txt --keep UKB_european_iids.txt indep-pairwise 50 5 0.2 Use those SNPs to calculate PCs from within those EUR-ancestry individuals Generate GRM & identify set of individuals with a relatedness < 0.05 threshold 2. 3. 4. 5. GRMs, list of retained individuals, and PCs are all within that directory

Imputed Data QC See file: /pl/active/IBG/ukb_imputed_QC/README Goal of this process: To lightly QC the imputed SNP data so that the files are more manageable than the giant raw data that was downloaded. 2ndGoal: To have a set of SNPs that was useful for projects I was working on. Important: You may have different QC requirements than these, and so you may have to do some separate or additional QC steps for the specific things you re interested in. One QC does not fit all projects.

Imputed Data QC First Dataset: SNPs for ALL INDIVIDUALS. Useful for a more manageable dataset containing most of the SNPs that most people would want to use, most of the time. You probably will still need to do some additional QC steps, depending on your goals. Basic filtering done: - MAC>3 - INFO >= 0.3 - remove INDELS - Remove sex mismatch (self-report vs. genetic) - Remove singletons & doubletons - Convert to plink binary files with hard calls only LOCATION: By chromosome /pl/active/IBG/ukb_imputed_QC/chrom/plink_bed/chr.*.qc.[bed/bim/fam]

Imputed Data QC Second Dataset: More common, well-imputed SNPs within EUR-ancestry group of individuals. This is in a useful format for BOLT-LMM analyses of EUR-ancestry (basic GWAS, though you may still need to do some additional QC of your own, depending on what you want to do with it) Basic filtering done: - maf>=0.0001 - INFO>=0.9 - per-variant missingness <0.05 - hwe p>10^-6, European ancestry LOCATION: By chromosome /work/IBG/ukb_imputed_QC/chrom/bgen/gwas_bivariate_MAF01_INFO95_eur/ chr.*.qc.[bgen/sample] (I made the thresholds a bit more lax after having started, so the directory name stuck, even though it's inaccurate.)

Array Data QC Common, high quality SNPs within EUR-ancestry group of individuals. This dataset was prepared for a specific project I was working on, but some others have asked about it, so I m listing it here. Filtering done on raw array data: --geno 0.05 --hwe 0.00000001 --maf 0.05 retaining only those individuals previously identified as EUR-ancestry LOCATION: /pl/active/IBG/luke/eur/allind/snps/ See README in that directory A set of merged, QC d SNPs of these individuals is in ./merged_QC/eur.qc.[bed/bim/fam]

Overview of General Quality Control (QC) Steps for UK Biobank Data

Download Presentation

Presentation Transcript

Related

More Related Content