A: Within each .tar.gz you will find three kinds of files:

  • Files of the form CODE-PHENOTYPE-TW_TISSUENAME-DATE.csv are association results based on GTEx v6p gene expression models.
  • The file CODE-PHENOTYPE-DGN_WB-DATE.csv contains association results based on gene expression models for whole blood generated with data from the Depression, Genes and Networks (DGN) study.
  • Finally, there is a file named CODE-PHENOTYPE-cer_0.01-DATE.txt . This one is a meta-analysis across tissues that allows one to quickly identify the genes that might be most relevant for the phenotype of interest.
  • A: This can be accomplished by running the command tar -xzvf filename.tar.gz. It will create a folder with the name CODE-PHENOTYPE and extract all the files in it.

    A: In the home page, you can explore the AWS S3 bucket containing the S-PrediXcan files. You can insert a pattern in the “Search” box to filter for the files matching that pattern, and download the files by clicking the links individually.

    For your convenience, we also provide this Google spreadsheet, which contains 1) the UK Biobank codes for the phenotypes, 2) the phenotype description, 3) file prefixes, 4) S3 links for each of the files and 5) wget commands to download them. These fields may be useful for scripting.

    For further information on the GWAS on which these results are based, you can consult this spreadsheet published by Neale Lab.

    In case you prefer to use awscli (AWS command line interface) to download the files, we also provide examples for some common use cases (to install awscli, you can follow the instructions given here. There’s no need to set up credentials, see below NOTE). In all cases, you have to replace `destiny_folder` by the path where you want to download the data to.

    To download the whole bucket (i.e. all the 2419 phenotypes): run aws --no-sign-request --region=us-east-1 s3 cp s3://gene2pheno/ destiny_folder --recursive

    To download all the files with UK Biobank code 20001 (cancer phenotypes): run aws --no-sign-request --region=us-east-1 s3 cp s3://gene2pheno/ destiny_folder --recursive --exclude "*" --include "20001*"

    NOTE: In general, it’s necessary to have AWS credentials to run aws commands, but the --no-sign-request option tells the CLI not to look for credentials. It works in this case because the AWS S3 bucket is public.

    A: The results were generated using S-PrediXcan software.

    The prediction models for gene expression that we used here are based on the GTEx v6p release of RNA-seq data (44 models), as well as the DGN study (1 model). These models were generated by our group and are publicly available as SQLite databases. To download these models, or to get more information on how they are generated, you can access the PredictDB portal. Coming soon: the v7 release of the GTEx models.

    The software was run at CRI HPC at University of Chicago . The total running time was of around 12 hours (1.5 minutes for each phenotype/tissue pair, parallelized across roughly 300 nodes).

    The code for the multitissue meta-analysis will be released soon, as we are still working on it.

    A:

  • gene: a gene's id, as listed in the Tissue Transcriptome model. Ensemble Id for some, while some others (mainly DGN Whole Blood) provide HUGO names
  • gene_name: gene name as listed by the Transcriptome Model (HUGO names)
  • zscore: S-PrediXcan's association result for the gene
  • effect_size: S-PrediXcan's association effect size for the gene (change in phenotype given one standard deviation change in predicted gene expression level)
  • pvalue: P-value
  • pred_perf_r2: R2 of tissue model's correlation to gene's measured transcriptome (prediction performance)
  • pred_perf_pval: pval of tissue model's correlation to gene's measured transcriptome (prediction performance)
  • pred_perf_qval: qval of tissue model's correlation to gene's measured transcriptome (prediction performance)
  • n_snps_used: number of GWAS snps used in S-PrediXcan analysis
  • n_snps_in_cov: number of snps in the covariance matrix
  • n_snps_in_model: number of snps in the model
  • var_g: variance of the gene expression, calculated as W' * G * W (where W is the vector of SNP weights in a gene's model, W' is its transpose, and G is the covariance matrix)