Welcome to DEcode!
The goal of this project is to enable you to utilize genomic big data in identifying regulatory mechanisms for differential expression (DE).
DEcode predicts inter-tissue variations and inter-person variations in gene expression levels from TF-promoter interactions, RNABP-mRNA interactions, and miRNA-mRNA interactions.
You can read more about this method in this paper (full text is available at https://rdcu.be/b5r3p) where we conducted a series of evaluation and applications by predicting transcript usage, drivers of aging DE, gene coexpression relationships on a genome-wide scale, and frequent DE in diverse conditions.
Run DEcode on Code Ocean
You can run DEcode on Code Ocean platform without setting up a computational environment. Our Code Ocean capsule provides reproducible workflows, all processed data, and pre-trained models for tissue- and person-specific transcriptomes and DEprior, at gene- or transcript level.
Prepare input features
Prerequisites
Before running the scripts to create custom input data, you must have the following python libraries installed on your system:
- pandas
- scipy
Also you need to install the following R packages:
- GenomicFeatures
- tidyverse
- data.table
- rtracklayer
- plyranges
- optparse
Usage
- First, download the example files by running the following commands: ```bash #GTF file from GTEXv7 (hg19) mkdir gtf wget https://storage.googleapis.com/gtex_analysis_v7/reference/gencode.v19.genes.v7.patched_contigs.gtf -P ./gtf/
#eCLIP-seq peaks from Encode (hg19) mkdir bed_rna wget https://www.encodeproject.org/files/ENCFF039BKT/@@download/ENCFF039BKT.bed.gz -P ./bed_rna/ wget https://www.encodeproject.org/files/ENCFF379UQU/@@download/ENCFF379UQU.bed.gz -P ./bed_rna/
#ChIP-seq peaks from Encode (hg19) mkdir bed_promoter wget https://www.encodeproject.org/files/ENCFF553GPK/@@download/ENCFF553GPK.bed.gz -P ./bed_promoter/ wget https://www.encodeproject.org/files/ENCFF549TYR/@@download/ENCFF549TYR.bed.gz -P ./bed_promoter/
This will create directories `gtf`, `bed_rna`, and `bed_promoter`, and download the GTF file, two eCLIP-seq bed files, and two ChIP-seq bed files into them.
Our input data for the gene-level model was constructed based on the gencode.v19.transcripts.patched_contigs.gtf file from the GTEXv7 dataset. This file contains only one representative transcript for each gene for the human genome (hg19).
If you use a different gene model, please make sure to filter GTF file and select a representative transcript for each gene before running the pipeline.
The input data for the transcript-level model was created based on `https://storage.googleapis.com/gtex_analysis_v7/reference/gencode.v19.transcripts.patched_contigs.gtf`. It is not necessary to filter the GTF file for the transcript-level model.
2. Convert genome coordinates to RNA coordinates using the following command:
```bash
Rscript functions/bed_to_RNA_coord.R -b ./bed_rna/ -n 100 -g gtf/gencode.v19.genes.v7.patched_contigs.gtf -t rna -o custom_RNA
Arguments
- bed_directory: Character string specifying the directory containing the bed files
- bin: Numeric value specifying the size of bins for the genomic features
- gtf_file: Character string specifying the path to the GTF file
- input_type: Character string specifying the experiment type of the bed files, i.e. “promoter” or “rna”
- output: Character string specifying the path and filename of the output file
This will convert bed files in the genome coordinates in the ./bed_rna/
directory to RNA coordinates using the gencode.v19.genes.v7.patched_contigs.gtf file and output as custom_RNA.txt
.
If you want to map ChIP-seq peaks to promoters, use the -t option as promoter.
Rscript functions/bed_to_RNA_coord.R -b ./bed_promoter/ -n 100 -g gtf/gencode.v19.genes.v7.patched_contigs.gtf -t promoter -o custom_promoter
- To convert RNA-coordinate peaks to Pandas format, use the following command: ```bash python functions/to_sparse.py custom_RNA.txt python functions/to_sparse.py custom_promoter.txt
clean up
rm custom_RNA.txt rm custom_promoter.txt
This will convert the RNA-coordinate peaks in the `custom_RNA.txt` file and the `custom_promoter.txt` file to sparse Pandas DataFrames (`custom_RNA.pkl` and `custom_promoter.pkl`).
4. Place `custom_RNA.pkl`, `custom_RNA_gene_name.txt.gz`, and `custom_RNA_feature_name.txt.gz` in the directory where RNA features are located, for example: `./data/toy/RNA_features/`. Also place `custom_promoter.pkl`, `custom_promoter_gene_name.txt.gz`, and `custom_promoter_feature_name.txt.gz` in the directory for promoter features, for example: `./data/toy/Promoter_features/`.
5. Modify the code (Run_DEcode_toy.ipynb) as follows:
```python
mRNA_data_loc = "./data/toy/RNA_features/"
mRNA_annotation_data = ["POSTAR","TargetScan","custom_RNA"]
promoter_data_loc = "./data/toy/Promoter_features/"
promoter_annotation_data = ["GTRD","custom_promoter"]
This modification to the code will instruct it to utilize the custom_RNA.pkl
and custom_promoter.pkl
files as part of the RNA annotation data and promoter annotation data, respectively.
If you find DEcode useful in your work, please cite our manuscript.
Tasaki, S., Gaiteri, C., Mostafavi, S. & Wang, Y. Deep learning decodes the principles of differential gene expression. Nature Machine Intelligence (2020) [link to paper] (full text is available at https://rdcu.be/b5r3p)
Source databases for traning data.
- GTEx transcriptome data - GTEx portal
- Transcription factor binding peaks - GTRD
- RNA binding protein binding peaks - POSTAR2
- miRNA binding locations - TargetScan