VRMOD CRM Track Hub Description
Genomic DNA encodes regulatory information that determines where, when, and to what extent genes are expressed. Theoretically, we should be able to identify these transcriptional “instructions” by examining genomic DNA sequence alone, yet this capability has remained elusive. Here we present the Vertebrate Regulatory MOdule Detector (VRMOD), a method that accurately predicts gene expression regulatory sequences using only the query genomic sequences. We applied VRMOD to 309 genomes from the Ensembl database, generating a compendium of high-resolution, genome-position-fixed cis-regulatory modules without adjusting any parameters. We performed extensive computational evaluation and experimental validation of VRMOD predictions. Notably, VRMOD predicted three sub-enhancers within the human enhancer hs52 at the FTO locus from the VISTA database, one of which were missed by existing methods. Using a chicken embryo system combined with 3D tissue imaging, we showed that each sub-enhancer exhibits restricted spatiotemporal activity within specific subsets of tissues where the full enhancer is active. We further demonstrated the utility of VRMOD in identifying evolutionarily non-conserved enhancers, annotating regulatory sequences in non-model organisms, and identifying candidate disease-causal variants associated with Alzheimer’s disease. Collectively, our work provides a universal coordinate reference system for regulatory sequences across 309 genomes, analogous to annotated gene models for protein-coding sequences in the genomes. VRMOD thus enables genome-wide annotation of non-coding regulatory elements in any vertebrate species using genomic sequence information alone. The predicted cis-regulatory modules of the 309 genomes represent a significant resource for the research community.
Methods
- VRMOD CRM: Cis-regulatory modules (CRMs) predicted by VRMOD algorithm as described in Gonçalves et al.
- ExpCRMs: Literature-curated experimentally defined human (n = 60) cis-regulatory modules driving gene expression in diverse tissues, cell types, and developmental stages
- VISTA Enhancers: In vivo validated functional human enhancer elements (n = 998) obtained from the VISTA Enhancer Browser
- Enhancer Atlas: Putative mouse enhancers obtained from Enhancer Atlas 2.0
- scATAC-seq: Mouse scATAC-seq peaks derived from 45 brain regions and 160 cell types in the adult mouse cerebrum from the mouse scATAC-seq brain atlas
- ATAC Atlas: ATAC-seq Atlas database comprising 296,390 peaks derived from 66 ATAC-seq profiles of 20 primary tissues of adult mice
- Cistrome ATAC: Mouse ATAC-seq peaks from the Cistrome database derived from 1,059 samples from 25 different tissue and cell types
- Cistrome DHS: DHS peaks from human DNase-seq data obtained from the Cistrome database
- Cistrome H3K27ac: Mouse ChIP-seq peaks for the histone mark H3K27ac downloaded from the Cistrome database
- Cistrome H3K4me1: Mouse ChIP-seq data for the histone mark H3K4me1 downloaded from the Cistrome database
- Cistrome H3K4me3: Mouse ChIP-seq data for the histone mark H3K4me3 downloaded from the Cistrome database
- ENCODE_V2: Mouse ENCODE2 cCREs
- ENCODE_V3: Mouse ENCODE3 cCREs
- ENCODE_V4: Mouse ENCODE4 cCREs
Credits
Data were generated and processed at Washington University School of Medicine, St. Louis, MO.
For inquiries, please contact us at the following address: gzhao (at) wustl.edu
References
Gonçalves T.M., Stewart C.L., Baxley S.D., Xu J., George B., Li D., Yang C., Gabel H.W., Piao X., Cruchaga C., Li Y.E., Wang T., Avraham O., Zhao G., Unlocking cis-regulatory landscapes across 500 million years of evolution and disease mechanisms. (In Revision)
Data Access
VRMOD CRM mm10 Big Bed File Download:
https://genome.ucsc.edu/hubspace/41/gzhao/VRMOD_mm10/Mouse_VRMOD_mm10.bigBed
The data is stored in the binary BigBed format. The bigBedToBed tool accepts a file or the URL above as the input and converts it to text.
If you have an experimentally defined CRM that you would like to include in the ExpCRM collection, please provide the genomic location (hg38), the conditions under which the CRM is active (e.g. cell type, tissue, developmental stage etc ), and the corresponding reference.