Evidence-ranked motif identification


This website provides the cERMIT executable (linux) as well as a brief description of the chip-seq analysis pipeline described in "Evidence-ranked motif identification".

Abstract

The computational identification of functional sequence motifs has been a challenging problem in computational biology. Traditionally, regulatory motif finding has been phrased as the de novo identification of small DNA or RNA sequences enriched in a small subset of sequences such as promoters or untranslated regions of transcripts. However, the increasing availability of genome-wide data sets which directly or indirectly reflect gene regulation has allowed for an alternative problem definition: identify enriched sequence motifs, given quantitative experimental evidence for each regulatory region in a genome-wide set. We propose the (conserved) Evidence-Ranked Motif Identification Tool cERMIT, which implements an efficient enumerative strategy for identifying cis-regulatory elements to address this reformulation of the motif finding problem. cERMIT operates on a set of non-coding regulatory regions and their corresponding evidence, for example p-values resulting from chromatin- or RNA-immunoprecipitation experiments, or differential expression scores from knockdown assays. Candidate sets of target genes are defined by the presence of shared motif instances in the regulatory regions. Our strategy identifies motifs that correspond to gene sets with strong aggregate evidence of co-regulation using the information across all genes assayed in the high-throughput experiment. A p-value calibration strategy and conservation-based filters contribute to a considerable improvement of the predictive power of the procedure. cERMIT is validated extensively on curated yeast datasets and substantially outperforms existing state-of-the-art approaches. Our approach provides a new look at an old problem, and we demonstrate the ease with which it is extended to a wide range of applications.

Pipeline for analysis of deep sequencing data (ChIP-seq, DNase-seq)

Generally speaking, there are three main steps in the analysis of chip-seq data (excluding the initial data normalization steps). Initilaly sequence reads are aligned against the corresponding species genome, which is followed by selecting a subset of the aligned reads and classifying them as regions enriched in evidence of binding (peak calling). Once the "interesting" regions, typically of size 50bp-3000bp, have been selected, they are further analyzed by motif finding algorithms to infer the binding afinity of the trans-acting element under study. cERMIT implements the thirtd step in this analysis process.

Alignment of short sequence reads
Align chip-seq reads using MAQ [1], retaining the reads that align against 4 or less locations. To avoid single base pile-ups of sequences, remove all sequence locations where within a 30bp window there are more than 10 sequences of which 70% map to a single base location. Trim locations with multiple identical sequences to a maximum of 5 sequences.

Peak calling
1. ChIP-seq
Identify discrete ChIP peaks using the non-parametric kernel density estimation (KDE) procedure implemented in Fseq [2]. Based on the Fseq base pair scores each peak is assign a single score corresponding to the maximum KDE value across all locations within the peak. Discard regions with kernel density scores more than 10, as those are most likely to be pile-ups within repeat regions. Extend/trim peaks to be at least 100bp and at most 1000bp in length (different range can be specified by the user). The extension/trimming is proportional to the distance from the end of the peak region to the base pair loaction where the maximum KDE score is observed.
2. DNaseI Hypersensitive Sites (DHS)
Peak regions are called similarly to the ChIP-seq case. A detailed description of the the procedure is included in [3].

Processing of Fseq peaks
1. Define sample space of sequence regions to be used as input to cERMIT
A major goal in defining the set of putative regulatory regions is for it be enriched in functional binding sites for the factor of interest. Recent high-throughput sequencing technologies coupled with DHS assays have clearly demonstrated that regions of open chromatin are highly enriched in functional DNA elements [3]. Hence, we define the set of putative regulatory regions to be the DHS peaks assayed in the same species and call this the “DNaseI” approach to defining putative regulatory regions. Ideally, we would use DHS data combined with the factor-specific ChIP-seq data derived from the same cell type. To define the putative regulatory regions in the cases when DHS data is not available we adopt a related "ensemble" approach, which relies on the assumption that ChIP-seq peaks tend to fall within open chromatin regions. The top ChIP-seq peaks across an ensebmle of ChIP-seq datasets provide a set of open chromatin genomic regions.

2. Assign scores to regions in the sample space for each ChIP-seq dataset
To each region in set putative regulatory regions constructed based on the “DNaseI” approach assign the Fseq score for the corresponding overlapping ChIP-seq Peak. If there is no overlapping ChIP-seq peak assign 0. For the regions in the "ensemble" approach assign the ChIP-seq scores for each individual dataset. Whenever two regions overlap, merge the two and assign the score of the longer region.

Note: The processing steps described above have been implemented in a suite of Ruby scripts available here. All input parameters should be specified in a parameters file (see example) which is passed as an argument inside 'do_processing.sh'. Upon running 'do_processing.sh' all necessary input is created and cERMIT can be run from the command line.

running cERMIT
Please, refer to the provided Readme file for detailed instructions on how to run cERMIT and interpret the generated output.

Downloads

cERMIT + sample datasets
This archive includes a binary executable, compiled for Linux, as well as sample ChIP-chip, ChIP-seq, and microRNA datasets.
ChIP-seq/DNase-seq
This archive includes the ruby pre-processing scripts implementaing the pipeline for analysis of deep sequencing data as well as the human and mouse ChIP-seq and DNase-seq peaks (called by Fseq) used as input in the paper. In order for these datasets to be analyzed by cERMIT a compressed version of the corresponding genomic data in .2bit format needs to be supplied as input by the user and specified inside the input parameters file (see example).

[1] Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Reserach 2008, 18:11:1851-8.
[2] Boyle AP, Guinney J, Crawford G, Furey T: F-Seq: a feature density estimator for high-throughput sequence tags. Bioinformatics 2008, 24:2537-2538.
[3] Boyle AP, Davis S, Shulha H, Meltzer P, Margulies E, Weng Z, Furey T, Crawford G: High-resolution mapping and characterization of open chromatin across the genome. Cell 2008, 132:311-322.


Last updated: 08/31/2009