Markov Chain Promoter Finder McPromoter006

Changes between version MM:II and 006

Contrary to many (outdated) textbooks, a eukaryotic core promoter is not defined by one specific canonical motif such as a TATA box; instead, there are several possible motifs and configurations. McPromoter has therefore been extended to use several models for different core promoter architectures instead of one. Along with the prediction of the location of transcription start sites, we also provide a model number reflecting the particular class of core promoters:

See refs (7) and (8) below for more details. As motfis are somewhat specific to the particular organism and not conserved in all eukaryotes, we DO NOT recommend applying the 006 version of McPromoter to anything but Drosophila genomes.

The 006 version does not use features for physicochemical properties of DNA (e.g. bendability) any longer; the additional benefits of these features became negligible due to the larger site of the training data.

How do I submit my sequence ?

Upload your DNA sequence, or paste your sequence into the sequence box. Your sequence should consist of one-letter nucleotides (A, C, G, T). Characters that do not uniquely determine a base (e.g. R or N) are replaced at random. The sequence should be in plain or FASTA format. FASTA format looks like this:

>gb|V00574|HSRAS1 Human germ line gene homologous to bladder carcinoma oncogene T24
GGATCCCAGCCTTTCCCCAGCCCGTAGCCCCGGGACCTCCGCGGTGGG
CGGCGCCGCGCTGCCGGCGCAGGGAGGGCCTCTGGT

Please beware that lines longer than 1024 symbols will be truncated! You can choose whether to show predictions only for the forward strand or for the backward strand as well.

The program works by shifting a 300 bases long window over the sequence and judging its content every 10 bases. If a promoter is detected, the position within the window when the model enters the initiator state is reported. This has the consequence that NO predictions are made within the first 250 bases on the forward and the last 250 bases on the backward strand -- and that your sequence has to be at least 300 bp long.
The output of McPromoter is a list of predicted transcription start sites in gff format . The score which is printed next to the predicted site is the output of the predictor and lies between approximately -0.5 and 0.1, larger values being better. The threshold defines a minimum score for a promoter to be reported. If there are multiple predictions within 500 bases, only the best one is showed.

How can I obtain the most meaningful results with McPromoter ?

  1. McPromoter 006 has been greatly improved over the last release, but of course, false positives are still around. Anything that you know about your sequence is thus helpful to restrict the search to meaningful parts of your sequence. Look only on the strand that your gene is located, and use other results (i.e. BLAST hits or cDNA/EST alignments) to throw out parts of your sequence which most likely do not contain a promoter.

  2. Any cutoff threshold is a compromise between sensitivity and specificity. Thus, if you don't get a hit, try a lower threshold. You don't have to re-run your sequence for that; a look at the attached plots reveals almost everything.

  3. If the system detects multiple hits close to each other, the result list contains only the best one; if you expect multiple initiation sites, a look at the graphics might also help to reveal neighbored maxima.

What are those plots attached to the result email good for ?

We provide a plot for each strand, depicting the system output over your submitted sequence. This can help to quickly find local optima that are below the threshold, or multiple hits that are close to each other (see the section above).

And now I want to know: What is inside McPromoter ?

Version 006 of McPromoter is a probabilistic method to look for eukaryotic polymerase II transcription start sites. The system contains a background model consisting of states for coding and non-coding sequences and five promoter models which divide a promoter in a number of consecutive segments reflecting different characteristic sequence elements. The models are applied to a window of 300 bases, and the score reflects the difference between the normalized likelihood of the best promoter and non-promoter model. Because McPromoter is a statistical system, it does not require that certain patterns must be present, but that the combination of all features is good enough. E.g., even if the TATA box score is very low, there can still be predictions if the other features score well.

The models were trained on a representative set consisting of vertebrate promoters and human non-promoter sequences respectively on D. melanogaster promoters and non-promoters (see link below). Cross-validation on the fly promoter/non-promoter data set delivered an equal recognition rate of 94.1%, with a correlation coefficient of 0.89. On a set of 92 Drosophila genes from the well-studied Adh region, we could identify 52% of the promoters with a false hit in 16 kb.


Get the whole story!

U. Ohler, Computational Promoter Recognition in Eukaryotic Genomic DNA .
All you (n)ever wanted to know about promoter finding... either as book from the LOGOS publishing house, or as a preprint of the thesis, submitted to the University of Erlangen in 2001 (pdf).

Our methods were also described in detail in the following papers. Please cite paper (8) when quoting results obtained with the new McPromoter006 for Drosophila, and paper (3) for results on human.

(1) U. Ohler, S. Harbeck, H. Niemann, E. Noeth and M. G. Reese
Interpolated Markov chains for eukaryotic promoter recognition.
Bioinformatics 15(5), p. 362-369, 1999.

(2) U. Ohler, S. Harbeck and H. Niemann
Discriminative training of language model classifiers
Proc. European Conference on Speech Communication and Technology (EUROSPEECH), Budapest 1999.

(3) U. Ohler, G. Stemmer, S.Harbeck and H. Niemann
Stochastic segment models of eukaryotic promoter regions
Proc Pacific Symposium on Biocomputing 5:377-388, Honolulu 2000.

(4) U. Ohler
Promoter prediction on a genomic scale - the Adh experience
Genome Res 10(4):539-542, 2000.

(5) U. Ohler and H. Niemann
Identification and analysis of eukaryotic promoters: recent computational approaches
Trends Genet. 17:56-60, 2001.

(6) U. Ohler, H. Niemann, G. Liao and G. M. Rubin
Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition
Bioinformatics 17:S199-S206, 2001.

(7) U. Ohler, G. Liao, H. Niemann and G. M. Rubin
Computational analysis of core promoters in the Drosophila genome.
Genome Biol. 3:research0087.1-0087.12, 2002.

(8) U. Ohler
Identification of core promoter modules in Drosophila and their application in improved promoter prediction.
Submitted, 2006.

More information and our training and test sequences are publicly available! ( click here ) Return to the McProm interface



McPromoter promoter predictor by Uwe Ohler.
Web interface by Moussa Sagna and Uwe Ohler