/* Copyright 1990, 1991, 1992, 1993, 1994, 1995 by Phil Green */

GENEFINDER DOCUMENTATION (Last revised 6/30/95)
Phil Green 


INSTALLATION

Use the makefile to create genefinder, Xgenefinder, and the other
programs in the package (unless I have already provided you with the
executables).  The "other programs" include: tables, which is used to
create the site and codon tables; randseq, which is used to generate
simulated sequences; translategenes, which finds the amino acid
translations for genes specified in the gene file; and several others.
Documentation on these will follow later.

For convenience and compatibility, it may be helpful for you to set up
a directory structure similar to the one I use when doing genefinder
runs.  I have a directory called "analyses", which contains the
tablenamefile "wormtables" and the reference table files (see
description of tablenamefile, below).  For each sequence to be
analyzed, I create a subdirectory of the analyses directory which
contains all the sequence-specific files (i.e. the sequencefile,
genefile, and blastorffiles, if any), and run genefinder from within
this subdirectory.


RUNNING GENEFINDER

text version:

 genefinder [options] sequencefile

graphics version:

 Xgenefinder ...

The name of the sequencefile must be provided on the command line.
Options are provided by giving the option name (which always starts
with a hyphen), followed by its value. The currently available
options, with their default values (if any), are as follows:

Usage: genefinder [optional args] %s
    -seqfile %s             Sequence filename (Required, positional)
    [-genefile %s]          gene file (default=(null))
    [-syncodon]             Split codon table into amino acid and synonymous
                            codon tables (default=False)
    [-normalize]            Normalize tables with analyzed sequence instead of
                            random sequence (default=False)
    [-exoncutoff %f]        Exon cutoff likelihood level (default=3)
    [-genecutoff %f]        Gene cutoff, default computed from intron penalty
                            (default=0)
    [-atgcutoff %f]         atg cutoff (default=0)
    [-intron3cutoff %f]     intron3 cutoff (default=0)
    [-intron5cutoff %f]     intron5 cutoff (default=0)
    [-orfcutoff %f]         orf cutoff for display (default=5)
    [-minexonlength %d]     Minimum exon length (default=33)
    [-minintronlength %d]   Minimum intron length (default=40)
    [-maxintronlength %d]   Maximum intron length (default=2000)
    [-minsequencelength %d] Minimum sequence length (default=0)
    [-minorflength %d]      min orf length (default=90)
    [-blastorffile %s]      blast orf filename (default=(null))
    [-histfile %s]          histfile (default=(null))
    [-refhistfile %s]       ref histfile (default=)
    [-tablenamefile %s]     table name file (default=../wormtables)
    [-codontable %s]        codon table (default=(null))
    [-nameprefix %s]        name prefix (default=(null))
    [-cpenalty %f]          minimum intron penalty (default=3)
    [-penaltyfactor %f]     intron penalty factor (default=1.5)
    [-penaltycluster %d]    true intron sizes around minintron length
                            (default=15)
    [-corrfactor %f]        correction factor to apply to compensate for coding
                            sequence within genomic sequence (default=1)
    [-intron3factor %f]     factor to multiply intron3 score by (default=1)
    [-intron5factor %f]     factor to multiply intron5 score by (default=1)
    [-selfsample]           do self-sample correction (default=False)


See "FILES" for descriptions of the various files, and
"OVERVIEW OF STRATEGY FOR FINDING GENES" for the meaning of the
various cutoffs.

Examples of valid calls to genefinder:

 genefinder zk637.seq

This specifies "zk637.seq" as the name of the sequence file, and
accepts the program default values for all other input.

 genefinder -genefile zk637.cand.gene -intron5cutoff -.5 zk637.seq

This specifies "zk637.cand.gene" as the name of the genefile,
"zk637.seq" as the name of the sequencefile, and changes the
score cutoff for splice donor sites to -0.5.


FILES

Most genefinder text output is written to the standard output (i.e.,
to the screen unless you redirect it to a file), and is described in
the section "OUTPUT" below.  In the present section I describe the
various input files, and one output file (the histogram file, number
(4)).  You must provide (1), and (optionally) (2), (3) and (4).  I
have provided (5)-(14), which you may replace with your own if you
desire (but probably won't need to as long as you are analyzing C.
elegans sequence).

(1) The sequencefile. The graphics version of genefinder (Xgenefinder)
currently requires that there be only a single sequence in this file,
while the text version allows multiple sequences.  Each sequence is
preceded by a header line, which must have a > as the first character
(as in FASTA format), followed immediately by a character string
giving the sequence name. All characters up to and including the first
colon in this string (if any) are omitted from the name. The rest of
the header line is ignored, except that, as an optional data integrity
checking feature, if the string "Length:" appears, it is assumed to be
followed by an integer giving the correct length of the sequence.
Genefinder will check whether the sequence that follows (up to the
next > or EOF) has this length, and abort (with an error message) if
it does not.

Sample header line: 

>In:zk637 Length: 40699 ..

In this case the "In:" would be stripped off, leaving the sequence
name as zk637.

The nucleotide sequence may be in upper or lower case. All alphabetic
ambiguity codes are effectively treated as N's.  Non-alphabetic
characters (e.g. digits) are deleted, except for '-' which is
converted to 'N'.

(2) The genefile, which specifies known genes (or candidate genes) in
the sequence. Each gene "entry" is in the following format:

sequencename genename sitetype listofsitepositions sitetype
  listofsitepositions ... *

sequencename must agree with the name occurring in the sequence file
(followed by .C if the gene is on the complementary strand).  sitetype
can currently be intron3, intron5, or atg (i.e. translation start
site); or you can use CDS (as in EMBL/GENBANK features tables), in
which case each successive pair of numbers indicates the first and
last coding nucleotides in an exon. Use top strand numbering to
indicate positions. The position of a 5' (resp. 3') splice site is the
last (resp. first) base of the adjacent exon, while the position of an
atg is at the 'a'. Each entry is terminated by a '*'.

Examples: 

zk637.C lin-9cand CDS 20160 20104 19994 19876 
 19826 19694 19647 19132 19006 18848
 18372 17962 17883 17779
 17732 17657 17582 17524 16827 16633 *

CELMYO1A myo-1 atg 3923 intron5 3992 4846 5558 5715 5935 10259
   10512 10797 intron3 4572 5214 5609 5766 5981 10306 10560 11246 *
 
For the initial analysis of an uncharacterized sequence or sequences,
no genefile need be provided.  Equivalently, one can provide a
genefile that specifies a "dummy" gene (with no sites) for each
strand of each sequence to be analyzed, e.g.:

zk637 zk637 *
zk637.C zk637cmp *

(This is to force analysis of both strands of the sequence zk637. To
analyze only one strand, specify a dummy gene for that strand only).
Once candidate genes have been decided upon, it is useful to create a
genefile that lists them; this serves to record the selections,
causes their names to appear in association with the appropriate ORFs
in subsequent genefinder runs, and allows easy creation of FASTA
format files containing the full inferred amino acid sequence (which
is necessary for TBLASTN and TFASTA runs) using the program
translategenes. In particular, it is useful to check accuracy of the
EMBL/GENBANK sequence submission headers by specifying the candidates
in the gene file using the CDS notation, rerunning genefinder, and
checking the output to make sure ORFs are correctly assigned.

(3) The blastorffile contains a list of names of BLASTX output files,
which must have been preprocessed with LaDeana's shell script.  See
LaDeana's description for the blastorffile format, and procedure for
creating them. If these files are present, the genefinder output will
display, for each ORF, any homologies involving that ORF that were
detected in the BLASTX run (see below).   (This has not been updated
to reflect the new output format of BLAST)

(4) The output histogram file. If this name is provided the
corresponding file will on completion of the run contain histograms of
the site and ORF scores occurring in the current sequence. Most users
will have no use for this file and can ignore this option.

(5) The tablenamefile is a file that lists the names of 9 other files
containing various reference tables required by genefinder.  This file
(and the 9 files it names) is required by genefinder, but need not be
specified on the command line. If it is not specified, genefinder
assumes that the name of the tablenamefile is "wormtables" and that it
resides in the parent directory of the current directory.

The default "wormtables" file provided to you reads as follows:

../tablefiles/nem/newnem.codon
../tablefiles/nem/newnem.intron3
../tablefiles/nem/newnem.intron5
../tablefiles/nem/newnem.atg
../tablefiles/nem/zk637.trinuc
../tablefiles/nem/zk637.intron3
../tablefiles/nem/zk637.intron5
../tablefiles/nem/zk637.atg
../tablefiles/nem/ref.hist

(The prefix "../tablefiles/nem/" before each filename indicates that
that file resides in the tablefiles/nem subdirectory of the parent
directory from which genefinder is invoked. If genefinder is run from a
subdirectory of the distribution (like the "analyses" directory) then
these prefixes will work properly. -- see INSTALLATION, above).

Files (6)-(14) must be listed in the order shown below, in the
tablenamefile. These contain the reference site and codon tables used
by genefinder to statistically evaluate candidate sites and coding
segments.  (6)-(9) contain the tables generated from known genes;
(10)-(14) are generated from simulated sequence (used as a control;
the versions provided to you are based on 1Mb of simulated sequence
having the same dinucleotide composition as the C. elegans cosmid
zk637.)  Versions of these files appropriate for C. elegans are
provided to you, although the genefinder package also includes the
programs necessary to generate them from your own gene database if you
desire (and you will need to do this if you wish to analyze sequence
from other organisms).  The format of these files is discussed later.
These files can be in any other directory, as long as their full path
names are included in tablenamefile.

These files are, in order:
(6) Codon table file
(7) 3' splice site table file 
(8) 5' splice site table file
(9) Translation start site table file
(10) Simulated sequence trinucleotide table file
(11) Simulated sequence 3' splice site table file
(12) Simulated sequence 5' splice site table file
(13) Simulated sequence translation start site table file
(14) File containing histograms of the site and ORF scores occurring
in the simulated sequence. (This is required for normalization of
scores.)  

If you wish to use scores based on tetramers, pentamers or hexamers, the
n-mer table file should be included after the codon table file (6) and
the n-mer simulated sequence table should be given in place of the
trinucleotide table file (10)  The codon table is still necessary, even
when gene scoring will not be done with it.  If you wish to use an amino
acid frequency table instead of the codon table, then replace (6) and
(10) with the appropriate tables.  A synonymous codon bias table may be
included after an amino acid table if desired.  The -syncodon command
line switch may be used to suppress the resulting warning message,
although it does not affect the results.

H. Sapiens amino acid, synonymous codon bias, tetramer, pentamer and
hexamer tables are included in the human directory as
humtab.{aa,syn,tetra,pent,hex} respectively.  There are also
tablenamefiles set up for each alternative in the human subdirectory.
Read the README in the human directory for more information on finding
human genes.

OVERVIEW OF STRATEGY FOR FINDING GENES. 

Genefinder systematically uses statistical criteria (primarily log
likelihood ratios, or LLRs) to attempt to identify likely genes within
a region of genomic sequence.  Candidate genes are evaluated on the
basis of "scores" that reflect their splice site, translation start
site, and coding potential LLRs, and intron sizes.  A dynamic
programming algorithm is used to find the set of non-overlapping
candidate genes (on a given strand) having the highest total score
(among all such sets). 

Specifically:

(a) 3' (acceptor) splice site, 5' (donor) splice site, and translation
start site scores are computed at each position within the sequence,
using a LLR approach (see below) that is similar but not identical to
the usual "weight matrix" method. (The LLR score matrix is derived
from the site tables for known genes and simulated sequence.)
Candidate sites are those whose LLR scores exceed the corresponding
(user-specifiable) cutoffs, namely intron3cutoff, intron5cutoff, and
atgcutoff.

(b) A list of candidate exons is then constructed, each exon being
specified by a pair of delimiting sites together with a reading frame
for which there is no in-frame stop codon between the sites.  An exon
may start with the atg in a candidate translation start site (in which
case the exon is called a "lead" exon), at a 3' splice site, or at the
beginning of the sequence being analyzed. It may end at a stop codon
(in which case it is called a trailing exon), at a 5' splice site, or
at the end of the sequence.  Internal exons are delimited by a 3'
splice site and a 5' splice site.  Exons whose sizes are smaller than
the user-specifiable parameter minexonlength are not considered.

(c) LLR scores for codons (defined similarly to site scores, using
codon tables for known genes and for simulated sequence) are used to
define "maximal coding segments" within each exon: these are contiguous
groups of codons that lie wholly between the two flanking sites, and
that are maximal in the sense that the segment score (summed LLR
score) decreases if the segment is extended to either the left or the
right.  The coding score of the exon is defined to be the maximum
segment score for any maximal coding segment in the exon.

(d) The exon "combined" score is then defined to be the sum of the LLR
scores of its two flanking sites (stop codons, and the ends of the
sequence, receive scores of 0) and of its coding score.  (Aside: More
generally, one could consider weighted sums of these scores.  From
likelihood ratio theory, giving equal weight to each score is optimal
if the scores are statistically independent; and this does appear to
be the optimal weighting in practice, even though the scores are not
actually independent.)

(e) This combined score is then "normalized" by reference to the
distribution of combined scores in simulated sequence (from the
"reference histogram file"), as follows: if a given combined score
occurs, on average, once in every 10^s nucleotides in simulated DNA,
then the corresponding normalized score is set to s.  (For example,
exons with a normalized score of 5.0 or greater will be found only
once in every 100kb of simulated DNA. With the current reference
simulated sequence, which is 1Mb in length, 6.0 is the maximum
normalized score that can occur.)  Exons whose normalized scores are
less than the (user-specifiable) parameter exoncutoff are ignored.
The default value of exoncutoff is currently 3.0, corresponding to a
frequency in random DNA of about once every kb. At present, this is
the only way in which normalized scores are used in the gene assembly
process.

(f) A candidate gene is defined to be a string of consecutive exons
that splice in frame, beginning with a lead exon and ending either
with a trailing exon or at the end of the sequence being analyzed. The
"total score" for a set of non-overlapping candidate genes is defined
to be the sum of the (unnormalized) combined scores of the exons in
all of the genes, minus a penalty for each gap (intron or intergenic
region).  Currently, the gap penalty is defined as
  p(n) = -1.5 * (2.0 + log10(n - minintronlength - 15)) 
  p(n) = p(maxintronlength), for n > maxintronlength
where n is the number of nucleotides in the gap and the log10 term is
replaced by 0 when n <= minintronlength + 15. Gaps smaller than
minintronlength, and introns larger than maxintronlength, are not
permitted. 
  Genefinder uses a dynamic programming algorithm to find, for each
strand, the set of non-overlapping candidate genes having the highest
total score. (The idea of using a dynamic programming approach in this
context is due to Richard Durbin.)  Note that the above penalty
function effectively imposes a minimum gene score of
p(maxintronlength) for widely separated genes: any gene that is
separated from an adjacent gene by at least maxintronlength, and that
has a score less than p(maxintronlength), will never increase the
total score and therefore will not appear in the candidate list. For
consistency, genefinder deletes ANY gene whose score is less than
p(maxintronlength) from the list of candidates (regardless of its
separation from other genes).  In addition, all genes are required to
be "complete", except that (i) the first gene in the list is allowed
to start with a 3' splice site, but only if this site is within
maxintronlength of the start of the sequence, and there is no
candidate translation start site within the first exon; (ii) the last
gene in the list may terminate at the end of the sequence rather than
a stop codon (however it is not allowed to end at a 5' site).
  In addition to the highest scoring list of candidate genes,
genefinder also prints out a list of high-scoring exons -- i.e. those
whose normalized scores exceed the user-definable parameter orfcutoff
-- that neither appear in, nor overlap, in-frame, any exon appearing
in the candidate gene list.  It also displays detailed information
(site and coding segment scores and positions, and homologies) about
selected ORFs, including all ORFs that contain an exon in the
candidate gene list, and all ORFs whose score exceeds orfcutoff. (By
definition, ORFs are maximal length, i.e.  flanked by stop codons at
each end). The score of an ORF (as opposed to an exon) is computed as
follows:
  
 (i) Consider all pairs of possible 3' and 5' splice sites within the
ORF that could potentially delimit an exon (i.e. that are separated
by at least minExonLength nucleotides). Compute the maximum, over all
such pairs, of the sum of the LLR scores for the two sites.  (ii)
Compute the maximum LLR score over all (non-pruned) individual splice
sites in the ORF.  Then set maxScoreSS = the maximum of the two
numbers computed in (i) and (ii). (The purpose of considering (ii) is
to allow for initial and terminal exons, which may have only a single
splice site rather than a pair of sites.)  The ORF score is then the
sum of maxScoreSS and the maximum coding segment score for the ORF;
for this purpose it is not required that the coding segment lie
between the splice sites. This score is then normalized to indicate
its frequency in simulated DNA, as with exon scores.

There are several reasons for focusing attention on ORFs as separate
entities.  First, they provide a way to summarize information about
several overlapping exons simultaneously (since overlapping exons with
the same reading frame will all lie in the same ORF).  One may in fact
think of the problem of exon identification as conceptually divided
into two parts: (1) identification of the "exon ORF", the (maximal
length) ORF that contains the exon; and (2) identification of the
exon within the exon ORF (i.e. of the correct splice site(s)).  The
original version of genefinder was based on this paradigm, and had the
goal of selecting a list of the most likely exon ORFs, providing for
each ORF a list of the most likely splice sites together with frame
information making it easy to see which splices would preserve reading
frame.  This made it possible to rapidly assemble candidate genes "by
hand". Even though this process has now largely been automated it is
still important to check the splice sites and coding segments in the
candidate genes, to see if alternative exons (in the same or a
different ORF) might be preferable. The ORF information allows this to
be done conveniently.  Finally, sequencing errors (or abnormal splice
sites) may tend to result in potential coding regions that do not fit
into an exon, and it is clearly important to be able to detect such
regions, which might be missed entirely in a list of candidate genes.
Since such regions often result in high-scoring ORFs without
appropriate splice sites they will still tend to be among the ORFs
found by genefinder.

(Need summary of performance on known genes here).

Roughly 85\% of "exon ORFs" (ORFs containing true exons) in C.
elegans genes currently in GENBANK have normalized scores above 5.0
(and many of the remaining 15\% are initial or terminal exons, having
a single splice site).  The rate for all C. elegans genes may be
lower, because of the bias towards highly expressed genes (which often
have very high coding segment scores) in the current database.
However, even for non- highly expressed genes in the current database
a majority of exon ORF scores exceed 5.0, so this should be an
effective criterion for identifying at least part of most genes.


OUTPUT

Text version:

The program first reads in the relevant input files, and prints out
miscellaneous information about the sequence nucleotide content, the
number of candidate sites of each type; if both strands of a sequence
are to be analyzed, this is done for each strand separately, starting
with the bottom (complementary) strand. The predicted genes, omitted
exons, and flagged ORFs on each strand are then displayed. For
example, the bottom strand of the cosmid zk637 yielded the following
predicted genes:

Predicted genes in zk637.C:

26.94 A 39321..39128, 38829..38547
14.78 A 33932..33832, 33784..33415, 33020..32544
37.12 A 20160..20104, 19994..19876, 19826..19694, 19647..19051, 19006..18848, 
        18372..17962, 17883..17779, 17732..17631, 16827..16633
11.32 A 4972..4825, 4728..4454
20.05 A 1509..1418, 1374..1188, 870..787, 610..486, 427..238, 
        137..1, 

Here each gene line starts with the gene score (unnormalized),
followed by an A indicating that the first exon starts with an atg. If
the first exon starts instead at the beginning of the sequence or at a
3' splice site (which is allowed only for the first gene on the
strand), the A is replaced by a 0, 1, or 2 indicating the reading
frame (i.e. that the site occurs after the 0th, 1st or 2d base of a
codon).  This is then followed by the list of exons (specified by
their first and last bases, separated by .. -- in trailing exons, the
last base is the final base of the stop codon). Note that numbering is
with respect to the top strand.  If two consecutive exons are
separated by '\/' rather than by a comma, then they are in the same
ORF -- i.e. the intron between them has no in-frame stop codons. In
some of these cases the intron may be spurious, and the detailed ORF
information should be checked to see how "good" the sites are. An exon
enclosed in parentheses overlaps an exon occurring in the gene list
for the opposite strand. In such cases, one of the two overlapping
genes (usually the lower-scoring one) is likely to be spurious; such
cases tend to arise in regions of high codon bias in a gene, which
often produce high coding segment scores on the complementary strand
due to the abnormal nucleotide content. Any such case should be
checked carefully by examining the ORF information on both strands.

The list of candidate genes is followed by a list of "omitted exons"
whose normalized score exceeds the orfcutoff but which do not overlap,
in-frame, any exon appearing in the candidate gene list.  The notation
is similar to that used for genes, except that the score is now
normalized. A comma following an exon indicates that it ends at a 5'
splice site or at the end of the sequence rather than at a stop codon.
Exons that overlap (in a different frame) an exon in the gene list
(for either the same or the complementary strand) are enclosed in
parentheses.  Such exons usually derive their high scores from sharing
splice sites or abnormal nucleotide content with the "true" exon, but
in any case should be checked.

The ORFs flagged for display include any ORF whose normalized score
exceeds the specified orfcutoff, any exon ORF for a predicted gene or
for a gene specified in the genefile, and any ORF having a BLAST
homology as indicated in the blastorffile.  The flagged ORFs on each
strand are listed in 5' to 3' order.  Top strand numbering is used.

Example: The following is an ORF from the zk637 output, using a gene
file that specified possible genes.

*ORF:    4747  4454   294   6.83   6.16     6.00
IE110-hom
*IIYFSATCAICLDNLQNNVDIPEDHVIKEELKIDPTTFGTTVIVMPCKHRFHYFCLTLWLEAQQTCPTCRQKVKTDKEVEEEERQRNLEELHDSMYG*
* 1  3'  4728     4.97
         4711  4607   105   2.76    
    atg  4612     1.76
         4609  4526    84  77.00  HSV11 IMMEDIATE-EARLY PROTEIN IE110 (GENE NAME: IE110). 133-160
         4570  4454   117   6.16    
  2  5'  4461     1.85

Explanation: 

First line: The leading * indicates that this ORF is an exon ORF for a
candidate gene either specified in the gene file or found by
genefinder's automated gene assembly procedure. The numbers give the
start, end, and length of the ORF (note that this ORF is on the bottom
strand), and the maxScoreSS, maximum coding segment score, and
normalized combined score. The "start" of an ORF is the position of
the first base in the leading stop codon; the "end" of an ORF is the
last coding base (i.e. the last base before the trailing stop codon).

Remaining lines: "IE110-hom" is the gene name (this line only appears
for an exon ORF for genes specified in the genefile).  The next line
is the amino acid translation of the ORF (here * indicates a stop
codon; X indicates undefined (i.e. any codon that contains a
nucleotide other than A,C,G, or T).  The following lines give the
sites, coding segments, and homology segments, in 5' to 3' order
within the ORF. For each splice site, the first number (0, 1, or 2)
indicates the position at which the site interrupts the codon; this is
followed by the site type, the position, and the (unnormalized) LLR
score. The "position" of a splice site is the position of the nearest
base of the corresponding exon; i.e. the first base of the exon for a
3' splice site, and the last exon base for a 5' site. An asterisk (by
the 3' site in the above example) indicates that the site is the one
specified for a (user-specified or automatically constructed)
candidate gene. For each segment, the beginning, end, length, and
score (coding segment score for coding segments, BLAST score for
homology segments) are indicated. For homology segments, the name of
the homologous sequence and amino acid positions within that sequence
are listed.

The list of sites is pruned to eliminate sites that by virtue of
their position within the ORF, or their score relative to other sites
in the ORF, are unlikely to be actual splice sites. Specifically, any
5' splice site must have either a 3' site or a translation start site
at least minexonlength nucleotides 5' to it within the ORF, any 3'
splice site must be at least minexonlength nucleotides from the 3' end
of the ORF, and any translation start site must either have a 5' site
at least minexonlength nucleotides 5' to it, or be at least 300
nucleotides from the end of the ORF. Any site is eliminated if its LLR
score is less than 1.0 and there is another site of the same type in
the ORF whose score exceeds it by 1.0 or more.


Graphics version:

Xgenefinder initially processes all ORFs and sends information about
the flagged ORFs to the Xterm window, as in the text version. The
(magnifiable and scrollable) graphic display shows each flagged ORF as
a black horizontal line segment with the appropriate length and
location, either above (top-strand ORFs) or below (bottom-strand ORFs)
the cosmid line.  Distance from the cosmid line indicates the ORF
normalized combined score. Exon ORFs for genes specified in the gene
file are labelled with the gene name. Features are displayed as
follows: sites are shown as vertical lines attached to the ORF, with
the length indicating the LLR score.  Splice sites are "hooked", with
the hook pointing in the direction of splicing: thus 3' sites point
upstream, and 5' sites point downstream.  Color of the site indicates
the position at which it interrupts the codon (corresponding to 0, 1,
or 2 in the text output): thus a 5' site in one ORF will splice to a
3' site in a downstream ORF so as to preserve frame if and only if
they have the same color.  Translation start sites are indicated by a
vertical (unhooked) black line. Coding segments are indicated by black
(high-scoring) or grey (lower-scoring) rectangles above the ORF line.
BLAST homology segments are indicated by blue rectangles below the ORF
line. Horizontal lines across the display indicate normalized score
levels of 3.0, 6.0 (tan) and the selected cutoff (grey).

Clicking at a particular point in the sequence (or on an ORF) pops up
a scrollable text window giving the nucleotide position in the
sequence and information about all three ORFs containing that
nucleotide, in the same format as the text version. Unlike the text
version output, however, no sites are pruned.

The normalized combined score cutoff can be changed and the display
redrawn using the "redraw" button. 


INTERPRETATION OF OUTPUT, AND ASSEMBLY OF GENES  

High-scoring ORFs: An ORF (or exon) whose normalized score exceeds 5.0
is unlikely to have occurred "by chance" in non-coding regions.
Usually, but not always, such ORFs are exon ORFs, i.e. they contain a
gene exon in the correct reading frame.  The exceptions are generally
either

(i) ORFs that overlap, in an incorrect reading frame, an exon
occurring on the same strand. The high score of the ORF can then be
due either to the fact that it contains one or both of the exon's
splice sites, or that the region of exon-like nucleotide composition
creates an apparent "coding segment" (even though the reading frame is
wrong), or both;

(ii) ORFs that overlap an exon occurring on the opposite strand. In
this case, the high score is due to an apparent coding segment
reflecting the abnormal nucleotide composition of the exon.

Usually, but not always, the true exon ORF in these cases has the
highest combined score.

In theory, high-scoring ORFs could arise in other ways.  For example,
intergenic or intronic regions having abnormal nucleotide composition
(for whatever reason) might spuriously appear to have coding segments,
and occasionally by chance may have apparent high-scoring splice
sites. So far, there seem to be relatively few such regions in C.
elegans genomic sequence.  These may account for the anomalous "orphan
exons" we occasionally find.

Low-scoring ORFs: A significant fraction of these are spurious. Some
are obvious, for example those that arise, as in (i) and (ii) above
because they overlap an exon, or those that are flagged only because
they have marginal spurious BLAST homologies. 

Selection of splice sites within the ORF. In known genes the true site
usually has a substantially higher LLR score (by 1.0 or more) than
other candidate sites in the ORF. Occasionally, there will be several
nearby sites with comparable scores, in which case the correct one may
or may not be the highest. In checking the candidate genes found by
genefinder, one should beware of cases in which a site with a low
score is chosen over a nearby high-scoring site. Usually, the
lowscoring site is the only one that preserves reading frame, but this
may be an indication of a sequencing error (e.g. a frameshift error
that changes the frame in one of the two adjacent exons).  Also, the
sites chosen for an ORF should have the property that any high-scoring
coding segment(s) lie entirely or almost entirely between them.

In general, 3' sites in C. elegans tend to have higher information
content (and thus higher LLR scores) than 5' sites. We have in fact seen
several cases of (apparently) true 5' sites with negative LLR scores. 
Thus in cases where an exon or coding segment appears to be part of a
gene but has no apparent 5' site, it may be useful to lower the
intron5cutoff to a slightly negative value (e.g. -.5). However it is
probably not a good idea to do this on a routine basis because it
tends to increase the "background" of spurious exons that could be
inappropriately appended by the assembly algorithm.

Note that occasionally more than one exon is contained in the same
ORF. In such cases there are 5' and 3' splice sites within the ORF
that splice so as to preserve frame.

Assembly of genes: The hardest (and most error-prone) part of this
process is probably identification of initial and terminal exons, i.e.
determining where one gene ends and the next begins.  In C. elegans,
initial exons frequently have an apparent trans-spliced leader site (a
high-scoring 3' splice site), followed closely by an apparent
high-scoring translation start site; but some internal exons may have
these, as well. Terminal exons frequently have an apparent, spurious
5' splice site (this has also been noticed in other organisms).


CHECKING FOR SEQUENCE ERRORS

The rule should be: any ORF that appears (on the basis of its score
or homologies) to contain part of a gene, but lacks one or both splice
sites or does not splice in frame to the next apparent exon ORF,
should be checked for possible sequence errors. Errors will usually be
inside the ORF (or inside the ORF to which it ought to splice), and
usually within the 60+ nucleotides nearest the non-splicing end of the
ORF.  (They are likely to be in this location because the portion of
the ORF past the location of the error is likely to be in an incorrect
reading frame, and so will tend to have a stop codon within a fairly
short distance).  Unfortunately, this will not catch all errors, in
particular those in the introns (which are probably not of much
concern) or those that reduce the ORF's score so much that it no
longer appears in the genefinder output. At present the only way to
look for such ORFs is to reduce the score threshhold for the entire
analysis. We hope to make genefinder more flexible at displaying
lower-scoring ORFs and other features in problematic regions. It
should be relatively easy, for example, to improve genefinder's
ability to detect whether an anomalous ORF has a type (ii) error (see
below), by having it look for nearby, high-scoring splice sites in
overlapping or adjacent ORFs.  Ultimately, it may be useful to have
genefinder available for the editing process itself, which will
require forging closer ties to xdap.

More specifically, sequence errors can disrupt exon ORFs in any of
three different ways, depending on the kind of error and where it
occurs:

(i) frameshifts or substitutions within a splice site can destroy it
(i.e. reduce its LLR score below the cutoff);

(ii) frameshifts well within the interior of an exon, or substitutions
that create a stop codon, usually separate a splice site from its ORF,
and/or split a coding segment;

(iii) frameshifts very near a splice site usually change the frame at
the site without separating it from the ORF.

Errors of types (i) and (ii) tend to create ORFs that have no splice
site where they apparently should -- i.e. they have coding segments
that extend almost to the end of the ORF, with no apparent way to
splice them to the next exon. One should check such ORFs for errors
near the end of the ORF lacking the splice site. (Unfortunately, lead
and terminal exons also look like this). Type (ii) errors often create
two ORFs each containing coding segments that either overlap or are
immediately adjacent to each other. 

Type (iii) errors are currently the easiest to detect because they
don't significantly reduce the score and so are less likely to remove
the ORF from the genefinder output. In favorable cases (when the ORF
with the error and the one it splices to are near each other, are
high-scoring, and have high-scoring splice sites), it is usually
obvious which sites should be spliced.  The sequence in the vicinity
of both sites should be checked whenever the frames don't match.

STILL TO BE DONE

(ii) Look for short introns (flanked by high-scoring splice sites).
(iii) Automatically suggest regions to check for sequence errors, and
allow more flexible viewing of low-scoring ORFs in problematic regions.
(iv) Extend graphics version to allow automatic recording of one's
choices for genes.
(v) Find probability models/window sizes for sites and codons that
give best discrimination (see below).
(vi) Incorporate display of repeated sequences (found using the
program "overlaps", documentation to follow).
(vii) Improve selection of low-scoring ORFs.
(viii) Indicate overlapping ORFs on the display.
(ix) Allow greater flexibility in graphics display: options to omit or
force pruning, display only high-scoring or homology ORFs or only exon
ORFs from gene file.

Known bugs: (i) Overlapping ORFs of identical normalized score are not
resolved on the display.
(ii) Homology segments that contain stop codons are only assigned to
the initial ORF. This can be eliminated by assigning large negative
scores to stop codons in the BLAST alignments; we do this for local
searches, but not for netblast searches of the NCBI databases.

TECHNICAL DISCUSSION, TO BE EXPANDED LATER:

(I) CREATION OF TABLE FILES

If you wish to modify any of the site or codon tables you must create
the corresponding table files using the program "tables" in the
genefinder package.  You will need to run tables on both a set of
known nematode genes and on simulated sequences (which can be
generated using the program randseq).  Following that, the reference
histogram file must be generated, by running genefinder on the
simulated sequence.  Details on how to carry out these steps will
follow later.

(i) Construction of nematode gene sequence database.
(ii) Construction of simulated sequence.
(iii) Format of table files.


(II) DEFINITION OF LLR SCORES 

LLR scores are log likelihood ratios for comparing the hypothesis that
a test sequence is a site of a given type, with the hypothesis that
the test sequence is "random".  The approach differs somewhat from the
customary weight matrix method.  In that method, nucleotide
frequencies estimated from the family of known sites are used as the
basis for computing the likelihood of the test sequence. In the LLR
approach, these frequencies are instead estimated from the known sites
together with the test site, so as to obtain the maximum likelihood
under the hypothesis that the test sequence and known sites are drawn
from the same family.  It turns out that a unique score matrix can be
defined for the score computation, avoiding the necessity to actually
re-estimate frequencies for each test site. The LLR approach is thus
conceptually cleaner, computationally just as simple, and has the
practical advantage that it effectively provides automatic small
sample correction (0 nucleotide frequencies, which cause problems for
the weight matrix approach, do not cause problems for the LLR
approach).

Genefinder allows a fairly general class of Markov-type probability
models for defining the likelihoods of the sites and reference
sequences. I have not yet done extensive testing to determine the
best-fitting model in this class, so the current version uses the
simplest models: the splice site model assumes individual nucleotides
within the site are independent of each other; the coding segment
model assumes codons are independent of each other. I have not yet
determined the optimal window size for the site tables, and
arbitrarily use a width of about 30 nucleotides. It would be desirable
to use a more general class of probability models that allows
mixtures of Markov models.

(III) DISTRIBUTION OF LLR SCORES IN KNOWN GENES