                                   fcontml



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Gene frequency and continuous character maximum likelihood

Description

   Estimates phylogenies from gene frequency data by maximum likelihood
   under a model in which all divergence is due to genetic drift in the
   absence of new mutations. Does not assume a molecular clock. An
   alternative method of analyzing this data is to compute Nei's genetic
   distance and use one of the distance matrix programs. This program can
   also do maximum likelihood analysis of continuous characters that
   evolve by a Brownian Motion model, but it assumes that the characters
   evolve at equal rates and in an uncorrelated fashion, so that it does
   not take into account the usual correlations of characters.

Algorithm

   This program estimates phylogenies by the restricted maximum likelihood
   method based on the Brownian motion model. It is based on the model of
   Edwards and Cavalli-Sforza (1964; Cavalli-Sforza and Edwards, 1967).
   Gomberg (1966), Felsenstein (1973b, 1981c) and Thompson (1975) have
   done extensive further work leading to efficient algorithms. CONTML
   uses restricted maximum likelihood estimation (REML), which is the
   criterion used by Felsenstein (1973b). The actual algorithm is an
   iterative EM Algorithm (Dempster, Laird, and Rubin, 1977) which is
   guaranteed to always give increasing likelihoods. The algorithm is
   described in detail in a paper of mine (Felsenstein, 1981c), which you
   should definitely consult if you are going to use this program. Some
   simulation tests of it are given by Rohlf and Wooten (1988) and Kim and
   Burgman (1988).

   The default (gene frequency) mode treats the input as gene frequencies
   at a series of loci, and square-root-transforms the allele frequencies
   (constructing the frequency of the missing allele at each locus first).
   This enables us to use the Brownian motion model on the resulting
   coordinates, in an approximation equivalent to using Cavalli-Sforza and
   Edwards's (1967) chord measure of genetic distance and taking that to
   give distance between particles undergoing pure Brownian motion. It
   assumes that each locus evolves independently by pure genetic drift.

   The alternative continuous characters mode (menu option C) treats the
   input as a series of coordinates of each species in N dimensions. It
   assumes that we have transformed the characters to remove correlations
   and to standardize their variances.

  A word about microsatellite data

   Many current users of CONTML use it to analyze microsatellite data.
   There are three ways to do this:
     * Coding each copy number as an allele, and feeding in the
       frequencies of these alleles. As CONTML's gene frequency mode
       assumes that all change is by genetic drift, this means that no
       copy number arises by mutation during the divergence of the
       populations. Since microsatellite loci have very high mutation
       rates, this is questionable.
     * Use some other program, one not in the PHYLIP package, to compute
       distances among the populations. Some of the programs that can do
       this are RSTCalc, poptrfdos, Microsat, and Populations. Links to
       them can be found at my Phylogeny Programs web site at
       http://evolution.gs.washington.edu/phylip/software.html.
       Those distance measures allow for mutation during the divergence of
       the populations. But even they are not perfect -- they do not allow
       us to use all the information contained in the gene frequency
       differences of within a copy number allele. There is a need for a
       more complete statistical treatment of inference of phylogenies
       from microsatellite models, ones that take both mutation and
       genetic drift fully into account.
     * Alternatively, there is the Brownian motion approximation to mean
       population copy number. This is described in my book (Felsenstein,
       2004, Chapter 15, pp. 242-245), and it is implicit also in the
       microsatellite distances. Each locus is coded as a single
       continuous character, the mean of the copy number in at that
       microsatellite locus in that species. Thus if the species (or
       population) has frequencies 0.10, 0.24, 0.60, and 0.06 of alleles
       that have 18, 19, 20, and 21 copies, it is coded as having
0.10 X 18 + 0.24 X 19 + 0.60 X 20 + 0.06 X 21   =  19.62

       copies. These values can, I believe, be calculated by a spreadsheet
       program. Each microsatellite is represented by one character, and
       the continuous character mode of CONTML is used (not the gene
       frequencies mode). This coding allows for mutation that changes
       copy number. It does not make complete use of all data, but neither
       does the treatment of microsatellite gene frequencies as changing
       only genetic drift. frequency

Usage

   Here is a sample session with fcontml


% fcontml -printdata
Gene frequency and continuous character maximum likelihood
Input file: contml.dat
Phylip tree file (optional):
Phylip contml program output file [contml.fcontml]:

Adding species:
   1. European
   2. African
   3. Chinese
   4. American
   5. Australian

Output written to file "contml.fcontml"

Tree also written onto file "contml.treefile"


Done.



   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Gene frequency and continuous character maximum likelihood
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-infile]            frequencies File containing one or more sets of data
  [-intreefile]        tree       Phylip tree file (optional)
  [-outfile]           outfile    [*.fcontml] Phylip contml program output
                                  file

   Additional (Optional) qualifiers (* if not always prompted):
   -datatype           menu       [g] Input type in infile (Values: g (Gene
                                  frequencies); i (Continuous characters))
*  -lengths            boolean    [N] Use branch lengths from user trees
*  -njumble            integer    [0] Number of times to randomise (Integer 0
                                  or more)
*  -seed               integer    [1] Random number seed between 1 and 32767
                                  (must be odd) (Integer from 1 to 32767)
*  -global             boolean    [N] Global rearrangements
   -outgrno            integer    [0] Species number to use as outgroup
                                  (Integer 0 or more)
   -[no]trout          toggle     [Y] Write out trees to tree file
*  -outtreefile        outfile    [*.fcontml] Phylip tree output file
                                  (optional)
   -printdata          boolean    [N] Print data at start of run
   -[no]progress       boolean    [Y] Print indications of progress of run
   -[no]treeprint      boolean    [Y] Print out tree

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   "-outtreefile" associated qualifiers
   -odirectory         string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   fcontml reads continuous character data.

  Continuous character data

   The programs in this group use gene frequencies and quantitative
   character values. One (CONTML) constructs maximum likelihood estimates
   of the phylogeny, another (GENDIST) computes genetic distances for use
   in the distance matrix programs, and the third (CONTRAST) examines
   correlation of traits as they evolve along a given phylogeny.

   When the gene frequencies data are used in CONTML or GENDIST, this
   involves the following assumptions:
    1. Different lineages evolve independently.
    2. After two lineages split, their characters change independently.
    3. Each gene frequency changes by genetic drift, with or without
       mutation (this varies from method to method).
    4. Different loci or characters drift independently.

   How these assumptions affect the methods will be seen in my papers on
   inference of phylogenies from gene frequency and continuous character
   data (Felsenstein, 1973b, 1981c, 1985c).

   The input formats are fairly similar to the discrete-character
   programs, but with one difference. When CONTML is used in the
   gene-frequency mode (its usual, default mode), or when GENDIST is used,
   the first line contains the number of species (or populations) and the
   number of loci and the options information. There then follows a line
   which gives the numbers of alleles at each locus, in order. This must
   be the full number of alleles, not the number of alleles which will be
   input: i. e. for a two-allele locus the number should be 2, not 1.
   There then follow the species (population) data, each species beginning
   on a new line. The first 10 characters are taken as the name, and
   thereafter the values of the individual characters are read
   free-format, preceded and separated by blanks. They can go to a new
   line if desired, though of course not in the middle of a number.
   Missing data is not allowed - an important limitation. In the default
   configuration, for each locus, the numbers should be the frequencies of
   all but one allele. The menu option A (All) signals that the
   frequencies of all alleles are provided in the input data -- the
   program will then automatically ignore the last of them. So without the
   A option, for a three-allele locus there should be two numbers, the
   frequencies of two of the alleles (and of course it must always be the
   same two!). Here is a typical data set without the A option:

     5    3
2 3 2
Alpha      0.90 0.80 0.10 0.56
Beta       0.72 0.54 0.30 0.20
Gamma      0.38 0.10 0.05  0.98
Delta      0.42 0.40 0.43 0.97
Epsilon    0.10 0.30 0.70 0.62

   whereas here is what it would have to look like if the A option were
   invoked:

     5    3
2 3 2
Alpha      0.90 0.10 0.80 0.10 0.10 0.56 0.44
Beta       0.72 0.28 0.54 0.30 0.16 0.20 0.80
Gamma      0.38 0.62 0.10 0.05 0.85  0.98 0.02
Delta      0.42 0.58 0.40 0.43 0.17 0.97 0.03
Epsilon    0.10 0.90 0.30 0.70 0.00 0.62 0.38


   The first line has the number of species (or populations) and the
   number of loci. The second line has the number of alleles for each of
   the 3 loci. The species lines have names (filled out to 10 characters
   with blanks) followed by the gene frequencies of the 2 alleles for the
   first locus, the 3 alleles for the second locus, and the 2 alleles for
   the third locus. You can start a new line after any of these allele
   frequencies, and continue to give the frequencies on that line (without
   repeating the species name).

   If all alleles of a locus are given, it is important to have them add
   up to 1. Roundoff of the frequencies may cause the program to conclude
   that the numbers do not sum to 1, and stop with an error message.

   While many compilers may be more tolerant, it is probably wise to make
   sure that each number, including the first, is preceded by a blank, and
   that there are digits both preceding and following any decimal points.

   CONTML and CONTRAST also treat quantitative characters (the
   continuous-characters mode in CONTML, which is option C). It is assumed
   that each character is evolving according to a Brownian motion model,
   at the same rate, and independently. In reality it is almost always
   impossible to guarantee this. The issue is discussed at length in my
   review article in Annual Review of Ecology and Systematics
   (Felsenstein, 1988a), where I point out the difficulty of transforming
   the characters so that they are not only genetically independent but
   have independent selection acting on them. If you are going to use
   CONTML to model evolution of continuous characters, then you should at
   least make some attempt to remove genetic correlations between the
   characters (usually all one can do is remove phenotypic correlations by
   transforming the characters so that there is no within-population
   covariance and so that the within-population variances of the
   characters are equal -- this is equivalent to using Canonical
   Variates). However, this will only guarantee that one has removed
   phenotypic covariances between characters. Genetic covariances could
   only be removed by knowing the coheritabilities of the characters,
   which would require genetic experiments, and selective covariances
   (covariances due to covariation of selection pressures) would require
   knowledge of the sources and extent of selection pressure in all
   variables.

   CONTRAST is a program designed to infer, for a given phylogeny that is
   provided to the program, the covariation between characters in a data
   set. Thus we have a program in this set that allow us to take
   information about the covariation and rates of evolution of characters
   and make an estimate of the phylogeny (CONTML), and a program that
   takes an estimate of the phylogeny and infers the variances and
   covariances of the character changes. But we have no program that
   infers both the phylogenies and the character covariation from the same
   data set.

   In the quantitative characters mode, a typical small data set would be:

     5   6
Alpha      0.345 0.467 1.213  2.2  -1.2 1.0
Beta       0.457 0.444 1.1    1.987 -0.2 2.678
Gamma      0.6 0.12 0.97 2.3  -0.11 1.54
Delta      0.68  0.203 0.888 2.0  1.67
Epsilon    0.297  0.22 0.90 1.9 1.74


   Note that in the latter case, there is no line giving the numbers of
   alleles at each locus. In this latter case no square-root
   transformation of the coordinates is done: each is assumed to give
   directly the position on the Brownian motion scale.

   For further discussion of options and modifiable constants in CONTML,
   GENDIST, and CONTRAST see the documentation files for those programs.

  Input files for usage example

  File: contml.dat

    5    10
2 2 2 2 2 2 2 2 2 2
European   0.2868 0.5684 0.4422 0.4286 0.3828 0.7285 0.6386 0.0205
0.8055 0.5043
African    0.1356 0.4840 0.0602 0.0397 0.5977 0.9675 0.9511 0.0600
0.7582 0.6207
Chinese    0.1628 0.5958 0.7298 1.0000 0.3811 0.7986 0.7782 0.0726
0.7482 0.7334
American   0.0144 0.6990 0.3280 0.7421 0.6606 0.8603 0.7924 0.0000
0.8086 0.8636
Australian 0.1211 0.2274 0.5821 1.0000 0.2018 0.9000 0.9837 0.0396
0.9097 0.2976

Output file format

   fcontml output has a standard appearance. The topology of the tree is
   given by an unrooted tree diagram. The lengths (in time or in expected
   amounts of variance) are given in a table below the topology, and a
   rough confidence interval given for each length. Negative lower bounds
   on length indicate that rearrangements may be acceptable.

   The units of length are amounts of expected accumulated variance (not
   time). The log likelihood (natural log) of each tree is also given, and
   it is indicated how many topologies have been tried. The tree does not
   necessarily have all tips contemporary, and the log likelihood may be
   either positive or negative (this simply corresponds to whether the
   density function does or does not exceed 1) and a negative log
   likelihood does not indicate any error. The log likelihood allows
   various formal likelihood ratio hypothesis tests. The description of
   the tree includes approximate standard errors on the lengths of
   segments of the tree. These are calculated by considering only the
   curvature of the likelihood surface as the length of the segment is
   varied, holding all other lengths constant. As such they are most
   probably underestimates of the variance, and hence may give too much
   confidence in the given tree.

   One should use caution in interpreting the likelihoods that are printed
   out. If the model is wrong, it will not be possible to use the
   likelihoods to make formal statistical statements. Thus, if gene
   frequencies are being analyzed, but the gene frequencies change not
   only by genetic drift, but also by mutation, the model is not correct.
   It would be as well-justified in this case to use GENDIST to compute
   the Nei (1972) genetic distance and then use FITCH, KITSCH or NEIGHBOR
   to make a tree. If continuous characters are being analyzed, but if the
   characters have not been transformed to new coordinates that evolve
   independently and at equal rates, then the model is also violated and
   no statistical analysis is possible. Doing such a transformation is not
   easy, and usually not even possible.

   If the U (User Tree) option is used and more than one tree is supplied,
   the program also performs a statistical test of each of these trees
   against the one with highest likelihood. If there are two user trees,
   the test done is one which is due to Kishino and Hasegawa (1989), a
   version of a test originally introduced by Templeton (1983). In this
   implementation it uses the mean and variance of log-likelihood
   differences between trees, taken across loci. If the two trees means
   are more than 1.96 standard deviations different then the trees are
   declared significantly different. This use of the empirical variance of
   log-likelihood differences is more robust and nonparametric than the
   classical likelihood ratio test, and may to some extent compensate for
   the any lack of realism in the model underlying this program.

   If there are more than two trees, the test done is an extension of the
   KHT test, due to Shimodaira and Hasegawa (1999). They pointed out that
   a correction for the number of trees was necessary, and they introduced
   a resampling method to make this correction. The version used here is a
   multivariate normal approximation to their test; it is due to
   Shimodaira (1998). The variances and covariances of the sum of log
   likelihoods across loci are computed for all pairs of trees. To test
   whether the difference between each tree and the best one is larger
   than could have been expected if they all had the same expected
   log-likelihood, log-likelihoods for all trees are sampled with these
   covariances and equal means (Shimodaira and Hasegawa's "least favorable
   hypothesis"), and a P value is computed from the fraction of times the
   difference between the tree's value and the highest log-likelihood
   exceeds that actually observed. Note that this sampling needs random
   numbers, and so the program will prompt the user for a random number
   seed if one has not already been supplied. With the two-tree KHT test
   no random numbers are used.

   In either the KHT or the SH test the program prints out a table of the
   log-likelihoods of each tree, the differences of each from the highest
   one, the variance of that quantity as determined by the log-likelihood
   differences at individual sites, and a conclusion as to whether that
   tree is or is not significantly worse than the best one.

   One problem which sometimes arises is that the program is fed two
   species (or populations) with identical transformed gene frequencies:
   this can happen if sample sizes are small and/or many loci are
   monomorphic. In this case the program "gets its knickers in a twist"
   and can divide by zero, usually causing a crash. If you suspect that
   this has happened, check for two species with identical coordinates. If
   you find them, eliminate one from the problem: the two must always show
   up as being at the same point on the tree anyway.

  Output files for usage example

  File: contml.fcontml


Continuous character Maximum Likelihood method version 3.69.650


   5 Populations,   10 Loci

Numbers of alleles at the loci:
------- -- ------- -- --- -----

   2   2   2   2   2   2   2   2   2   2

Name                 Gene Frequencies
----                 ---- -----------

  locus:         1         1         2         2         3         3
                 4         4         5         5         6         6
                 7         7         8         8         9         9
                10        10

European     0.28680   0.71320   0.56840   0.43160   0.44220   0.55780
             0.42860   0.57140   0.38280   0.61720   0.72850   0.27150
             0.63860   0.36140   0.02050   0.97950   0.80550   0.19450
             0.50430   0.49570
African      0.13560   0.86440   0.48400   0.51600   0.06020   0.93980
             0.03970   0.96030   0.59770   0.40230   0.96750   0.03250
             0.95110   0.04890   0.06000   0.94000   0.75820   0.24180
             0.62070   0.37930
Chinese      0.16280   0.83720   0.59580   0.40420   0.72980   0.27020
             1.00000   0.00000   0.38110   0.61890   0.79860   0.20140
             0.77820   0.22180   0.07260   0.92740   0.74820   0.25180
             0.73340   0.26660
American     0.01440   0.98560   0.69900   0.30100   0.32800   0.67200
             0.74210   0.25790   0.66060   0.33940   0.86030   0.13970
             0.79240   0.20760   0.00000   1.00000   0.80860   0.19140
             0.86360   0.13640
Australian   0.12110   0.87890   0.22740   0.77260   0.58210   0.41790
             1.00000   0.00000   0.20180   0.79820   0.90000   0.10000
             0.98370   0.01630   0.03960   0.96040   0.90970   0.09030
             0.29760   0.70240


  +-----------------------------------------------------------African
  !
  !             +-------------------------------Australian
  1-------------3
  !             !     +-----------------------American
  !             +-----2
  !                   +Chinese
  !
  +European


remember: this is an unrooted tree!

Ln Likelihood =    38.71914

Between     And             Length      Approx. Confidence Limits
-------     ---             ------      ------- ---------- ------
  1       African        0.09693444   (  0.03123910,  0.19853604)
  1          3           0.02252816   (  0.00089799,  0.05598045)
  3       Australian     0.05247405   (  0.01177094,  0.11542374)
  3          2           0.00945315   ( -0.00897717,  0.03795670)
  2       American       0.03806240   (  0.01095938,  0.07997877)
  2       Chinese        0.00208822   ( -0.00960622,  0.02017433)
  1       European       0.00000000   ( -0.01627246,  0.02516630)



  File: contml.treefile

(African:0.09693444,(Australian:0.05247405,(American:0.03806240,Chinese:0.002088
22):0.00945315):0.02252816,
European:0.00000000);

Data files

   None

Notes

   None.

References

   None.

Warnings

   None.

Diagnostic Error Messages

   None.

Exit status

   It always exits with status 0.

Known bugs

   None.

See also

                    Program name                   Description
                    egendist     Genetic distance matrix program
                    fgendist     Compute genetic distances from gene frequencies

Author(s)

                    This program is an EMBOSS conversion of a program written by Joe
                    Felsenstein as part of his PHYLIP package.

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (2004) - Joe Felsenstein, University of Washington.

                    Converted (August 2004) to an EMBASSY program by the EMBOSS team.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.

Comments

None
