GenomeThreader Frequently Asked Questions (FAQ)

  1. How can I estimate the memory requirements of GenomeThreader?
  2. How can I reduce the memory requirements of GenomeThreader?
  3. How can I split the input files to reduce the memory consumption and utilize multiple CPUs?

(1) How can I estimate the memory requirements of GenomeThreader?

The major driving forces are the size of the input files, the number of stored spliced alignments and the maximum size of the Dynamic Programming matrix. For the genomic file(s) (which are used to construct the index) you need about factor 8 of the uncompressed file size(s). For the EST input file(s) you need about facor 2 of the uncompressed file size(s). The number of stored spliced alignments is important, because they are all held in main memory (the size depends on your use case, see statistics at the end of the GenomeThreader output). The maximum size of the Dynamic Programming matrix is very important, because it could create a "space peak" which lets you run out of memory. See next question for advice on how to limit the size of the matrix.

(2) How can I reduce the memory requirements of GenomeThreader?

You should set option -gcmaxgapwidth accordingly (depending on your species) to reduce the maximum possible size of the Dynamic Programming matrix. If you do not use option -introncutout you can use -autointroncutout to prevent "space peaks" for large Dynamic Programming tables. You can also split up the input files which is described in the next question.

(3) How can I split the input files to reduce the memory consumption and utilize multiple CPUs?

The basic strategy is to use gth with option -intermediate on different subsets of the input and combine the results afterwards. The input files can be split with the gt splitfasta tool contained in the GenomeTools package (an open-source collection of bioinformatics tools). To determine the right sizes for the genomic and EST/protein input files keep the answer to question (1) in mind. A common strategy in practice is to leave the genomic input files untouched (if you can afford it memory wise) and split the EST/protein input files into chunks of 50 MB. There are two possible formats to store the intermediate results: XML and GFF3 (options -xmlout and -gff3out, respectively). XML output is lossless but takes more resources to process. GFF3 is much more resource-efficient, but it makes only sense if you want to have GFF3 output in the end, because the other formats cannot be reconstructed from GFF3 output. To combine XML intermediate files use gthconsensus which is described in the GenomeThreader manual. Using gthconsensus is quite memory intensive, because it takes the intermediate XML files and reconstructs the alignments exactly as they were during the gth run. That is, each spliced alignment needs to be stored in memory. If you don't need the full alignments but could live with the structure in GFF3 the following strategy is recommended: You call gth with the options -intermediate and -gff3out. This gives you the spliced alignments predicted by GenomeThreader in GFF3 format. You can then postprocess this spliced alignments with some tools from the GenomeTools package. You cannot convert the GFF3 output back to the intermediate form, but you can perform the same analysis which gthconsensus performs with tools operating on the GFF3 output. To do so, sort the intermediate files with gt gff3 -sort and merge them with gt merge afterwards. Then compute the consensus spliced alignments with gt csa. If you also want to predict coding sequences add them to the GFF3 with gt cds. gt filter allows you to filter spliced alignments according to their scores.