Genomics SU Calculator

This Genomics Service Unit (SU) Calculator is designed to help life sciences researchers assess what resources a particular run of a common bioinformatics program (e.g., BWA, Bowtie, Tophat, and Cufflinks) will require to complete. The resources assessed include estimated walltime (expected real running time, i.e., the time that is expected to elapse on a clock for the entirety of a program's run) and Service Units (units that are allocated to users and then charged to allow usage of the supercomputing resources). Given a certain program, organism, read count, and read length configuration, the calculator will gauge how long a specific run of a program would take and how much memory is needed on the various computing platforms at the Minnesota Supercomputing Institute, particuarly on Itasca and lab nodes, and then calculate the CPU hours and Service Units if appropriate. It will then display the estimated walltime in hours, CPU hours, Service Units, estimated peak memory in GB, and the recommended number of threads required to complete one run on either an Itasca or lab node.


Input
Program
Organism
Read Count (M)
Read Length


Million

 

Output
Itasca
Est. Walltime (hr)
CPU Hours
Service Units
Peak Memory (GB)
Threads
PBS Script





Lab
Est. Walltime (hr)
Peak Memory (GB)
Threads
PBS Script



How to Use

Instructions
1. Select the program you want to run
2. Select the organism that most closely matches the organism of your data
3. Enter the read count of one of the ends of your paired-end data in millions (e.g., enter 10 if both ends each have 10 Million reads)
4. Select the read length that most closely matches the read length of your reads (e.g., enter 50 for 2x50 base pair reads)
5. Click "Calculate" to obtain resource estimates

What goes in?
Provided are drop-down menus and fields for you to enter the configuration for your run.
- Program: BWA, Bowtie2, Tophat2, or Cufflinks2
Tophat has the option of running with or without the --GTF option
Cufflinks has the option of running with the --GTF or --GTF-guide option, or simply no GTF option
- Organism: Mammalian, Plant, Fungi
- Read Count (in millions)
- Read Length: 50, 75, 100, or 150 base pairs based on availability of data for benchmarks

What comes out?
The calculator will display the following for the Itasca nodes:
- Estimated walltime in hours on Itasca
- CPU hours of the run (walltime*maxThreads)
- Service Units (SUs) required (CPU hours/1.5)
- Estimated peak memory in GB
- Recommended number of threads
- Template PBS script for the run

and the following for the lab nodes:
- Estimated walltime in hours on lab queue
- Estimated peak memory in GB
- Recommended number of threads
- Template PBS script for the run

The estimated walltime is the expected real running time of the run, i.e., the time that is expected to elapse on a clock for the entirety of a program's run.
CPU hours is the walltime multiplied by the max number of threads on the node running the job, e.g., the CPU hours for a job that runs for 5.0 hours on Itasca will be 5.0*8=40.0 since the max number of threads on an Itasca batch node is 8.
Service Units (SUs) are allocated to users and then charged to allow usage of the supercomputing resources. It is how the Minnesota Supercomputing Institute keeps track of the usage of the supercomputers. More information about SUs can be found here and more information about MSI's HPC policies and procedures can be found here. For Itasca, 1 SU will provide 1.5 CPU hours of time.
The estimated peak memory is the amount of memory that the job will require on the node to run through successfully.
The recommended number of threads is the number of cores a user should run the program on. The recommended number of threads is also the number of cores that the user should request in a PBS script. For Itasca and lab nodes, it is suggested to always run a program on the max number of threads, which would be 8 threads for both computing platforms.
The template PBS script provided as a download will allow researchers to easily create the script to run their program on one of the nodes.

This outputted information will be valuable for life sciences researchers to understand the resources required for their particular run of a program. In addition, the template PBS script provided will allow the researchers to easily create the script for their run.

Where do these estimates come from?

The estimates of the walltimes and peak memories of a particular run come from the results of an informatics benchmarking project. The purpose of the informatics benchmarking project was to help assess the resources needed to run a program for many different possible configurations. The time and memory results of the benchmarks were graphed and analyzed, and curves were fitted to the data points. The resulting equations estimate the walltime and peak memory when given a program, organism, read count, and read length.

The benchmarked programs include:
- BWA (0.7.4)
- Bowtie2 (2.1.0)
- Tophat2 (2.0.10)
- Cufflinks2 (2.1.1) (note: Cufflinks 2.2 performance is comparable to 2.1.1)

The organisms used for the benchmarks include:
- Human (Homo Sapiens) for mammalian organisms
- Potato (Solanum tuberosum) for plant organisms
- Mushroom (Basidiomycetes) for fungi organisms

The resource estimates as given by the outputs are limited by the availability of data for the informatics benchmarking project and by a computing platform's node's specific memory and walltime constraints.

More information about the informatics benchmarking project, including the methods used for the benchmarking and the results thus far, can be found at the Informatics Benchmarking page on MSI's Intranet.

Notes

Itasca
It is suggested that jobs ran on Itasca be ran on the batch nodes, as opposed to the Sandy Bridge nodes. This is due to the fact that there are 1,134 batch nodes and only 51 Sandy Bridge nodes, so the time from job submission to beginning execution will generally be better on the batch nodes. The Itasca estimates given in the calculator are for the batch nodes, but the performance between the two different kinds of Itasca nodes are comparable. The only time one should consider using a Sandy Bridge node over a batch node is when the estimated walltime of a run is greater than 24 hours, which exceeds the maximum walltime of a batch node.

It is suggested that the max number of threads always be used when running jobs on Itasca, since the entirety of a requested node is dedicated to the user. Thus, it would be a waste to not utilize all cores given to the user. There is a total of 8 cores per batch node and a total of 16 cores per Sandy Bridge node. The output displayed for Itasca will always be for a job running on 8 threads, the max number of threads on a batch node.

More information on Itasca.
Information about the Sandy Bridge nodes can be found here.

Lab
There are three different kinds of lab nodes: old calhoun, mirror, and elmo nodes. It is suggested that only old calhoun and mirror nodes be used when running jobs on the lab queue. Elmo nodes are inferior to the other two nodes in several ways, including being significantly slower, having an unreliable time frame between job submission and beginning execution, and there are only 6 elmo nodes compared to the 16 mirror nodes and 64 old calhoun nodes. In regards to the old calhoun and mirror nodes, the performance is comparable, and it is generally easy getting into the lab queue for both. The estimates shown in the calculator are for the old calhoun nodes, which are essentially the same for the mirror nodes.

Service Units are not required to run jobs on the lab queue, so the output for the lab nodes does not display CPU hours nor Service Units. When a particular job is estimated to take longer than 8 hours on the lab nodes, the output for the lab nodes will display "not recommended," as at that point, it is highly recommended that the job not be ran on a lab node and be ran on Itasca instead. The output displayed for the lab nodes will always be for a job running on 8 threads, the max number of threads on both a mirror and an old calhoun node.

The usage of the mirror or old calhoun nodes are invoked by the PBS option: #PBS -l 'feature=[node]'
where [node] is replaced by "mirrornode" for the mirror nodes or "xenode" for the old calhoun nodes.

More information on the Lab Queue.
Old calhoun nodes are nodes lab001-064, mirror nodes are nodes mirror1-mirror16, and elmo nodes are nodes labh03-labh08.

Tophat
Tophat can be ran with or without the --GTF(-G) option. If the --GTF option is provided, Tophat will use this virtual transcriptome to align reads first when it calls Bowtie. Read more about this option and other Tophat options here.

Tophat was found to be fairly memory intensive, requiring roughly 3GB/core. Tophat jobs ran with the max number of threads on a lab node will terminate prematurely due to exceeding the maximum amount of memory available on a lab node, which is 16GB. Thus, if it is desired that a Tophat job be ran with the max number of threads on a node, it is highly suggested that Itasca be used, and so the calculator will simply display "Not recommended" for the lab output. Since the calculator is designed to output estimates only for the max number of threads per node, it will never be recommended that a Tophat job be ran on a lab node.

For different organisms, the Tophat parameters used should be changed accordingly to best fit the organism.
- Mammal:
--- Default parameters
- Plant:
--- Mate inner distance: 65 bp
--- Minimum intron length: 45
--- Maximum intron length: 5000
--- Minimum intron length that may be found during split-segment search: 45
--- Maximum intron length that may be found during split-segment search: 5000
- Fungi:
-- Mate inner distance: 49 bp
-- Mate standard deviation: 108
-- Minimum intron length: 10
-- Maximum intron length: 850
-- Minimum coverage intron: 15
-- Maximum coverage intron: 850
-- Minimum segment intron: 15
-- Maximum segment intron: 1000
-- Segment mismatches: 2
-- Segment length: 25
-- Library type: fr-unstranded
-- Coverage search
-- Microexon search

Usage for Mammalian data:
tophat --num-threads [#threads] --output-dir [output_directory] [bowtie2_index] [R1.fastq] [R2.fastq]

Usage for Plant data:
tophat --num-threads [#threads] --mate-inner-dist 65 --min-intron-length 45 --max-intron-length 5000 --min-segment-intron 45 --max-segment-intron 5000 --output-dir [output_directory] [bowtie2_index] [R1.fastq] [R2.fastq]

Usage for Fungi data:
tophat --num-threads [#threads] --mate-inner-dist 49 --mate-std-dev 108 --min-intron-length 10 --max-intron-length 850 --coverage-search --microexon-search --library-type fr-unstranded --segment-mismatches 2 --segment-length 25 --min-coverage-intron 15 --max-coverage-intron 850 --min-segment-intron 15 --max-segment-intron 1000 --output-dir [output_directory] [bowtie2_index] [R1.fastq] [R2.fastq]

If the --GTF option is wanted, add the option --GTF [reference_annotation.(gtf/gff)]

Cufflinks
It is recommended that Cufflinks is always ran with the --frag-bias-correct(-b) and --multi-read-correct(-u) options, as they will generally speed up the runtime. The --frag-bias-correct option tells Cufflinks to run the bias detection and correction algorithm, which will improve accuracy of transcript abundance estimates. The --multi-read-correct option tells Cufflinks to do an initial estimation procedure to more accurately weight reads mapping to multiple locations in the genome.

In addition to these options, Cufflinks may be ran with a GTF option, particularly with the --GTF(-G) or --GTF-guide(-g) options. If the --GTF option is provided, Cufflinks will use the reference annotation to estimate isoform expression. If the --GTF-guide option is provided, Cufflinks will use the reference annotation to guide RABT assembly.
Read more about these options and other Cufflinks options here.

Usage for Mammalian data:
cufflinks --num-threads [#threads] --frag-bias-correct [genome.fa] --multi-read-correct --output-dir [output_directory] [accepted_hits.bam]

Usage for Plant data:
cufflinks --num-threads [#threads] --frag-bias-correct [genome.fa] --multi-read-correct --min-intron-length 45 --max-intron-length 5000 --output-dir [output_directory] [accepted_hits.bam]

Usage for Fungi data:
cufflinks --num-threads [#threads] --frag-bias-correct [genome.fa] --multi-read-correct --library-type fr-unstranded --min-intron-length 10 --max-intron-length 850 --output-dir [output_directory] [accepted_hits.bam]

If the --GTF option is wanted, add the option --GTF [reference_annotation.(gtf/gff)]
If the --GTF-guide option is wanted, add the option --GTF-guide [reference_annotation.(gtf/gff)]

Top of Page