Trinity RNA Sequence Assembler Guide (in progress)
Trinity is a bioinformatics software package that carries out de novo assembly on RNASeq sequence data generated using the Illumina platform. It is used to create high-quality contigs representing real RNA transcripts from both pro- and eukaryotic organisms where a reference genome is not available. Users can also process meta-transcriptomic data using Trinity, however careful analysis of the resulting contigs is necessary.
This guide assumes that the user is already familiar with the theory of how Trinity operates and only covers issues regarding running Trinity on the HPC Cluster. For more information about Trinity and its general operation, please see the Trinity website.
Two versions of the Trinity package are currently installed on the cluster: r2012-05-18 and r2012-06-08. These versions are functionally identical, however they take different command arguments so users should be aware of which version they are using.
Before using Trinity, users must load the module for the Trinity version that they wish to use as follows:
module load trinityrnaseq/2012-06-08 or module load trinityrnaseq/2012-05-18
Additionally, users must also load the module for the bowtie sequence aligner as follows:
module load bowtie/0.12.8
Users who often use Trinity will find it helpful to have the required modules load during log-in. Please refer to the Modules_Guide for more information about how to have modules automatically load during login, as well as other useful information about module usage in general.
How to run a default Trinity job with SLURM
Standard practice for running Trinity jobs on the HPC Cluster use a specialized shell script to allow all computation to occur solely on a single processing node utilizing that node's local disk space exclusively. The shell script will automatically transfer your designated input files to the compute node, and after completion of Trinity execution will bzip2 compress the output directory and transfer the result back to your home directory. Output files are labeled according to the date and time that execution of the shell script began.
The default options for Trinity specify that 2 processing cores are used. This value can be changed using the --CPU argument of Trinity. When submitting a job, users must also specify the SLURM supported environment variable $SLURM_CPUS_PER_TASK to forward the number of processing cores to the Trinity.
$ cat Trinity.sh #!/bin/bash #SBATCH -n 1 #SBATCH -c 6 # users can cange the number of processing cores here but it cannot be over the totall number of the cores in the node Trinity.pl --CPU $SLURM_CPUS_PER_TASK <other options>
Then submit the script with sbatch:
NOTE: There is no explicit error message generated by the job script to indicate that Trinity failed to execute properly or exited with 0 status. Users are advised to examination of the run.log file to ensure that Trinity execution was successful prior to proceeding with further analyses.
Restarting failed jobs
In certain cases, Trinity execution will fail before completing the final compute phase (Butterfly). Trinity has the ability to automatically detect output files from each stage of program execution, and in many cases can resume execution from the last valid state.
In such cases where Trinity fails to properly complete execution on the cluster, intermediate results are compressed and transferred back to the users home directory to allow for troubleshooting and eventual resumption of the run at the last valid state. To restart a failed Trinity run, users will need to modify the job script to specify the previous output file as the new input.
Due to the standard account limits on directory size, users who wish to use Trinity on the HPC Cluster will need to request an increase in maximum directory size to a value sufficiently large enough to handle double the largest dataset that they wish to use. For example, if a user wishes to process a paired-end dataset that totals 25GB, the user should request that their directory limit be increased to 50GB to allow for the input and output files to reside in the home directory at the same time.
Users are highly encouraged to utilize compressed file formats such as gzip or bzip2 as much as possible when working with large datasets.