RISSdb: Reference Bioinformatics Data

RISS maintains local copies of several commonly used public reference data repositories for use by the MSI user community. Many public websites hosting reference data don't allow high-thoughput access to the data, and most software that makes use of public datasets runs much faster using a local copy of the dataset instead of accessing it over the internet. 

Access

These datasets are located on the MSI filesystem at /panfs/roc/rissdb. These datasets are accessible from MSI compute clusters, including the Lab cluster, Itasca, Calhoun,  and login.msi.umn.edu. They are not accessible from MSI Windows servers or from Galaxy.

Datasets

NCBI Blast Databases /panfs/roc/rissdb/blast/current

The following BLAST databases from NCBI are available: nt, nr, swissprot, taxdb, vector, human_genomic, refseq_genomic, refseq_protein, refseq_rna, refseqgene, pataa, patnt, pdbaa, pdbnt, gss, est, est_human, est_mouse, est_others, sts, htgs, and wgs. They are automatically updated once a week. The ncbi_blast+ module is configured to automatically use these databases. See the ncbi_blast+ software page for more details.

GATK Bundle /panfs/roc/rissdb/gatk

The GATK Bundle is a set of reference data for use with GATK. See the GATK website for more details of what is contained in the dataset and how to make use of it. This dataset is updated once a week.

iGenomes /panfs/roc/rissdb/igenomes

Illumina has provided the RNA-Seq user community with a set of genome sequence indices (including Bowtie indices) as well as GTF transcript annotation files for a few of the most heavily studied organisms. MSI only keeps a local copy of the GTF annotation files from this dataset. This dataset is updated once a week.

Protein Databank (PDB) /panfs/roc/rissdb/pdb

PDB is an archive of macromolecular structural data. See the PDB website for more information about using this dataset.

Reference Genomes /panfs/roc/rissdb/genomes

MSI maintains local copies of a wide range of reference genomes, organized by species. In the /panfs/roc/rissdb/genomes folder there is a folder for each species, and in each species folder there is a folder for each genome build. Each genome build folder contains the following folders and files:

  • seq: contains a single fasta file of teh genome sequence; a .dict file generated by Picard Tools; a .fai file generated by Samtools; and the genome sequence in .2bit format
  • bowtie2: contains a genome index for Bowtie version 2.x
  • bwa: contains a genome index compatible with BWA versions 0.6+
  • gmap: contains a genome index for gmap
  • maq: contains a genome index for maq
  • annotation: contains the GTF annotation file from the iGenomes dataset (See above), if available
  • REAMDE.genome: a plain text file contains detailed information about the genome build, including where it was downloaded from

List of available genomes

Other datasets /panfs/roc/rissdb/adhoc

The adhoc folder contains additional datasets that may be of use, but are not maintained or documented.