You are here
RISSdb: Reference Bioinformatics Data
RISS maintains local copies of several commonly used public reference data repositories for use by the MSI user community. Many public websites hosting reference data don't allow high-thoughput access to the data, and most software that makes use of public datasets runs much faster using a local copy of the dataset instead of accessing it over the internet.
These datasets are located on the MSI filesystem at /panfs/roc/rissdb. These datasets are accessible from MSI compute clusters, including the Lab cluster, Itasca, and login.msi.umn.edu. They are not accessible from MSI Windows servers or from Galaxy.
NCBI Blast Databases /panfs/roc/rissdb/blast/current
The following BLAST databases from NCBI are available: nt, nr, swissprot, taxdb, vector, human_genomic, refseq_genomic, refseq_protein, refseq_rna, refseqgene, pataa, patnt, pdbaa, pdbnt, gss, est, est_human, est_mouse, est_others, sts, htgs, and wgs. They are automatically updated once a week. The ncbi_blast+ module is configured to automatically use these databases. See the ncbi_blast+ software page for more details.
GATK Bundle /panfs/roc/rissdb/gatk
The GATK Bundle is a set of reference data for use with GATK. See the GATK website for more details of what is contained in the dataset and how to make use of it. This dataset is updated once a week.
Illumina has provided the RNA-Seq user community with a set of genome sequence indices (including Bowtie indices) as well as GTF transcript annotation files for a few of the most heavily studied organisms. MSI only keeps a local copy of the GTF annotation files from this dataset. This dataset is updated once a week.
Protein Databank (PDB) /panfs/roc/rissdb/pdb
PDB is an archive of macromolecular structural data. See the PDB website for more information about using this dataset.
Reference Genomes /panfs/roc/rissdb/genomes
MSI maintains local copies of a wide range of reference genomes, organized by species. In the /panfs/roc/rissdb/genomes folder there is a folder for each species, and in each species folder there is a folder for each genome build. Each genome build folder contains the following folders and files:
- seq: contains a single fasta file of teh genome sequence; a .dict file generated by Picard Tools; a .fai file generated by Samtools; and the genome sequence in .2bit format
- bowtie2: contains a genome index for Bowtie version 2.x
- bwa: contains a genome index compatible with BWA versions 0.6+
- gmap: contains a genome index for gmap
- maq: contains a genome index for maq
- annotation: contains the GTF annotation file from the iGenomes dataset (See above), if available
- REAMDE.genome: a plain text file contains detailed information about the genome build, including where it was downloaded from
Other datasets /panfs/roc/rissdb/adhoc
The adhoc folder contains additional datasets that may be of use, but are not maintained or documented.