MSI provides large capacity and high performance temporary storage to be used while applications are running on the supercomputer. Depending on the system, scratch storage may be a set-aside area of primary storage (global scratch), or may consist of separate storage attached via fast data links. While global scratch is a shared space that is visible to all nodes, local scratch disk, SSD and RAMdisk is only connected (and visible) to a single node.
- Access to scratch storage is shared and does not need to be requested (SSD nodes must be requested in a PBS script, as shown below).
- The available/free capacity of each scratch type will vary based on the aggregate utilization of these shared resources.
- There are no quotas on this scratch storage, nor backups.
- Scratch storage should not be used for any valuable data or data intended to be stored longer than 30 days.
- With the exception of global scratch, PBS jobs must "clean up" after themselves (delete any files created in scratch) before exiting.
- Users must make a directory for themselves under the relevant Path for each storage type (e.g. mkdir /scratch.local/$USER).
- Files are written to scratch with the user's specified umask; by default, this means files are private to the individual.
Example Use Cases
Given the performance and capacity differences, scratch type usage will vary by case. In general, you want to use the fastest scratch storage type, but this is usually determined by the capacity required for your application. When in doubt, contact email@example.com. A few examples of how scratch is used at MSI:
|Global Scratch||My multinode job will generate many large intermediate files, on the order of terabytes. The intermediate files are needed by each node in the job. Of these files, I only need to keep a few gigabytes, and my group quota is not large enough for all of the intermediate and persistent data.|
|Local Scratch||My job can run on one or more nodes, and each node needs it's own unique space for gigabytes of output. At the end of execution, my PBS script will consolidate files to my home directory.|
|SSD||My image processing code needs all the memory on the node(s), but is limited by the network and I/O bandwidth when processing files. Each file must be read and written to many times. It is possible to stage these jobs to avoid a network filesystem, and instead use SSDs for faster I/O.|
|RAMdisk||I have an application that uses less than 30 gigabytes of memory, but performs significant I/O on a database or file. The database or file requires less than half the memory on the node.|
Global scratch is part of the same networked filesystem (Panasas) that also contains users' home directories. As such, it is accessible at /scratch.global from all cluster login and compute nodes at MSI, including those for Mesabi, Itasca, and interactive computing (lab). Since it is a networked filesystem, global scratch directories have the same contents wherever they are accessed, whereas local scratch, SSD, and RAMdisk directories are isolated to each node and have different contents. Global scratch capacity is typically on the order of terabytes (refer to Capacity below for more details).
Data in global scratch is deleted after 30 days.
Performance of global scratch can vary based on network connection and the user's shared data utilization.
Local scratch is a local filesystem on all cluster login and compute nodes at MSI. Local scratch is accessible at /scratch.local, but the contents on each node are unique because the filesystem is not networked with other nodes. It consists of spinning disk attached directly to each node. As such, capacity is limited to the order of gigabytes of space, and varies by queue.
Some Mesabi compute nodes have SSDs attached to them, which can be used as temporary storage that is generally faster than the local scratch.
Within the "small" queue there are 32 nodes with ~440 GB of SSD space available, accessible at /scratch.ssd. Please remember that data stored in /scratch.ssd needs to be deleted at the end of each PBS job. To request SSD node(s) in your job, add the "ssd" attribute to your PBS script or qsub command:
#PBS -l nodes=1:ssd:ppn=1,walltime=1:00:00
The current space available (in MegaBytes) on /scratch.ssd on an SSD node (in an ssd enabled PBS script) can be seen with:
df -m /scratch.ssd Filesystem 1M-blocks Used Available Use% Mounted on /dev/sdb1 450547 71 427584 1% /scratch.ssd A list of these SSD-capable nodes can be generated from a Mesabi login node using the command:
RAMdisk (/dev/shm) is available on all nodes, does not need to be explicitly requested in a PBS script, and has a capacity equal to half of the node's memory. Generally this is the fastest temporary storage option, but has the lowest capacity. Available memory per node varies with the queue, and users must remember to account for RAMdisk in their PBS resource requests.
Table comparing representative performance seen by the fio benchmark on a Mesabi SSD (small queue) node:
|Type||Path||Read Bandwidth (MB/sec)||Write Bandwidth (MB/sec)||Capacity (Order of Magnitude)|
* actual performance may be degraded by factors like network contention and concurrent utilization by other users.
The current capacity and free space of a given device can be found with the command:
df -h <PathToDeviceName>
e.g. from any node (other devices are local to a given node):
df -h /scratch.global
It is recommended that PBS jobs check for sufficient free space (based on your application's requirements) before using that device. If the job encounters insufficient free space, you can try running on another queue (one which has no node sharing), or adjust your script to use an alternate type of scratch storage.