MSI provides large capacity and high performance temporary storage to be used while applications are running on the supercomputer. Depending on the system, scratch storage may be a set-aside area of primary storage (global scratch), or may consist of separate storage attached via fast data links. While global scratch is a shared space that is visible to all nodes, local scratch disk, SSD and RAMdisk is only connected (and visible) to a single node.
- Access to scratch storage must be requested just like processors, memory, and GPUs.
- The available/free capacity of each scratch type will vary based on the aggregate utilization of these shared resources.
- There is a 40 TB and 10M file quota on global scratch storage.
- There is no backup or snapshots of local or global scratch storage.
- Scratch storage should not be used for any valuable data or data intended to be stored longer than 30 days.
- Users must make a directory for themselves when using global scratch (e.g. mkdir /scratch.global/$USER).
- Files are written to scratch with the user's specified umask; by default, this means files are private to the individual.
Example Use Cases
Given the performance and capacity differences, scratch type usage will vary by case. In general, you want to use the fastest scratch storage type, but this is usually determined by the capacity required for your application. When in doubt, contact email@example.com. A few examples of how scratch is used at MSI:
|Global Scratch||My multinode job will generate many large intermediate files, on the order of terabytes. The intermediate files are needed by each node in the job. Of these files, I only need to keep a few gigabytes, and my group quota is not large enough for all of the intermediate and persistent data.|
|Local Scratch||My job can run on one or more nodes, and each node needs it's own unique space for gigabytes of output. At the end of execution, my script will consolidate files to my home directory.|
|RAMdisk||I have an application that uses less than 30 gigabytes of memory, but performs significant I/O on a database or file. The database or file requires less than half the memory on the node.|
Global scratch is part of the same networked filesystem (Panasas) that also contains users' home directories. As such, it is accessible at /scratch.global from all cluster login and compute nodes at MSI, including those for Mangi, Mesabi, and interactive computing. Since it is a networked filesystem, global scratch directories have the same contents wherever they are accessed, whereas local scratch, SSD, and RAMdisk directories are isolated to each node and have different contents. Global scratch capacity is typically on the order of terabytes (refer to Capacity below for more details).
Data in global scratch is deleted after 30 days.
Performance of global scratch can vary based on network connection and the user's shared data utilization.
Starting April 1, 2020, the default quota on Global Scratch will be set at 40 TB and 10,000,000 files. MSI will review requests for temporary increases of this quota for individual projects on an ongoing basis. It is important to note that global scratch is not backed up and snapshotted on the assumption that data stored there is transitory.
Local scratch is a local filesystem on all cluster login and compute nodes at MSI. Local scratch is accessible at /scratch.local, but the contents on each node are unique because the filesystem is not networked with other nodes. It consists of spinning disk attached directly to each node. As such, capacity is limited to the order of gigabytes of space, and varies by queue.
Local scratch must be requested by the job using `--tmp <size>`, like `--tmp 50G` for 50 Gigabytes. /scratch.local points to RAMdisk if there is no disk request.
Local scratch is isolated to the job, other jobs will have their own /scratch.local in a different namespace.
RAMdisk (/dev/shm) is available on all nodes, does not need to be explicitly requested in a job script, and has a capacity equal to half of the node's memory. Generally this is the fastest temporary storage option, but has the lowest capacity. Available memory per node varies with the queue, and users must remember to account for RAMdisk in their job resource requests.
RAMdisk is isolated to the job, other jobs will have their own /dev/shm in a different namespace.
Table comparing representative performance seen by the fio benchmark on a Mesabi SSD (small queue) node:
|Type||Path||Read Bandwidth (MB/sec)||Write Bandwidth (MB/sec)||Capacity (Order of Magnitude)|
* actual performance may be degraded by factors like network contention and concurrent utilization by other users.
The current capacity and free space of a given device can be found with the command:
df -h <PathToDeviceName>
e.g. from any node (other devices are local to a given node):
df -h /scratch.global
It is recommended that jobs check for sufficient free space (based on your application's requirements) before using that device. If the job encounters insufficient free space, you can try running on another queue (one which has no node sharing), or adjust your script to use an alternate type of scratch storage.