A cluster dedicated to MapReduce workflows using HDFS
The Big Nodes were part of the general-purpose storage installed in 2015. They were designed as data movers into and out of the general-purpose storage system. MSI has configured 20 of these compute nodes into a single Hadoop cluster.
Hadoop HDFS and MapReduce are installed on the cluster. HDFS is a distributed filesystem. Each node in the Hadoop cluster holds some of its data in HDFS.
Each big node has an Intel Ivy Bridge "E5-2680 v2" processor @2.8 GHz. Each processor has ten cores and 25 MB cache, and it communicates with memory on a dual QuickPath Interconnect (QPI) interface ( 2 x 7.2 Giga-Transfers/sec). All of the systems within the cluster are interconnected with 10-gigabit Ethernet.
Cluster specification: 20 nodes of the Supermicro 4U FatTwin 4x nodes with E5-2680v2 (10 core )/2.8 GHz processors. Each node has 20 6 TB SATA3 drives and 128 GB of memory. In aggregate, the Hadoop system has 1.2 PB of storage available.
The Hadoop cluster should always be accessed remotely from the lab machines or the pool of nodes in the lab queue. HDFS and MapReduce applications are all run from the interactive lab nodes. Please connect through login.msi.umn.edu or nx.msi.umn.edu. See MSI's interactive connections FAQ, and use isub to start an interactive session on a lab machine.
Hadoop HDFS and MapReduce are installed on the cluster. Additional software you need to access should be installed in your home directory.
HDFS is a distributed filesystem. Each node in the Hadoop cluster holds some of its data in HDFS. The Namenode holds the metadata for all files in the HDFS. The HDFS filesystem on the current cluster should be considered a scratch filesystem. Please maintain a copy of important data on other storage systems.
To move files to HDFS, begin by loading the default environment settings for Hadoop.
% module load hadoop/2.7.1
To list existing files:
% hadoop fs -ls /user/$USER
To copy a local directory to HDFS on Big:
% hadoop fs -put ~/mylocaldirectory /user/$USER
Note: Users will have a directory in HDFS for them in /user/user_name and directories pertaining to their groups as well in /groups/group_name/user_name by default
Supported Hadoop Platforms
- Spatial Hadoop
To run a job on the cluster, you must first login to big:
% ssh my_user@big
Then, to run a mapreduce job, run the jar file you wish to run:
% hadoop jar <jar> [mainClass] args...
Or to run a pig job:
% pig pig_script.pig
Varies, see additional resources below