Hadoop Cluster

Hadoop Cluster

A cluster dedicated to MapReduce workflows using HDFS

What is the Hadoop Cluster?

The Big Nodes were part of the general-purpose storage installed in 2015. They were designed as data movers into and out of the general-purpose storage system. MSI has configured 20 of these compute nodes into a single Hadoop cluster. 

Hadoop HDFS and MapReduce are installed on the cluster. HDFS is a distributed filesystem. Each node in the Hadoop cluster holds some of its data in HDFS.

Each big node has an Intel Ivy Bridge "E5-2680 v2" processor @2.8 GHz. Each processor has ten cores and 25 MB cache, and it communicates with memory on a dual QuickPath Interconnect (QPI) interface ( 2 x 7.2 Giga-Transfers/sec). All of the systems within the cluster are interconnected with 10-gigabit Ethernet.

Cluster specification: 20 nodes of the Supermicro 4U FatTwin 4x nodes with E5-2680v2 (10 core )/2.8 GHz processors. Each node has 20 6 TB SATA3 drives and 128 GB of memory. In aggregate, the Hadoop system has 1.2 PB of storage available. 

How can I use the Hadoop Cluster?

Login Procedure

The interactive Hadoop cluster is an MSI Beta resource. Please email help@msi.umn.edu to get access to this system.

The Hadoop cluster should always be accessed remotely from the lab machines or the pool of nodes in the lab queue. HDFS and MapReduce applications are all run from the interactive lab nodes. Please connect through login.msi.umn.edu or nx.msi.umn.edu. See MSI's interactive connections FAQ, and use isub to start an interactive session on a lab machine. 

Available Software

Hadoop HDFS and MapReduce are installed on the cluster. Additional software you need to access should be installed in your home directory.

Moving Files Into HDFS

HDFS is a distributed filesystem. Each node in the Hadoop cluster holds some of its data in HDFS. The Namenode holds the metadata for all files in the HDFS. The HDFS filesystem on the current cluster should be considered a scratch filesystem. Please maintain a copy of important data on other storage systems.

To move files to HDFS, begin by loading the default environment settings for Hadoop.

% module load hadoop/2.7.1

To list existing files:

% hadoop fs -ls /user/$USER

To copy a local directory to HDFS on Big:

% hadoop fs -put ~/mylocaldirectory /user/$USER

Note: Users will have a directory in HDFS for them in /user/user_name and directories pertaining to their groups as well in /groups/group_name/user_name by default

 

Running jobs on the Hadoop Cluster

Supported Hadoop Platforms

  • MapReduce
  • Pig
  • Spark
  • Spatial Hadoop

Prerequisites

To run a job on the cluster, you must first login to big:

% ssh my_user@big

MapReduce

Then, to run a mapreduce job, run the jar file you wish to run:

% hadoop jar <jar> [mainClass] args...

Pig

Or to run a pig job:

% pig pig_script.pig

Spark

Varies, see additional resources below