Difference between revisions of "Hadoop"

From Storrs HPC Wiki
Jump to: navigation, search
m (Changed help address to new one)
Line 203: Line 203:
  
 
== Help to debug ==
 
== Help to debug ==
This is a beta version of interactive Hadoop. So please contact us at mailto:hpc@uconn.edu if you face any errors or find any bugs.
+
This is a beta version of interactive Hadoop. So please contact us at hpc@uconn.edu if you face any errors or find any bugs.
  
 
= Start Hadoop with previous job =
 
= Start Hadoop with previous job =

Revision as of 16:59, 19 February 2016

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

On our HORNET cluster, the Hadoop implementation is different than normal. Hadoop was originally designed for shared-for-nothing systems, where each node has its CPU, memory, and large disk space. Hadoop also has it's own job scheduling system. However, our HPC cluster is designed to be shared between users. Therefore, to run Hadoop on our HORNET cluster, we use myHadoop, which starts up a Hadoop cluster each time you submit a Hadoop job as a batch job. myHadoop is set to use persistent mode which saves data in our shared file system, GPFS. Your persistent mode Hadoop installation data is stored the following directory:

/scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}

Load modules

Several modules are needed by myHadoop. Load all of them at once with:

module load zlib/1.2.8 openssl/1.0.1e java/1.8.0_31 protobuf/2.5.0 myhadoop/Sep012015

Start Hadoop Cluster

Below is an example SLURM batch script for word counting using mapreduce. It is based on code written by Dr. Lockwood and has been adapt for our HORNET cluster.

Submission Script

#!/bin/bash
################################################################################
#  slurm.sbatch - A sample submit script for SLURM that illustrates how to
#  spin up a Hadoop cluster for a map/reduce task using myHadoop
#
# Created:
#  Glenn K. Lockwood, San Diego Supercomputer Center             February 2014
# Revised:
#  Tingyang Xu, BECAT                                            September 2015
#  Pariksheet Nanda, BECAT                                       December 2015
################################################################################
## -p Westmere sets the partition.
#SBATCH -p Westmere
## -N 4 allocates 4 nodes to run the Hadoop cluster (1 master namenode, 3 datanodes)
#SBATCH -N 4
## -c 12 uses 12 cores on each node.
#SBATCH -c 12

#################NO CHANGE############################
## --ntasks-per-node=1 so that each node runs a single datanode/namenode. 
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#################NO CHANGE END########################

# Download example data to your folder for the mapreduce script.
if [ ! -f ./pg2701.txt ]; then
  echo "*** Retrieving some sample input data"
  wget 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt'
fi

# Set the storage directory for temporary Hadoop configuration files.
# It must be in a location accessible to all the compute nodes.
export HADOOP_CONF_DIR=$PWD/hadoop-conf.$SLURM_JOBID

#################NO CHANGE############################
if [ "z$HADOOP_OLD_DIR" == "z" ]; then
  myhadoop-configure.sh
else
  myhadoop-configure.sh -p $HADOOP_OLD_DIR
fi

# test if the HADOOP_CONF_DIR is accessible by the compute nodes.
if ! srun ls -d $HADOOP_CONF_DIR; then
  echo "The configure files are not accessible by the compute nodes. 
       Please consider the the shared, home, or scratch directory to put your HADOOP_CONF_DIR. 
       For example, export HADOOP_CONF_DIR=/scratch/scratch2/$USER_hadoop-conf.$SLURM_JOBID"
  myhadoop-cleanup.sh
  rm -rf $HADOOP_CONF_DIR
  exit 1
fi
start-all.sh
#################NO CHANGE END########################

# Make the data directory using Hadoop's hdfs
hdfs dfs -mkdir /data
hdfs dfs -put ./pg2701.txt /data
hdfs dfs -ls /data
# Run the word counting script
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount \
        /data /wordcount-output
# Copy the results
hdfs dfs -ls /wordcount-output
hdfs dfs -get /wordcount-output ./

#################NO CHANGE############################
stop-all.sh
myhadoop-cleanup.sh
rm -rf $HADOOP_CONF_DIR
#################NO CHANGE END########################

# Delete data of the Hadoop namenode (master) and the datanodes (slaves).
# Comment the line below out for persistent sessions;
# For example comment it out if you are working on a project that needs
# to continue calculations from a previous batch job.
rm -rf /scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}

Save the script using the name, say, slurm.sbatch. Run the batch script using:

$ sbatch slurm.sbatch

Output

You will see a new folder "wordcount-output" containing a file "part-r-00000" when the job is done.

$ head part-r-00000 
"'A	3
"'Also	1
"'Are	1
"'Aye,	2
"'Aye?	1
"'Best	1
"'Better	1
"'Bout	1
"'But	2
"'Canallers!'	1

NOTE: Finally, don't forget to stop the cluster after running your Hadoop job.

Interactive Hadoop (beta) use with SLURM (not recomended)

WARNING: *any* network interruption will cause your job to crash irrecoverably.

To run an interactive Hadoop session, do the following:

[NetID@cn65 ~]$ module load zlib/1.2.8 protobu/2.5.0 openssl java/1.8.0_31 myhadoop/Sep012015
# You can change number of nodes you want to run the Hadoop.
# Please always KEEP "--exclusive --ntasks-per-node=1". 
[NetID@cn65 ~]$ fisbatch.hadoop --exclusive --ntasks-per-node=1 -c12 -p Westmere  -N 4
FISBATCH -- the maximum time for the interactive screen is limited to 6 hours. You can add QoS to overwrite it.
FISBATCH -- waiting for JOBID 24187 to start on cluster=cluster and partition=Westmere
...!
FISBATCH -- booting the Hadoop nodes
*******************!
FISBATCH -- Connecting to head node (cn43)
[NetID@cn43 ~]$ mkdir workspace
[NetID@cn43 ~]$ cd workspace
[NetID@cn43 workspace]$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt
--2015-09-10 22:06:09--  http://www.gutenberg.org/cache/epub/2701/pg2701.txt
Resolving www.gutenberg.org... 152.19.134.47
Connecting to www.gutenberg.org|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1257296 (1.2M) [text/plain]
Saving to: pg2701.txt 

100%[=============================>] 1,257,296   5.27M/s   in 0.2s

2015-09-10 22:06:10 (5.27 MB/s) - pg2701.txt

[NetID@cn43 workspace]$ hdfs dfs -mkdir /data
[NetID@cn43 workspace]$ hdfs dfs -put ./pg2701.txt /data
[NetID@cn43 workspace]$ hdfs dfs -ls /data
Found 1 items
-rw-r--r--   3 NetID supergroup    1257296 2015-09-10 22:11 /data/pg2701.txt
[NetID@cn43 workspace]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar \
> wordcount /data /wordcount-output
15/09/10 22:23:16 INFO client.RMProxy: Connecting to ResourceManager at ib-cn43/172.16.100.43:8032
15/09/10 22:23:16 INFO input.FileInputFormat: Total input paths to process : 1
15/09/10 22:23:16 INFO mapreduce.JobSubmitter: number of splits:1
...
15/09/10 22:23:21 INFO mapreduce.Job:  map 0% reduce 0%
15/09/10 22:23:25 INFO mapreduce.Job:  map 100% reduce 0%
15/09/10 22:23:31 INFO mapreduce.Job:  map 100% reduce 100%
15/09/10 22:23:31 INFO mapreduce.Job: Job job_1441936501378_0003 completed successfully
15/09/10 22:23:31 INFO mapreduce.Job: Counters: 43
        File System Counters
                FILE: Number of bytes read=499462
                FILE: Number of bytes written=1156687
                ...
[NetID@cn43 workspace]$ hdfs dfs -ls /wordcount-output
Found 2 items
-rw-r--r--   3 NetID supergroup          0 2015-09-10 22:23 /wordcount-output/_SUCCESS
-rw-r--r--   3 NetID supergroup     366672 2015-09-10 22:23 /wordcount-output/part-r-00000
[NetID@cn43 workspace]$ hdfs dfs -get /wordcount-output ./
[NetID@cn43 workspace]$ exit
[screen is terminating]
Connection to cn43 closed.
FISBATCH -- exiting job
[NetID@cn65 ~]$ head workspace/wordcount-output/part-r-00000
"'A     3
"'Also  1
"'Are   1
"'Aye,  2
"'Aye?  1
"'Best  1
"'Better        1
"'Bout  1
"'But   2
"'Canallers!'   1
# This command is optional. See the following explainations.
[NetID@cn65 ~]$ sjobs
       JobID  Partition        QOS    JobName      User              Submit      State    Elapsed   NNodes      NCPUS        NodeList               Start
------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- -------------------
24187          Westmere   standard HADOOP.FI+     NetID 2015-09-10T21:54:33    RUNNING   00:32:32        4         48       cn[43-46] 2015-09-10T21:54:34
24187.0                                screen           2015-09-10T21:55:03  COMPLETED   00:32:01        1         12            cn43 2015-09-10T21:55:03
24194          Westmere   standard HADOOP.CL+  tix11001 2015-09-10T22:27:04    PENDING   00:00:00        2          0   None assigned             Unknown

You job may still runs for a while to turn off the Hadoop daemon and cleanup the tmp files. Or you may find another job called "HADOOP.CLEANUP" is running/pending in your job queue. Just let it run. The cleanup steps should only takes 20-60 seconds.

NOTE: please do not forget to exit from the nodes so that the other users can use it.

Help to debug

This is a beta version of interactive Hadoop. So please contact us at hpc@uconn.edu if you face any errors or find any bugs.

Start Hadoop with previous job

If you want to restart a new job using the HDFS creating in the previous job, you need to do the following steps. First, find out your hadoop hdfs folder. By default, it is located at:

/scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}

Submission Script

Firstly, you need to point out the old HDFS folder by issuing:

$ export HADOOP_OLD_DIR=/path/to/old/hadoop/folder # by default, it is /scratch/scratch2/${USER}_hadoop.$SLURM_JOB_ID

Then, start your submission script:

$ sbatch slurm.sh

Interactive

Before issuing fisbatch.hadoop, you need export HADOOP_OLD_DIR.

$ export HADOOP_OLD_DIR=/path/to/old/hadoop/folder # by default, it is /scratch/scratch2/${USER}_hadoop.$SLURM_JOB_ID
$ fisbatch.hadoop --exclusive --ntasks-per-node=1 -c12 -p Westmere  -N 4、
...