Difference between revisions of "Hadoop"
(→Submission Script) |
m (Changed help address to new one) |
||
Line 203: | Line 203: | ||
== Help to debug == | == Help to debug == | ||
− | This is a beta version of interactive Hadoop. So please contact us at mailto: | + | This is a beta version of interactive Hadoop. So please contact us at mailto:hpc@uconn.edu if you face any errors or find any bugs. |
= Start Hadoop with previous job = | = Start Hadoop with previous job = |
Revision as of 16:58, 19 February 2016
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
On our HORNET cluster, the Hadoop implementation is different than normal. Hadoop was originally designed for shared-for-nothing systems, where each node has its CPU, memory, and large disk space. Hadoop also has it's own job scheduling system. However, our HPC cluster is designed to be shared between users. Therefore, to run Hadoop on our HORNET cluster, we use myHadoop, which starts up a Hadoop cluster each time you submit a Hadoop job as a batch job. myHadoop is set to use persistent mode which saves data in our shared file system, GPFS. Your persistent mode Hadoop installation data is stored the following directory:
/scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}
Contents
Load modules
Several modules are needed by myHadoop. Load all of them at once with:
module load zlib/1.2.8 openssl/1.0.1e java/1.8.0_31 protobuf/2.5.0 myhadoop/Sep012015
Start Hadoop Cluster
Below is an example SLURM batch script for word counting using mapreduce. It is based on code written by Dr. Lockwood and has been adapt for our HORNET cluster.
Submission Script
#!/bin/bash ################################################################################ # slurm.sbatch - A sample submit script for SLURM that illustrates how to # spin up a Hadoop cluster for a map/reduce task using myHadoop # # Created: # Glenn K. Lockwood, San Diego Supercomputer Center February 2014 # Revised: # Tingyang Xu, BECAT September 2015 # Pariksheet Nanda, BECAT December 2015 ################################################################################ ## -p Westmere sets the partition. #SBATCH -p Westmere ## -N 4 allocates 4 nodes to run the Hadoop cluster (1 master namenode, 3 datanodes) #SBATCH -N 4 ## -c 12 uses 12 cores on each node. #SBATCH -c 12 #################NO CHANGE############################ ## --ntasks-per-node=1 so that each node runs a single datanode/namenode. #SBATCH --ntasks-per-node=1 #SBATCH --exclusive #################NO CHANGE END######################## # Download example data to your folder for the mapreduce script. if [ ! -f ./pg2701.txt ]; then echo "*** Retrieving some sample input data" wget 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt' fi # Set the storage directory for temporary Hadoop configuration files. # It must be in a location accessible to all the compute nodes. export HADOOP_CONF_DIR=$PWD/hadoop-conf.$SLURM_JOBID #################NO CHANGE############################ if [ "z$HADOOP_OLD_DIR" == "z" ]; then myhadoop-configure.sh else myhadoop-configure.sh -p $HADOOP_OLD_DIR fi # test if the HADOOP_CONF_DIR is accessible by the compute nodes. if ! srun ls -d $HADOOP_CONF_DIR; then echo "The configure files are not accessible by the compute nodes. Please consider the the shared, home, or scratch directory to put your HADOOP_CONF_DIR. For example, export HADOOP_CONF_DIR=/scratch/scratch2/$USER_hadoop-conf.$SLURM_JOBID" myhadoop-cleanup.sh rm -rf $HADOOP_CONF_DIR exit 1 fi start-all.sh #################NO CHANGE END######################## # Make the data directory using Hadoop's hdfs hdfs dfs -mkdir /data hdfs dfs -put ./pg2701.txt /data hdfs dfs -ls /data # Run the word counting script hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount \ /data /wordcount-output # Copy the results hdfs dfs -ls /wordcount-output hdfs dfs -get /wordcount-output ./ #################NO CHANGE############################ stop-all.sh myhadoop-cleanup.sh rm -rf $HADOOP_CONF_DIR #################NO CHANGE END######################## # Delete data of the Hadoop namenode (master) and the datanodes (slaves). # Comment the line below out for persistent sessions; # For example comment it out if you are working on a project that needs # to continue calculations from a previous batch job. rm -rf /scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}
Save the script using the name, say, slurm.sbatch. Run the batch script using:
$ sbatch slurm.sbatch
Output
You will see a new folder "wordcount-output" containing a file "part-r-00000" when the job is done.
$ head part-r-00000 "'A 3 "'Also 1 "'Are 1 "'Aye, 2 "'Aye? 1 "'Best 1 "'Better 1 "'Bout 1 "'But 2 "'Canallers!' 1
NOTE: Finally, don't forget to stop the cluster after running your Hadoop job.
Interactive Hadoop (beta) use with SLURM (not recomended)
WARNING: *any* network interruption will cause your job to crash irrecoverably.
To run an interactive Hadoop session, do the following:
[NetID@cn65 ~]$ module load zlib/1.2.8 protobu/2.5.0 openssl java/1.8.0_31 myhadoop/Sep012015 # You can change number of nodes you want to run the Hadoop. # Please always KEEP "--exclusive --ntasks-per-node=1". [NetID@cn65 ~]$ fisbatch.hadoop --exclusive --ntasks-per-node=1 -c12 -p Westmere -N 4 FISBATCH -- the maximum time for the interactive screen is limited to 6 hours. You can add QoS to overwrite it. FISBATCH -- waiting for JOBID 24187 to start on cluster=cluster and partition=Westmere ...! FISBATCH -- booting the Hadoop nodes *******************! FISBATCH -- Connecting to head node (cn43) [NetID@cn43 ~]$ mkdir workspace [NetID@cn43 ~]$ cd workspace [NetID@cn43 workspace]$ wget http://www.gutenberg.org/cache/epub/2701/pg2701.txt --2015-09-10 22:06:09-- http://www.gutenberg.org/cache/epub/2701/pg2701.txt Resolving www.gutenberg.org... 152.19.134.47 Connecting to www.gutenberg.org|152.19.134.47|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 1257296 (1.2M) [text/plain] Saving to: pg2701.txt 100%[=============================>] 1,257,296 5.27M/s in 0.2s 2015-09-10 22:06:10 (5.27 MB/s) - pg2701.txt [NetID@cn43 workspace]$ hdfs dfs -mkdir /data [NetID@cn43 workspace]$ hdfs dfs -put ./pg2701.txt /data [NetID@cn43 workspace]$ hdfs dfs -ls /data Found 1 items -rw-r--r-- 3 NetID supergroup 1257296 2015-09-10 22:11 /data/pg2701.txt [NetID@cn43 workspace]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar \ > wordcount /data /wordcount-output 15/09/10 22:23:16 INFO client.RMProxy: Connecting to ResourceManager at ib-cn43/172.16.100.43:8032 15/09/10 22:23:16 INFO input.FileInputFormat: Total input paths to process : 1 15/09/10 22:23:16 INFO mapreduce.JobSubmitter: number of splits:1 ... 15/09/10 22:23:21 INFO mapreduce.Job: map 0% reduce 0% 15/09/10 22:23:25 INFO mapreduce.Job: map 100% reduce 0% 15/09/10 22:23:31 INFO mapreduce.Job: map 100% reduce 100% 15/09/10 22:23:31 INFO mapreduce.Job: Job job_1441936501378_0003 completed successfully 15/09/10 22:23:31 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=499462 FILE: Number of bytes written=1156687 ... [NetID@cn43 workspace]$ hdfs dfs -ls /wordcount-output Found 2 items -rw-r--r-- 3 NetID supergroup 0 2015-09-10 22:23 /wordcount-output/_SUCCESS -rw-r--r-- 3 NetID supergroup 366672 2015-09-10 22:23 /wordcount-output/part-r-00000 [NetID@cn43 workspace]$ hdfs dfs -get /wordcount-output ./ [NetID@cn43 workspace]$ exit [screen is terminating] Connection to cn43 closed. FISBATCH -- exiting job [NetID@cn65 ~]$ head workspace/wordcount-output/part-r-00000 "'A 3 "'Also 1 "'Are 1 "'Aye, 2 "'Aye? 1 "'Best 1 "'Better 1 "'Bout 1 "'But 2 "'Canallers!' 1 # This command is optional. See the following explainations. [NetID@cn65 ~]$ sjobs JobID Partition QOS JobName User Submit State Elapsed NNodes NCPUS NodeList Start ------------ ---------- ---------- ---------- --------- ------------------- ---------- ---------- -------- ---------- --------------- ------------------- 24187 Westmere standard HADOOP.FI+ NetID 2015-09-10T21:54:33 RUNNING 00:32:32 4 48 cn[43-46] 2015-09-10T21:54:34 24187.0 screen 2015-09-10T21:55:03 COMPLETED 00:32:01 1 12 cn43 2015-09-10T21:55:03 24194 Westmere standard HADOOP.CL+ tix11001 2015-09-10T22:27:04 PENDING 00:00:00 2 0 None assigned Unknown
You job may still runs for a while to turn off the Hadoop daemon and cleanup the tmp files. Or you may find another job called "HADOOP.CLEANUP" is running/pending in your job queue. Just let it run. The cleanup steps should only takes 20-60 seconds.
NOTE: please do not forget to exit from the nodes so that the other users can use it.
Help to debug
This is a beta version of interactive Hadoop. So please contact us at mailto:hpc@uconn.edu if you face any errors or find any bugs.
Start Hadoop with previous job
If you want to restart a new job using the HDFS creating in the previous job, you need to do the following steps. First, find out your hadoop hdfs folder. By default, it is located at:
/scratch/scratch2/${your_NetID}_hadoop.${SLURM_JOBID}
Submission Script
Firstly, you need to point out the old HDFS folder by issuing:
$ export HADOOP_OLD_DIR=/path/to/old/hadoop/folder # by default, it is /scratch/scratch2/${USER}_hadoop.$SLURM_JOB_ID
Then, start your submission script:
$ sbatch slurm.sh
Interactive
Before issuing fisbatch.hadoop
, you need export HADOOP_OLD_DIR
.
$ export HADOOP_OLD_DIR=/path/to/old/hadoop/folder # by default, it is /scratch/scratch2/${USER}_hadoop.$SLURM_JOB_ID $ fisbatch.hadoop --exclusive --ntasks-per-node=1 -c12 -p Westmere -N 4、 ...