Getting started with Hadoop development
Last modified : 20 November, 2017
In this article I would like to describe the environment I use for developing Hadoop. The target audience are people who are interested in contributing to the Hadoop project but have not yet done so.
I’m sure there are several things that could be improved, and I would appreciate it if you could please leave a comment with any suggestions you may have.
I am going to describe the steps I had to follow on a Fedora Virtual Machine. Steps on a Mac or Windows should be similar with perhaps different commands.
Set environment variables
Please set the following variables in your .bashrc
# This environment variable is NOT used by Hadoop. It has been created solely to help set up the development environment.
export HADOOP_SRC_PATH=~/Code/hadoop
# Set up alias commands for building Hadoop in one of several ways
# This is the alias I use most. The '-Pdist' maven profile assembles all projects into a distribution.
# '-Pnative' builds native code which speeds up compression and checksum etc.
alias mvnp='mvn -Pdist -Pnative -Dmaven.javadoc.skip -DskipTests install'
# Same as mvnp but added a clean (so all previous artifacts are blown away).
alias mvnc='mvn -Pdist -Pnative -Dmaven.javadoc.skip -DskipTests clean install'
# This environment variable is used by all Hadoop scripts to locate the binaries that ought to be loaded and run
export HADOOP_HOME=$HADOOP_SRC_PATH/trunk/hadoop-dist/target/hadoop-3.1.0-SNAPSHOT
# Since we built the native libraries, we can add them to LD_LIBRARY_PATH to be loaded
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HADOOP_HOME/lib/native/
# This is the directory to which logs for Hadoop daemons will be written
export HADOOP_LOG_DIR=$HADOOP_SRC_PATH/logs
# YARN could complain about this not being set in some versions of Hadoop.
export YARN_LOG_DIR=$HADOOP_LOG_DIR
# The directory which should contain core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml, log4j.properties etc.
export HADOOP_CONF_DIR=$HADOOP_SRC_PATH/config
# This appends JAVA options to various Hadoop daemons, so that I can connect a debugger to them if needed.
export HADOOP_NAMENODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1049'
export HADOOP_DATANODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1050'
export HADOOP_SECONDARYNAMENODE_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1051'
export YARN_RESOURCEMANAGER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044'
export YARN_NODEMANAGER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1045'
export HADOOP_JOB_HISTORYSERVER_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1046'
# Some times I uncomment this if I am debugging the client.
#export HADOOP_CLIENT_OPTS='-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=1047'
# enable debugging output on hdfs commands
#export HADOOP_ROOT_LOGGER=DEBUG,console
# Add the native libraries built to the JVM
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_HOME/lib/native"
# Put the Hadoop commands on your PATH
export PATH=$HADOOP_HOME/bin:$PATH
Retrieve the source code
There are a few sources where you could get the Hadoop code from. I prefer getting it from the Apache git server which is the source of truth. Alternatively, you could clone from the Github mirror ( https://github.com/apache/hadoop ) which follows this Apache git repository. If you have a fork, you could add yet another remote.
I like to keep several copies of the code (one for trunk, another for branch-2 and yet another for my fork). That way I can load all versions of the project in my IDE (Eclipse) and have them to compare side by side. More on that later.
# Create the directory where source code will be cloned
mkdir -p $HADOOP_SRC_PATH
cd $HADOOP_SRC_PATH
# git clone the source.
git clone https://git-wip-us.apache.org/repos/asf/hadoop.git trunk
# We don't need to clone again. We can just copy the directory for another copy that would be branch-2.
cp -R trunk branch-2
# Go into branch-2 and check out branch-2
cd branch-2
git checkout remotes/origin/branch-2 -b branch-2
cd ..
# You could do this if you have your own fork
# cp -R trunk MyForkOfHadoop
# cd MyForkOfHadoop
# git remote add myfork git@github.com:MyForkOfHadoop/hadoop.git
# git fetch myfork
Building the source
Now that we have the source code, let’s build it. Please read BUILDING.txt to get more details. I have setup alias commands for different ways in which I could be building the source.
Depending on what your operating system already ships with, you would need the following packages:
- Maven: It is the build system used
- Protobuf-2.5 : PAINFUL STEP Please note the version here. Other versions (protobuf-2.4 or protobuf-3) will NOT work. I downloaded the release from https://github.com/google/protobuf/releases/tag/v2.5.0 and built it from source (./configure; make; make install). I wish Hadoop would upgrade already, but really, its extremely complicated. Look at HADOOP-13363 and the JIRAs it links if you’re interested.
- CMake: There are some pieces of C++ code that use the cmake build system.
- zlib-devel : Native compression is much faster than the JAVA implementation (which Hadoop would fall back to if native wasn’t available).
- openssl-devel : Hadoop pipes needs it for some reason.
Its quite likely that the build may fail in between. You can resume by using -rf :<module-which-failed>
cd $HADOOP_SRC_PATH/trunk
mvnp
# Create the eclipse project files.
mvn eclipse:eclipse
Setting up a Single Node Hadoop Cluster
I almost always do my development and testing with this set up. Please follow instructions from the Hadoop docs
- Set up passwordless SSH to localhost.
- Create core-site.xml, hdfs-site.xml, yarn-site.xml and mapred-site.xml in $HADOOP_SRC_PATH/config . You may also initialize a git repository in this directory to keep track of the changes you’d make to configuration (and maybe branch for different versions for Hadoop)
- create a file $HADOOP_CONF_DIR/workers and add localhost to create 1 datanode on localhost
cp $HADOOP_HOME/etc/hadoop/capacity-scheduler.xml $HADOOP_CONF_DIRcp $HADOOP_HOME/etc/hadoop/log4j.properties $HADOOP_CONF_DIR- Format the HDFS Namenode using
$ hdfs namenode -format - Start the Hadoop daemons. Usually there are 7 : HDFS (Namenode, Datanode, SecondaryNamenode), YARN (ResourceManager, NodeManager, TimelineServer) and MapReduce (JobHistoryServer) . The following scripts could help:
$HADOOP_SRC_PATH/start.sh
#!/bin/bash
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
nohup mapred historyserver > ~/Code/hadoop/logs/historyserver.log &
if [ $(hadoop version | head -n 1 | cut -d' ' -f2 | cut -d- -f1) == "2.7.4" ]; then
nohup yarn timelineserver > ~/Code/hadoop/logs/timelineserver.log &
else
yarn --daemon start timelineserver
fi
echo -n "Number of running hadoop servers: "
jps | egrep 'NameNode|DataNode|SecondaryNameNode|ResourceManager|NodeManager|JobHistoryServer|ApplicationHistoryServer' | wc -l
echo ""
$HADOOP_SRC_PATH/stop.sh
jps | egrep "SecondaryNameNode|NameNode|JobHistoryServer|DataNode|NodeManager|ResourceManager|ApplicationHistoryServer" | awk '{print $1}' | xargs -r kill -9
sleep 1
echo -n "Number of running hadoop servers: "
jps | egrep 'NameNode|DataNode|SecondaryNameNode|ResourceManager|NodeManager|JobHistoryServer|ApplicationHistoryServer' | wc -l
echo ""
Run a YARN Application
If everything goes well, you should be able to run a sleep job
# Run a sleep job with 2 mappers and 1 reducer. The Map tasks should sleep for 100ms and the Reduce tasks should sleep for 200ms
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.7.4-tests.jar sleep -m 2 -r 1 -mt 100 -rt 200
All content on this website is licensed as Creative Commons-Attribution-ShareAlike 4.0 License. Opinions expressed are solely my own.