Hadoop installation

Hadoop: Setting up a Single Node Cluster.

This document leads how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

Supported Platforms

  • Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
  • Windows is also a supported platform but the followings steps are for Linux only. To set up Hadoop on Windows, see official hadoop wiki page

Pre-Installations

  • Java™ must be installed.  Hadoop Version 2.7 and later of Apache Hadoop requires Java 7. It is built and tested on both OpenJDK and Oracle (HotSpot)’s JDK/JRE. Earlier versions (2.6 and earlier) support Java 6.

    Recommended Java versions are described at official documents.

$ sudo add-apt-repository ppa:openjdk-r/ppa  
$ sudo apt
-get update $ sudo apt-get install openjdk-7-jdk
  • ssh and rsync must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemon.
$ sudo apt-get install ssh
$ sudo apt-get install rsync

Hadoop Installation & Configuration

I recommend creating a normal (not root) account for Hadoop working such as hadoop and user group hadoop. To create an account using the following command.

$ sudo groupadd hadoop
$ sudo adduser hadoop
$ passwd hadoop
$ useradd -G hadoop hadoop

After creating the account, it also required to set up key-based ssh to its own account. To do this use execute following commands.

$ su - hadoop
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Let’s verify key based login. Below command should not ask for the password but the first time it will prompt for adding RSA to the list of known hosts.

$ ssh localhost
$ exit

Hadoop Distribution

In this step, download hadoop latest version source archive file using below command. You can also select alternate download mirror for increasing download speed. 

$ cd ~
$ wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
$ tar xzf hadoop-3.1.1.tar.gz
$ mv hadoop-3.1.1 hadoop
and
$ export HADOOP_HOME=/home/hadoop/hadoop

Environmental Variables & Configuration

To run the hadoop distribution you have to define the JAVA_HOME variable

$ export JAVA_HOME=/usr/lib/jvm/java-8-oracle

then edit the file $HADOOP_HOME/etc/hadoop/hadoop-env.sh to define some parameters as follows:

$ vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh

and add the same code above to the this hadoop-env.sh file as 

export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/native"

Configuration Files

Hadoop has many of configuration files, which need to configure as per requirements of your Hadoop infrastructure. Let’s start with the configuration with basic Hadoop single node cluster setup. first, navigate to below location

$ cd $HADOOP_HOME/etc/hadoop

Edit core-site.xml

<configuration>
<property>
  <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>
</configuration>

Edit hdfs-site.xml

<configuration>
<property>
 <name>dfs.replication</name>
 <value>1</value>
</property>

<property>
  <name>dfs.name.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/namenode</value>
</property>

<property>
  <name>dfs.data.dir</name>
    <value>file:///home/hadoop/hadoopdata/hdfs/datanode</value>
</property>
</configuration>

Edit yarn-site.xml

<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>
Edit yarn-site.xml
<configuration>
 <property>
  <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
 </property>
</configuration>

Format Namenode

First in the $HADOOP_HOME/bin/,  format the namenode using the following command, make sure that Storage directory is

$ hdfs namenode -format

Start Hadoop Cluster

Let’s start your Hadoop cluster using the scripts provides by Hadoop. Just navigate to your $HADOOP_HOME/sbin directory and execute scripts one by one.

$ cd $HADOOP_HOME/sbin/

Now run start-dfs.sh script.

$ ./start-dfs.sh

Sample output:

Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [smart]
2018-05-02 18:00:32,565 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Now run start-yarn.sh script.

$ ./start-yarn.sh

Sample output:

Starting resourcemanager
Starting nodemanagers

Access Hadoop Services in Browser

Hadoop NameNode started on port 9870 default. Access your server on port 9870 in your favorite web browser.

http://localhost:9870/

Now access port 8042 for getting the information about the cluster and all applications

http://localhost:8042/

Access port 9864 to get details about your Hadoop node.

http://localhost:9864/

Test Hadoop Single Node Setup

Make the HDFS directories required using following commands.

$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/hadoop

 Copy all files from local file system /var/log/httpd to hadoop distributed file system using below command

$ bin/hdfs dfs -put /var/log/apache2 logs

Browse Hadoop distributed file system by opening below URL in the browser. You will see an apache2 folder in the list. Click on the folder name to open and you will find all log files there.

 http://localhost:9870/explorer.html#/user/hadoop/logs/

Now copy logs directory for hadoop distributed file system to local file system.

$ bin/hdfs dfs -get logs /tmp/logs
$ ls -l /tmp/logs/
About us

Etiam rutrum mattis metus vitae dapibus. Cras sagittis leo tellus, non rhoncus velit efficitur tristique.

Contact Info

  Adress: 75 Ninth Avenue New York, NY 10011

  Telephone: +1 212-565-0000

  Email: [email protected]