Install and deploy Hadoop single node

Every major industry is implementing Apache Hadoop as the standard framework for big data processing and storage. Hadoop is designed to be deployed across a network of hundreds or even thousands of dedicated servers. All these machines work together to deal with large volumes and huge data sets

See more : Overview of Hadoop

Hadoop is powerful and useful only when installed and exploited on multiple nodes, but for beginners, Hadoop Single node is a great start to get acquainted with hadoop. In this article, I will guide you to deploy Hadoop on 1 node (Hadoop Single node).

Conditions before installation

Your device must have jdk (version 8, 11 or 15 is fine, note that if you use hadoop 3.1.4, you can use jdk8, but if you use hadoop 3.2.2 or higher, use Java 11 or higher), if If not, you can install it with the following command:

sudo apt-get install openjdk-11-jdk -y

Your computer has SSH client and SSH server. If you don’t have it, you can install it with the following command:

sudo apt-get install openssh-server openssh-client -y

Set up User for Hadoop

Generate an SSH key pair and determine where it will be stored:

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

The system will proceed to create and save the SSH key pair:

Using cat command to save public key to authorized_keys in SSH directory:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Delegate user permissions with the chmod command :

chmod 0600 ~/.ssh/authorized_keys

Verify everything is set up correctly by ssh to localhost:

ssh localhost

Download and Install Hadoop on Ubuntu

Download a version of Hadoop from the official Hadoop distribution site at: https://hadoop.apache.org/releases.html

Click on the binary in Binary download

Now put the compressed file you just downloaded anywhere and extract it with the command:

tar xvzf hadoop-3.2.2.tar.gz

Configuring and Deploying Hadoop Single Node (Pseudo-Distributed Mode)

To configure Hadoop for pseudo-distributed mode, we will edit the Hadoop configuration files in the etc/hadoop path and in the environment configuration file including the following files:

.bashrc
hadoop-env.sh
core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

Note: In the installation below, my Hadoop is placed in the folder /opt/myapp, you can put Hadoop anywhere, it doesn’t have to be the same as me.

Configure Hadoop environment variables (file .bashrc)

Open file .bashrc with nano :

sudo nano ~/.bashrc

Define Hadoop enviroment by adding some variable below at end of the file (edit your hadoop home path match your hadoop path):

#Hadoop Related Options 
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 
export HADOOP_HOME=/opt/myapp/hadoop-3.2.2 
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME 
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME 
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native" 

Apply change with command: source ~/.bashrc

Edit file hadoop-env

open file hadoop-env.sh with nano:

sudo nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Find the location as shown below, uncomment (remove sign #) the JAVA_HOME and add your openjdk path:

Edit file core-site.xml

Open file core-site.xml with nano:

sudo nano $HADOOP_HOME/etc/hadoop/core-site.xml

Add between 2 tags configuration to get the full content as follows:

<configuration>
<property>
	<name>hadoop.tmp.dir</name>
	<value>/opt/myapp/hadoop-3.2.2/tmpdata</value>
</property>
<property> 
	<name>fs.default.name</name> 
	<value>hdfs://localhost:9000</value>
</property>
</configuration>

fs.default.name Configure the address of HDFS, if not configured by default it will be placed at port 9000, if there is a duplicate port, change it to another port so Hadoop can operate normally.

Edit file hdfs-site.xml

Open file hdfs-site.xml with nano:

sudo nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Add between 2 tags configuration to get the full content as follows:

<configuration>
<property>
	<name>dfs.data.dir</name> 
	<value>/opt/myapp/hadoop-3.2.2/dfsdata/namenode</value>
</property>
<property>
	<name>dfs.data.dir</name>
	<value>/opt/myapp/hadoop-3.2.2/dfsdata/datanode</value>
</property> 
<property>
	<name>dfs.replication</name> 
	<value>2</value>
</property>
</configuration>

dfs.replication configure the number of copies.

Edit file mapred-site.xml

Open file mapred-site.xml with nano:

sudo nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

Add between 2 tags configuration to get the full content as follows:

<configuration>
<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>
</configuration>

Open file yarn-site.xml

Open file yarn-site.xml with nano:

sudo nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

Add between 2 tags configuration to get the full content as follows:

<configuration> 
<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property> 
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
	<name>yarn.resourcemanager.hostname</name>
	<value>127.0.0.1</value>
</property>
<property>
	<name>yarn.acl.enable</name>
	<value>0</value>
</property> 
<property>
	<name>yarn.nodemanager.env-whitelist</name>  
	<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PERPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>

Format HDFS namenode

Format namenode before start first service:

hdfs namenode -format

Start Hadoop Cluster

In sbin, start hadoop with command:

./start-all.sh

Check daemon running with command:

jps

If the result is 6 daemons as follows, then you have configured correctly (you only pay attention to the 6 daemons above, you don’t need to care about XMLServerLauncher):

Access Hadoop UI from the browser

You can check whether Hadoop has been installed successfully or not at namenode’s default port 9870:

localhost:9870

Check datanode at default port 9864:

localhost:9864

Check out the Yarn resource manager at the portal 8088:

locahost:8088

Reference: https://phoenixnap.com/

Install and deploy Hadoop single node

Conditions before installation

Set up User for Hadoop

Download and Install Hadoop on Ubuntu

Configuring and Deploying Hadoop Single Node (Pseudo-Distributed Mode)

Configure Hadoop environment variables (file .bashrc)

Edit file hadoop-env

Edit file core-site.xml

Edit file hdfs-site.xml

Edit file mapred-site.xml

Open file yarn-site.xml

Format HDFS namenode

Start Hadoop Cluster

Access Hadoop UI from the browser

Further Reading

An overview of Hadoop

Summary of questions about Apache Hadoop

HDFS

Trending Tags