Summary of questions about Apache Hadoop

The main goal of Apache Hadoop

Open data storage and powerful data processing. Save costs when storing and processing large amounts of data.

You can see more details about Hadoop’s goals HERE

Hadoop solves the problem of fault tolerance through what technique?

Hadoop is fault tolerant through redundancy engineering
Files are fragmented, fragments are replicated to other nodes on the cluster
The jobs that need to be calculated are segmented into independent tasks

Describe how a client gets data on HDFS

The client queries the namenode to know the location of the chunks. Namenode returns the location of the chunks. Client connects to datanodes in parallel to read chunks.

You can see details of this data reading process HERE

Describe how a client writes data on HDFS

The client connects to the namenode to specify the amount of data to write. Namnode specifies the location of chunks for the client. The client when the chunk reaches the first datanode, then the datanodes automatically perform replication. The process ends when all chunks and clones have been executed successfully.

You can see details of this data recording process HERE

Main components in Hadoop Ecosystem

Hadoop Ecosytem is a platform that provides solutions for storing and processing large amounts of data. The main components in Hadoop Ecosytem are:

HDFS
Mapreduce framework
YARN
Zookeeper

Fault tolerance mechanism of datanode in HDFS

Using the heartbeat mechanism, the datanode will periodically send status notifications to the namenode. The default time period for datanode to send heartbeat to namenode is 3s. After 3s, if datanode does not send information to namenode, by default namenode will consider that node dead and the unfinished task of that node will be given back. for new node.

To reconfigure the heartbeat time, you can add the following XML to the file hdfs-site.xml as follows:

<property>
<name>dfs.heartbeat.interval</name>
<value>3</value>
</property>

<property>
<name>dfs.namenode.heartbeat.recheck-interval</name>
<value>300000</value>
</property>

You can see the default configurations hdfs-site.xml HERE

Data organization mechanism of datanode in HDFS

Files in HDFS are divided into block-sized chunks called data blocks. Blocks are stored as independent units (default block size is 128MB).

Chunks are system files in the datanode server’s local file.

See more: What happens if you store small files on HDFS?

Data replication mechanism in HDFS

File blocks are duplicated to increase fault tolerance. The Namenode is the node that makes all the decisions regarding the replication of blocks.

How does HDFS solve the single-point-of-failure problem for namenode

Use Secondary Namenode according to active-passive mechanism. Secondary Namenode only works when there is a problem with the Namenode

You can see details about Secondary Namenode HERE

What are the 3 modes in which Hadoop can run?

Standalone mode: this is the default mode, Hadoop uses local FileSystem and a single Java process to run Hadoop services.
Pseudo-distributed mode: deploy Hadoop on 1 node to execute all services.
Fully-distributed mode: deploy Hadoop on a cluster of machines with namenode and datanode.

Explain Bigdata and Bigdata’s 5V criteria

Bigdata is a term for large and complex data sets that are difficult to process with relational data tools and traditional data processing applications.

5V in Bigdata is:

Volume: Volume represents the amount of data that is growing at an exponential rate, i.e. in Petabytes and Exabytes.
Velocity: Velocity refers to the rate at which data is growing, very fast. Today, yesterday’s data is considered old data. Today, social networks are a major contributing factor to the growth of data.
Variety: Variety refers to the heterogeneity of data types. In other words, the data collected comes in many formats like video, audio, csv, etc. So, these different formats represent many types of data.
Veracity: Veracity refers to doubtful or uncertain data of available data due to data inconsistency and incompleteness. The available data can sometimes be messy and difficult to trust. With many forms of big data, quality and accuracy are difficult to control. Volume is often the reason behind the lack of data quality and accuracy.
Value: It’s all well and good to have access to big data but unless we can turn it into a value.