INTRODUCTION TO HDFS :-
Hadoop is an open-source framework that allows to store and process big data in a distributed environment across clusters of computers.It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant as it provides high-performance access to data across Hadoop clusters. Like other Hadoop-related technologies, HDFS has become a key tool for managing pools of big data and supporting big data analytics applications.It is the primary storage system used by Hadoop applications.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.It provides high throughput access to application data and is suitable for applications that have large data sets.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the file system metadata and one or more slave DataNodes that store the actual data.
What are NameNodes and DataNodes?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster . When the NameNode goes down, the file system goes offline.
The DataNode is responsible for storing the files in HDFS. It manages the file blocks within the node. It sends information to the NameNode about the files and blocks stored in that node and responds to the NameNode for all filesystem operations. A functional filesystem has more than one DataNode , with data replicated across them.
Within HDFS, a given name node manages file system namespace operations like opening, closing, and renaming files and directories. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node.
HDFS ARCHITECTURE :-
HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster has a NameNode, that manages the file system namespace and regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files.
GOALS OF HDFS :-
- Fault detection and recovery : Detection of faults and quick, automatic recovery from them is the core architectural goal.
- Huge datasets : HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
- Simple coherency model : HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed except for appends.
- Large data sets : Since HDFS is tuned to support large files, it should support tens of millions of files in a single instance.