This tutorial explains about what and why of HDFS and its features. HDFS is the world’s most reliable storage-system, which can store large quantity of structured as well as unstructured data. HDFS provides reliable storage of data with its unique feature of Data Replication. HDFS is highly fault tolerant, reliable, available, scalable, distributed file system.
2. HDFS Introduction
HDFS is a distributed file system which provides redundant storage space for files having huge sizes. It is based on Google’s File system (GFS or GoogleFS). It is designed to run on commodity hardware. HDFS is highly fault-tolerant and it provides high throughput access to application data. It is suitable for applications with large data sets. HDFS is accepts as world’s most reliable storage system. It is anticipated that by the end of 2017 75% of the data available on the planet will be residing in HDFS. To learn more about HDFS follow this introductory guide .
3. Features of HDFS
3.1. Fault Tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavourable conditions and how that system can handle such situations. HDFS is highly fault tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating replica of blocks of data on other rack. Hence if suddenly a machine fails, then user can access data from other slaves present in other rack. To learn more about Fault Tolerance follow this Guide .
3.2. High Availability
HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access his data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavourable situations like failure of a node, user can easily access their data from the other nodes. Because duplicate copies of blocks which contain user data are created on the other nodes present in the HDFS cluster. To learn more about high availability follow this Guide .
3.3. Data Reliability
HDFS is a distributed file system which provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It stores data reliably on cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then user can easily access that data from the other nodes which contains copy of same data in the HDFS cluster. HDFS by default creates 3 copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.
Data Replication is most important and unique feature of HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like crashing of a node, hardware failure, and so on. As data is replicated across a number of machines in the cluster by creating blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines present in the cluster. So whenever any machine in the cluster gets crashed, user can access their data from other machines which contains the blocks of that data. Hence there is no possibility of losing of user data. Follow this guide to learn more about the data read operation .
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the cluster. There are two scalability mechanism available: Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster. Horizontal way is preferred as we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any down-time.
3.6. Distributed Storage
In HDFS all the features are achieved via distributed storage and replication. In HDFS data is stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every block are created and stored on other nodes present in the cluster. So if a single machine in the cluster gets crashed we can easily access our data from the other nodes which contains its replica.