Hadoop Guide Chapter 3: The Hadoop Distributed Filesystem

Datetime:2016-08-23 01:47:41          Topic: Hadoop  HDFS           Share

— HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. —

Filesystems that manage the storage across a network of machines are called distributed filesystems . One of the biggest challenges is making the distributed filesystem tolerate node failure without suffering data loss . The hadoop distributed filesystem is called HDFS , which stands for Hadoop Distributed Filesystem.HDFS is a filesystem designed for storing very large files with streaming data access patterns , running on clusters of commodity hardware .

HDFS is built around the idea that the most efficient data processing parttern is write-once, read-many-times pattern . Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode . As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.


A disk has a block size, which is the minimum amount of data that it can read or write. Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.

Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks, which are stored as independent units. Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.

There’s nothing that requires the blocks from a file to be stored on the same disk, so they can take advantage of any of the disks in the cluster. In fact, it would be possible, if unusual, to store a single file on an HDFS cluster whose blocks filled all the disks in the cluster.

Having a block abstraction for a distributed filesystem brings several benefits.

  • A file can be larger than any single disk in the network.
  • Making the unit of abstraction a block rather than a file simplifies the storage subsystem. So the storage subsystem only deals with blocks, simplifying storage management: blocks are a fixed size
  • Furthermore, blocks fit well with replication for proiding fault tolerance and availability.

If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. A block that is no longer available due to corruption or machine failure can be replicated from its alternative locations to other live machines to bring the replication factor back to the normal level.

Some applications can choose to set a high replication factor for the blocks in a popular file to spread the read load on the cluster.

To see list the blocks that make up each file in the filesystem.

hdfs fsck / -files -blocks

Namenodes and Datanodes

An HDFS cluster has two types of nodes: a namenode and a number of datanodes.The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. The namenode also knows the datanodes on which all the blocks for a given file are located; however, it does not store block locations persistently, which will be reconstructed from datanodes when the system starts.

Datanodes are the workhorses of the filesystem. They will report back to the namenode periodically with lists of blocks that they are storing.

If the namenode failed, all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes.

Hadoop provides two mechanisms to make the namenode resilient to failure.:

When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. The partitioning can be controlled by a user-defined partitioning function, but normally the default partitioner - which buckets keys using a hash function - works very well.

About List