Entire website open-sourced at: https://github.com/grokit/website_grokit_ca/ . Feel free to make pull requests if you find mistakes!
Apache Hadoop Explained: Kafka, ZooKeeper, HDFS and Cassandra.
Apache Hadoop is a suite of open-source components which serve as the building blocks of large distributed systems.
Some of the world's best apps and websites are powered by Hadoop. If you want to understand how to build world-scale web services, understanding the principal Hadoop components is the right place to start.
ZooKeeper is a distributed coordination service which is used when nodes in a distributed system need a single source of truth.
It is implemented as a single (movable) master with N coordinated nodes. A majority of nodes (n/2+1) must agree on a change before the change is accepted. The write requests have to be processed by the master (so the write rate is severely limited), but the read requests can be answered by any of the coordinated nodes.
The basic unit of interaction are znodes which are file-like entities addressable under a path:
ZooKeeper provides a low-level API that allows to do just a few fundamental operations on those znodes. For example (but not limited to):
setData(path, data, version)
Guarantees: The clients are not guaranteed to all have the same view at the same time (because of network delays and network segmentation, some clients might have information from a previous state). But the clients are guaranteed to all view the changes in the same order.
Example Use-Cases: distributed lock, group membership and configuration management.
ZooKeeper provides very low-level abstraction, and implementing scenarios like leader election can be really hard to do well. Usually people will use the Curator library which implements higher-level concept on top of the ZooKeeper API.
Similar system: Google's Chubby.
- ZooKeeper is very low-level, this library probably implements > your use-case.
Uses Zab (a Paxos alternative): paper::"Zab: High-performance > broadcast for primary-backup systems" as consistency algorithm.
Kafka is a publish-subscribe distributed messaging system implemented on a log abstraction. Publishers write to topics, any number of clients can subscribe to that topic and get notified when a new event happen.
The base abstraction in Kafka is a distributed log. This log allows to define an authoritative order for the events in the system as well as allowing buffering between producer consumer (a producer can produce N messages, the consumer does not have to immediately consume them). However, that log has a maximum size and retention period, so if the events are not consumed within a period of time, they vanish.
Kafka is organized by topics, which are a name that producer and consumer agree to use for a specific event stream. For example, you could think of a topic being "Picture7564_Comments". The producers write to that topic when a user comments on the picture, and clients subscribed to that topic get a notification whenever a new comment is added on that picture.
The topics are broken-down by partition. Partitions are used to provide scaling, parallelism and ordering guarantees. Messages in the same partition are totally ordered on the server, and deliver in-order to the client. However, between partitions, message on a topic have no ordering defined. A partition is attributed to a server, so having only one partition per topic limits the throughput.
Kafka replicates the data on multiple servers. Replication factor is a configurable setting, setting it to one will yield the best performance but risk permanent message loss on a hard-disk failure. When the replication factor is N, it guarantees that unless N-1 nodes fail at the same time, the message will not be lost and be readable by the consumer.
Kafka is designed to be a real-time system; the delay between an event being published and consumed is expected to be less than 100ms.
Similar system: Rabbit MQ
- Very good introduction, read first. Has a section that shows you how to install and use.
- This is not as good as the Apache documentation page.
Distributed data-store which is highly available and guarantees no data-loss even in case of multiple server and hard-disk failures. Think of HDFS as a distributed file store that guarantees high-throughput for read and writes and well as durability. HDFS gives a file-like interface: you can read, write and delete files which are organized under an hierarchical folder structure.
Internally HDFS uses a master / coordinated nodes design. The master node is called the `NameNode` and the coordinates nodes are called `DataNode`. Files stored in HDFS are broken down in blocks (up to \~64MB), and the blocks bytes are stored in the DataNodes. The NameNode is responsible for managing the filesystem namespace and regulating access to files by clients. It stores the block to DataNode relation. DataNode are responsible for storing the data and directly handling read / write streams. So when a client wants to read a file, his request has to first go to the MasterNode to know which DataNode hold the blocks for this file. Then the client connects directly to the DataNodes and reads the data.
It may seen odd that all read / writes need to go through the unique NameNode, this introduces an obvious performance bottleneck. However, the operations that the NameNode has to do (looking up file -> block information from RAM and filesystem metadata changes) are light, and the costly operations (actually reading / writing the data) are done by the DataNodes, so the bottleneck is not that severe. The NameNode uses a synchronously-replicated transaction log for metadata changes so that HDFS can recover from a NameNode permanent failure.
HDFS maintains N (usually 3) copy of all blocks at all times. It verifies that N copies are always healthy, and if some of the copies have vanished (e.g. hard-disk failure) it will create new copies of the data until the replication factor goes back to N.
Similar systems: GFS, Amazon S3, Azure Blob storage
Cassandra is a data store for tabular data allowing mass-query mechanisms but unlike monolithic database (e.g. MySQL) allows to scale incrementally by gradually adding nodes as more capacity is needed. Cassandra is a column-oriented database, it provides more structure than a simple key-value store and an SQL-like language (called CQL) for data-operations, but less structure and a full-on relational database.
The base abstraction in Cassandra are tables, which are themselves broken down in row and columns. A (table, key, columnName) represents the address of data. Unlike a relational database, Cassandra does not force all keys in a table to have the same set of columns.
Writes use a Quorum system where a majority of the replicas need to acknowledge success before completing a write. For read, the client has to choice to just wait for any replica to return the data, or all replicas depending on his consistency requirements.
The data is sharded across nodes using consistent hashing on the key. hash(key) yields the coordinator for this key.
Similar system: Google BigTable, Azure Table, HBase, Amazon Dynamo