Managing services using Apache ZooKeeper

Datetime:2017-04-11 05:41:19         Topic: ZooKeeper          Share        Original >>
Here to See The Original Article!!!

Modern software solutions are a group of software components that work with each other. Fault tolerance, load balancing and recovery are some of the key design considerations for such components. It is assumed that components will crash, become unresponsive or get disconnected. Multiple instances of a component are run, so that, even if any instance becomes unavailable, the system still continues to work.

It is challenging to deploy such solutions and ensure that they run consistently. It is equally difficult to diagnose unexpected results because components are deployed on different servers. Keeping the configuration on all the servers in sync needs careful manual steps. Cloud and containers are intended to solve such problems. To effectively use these technologies, services need to be designed in a way that they can be discovered, monitored and configured seamlessly.

This post proposes a solution for such issues using Apache Zookeeper . Other options like consul or etcd are also available. The principles remain largely the same.

Service Registry

A service registry makes it possible to address the above concerns. As an example, let us consider a load-balanced RESTful service. As soon as the RESTful service is ready to serve clients, it declares its availability and endpoint details to a central registry. In case the service crashes or becomes unavailable due to network issues, the registry must declare it as unavailable.

When a service is registered, it becomes visible to various client processes like other client services, support tools etc. The registry lets the client processes know of changes like availability / unavailability of a service or a particular service instance. A client component subscribes for information on services that it is interested in. This subscription mechanism makes it possible for the registry to let the client know of updates on services or service instances.

Registry cluster

A registry needs to keep checking if all the registered services are running. Similarly, the registered services need to keep checking if the registry is running fine. This is achieved by sending and receiving heartbeat messages (often referred to as ping ). This way, both, the service and the registry report to each other that they are alive. This is done at predefined (often negotiated ) intervals.

If the registry becomes unavailable (due to crash / system issue / network issue etc.), the client needs a fallback option. Therefore, the registry is run as a cluster of instances continuously syncing with each other. The clients connect to any one of these instances. When the registry instance becomes unavailable, a client can easily connect to another available registry instance and be up-to-date with the current status of running services. That particular registry instance becomes responsible for sending and receiving pings and managing subscriptions from the client.

Imagine a situation where a registry instance gets disconnected from the rest of the cluster. In such a situation, it will not be able to synchronize with the rest of the instances and thus have an incorrect picture of the availability of various services. The clients which have a subscription with such an instance will therefore have an incorrect picture of availability of services. This situation is different from the registry instance becoming unavailable because, the registry instance appears to be running fine to the clients and the clients appear to be running fine to the registry instance.

ZooKeeper uses the fail fast technique. When a ZooKeeper instance discovers that it is not in a position to synchronize with other instances, it kills itself. The clients then connect to other instances of ZooKeeper in the cluster and thus remain up-to-date.

Imagine a registry instance becoming corrupt or inconsistent because of some bug in the code. Such failures are difficult to find and fix. There are a few interesting algorithms which address a huge number of such problems. Thankfully, we don’t have to implement these algorithms. Most of them use consensus algorithms to address such failures.

The kind of failures described above are often presented to computer scientists in the form of the Byzantine Generals Problem. A very interesting article, The-Byzantine-Generals-Problem was written in the early 80s by Leslie Lamport, Robert Shostak, and Marshall Pease. It addresses the Byzantine Generals Problem and proposes solutions for computer programs.


Consensus is arrived at by checking with multiple instances of the registry. There are a few very interesting implementations of consensus algorithms. To truly appreciate the problem of consensus algorithms, I suggest reading Paxos made simple by Leslie Lamport .

Zookeeper uses Zab (ZooKeeper Atomic Broadcast) protocol for consensus. There is a leader which proposes and followers listen to it. When enough number of followers acknowledge , the leader considers the proposal as accepted ( learned ). The minimum number of followers required to acknowledge for a proposal to be considered accepted, is called a Quorum . For example, if 7 instances are running, 4 form a Quorum. That means at least 4 instances need to acknowledge a proposal. Typically an odd number of instances of ZooKeeper is run in a cluster.

The proposals are merely a value for the leader and the followers. They don’t really understand the meaning of the proposals. All they do is to store it and synchronize it between themselves. The proposal can be any value, something as simple as a text string such as “there is a new service instance” or “a particular instance is no longer available”. Another interesting value is of the form “a certain set of services must use the following config values”.

When the RESTful service registers with ZooKeeper, the registration happens via the leader. Whereas, the clients of the RESTful service might have subscribed with a follower instance of ZooKeeper. When the follower receives the update of the RESTful service’s availability, it relays the information to its subscribers (the clients). Thus the client becomes aware of the RESTful service.

It could happen that the RESTful service is already up and declared available by the registry. Subsequently, when the client starts up, it can “discover” the already available RESTful service via the Service Registry (ZooKeeper in this case).

There are two main patterns used in service discovery, the client-side discovery pattern and the server-side discovery pattern. Depending upon the application’s need, either can be used. Server-side discovery needs an extra hop in the network hence is avoided where latency is critical. The following links describe these patterns in more detail: client-side-discovery , server-side-discovery

Service Registration

ZooKeeper maintains a hierarchy of nodes, each of which can store data. These nodes and their data are saved on disk and maintained even if the complete ZooKeeper cluster is restarted.

ZooKeeper allows creation of a special kind of nodes, which are non-permanent, called Ephemeral Nodes . These special nodes enable usage of Zookeeper as a registry. While creating these nodes, an attribute is specifically set as Ephemeral. To use this, a service needs to open a session with ZooKeeper and negotiate a heartbeat (ping) interval. If there are no heartbeats or if the ZooKeeper session is closed by the service, the Ephemeral Node is removed by ZooKeeper.

Any process which has subscribed for information on that node or it’s parent will come to know that the service is no longer available. The ZooKeeper API provides an easy way to create Ephemeral nodes and subscribe for information (referred to as a watch ). Changes are relayed to the client, only if a watch is set on a node or its parent. A watch is ZooKeeper’s subscription mechanism.

Note that, non-ephemeral nodes are not removed when a session is closed.

The nodes for a service can be created as a hierarchy. For example:

  • Application Name or Product or Domain ( non-ephemeral )
    • Service1 ( non-ephemeral )
      • Instance1 ( ephemeral )
      • Instance2 ( ephemeral )
    • Service2 etc…

To identify an instance node, ZooKeeper allows specifying a path to the node:


Note that there are rules for naming a node: The ZooKeeper data model .

It is advisable to represent Instance1 and Instance2 as UUIDs instead of having a configured name or number. This enables unique identification of a service instance, and there is no possibility of having multiple instances with same instance name or number running at any point of time. The parent node represents the service name.

/App/Service1 (parent node, represents the service name)
/App/Service1/20b755cb-2397-4ae4-b5a7-dec28bd1881e (particular instance)

Vertical partitioning

Often, “vertical” partitioning (data partitioning) is needed. This is particularly useful for trying out a new feature with a few “less-important” customers / consumers. Sometimes it is also used for increasing scalability of the application. To support this need, the hierarchy can be evolved to:

  • Application or Product or Domain
    • Service1
      • Consumer_category1 ( Cat1 )
        • Instance1
        • Instance2 etc…
      • Consumer_category2 etc…
    • Service2 etc…

We will evolve this hierarchy further to address some other needs.

Load Balancing and serving capacity

Some services are heavy on processing or I/O. They hold up resources while performing tasks. These kind of services can be load balanced, so that multiple instances on different servers can serve requests from clients simultaneously.

Thundering herd

If, due to some reason, all the instances of a load-balanced service are down and are in the process of booting up, then any one of them will come online first and declare that the service is available. Instantaneously, all the clients waiting for this service will start firing requests. If the service is unable to cope up with this load, it might go down or become unresponsive. It will fail to send heartbeats at the negotiated interval, leading to removal of the node in ZooKeeper. The clients will then stop firing requests and wait for the service to be available again. Subsequently, the next instance would come up and go through the same sequence of events and so on, until at least two or more instances declare themselves available approximately at the same time.

To handle this issue, availability of the service is declared only if a pre-configured minimum number of instances are available. All clients start firing requests only after the configured minimum number of service instances become available. The configuration can be stored in ZooKeeper so that all the participants use the same value. Service discovery and monitoring is necessary for effective load balancing.

There are many interesting techniques used for load-balancing. For someone who is building micro-services and does not mind an extra hop in the network, the AWS ELB is a good case-study. It uses the server side discovery pattern. elastic-load-balancing

Server-side load balancing

Load balancers which are deployed between a client and the service wait for a minimum number of available service instances. The load balancer becomes a client to the Service Registry and observes the number of service instances before routing requests.

Client-side load balancing

A client needs to observe the number of available service instances. If it is equal to or more than a pre-configured number, it starts firing requests on the individual instances in a round-robin fashion. Client side load balancing is preferred when latency matters. In such cases, the extra hop of the request via a server side load balancer is avoided.

A simple round-robin technique works for most of the cases. For complex processes, there are many load-balancing algorithms to choose from: load-balancing-algorithms-techniques , best-practices-evaluating-elastic-load-balancing

Primary & Secondary instances

Sometimes a service is designed in such a way that only one instance of the service is used to serve clients. The other instance is running as a hot backup. If the primary instance crashes, the secondary takes over and becomes the primary instance. These kind of services are sometimes referred to as Master-Slave or Leader-Follower. Whenever an instance is declared primary, it is said to have gained mastership. In such scenarios, vertical partitioning helps in scaling the solution.

To classify these kind of services separately, the nodes hierarchy can be evolved as:

  • Your Application or Product or Domain
    • Mastership
      • Service1
        • Consumer_category1
          • Instance1
          • Instance2
        • Consumer_category2
      • Service2
    • LoadBalanced
      • Service3
        • Consumer_category1
          • Instance1
          • Instance2 etc…

The path to identify an instance would look like:


Ephemeral Nodes help to identify if a service instance is available. However, gaining mastership requires more than that. The system needs to identify a leader / master.

Leader election protocol

ZooKeeper allows creation of Sequence Nodes. The node name is appended with an auto-incremented number by ZooKeeper.

For example: If ZooKeeper is asked to create a Sequence Node with the following path:


then, ZooKeeper creates the following path:


The appended number ‘0’ is incremented by 1 every time a Sequential Node is created under the parent:


For example:


As the sequence number is guaranteed to increase atomically, the smallest number node can be declared as the leader. ZooKeeper maintains the largest number it has provided so far. Whenever it needs to provide a new number it increments it by one.

The only catch is: 2147483647 (INT_MAX). Beyond which, the number will be rolled over to 0 again.

Creating Ephemeral and Sequential Nodes for service instances allows us to choose a leader / master. This is effective for mastership services.

There are many interesting recipes on ZooKeeper: ZooKeeper Recipes .

Curator, another Apache project is based on ZooKeeper and has some more interesting recipes: Curator Recipes .

Central Data Store and Config management

ZooKeeper allows each node in the hierarchy to store some data, typically of the order of a few kilo bytes (1 MB max). Hence, a lot of deployments use ZooKeeper as a cache. It supports “compare and swap” functionality that makes ZooKeeper a versatile tool for interesting cache implementations. The ‘version’ field in the Stat structure of ZooKeeper is used to store the number of changes in the data at a particular node. ( ZooKeeper Stat Structure , JavaDoc for the Stat Structure )

Compare and swap

By using Compare And Swap (CAS), applications can achieve atomicity, synchronization and implement optimistic-writes. A component C1 seeks data from a central data store (the ZooKeeper) and gets the version from the Stat structure of that node. It then performs tasks with that version of the data. In the meanwhile, another component C2 updates the data and thus increments the version in the central data store. Now, as C1 completes its processing, it tries to update the data and provides the version it had assumed. ZooKeeper fails the update, and lets C1 know of the version mismatch. This allows C1 to either seek data again and redo the processing or fail the original processing. This technique is called optimistic write.

Authentication and Authorization

A central data store needs to provide authentication and authorization. All the nodes have a set of associated ID-permission pairs called Access Control Lists or ACL ( ZooKeeper ACL ). This enables read-only or read-write accesses for different components and users.

There are different kinds of ACL Schemes provided by ZooKeeper.


Applications / components usually do not update config values. They read the config during startup. Typically configs are stored in files on servers where the component is deployed. Whenever a config is changed, the config file needs to be updated on all the servers. Configs can be stored as data at specific nodes in ZooKeeper. A transaction log and a snapshot of the nodes with the data is saved on the disk by ZooKeeper. This enables ZooKeeper to recover all the data even if the entire ZooKeeper cluster goes down.

A config-update-tool can be provided to update configs stored in ZooKeeper.

Application level configs

Some of the configs are values that are used for processing a request. Limits, flags, default values and error messages etc are a few examples of such configs. A component can subscribe to nodes which store configs as data using ZooKeeper Watches . If a config is updated using the config-update-tool, the component receives a notification of the change and reacts immediately. This is useful for enabling / disabling a feature or functionality or can be used as an emergency control.

New features may be enabled only for a few consumers. A few other consumers may have a smaller limit. Such flags and limits can be stored as configs in the consumer category nodes. This can be achieved as suggested below:

/App/Config/Service1 (store config as data at this node)
/App/Config/Service1/Cat1 (store config overrides here)

Configs are typically stored in the following file formats: .properties, .xml, .ini, .json. These files can be uploaded as data at specific nodes.

It is tempting to store only the overridden config values at the consumer category nodes. In that case, the remaining configs which are common to all consumers would be stored at the service node. This complication is unnecessary as it leads to support errors / incorrect expectations. A better and more manageable technique is to have all the needed configs at one node. This way, a component instance with overriden configs can find all the needed configs at the consumer category node.

Infrastructure level configs

Configs for database connections / middleware / ZooKeeper cluster rarely change. I prefer calling these kind of configs as infrastructure level configs. These configs can be stored in files on servers.

Service Discovery

There are software solutions which make use of middleware like a message bus. Clients typically fire a request at a configured broker, queue / topic. If the service is unavailable, the client waits until timeout. This causes the client to hold up resources on the system. Hence, service discovery and monitoring becomes important. If a service is unavailable, the client can take appropriate action like logging a CRIT level message or proceeding assuming a default response from the service.

A RESTful service must also expose the service endpoint with details such as host, port and the representations it supports. These details can be stored at the instance node of a service. Note that an instance node is Ephemeral. Hence, as desired, the endpoint data and the node get removed in ZooKeeper when the service goes down.

For example, consider a service instance represented by the path:


Endpoint data for the service instance node above:

  "EndpointType": "REST",
  "Server": "",
  "Port": 9080,
      "Representation": "/library/books",
      "AllowedMethods": "GET"
      "Representation": "/library/client",
      "AllowedMethods": "GET,PUT,DELETE,POST"
    }, etc ...
  "EndpointType": "MessageBus",
  "BrokerURL": "<broker_url>",
  "BrokerPort": 9876,
  "Topic": "Service1",
  "TopicMessageSelector": "Cat1",
  "InstanceSpecificTopic": "Service1/51740b51-7c48-44e9-842d-78c77eb015cf"

Instance specific topic

Routing a request to a particular instance of a service is very helpful for supporting the component. Tools can be built to exploit this feature and extract server status or other diagnostics on a live environment. For solutions using message bus, this can be achieved by having a special topic specific to that service instance.

RESTful services expose endpoint data anyways, hence there is no special configuration needed.


A strategy for managing configs and services through service registry must be decided before starting development. Developers must incorporate the service registry and config store in every component. This makes it easier to support, maintain and deploy the solution.