This is the second in a two part blog post outlining our work at Moz on RogerOS, our next gen infrastructure platform.
Thefirst partgives the background and provides an overview of the system. This post goes deeper into the implementation and provides the juicy technical details. This blog post comes to you courtesy ofAnkan Mukherjee andArunabha Ghosh. We hope you enjoy reading about it as much as we’ve enjoyed working on it.
In thefirst part we talked at length about why we set out to build RogerOS, the background that set us up towards this end and the lofty design goals we had in mind. We also provided a ClusterOS architecture diagram detailing the various subsystems that worked together to form RogerOS. Let’s now get into more details on these fronts.
In thefirst part we summarized our overall design goals. Here are more details on some of our salient technical design goals –
Developer oriented abstractions
Traditionally, interfaces and abstractions offered by infrastructure systems have been heavily focused on the operational side. Virtualization systems, for example provide a lot of operational flexibility. However they offer the same fundamental abstraction as a physical machine. On the other hand most developers intuitively think in terms of different abstractions like processes, jobs, services etc. To a large degree, they care more about the fact that their stuff runs and is reachable than the particulars of which machine they run on. As software systems become larger, more distributed and decoupled, the implications of this abstraction gap become more and more apparent. The growing capabilities and complexity of orchestration systems like Puppet and Ansible are visible manifestation of the increasing gap and attempts to bridge it.
One of our core design principles was to provide developers with the abstractions they needed and used daily. This undoubtedly increased the complexity as it was now up to the system to make the necessary translations into the underlying machine abstractions, but in hindsight, this was the right choice.
Completely automated lifecycle management
As an immediate consequence of the previous goal, we wanted to provide as much automation as possible, providing not just orchestration but complete lifecycle management. This was also necessary for some of the other goals like providing higher resiliency.
Highly resilient, no Single point of failure
Given that a lot of developers would be depending on the system and failures are a regular feature at scale, we explicitly wanted to have ample redundancy and no single points of failure.
High degree of automation and self healing
As mentioned above, failures are an expected feature than an anomaly at scale. The expectation was that we would be able to handle most routine failures like machine and process failure with no manual intervention.
Ability to run multiple workloads
As mentioned before, any solution we picked had to cater to not only new development, but also be able to handle existing applications. Specifically, we wanted to be able to handle application workloads using containers, legacy applications using virtual machines, recurring ‘cron’ jobs, and large scale batch workloads (Spark et al.)
While not the primary concern, for a system like RogerOS, it is vital that efficiency is a major criteria. Indeed, external evidence from systems like Borg indicate that significant cost reduction can be gained by using such systems (Google estimates the savings to be on the order of not needing an entire data center due to efficiency gains!)
Ease of operation
Last but not the least we wanted the system to be easy to operate. Some of the design criteria for this were –
- Integrated monitoring and resource tracking
- Procurement friendly
- Ability to use existing resources (no need to buy new hardware to get the system going)
- Easy to add resources with minimal or no disruption
In thefirst part we showed an architecture diagram that provides an overview of the various systems that from the ClusterOS.
As stated already the core of RogerOS is Apache Mesos . Mesos corresponds to the ‘ClusterOS Kernel’ in the architecture diagram mentioned above.
The Core Components
We now describe the core components that make up RogerOS.
Apache Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services. Mesos uses zookeeper for master leader election, coordinating activity across the cluster and for letting slaves join the cluster. We also use zookeeper for some of our internal apps running on RogerOS for distributed locking and process coordination.
We run Mesos masters in high availability mode with 3 or 5 nodes that coordinate to select 1 active leader using zookeeper. We run Mesos slaves with the docker containerizer enabled. Mesos slaves are the worker nodes and keep talking to the Mesos master providing resource details available for running tasks. Mesos frameworks (see below) talk to the master to get information on these details and decide which slaves to run tasks on.
Mesos Frameworks provide a higher level of abstraction on top of Mesos and provides features that make it easy to manage what’s running on Mesos. The ability to add and create your own frameworks also makes this system highly extensible.
We started with the Marathon framework which runs quite a few of our tasks at present. Marathon provides the ability to start and manage long-running applications on Mesos and corresponds to the Job/Task Schedulers in the architecture diagram mentioned above. If you consider Mesos as the kernel of this ClusterOS, Marathon is the upstart/init daemon.
As we moved forward there was a need to schedule periodic and run one-off tasks which led us to add Chronos which also corresponds to Job/Task Schedulers and is essentially distributed cron. We have a few backup jobs running on Chronos and there are new jobs getting added.
We are also considering getting other frameworks like Apache Aurora (that supports both long-running as well as distributed cron jobs) and Apache Spark (for large scale data processing) up and running in the future.
Load Balancer & DNS Server
We use roger-bamboo which along with haproxy acts as the service discovery and distributed load balancing system for RogerOS. It is based on bamboo and adds several features including flexible TCP and HTTP port specification, HTTP web path specification etc.
Each node on RogerOS has roger-bamboo and haproxy running. Any changes on RogerOS triggers an update to bamboo which in turn updates and reloads haproxy configuration. This essentially provides configuration free automatic load balancing across all instances of an application running on the cluster.
We also use maradns as a dns server to provide common and simple names on RogerOS.
The following diagram shows an overall layout of the core components in action.
The Monitoring Stack
We realized that integrated monitoring is an essential component for this system to succeed. This corresponds to the Monitoring and Log Aggregation components in the architecture diagram mentioned above. There is a need to monitor both the variety of components that constitute the system as well as the tasks and services that run on the system.
Our monitoring system is based on the following principles –
- Monitoring data should be easy to collect, store and query.
- Creating dashboards and graphs from this data should take minimal effort.
- Alerts should be easy to create and take action on.
- The underlying query semantics should be the same so that it’s easy to manage and maintain.
- It should be possible to query and alert across metrics and components when required.
To achieve this end we use the following components –
- We collect system metrics using collectd . We also use collectd plugins to collect metrics from the core components above (mesos, marathon, zookeeper, haproxy, bamboo, etc.) These run on each of the nodes on RogerOS.
- In some cases we also run dockerized collectd instances on RogerOS itself to collect metrics from applications running on RogerOS.
- For custom application metrics we provide a statsd interface to send metrics data to.
- We use InfluxDB which is an open source distributed time series database to store all our monitoring metrics.
- Grafana that has built in support for influxdb provides us with the ability to create dashboards.
- For alerting we wrote our own daemon that uses influxdb queries, integrates with PagerDuty and provides the ability to create alerts.
We monitor all machines that are part of RogerOS for cpu usage, memory usage, network usage, disk space etc. and have appropriate alerts set up. We also monitor our own internal applications, services, daemons and have alerts set up as required.
- We automatically collect metrics like cpu usage, memory usage, network usage etc. for all docker containers running on the system.
- We also have dashboards that chart out these metrics automatically without the need of any configuration or manual effort.
- In addition apps on RogerOS may send custom metrics using any stastd client. Our statsd interface is via telegraf (with the statsd service plugin for influxdb) which adds the ability to tag metrics and makes it easier to store and query with influxdb.
- Although we set up the monitoring stack for RogerOS there is no requirement that the system to be monitored be running on or be a part of RogerOS. This essentially means that we can support common monitoring needs of applications across the company.
- Needless to say custom dashboards can be created using any of the metrics collected above.
- Application logs from docker can be sent to a logstash setup that is already available across the company.
Setting up Alerts
- We have our custom alerter that integrates with pagerduty and allows alerts to be set up on any metrics.
- Alerting and charting use the same underlying query semantics. So, if you can chart something you can alert on it.
- Common alerts like less than, greater than etc are available on any influxdb aggregation queries.
- Additional alert types like checking for presence/absence, consistency etc. are also available.
Monitoring the Monitor
“ Quis custodiet ipsos custodiet? ”, you ask. We use deadmanssnitch to ensure that our alerter is up, running and healthy.
The following diagram provides an overview of the monitoring system we have.
No matter how good a system is, without ease of usage it’s of no use. We recently started creating a command line interface (CLI) with this in mind. We have the following goals for the CLI which is targeted towards developers as the primary users –
- Make it absolutely painless to get code from source control to deployment.
- Create a standardized way of declaring application configuration and create commands that are similar to tools that developers already use.
- Avoid abstracting out or hiding tools and features that are already available. Instead utilize what’s already available and add tools that enhance the overall experience.
- Rapidly develop features, release and actively act on feedback from developers.
We use a combination of vagrant, docker, nodejs, python, jinja2 (and more) and currently have the following set of commands –
- init – Creates basic configuration and templates to get started.
- deploy – This is a one stop shop. It gets code from the github repository, builds it and pushes it to docker repository, creates configuration files for the target environment and then pushes it into RogerOS – all in one go.
- git-pull – Pulls code from a github repo.
- build – Builds the application and optionally pushes it to the docker registry (if applicable).
- push – Pushes the application into RogerOS.
- shell – Sets up an interactive bash session to a task running on RogerOS.
This is fairly new work and in progress. We plan to add more features as we move forward.
As can be imagined from the design details, RogerOS is a fairly complex system and has been under development for quite a while. Today, a small cluster of 8 (albeit beefy) machines runs a wide variety of applications for Moz –
- 8 nsqd instances along with nsqlookupd and nsqadmin instances
- 2 mongodb replica sets with 3 replicas each
- A few redis instances
- About 20 different apps with numerous (over 150) instances that support Moz Content
- Over 70 instances of various apps that support products that are either in beta or are launching in the future (hush hush!!)
- Numerous tasks that are used internally by RogerOS.
Developer response to the system has also been very positive. Moz Content was developed from scratch on RogerOS and the development team has been extremely satisfied with the platform. In particular, by abstracting the hardware and providing higher level services, it freed the team to focus almost exclusively on the product. And not just new products, parts of existing legacy products have been successfully run on the system. We’re looking forward to build on this and add more features and tools going forward.
Overall, RogerOS has been a very ambitious effort at Moz and we are excited about the progress that we have made so far. We are even more excited about the future.
Staying true to our TAGFEE code we’re happy to announce that we plan to open source most of RogerOS. We’re starting by open sourcing the core installers and Roger’s internal load balancer but this is just the beginning, we hope to build a vibrant community around the effort.
Last but not the least, we would like to thank the following for their contribution to the effort so far – Jord Sonneveld, Manish Ranjan, Amit Bose, Chris Whitten, Josh Gummersal, Evan Battaglia, Tyler Murray and the Moz Content team. RogerOS would not have seen the light of day without them.
If you enjoyed these posts, please check out the following talks by us at Mesoncon 2015!