At GitHub we use MySQL as our main datastore. While repository data lies in
git , metadata is stored in MySQL. This includes Issues, Pull Requests, Comments etc. We also auth against MySQL via a custom git proxy (babeld). To be able to serve under the high load GitHub operates at, we use MySQL replication to scale out read load.
We have different clusters to provide with different types of services, but the single-writer-multiple-readers design applies to them all. Depending on growth of traffic, on application demand, on operational tasks or other constraints, we take replicas in or out of our pools. Depending on workloads some replicas may lag more than others.
Displaying up-to-date data is important. We have tooling that helps us ensure we keep replication lag at a minimum, and typically it doesn’t exceed
1 second. However sometimes lags do happen, and when they do, we want to put aside those lagging replicas, let them catch their breath, and avoid sending traffic their way until they are caught up.
We set out to create a self-managing topology that will exclude lagging replicas automatically, handle disasters gracefully, and yet allow for complete human control and visibility.
HAProxy for load balancing replicas
We use HAProxy for various tasks at GitHub. Among others, we use it to load balance our MySQL replicas. Our applications connect to HAProxy servers at
:3306 and are routed to replicas that can serve read requests. Exactly what makes a replica able to “serve read requests” is the topic of this post.
MySQL load balancing via HAProxy is commonly used, but we wanted to tackle a few operational and availability concerns:
- Can we automate exclusion and inclusion of backend servers based on replication status?
- Can we automate exclusion and inclusion of backend servers based on server role?
- How can we react to a scenario where too many servers are excluded, and we are only left with one or two “good” replicas?
- Can we always serve ?
- How easy would it be to override pool membership manually?
- Will our solution survive a
service haproxy reload/restart?
With this criteria in mind, the standard
mysql-check commonly used in HAProxy-MySQL load balancing will not suffice.
This simple check merely tests whether a MySQL server is live and doesn’t gain additional insight as for its internal replication state (lag/broken) or for its operational state (maintenance/ETL/backup jobs etc.).
Instead, we make our HAProxy pools context aware. We let the backend MySQL hosts make an informed decision: “should I be included in a pool or should I not?”
Context aware MySQL pools
In its very simplistic form, context awareness begins with asking the MySQL backend replica: “are you lagging?” We will reach far beyond that, but let’s begin by describing this commonly used setup.
In this situation, HAProxy no longer uses a
mysql-check but rather an
http-check . The MySQL backend server provides an
HTTP interface, responding with
HTTP 200 or
HTTP 503 depending on replication lag. HAProxy will interpret these as “good” (
UP ) or “bad” (
DOWN ), respectively. On the HAProxy side, it looks like this:
backend mysql_ro_main option httpchk GET / balance roundrobin retries 1 timeout connect 1000 timeout check 300 timeout server 86400000 default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10 server my-db-0001 my-db-0001.heliumcarbon.com:3306 check server my-db-0002 my-db-0002.heliumcarbon.com:3306 check server my-db-0003 my-db-0003.heliumcarbon.com:3306 check server my-db-0004 my-db-0004.heliumcarbon.com:3306 check server my-db-0005 my-db-0005.heliumcarbon.com:3306 check server my-db-0006 my-db-0006.heliumcarbon.com:3306 check
The backend servers need to provide an HTTP service on
:9876 . That service would connect to MySQL, check for replication lag, and return with
lag <= 5s ) or
lag > 5s or replication is broken).
This commonly used setup automatically excludes or includes backend servers based on replication status. If the server is lagging, the specialized HTTP service will report
503 , which HAProxy will interpret as
DOWN , and the server will not serve traffic until it recovers.
But, what happens when two, three, or four replicas are lagging? We are left with less and less serving capacity. The remaining replicas are receiving two or three times more traffic than they’re used to receiving. If this happens, the replicas might succumb to the load and lag as well, and the solution above might not be able to handle an entire fleet of lagging replicas.
What’s more, some of our replicas have special roles. Each cluster has a node running continuous logical or physical backups. For example, other nodes might be serving a purely analytical workload or be partially weighted to verify a newer MySQL version.
In the past, we would update the HAProxy config file with the list of servers as they came and went. As we grew in volume and in number of servers this became an operational overhead. We’d rather take a more dynamic approach that provides increased flexibility.
We may operate a MySQL master failover. This may be a planned operation (e.g. upgrading to latest release) or an unplanned one (e.g. automated failover on hardware failure). The new master must be excluded from the read-pool. The old master, if available, may now serve reads. Again, we wish to avoid the need to update HAProxy’s configuration with these changes.
Static HAProxy configuration, dynamic decisions
In our current setup the HAProxy configuration does not regularly change. It may change when we introduce new hardwares now and then, but otherwise it is static, and HAProxy reacts to ongoing instructions by the backend servers telling it:
- I’m good to participate in a pool (
- I’m in bad state; don’t send traffic my way (
- I’m in maintenance mode. No error on my side, but don’t send traffic my way (
The HAProxy config file lists each and every known server. The list includes the backup server. It includes the analytics server. It even includes the master. And the backend servers tell HAProxy whether they wish to participate in taking read traffic or not.
The HAProxy config file lists each and every known server. The list includes the backup server, the analytics server, and even the master. The backend servers themselves tell HAProxy whether they wish to participate in taking read traffic or not.
Before showing you how to implement this, let’s consider availability.
Graceful failover of pools
HAProxy supports multiple backend pools per frontend, and provides with Access Control Lists ( ACLs ). ACLs often use incoming connection data (headers, cookies etc.) but are also able to observe backend status.
The scheme is to define two (or more) backend pools:
- The first (“main”/”normal”) pool consists of replicas with acceptable lag that are able to serve traffic, as above
- The second (“backup”) pool consists of valid replicas which are allowed to be lagging
We use an
acl that observes the number of available servers in our
main backend. We then set a rule to use the
backup pool if that
frontend mysql_ro ... acl mysql_not_enough_capacity nbsrv(mysql_ro_main) lt 3 use_backend mysql_ro_backup if mysql_not_enough_capacity default_backend mysql_ro_main
See code sample
In the example above we choose to switch to the
mysql_ro_backup pool when left with less than three active hosts in our
mysql_ro_main pool. We’d rather serve stale data than stop serving altogether. Of course, by this time our alerting system will have alerted us to the situation and we will already be looking into the source of the problem.
Remember that it’s not HAProxy that makes the decision “who’s in and who’s out” but the backend server itself. To that effect, HAProxy sends a check hint to the server. We choose to send the hint in the form of a URI, as this makes for a readable, clear code:
backend mysql_ro_main option httpchk GET /check-lag http-check disable-on-404 balance roundrobin retries 1 timeout connect 1000 timeout check 300 timeout server 86400000 default-server port 9876 fall 2 inter 5000 rise 1 downinter 5000 on-marked-down shutdown-sessions weight 10 server my-db-0001 my-db-0001.heliumcarbon.com:3306 check server my-db-0002 my-db-0002.heliumcarbon.com:3306 check server my-db-0003 my-db-0003.heliumcarbon.com:3306 check server my-db-0004 my-db-0004.heliumcarbon.com:3306 check server my-db-0005 my-db-0005.heliumcarbon.com:3306 check server my-db-0006 my-db-0006.heliumcarbon.com:3306 check backend mysql_ro_backup option httpchk GET /ignore-lag http-check disable-on-404 balance roundrobin retries 1 timeout connect 1000 timeout check 300 timeout server 86400000 default-server port 9876 fall 2 inter 10000 rise 1 downinter 10000 on-marked-down shutdown-sessions weight 10 server my-db-0001 my-db-0001.heliumcarbon.com:3306 check server my-db-0002 my-db-0002.heliumcarbon.com:3306 check server my-db-0003 my-db-0003.heliumcarbon.com:3306 check server my-db-0004 my-db-0004.heliumcarbon.com:3306 check server my-db-0005 my-db-0005.heliumcarbon.com:3306 check server my-db-0006 my-db-0006.heliumcarbon.com:3306 check
See code sample
Both backend pools list the exact same servers. The major difference between the pools is the
option httpchk GET /check-lag
option httpchk GET /ignore-lag
As the URI suggests, the first,
main pool is looking for backup servers that do not lag (and we wish to also exclude the master, the backup server, etc.). The
backup pool is happy to take servers that actually do lag. But still, it wishes to exclude the master and other special servers.
HAProxy’s behavior is to use the
main pool for as long as at least three replicas are happy to serve data. If only two replicas or less are in good shape, HAProxy switches to the
backup pool, where we re-introduce the lagging replicas; serving more stale data, but still serving.
Also noteworthy in the above is
http-check disable-on-404 , which puts a
HTTP 404 server in a
NOLB state. We will discuss this in more detail soon.
Implementing the HTTP service
HTTP service implementation will do. At GitHub, we commonly use
Ruby scripts, that integrate well with our ChatOps . We have many reliable
shell building blocks, and our current solution is a
shell oriented service, in the form of
xinetd makes it easy to “speak HTTP” via
shell . A simplified setup looks like this:
In the above, we’re in particular interested that
xinetd serves on
:9876 and calls upon
/path/to/scipts/xinetd-mysql to respond to HAPRoxy’s
Implementing the check script
The xinetd-mysql script routes the request to an appropriate handler. Recall that we asked HAProxy to hint per
check . The hint URI, such as
/check-lag , is intercepted by
xinetd-mysql which further invokes a dedicated handler for this check. Thus, we have different handlers for
The real magic happens when running this handler script . This is where the server makes the decision: “Should I be included in the read-pool or not?” The script bases its decision on the following factors:
- Did a human suggest that this server be explicitly included/excluded? This is just a matter of touching a file
- Is this the master server? A backup server? Something else? The server happens to know its own role via service discovery or even via
puppet. We check for a hint file
- Is MySQL lagging? Is it alive at all? This (finally) executes a self-check on MySQL. For lag we use a heartbeat mechanism, but your mileage may vary.
shell implementation suggests we do not use persistent MySQL connections; each
check generates a new connection on the backend server. While this seems wasteful, the rate of incoming check requests is not high, and negligible in the scale of our busy servers. But, furthermore, this better serves our trust in the system: a hogged server may be able to serve existing connections but refuse new ones; we’re happy to catch this scenario.
404 or error?
Servers that just don’t want to participate send a
404 , causing them to go
NOLB . Lagging, broken or dead replicas send a
503 . This makes it easier on our alerting system and makes it clearer when we have a problem.
One outstanding issue is that HAProxy never transitions from
NOLB . The automaton requires first going
UP . This is not an integrity problem but causes more alerting. We work around this by cross checking servers and refreshing if need be. This is a rare situation for us and thus of no significant concern.
This small building blocks design permits us to do simple unit testing. Control and visibility are easily gained: disabling and enabling servers is a matter of creating a file. Whether forced to exist by a human or implied by server role.
These scripts integrate well within our chatops. We are able to see the exact response HAProxy sees via simple chatops commands:
.mysql xinetd my-db-0004 /check-lag
200 ; OK
Or we can interfere and force backends in/out the pools:
.mysqlproxy host force-disable my-db-0004
Host my-db-0004 disabled by @shlomi-noach at 2016-06-30 15:22:07
.mysqlproxy host restore my-db-0004
Host my-db-0004 restored to normal state
We have specialized monitoring for these HAProxy boxes, but we don’t wish to be notified if a single replica starts to lag. Rather, we’re interested in the bigger picture: a summary of the total found errors in the pools. This means there’s a difference between a half empty
main pool and a completely empty one. In the event of problems, we get a single alert that summarizes the status across the cluster’s pools. As always, we can also check from chatops:
.mysqlproxy sup myproxy-0001
OK mysql_ro_main OK 3/10 servers are nolb in pool mysql_ro_backup OK 3/10 servers are nolb in pool
We’ve stripped our script and config files to decouple them from GitHub’s specific setup and flow. We’ve also open sourced them in the hope that you’ll find them useful, and that they’ll help you implement your own solution with context-aware MySQL replica pools.