Tracker: Ingesting MySQL data at scale - Part 2

Datetime:2016-08-23 02:22:47          Topic: MySQL           Share

InPart 1 we discussed our existing architecture for ingesting MySQL called Tracker, including its wins, challenges and an outline of the new architecture with a focus on the Hadoop side. Here we’ll focus on the implementation details on the MySQL side. The uploader of data to S3 has been open-sourced as part of the Pinterest MySQL Utils .

Tracker V-0

As a proof of concept, we wrote a hacky 96-line Bash script to unblock backups to Hive for a new data set. The script spawned a bunch of workers that each worked on one database at a time. For each table in the database, it ran SELECT INTO OUTFILE and then uploaded the data to S3. It worked, but BASH… And that just isn’t a long term solution.

Tracker V-1

For our maintainable implementation, we rewrote the Bash script into a Python script called mysql_backup_csv.py . The only significant difference (other than making us not feel bad about ourselves) was we added lzop compression in order to reduce the size of the data in S3. Why lzop? We thought it would be the lightest weight compression tool with a command line interface we could install from apt-get.

We tested this against our large sharded MySQL fleet, and it was slow. Like, really slow -- 8 hours slow.

Speed it up

We now had a tool that was maintainable for uploading our MySQL data to S3. The problem was that the tool would not process all data fast enough for our team to meet their SLA’s. We needed to improve the overall throughput significantly, and so went to work on the following:

  • Implement locking so multiple slaves could cooperatively dump in parallel. The lock is maintained via a simple table on the master. This allowed us to get down to around 3.5 hours to dump all our data. Too slow!
  • Skip writing to disk. The Percona distribution of MySQL has a very interesting feature in that SELECT INTO OUTFILE can write to a FIFO . As is, we had to dump all of our data, and then read it back from the filesystem. Using a fifo , we could build a pipeline that did not need to write to the local filesystem at all! This got us to somewhere around 1 hour which was way less than our requirement.

Slow it down

Per the fine manual (and this is in super old manuals):

“ASCII NUL is escaped to make it easier to view with some pagers.”

!@#()U@!#!!!

We had to write a C program called nullescape to unescape the data. &*(@!#!

Adding this to our pipeline resulted in our servers burning four cores just to unescape NUL bytes. This slowed us down to 1.5 hour to dump all our data. This was still within our requirements and left us a bit of breathing room.

Winning the race against an EOF

A problem with the system was that partial uploads must be prevented. Partial uploads could happen if anything in the pipeline failed. When a Linux program terminates (regardless of how or why), its open file handles will close. If the file handle is to a FIFO, the reader of the FIFO will receive an EOF without any indication of success or failure of the process feeding data into the FIFO. 

So, why does this matter? Well, dump queries get killed from time to time, and early versions of nullescape would segfault occasionally. When either happened, the rest of the pipe would think no more data was coming. It was possible to catch the non-zero return status and delete the uploaded data, but that’s kinda racy and eventually the race would be lost.

We talked about it a bunch, and the best solution we came up with was a program that would sit just before s3gof3r in the pipeline. This program would repeat its input from stdin to stdout, but only transmit an EOF if all programs in the pipeline succeeded. This program is called safe_uploader and ended up being very lightweight.

In the beginning there were subtle bugs in safe_uploader that resulted in Zombie and Orphan processes, however once we fixed these, they quit appearing on database servers.

Systemic improvements

Compared to the previous system, this project significantly improved usability of the resulting data and reduced operational issues:

  1. We added support for MySQL binary types. During the backup, rather than using hex encoding for binary columns (which doubled the size of the backup file), we chose to use escaping for some special characters (e.g. \n, \t, \r);
    • Hadoop’s built-in TextInputFormat can’t read the backup with newline characters escaped, so we wrote our own EscapedTextInputFormat for Hadoop/HIVE
    • We made a fix in Hadoop Streaming side for this special TextInputFormat
    • We rewrote the CSV parser for our Python clients to read the new backup file
  2. We added consistent data retention policy to all backup files, and we made the auto adjustment on HIVE table to make sure its schema is always in sync with MySQL schema.
  3. Since all data is imported into MySQL without significant modification, we now have a secondary backup system. This is useful for small losses of data. Restoring an xtrabackup takes hours, but pulling a single row or a small table from Hive is really fast and, better yet, doesn’t require help from the DBAs!
  4. When a failover occurs, a small script run by cron kills running backups. In the past, this would require dropping the MySQL user for the dumper framework. Often, this would also result in the DBA team and Data-Eng paging each other in the wee hours of the morning.
  5. Our backups are now fully consistent on the schema level and generally consistent on a replica set within a few seconds. This is a big improvement for cross-shard data consistency checking.
  6. An unintended benefit of tracker pushing the slaves servers really, really hard is that we are effectively running a benchmark every night that significantly stresses out our slave servers. From time to time, we’ll remove slower servers from production.

Tracker is now in full production with the capability of moving all our MySQL data into S3 within two hours.

Future work

We’re not stopping here. We realized for some tables, the daily change is actually not big enough to warrant a full snapshot pull, so we’re building an incremental pull pipeline that converts MySQL binary logs into a Kafka stream. This will then be incrementally pushed into S3 and later compacted with the previous snapshot to get the continuously updating snapshots. Stay tuned!

Acknowledgements: Thanks to Henry Cai, Krishna Gade, Vamsi Ponnekanti, Mao Ye and Ernie Souhrada for their invaluable contributions to the Tracker Project.

For Pinterest engineering news and updates, follow our engineeringPinterest, Facebook and Twitter . Interested in joining the team? Check out ourCareers site. 





About List