Managing Apt Repos in S3 Using Lambda

Datetime:2016-08-23 01:51:17          Topic: Debian  Elastic Search           Share

I've been wanting to host an apt repo in S3 for a while now, but I've hit a few stumbling blocks that have prevented me from fully realizing a solution until just recently.

An APT repo in S3 must be:

  • Private
  • Easy to manage
  • Fast
  • Not require a local copy
  • Not use crazy tools
  • Integrate seamlessly with apt-get
  • Have the least amount of cognitive load possible

What I came up with is a simple python script that can be attached to an S3 bucket as a lambda function. The lambda function watches for any changes in the S3 bucket, and will quickly regenerate the Packages index whenever the bucket changes. Once you configure this lambda function, it will keep the Packages index up-to-date and you do not need to run any further steps aside from adding and removing .deb files in S3.

S3 Bucket Layout

Here's a look at the layout of my S3 bucket:

# High level layout
/
/dists/
/dists/webscale.plumbing/1.0/
/dists/webscale.plumbing/1.1/
/dists/webscale.plumbing/2.0/
/control-data-cache/

# Actual repo content
/dists/webscale.plumbing/1.0/Packages
/dists/webscale.plumbing/1.0/elasticsearch-1.0.deb

/dists/webscale.plumbing/1.1/Packages
/dists/webscale.plumbing/1.1/elasticsearch-1.2.deb
/dists/webscale.plumbing/1.1/monkeyfilter-1.7.deb

/dists/webscale.plumbing/2.0/Packages
/dists/webscale.plumbing/2.0/elasticsearch-2.0.deb
/dists/webscale.plumbing/2.0/monkeyfilter-3.0.deb

For this, I am going to assume you are somehow managing the contents of the S3 bucket, using the AWS CLI, console, or other tool you have built.

Lambda Functions

AWS Lambda offers a great way to run code in response to events, for example if a new object is added to S3 or one is removed. We will use this to rebuild the Packages index when a .deb file is added or removed.

Get and Configure The Lambda Function

https://github.com/szinck/s3apt

You will need to configure the lambda function. It takes one configuration parameter, the name of the S3 bucket. Add this to config.py

$ git clone https://github.com/szinck/s3apt.git
$ cd s3apt

$ virtualenv venv
$ . venv/bin/activate
$ pip install -r requirements.txt

$ cp config.py.example config.py
$ vim config.py
# Edit the APT_REPO_BUCKET_NAME to be the name of the bucket (with no s3:// prefix)

Create a zip file containing the lambda function

$ zip  code.zip s3apt.py config.py
$ (cd venv/lib/python2.7/site-packages/ ; zip -r ../../../../code.zip *)

Create the Lambda

Create a new lambda using the AWS Console. It will need read/write access to the S3 bucket you are going to use. Upload the zip file created in the previous step.

Set the lambda handler as s3apt.lambda_handler and configure triggers as below.

Lambda Triggers

Add the following Lambda triggers on the S3 bucket:

  • ObjectCreated (All) : prefix=/dists/
  • ObjectRemoved (All) : prefix=/dists/

This will cause the lambda to automatically update the Packages index in the correct subdirectory any time a package is added or removed, or even if the Packages index gets removed.

Here is what happens every time the lambda is triggered:

  1. If an S3 object ending in .deb was added :
  • For every package in the bucket:
    • See if the control data has already been generated, by looking in the control-data-cache folder.
    • If not, generate it and save it in the control-data-cache folder.
    • Add it to the Packages index.
  1. If an S3 object ending in .deb was removed :
  • Do the same as above.
  1. If the Packages index was added or removed :
  • Do the same as above.

Prevent Race Conditions With This One Weird Trick

Conceptually, since the lambda code runs in response to an update in S3, if you upload multiple files at the same time, there could be multiple lambdas running, and thus multiple processes trying to write the Packages index at the same time. This can be demonstrated in the left column below.

To prevent this, we add two things:

  • a separate checksum of the Packages index file (mainly for speed)
  • a trigger to double-check the Packages index file after it changes

The checksum of the packages index file is implemented as metadata on the Packages index S3 object. It is simply the md5sum of the sorted list of packages that are in the index. Whenever the Packages index changes, the lambda will fire again and double-check that all the packages in that directory are contained in the Packages index. If they are not, it will regenerate the Packages index, and another lambda will fire to double-check that Packages index is correct (yo dawg...). Double checking the packages index is really fast (150 msec in testing).

The right column above shows the fixed process.

Configuring apt-get to Hit S3

Adding APT S3 Transport

Apt-get has pluggable transport methods, which we will use to add an S3 transport.

This is a fairly straightforward section. Fortunately there is a debian package for this that is in ubuntu 16.10 (yakkety yak). Unfortunately, I still use ubuntu 12.04 quite frequently.

You can download the apt-transport-s3 source that the debian package is based upon. If you need proxy support, you can merge in this PR for proxy support . Or you can download my copy of apt-transport s3 that has proxy support merged in .

When you get this, copy the s3 python code file to /var/lib/apt/methods/s3 on the target machine that you want to manage. I've tested this with ubuntu 12.04 using a proxy and it has worked.

Configuring S3 Access

See the readme in apt-transport-s3, but you will need to create the file /etc/s3auth.conf with the following contents:

AccessKeyId=AKIABLAHBLAHTESTING
SecretAccessKey=asdfasdfasfdasdfasdf
Token=''

You will also need to configure IAM permissions for this user to be able to read from the S3 bucket that you are using for your apt repository. If the instances you are running use instance profiles, you can also configure your instance profile for access to the s3 bucket in lieu of configuring an access key above.

Configuring Your Repo List

This is how I configure my s3 repo in /etc/apt/sources.list.d/s3repo.list

deb s3://my-bucket-name/  dists/webscale.plumbing/2.0/

Test it

Just run a regular apt-get update and you should see it attempt to download from S3.

Debian Packaging Deep Dive

To understand how the above lambda works, let's take a quick look at debian packaging. All of this information can be found in the Debian policy manuals, but I'll scoop it up here in one place.

Let's take the Elasticsearch debian package, for example. This is a fairly old version but will suffice for the example.

$ ls -lh
total 36120
-rw-r--r--@ 1 szinck  staff    18M Aug  1 07:35 elasticsearch-1.0.2.deb

Inside the deb file, there are two files: control.tar.gz and data.tar.gz . control.tar.gz contains the debian control file that describes the contents of the package. data.tar.gz contains the elasticsearch software that actually gets installed on the target machine. Here we are interested in data.tar.gz . Any .deb file can be unarchived as follows:

$ ar x elasticsearch-1.0.2.deb
$ ls -lh
total 72248
-rw-r--r--  1 szinck  staff   3.1K Aug  1 07:37 control.tar.gz
-rw-r--r--  1 szinck  staff    18M Aug  1 07:37 data.tar.gz
-rw-r--r--  1 szinck  staff     4B Aug  1 07:37 debian-binary
-rw-r--r--@ 1 szinck  staff    18M Aug  1 07:35 elasticsearch-1.0.2.deb

Inside the control.tar.gz file lies a text file called control . This is what we are interested in.

$ tar -zxOf control.tar.gz control
Package: elasticsearch
Version: 1.0.2
Section: web
Priority: optional
Architecture: all
Depends: libc6, adduser
Installed-Size: 20972
Maintainer: Elasticsearch Team <info@elasticsearch.com>
Description: Open Source, Distributed, RESTful Search Engine
 Elasticsearch is a distributed RESTful search engine built for the cloud.
 .
 Features include:
 .
 + Distributed and Highly Available Search Engine.
 [... snip ...]

As you can see the control file contains a number of plain text fields that describe the package. Some of the fields, such as Description can be multi-line fields. If we run a tool to generate a package index, such as dpkg-scanpackages , it will pull out these fields and also add a couple other fields:

Package: elasticsearch
Version: 1.0.2
Section: web
[... snip ...]
Filename: elasticsearch-1.0.2.deb
Size: 18237916
SHA256: c58e29a47eb869d895c5c5324748225de6397e1eaa88b218535e479658ca60c6
SHA1: 7aaf79b77c3b39ffc5591ab3eace7716160c28e8
MD5sum: a75766589b419487d35f5dc551d84215

These fields are:

  • Filename : name of the .deb file. It can include a path relative to the repository root.
  • Size : size of the .deb file.
  • SHA256 , SHA1 , MD5sum : various hashes of the file.

If you know the hash of a .deb file, all of the fields except for Filename can never change. Filename can change because the Packages index and the actual deb packages can be in different directories. A repository maintainer might do this to deduplicate their .deb files but yet have multiple Packages files that refer to the same .deb.

We can use this knowledge to speed up how long it takes to generate a Packages index by caching all of the control file data, size, and hashes for any given hash of a deb file.

Misc Notes

If you look at how I structure my repos, it may seem wasteful to have multiple copies of a deb file in different S3 buckets. Unless you are hosting many copies of very large repos, the pricing of S3 is so cheap that it will not cost you a lot. The alternative here would be to maintain a list of what .deb packages belong in each repo you manage, keep all the actual packages in one directory, and use the Filename field in the Packages index to point them to the correct location. I chose not to do this because it is far simpler to just look in the correct folder in the bucket and see what is there, if you want to know what packages are there. This also means you can just use standard AWS CLI or console tools to add and remove packages, and don't have to maintain anything separate.

This tool was built primarily to host proprietary debian packages built by private build servers. I haven't tried uploading a complete copy of the ubuntu repository to see how well it works.

There were some alternatives already out there but none of them did exactly what I wanted. For example some people use reprepo but you have to maintain a copy of the repository on a server you manage. There is also aptly and deb-s3 but those looked more complicated in setting up. This solution is set-it-and-forget-it. Once you have the lambda set up, you don't have to worry about it any more.





About List