I've been wanting to host an apt repo in S3 for a while now, but I've hit a few stumbling blocks that have prevented me from fully realizing a solution until just recently.
An APT repo in S3 must be:
- Easy to manage
- Not require a local copy
- Not use crazy tools
- Integrate seamlessly with apt-get
- Have the least amount of cognitive load possible
What I came up with is a simple python script that can be attached to an S3 bucket as a lambda function. The lambda function watches for any changes in the S3 bucket, and will quickly regenerate the Packages index whenever the bucket changes. Once you configure this lambda function, it will keep the Packages index up-to-date and you do not need to run any further steps aside from adding and removing .deb files in S3.
S3 Bucket Layout
Here's a look at the layout of my S3 bucket:
# High level layout / /dists/ /dists/webscale.plumbing/1.0/ /dists/webscale.plumbing/1.1/ /dists/webscale.plumbing/2.0/ /control-data-cache/ # Actual repo content /dists/webscale.plumbing/1.0/Packages /dists/webscale.plumbing/1.0/elasticsearch-1.0.deb /dists/webscale.plumbing/1.1/Packages /dists/webscale.plumbing/1.1/elasticsearch-1.2.deb /dists/webscale.plumbing/1.1/monkeyfilter-1.7.deb /dists/webscale.plumbing/2.0/Packages /dists/webscale.plumbing/2.0/elasticsearch-2.0.deb /dists/webscale.plumbing/2.0/monkeyfilter-3.0.deb
For this, I am going to assume you are somehow managing the contents of the S3 bucket, using the AWS CLI, console, or other tool you have built.
AWS Lambda offers a great way to run code in response to events, for example if a new object is added to S3 or one is removed. We will use this to rebuild the Packages index when a .deb file is added or removed.
Get and Configure The Lambda Function
You will need to configure the lambda function. It takes one configuration parameter, the name of the S3 bucket. Add this to config.py
$ git clone https://github.com/szinck/s3apt.git $ cd s3apt $ virtualenv venv $ . venv/bin/activate $ pip install -r requirements.txt $ cp config.py.example config.py $ vim config.py # Edit the APT_REPO_BUCKET_NAME to be the name of the bucket (with no s3:// prefix)
Create a zip file containing the lambda function
$ zip code.zip s3apt.py config.py $ (cd venv/lib/python2.7/site-packages/ ; zip -r ../../../../code.zip *)
Create the Lambda
Create a new lambda using the AWS Console. It will need read/write access to the S3 bucket you are going to use. Upload the zip file created in the previous step.
Set the lambda handler as s3apt.lambda_handler and configure triggers as below.
Add the following Lambda triggers on the S3 bucket:
- ObjectCreated (All) : prefix=/dists/
- ObjectRemoved (All) : prefix=/dists/
This will cause the lambda to automatically update the Packages index in the correct subdirectory any time a package is added or removed, or even if the Packages index gets removed.
Here is what happens every time the lambda is triggered:
- If an S3 object ending in .deb was added :
- For every package in the bucket:
- See if the control data has already been generated, by looking in the control-data-cache folder.
- If not, generate it and save it in the control-data-cache folder.
- Add it to the Packages index.
- If an S3 object ending in .deb was removed :
- Do the same as above.
- If the Packages index was added or removed :
- Do the same as above.
Prevent Race Conditions With This One Weird Trick
Conceptually, since the lambda code runs in response to an update in S3, if you upload multiple files at the same time, there could be multiple lambdas running, and thus multiple processes trying to write the Packages index at the same time. This can be demonstrated in the left column below.
To prevent this, we add two things:
- a separate checksum of the Packages index file (mainly for speed)
- a trigger to double-check the Packages index file after it changes
The checksum of the packages index file is implemented as metadata on the Packages index S3 object. It is simply the md5sum of the sorted list of packages that are in the index. Whenever the Packages index changes, the lambda will fire again and double-check that all the packages in that directory are contained in the Packages index. If they are not, it will regenerate the Packages index, and another lambda will fire to double-check that Packages index is correct (yo dawg...). Double checking the packages index is really fast (150 msec in testing).
The right column above shows the fixed process.
Configuring apt-get to Hit S3
Adding APT S3 Transport
Apt-get has pluggable transport methods, which we will use to add an S3 transport.
This is a fairly straightforward section. Fortunately there is a debian package for this that is in ubuntu 16.10 (yakkety yak). Unfortunately, I still use ubuntu 12.04 quite frequently.
You can download the apt-transport-s3 source that the debian package is based upon. If you need proxy support, you can merge in this PR for proxy support . Or you can download my copy of apt-transport s3 that has proxy support merged in .
When you get this, copy the s3 python code file to /var/lib/apt/methods/s3 on the target machine that you want to manage. I've tested this with ubuntu 12.04 using a proxy and it has worked.
Configuring S3 Access
See the readme in apt-transport-s3, but you will need to create the file /etc/s3auth.conf with the following contents:
AccessKeyId=AKIABLAHBLAHTESTING SecretAccessKey=asdfasdfasfdasdfasdf Token=''
You will also need to configure IAM permissions for this user to be able to read from the S3 bucket that you are using for your apt repository. If the instances you are running use instance profiles, you can also configure your instance profile for access to the s3 bucket in lieu of configuring an access key above.
Configuring Your Repo List
This is how I configure my s3 repo in /etc/apt/sources.list.d/s3repo.list
deb s3://my-bucket-name/ dists/webscale.plumbing/2.0/
Just run a regular apt-get update and you should see it attempt to download from S3.
Debian Packaging Deep Dive
To understand how the above lambda works, let's take a quick look at debian packaging. All of this information can be found in the Debian policy manuals, but I'll scoop it up here in one place.
Let's take the Elasticsearch debian package, for example. This is a fairly old version but will suffice for the example.
$ ls -lh total 36120 -rw-r--r--@ 1 szinck staff 18M Aug 1 07:35 elasticsearch-1.0.2.deb
Inside the deb file, there are two files: control.tar.gz and data.tar.gz . control.tar.gz contains the debian control file that describes the contents of the package. data.tar.gz contains the elasticsearch software that actually gets installed on the target machine. Here we are interested in data.tar.gz . Any .deb file can be unarchived as follows:
$ ar x elasticsearch-1.0.2.deb $ ls -lh total 72248 -rw-r--r-- 1 szinck staff 3.1K Aug 1 07:37 control.tar.gz -rw-r--r-- 1 szinck staff 18M Aug 1 07:37 data.tar.gz -rw-r--r-- 1 szinck staff 4B Aug 1 07:37 debian-binary -rw-r--r--@ 1 szinck staff 18M Aug 1 07:35 elasticsearch-1.0.2.deb
Inside the control.tar.gz file lies a text file called control . This is what we are interested in.
$ tar -zxOf control.tar.gz control Package: elasticsearch Version: 1.0.2 Section: web Priority: optional Architecture: all Depends: libc6, adduser Installed-Size: 20972 Maintainer: Elasticsearch Team <firstname.lastname@example.org> Description: Open Source, Distributed, RESTful Search Engine Elasticsearch is a distributed RESTful search engine built for the cloud. . Features include: . + Distributed and Highly Available Search Engine. [... snip ...]
As you can see the control file contains a number of plain text fields that describe the package. Some of the fields, such as Description can be multi-line fields. If we run a tool to generate a package index, such as dpkg-scanpackages , it will pull out these fields and also add a couple other fields:
Package: elasticsearch Version: 1.0.2 Section: web [... snip ...] Filename: elasticsearch-1.0.2.deb Size: 18237916 SHA256: c58e29a47eb869d895c5c5324748225de6397e1eaa88b218535e479658ca60c6 SHA1: 7aaf79b77c3b39ffc5591ab3eace7716160c28e8 MD5sum: a75766589b419487d35f5dc551d84215
These fields are:
- Filename : name of the .deb file. It can include a path relative to the repository root.
- Size : size of the .deb file.
- SHA256 , SHA1 , MD5sum : various hashes of the file.
If you know the hash of a .deb file, all of the fields except for Filename can never change. Filename can change because the Packages index and the actual deb packages can be in different directories. A repository maintainer might do this to deduplicate their .deb files but yet have multiple Packages files that refer to the same .deb.
We can use this knowledge to speed up how long it takes to generate a Packages index by caching all of the control file data, size, and hashes for any given hash of a deb file.
If you look at how I structure my repos, it may seem wasteful to have multiple copies of a deb file in different S3 buckets. Unless you are hosting many copies of very large repos, the pricing of S3 is so cheap that it will not cost you a lot. The alternative here would be to maintain a list of what .deb packages belong in each repo you manage, keep all the actual packages in one directory, and use the Filename field in the Packages index to point them to the correct location. I chose not to do this because it is far simpler to just look in the correct folder in the bucket and see what is there, if you want to know what packages are there. This also means you can just use standard AWS CLI or console tools to add and remove packages, and don't have to maintain anything separate.
This tool was built primarily to host proprietary debian packages built by private build servers. I haven't tried uploading a complete copy of the ubuntu repository to see how well it works.
There were some alternatives already out there but none of them did exactly what I wanted. For example some people use reprepo but you have to maintain a copy of the repository on a server you manage. There is also aptly and deb-s3 but those looked more complicated in setting up. This solution is set-it-and-forget-it. Once you have the lambda set up, you don't have to worry about it any more.