For a while, I wanted to set up an Ubuntu archive mirror using Cloudflare. It felt like a natural idea: the archive is a set of static files that could be easily cached, and Cloudflare is very good at caching files close to users around the world.
What is an archive mirror?
If you have ever run apt update
on Ubuntu, you have used the archive. It is a big collection of files: packages (.deb
files) and index files (Packages.gz
, Release
, etc.) that tell apt
what is available.
The structure was designed more than 20 years ago, before CDNs and large-scale caching were common. It is very stable, but not optimized for today’s internet.
If you want to see how it looks, you can browse archive.ubuntu.com or read about the Debian repository format.
My first idea: the “big sync”
My first plan was to copy the entire archive into Cloudflare R2, their low-cost object store. I thought I would write workers to parse the index files, detect changes, and keep R2 in sync with the upstream archive.
But this was heavy: the archive is very large, the initial storage without being too high would not be null and the code to make sure that the indices would stay in sync with what was available in the mirror would not be trivial. Indeed, if a index reference a package which has not been synced, it would cause the client to fail. All of that for a mirror that maybe only a few people would use.
What I ended up doing: “lazy syncing”
Instead, I settled on something much simpler.
- When a client asks for a package file, my worker fetches it from the upstream archive, stores it in R2, and caches it in Cloudflare’s edge for 1 day.
- Index files are not stored in R2. They are just cached in Cloudflare for 30 minutes.
This way, I do not need to preload or parse the whole archive. Packages appear only if someone asks for them.
One of the drawback here is that every request hits the worker which add more cost but I figured that if it would become a problem later on, I could switch to my original idea..
Putting it in production
Mirror URL: https://ubuntu.gjolly.dev/ubuntu
As you might expect, getting this to production was not without issues.
At first, I made a mistake with a symlink that pointed /ubuntu
back to itself. Yes, a recursive path! You can try it by yourself: http://archive.ubuntu.com/ubuntu/ubuntu/ubuntu/ubuntu, you can add as many ubuntu
as you wish, it will always work because on the actual filesystem ubuntu/ubuntu
is a symlink to itself! On a static file server, it might work with no problem but I certainly didn’t want my object store to be filled with duplicated data.
My next “oops” moment was when I submitted the mirror to the official Ubuntu archive mirrors list, to my surprise, the probes all failed. Indeed, I had completely forgot to support HEAD requests
Running the numbers
Let’s take a random package of the archive: mysql
and try to download it from the upstream archive maintained by Canonical, from my regional mirror and from my service:
$ curl -o /dev/null https://archive.ubuntu.com/ubuntu/pool/main/m/mysql-8.4/mysql-client-core_8.4.6-0ubuntu0.25.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2083k 100 2083k 0 0 7569k 0 --:--:-- --:--:-- --:--:-- 7576k
$ curl -o /dev/null https://fr.archive.ubuntu.com/ubuntu/pool/main/m/mysql-8.4/mysql-client-core_8.4.6-0ubuntu0.25.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2083k 100 2083k 0 0 5895k 0 --:--:-- --:--:-- --:--:-- 5885k
$ curl -o /dev/null https://ubuntu.gjolly.dev/ubuntu/pool/main/m/mysql-8.4/mysql-client-core_8.4.6-0ubuntu0.25.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2083k 100 2083k 0 0 1510k 0 0:00:01 0:00:01 --:--:-- 1510k
As you can see (average Dload
) my service is the worse, but wait! Let’s try again now:
$ curl -o /dev/null https://ubuntu.gjolly.dev/ubuntu/pool/main/m/mysql-8.4/mysql-client-core_8.4.6-0ubuntu0.25.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2083k 100 2083k 0 0 14.0M 0 --:--:-- --:--:-- --:--:-- 14.1M
Yay! Now that it’s in the cache, we are twice as fast as the official archive.
Let’s do that but with a bigger package, for example linux-modules-extra-6.14.0-22-generic_6.14.0-22.22~24.04.1_amd64.deb
which is 114MB
big:
$ curl -o /dev/null https://archive.ubuntu.com/ubuntu/pool/main/l/linux-hwe-6.14/linux-modules-extra-6.14.0-22-generic_6.14.0-22.22~24.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 113M 100 113M 0 0 29.2M 0 0:00:03 0:00:03 --:--:-- 29.2M
$ curl -o /dev/null https://fr.archive.ubuntu.com/ubuntu/pool/main/l/linux-hwe-6.14/linux-modules-extra-6.14.0-22-generic_6.14.0-22.22~24.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 113M 100 113M 0 0 2294k 0 0:00:50 0:00:50 --:--:-- 5368k
$ curl -o /dev/null https://ubuntu.gjolly.dev/ubuntu/pool/main/l/linux-hwe-6.14/linux-modules-extra-6.14.0-22-generic_6.14.0-22.22~24.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 113M 100 113M 0 0 84.1M 0 0:00:01 0:00:01 --:--:-- 84.1M
But once again, after caching we get 134 MB/s
$ curl -o /dev/null https://ubuntu.gjolly.dev/ubuntu/pool/main/l/linux-hwe-6.14/linux-modules-extra-6.14.0-22-generic_6.14.0-22.22~24.04.1_amd64.deb
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 113M 100 113M 0 0 134M 0 --:--:-- --:--:-- --:--:-- 134M
Why this design makes sense
This approach only works well if more than one person uses the mirror in each Cloudflare edge location. If I am the only one hitting it, then most requests will go all the way back to the upstream archive. But if many users share the same edge cache, the benefit grows quickly: new packages are quickly added to the R2 bucket and stay cached on the edge, downloads become much faster: only the first client has to pay a “high” latency
So, if you want to try it, please do! It’s available at https://ubuntu.gjolly.dev/ubuntu. The more people use it, the better it works.