FAI and Ubuntu

My project for the past three days has been to upgrade our etch-based Debian build system to add support for Ubuntu hardy. We're primarily a Debian shop, but we want to use Ubuntu for the public timeshare systems due to its faster stable update cycle. In that environment, the latest and greatest software and active security support are both important simultaneously, and server-level stability isn't as vital. Ubuntu wins over other options because it's basically Debian and therefore shares with Debian the huge package selection, which is important for university timeshare systems. (We also use the external R package repository; we have a lot of heavy R users.)

First, FAI has changed a lot since the etch release and hardy has a far more recent version of fai-client, which means that just bootstrapping an Ubuntu NFS root on an etch system and trying to install it with the etch FAI doesn't work due to the mismatch between the server expectations and the fai-client and fai-nfsroot packages. FAI during the etch release boots with a traditional read-only NFS root and works around the fact that its boot device is read-only. Current FAI does essentially the same trick that a live CD does, using unionfs to allow local modifications to the NFS root with a backing tmpfs file system, which is kind of scary but allows FAI to be much simpler.

The first thing I tried was debootstrapping Ubuntu (using the debootstrap from backports.debian.org) and then downgrading its fai-client and fai-nfsroot packages to the etch versions, but changes in other part of the system (particularly mount) makes that not very viable.

One of the nice things about FAI is that it's basically a collection of shell scripts, which means that it's architecture-independent and one can easily install the unstable packages on an otherwise-etch system. So I upgraded fai-server to the current Debian sid version of FAI and installed the sid versions of fai-nfsroot and fai-client into the Ubuntu NFS root after generating it with make-fai-nfsroot as normal. That works much better. However, there are a few Ubuntu-specific problems to work around (which are patched in the Ubuntu version of FAI, but I don't want to have to track down an Ubuntu system to generate the NFS roots when the rest of our infrastructure is etch):

Other than that, most of the work was upgrading the whole system to using the new boot method (changing to boot=live on the PXE kernel options, including live-initramfs in the NFS root, copying the kernel and initrd out of the NFS root into the tftp area instead of using fai-kernels, and adding initrd to the PXE kernel options). I'm now successfully building etch and hardy systems using our FAI setup and will probably upgrade the production service next week.

There is one other fairly serious Ubuntu design misfeature that is only tangentially related but that cost me a lot of time. Ubuntu has modified update-grub to make it use ucf, which breaks the traditional update-grub semantics in extremely strange and confusing ways that took hours for me to diagnose.

For those who aren't familiar with the details of how update-grub works, it uses comments in /boot/grub/menu.lst as a template to generate the kernel list at the end of menu.lst. It's not the best design, since it uses the same file as both a template and the active configuration file, but it works, and has worked the same way for many years. It's simple and reliable: you only change the defaults in the comments and then update-grub adds in the kernel-specific information and generates the active menu for you. If you want to edit the kernel list yourself for some reason, don't use update-grub; it's completely optional.

To see how using ucf breaks this, remember that ucf's goal is to preserve local changes to a configuration file, but its notion of local changes is fundamentally contrary to how menu.lst works. Suppose that you're a large organization with a lot of systems with a variety of kernel revisions, for one reason or another, but with a desire to have uniform kernel parameters like console speed and timeouts. With update-grub, you can just copy your menu.lst template to all of your systems and run update-grub afterwards, and update-grub will apply your new defaults and generate a new kernel list using them. However, with Ubuntu's update-grub, it generates the initial menu.lst and kernel list, and then when you replace that with a fresh template, ucf sees that you've removed the kernel list. It thinks that's a local modification, and when update-grub generates a new kernel list, it "preserves" your modification and leaves the kernel list out of the new menu.lst. Tada, instant unbootable system.

Even worse, this bug is sticky: you can re-run update-grub all you want, it succeeds without errors, and it keeps generating an empty menu.lst. This is a completely mystifying error until you dig into the details of what's going on. To recover, you have to run ucf --purge /var/run/grub/menu.lst (and that path, or for that matter any of the ucf stuff, is completely undocumented in the update-grub man page) and then re-run update-grub, at which point ucf will display a debconf prompt (which doesn't support debconf preseeding, but that's another problem). If you then select three-way merge (not the default) from that prompt, you get a correct menu.lst.

I'm not sure why Ubuntu made this change, but as near as I can tell, it's fundamentally architecturally broken. The specific problem I had was that we were copying our template over and re-running update-grub in a post-install script, resulting in a system that always had an empty kernel menu and wouldn't boot. Once I finally figured out what's going on, I worked around it by being very careful to put our template in place before the initial system bootstrap and then never changing menu.lst afterwards, but this is a bad limitation for a large site that scales system administrators by automating configuration file updates. But maybe I'm missing some subtlety. I'll go file a Launchpad bug shortly after posting this and we'll see what happens.

Posted: 2008-06-12 23:47 — Why no comments?

Last modified and spun 2013-07-22