< Lintian and Policy updates | Russ Allbery > Eagle's Path > June 2008 | Latest haul > |
My project for the past three days has been to upgrade our etch-based Debian build system to add support for Ubuntu hardy. We're primarily a Debian shop, but we want to use Ubuntu for the public timeshare systems due to its faster stable update cycle. In that environment, the latest and greatest software and active security support are both important simultaneously, and server-level stability isn't as vital. Ubuntu wins over other options because it's basically Debian and therefore shares with Debian the huge package selection, which is important for university timeshare systems. (We also use the external R package repository; we have a lot of heavy R users.)
First, FAI has changed a lot since the etch release and hardy has a far more recent version of fai-client, which means that just bootstrapping an Ubuntu NFS root on an etch system and trying to install it with the etch FAI doesn't work due to the mismatch between the server expectations and the fai-client and fai-nfsroot packages. FAI during the etch release boots with a traditional read-only NFS root and works around the fact that its boot device is read-only. Current FAI does essentially the same trick that a live CD does, using unionfs to allow local modifications to the NFS root with a backing tmpfs file system, which is kind of scary but allows FAI to be much simpler.
The first thing I tried was debootstrapping Ubuntu (using the debootstrap from backports.debian.org) and then downgrading its fai-client and fai-nfsroot packages to the etch versions, but changes in other part of the system (particularly mount) makes that not very viable.
One of the nice things about FAI is that it's basically a collection of shell scripts, which means that it's architecture-independent and one can easily install the unstable packages on an otherwise-etch system. So I upgraded fai-server to the current Debian sid version of FAI and installed the sid versions of fai-nfsroot and fai-client into the Ubuntu NFS root after generating it with make-fai-nfsroot as normal. That works much better. However, there are a few Ubuntu-specific problems to work around (which are patched in the Ubuntu version of FAI, but I don't want to have to track down an Ubuntu system to generate the NFS roots when the rest of our infrastructure is etch):
Ubuntu hardy ships with a unionfs that can't write to existing files, only create new files. I'm not sure whose bug this is, possibly upstream, but etch unionfs doesn't have this problem. The workaround is to delete the files that FAI wants to overwrite out of the NFS root after generating it with make-fai-nfsroot, which you can do in a hook or by wrappering it and doing the deletions afterwards (we do the latter because we have other local tweaks we make). You want to delete /etc/resolv.conf, /etc/syslog.conf, and auth.log, syslog, daemon.log, kern.log, lpr.log, mail.*, user.log, debug, and messages in /var/log.
Ubuntu hardy defaults to using upstart, which completely fails in an
FAI NFS root environment. I'm not sure why, but first upstart claims
the respawn
line of every file in /etc/event.d/* is a syntax
error (?!) and, even if that's fixed, doesn't actually spawn a getty
(at least on a serial console). The implication is that you can't
Ctrl-C at the end of the build and work on the system before
rebooting, which makes diagnosing problems very difficult. I tried a
bunch of different possibilities for an event.d file (and wow is that
ever either underdocumented or obscurely documented —
I've still never found a man page or other document clearly explaining
the possible configuration options) before giving up. The Ubuntu FAI
package forces installation of sysvinit in the NFS root, and that
works fine (with the normal inittab tweaking to spawn a getty on the
serial console). upstart works fine on the newly built system,
including serial console, so I'm not sure why the NFS root is broken.
Other than that, most of the work was upgrading the whole system to using
the new boot method (changing to boot=live
on the PXE kernel
options, including live-initramfs in the NFS root, copying the kernel and
initrd out of the NFS root into the tftp area instead of using
fai-kernels, and adding initrd to the PXE kernel options). I'm now
successfully building etch and hardy systems using our FAI setup and will
probably upgrade the production service next week.
There is one other fairly serious Ubuntu design misfeature that is only tangentially related but that cost me a lot of time. Ubuntu has modified update-grub to make it use ucf, which breaks the traditional update-grub semantics in extremely strange and confusing ways that took hours for me to diagnose.
For those who aren't familiar with the details of how update-grub works, it uses comments in /boot/grub/menu.lst as a template to generate the kernel list at the end of menu.lst. It's not the best design, since it uses the same file as both a template and the active configuration file, but it works, and has worked the same way for many years. It's simple and reliable: you only change the defaults in the comments and then update-grub adds in the kernel-specific information and generates the active menu for you. If you want to edit the kernel list yourself for some reason, don't use update-grub; it's completely optional.
To see how using ucf breaks this, remember that ucf's goal is to preserve local changes to a configuration file, but its notion of local changes is fundamentally contrary to how menu.lst works. Suppose that you're a large organization with a lot of systems with a variety of kernel revisions, for one reason or another, but with a desire to have uniform kernel parameters like console speed and timeouts. With update-grub, you can just copy your menu.lst template to all of your systems and run update-grub afterwards, and update-grub will apply your new defaults and generate a new kernel list using them. However, with Ubuntu's update-grub, it generates the initial menu.lst and kernel list, and then when you replace that with a fresh template, ucf sees that you've removed the kernel list. It thinks that's a local modification, and when update-grub generates a new kernel list, it "preserves" your modification and leaves the kernel list out of the new menu.lst. Tada, instant unbootable system.
Even worse, this bug is sticky: you can re-run update-grub all you want,
it succeeds without errors, and it keeps generating an empty menu.lst.
This is a completely mystifying error until you dig into the details of
what's going on. To recover, you have to run ucf --purge
/var/run/grub/menu.lst
(and that path, or for that matter any of the ucf
stuff, is completely undocumented in the update-grub man page) and then
re-run update-grub, at which point ucf will display a debconf prompt
(which doesn't support debconf preseeding, but that's another problem).
If you then select three-way merge (not the default) from that prompt, you
get a correct menu.lst.
I'm not sure why Ubuntu made this change, but as near as I can tell, it's fundamentally architecturally broken. The specific problem I had was that we were copying our template over and re-running update-grub in a post-install script, resulting in a system that always had an empty kernel menu and wouldn't boot. Once I finally figured out what's going on, I worked around it by being very careful to put our template in place before the initial system bootstrap and then never changing menu.lst afterwards, but this is a bad limitation for a large site that scales system administrators by automating configuration file updates. But maybe I'm missing some subtlety. I'll go file a Launchpad bug shortly after posting this and we'll see what happens.
Posted: 2008-06-12 23:47 — Why no comments?
< Lintian and Policy updates | Russ Allbery > Eagle's Path > June 2008 | Latest haul > |