Today was, alas, not another day of getting life back in order and making productive forward progress. It was, instead, a day of babysitting a crashed AFS server. For some reason, one of our AFS servers started spewing the dreaded CPS too many lockers message and got very slow, and while I was trying to move volumes off of it, our automated scripts decided that it had passed the point of no return in terms of connections waiting for a thread and shot it in the head.

It turned out that we were running a back rev of the server (1.4.10 instead of 1.4.11), which I'm hoping was most of the problem. That certainly explains why we were having volumes randomly go off-line, as that's a known bug that was fixed in 1.4.11. I need to accelerate my work on new Debian packages incorporating some critical fixes in OpenAFS stable and get new file server backports ready to deploy on our systems, and then we need to do a rolling upgrade starting with the 1.4.10 systems.

This was a really bad time to have more AFS problems.

On the good side, I confirmed that all the nastiest symptoms of an AFS file server restart can be fixed by using iptables to prevent the AFS file server from talking to any clients until it's finished attaching volumes. This should really be done upstream if possible, but in the meantime I have iptables rules that work and can write a remctl call that we can use to cut an AFS file server off from clients when it starts having problems and keep it cut off until it recovers.

But I got absolutely nothing else done today, including responding to a critical HelpSU ticket that I need to get first thing tomorrow. I need to send them a note tonight, in fact, to tell them that I'll do that. Tomorrow will likewise probably be devoured by AFS work; hopefully by Wednesday I can get back to the other things I was supposed to be doing.

But on the bright side, review writing and posting continues, which is making me feel less behind.

