AFS Server Balancing

"I mean, don't you think that's overkill?"
"No. I think it's just enough kill."

"What's My Line, Part 1", Buffy the Vampire Slayer

Table of Contents

Motivation
Available Balancing Input
The Algorithm
Implementing the Results
Problems
Future Work
Credits

Motivation

If you have more than one AFS server, you probably have had to worry about balancing. As volumes grow and shrink, and as you create volumes, delete volumes, and grant more quota, your perfect distribution of volumes gets out of balance. Some servers have lots of volumes but lots of free space. Other servers are constantly running out of space, but aren't handling many file accesses. New volumes were created on one server for too long, and now it's constantly running out of space as those volumes are used and grow. You need to figure out how to rearrange your AFS volumes so that each of your AFS servers is handling a proportional share of the load.

There are many different solutions and tools for this problem, as well as tools to help its onset. Volume creation tools that can choose partitions with the most free space (such as volcreate) can help, although they risk putting too many new volumes in the same place. Balancing can be done by feel, looking for partitions with too many volumes or not enough free space and shuffling volumes around. There are lots of tools to help apply various heuristics, sometimes on an automated basis. Haven't you wondered, though, what a true server balance would look like?

This paper explains how to phrase AFS server balancing as a linear programming problem and then use the tools available for solving linear programming problems to find, within a set of constraints, the best available balance of AFS volumes. This is probably overkill for small or medium sites. Among other things, it generates large problems that may require commercial optimizers to handle. It has the advantage, though, of taking most of the guesswork and uncertainty out of the problem and reducing it to a mechanical process with a high probability of success.

The software that we use to implement the algorithm and techniques described below is available from the afs-balance distribution page.

Available Balancing Input

The first thing to decide is what criteria to use to balance the servers. How do you know when they're balanced? In our experience, there are four useful quantities that one may wish to balance: total usage, total accesses, total quota, and volume count.

Total usage is obvious. A full AFS server partition is an emergency and causes a noticeable outage for all writes to volumes on that server. Total accesses matters somewhat less and may not matter at all for some uses of AFS, but AFS servers can become slow or unstable under heavy load and the best performance obviously comes from spreading out the access patterns. Total quota and volume count both measure the potential for future growth in usage, and hence the chance that a partition will become too full in the future. Volume count also has performance implications, particularly when restarting an AFS server after a crash.

While it would be ideal to make all those quantities exactly equal across all server partitions, that's more possible with some of them than others. Accesses in particular can vary widely, the access information available may not be particularly accurate, and an exact access balance may be meaningless in a week as access patterns shift. A closer balance of usage is somewhat more useful, but balancing usage to the byte is pointless and unnecessary. Therefore, another decision is how much fudge or wiggle room to allow in the balance and still consider the servers balanced.

Finally, not all servers are created equal. Some servers are only for particular uses, have smaller disks, or have slower I/O buses. While this can be partly addressed by only balancing between like servers, it's useful to be able to mix various different storage capacities together into a single pool of servers.

The solution that has worked well for us is to assign a weight to each server partition equal to its proportion of the total server space that is being balanced. While this isn't perfect (a fast server that can handle a lot of accesses may have a small disk), we've found that newer servers with faster processors usually also have more disk space, so disk space is a good approximation to the amount of load that a server can handle. We then balance volume count as exactly as possible, putting at least as many volumes on each server partition as the total number of volumes originally in the set of servers being balanced times the proportion that partition represents of the total disk space, rounded down. We do the same thing for usage and accesses, but there we allow the balance to vary by up to 2.5%. (In practice, accesses sometimes need more variation than that and usage could use less, but this seems to almost always work.)

Note that, somewhat counterintuitively, we balance volume count the most closely of any of the parameters. Partly this is because one can achieve a perfect volume count balance, whereas a perfect balance is less meaningful for the other parameters. We've also found that volume count is a better predictor of future usage growth than total quota for our cell and for the way that we assign quota. Volume count may also be a better indicator of accesses, given how variable accesses can be.

Other decisions and weightings are certainly possible, and once the balancing software has been set up, it's fairly easy to experiment. Different approaches work better for your cell, depending on how you allocate volumes and quota.

To gather all of this data, you can use vos listvol -long, which will give you usage and the access count for the current day. We pull the data out of our AFS reporting database. The more historical access data that you can average together, the more accurate representation of access patterns you can achieve (and therefore the better the balance will be).

The Algorithm

Once you've decided on your balancing criteria, the balancing problem has to be phrased as a linear programming problem. Briefly and simplistically, this means that the problem must be phrased as a set of constraints, a set of data that can be modified, and an optimization goal subject to those constraints.

The constraints are the definition of the balance. Let the weight of each server partition be the proportion of the total available space in the set of servers to be balanced represented by that partition. For the balancing discussed above, the constraints would then be:

The input data that will be changed as the result of the balancing is the current locations of all of the volumes. This can be written as a matrix of boolean variables (a table where each cell is either 0 or 1), with the server partitions as the columns and the volumes as the rows. Each row (volume) will then have a 1 in one and only one column, representing its current location. All of the other columns will be 0.

We have to tell the optimizer about that constraint as well so that it will be maintained in the solution, so we add:

The job of the optimizer will be to find a new value for that location matrix such that all of the constraints are satisfied.

There will probably be multiple solutions, though, and not all of them are equal. We'd rather not move every affected volume around if we can help it. We therefore also add an optimization goal:

This is now a linear programming problem suitable for handing to an appropriate optimizer, once the above is rephrased in mathematics.

This is a fairly straightforward problem and could be solved with several different algorithms, but balancing a large AFS cell tends to generate extremely large problems. It's likely that the problem will overwhelm any optimizer that has non-commercial or personal use limits or isn't well-optimized. We use the commercial ILOG CPLEX optimizer since it's fast, it can handle large problems, and Stanford University already had a license for it. ILOG CPLEX comes with AMPL as a front-end, which provides a simple language for describing these sorts of problems. The above constraints and optimization goal can be expressed in less than a dozen lines of AMPL model (excluding variable definitions).

afs-balance will generate the input data suitable for this problem. The AMPL models that we use are also available from that page. For a complete mathematical statement of the above algorithm, please see the accompanying paper in PDF or TeX.

Implementing the Results

After the optimizer churns away for a while, it will (hopefully) produce a result that implements the desired balance. The output will be a new location matrix, and the next step is to make the placement of the volumes match the new location matrix.

(Actually, it's more useful if the optimizer outputs a matrix that specifies what volume moves are necessary, rather than just a matrix of volume locations that has to be compared against the input matrix. This can be done easily by applying a mathematical trick and telling the optimizer to output, for each cell in the matrix, the product of the value of that cell in the final location matrix with 1 minus the value of that cell in the original matrix. This will put a 1 only in those cells that correspond to the final destination of a volume that had to move.)

We process the optimizer output with a Perl script (afs-balance run in a different mode) that generates an input file for mvto, a script for doing AFS volume moves that can take an input file consisting of a list of volumes and the locations to move those volumes to. You will need some sort of script like this to take care of the different steps that have to be done for different types of volume moves (for example, moving a replica may require vos addsite and a volume release, you may wish to recreate the backup volume for a read/write volume after moving it, etc.). mvto is useful because it already knows how to handle all of those operations.

Also be careful of the order of volume moves. It's not uncommon for the optimizer to move one large, low-access volume from one partition to another, while moving several smaller, high access volumes the other direction. Depending on the ordering of the moves, it's possible to overfill a partition in an intermediate step even if the end result would be balanced. mvto will check and be sure not to overfill a partition, warning you about this problem so that you can reorder the volume moves by hand or use a free partition as a temporary location.

We normally move volumes whenever is the most convenient, but depending on your cell usage patterns and your sensitivity for the short user-noticeable "busy volume" and slowness periods in a volume move, you may wish to wait for an idle or slow time for your cell to implement the balance.

Problems

There are two common ways in which the optimizer can fail: it can determine there is no feasible solution, or it can take too long.

It's possible that there is no feasible solution given the constraints. This generally happens when there are one or two volumes with significantly more accesses (or higher usage, although this is less common) than all of the other volumes, and there is therefore no arrangement of volumes that will balance out that quantity. There are a few different ways that one can deal with this:

A more common problem is that the optimizer detects there is a possible solution but just takes too long to find it. We found that on our relatively slow SPARC compute servers, ILOG CPLEX could handle up to about ten to twelve partitions (around 6,000 volumes) in a reasonable amount of time (10 hours, or overnight). Faster compute servers and newer versions of the software than we were using could probably handle larger problems.

The first solution is to tell the optimizer to not find the optimal solution to the problem. We normally instruct the optimizer to stop after finding two solutions and return the best of the two, since the second solution is often considerably better than the first. Even finding one result can still take too long, though.

Another way of addressing this is to just attack a smaller problem. Rather than balancing your entire cell, balance half of it at a time. If you pick a reasonable mix of heavily-loaded and lightly-loaded servers for each separate balance, the result won't be too far off from what you would have gotten by balancing the whole cell.

A more comprehensive solution, particularly if your servers have a fairly large number of partitions, is to observe that, in most cases (particularly if your storage devices have a significant RAM cache), it only makes sense to balance accesses across servers, not across partitions. The software we use for balancing can, in addition to doing a full balance across all the provided partitions, perform a balance by server and a balance by partition. The server balance still uses all of the constraints, but it balances only at the granularity of a server, reducing the number of locations that the optimizer has to worry about. We then implement that partial solution, using a simple algorithm like moving volumes onto the partitions of their destination servers with the most free space, and then do a second balance for each individual server that ignores access count and just evens out the usage and volume count across the partitions. Those follow-up individual server balances are extremely fast to calculate since there is one fewer constraint.

The obvious drawback of this two-phase approach is that it takes longer and many volumes will have to be moved twice, once to get them on the right server and then again to put them on the right partition. If you have the horsepower to calculate it, a full balance is more efficient.

Future Work

One obvious possibility for this sort of balancing is to make it automatic. Given sufficient safeguards and scheduling of the volume moves, there's no reason why this sort of process couldn't run unattended, regularly redistributing volumes for the best balance. The constraints might have to be chosen carefully so as to not make too many unnecessary volume moves, but then AFS administrators won't have to think about server balance at all.

Balancing read-only replicas well is somewhat more complex, since in addition to server-side issues one often also wants to maintain geographic distribution of replicas, put replicas next (network-wise) to the servers that use them, and avoid moving replicas of volumes with unreleased changes. An intelligent algorithm for balancing replica sites would require more inputs and either more constraints or a more carefully chosen set of affected server partitions than the simpler read/write case.

It would be useful to get this balancing method working entirely with free software. We used ILOG CPLEX because we already had it and it could handle the problem, but there may be free software available that can handle or could be made to handle this.

Credits

This approach to AFS server balancing, the AMPL model, and the original scripts to generate the AMPL input and parse the solution were all developed by Neil Crellin. I have since rewritten the scripts, taught the scripts to use mvto to implement the result, converted the database access over to Oracle and DBI, and implemented the by-server and by-partition two-phase balance so that we could apply this technique to larger problems.

Last modified and spun 2014-10-19