AFS Reporting Database

Table of Contents

Motivation
Database Schema
Populating the Database
Reporting Possibilities
Future Work
Credits

Motivation

In a large AFS cell, there is a great deal of meta-information about AFS volumes that AFS administrators need to be aware of on a regular basis, both for administration and for reporting: the locations of particular volumes, the sizes of volumes, the number of volumes on different servers and partitions, access patterns, replicated volumes that were released, and volumes that have unreleased changes, etc. All of this information is available either from the VLDB or from the volserver on the individual file servers, but the ability to query those services is limited. Only a few queries are possible, and some information (such as comprehensive information from the volserver on each file server, needed for finding stray copies of volumes and gathering access information) is expensive for the server to provide.

We also found that for most administrative and reporting tasks, current information isn't strictly necessary. As long as the information was regularly updated, most questions could be answered to a close enough approximation by looking at data that is a day or two old. This led to the idea of extracting all of the data that we might care about while the cell is mostly idle and then making it available for queries and reporting without having to query the production AFS servers.

This paper describes the Oracle database that we set up to hold this nightly dump of AFS meta-information, the method that we use to populate that database, and some of the reports that we run against it. Also discussed are some limitations and caveats, and possible future work.

Database Schema

We used Oracle as the SQL database, but any full-featured SQL database should be fine. We previously used Sybase in the past and changed to Oracle because of external reasons, not because Sybase was in any way insufficient. We did, however, find sub-selects to be extremely useful for many of the queries that we want to perform, so MySQL is only suitable if you're using a new enough version to have good sub-query support.

The core of the database is the volumes table, which stores the available meta-information for a particular volume. We started with the information available from vos listvol -long, since that contains all of the information that we needed for the reports we wanted at the time. We experimented some with looking at the additional information provided by vos listvol -extended, mostly because the file count seemed interesting, but it turned out that the file count (at least for the AFS servers that we were running, using the inode file server on Solaris) wasn't accurate. The access information in -extended is more complete, but the simple access count available from -long was sufficient for our purposes.

The DDL for the volumes table is:

    create table volumes
       (volname             varchar2(31),
        volid               number,
        type                varchar2(2),
        used                number,
        status              varchar2(8),
        server              varchar2(8),
        part                varchar2(8),
        rwid                number,
        roid                number,
        bkid                number,
        quota               number,
        created             date,
        updated             date,
        accesses            number);

Most of this is fairly straightforward. The type is the two-character volume type string (one of RO, RW, or BK). The status is either "On-line" or "Off-line" (taken from the first line of the vos listvol -long output). You may need to increase the size of the server field; we use a standard naming convention for all of our AFS servers that happens to fit into eight characters.

We used vos listvol information exclusively to populate this database, which means that the roid field is not reliably populated depending on where the volume and its replica are located. This turned out not to be a problem for reporting purposes.

Since many of the reports we wanted to run were on changes to the AFS cell, we also created another table with the identical schema named previous_volumes. Each night, before loading the new data, the load script truncates the contents of the previous_volumes table, moves all of the data from volumes to previous_volumes, truncates the volumes table, and then loads the new data dump into volumes. We can then run reports that compare volumes to previous_volumes, allowing us to report on volume moves, volume deletions, and volume renames.

We also loaded our mount point tracking database, which is maintained as a Berkeley DB file, into Oracle each night as well for easier reporting. The DDL for that table is:

    create table mountpoints
       (volname             varchar2(31),
        path                varchar2(255),
        volid               number);

Depending on the organization of your cell, you might need to increase the size of the path column. (Subsequently, in work not discussed here, we changed the mount point system to store data directly in the database without using a Berkeley DB file.)

Most of the queries that we did were by volume, so we also added the following indices:

    create index curr_volumes_index on volumes(volname);
    create index prev_volumes_index on previous_volumes(volname);
    create index mtpt_volumes_index on mountpoints(volname);

There are no useful referential integrity constraints for this schema.

Finally, there are a few views that proved extremely useful for reporting. The first finds all replicated volumes with unreleased changes:

    create view unreleased_changes
        as select distinct a.volname
                      from volumes a, volumes b
                     where a.type = 'RW'
                       and a.roid = b.volid
                       and b.type = 'RO'
                       and a.updated > b.updated;

We created a similar view named old_unreleased_changes that is identical except applied to previous_volumes instead. This finds all volumes with read-only replicas where the read/write volume has been updated more recently than at least one of the replicas.

This view:

    create view volume_count
        as select volname, count(*) as count
             from volumes
            where type != 'RO'
         group by volname;

will show volume names with the count of their occurrences for all read/write and backup volumes, and querying it for volumes with a count greater than one will show all volumes that appear to be on more than one server. Sometimes a vos move operation doesn't complete successfully and leaves an unreferenced copy of the volume behind; these old copies don't hurt anything, but they take up space and are worth cleaning up with vos zap.

Finally, this view:

    create view partition_count
        as select server, part, type, count(*) as count
             from volumes
         group by server, part, type;

will tell you how many volumes of any given type are on a particular server and partition and can be used to find servers or partitions that are overloaded or underutilized.

Here is the complete DDL as a text file, to save you the effort of cutting and pasting.

Populating the Database

To populate the database nightly, we created a file in AFS that lists all of our active file servers. Each night, a Perl script (afsdb-load) goes through all the servers in that file, uses vos partinfo to get a list of partitions on that server, and then runs vos listvol -long on that server and partition, parsing the results and writing them out as a semicolon-separated data file. We then used Oracle's sqlldr utility (and previously used Sybase's bcp utility) to bulk-load the data into the database.

Be aware that vos listvol -long is a fairly intensive operation for the volserver and prior to OpenAFS 1.3 the volserver is single-threaded. This means that vos move and vos release operations will delay or be delayed by vos listvol, sometimes to the point of timing out depending on the number of volumes on the server and how fast the volserver manages to be. Our script retries the vos listvol command up to ten times before giving up to deal with this timeout issue. You want to try to do this data dump during a fairly idle time for your cell.

We kicked off this job at 18:30 because backups started shortly after midnight and want to use the database load to provide information to our backup system. With that starting time, in a cell with 20 AFS servers, 119,000 volumes (including backup volumes), and over 3.5TB of used space, the database load process took between 3 and 3.5 hours.

We did all of the vos listvol commands sequentially since we've never needed faster load times. One obvious optimization would be to do each server in parallel at the cost of slowing down more of the cell at the same time, since the volserver accesses are independent. This requires more sophistication in the Perl script driver, but should be easily doable.

Note that the access count on volumes resets daily but at a different schedule than your nightly dump, so the access count should be taken only as an estimate. Also, since the data dump takes some time to complete, volumes on servers done towards the end may see higher access counts that volumes done near the beginning even for the same number of daily accesses. We didn't find this to be a significant problem, but it's something to be aware of. Similarly, if volumes are being moved or released while the data dump is being taken, sometimes odd things will show up in the data that will go away the next day when the volume is no longer changing.

Reporting Possibilities

Once you have all of the data available for SQL queries, it's pretty easy to come up with interesting reports that you want to run. Here are the ones that we found useful over the years:

For all of the SQL that implements the above, as well as the driver script that we used to do all of our reporting queries, see the AFS reporting database software page.

We also provided an interface for AFS administrators to do arbitrary ad hoc queries from the command line without needing to use sqlplus (afsdb-query), and those queries were often used for such purposes as getting the list of volumes on a particular server for a server evacuation or mass migration (followed up with vos listvol or vos listvldb to catch any stragglers added in the past day). We made this script available via a remctl interface. This data was also used to find all volumes modified in the past day to feed as input to our backup system so that it doesn't have to think about volumes that haven't been modified, and to gather information for use with AFS server balancing.

Future Work

As mentioned above, running vos listvol in parallel, one for each server, should be possible and would likely result in a significant decrease in the time it takes to do the nightly data load. It may also be interesting to augment the data with data from vos listvldb, primarily the read-only ID associated with a given replicated read/write volume, although we didn't find a use for that data.

Significantly more information is available about each volume, particularly low-level performance information, by using vos listvol -extended instead of vos listvol -long. We didn't analyze that data enough to know if it would be useful.

Keeping only two days of information isn't sufficient for some interesting types of reporting, particularly for the often highly variable access count. For applications such as server balancing, it would be more useful to keep a longer history and a moving average (or some other longer-term estimate) of accesses and possibly used space. Additionally, it would be useful to dump the results of some queries into an RRD database for long-term trend graphing using rrdtool or a similar application.

At the time this document was written, we only gathered information about volumes, but there may also be some useful information to gather about users and groups from the ptserver (at least for metrics and reporting purposes).

Credits

The original concept, the initial database schema, and all of the initial scripts for loading and querying the data in Sybase were developed by Neil Crellin. Further refinements of the scripts were done by Xueshan Feng. Jonathan Pilat and I converted the database to Oracle and rewrote and expanded the supporting scripts and the number of reports significantly in the process. I subsequently added new reports based on suggestions from Xueshan Feng, Jonathan Pilat, and Alice Ohgi.

You may do anything you wish with the contents of this page. Or, put formally: The authors hereby relinquish any claim to any copyright that they may have in this work, whether granted under contract or by operation of law or international treaty, and hereby commit to the public, at large, that they shall not, at any time in the future, seek to enforce any copyright in this work against any person or entity, or prevent any person or entity from copying, publishing, distributing or creating derivative works of this work.

Last modified and spun 2014-08-18