Log parsing and infinite streams

I have a problem I have to solve for work that involves correlating Apache access and error logs. Part of WebAuth logs successful authentications to the Apache error log, and I want to correlate User-Agent strings with users so that we can figure out what devices our users are using by percentage of users rather than percentage of hits. The problem, as those who have tried to do this prior to Apache 2.4 know, is that Apache doesn't provide any easy way to correlate access and error log entries (made even more complex because two separate components are involved).

I could have just hacked something together, but I've written way too many ad hoc log parsers, and, see, I was reading this book....

The book in question is Mark Jason Dominus's Higher Order Perl. I'm not quite done with it, and will post a full review when I am. I have some problems with it, mostly around the author's choice of example problems. But there is one chapter on infinite streams, and the moment I read that chapter on the train, it struck me as the perfect solution to log parsing problems.

I'm not much of a functional programming person (which is where Dominus is drawing most of the material for this book), so I don't know if this terminology is standard or somewhat unique to the book. An infinite stream in this context is basically a variation on an interator that lets you look at the next item without consuming it. The power comes from putting this modified iterator in front of a generator and use it to consume one element at a time, and then compose this with transformation and filtering functions. That gives you all the power of the iterator to store state and lets you process one logical element from the stream at a time, without inverting your flow of control.

Dominus provides code in the book for a very nice functional implementation of this that's about as close as you're probably going to get to writing Haskell in Perl. Unfortunately, his publisher decided to write their own free software license, so the license is kind of weird and doesn't explicitly permit redistribution of modified code. It's probably fine, but I didn't feel like dealing with it, and I'm more comfortable with writing object-oriented code at the moment (at least in Perl), so I decided to write an object-oriented version of the same code specific for log parsing.

That's what I've been doing since shortly after lunch, and I can't remember the last time I've had this much fun writing code. I have a reasonable first cut at a (fully tested and fully test-driven) log parsing framework built on top of a reasonably generic implementation of the core ideas behind infinite streams. I also used this as an opportunity to experiment with Module::Build, and have discovered that the things I most disliked about it have apparently all been fixed. And I'm also using Perl 5.10 features. (I was tempted to start using some things from 5.12, but I do actually need to run this on Debian stable.) It's rather satisfying to write a thoroughly modern Perl module.

There are some definite drawbacks to writing this in an object-oriented fashion. There's rather more machinery and glue that has to be set up, it's probably a bit slower, and it tends to accumulate layers of calls. One of the advantages of the method with standalone functions and a very simple, transparent data structure is that it's easier to eliminate unnecessary call nesting. But I suspect the object-oriented version will do what I want without any difficulties, and if I feel very inspired, I can always fiddle with it later.

Maybe I'll eventually use this as a project to experiment with Moose as well.

I'm surprised that no one else has done this, but I poked around on CPAN a fair bit and couldn't find anything. This will all show up on CPAN (as Log::Stream) as soon as I've finished enough of it to implement my sample application. And then I'll hopefully find some time to rewrite our metrics system using it, which should simplify it considerably....

Posted: 2013-01-22 23:54 — Why no comments?

Last spun 2013-07-01 from thread modified 2013-01-24