Occasional thoughts on diverse subjects

Free Applicative Functors in Haskell

By Dave Menendez
Monday, May 27, 2013, at 1:37 AM
Filed under: Haskell

Summary: Can we find a way to make an applicative functor for any type constructor? It turns out, methods are already known—but I didn’t know that when I started, so I found my own way. One that turns out to have much better performance.

A few days ago, I found myself thinking about free monads, specifically Prompt monads, and wondering if there were equivalents for applicative functors. I didn’t realize it at the time, but apparently there was a recent bit of discussion about this in the Haskell community, starting with a blog post by Tom Ellis, and leading to a roundup of implementations by Roman Cheplyaka.

The consensus implementation, at least the one provided by the free package, looks roughly like this:

> {-# LANGUAGE GADTs, RankNTypes #-}
> module FreeApp where
> import Control.Applicative
> data Free f a where
>     Pure :: a -> Free f a
>     App :: f a -> Free f (a -> b) -> Free f b
> liftFree :: f a -> Free f a
> liftFree x = App x (Pure id)
> lowerFree :: Applicative f => Free f a -> f a
> lowerFree (Pure x) = pure x
> lowerFree (App x xs) = x <**> lowerFree xs
> instance Functor (Free f) where
>     fmap f (Pure a) = Pure (f a)
>     fmap f (App x xs) = App x (fmap (f .) xs)
> instance Applicative (Free f) where
>     pure = Pure
>     Pure f <*> y = fmap f y
>     App x xs <*> y = App x (fmap flip xs <*> y)

Essentially, Free f a consists of a sequence of “effects” (values of type f a, for some a) and a function which combines the results of those effects to produce the final answer.

You can also see that this is going to have bad asymptotic performance. fmap f x, for example, will take time O(n) time, where n is the number of effects in x. Worse, x <*> y will take O(n2 + m) time, since it calls fmap n times on x and once on y.

My first implementation

Of course, I didn’t bother to look for other implementations until after I’d found my own. I think I considered one like Free above, but rejected it because I didn’t like that fmap was so expensive. Instead, I kept the function separate from the effect sequence. To make it easy to call fmap without needing to walk the tree, I made the function accept all the results from the effects in a single argument, represented as a nested tuple.

Here’s the type for the effect sequence:

> data CSeq f a where
>     CNil :: CSeq f ()
>     CCons :: f a -> CSeq f u -> CSeq f (a,u)
> reduce :: Applicative f => CSeq f u -> f u
> reduce CNil         = pure ()
> reduce (CCons x xs) = (,) <$> x <*> reduce xs

So if we have operations x :: f a, y :: f b, and z :: f c, we can put them in sequence and get CCons x (CCons y (CCons z CNil)) :: Seq f (a,(b,(c,()))).

And here’s the implementation itself:

> data C1 f a = forall u. C1 (u -> a) (CSeq f u)
> liftC1 :: f a -> C1 f a
> liftC1 a = C1 fst (CCons a CNil)
> lowerC1 :: Applicative f => C1 f a -> f a
> lowerC1 (C1 f x) = f <$> reduce x
> instance Functor (C1 f) where
>     fmap f (C1 u x) = C1 (f . u) x
> instance Applicative (C1 f) where
>     pure a = C1 (const a) CNil
>     C1 f x <*> C1 g y = rebaseC1 x C1 (\v u -> f u (g v)) y
> rebaseC1 :: CSeq f u -> (forall x. (x -> y) -> CSeq f x -> z) ->
>     (v -> u -> y) -> CSeq f v -> z
> rebaseC1 CNil         k f = k (\v -> f v ())
> rebaseC1 (CCons x xs) k f = 
>     rebaseC1 xs (\g s -> k (\(a,u) -> g u a) (CCons x s))
>         (\v u a -> f v (a,u))

(The prefix C, incidentally, stands for “Curry”, because when I was naming it, I got currying and uncurrying confused. C1 uses an uncurried function; Free uses a curried one.)

The only real tricky part is rebaseC1. It may not be obvious, but it essentially acts like ++, traversing the first sequence and creating a new one by appending the second sequence. The difference is that this also has to modify the return functions and that the return type depends on the input types. I could have used type families here, but for some reason I chose to just stick with existential types; the continuation passed in the second argument is polymorphic in the sequence type, allowing me to just pass C1 when I call it.

The real “trick” in rebaseC1 involves the y parameter, which changes in the recursive call to allow us to pass values through the later parts of the computation and then reconstruct them. So if the third parameter, f has type v -> (a,u) -> y, in the recursive call it will have type v -> u -> a -> y. The modified continuation will eventually convert the function g :: x -> a -> y into a one with type (a,x) -> y and pass it to the original continuation.

Asymptotically, this is more efficient than Free. All of the operations are O(1), aside from lowerC1 and <*>, which are O(n). It’s pretty clear that O(n) is the best something like lowerC1 could be, since it has to evaluate every effect, but can we get a faster <*>?

My second implementation

Consider that <*> is essentially appending one sequence to another. In lists, appending works by replacing the final [] of the first list with the second list. That is,

1:2:3:[] ++ 4:5:[] = 1:2:3:4:5:[]

We can speed up append by taking the end of the list as a parameter, allowing us to use [] or another list as we choose. Then appending simply becomes function composition.

(\n -> 1:2:3:n) . (\n -> 4:5:n) = (\n -> 1:2:3:4:5:n)

Finding an analogous trick for C1 is harder, obviously, but after a bunch of tweaking, this is what I came up with:

> newtype C2 f a = C2
>     { unC2 :: forall u y z. 
>         (forall x. (x -> y) -> CSeq f x -> z) -> 
>         (u -> a -> y) -> CSeq f u -> z }
> liftC2 :: f a -> C2 f a
> liftC2 a = C2 (\k f s -> k (\(a,s) -> f s a) (CCons a s))
> lowerC2 :: Applicative f => C2 f a -> f a
> lowerC2 x = unC2 x (\f s -> f <$> reduce s) (\() -> id) CNil
> instance Functor (C2 f) where
>     fmap g x = C2 (\k f -> unC2 x k (\s -> f s . g))
> instance Applicative (C2 f) where
>     pure a = C2 (\k f -> k (flip f a))
>     x <*> y = C2 (\k f -> unC2 y (unC2 x k) (\s a g -> f s (g a)))

Now, all operations are O(1) except lowerC2.

As a quick explanation, x :: C2 f a is a function taking three arguments. The first is a continuation (of the same type used in reduceC1) saying what to do with the final effect sequence and function. The third is the tail of the effect sequence, that x can add to using CCons. The second is a function which explains how to take the results of the tail effects and the value produced by x and combine them into the final return value. You can look at liftC2 and pure to see how they use them.

<*> uses the same trick as rebaseC1 of being sneaky with the y type parameter, allowing us to pass an extra value from x to y. That’s also why the function doesn’t have the seemingly more logical argument order a -> u -> v.

Notice in <*> that the first argument is put into the continuation of the second one. This is the secret to C2’s speed: it essentially traverses the expression tree from right to left, building the sequence of events from last to first. Unlike Free or C1, C2 never has to traverse and reconstruct the effect sequence.

Naturally, C1 and C2 are easily interconvertible:

> c2toC1 :: C2 f a -> C1 f a
> c2toC1 x = unC2 x C1 (\() a -> a) CNil
> c1toC2 :: C1 f a -> C2 f a
> c1toC2 (C1 g x) = C2 (\k f -> rebaseC1 x k (\v u -> f v (g u)))

Testing for speed

I’ve talked a bit about asymptotic performance, but that isn’t the end of the story with efficiency. C1 and C2 do a bit work converting functions between curried and uncurried forms, so we should do some experiments to see whether they’re actually any better.

Here’s one. I generated a balanced binary tree containing between 10 and 100 nodes, each of which is a trivial call to liftX, used sequenceA to traverse the tree, and then used lowerX to run the computation. I then used Criterion to tell me how long they took. The result was that C2 was easily faster than C1, which was much faster than Free.

Chart showing times to traverse a tree with n nodes

Other experiments are more stark. For example, expressions of the form

f <$> lift () <*> lift () <*> lift () …

are something of a worst case for Free and C1, particularly as the number of arguments rose. In my experiments, the time for Free went from 0.58 μs to 14 μs as the number of arguments rose from 3 to 11. C2, in contrast, rose only from 0.049 μs to 0.16 μs.

Chart showing times for applying a function to n arguments

Obviously, this experiment is artificial, but its results hold up in the tree traversal experiment.

The funny thing is, if I had started my exploration by looking for other implementations, I might have found the existing ones and stopped, my curiosity satisfied. Because I was (re)inventing the concept for myself, I pushed harder to find the best one I could.1 The results are a significant improvement over the current state of the art.

It may be that no one will ever need a free applicative functor. It certainly may be that no one will need the speed improvement of C2 over Free. We’ve still learned two lessons here: don’t append when you can compose, and don’t use curried functions if you’re going to compose them.

  1. I'm omitting, in fact, four implementations I came up with that use a curried function—in fact, C1 and C2 were renamed to contrast with them. When I finally pulled out Criterion, I saw that they could beat Free, but C2 was better than all of them. Since they’re also more complicated, it didn’t seem worth it to mention them.