Beauty and the Beast - Distributed functional programming, part 1

Update 2015-01-12

The algorithm as it exists in HEAD is somewhat different from the below, in ways that I'll describe (eventually) in an another post. In some ways, it's closer to Fork-Join, but with important differences to support reentrancy, share results of duplicate requests and adjust for the costs of distribution.

OSS and Commercial Grids

Grid computing has always suffered the reputation of a buzzword that one suspects might not actually mean anything, but it has become especially ill-defined with the rise of open-source distributed computation frameworks like Spark, Storm and Grampa Hadoop. These extensively documented systems don't need much explanation from me, other than to note that they all basically pump data through a pre-designed graph, whose topology is often defined by map-reduce keys extracted from the input data.

Traditional, commercial grid products like Tibco Cloud, né DataSynapse, and Symphony are adapting, at least in terms of marketing, but they still suffer from a general lack of understanding about what they're supposed to do.

The way I've observed DataSynapse and Symphony in action is as a kind of fancy rsh. For example, Behemoth Incorporated has a vast collection of servers, of wide-ranging vintage, running different operating systems, financed through complex amortization that makes ownership difficult to establish, and in wildly-fluctuating demand on an unsteady schedule that generally clusters around the end-of-day in various global business centers. Departments within BI have huge numbers of batch jobs, of hypothetically varying priority but in practice all deemed urgent in the extreme. These batch jobs generally involve running an executable that reads a bunch of stuff from a database (loosely defined), chews cpu for O(10-100) minutes and puts its output somewhere. The job of the grid scheduler is to find a places to where these jobs will be able to run and do some kind of tracking of whether they actually ran. Part of finding places to run is what's known as "cycle scavenging," which means about what you think it would but often meets resistance from the scavengees.

Sarcasm notwithstanding, these products are solving problems that really exist, even if that existence tends to reflect organizational choices, with which it is not always entirely impossible to find fault.

But what about beauty?

Consider another computing paradigm, where we write pure functions that call other pure functions. This blisteringly novel idea doesn't yet have a name, but for now let's call it Functional Programming. With this "FP" thing that I, myself, invented right now,¹ we may write beautiful programs that do complicated work but can still be understood easily, without having to consider a thousand possible ways in which shared state might be mucked up by some process we forgot about. As always, there's a "graph" beneath it all, but it emerges naturally from the stack of function calls, and we almost don't need to think about it.

Complications arise, however, when our computer is insufficiently speedy, and we find ourselves needing to distribute the load across multiple threads or even multiple machines, and the paradigms to deal with these complications tend to make our programs less beautiful. Leaving aside the huge cognitive dissonance that untrained individuals experience when trying to use the word manifesto unironically, the actor model, even in a truly elegant implementation, is not as beautiful as the functional model. It's just not. Similarly, but, for me, more painful to admit, CSP, especially as embodied on Clojure by core.async, is, while dazzling, not beautiful either.

Under an aesthetically caring god, we would not have to re-imagine simple nested function calls as a flurry of notes, passed among mechanical homunculi.

Girder

I have been playing with an approach for distributed computing that, while implemented using a variety of the techniques I have deemed unbeautiful, exposes an interface that comes a bit closer to Plain Old FP. With some limitations, you write code that looks a lot like POFP except that some of the functions you call happen to get executed remotely.

The main design goals and assumptions were these:

Calculation requests should be built with ordinary clojure functions and values.
Calculations are referentially transparent and idempotent, implying
we can cache results
repeating the evaluation of one will at worst result in unnecessary work
it is valid and safe to restart the system and make the initial requests again.
Support emergent graphs, i.e. computations running on the grid may request and use the results of other computations on the grid, but we don't claim to know the data-flow graph in advance or specify it any way other than through the stack of function calls.
We must therefore fully support re-entrant requests. When functions make their own requests, we must not risk grid "starvation" (where requests cannot be processed because all workers claim to be busy, when they are in fact in a wait state), but we must do so efficiently and with predictable load.
Support any emergent directed acyclic graph, i.e. different requests may make the same further request, and they should be able share a memoized result.
One would generally prefer that re-entrant requests do not get executed very remotely, but we want the grid to "help out" if we get overburdened.
The system should operate at high throughput. Aim for several ms overhead in a potentially remote call.
You're responsible for distributing jar files. We're not solving the code serialization problem.
It is safe for any member of the grid to die and/or be restarted (though the process of recovery is not something I've yet automated).

Implementation also embodies a few central principles.

Interface is generally via core.async. Requests for computations return channels, which will eventually deliver results.
We rely internally on async to coordinate data flow without directly playing with thread pools, thus references below to "blocking" are really "parking".
There will be a central statekeeper (Redis, at this point), easing reliable coordination and facilitating reporting on the state of the system.

Topology of compute nodes

While the topology of data flow is determined entirely by function execution, there is a defined topology of machinery to execute those functions. I'll define a node as a single threaded process that knows how to do something. Every node has a work queue, onto which requests can be placed. There is also a central back end, which knows what's on all the queues as well as the current status of any request.

Worker Nodes

unsurprisingly, do most of the work. They

Have a name, like "w1".
Have one designated distributor node, which may be so designated by many worker node.
Pop requests only from their queue.
Execute these requests, which may themselves make requests reentrantly.
Push reentrant grid requests back onto their queue.
After making re-entrant requests, work through own queue (which will generally contain the requests they just made).
Ignore work marked done or in progress but somehow acquire the results a synchronously.
When free, push themselves onto the volunteer queue of their distributor.

Distributor nodes

Act as a pool for worker nodes that designate them.
Have a name, like "pool".
Have, as just noted, a volunteer queue as well as a work queue.
Pop work requests from own work queue, discarding work that is already finished or in progress, and pop volunteer requests from their volunteer queue, blocking until they have at least one of each.
Push the work request down to the volunteer.
May have their own designated distributor node, at which they volunteer when not busy.
Periodically (eg. every 100ms) copy the the entirety of the work queues of all nodes for which they act as distributor and push the requests onto their own work queue.

Back end

Maintains all the queues.
Maintains the state of every request, e.g. new, running or completed.
Maintains the cached value of every completed request.
Handles distribution of results
Is, as I said, at this point Redis, but doesn't have to be.

Typical use would be to make top-level requests at a distributor pool, where they would be doled out to idle workers that had volunteered. If workers themselves make requests, their queues may elongate, in which case the requests will be hoisted up the distribution chain, to be pushed back down should anyone else volunteer.

Requests

A request in girder is a finite sequence, whose first element is a function and all of whose subsequent elements must be serializable. At its simplest, a request doesn't look much different from a function call, and, indeed,

(request "pool" (inc 5))

will duly return 6, though the execution could have occurred at any worker in the pool (or in one of the pool's pools, etc).

The request macro hides a lot of complexity, expanding the above into something like

  (let* [req [+ 1 2]
         res (<!! (grid/enqueue "pool" req))]
    (if-not (:error res)
       (:value res)
       (throw (ex-info "Error during grid execution" res))))

where enqueue is responsible for serializing the request into something remotely detangleable (in this case, "(\"clojure.core/+\" 1 2)"), pushing it onto the appropriate queue and subscribing to results matching this request.

One of Girder's main design goals is that a remotely executing function should be allowed to remotely execute other functions. We want this to happen without gorging on threads, and we want to deal gracefully with errors that may occur far down the call chain. To achieve this, the fundamental computational unit is a function that

Does its calculation in a go block.
Returns a channel to which results will eventually be delivered.
Wraps those results in a structure that allows us to differentiate success and failure, and preserve error information.

In the example above, + gets wrapped up automatically, but in general we'll want a macro to facilitate creation of such functions:

(defmacro cdefn [fun args & forms]
`(defn ~(vary-meta fun assoc :girded true) ~args 
   (let [c# (chan)]
     (go
       (>! c#
           (try {:value  (do ~@forms)}
                (catch Exception e# {:error (format-exception e#)})))
       (close! c#))
     c#)))

which I can use in this a contrived example, to define several financial instruments around based on a criterion (e.g. a ticker) and then calculate their prices in parallel:

(cdefn calc-price [deets] (calculate-something-expensive deets))
(cdefn calc-portfolio [spec]
   (let [reqs [[calc-price (munge spec)] [calc-price (diddle spec)]]]
      (reduce + (requests reqs))))
(request "pool" (calc-portfolio "xyz"))

The innocuous looking requests in calc-portfolio, after passing through a few more macros, eventually calls and waits on (with some simplifications for clarity):

(defn enqueue-reentrant [reqs nodeid reqchan]
  (let [out (chan)]
    (go 
      (let [results   (async/map vector (map #(enqueue nodeid %) reqs))]
        (async/go-loop []
          (let [[v c] (async/alts! [results reqchan])]
          (if (= c results)
            (do (>! out v) (close! c) (close! out)
            (do (<! (process-reqid nodeid reqchan v))
                  (recur))))))))
    out))

This does two main things:

Enqueue the requests at our local nodeid, waiting for on channel results for a vector of answers.
Use alts! to pluck from reqchan any other requests that might be on this worker's queue, and process them until we get back the results.

Here, we benefit from the fact that the original cdefn put us in a go block, so we can continue processing requests while waiting for an answer. These requests may well be the ones that we just enqueued, though, depending on how long all these takes, we may find that they've already been hoisted up by a distributor node and doled out to someone else.

process-reqid is where the calculation actually occurs, but explaining it will require going over some other internals, which will happen in part 2 of this post.

Here's an example of how errors get caught:

(cdefn divide [x y] (float (/ x y)))
(cdefn ratio [i] (request (divide i (dec i))))

While (request "pool" (ratio 5)) returns 1.25, (request "pool" (ratio 1)) throws (somewhat redacted) ex-info:

   {:error
    {:info
     {"(\"acyclic.girder.testscripts.boffo/divide\" 1 0)"
      {:req "(\"acyclic.girder.testscripts.boffo/divide\" 1 0)",
       :msg "Divide by zero",
       :stack
       ["java.lang.ArithmeticException: Divide by zero"
        "clojure.lang.Numbers.divide(Numbers.java:156)"
        ...
        "java.lang.Thread.run(Thread.java:722)"]}},
     :req "(\"acyclic.girder.testscripts.boffo/ratio\" 1)",
     :msg "Error from request",
     :stack [....]}}

so we know the error happened in divide.

Finally, here's an example where requests themselves make requests, some of which might be made by other requests, and the subrequests results just get thrown into a result string:

(cdefn bogosity [msec jobnum reclevel numrecjobs args]
  (let [reqs (map #(vector bogosity msec % (dec reclevel)  numrecjobs args) (range reclevel))
        vs   (requests reqs)]
    (Thread/sleep msec)
    (str "Bogosity:" *nodeid* ":" jobnum ":" msec ":" reclevel ":" args ":[" (clojure.string/join "," (map str vs)) "]")))

The reclevel argument indicates how many levels of sub-requests to spawn As you see, cdefn requests do have access to a locally bound variable *nodeid*, indicating where the execution is taking place. This should be used for only debugging, but here I've actually broken referential transparency by putting nondeterministic values into the results. That necessitated the use of the last argument, just to differentiate multiple tests and thwart the cache.

Launching a whole bunch of these:

(requests "pool" (map #(vector bogosity 1 % 3 5 222) (range 10)))

eventually gets me

[{:value "Bogosity:w2:0:1:3:222:[Bogosity:w2:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w1:1:1:3:222:[Bogosity:w2:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w2:2:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w1:3:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w2:4:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w1:5:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w2:6:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w1:7:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w1:8:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"} {:value "Bogosity:w2:9:1:3:222:[Bogosity:w1:0:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w1:1:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]],Bogosity:w2:2:1:2:222:[Bogosity:w1:0:1:1:222:[Bogosity:w2:0:1:0:222:[]],Bogosity:w2:1:1:1:222:[Bogosity:w2:0:1:0:222:[]]]]"}]

where you see that execution was spread across two workers, lots of recursion occurred, and somehow the two workers managed to stay busy calculating sub-requests, even while apparently blocking on them.

But how does it work?

That's a secret, until my next post, but it will revolve around the mysterious enqueue and process-reqid methods. We'll also go into the way we use core.async to interact with Redis, try it out on a hundred AWS machines and go into some of the other "tricks" involved in implementing all this. That sounds like a lot, so maybe there will be three posts altogether.

As with most of my original ideas, this one is bound to be stolen by others, who cunningly post-date their books and papers to make it look like they came first. ↩