Hollywood Typecasting - Adventures with typed clojure and IMDB

I am broadly sympathetic to view that scalable systems must be built with statically typed langages, for reasons outlined in this wonderful screed, and, until recently, that has made it difficult for me to recommend clojure for institutional use.

With the introduction of core.typed, that has changed. The author has says that core.typed is now production-ready, and I agree. It's not perfect, but it will find bugs in your code without breaking it or causing performance problems. It's also pretty cool, and in many ways more expressive than type declarations in "normal" statically typed languages.

That said, the documentation is not, uhm, exhaustive. I decided to teach myself what I could using this toy project, which started life as a coding challenge for a job interview. (It didn't work out, though probably not for reasons connected with the challenge. Thanks for asking.) The task is to compute movie degrees of separation using data scraped from IMDB: two actors are considered to be directly connected if they appear in the same film together; if they're connected via a third actor with whom each of the first two appeared in a film, then it's a 2nd degree connection, etc. (One of the reasons this is not a particularly good challenge question is that IMDB limits your scraping rate so much that it is literally impossible to complete the calculation in the allotted time for non-trivial examples.)

Many people would say that clojure is not an obvious choice for this task, since it would seem at first that Dijkstra is a naturally stateful algorithm, but persistent data structures are wonderful things. We can store the graph as a nested map of links, every "update" to which is effectively a brand new graph.

As you've gathered, this post is about three totally unrelated things, in descending order of emphasis:

Typed clojure
Graph algorithms with functional data structures
Scraping IMDB

So hang on tight.

Combining the first two topics, I now have the luxury of explaining the graph structure by showing you type signatures:

(t/def-alias Node "Node representing an actor or film"
  (HMap :mandatory {:id String 
                    :links (t/Set String)
                    :title String}
        :optional {:distance t/AnyInteger
                   :path (t/Seq String)}
        :complete? true))
(t/def-alias Graph "Graph of tokens to nodes"
  (Map String Node))

This is almost self-explanatory. Every node has a string id, along with a title and a set of ids of nodes to which it is linked. As the Dijkstra algorithm proceeds, we add fields containing the updated estimtated distance from the starting node and the path by which we got to it. So far, so obvious.

What's not immediately obvious is that this declaration contains more type information than we're used to seeing in standard OO languages. Remember that this isn't a structure or class - just an ordinary hash map. The field names are being represented here as types; that is, a type can specify not just a particular sort of entity but the values that the entity is allowed to take! Outide of map structures, you may not find yourself taking advantage of this ability at first, but when core.typed infers type information within your code, it will be as conservative as possible, down to the value level if it has that information. For example, if I had (if (> x 0.0) (Math/log x) "Bleh"), the type might be inferred as (U double "Bleh"). A more general implication, about which I'll say more a bit later, is that there are often multiple approaches to typing within a given piece of code.

Note that the enforcement of this typing is strictly at our discretion. To typecheck a namespace, run (check-ns); for a more automated experience, lein typed check (using a plugin you'll find in my project.clj); or examine individual expressions in the REPL using cf. Actual compilation can proceed irrespective of whether we did any of these things, or whether they succeeded, and the compiled code won't differ either.

Let's take a look at some more type declarations in the project. One possibly unnecessary complication I introduced was a Mongo caching layer over IMDB, so it wouldn't be necessary to scrape the same thing more than once. There's a very nice Mongo interface available, but it doesn't (yet) come typed out of the box. One of the great things about the separation of definition and type declaration is that I can provide remedial typing of external libraries:

(t/ann ^:no-check monger.core/connect!
        (Fn [(HMap :optional {:port t/AnyInteger :host String} :complete? true) -> (t/Option com.mongodb.MongoClient)]
            [-> (t/Option com.mongodb.MongoClient)]))
(t/ann ^:no-check monger.core/get-db [String -> com.mongodb.DBApiLayer])
(t/ann ^:no-check monger.core/set-db! [com.mongodb.DBApiLayer -> com.mongodb.gridfs.GridFS])

As you can see, this adds type information to functions in another namespace.

The :no-check metadata tells core.typed not to bother checking within the definitions of these functions (which would certainly fail).
Option is shorthand for (U Something nil) The monadic allusion may be overplayed here, but explictly noting that nil is a possible outcome allows a type error if my code doesn't handle it.
As you can see, qualified Java class names can be types.

I also needed external libraries for web queries and parsing. My httpkit annotation

(t/ann ^:no-check org.httpkit.client/request
    ['{:url String :method (Value :get)}, [Any -> Any] ->
    (t/Atom1 '{:status Number :body String})])

does not even attempt completeness. As long as I err on the overly restrictive side, this does nobody any harm. Worst case, I'll realize I need to use the function more generally and have to expand my annotation. (In truth, [Any->Any] is too broad for the callback function, so I've sort of broken my own rule. I think it should be (Value identity), but core.typed doesn't permit function constants at this point.)

I used enlive to convert the raw content stream from the imdb.com into a nested map. The declaration language is rich enough to describe this object recursively

(t/def-alias Resource "Parsed DOM resource"
  (Rec [x] (t/Map Keyword (U String x (t/Seq x))))

where x stands for the full declaration of Resource, and it's easy enough to declare the functions I use:

(t/ann ^:no-check net.cgrand.enlive-html/html-resource [java.io.Reader -> Resource])
(t/ann ^:no-check net.cgrand.enlive-html/select [Resource (t/Vec Keyword) -> (t/Seq Resource)])

The actual parsing, however, is a bit ugly. Either we rigorously validate all the assumptions we make about the layout of the document, or we punt on the type checking. Extraction of links to actors or films starts a bit like this:

  (-> resource
      (html/select [:div.title])
      (as-> doc (map #(-> % :content second :attrs :href) doc)))

That is, we expect to find a list of divs of class title, the second item in the content of each of which is supposed to be the link we're interested in. Expressing this with higher order functions and threading is relatively compact and readable, but the type inference turns into a complete mess. If we truly cared about Blythe Danner's relation to Meatloaf, it would be worth doing a host of extra validations, including but hardly limited to type checking, but there's no guarantee that IMDB would be using the same format by the time we were done. (We might hope that they don't: IMDB pages are a total rat's nest. I make life a little easier by scraping the mobile pages, but there are still nuances such as different URLs for actors and actresses, despite no explicit data for gender.) In any case, I wrap my link extraction routine in a big fat no-check.

(t/ann ^:no-check resource->links [String Resource -> (t/Set String)])

which nicely isolates the most typologically problematic portion of the code while allowing me to make progress on the rest.

The final external library I'll use is data.priority-map. At each iteration of the Dijkstra algorithm, we visit the node that currently has the smallest estimated distance, and it's nice to be able to find that node in something less than O(n). I could use java.util.PriorityQueue, but This lovely package gives me a more idiomatic clojure solution. The object created by (priority-map) behaves like a map, but you can use peek and pop to retrieve and discard the entry with the smallest value.

Once again, of course, the package doesn't come pre-typed, but core.typed lets me correct for that:

(t/def-alias Queue (t/Map String t/AnyInteger))
(t/ann ^:no-check clojure.data.priority-map/priority-map [ -> Queue])
(t/ann ^:no-check pmap-assoc [Queue String (U nil t/AnyInteger) -> Queue])
(def pmap-assoc assoc)
(t/ann ^:no-check pmap-peek [Queue -> (Vector* String t/AnyInteger)])
(def pmap-peek peek)
(t/ann ^:no-check pmap-pop [Queue -> Queue])
(def pmap-pop pop)

The full cache layering can be summarized with these declarations:

(t/def-alias Graph "Graph of tokens to nodes" (t/Map String Node))
(t/ann fetch-node-from-imdb [String -> Node])
(t/ann fetch-node-from-mongo-fallback-to-imdb [String -> Node])
(t/ann fetch-node-from-graph-fallback-to-mongo [Graph String -> (Vector* Graph Node)])

And now for the core of the algorithm. Since I have complete control over my own code, there will be none of this ^:no-check nonsense; every single line is type-checked. As the following function illustrates, core.typed is doing a lot of work on my behalf:

(t/ann update-queue [Queue Graph String (t/Seqable String) -> Queue])
(defn update-queue [queue graph id links] 
  (reduce (t/fn> :- Queue [q :- Queue id :- String]
                 (pmap-assoc q id (:distance (get graph id)))) queue links))

To verify that this code does indeed produce a Queue, type must be properly inferred through 4 layers of function calls, one of them higher-order. I find that impressive.

Had I fat-fingered the last line as

            ...   (pmap-assoc id q (:distance (get graph id)))) queue links)

I would have learned that immediately

 Type Error (imdb.core:219:18) Type mismatch:
 Expected:  Queue
 Actual:    String
 in: (imdb.core/pmap-assoc id412107 q (:distance (clojure.lang.RT/get graph id412107)))
 Type Error (imdb.core:219:18) Type mismatch:
 Expected:  String
 Actual:    Queue
 in: (imdb.core/pmap-assoc id412107 q (:distance (clojure.lang.RT/get graph id412107)))

rather than having to deduce it from some downstream catastrophe.

The rest of the algorithm contains few surprises. As promised, it's purely functional, with multiple iterations of the graph emerging from an recursion and reduction.

(defn update-distance
  "Update :distance field in graph node id to be one greater than di if appropriate
   and return the updated graph."
  [entry graph id]
  (let [di           (distance entry)
        [graph node] (fetch-node-from-graph graph id)
        path         (or (:path entry) '())]
        (if (<= (distance node) di)  graph
                (-> graph
                    (set-node id :distance (inc di))
                    (set-node id :path (conj path id))))))

(t/ann visit-node [Graph Queue String -> (Vector* Graph Queue)])
(defn visit-node
  "Visits node id, and update the :distance fields of its neighbors.
   queue is a priority queue of id=>distance, containing entries only for unvisited
   nodes.  We will use this to determine what to visit next, Dijkstra style."
  [graph queue id]
  (let [[graph entry] (fetch-node-from-graph graph id)
        links         (filter #(nil? (:visited (graph %))) (:links entry))
        graph         (reduce (partial update-distance entry) graph links)
        queue         (update-queue queue graph id links)
        graph         (set-node graph id :visited true)]
    [graph queue]))

(t/ann find-distance [String String -> (t/Option Any)])
(defn find-distance 
  "E.g. (find-distance \"/name/nm0000257/\" \"/name/nm0000295/\")
   Use Dijkstra algorithm to find shortest path from id to target."
  [id target]
    (let [ [graph node]  (fetch-node-from-graph {} id)
           graph         (set-node graph id :distance 0)
           queue         (priority-map)]
      (t/loop> [graph :- Graph graph
                queue :- Queue queue
                id    :- String id]
        (if (= id target) (get graph id)
          (let [[graph queue]   (visit-node graph queue id)
                closest         (first (pmap-peek queue))]
            (if (empty? queue)
                "Couldn't find a path!"
                (recur graph (pmap-pop queue) closest)))))))

I mentioned above that the core.typed's flexibility gives us some unexpected choices. To investigate this in the REPL, we'll use a couple of extra tools. First, (t/ann-form #{"a"} (t/Set String)) specifies inline that #{a} is to be treated as a set of String, rather than the more conservatively inferred (t/Set (Value "a")), and (t/cf someform) checks a form and displays its inferred type.

Here's how union might be annotated

(t/ann ^:no-check clojure.set/union (All [a] (Fn [(t/Set a) * -> (t/Set a)])))

and here's how that would be interpreted:

imdb.core> (t/cf (clojure.set/union (t/ann-form #{1} (t/Set Integer)) (t/ann-form #{"one"} (t/Set String))))
(t/Set (U Integer String))

That certainly isn't what I expected, given apparently analogous declarations in other languages. For example, scala's sets have this union method

def union(that: GenSet[A]): Set[A]

which of course bombs on heterogenous sets:

scala> HashSet(1).union(HashSet(2))
res0: scala.collection.immutable.HashSet[Int] = Set(1, 2)

scala> HashSet(1).union(HashSet("one"))
<console>:9: error: type mismatch;
 found   : String("one")
 required: Int

To get the familiar behavior from clojure, we'd have to specify an equality bound on the type parameter:

(t/ann clojure.set/union (All [x [x1 :< x :> x]]
                          (Fn [(t/Set x) (t/Set x1) * -> (t/Set x1)])))

Now:

imdb.core> (t/cf (clojure.set/union (t/ann-form #{1} (t/Set Integer))
                                    (t/ann-form #{"one"} (t/Set String))))
AssertionError Assert failed: 1: Inferred type String is not between bounds Integer and Integer
(and (subtype? inferred upper-bound) (subtype? lower-bound inferred))  clojure.core.typed.cs-gen/subst-gen/fn--10414 (cs_gen.clj:1333)

Neither of these alternate declarations is obviously more correct, and both will catch our coding errors, though in slightly different ways. Suppose we had the following solecism:

(t/cf (let [s (clojure.set/union #{1} #{"a"})]
         (map inc s)))

With the bounded annotation, the call to union is simply illegal:

AssertionError Assert failed: 1: Inferred type (Value "a") is not between bounds (Value 1) and (Value 1)
(and (subtype? inferred upper-bound) (subtype? lower-bound inferred))  clojure.core.typed.cs-gen/subst-gen/fn--10414 (cs_gen.clj:1333)

With the first, more general declaration, the call to union is fine, but inc will have problems:

Type Error (imdb.core:2:6) Polymorphic function clojure.core/map could not be applied to arguments:
Domains:
    (Fn [a b ... b -> c]) (t/NonEmptySeqable a) (t/NonEmptySeqable a) ... b
    (Fn [a b ... b -> c]) (U nil (clojure.lang.Seqable a)) (U nil (clojure.lang.Seqable b)) ... b
Arguments:
    (Fn [t/AnyInteger -> t/AnyInteger] [Number -> Number]) (t/Set (U (Value 1) (Value "a")))
Ranges:
    (t/NonEmptyLazySeq c)
    (clojure.lang.LazySeq c)
with expected type:
    Any
in: (clojure.core/map clojure.core/inc s)

In both cases, our error is caught, but at different stages of analysis.

This flexibility is immensely powerful, and it encourage more nuanced thinking about static types. In many situations, finding ourselves with a heterogeous set would be a sign that something had gone wrong, but not always. For instance, with our more general annotation, we could do this:

imdb.core> (t/cf (let [s (clojure.set/union #{1} #{1.0})]
           (map inc s)))
(clojure.lang.LazySeq Number)

I think I've demonstrated that core.typed is cool. I know that it has improved my own code. I hope that it will ultimately remove one of the greatest barriers to broad adoption of clojure.

Comments