I am broadly sympathetic to view that scalable systems must be built with statically typed langages, for reasons outlined in this wonderful screed, and, until recently, that has made it difficult for me to recommend clojure for institutional use.
With the introduction of core.typed,
that has changed. The author has
says
that core.typed
is now production-ready, and I agree. It's not perfect, but it
will find bugs in your code without breaking it or causing performance problems.
It's also pretty cool, and in many ways more expressive than type declarations
in "normal" statically typed languages.
That said, the documentation is not, uhm, exhaustive. I decided to teach myself what I could using this toy project, which started life as a coding challenge for a job interview. (It didn't work out, though probably not for reasons connected with the challenge. Thanks for asking.) The task is to compute movie degrees of separation using data scraped from IMDB: two actors are considered to be directly connected if they appear in the same film together; if they're connected via a third actor with whom each of the first two appeared in a film, then it's a 2nd degree connection, etc. (One of the reasons this is not a particularly good challenge question is that IMDB limits your scraping rate so much that it is literally impossible to complete the calculation in the allotted time for non-trivial examples.)
Many people would say that clojure is not an obvious choice for this task, since it would seem at first that Dijkstra is a naturally stateful algorithm, but persistent data structures are wonderful things. We can store the graph as a nested map of links, every "update" to which is effectively a brand new graph.
As you've gathered, this post is about three totally unrelated things, in descending order of emphasis:
- Typed clojure
- Graph algorithms with functional data structures
- Scraping IMDB
So hang on tight.
Combining the first two topics, I now have the luxury of explaining the graph structure by showing you type signatures:
(t/def-alias Node "Node representing an actor or film"
(HMap :mandatory {:id String
:links (t/Set String)
:title String}
:optional {:distance t/AnyInteger
:path (t/Seq String)}
:complete? true))
(t/def-alias Graph "Graph of tokens to nodes"
(Map String Node))
This is almost self-explanatory. Every node has a string id, along with a title and a set of ids of nodes to which it is linked. As the Dijkstra algorithm proceeds, we add fields containing the updated estimtated distance from the starting node and the path by which we got to it. So far, so obvious.
What's not immediately obvious is that this declaration contains more type
information than we're used to seeing in standard OO languages. Remember that
this isn't a structure or class - just an ordinary hash map. The field names
are being represented here as types; that is, a type can specify not just a
particular sort of entity but the values that the entity is allowed to take!
Outide of map structures, you may not find yourself taking advantage of this ability at first, but
when core.typed
infers type information within your code, it will be as
conservative as possible, down to the value level if it has that information.
For example, if I had (if (> x 0.0) (Math/log x) "Bleh")
, the type might
be inferred as (U double "Bleh")
. A more general implication, about which
I'll say more a bit later, is that there are often multiple approaches to typing
within a given piece of code.
Note that the enforcement of this typing is strictly at our discretion. To
typecheck a namespace, run (check-ns)
; for a more automated experience,
lein typed check
(using a plugin you'll find in my project.clj
);
or examine individual expressions in the REPL using cf
.
Actual compilation can proceed irrespective of whether we did any of these
things, or whether they succeeded, and the compiled code won't differ either.
Let's take a look at some more type declarations in the project. One possibly unnecessary complication I introduced was a Mongo caching layer over IMDB, so it wouldn't be necessary to scrape the same thing more than once. There's a very nice Mongo interface available, but it doesn't (yet) come typed out of the box. One of the great things about the separation of definition and type declaration is that I can provide remedial typing of external libraries:
(t/ann ^:no-check monger.core/connect!
(Fn [(HMap :optional {:port t/AnyInteger :host String} :complete? true) -> (t/Option com.mongodb.MongoClient)]
[-> (t/Option com.mongodb.MongoClient)]))
(t/ann ^:no-check monger.core/get-db [String -> com.mongodb.DBApiLayer])
(t/ann ^:no-check monger.core/set-db! [com.mongodb.DBApiLayer -> com.mongodb.gridfs.GridFS])
As you can see, this adds type information to functions in another namespace.
- The
:no-check
metadata tellscore.typed
not to bother checking within the definitions of these functions (which would certainly fail). Option
is shorthand for(U Something nil)
The monadic allusion may be overplayed here, but explictly noting thatnil
is a possible outcome allows a type error if my code doesn't handle it.- As you can see, qualified Java class names can be types.
I also needed external libraries for web queries and parsing. My
httpkit
annotation
(t/ann ^:no-check org.httpkit.client/request
['{:url String :method (Value :get)}, [Any -> Any] ->
(t/Atom1 '{:status Number :body String})])
does not even attempt completeness.
As long as I err on the overly restrictive side, this does nobody
any harm. Worst case, I'll realize I need to use the function more
generally and have to expand my annotation. (In truth, [Any->Any]
is
too broad for the callback function, so I've sort of broken my own rule.
I think it should be (Value identity)
, but core.typed
doesn't
permit function constants at this point.)
I used enlive
to convert the raw
content stream from the imdb.com into a nested map. The declaration
language is rich enough to describe this object recursively
(t/def-alias Resource "Parsed DOM resource"
(Rec [x] (t/Map Keyword (U String x (t/Seq x))))
where x
stands for the full declaration of Resource
, and it's easy
enough to declare the functions I use:
(t/ann ^:no-check net.cgrand.enlive-html/html-resource [java.io.Reader -> Resource])
(t/ann ^:no-check net.cgrand.enlive-html/select [Resource (t/Vec Keyword) -> (t/Seq Resource)])
The actual parsing, however, is a bit ugly. Either we rigorously validate all the assumptions we make about the layout of the document, or we punt on the type checking. Extraction of links to actors or films starts a bit like this:
(-> resource
(html/select [:div.title])
(as-> doc (map #(-> % :content second :attrs :href) doc)))
That is, we expect to find a list of div
s of class title
, the second
item in the content of each of which is supposed to be the link we're interested
in. Expressing this with higher order functions and threading is relatively
compact and readable, but the type inference turns into a complete mess.
If we truly cared about Blythe Danner's relation to Meatloaf,
it would be worth doing a host of
extra validations, including but hardly limited to type checking, but there's
no guarantee that IMDB would be using the same format by the time we were done.
(We might hope that they don't: IMDB pages are a total rat's nest. I make life
a little easier by scraping the mobile pages, but there are still nuances such
as different URLs for actors and actresses, despite no explicit data for gender.)
In any case, I wrap my link extraction routine in a big fat no-check
.
(t/ann ^:no-check resource->links [String Resource -> (t/Set String)])
which nicely isolates the most typologically problematic portion of the code while allowing me to make progress on the rest.
The final external library I'll use is data.priority-map
. At each
iteration of the Dijkstra algorithm, we visit the node that currently has the
smallest estimated distance, and it's nice to be able to find that node in
something less than O(n). I could use java.util.PriorityQueue
, but
This lovely package gives me a
more idiomatic clojure solution. The object created by (priority-map)
behaves like a map, but you can use peek
and pop
to retrieve and
discard the entry with the smallest value.
Once again, of course, the package doesn't come
pre-typed, but core.typed
lets me correct for that:
(t/def-alias Queue (t/Map String t/AnyInteger))
(t/ann ^:no-check clojure.data.priority-map/priority-map [ -> Queue])
(t/ann ^:no-check pmap-assoc [Queue String (U nil t/AnyInteger) -> Queue])
(def pmap-assoc assoc)
(t/ann ^:no-check pmap-peek [Queue -> (Vector* String t/AnyInteger)])
(def pmap-peek peek)
(t/ann ^:no-check pmap-pop [Queue -> Queue])
(def pmap-pop pop)
The full cache layering can be summarized with these declarations:
(t/def-alias Graph "Graph of tokens to nodes" (t/Map String Node))
(t/ann fetch-node-from-imdb [String -> Node])
(t/ann fetch-node-from-mongo-fallback-to-imdb [String -> Node])
(t/ann fetch-node-from-graph-fallback-to-mongo [Graph String -> (Vector* Graph Node)])
And now for the core of the algorithm. Since I have complete control over my
own code, there will be none of this ^:no-check
nonsense; every single
line is type-checked. As the following function illustrates, core.typed
is doing a lot of work on my behalf:
(t/ann update-queue [Queue Graph String (t/Seqable String) -> Queue])
(defn update-queue [queue graph id links]
(reduce (t/fn> :- Queue [q :- Queue id :- String]
(pmap-assoc q id (:distance (get graph id)))) queue links))
To verify that this code does indeed produce a Queue
, type must be
properly inferred through 4 layers of function calls, one of them higher-order.
I find that impressive.
Had I fat-fingered the last line as
... (pmap-assoc id q (:distance (get graph id)))) queue links)
I would have learned that immediately
Type Error (imdb.core:219:18) Type mismatch:
Expected: Queue
Actual: String
in: (imdb.core/pmap-assoc id412107 q (:distance (clojure.lang.RT/get graph id412107)))
Type Error (imdb.core:219:18) Type mismatch:
Expected: String
Actual: Queue
in: (imdb.core/pmap-assoc id412107 q (:distance (clojure.lang.RT/get graph id412107)))
rather than having to deduce it from some downstream catastrophe.
The rest of the algorithm contains few surprises. As promised, it's purely functional, with multiple iterations of the graph emerging from an recursion and reduction.
(defn update-distance
"Update :distance field in graph node id to be one greater than di if appropriate
and return the updated graph."
[entry graph id]
(let [di (distance entry)
[graph node] (fetch-node-from-graph graph id)
path (or (:path entry) '())]
(if (<= (distance node) di) graph
(-> graph
(set-node id :distance (inc di))
(set-node id :path (conj path id))))))
(t/ann visit-node [Graph Queue String -> (Vector* Graph Queue)])
(defn visit-node
"Visits node id, and update the :distance fields of its neighbors.
queue is a priority queue of id=>distance, containing entries only for unvisited
nodes. We will use this to determine what to visit next, Dijkstra style."
[graph queue id]
(let [[graph entry] (fetch-node-from-graph graph id)
links (filter #(nil? (:visited (graph %))) (:links entry))
graph (reduce (partial update-distance entry) graph links)
queue (update-queue queue graph id links)
graph (set-node graph id :visited true)]
[graph queue]))
(t/ann find-distance [String String -> (t/Option Any)])
(defn find-distance
"E.g. (find-distance \"/name/nm0000257/\" \"/name/nm0000295/\")
Use Dijkstra algorithm to find shortest path from id to target."
[id target]
(let [ [graph node] (fetch-node-from-graph {} id)
graph (set-node graph id :distance 0)
queue (priority-map)]
(t/loop> [graph :- Graph graph
queue :- Queue queue
id :- String id]
(if (= id target) (get graph id)
(let [[graph queue] (visit-node graph queue id)
closest (first (pmap-peek queue))]
(if (empty? queue)
"Couldn't find a path!"
(recur graph (pmap-pop queue) closest)))))))
I mentioned above that the core.typed
's flexibility gives us some
unexpected choices. To investigate this in the REPL, we'll use a couple of
extra tools.
First, (t/ann-form #{"a"} (t/Set String))
specifies inline that
#{a}
is to be treated as a set of String, rather than the more conservatively
inferred (t/Set (Value "a"))
, and (t/cf someform)
checks a form and
displays its inferred type.
Here's how union
might be annotated
(t/ann ^:no-check clojure.set/union (All [a] (Fn [(t/Set a) * -> (t/Set a)])))
and here's how that would be interpreted:
imdb.core> (t/cf (clojure.set/union (t/ann-form #{1} (t/Set Integer)) (t/ann-form #{"one"} (t/Set String))))
(t/Set (U Integer String))
That certainly isn't what I expected, given apparently analogous declarations in other languages.
For example, scala's sets have this union
method
def union(that: GenSet[A]): Set[A]
which of course bombs on heterogenous sets:
scala> HashSet(1).union(HashSet(2))
res0: scala.collection.immutable.HashSet[Int] = Set(1, 2)
scala> HashSet(1).union(HashSet("one"))
<console>:9: error: type mismatch;
found : String("one")
required: Int
To get the familiar behavior from clojure, we'd have to specify an equality bound on the type parameter:
(t/ann clojure.set/union (All [x [x1 :< x :> x]]
(Fn [(t/Set x) (t/Set x1) * -> (t/Set x1)])))
Now:
imdb.core> (t/cf (clojure.set/union (t/ann-form #{1} (t/Set Integer))
(t/ann-form #{"one"} (t/Set String))))
AssertionError Assert failed: 1: Inferred type String is not between bounds Integer and Integer
(and (subtype? inferred upper-bound) (subtype? lower-bound inferred)) clojure.core.typed.cs-gen/subst-gen/fn--10414 (cs_gen.clj:1333)
Neither of these alternate declarations is obviously more correct, and both will catch our coding errors, though in slightly different ways. Suppose we had the following solecism:
(t/cf (let [s (clojure.set/union #{1} #{"a"})]
(map inc s)))
With the bounded annotation, the call to union
is simply illegal:
AssertionError Assert failed: 1: Inferred type (Value "a") is not between bounds (Value 1) and (Value 1)
(and (subtype? inferred upper-bound) (subtype? lower-bound inferred)) clojure.core.typed.cs-gen/subst-gen/fn--10414 (cs_gen.clj:1333)
With the first, more general declaration, the call to union
is fine, but inc
will have problems:
Type Error (imdb.core:2:6) Polymorphic function clojure.core/map could not be applied to arguments:
Domains:
(Fn [a b ... b -> c]) (t/NonEmptySeqable a) (t/NonEmptySeqable a) ... b
(Fn [a b ... b -> c]) (U nil (clojure.lang.Seqable a)) (U nil (clojure.lang.Seqable b)) ... b
Arguments:
(Fn [t/AnyInteger -> t/AnyInteger] [Number -> Number]) (t/Set (U (Value 1) (Value "a")))
Ranges:
(t/NonEmptyLazySeq c)
(clojure.lang.LazySeq c)
with expected type:
Any
in: (clojure.core/map clojure.core/inc s)
In both cases, our error is caught, but at different stages of analysis.
This flexibility is immensely powerful, and it encourage more nuanced thinking about static types. In many situations, finding ourselves with a heterogeous set would be a sign that something had gone wrong, but not always. For instance, with our more general annotation, we could do this:
imdb.core> (t/cf (let [s (clojure.set/union #{1} #{1.0})]
(map inc s)))
(clojure.lang.LazySeq Number)
I think I've demonstrated that core.typed
is cool.
I know that it has improved my own code.
I hope that it will ultimately remove one of the greatest barriers to broad adoption of clojure.
Comments
comments powered by Disqus