Like any red-blooded American, I find regex golf fascinating. The idea, to paraphrase the comic strip, is to find a regular expression that matches all members of a group of related terms, but not any members of a different group. The hover text on the strip suggests a regex that matches all winning presidents.

Peter Norvig went to town on this, first clarifying the problem and then building an algorithm to search for the shortest solution that matches all mainstream presidential candidates who eventually won, but not such candidates who never won. (So Perot, Nader, Anderson et al don't figure.) Anyway, read his post before continuing to read this, if you haven't already.

Peter's algorithm is best described by him, but, briefly, it creates a pool of regex fragments, like /r.e$ and repeatedly pulls from it the most effective ones, in the sense that they match as many winners as possible, don't match any losers and contribute the least to the total length of the regex when they're all |'d together.

I started to wonder whether the problem could be approached as a more more general optimization, with the specifics of the problem abstracted out into some kind of objective function. Specifically, I was thinking of simulated annealing, which I've always liked because it has a nice analogy in the physical world and which has been getting a lot press recently as a possible application of quantum computing. As always, it also seemed interesting to try implementing this in Clojure, especially because one doesn't (or I didn't) think of iterative optimizations over complex state as natural candidates for a functional lisp.

All the code is on github, but some of it will be recapitulated inline below.

Simulated Annealing

In simulated annealing, you have some sort of state (e.g. the direction of magnetic spins in a crystal lattice) and model for the energy of that state (e.g. a positive contribution for neighboring spins that are aligned). However the state and its energy are defined, we assume the Boltzmann distribution for the likelihood of any given state at any given temperature: $$ p \sim e^{- E(S)/k_B T} $$ where $k_B$ is the Boltzmann constant. Sliding past the detailed physics for a moment, the basic idea is that, at high temperatures, many different fluctuating states are possible even if their energies are high, while, as $T$ gets small, there is greater punishment of higher energies. Starting a high $T$ and gradually lowering it, we "anneal" into the lowest possible energy state.

Metropolis-Hastings is a Markov-Chain Monte Carlo algorithm for generating random states that satisfy a complicated probability distribution. It is well suited for the Boltzmann distribution, which was its earliest application. The gist (explained in terms of Boltzmann) is that you modify the state $S$ in some (almost) arbitrary fashion to $S' = f_r(S)$ (where the $r$ subscript indicates that this transformation has some randomness to it) and then use the following criteria to decide whether to accept the transformed state or to try transforming again:

  • If $E(S') < E(S)$, definitely accept $S'$.
  • If $E(S') > E(S)$, accept $S'$ with probability $e^{-(E(S') - E(S))/k_B T}$.

If you repeat this long enough, the collection of $S$ values will satisfy the Boltzmann distribution. I said almost any fashion, because the $f_r$ must satisfy some criteria for the algorithm to work at all. For example the identity function would be terrible, as would any function that can't reach all possible states $S$. Similarly, if $f_r$ is too close to identity, you'll explore the space very slowly.

The trick of simulated annealing is to lower the $T$ slowly. As the $T$ in the denominator decreases the magnitude of the negative exponent gets large, and the probability of accepting an increase in energy approaches zero. At large $T$, the magnitude of the exponent is reduced, thus increasing the probability that we'll be able to explore even if we're currently in a local minimum.

Energy and state for the regex problem

For our purposes, we'll use a state that, echoing Peter's algorithm, is a set of regular expressions that will be or'd together to form our answer. Instead, however, of starting with fragments, we'll be starting with the least subtle solution possible:

xkcd> (def initial (set (map #(str "^" % "$") winners)))
xkcd> initial
#{"^washington$" "^adams$", ...}

The or-concatenation of this set obviously matches the winners and only the winners. Again, we define winners as people who eventually won:

(def winners
  "Anyone who ever won the presidency."
  (set (str/split "washington adams jefferson jefferson madison madison monroe monroe adams jackson jackson vanburen harrison polk taylor pierce buchanan lincoln lincoln grant grant hayes garfield cleveland harrison cleveland mckinley mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon nixon carter reagan reagan bush clinton clinton bush bush obama obama" #" ")))

(def losers 
  "Anyone who ran as a major party candidate but never won"
  (sets/difference  (set (str/split "clinton jefferson adams pinckney pinckney clinton king adams jackson adams clay vanburen vanburen clay cass scott fremont breckinridge mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan parker bryan roosevelt hughes cox davis smith hoover landon wilkie dewey dewey stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale dukakis bush dole gore kerry mccain romney" #" ")) winners ))

We can easily verify whether a candidate set of regular expressions satisfies the basic criteria:

(defn check [res] 
  (let [re (re-pattern (str/join "|" res))]
    (and (every? #(re-find re %) winners)
         (not (some #(re-find re %) losers)))))

For lack of anything obviously cleverer, $E(S)$ will just be the length of the regex. Not actually matching the presidents correctly corresponds to infinite energy, which we'll represent with nil.

(defn energy [res]
  (when (check res) (dec  (+ (count res) (reduce + (map count res))))))

Now, what sorts of perturbations might be interesting? The first thing I thought of was chopping up a randomly chosen regex at a randomly chosen position:

(defn get-rand-gt1 "Fetch a re of length at least two, as long as we have at least one"
  [res]
  (let [cands (filter #(> (count %) 1) res)]
    (if (seq cands) (rand-nth cands) nil)))

(defn chop
  "Randomly chop one regex into two with |"
  [res]
  (let [cand (get-rand-gt1 res)]
    (if cand
      (sets/union (disj res cand)
                  (set (map str/join (split-at (inc (rand-int (dec (count cand)))) cand )))))))

This operation should never cause us not to miss a winner, but it might cause us to match a loser - in which case, the energy will be infinite and it will be rejected.

Frequently, we should also try to get rid of a regex entirely, which, if it succeeds, will definitely decrease the total length.

(defn yank [res] (disj res (rand-nth (seq res))))

I also wrote transformations to insert a dot into the middle of one and, not competely orthogonally, `decapitate the first or last character. These exciting functions are on github.

It's especially important that we have at least one transformation that will help us back out of dead ends. For this, I decided to randomly add back the full regexp for one of the winners, so we can have a go at chopping it up again.

(defn add-back [res]  (conj res (str "^" (rand-nth (seq  winners)) "$")))

This operation will always increase the energy of the state, so it will become less and less likely at low temperatures.

Finally, everything gets bundled into a single perturb function that picks one of the transformations randomly and applies it.

The optimization algorithm

The strategy is to perturb the state randomly, possibly accept the change, slightly reduce the temperature and repeat. The temperature reduction will be achieved by repeatedly multiplying by 1.0-dT. Without loss of generality, we take $k_B=1$.

This doesn't take very much code at all! The steps function produces an infinite lazy sequence of vectors [S E T] vectors:

(defn steps [S0 energy perturb T dT]
  (letfn [(step [[S E T]]
            (let [S2     (perturb S)
                  E2     (energy S2)
                  [S E]  (if (and E2 (> (Math/exp (/  (- E E2) T)) (rand)))
                               [S2 E2] [S E])]
              [S E (* T (- 1.0 dT))]))]
    (iterate step [S0 (energy S0) T])))

Remember that iterate basically does the cons/lazy-seq shuffle for the case where the function only needs to consume its own output.

I think that's pretty neat. The code is compact, general, and thanks once again to persistent data structures, it gets to be purely functional.

To make some sense of this, I wrote something to filter out uninteresting states, displaying every dnth step, or when the energy is at least dE less than the previous minimum.

(defn annotate  [steps dn dE]
  (letfn [(annotate* [steps n minE]
            (let [[S E T]           (first steps)
                  newMinE     (if minE (min E minE) E)
                  out1        (when (and dE minE (<= dE (- minE newMinE))) (str "*** " E " " (str/join "|" S)))
                  out2        (when (and dn (zero? (mod n dn))) (str n " " step))
                  tail        (lazy-seq (annotate* (next steps) (inc n) newMinE))]
              (if (or out1 out2) (cons  (str out1 out2) tail) tail)))]
    (annotate* steps 0 nil)))

When and only when there's something interesting to report, we cons an explanatory string to the sequence, so this sequence is far sparser than the input steps.

I'll invoke a simulation from the REPL with

(doseq [x (annotate2 (steps2 initial energy 5.0 0.0000001) nil 1)] (println x))

The doseq forces the lazy sequence to evaluate without holding on to the head. The starting temperature of 5.0 is not completely arbitrary: we want something such that the probability of of an add-back being accepted is high enough to occur occasionally.

Here's some typical output, with . . . indicating elisions.

*** 329 adams$|^reagan$|^obama$|^mckinley$|^vanburen$|^eisenhower$|^kennedy$|^lincoln$|^truman$|^taylor$|^garfield$|^harrison$|^coolidge$|^grant$|^nixon$|^roosevelt$|^polk$|^madison$|^monroe$|^taft$|^harding$|^hoover$|^cleveland$|^wilson$|^jackson$|^hayes$|^buchanan$|^washington$|^jefferson$|^carter$|^johnson$|^clinton$|^bush$|^pierce$
*** 328 adams$|^reagan$|^obama$|^mckinley$|^vanburen$|truman$|^eisenhower$|^kennedy$|^lincoln$|^taylor$|^garfield$|^harrison$|^coolidge$|^grant$|^nixon$|^roosevelt$|^polk$|^madison$|^monroe$|^taft$|^harding$|^hoover$|^cleveland$|^wilson$|^jackson$|^hayes$|^buchanan$|^washington$|^jefferson$|^carter$|^johnson$|^clinton$|^bush$|^pierce$
. . .
*** 107 rs|r.i.|ln|ru|o...$|i..n|o...e|a.t|ks|hn.|.n..y$|am|ye|..olidge$|.i.so|^c.eve.a.|ie.|bu..|t.n|r..g..|po|a.l
*** 106 rs|ln|ru|o...$|i..n|o...e|a.t|ks|hn.|.n..y$|am|ye|r.i|..olidge$|.i.so|^c.eve.a.|ie.|bu..|t.n|r..g..|po|a.l
*** 105 .olidge$|rs|ru|^taylor$|o...$|i..n|o...e|hn|i..o|a.t|ks|.n..y$|am|ye|r.i|^c.eve.a.|ie.|bu..|t.n|r..g..|po
*** 104 rs|ru|^taylor$|o...$|i..n|o...e|hn|i..o|a.t|ks|.n..y$|am|ye|r.i|olidge$|^c.eve.a.|ie.|bu..|t.n|r..g..|po
. . .

After a few minutes, one of my runs produced

*** 52 n.e|ho|ls|a.t|a..i|j|^n|bu|v.l|ma|a.a|ay.|r.e$|li|po

Not half bad.

So what?

Obviously the algorithm could be improved, perhaps significantly, but I decided not to try (yet). First, it's slightly under the par established by Peter Norvig and so is probably probably not terrible. Second, part of the appeal of this approach was that it wouldn't require too much fiddling. I don't mean to imply that this is a "better" approach; it burns considerable CPU for essentially the same quality answer, but it satisfied my goals of abstracting the domain specifics, allowing me finally to use simulated annealing for something and demonstrating another sort of program that can be handled gracefully in Clojure.



Comments

comments powered by Disqus