Better Word Grouping in the CryptoQuip

Clojure.jpg

I was talking with my friend today, and he pointed out that the size of the word sets were much smaller when you grouped them by their character pattern as opposed to just their length. The thought is that a significant portion of the execution time is in the filtering of the list of possible words - based solely on their length, and if you can group the possible words by their pattern, then they all have to pass that criteria, and you can skip the filtering.

The code to compute this character pattern isn't all that clear, but it's very effective:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        [0 1 1]
      => (pattern \"rabbit\")
        [0 1 2 2 3 4]
    "
    [word]
    (loop [p [] d {} n 0 f (first word) r (rest word)]
      (if f
        (if (contains? d f)
          (recur (conj p (d f)) d n (first r) (rest r))
          (recur (conj p n) (assoc d f n) (inc n) (first r) (rest r)))
        p)))

and then we simply change how the words are organized, from:

  (def words
    "This is the map of all known plaintext words - organized by the length
    of the word to speed up the matching process. The result is a map where
    the key is the length, and the value is a sequence of words."
    (->> (slurp "resources/words")
         (.split #"\n")
         (group-by count)))

to:

  (def words
    "This is the map of all known plaintext words - organized by the length
    of the word to speed up the matching process. The result is a map where
    the key is the length, and the value is a sequence of words."
    (->> (slurp "resources/words")
         (.split #"\n")
         (group-by pattern)))

at this point, we don't need to filter the word list by the pattern because we have already done this - so let's remove that as well.

At this point, the words are grouped by their pattern and not just the length. The difference in the times is really profound:

  [13:01:16.677:main] INFO  cq.block - Finished match-up [8 words] in 39ms.
  [13:01:16.687:main] INFO  cq.block - Finished solve in 49ms.

becomes:

  [13:45:46.967:main] INFO  cq.block - Finished match-up [8 words] in 0ms.
  [13:45:46.981:main] INFO  cq.block - Finished solve in 14ms.

We have simply removed the single most expensive part of the solution, and we get the same solution. Very nice. Very.

UPDATE: I came up with a simpler implementation of pattern that - at least to me - seems to be a lot easier to follow. You make a map of the characters and their position and then uses that as the function on the word itself:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        [0 1 1]
      => (pattern \"rabbit\")
        [0 1 2 2 3 4]
    "
    [word]
    (mapv (into {} (map vector (distinct word) (range))) word))

It turns out that this is also a little faster. We're down to 11 msec. Nice!