Archive for the ‘Clojure Coding’ Category

Another Boost in the CryptoQuip Performance (cont.)

Friday, December 19th, 2014

Clojure.jpg

I got another email from my friend on the updates I'd made to the solver, and he pointed out that the pattern function we were using:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        (0 1 1)
      => (pattern \"rabbit\")
        (0 1 2 2 3 4)
    "
    [word]
    (let [dw (distinct word)]
      (map #(.indexOf dw %) word)))

does something that's not strictly necessary to the proper functioning of the algorithm - the distinct call on the input word.

If we have a function that maps these words into some kind pattern, then it really doesn't matter what the pattern is - as long as it's just the pattern, and is not able to be confused with another pattern. So if we choose to change the code to look something like this:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        (0 1 1)
      => (pattern \"rabbit\")
        (0 1 2 2 4 5)
    "
    [word]
    (map #(.indexOf word (int %)) word))

where we are ignoring the cost of the repeated characters in the word in terms of the index they represent, then we have a similar pattern that's deterministic and comparable, but it lacks the call to distinct, and so it's going to be a little faster.

In the REPL, we can see the old way:

  cq.main=> (time (dotimes [_ 100000] (dpat "bivteclnbklzn")))
  "Elapsed time: 8.748 msecs"
  nil
  cq.main=> (time (dotimes [_ 100000] (dpat "bivteclnbklzn")))
  "Elapsed time: 8.158 msecs"
  nil

or about 87 nsec/call. That's fast. But when we test the new non-distinct mapper, we get:

  cq.main=> (time (dotimes [_ 100000] (pattern "bivteclnbklzn")))
  "Elapsed time: 4.859 msecs"
  nil
  cq.main=> (time (dotimes [_ 100000] (pattern "bivteclnbklzn")))
  "Elapsed time: 4.972 msecs"
  nil

or down to about 49 nsec/call. That's another significant percentage on that call, but I doubt it'll convert to something really significant in the overall solution. It's down to about 5 to 6 msec per benchmark solution. Not bad, considering it's all on a laptop, and all on the JVM. I do love this game!

Another Boost in the CryptoQuip Performance

Thursday, December 18th, 2014

Clojure.jpg

Today I got an email from my friend about the times we were getting on the pattern function in the code - that function that turns the word into a sequence of ints that corresponds to the character pattern of the word. So we might have:

  => (pattern "see")
  [0 1 1]
  => (pattern "rabbit")
  [0 1 2 2 3 4]

At the time, my best code looked like this:

  (defn pattern
    [word]
    (mapv (into {} (map vector (distinct word) (range))) word))

and the times we were getting just for this function were not as good as his version:

  (defn pattern
    [word]
    (loop [p [] d {} n 0 f (first word) r (rest word)]
      (if f
        (if (contains? d f)
          (recur (conj p (d f)) d n (first r) (rest r))
          (recur (conj p n) (assoc d f n) (inc n) (first r) (rest r)))
        p)))

On 1000 calls he was beating me by a couple of milliseconds - and it makes sense... he's building the map as he goes, and I took the time to build it up-front, and I'm paying the cost for it. The big cost is that I then have to run map again to use the map I just created. His code builds and uses the map at the same time.

So I was trying to think of a way to speed up the process, and I couldn't think of a simple way, and then I decided just to try a simple test - use the Java indexOf method to get the index on the distinct string and then return that. The code then becomes:

  (defn pattern
    [word]
    (let [dw (distinct word)]
      (map #(.indexOf dw %) word)))

The times were startling:

  cq.main=> (time (dotimes [_ 1000] (bob "bivteclnbklzn")))
  "Elapsed time: 0.342 msecs"
  nil
  cq.main=> (time (dotimes [_ 1000] (bret "bivteclnbklzn")))
  "Elapsed time: 3.605 msecs"
  nil

where bob is my new version with indexOf and bret is his version with the loop/recur.

A factor of 10x. Wow! Ten times faster. Make the string and use the inherent properly of the string and it's index to do the mapping. What a hoot! I have really enjoyed working on this with my friend.

When we put this into the web server, we get even better solve times:

[12:29:13.757:qtp1360409373-14] INFO  cq.block - Finished match-up [8 words] in 1ms.
[12:29:13.763:qtp1360409373-14] INFO  cq.block - Finished solve in 7ms.

...just amazingly fun.

Better Word Grouping in the CryptoQuip

Wednesday, December 17th, 2014

Clojure.jpg

I was talking with my friend today, and he pointed out that the size of the word sets were much smaller when you grouped them by their character pattern as opposed to just their length. The thought is that a significant portion of the execution time is in the filtering of the list of possible words - based solely on their length, and if you can group the possible words by their pattern, then they all have to pass that criteria, and you can skip the filtering.

The code to compute this character pattern isn't all that clear, but it's very effective:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        [0 1 1]
      => (pattern \"rabbit\")
        [0 1 2 2 3 4]
    "
    [word]
    (loop [p [] d {} n 0 f (first word) r (rest word)]
      (if f
        (if (contains? d f)
          (recur (conj p (d f)) d n (first r) (rest r))
          (recur (conj p n) (assoc d f n) (inc n) (first r) (rest r)))
        p)))

and then we simply change how the words are organized, from:

  (def words
    "This is the map of all known plaintext words - organized by the length
    of the word to speed up the matching process. The result is a map where
    the key is the length, and the value is a sequence of words."
    (->> (slurp "resources/words")
         (.split #"\n")
         (group-by count)))

to:

  (def words
    "This is the map of all known plaintext words - organized by the length
    of the word to speed up the matching process. The result is a map where
    the key is the length, and the value is a sequence of words."
    (->> (slurp "resources/words")
         (.split #"\n")
         (group-by pattern)))

at this point, we don't need to filter the word list by the pattern because we have already done this - so let's remove that as well.

At this point, the words are grouped by their pattern and not just the length. The difference in the times is really profound:

  [13:01:16.677:main] INFO  cq.block - Finished match-up [8 words] in 39ms.
  [13:01:16.687:main] INFO  cq.block - Finished solve in 49ms.

becomes:

  [13:45:46.967:main] INFO  cq.block - Finished match-up [8 words] in 0ms.
  [13:45:46.981:main] INFO  cq.block - Finished solve in 14ms.

We have simply removed the single most expensive part of the solution, and we get the same solution. Very nice. Very.

UPDATE: I came up with a simpler implementation of pattern that - at least to me - seems to be a lot easier to follow. You make a map of the characters and their position and then uses that as the function on the word itself:

  (defn pattern
    "Function to take a word (as a string) and return a vector that is the
    pattern of that word where the values are the index of the character.
 
      => (pattern \"see\")
        [0 1 1]
      => (pattern \"rabbit\")
        [0 1 2 2 3 4]
    "
    [word]
    (mapv (into {} (map vector (distinct word) (range))) word))

It turns out that this is also a little faster. We're down to 11 msec. Nice!

Deciding to Help Out a Friend

Wednesday, December 17th, 2014

Machine Learning

Carl and I have been talking about this idea for applied machine learning for analyzing Salesforce.com data, and now that I know what is going to happen to me with the new job, I told him in chat that I'd be glad to build him a proof-of-concept for the task. I know he has test data, and he has the algorithms, and what he needs is something to prove that this will work at scale. Sounds like fun.

I'm not at all sure what the algorithms look like, but it's all in Python code, and he's online so I can ask him as I build this - and I don't have to integrate it with Salesforce.com - I just need to hammer out the code into a decent clojure project that can perform the requested analytics on the data and come up with the "right" answers.

This isn't all that different from the CryptoQuip work I've been doing. I really like this stuff. Clojure makes it so easy to piece stuff together, and with the REPL and all that, it's just amazing how nice it is. So it'll be a nice diversion for a while and then maybe a few weekends and then who knows?

Bitbucket is Pretty Decent

Wednesday, December 17th, 2014

Bitbucket

While it's no GitHub, I've been working with Bitbucket for a few projects with a friend, and I have to say it's not bad at all. I happen to like the helpful hints GitHub offers on making a new repo, but the help Bitbucket has is very fair, and reasonable. It's got a little different UI, which is fine, and it works pretty much the same so that you don't have to come up to speed with a new workflow if you're switching back and forth from Bitbucket to GitHub.

I looked at Bitbucket a long time ago, and it wasn't nearly as similar to GitHub as it is now, and that makes a lot of sense. It's got Pull Requests, and it's got a very nice source view:

Bitbucket Source

Many will say It's just like GitHub's - which is a good thing. Face it - you copy what's a great design and works well. This is one of those cases.

The thing I like about Bitbucket is the free repos for small teams - up to 5 people. In short, they have a different approach than GitHub - repos are free, people are charged. It's a nice approach because it means that non-Open Source projects that you work on with a friend can be private without a cost. GitHub charges for the first private repo, and I don't begrudge them that - I'm a paying GitHub user. But I like the different approach because it gives small groups a leg up.

I'm not going to stop using GitHub, but it's nice to see that Bitbucket exists and seems every bit as good as GitHub for the task. It's something I'll remember.

SNAP! Got the Speed in Clojure

Tuesday, December 16th, 2014

Clojure.jpg

Well... I just pushed the commits that added the web server to the CryptoQuip repo with my friend. I have been able to get the speed down to that of the Obj-C code, and it was really interesting to me where the hold-up was. I needed to add some logging with timings so that I could get performance numbers for changes as I ran them over and over. But what I found was that the real bottleneck was in the possible? predicate function that was being used to filter the possible words to those that matched the pattern of each cypherword.

What I needed to do was to switch to pmap to make it efficient on getting the matches:

  (defn possible?
    "Function to see if the cyphertext and plaintext have the same pattern of
    characters such that they could possibly match - given the right legend."
    [ct pt]
    (if (and (string? ct) (string? pt) (= (count ct) (count pt)))
      (let [ctc (count (distinct ct))
            ptc (count (distinct pt))]
        (if (= ctc ptc)
          (= (count (distinct (map str ct pt))) ctc)
          false))
      false))

where I now look at the (distinct) length of the cypher and plaintext before making the map of pairs and counting them. This saves a good bit, but the biggie was in breaking out the creation of the possibles sequence:

  (defn match-up
    "Function to create a sequence of all the cypherwords and the possible
    matches to it in a way that we can easily time this for performance
    tuning."
    [quip]
    (let [qw (vec (distinct (.split quip " ")))
          go (fn [cw] (let [poss (filter #(possible? cw %) (get words (count cw)))]
                        { :cyphertext cw
                          :possibles poss
                          :hits (count poss) }))]
      (sort-by :hits < (pmap go qw))))
 
  (log-execution-time! match-up
    {:msg-fn (fn [ret q] (format "%s words filtered" (count ret)))})

Here, the biggie is splitting up the matching using pmap to make sure that we work on this matching problem as fast as possible. It's all still very functional, as the source lists are immutable, and it's a simple filtering problem.

With this, we're in the 65 msec range for the solution, and that's about where the ARC-based Obj-C code was, and that's pretty nice.

More Fun with the CryptoQuip Solver in Clojure

Monday, December 15th, 2014

Clojure.jpg

This morning I finished up a different implementation of the CryptoQuip solver that I've been migrating from Obj-C to ruby - and now to clojure. A friend has already done a port of the logic, but he made slightly different implementation scheme, and therefore a different execution profile. This is a lot more along the lines of the the Obj-C and ruby ports, just in clojure.

What really impressed me what the exceptionally tight implementation of the functions in clojure. For the most part, the code is about the size of the ruby code, except where it comes to the reentrant attack code. There, clojure is once again, just amazing. The ruby looks like this:

  # This is the recursive entry point for attempting the "Word Block" attack
  # on the cyphertext starting at the 'index'th word in the quip. The idea is
  # that we start with the provided legend, and then for each plaintext word
  # in the 'index'ed cypherword that matches the legend, we add those keys not
  # in the legend, but supplied by the plaintext to the legend, and then try
  # the next cypherword in the same manner.
  #
  # If this attack results in a successful decoding of the cyphertext, this
  # method will return true, otherwise, it will return false.
  def do_word_block_attack(idx, key)
    p = @pieces[idx]
    cw = p.cypherword
    p.possibles.each do |pt|
      have_solution = false;
      if cw.can_match?(pt, key)
        # good! Now let's see if we are all done with all the words
        if (idx == @pieces.count - 1)
          # make sure we can really decode the last word
          if key.incorporate_mapping(cw.cyphertext, pt)
            # if it's good, add the solution to the list
            if (dec = key.decode(@cyphertext))
              @solutions << dec
              have_solution = true
            end
          end
        else
          # OK, we had a match but we have more cypherwords
          # to check. So, copy the legend, add in the assumed
          # values from the plaintext, and move to the next
          next_key = key.clone
          if next_key.incorporate_mapping(cw.cyphertext, pt)
            have_solution = do_word_block_attack(idx + 1, next_key)
          end
        end
      end
      # if we have any solutions - stop
      break if have_solution
    end
    @solutions.count > 0
  end
end

and the clojure code looks like:

  (defn attack
    "Function to do the block attack on the quip, broken down into a sequence
    of the cyphertext words, and their lists of possible plaintext words, as
    well as the index in that sequence to attack, and the clue (legend) to use.
    This will return the first match to the attack and that's it."
    [quip pieces idx clue]
    (if (and (coll? pieces) (map? clue))
      (let [{cw :cyphertext poss :possibles} (nth pieces idx)
            last? (= idx (dec (count pieces)))]
        (some identity (for [pt poss
                             :when (matches? clue cw pt)
                             :let [nc (merge-clue clue cw pt)]
                             :when nc]
                         (if last?
                           (decode nc quip)
                           (attack quip pieces (inc idx) nc)))))))

which is just amazingly simple and easy to follow.

The execution times don't beat those that my friend's code, but maybe it's due to loading, or maybe he'll be able to find something in the performance I missed. But you just can't beat the simplicity of the clojure code.

Just amazing!

VisualVM is Bundled with the JVM

Wednesday, December 3rd, 2014

java-logo-thumb.png

I have been using VisualVM for quite a while now... it's got the monitoring tools that work pretty nicely if you have the network connectivity you need to get to the box. What I learned this morning on the #clojure channel on IRC is that it's now included in the JVM.

That's sweet! SO I had to check:

  $ which jvisualvm
  /usr/bin/jvisualvm

Really nice! It's the same tool, but now it's integrated, and that's even better.

Now I'll be able to use it on any platform and not have to hassle with downloading it as a separate install. Very nice!

Faster Math in Clojure 1.7

Wednesday, December 3rd, 2014

Clojure.jpg

Clojure is forced to use reflection in a lot of cases simply because there is no type system in place, and when you're iterating on a sequence of elements, you have to ask each one Hey... what are you? and then see if you can do something with that. It's understandable, and it's the point of adding type hinting in the function signatures, but it hurts when you have Java auto-boxing int to Integer, and then trying to do math on those values.

It slows simple math way down - and that's not typically something you want to slow down.

Thankfully, in Clojure 1.7, there's a new dynamic variable that will check all the math to make sure that it's not boxed:

  (set! *unchecked-math* :warn-on-boxed)

and then for the scope of that definition, it'll check to see that all the math is not boxed. This can help identify the source of a performance issue without the painful profiling.

Great to see... now if Storm could just be on 1.7...

Getting Prismatic Schema to Compile Under 1.4

Monday, November 24th, 2014

Clojure.jpg

I've been trying to get some data validation code going, and one of the best looking libraries is Prismatic's Schema. It's got both clojure and clojurescript, as well as a very well thought out scheme. It's got capabilities to run tests on code, in-line, or as data to filter like tests. It's very nice.

Problem is, it's meant to use clojure 1.5.1 or better.

But I wanted to see if we could get it going in clojure 1.4.0 for the Storm topologies we have. Turns out, it wasn't all that easy. But it appears possible.

First, let's change a few things in the project.clj file. First, make the version something that's clear it's not the normal version:

  (defproject prismatic/schema "0.3.4G"

and then, because clojure 1.4.0 doesn't have the EDN reader, we needed to add:

    :dependencies [[potemkin "0.3.2"]
                   [org.clojure/tools.reader "0.8.12"]]

and then changed the main version of clojure:

    :profiles [{:dev {:dependencies [[org.clojure/clojure "1.4.0"]

And then in src/cljx/schema/coerce.cljx we need to reference the correct EDN reader:

    #+clj [clojure.tools.reader.end :as end]

and then in test/cljx/schema/core_test.cljx we need to use a proper defprotocol for 1.4.0:

  (defprotocol ATestProtocol (test-fn [x]))

And at this point, we can run:

  $ lein do clean, reps, test

And it will work just fine - for the clojure part. The clojure script requires 1.5.1, but that's OK, as we don't need it.