Archive for the ‘Clojure Coding’ Category

What Would I Build?

Monday, November 25th, 2013

Storm

I've been playing around with Storm for a while now, and while I don't think there are all that many folks in the world that are expert at it, I'm certainly an advanced novice, and that's good enough for the amount of time I've put into it. I've learned a lot about how they have tired to solve the high-performance computing platform in clojure and on the JVM, and I've come away with an affirmation of the feelings I had when I was interviewed for this job, and discussing functional languages: Garbage Collection is the death of all functional languages, and certainly Storm.

I like the simplicity of functional languages with a good library of functions. Face it, Java took off over C++ because C++ was the base language, and Java had the rich object set that everyone built on. It made a huge difference in how fast people could build things. So if you want a functional language to have a lot of traction fast, you need to make sure that you don't send people to re-invent the wheel to do the most basic tasks.

But the real killer is Garbage Collection. I'm not a fan, and the reason is simple - If I'm trying to do some performant coding, I want to control when that happens, and under what conditions. It's nice for novices to be able to forget about this and still write stable code, but when you want to move 1,000,000 msgs/sec, you can't do it without pools, lockless data structures, mutability, and solid resource control. None of which I get in the JVM - or anything based on it.

So what's a coder to do? Answer: Write another.

There used to be Xgrid from Apple, but they dropped that. They didn't see that it was in their best interests to write something that targets their machines as nodes in a compute cluster, and they aren't about to write something where you can use cheap linux boxes and cut them out altogether. Sadly, this is a company, and they want to make money.

But what if we made a library that used something like ZeroMQ for messaging, and then we used something like C++ for the linux side, and Obj-C++ for the Mac side and made all the tools work like they do for Storm - but instead of using clojure and the JVM, and a ton of tools on the server-side to handle all the coordination and messaging, let's use something that's far more coupled with the toolset we're working with.

First, no Thrift. It's bulky, expensive, and it's being used as a simple remote procedure call. There are a lot better alternatives out there when you're using a single language. Stick with a recent version of ZeroMQ and decent bindings - like their C++ ones. Start small and build it up. Make a decent console - Storm is nice here, but there's a lot more that could be done, and the data in the Storm UI is not really easily discernible. Make it clearer.

Maybe I'll get into this... it would certainly keep me off the streets.

Building Clojure Libraries for use in Storm

Friday, September 20th, 2013

Storm

Here's a little gem that I figured out while trying to deliver a nice clojure library for some storm work that I've been doing. The problem is that when you build an uberjar for a topology (or library) with leiningen, you don't want to include the storm jars as that will mess up the storm cluster when you go to deploy the topology. So how to get it to all work locally, but when building the uberjar, it goes smoothly.

Sadly, this is not clearly documented anywhere that I could find. But there were bits and pieces here and there and I was able to figure out what I needed with a little trial-and-error.

Thankfully, it's all in the leiningen project.clj file:

  (defproject having-fun "1.0.0"
    :aot [project.core]
    :profiles {:provided {:dependencies [[storm "0.9.0-wip16"]]}}
    :repositories [["releases" {:url "http://nexus/content/repositories/releases/"
                                :sign-releases false}]]
    :main project.core)

Where the keys seem to be that with leiningen 2, you need to accept that the :aot tag needs to be at the top level, and not buried in a :profiles entry. This seems to be a change going forward, so I wanted to adopt it now, and it works better this way.

Additionally, the :profiles line is all about excluding the storm jars in the uberjar, which is just what I needed, and then the :repositories tag is all about where to deploy this to with a simple:

  $ lein deploy

With this, I've been able to build clojure libraries with defbolt and defspout constructs - which is exactly what I wanted to do, and then put this up on our local nexus server so that it's very easy for others in the group to put this library in their project.clj file and make use of it.

Sweet.

Installing Hadoop on OS X

Wednesday, July 17th, 2013

Hadoop

This morning I finally got Hadoop installed on my work laptop, and I wanted to write it all down so that I could repeat this when necessary. As I found out, it's not at all like installing CouchDB which is about as simple as anything could be. No… Hadoop is a far more difficult beast, and I guess I can understand why, but still, it'd be nice to have a simple Homebrew install that set it up in single-node mode and started everything with Launch Control, but that's a wish, not a necessity.

So let's get into it. First, make sure that you have the SSH daemon running on your box. This is controlled in System Preferences -> Sharing -> Remote Login - make sure it's checked, save this, and it should be running just fine. Make sure you can ssh into your box - if necessary, make the SSH keys and put them in your ~/.ssh directory.

Next, you certainly need to install Homebrew, and once that's all going, you need to install the basic Hadoop package:

$ brew install hadoop

at this point, you will need to edit a few of the config files, and make a few directories. Let's start by making the directories. These will be the locations for the actual Hadoop data, the Map/Reduce data, and the NameNode data. I picked to place these next to the Homebrew install of Hadoop so that it's all in one place:

  $ cd /usr/local/Cellar/hadoop
  $ mkdir data
  $ cd data
  $ mkdir dfs
  $ mkdir mapred
  $ mkdir nn

At this point we can go to the directory with the configuration files and update them:

  $ cd /usr/local/Cellar/hadoop/1.1.2/libexec/conf

The first update is to handle a Kerberos bug in Hadoop - a known bug. Do this by editing hadoop-env.sh to include:

  export HADOOP_OPTS="-Djava.security.krb5.realm= -Djava.security.krb.kdc="

Next, edit the hdfs-site.xml file to include the following:

  <configuration>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/local/Cellar/hadoop/data/dfs</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/local/Cellar/hadoop/data/nn</value>
    </property>
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>
  </configuration>

Next, edit the core-site.xml file to include the following:

  <configuration>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/tmp/hdfs-${user.name}</value>
    </property>
    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>
  </configuration>

Finally, edit the mapred-site.xml file to include the following:

  <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.</description>
    </property>
    <property>
      <name>mapred.local.dir</name>
      <value>/usr/local/Cellar/hadoop/data/mapred/</value>
    </property>
  </configuration>

We are finally all configured. At this point, you need to initialize the Name node:

  $ hadoop namenode -format

and then you can start all the necessary processes on the box:

  $ start-all.sh

At this point, you will be able to hit the endpoints:

and using the WebHDFS REST endpoint, you can use any standard REST client to submit files, delete files, make directories, and generally manipulate the filesystem as needed.

This was interesting, and digging around for what was needed was non-trivial, but it was well worth it. I'll now be able to run my code against the PostgreSQL and Hadoop installs on my box.

Sweet!

I Love Magic – as Entertainment

Tuesday, March 5th, 2013

Clojure.jpg

I love a magic show. Even one that my friends might think of as lame. I love the well-done illusion. I know it's not real, but it's fun to believe that it is. After all - it's entertainment, and if you don't enjoy entertainment, then watch something else. It's your time, your life, your choice. But where I don't like magic is in languages and coding - there I absolutely hate it.

Take clojure, and for that matter, ruby falls into this category as well. The Ruby-ism of Convention over configuration is a nice thought, and can be helpful for new coders starting out, but it obscures all the details, and in that obscurity, it masks all the performance-limitations, and that includes threading. Clojure is the same. What's really being done? Don't quite know with a lazy sequence, do you? What's loaded when? If it's a database result set, does the code load in all the rows and then construct them as a lazy sequence, or does it read a few rows at a time and leave the connection open? Big difference, right?

So I'm not a fan of this kind of code - except for simple one-off scripts and manual processes. You just have no idea what's really happening, and without that knowledge, you have to dig into the code and learn what it's doing. Don't forget to stay abreast of all the updates to the libraries as well - things could change at any time simply because of the cloaking power of that abstraction.

Why does this mean so much to me? Because there's never been a project I've been involved in in the last 15 years that doesn't come down to performance. It's always coming dow to how fast can this all be done, and how much can be run on a single box, and so on. All these are performance issues, and without the in-depth, continual, knowledge of every library in use, I'm bound to have to make some assumptions - until I'm proven wrong by the code itself.

And what's worse, is that I know to look, whereas plenty of the junior developers that I have worked with simply assume that it's par for the course, and don't even think about the performance consequences of their code. They've always had enough memory, and CPU speed, and if it takes 20 mins - so what? It takes 20 mins! I wonder if they would feel that way if it was charging a defibrillator for their parents? I'm guessing not.

Time is the one limited resource we all have. Waiting is not acceptable if you can figure out a way to reduce or eliminate the wait. That's what I've been doing for the better part of 15 years now: removing the wait.

Starting late yesterday, I realized that we have a real performance problem with the clojure code we are working with. I'm not sure that the code is really all that bad, as it works fine when the database isn't loaded, but when it is - and it doesn't have to be loaded very much, things slow to a crawl, and that's not good. So bad, in fact, that several processes failed last night as it was cranking through a new data set.

So what's a guy to do? Well… I know what to do to make JDBC faster, I just need to know what to do in clojure to get those parameters into the Statement creation code in the project. Unfortunately, there's no simple way to see how to do it. Clojure, like ruby, isn't well documented for the libraries - for the most part. This bites because I can see what I need to set, but not how to set it.

So I'm going to have to wait for our clojure expert to show up this morning and tell him to dig into it until he can give me a list of examples that I can work from. I have no doubt he's capable of doing it, but it's not terribly nice to have to wait for him to walk in.

But that's my problem, not his.

But Boy! I hate "magic" code.

Cool Sub-Selects in korma

Thursday, January 31st, 2013

Clojure.jpg

I was doing some selects from a postgres database into clojure using korma, and they were pretty straight-forward for a master/detail scheme:

  (defn locations-for-demand-set
    [demand-set-id]
    (select locations
            (with demands)
            (fields "locations.*")
            (where {:demands.demand_set_id demand-set-id})))

and it was working pretty well. The data was coming back from the database, and everything was OK. But as the database got bigger and bigger, we started to see a real performance penalty. Specifically, the pulling of the locations was taking on the order of 30 seconds for a given demand set. That's too long.

The problem, of course, is that this is implemented as a join, and that's not going to be very fast. What's faster, is a sub-select where we can get all the demand ids for a given demand-set, and then use that with an IN clause in SQL. Thankfully, I noticed that korma had just that capability:

  (select locations
    (where {:demand_id [in (subselect demands
                                      (fields :id)
                                      (where {:demand_set_id demand-set-id})]}))

Unfortunately, this didn't really give me the kind of speed boost I was hoping for. In fact, it only cut off about a half-second of the 31 sec runtime. Kinda disappointing. But the fact had to be related to the size of the sub-select. It was likely 25,000 elements, and doing an IN on that was clearly an expensive operation.

I like that korma supports this feature, but I need a faster way.

Hitting Teradata from Clojure

Monday, January 28th, 2013

Clojure.jpg

Today I worked on hitting Teradata from within clojure using clojure.java.jdbc, and I have to say it wasn't that bad. There are plenty of places that a few paragraphs of documentation could have saved me 30 mins or so, but all told, the delays due to googling weren't all that bad, and in the end I was able to get the requests working, and that's the most important part. I wanted to write it down because it's hard enough that it's not something I'll keep in memory, but it's not horrible.

First, set up the config for the parameters for the Teradata JDBC connection. I have a resources/ directory with a config.clj file in it that's read on startup. The contents of it are: (at least in part)

  {:teradata {:classname "com.teradata.jdbc.TeraDriver"
              :subprotocol "teradata"
              :subname "//tdwa"
              :user "me"
              :password "secret"}}

Then, because we're using Leiningen, the jars are loaded in with the following added to the project.clj file:

    [com.teradata/terajdbc4 "14.00.00.13"]
    [com.teradata/tdgssconfig "14.00.00.13"]

so that the next time we run leon, we'll get the jars, and they will know how to connect to the datasource.

Then I can simply make a function that hits the source:

  (defn hit-teradata
    ""
    [arg1 arg2]
    (let [info (cfg/config :teradata)]
      (sql/with-connection info
        (sql/with-query-results rows
          ["select one, two from table where arg1 = ? and arg2 = ?" arg1 arg2]
          rows))))

Sure, the example is simplistic, but it works, and you get the idea. It's really in the config and jar referencing that I spent the most time. Once I had that working, the rest was simple JDBC within clojure.

Adding More Metadata to the Demand Service

Thursday, January 24th, 2013

Dark Magic Demand

This morning I was asked to add something to the demand service so that it'd be easier debugging what's happening within that service. It wasn't horrible, and given that we already had the provision for the meta data associated with the demand, it was simply a matter of collecting the data and then placing it in the map at the right time.

I was really pretty pleasantly surprised to see how easy this was in clojure. Since everything is a simple data structure (as we're using it), it's pretty easy to change a single integer to a list, and add a value to the end. Then it's just important to remember what's what when you start to use this.

Placed it into the functions, deployed the code and we should be collecting the data on the next update. Good news.

Added Investigative Tool to Metrics App

Wednesday, January 23rd, 2013

WebDevel.jpg

This morning, with the right help, I was able to put up a slick little javascript JSON viewer on the main 'metrics' web page for the project I'm on at The Shop. The goal of this is really a quick way to allow folks - both developers and support folks, to look at the details of the Demand objects in the demand service. Since each demand has a UUID, it's possible to look at the series of the demand objects as it "flows" through the service and is adjusted, combined, and delivered to the client.

I have to say that I'm very impressed with the speed of the clojure service. Then again, it's java, and there's not all that much to it - there's the indexed postgres table, a select statement, and then formatting. Sure, that's a simplification, but it's not like we're delivering streaming video or something. But wow! It's fast.

And the regex-based JSON classifier in javascript is pretty impressive as well. It's fast and clean and the CSS is a lot of fun to play with in the syntax highlighting. I can really see how much fun this can be. Big time sink, if I played with it too much.

But it's nice to punch it out and make it available on the much improved unicorn/nginx platform. Those timeouts are gone, and the code is much simpler, and that's a wonderful thing.

Running Tests, Fixing Issues, Moving Forward

Monday, January 21st, 2013

Dark Magic Demand

Today I spent a lot of time with the new calculation chain reaction in the demand service trying to make sure that everything was running as it should. There were a few issues with the updates when there wasn't an updating component - like a demand set without a seasonality set to adjust it. In these cases, we just did nothing but log an error, when the right thing was to realize that a seasonal adjustment without any factors is just a no-op, and to return the original data as the "adjusted" data. Easy.

But the guy that wrote it doesn't think like that. So I had to put in functions to return the empty set if there is nothing in the database with which to adjust the data. It's hokey, but hey, this entire project is like that. I'm not saying it's not valuable in some sense, but I'm looking at this and thinking that we picked a language, and a team (including me) that has no real traction here at The Shop, for the sake of what? Exposure? Coolness? Are we really moving in this direction? Or is it just another fad to pacify the Code Monkeys that want to play with new toys? Will be be moving to the next new toy when they get bored with this one?

Anyway… I've been fixing things up, and realizing that I'm getting to be a decent clojure developer. I'm not really good - let alone really good, but it's a start, and I don't have to hit the docs every 5 mins to figure something out. It's starting to make sense, even if it isn't the way that my co-worker might like it.

Thankfully, things are really working out well. By the end of the day I had updated the code in the main pipeline app to use either form of the demand coming out of the service so that we can have a much improved demand forecasting impact in the app. Very nice to see.

I have to wait a day to put it in UAT just to make sure things have settled out on other fronts, and then we can isolate what the changes are due to this effect. But it's progress, and that's good to see.

Updated Leiningen to 2.0.0

Monday, January 21st, 2013

Homebrew

This morning I saw that Leiningen 2.0.0 was officially released, and I updated it on my work laptop with the simply command:

  $ lein upgrade

as I'd downloaded that directly from the site. But on my own MacBook Pro, I was using the version from Homebrew, and this morning they, too, updated their "brew" and I just needed to say:

  $ brew update
  $ brew upgrade leiningen

and I had 2.0.0 on my box as well. I don't think there's any difference in the installs, but it's nice to use the brew version on my laptop as it's different than the one on my work box, and some folks may ask about this as we get more folks on the team.