Archive for the ‘Clojure Coding’ Category

Installing Hadoop on OS X

Wednesday, July 17th, 2013

Hadoop

This morning I finally got Hadoop installed on my work laptop, and I wanted to write it all down so that I could repeat this when necessary. As I found out, it's not at all like installing CouchDB which is about as simple as anything could be. No… Hadoop is a far more difficult beast, and I guess I can understand why, but still, it'd be nice to have a simple Homebrew install that set it up in single-node mode and started everything with Launch Control, but that's a wish, not a necessity.

So let's get into it. First, make sure that you have the SSH daemon running on your box. This is controlled in System Preferences -> Sharing -> Remote Login - make sure it's checked, save this, and it should be running just fine. Make sure you can ssh into your box - if necessary, make the SSH keys and put them in your ~/.ssh directory.

Next, you certainly need to install Homebrew, and once that's all going, you need to install the basic Hadoop package:

$ brew install hadoop

at this point, you will need to edit a few of the config files, and make a few directories. Let's start by making the directories. These will be the locations for the actual Hadoop data, the Map/Reduce data, and the NameNode data. I picked to place these next to the Homebrew install of Hadoop so that it's all in one place:

  $ cd /usr/local/Cellar/hadoop
  $ mkdir data
  $ cd data
  $ mkdir dfs
  $ mkdir mapred
  $ mkdir nn

At this point we can go to the directory with the configuration files and update them:

  $ cd /usr/local/Cellar/hadoop/1.1.2/libexec/conf

The first update is to handle a Kerberos bug in Hadoop - a known bug. Do this by editing hadoop-env.sh to include:

  export HADOOP_OPTS="-Djava.security.krb5.realm= -Djava.security.krb.kdc="

Next, edit the hdfs-site.xml file to include the following:

  <configuration>
    <property>
      <name>dfs.data.dir</name>
      <value>/usr/local/Cellar/hadoop/data/dfs</value>
    </property>
    <property>
      <name>dfs.name.dir</name>
      <value>/usr/local/Cellar/hadoop/data/nn</value>
    </property>
    <property>
      <name>dfs.replication</name>
      <value>1</value>
    </property>
    <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
    </property>
  </configuration>

Next, edit the core-site.xml file to include the following:

  <configuration>
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/tmp/hdfs-${user.name}</value>
    </property>
    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>
  </configuration>

Finally, edit the mapred-site.xml file to include the following:

  <configuration>
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:9001</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.</description>
    </property>
    <property>
      <name>mapred.local.dir</name>
      <value>/usr/local/Cellar/hadoop/data/mapred/</value>
    </property>
  </configuration>

We are finally all configured. At this point, you need to initialize the Name node:

  $ hadoop namenode -format

and then you can start all the necessary processes on the box:

  $ start-all.sh

At this point, you will be able to hit the endpoints:

and using the WebHDFS REST endpoint, you can use any standard REST client to submit files, delete files, make directories, and generally manipulate the filesystem as needed.

This was interesting, and digging around for what was needed was non-trivial, but it was well worth it. I'll now be able to run my code against the PostgreSQL and Hadoop installs on my box.

Sweet!

I Love Magic – as Entertainment

Tuesday, March 5th, 2013

Clojure.jpg

I love a magic show. Even one that my friends might think of as lame. I love the well-done illusion. I know it's not real, but it's fun to believe that it is. After all - it's entertainment, and if you don't enjoy entertainment, then watch something else. It's your time, your life, your choice. But where I don't like magic is in languages and coding - there I absolutely hate it.

Take clojure, and for that matter, ruby falls into this category as well. The Ruby-ism of Convention over configuration is a nice thought, and can be helpful for new coders starting out, but it obscures all the details, and in that obscurity, it masks all the performance-limitations, and that includes threading. Clojure is the same. What's really being done? Don't quite know with a lazy sequence, do you? What's loaded when? If it's a database result set, does the code load in all the rows and then construct them as a lazy sequence, or does it read a few rows at a time and leave the connection open? Big difference, right?

So I'm not a fan of this kind of code - except for simple one-off scripts and manual processes. You just have no idea what's really happening, and without that knowledge, you have to dig into the code and learn what it's doing. Don't forget to stay abreast of all the updates to the libraries as well - things could change at any time simply because of the cloaking power of that abstraction.

Why does this mean so much to me? Because there's never been a project I've been involved in in the last 15 years that doesn't come down to performance. It's always coming dow to how fast can this all be done, and how much can be run on a single box, and so on. All these are performance issues, and without the in-depth, continual, knowledge of every library in use, I'm bound to have to make some assumptions - until I'm proven wrong by the code itself.

And what's worse, is that I know to look, whereas plenty of the junior developers that I have worked with simply assume that it's par for the course, and don't even think about the performance consequences of their code. They've always had enough memory, and CPU speed, and if it takes 20 mins - so what? It takes 20 mins! I wonder if they would feel that way if it was charging a defibrillator for their parents? I'm guessing not.

Time is the one limited resource we all have. Waiting is not acceptable if you can figure out a way to reduce or eliminate the wait. That's what I've been doing for the better part of 15 years now: removing the wait.

Starting late yesterday, I realized that we have a real performance problem with the clojure code we are working with. I'm not sure that the code is really all that bad, as it works fine when the database isn't loaded, but when it is - and it doesn't have to be loaded very much, things slow to a crawl, and that's not good. So bad, in fact, that several processes failed last night as it was cranking through a new data set.

So what's a guy to do? Well… I know what to do to make JDBC faster, I just need to know what to do in clojure to get those parameters into the Statement creation code in the project. Unfortunately, there's no simple way to see how to do it. Clojure, like ruby, isn't well documented for the libraries - for the most part. This bites because I can see what I need to set, but not how to set it.

So I'm going to have to wait for our clojure expert to show up this morning and tell him to dig into it until he can give me a list of examples that I can work from. I have no doubt he's capable of doing it, but it's not terribly nice to have to wait for him to walk in.

But that's my problem, not his.

But Boy! I hate "magic" code.

Cool Sub-Selects in korma

Thursday, January 31st, 2013

Clojure.jpg

I was doing some selects from a postgres database into clojure using korma, and they were pretty straight-forward for a master/detail scheme:

  (defn locations-for-demand-set
    [demand-set-id]
    (select locations
            (with demands)
            (fields "locations.*")
            (where {:demands.demand_set_id demand-set-id})))

and it was working pretty well. The data was coming back from the database, and everything was OK. But as the database got bigger and bigger, we started to see a real performance penalty. Specifically, the pulling of the locations was taking on the order of 30 seconds for a given demand set. That's too long.

The problem, of course, is that this is implemented as a join, and that's not going to be very fast. What's faster, is a sub-select where we can get all the demand ids for a given demand-set, and then use that with an IN clause in SQL. Thankfully, I noticed that korma had just that capability:

  (select locations
    (where {:demand_id [in (subselect demands
                                      (fields :id)
                                      (where {:demand_set_id demand-set-id})]}))

Unfortunately, this didn't really give me the kind of speed boost I was hoping for. In fact, it only cut off about a half-second of the 31 sec runtime. Kinda disappointing. But the fact had to be related to the size of the sub-select. It was likely 25,000 elements, and doing an IN on that was clearly an expensive operation.

I like that korma supports this feature, but I need a faster way.

Hitting Teradata from Clojure

Monday, January 28th, 2013

Clojure.jpg

Today I worked on hitting Teradata from within clojure using clojure.java.jdbc, and I have to say it wasn't that bad. There are plenty of places that a few paragraphs of documentation could have saved me 30 mins or so, but all told, the delays due to googling weren't all that bad, and in the end I was able to get the requests working, and that's the most important part. I wanted to write it down because it's hard enough that it's not something I'll keep in memory, but it's not horrible.

First, set up the config for the parameters for the Teradata JDBC connection. I have a resources/ directory with a config.clj file in it that's read on startup. The contents of it are: (at least in part)

  {:teradata {:classname "com.teradata.jdbc.TeraDriver"
              :subprotocol "teradata"
              :subname "//tdwa"
              :user "me"
              :password "secret"}}

Then, because we're using Leiningen, the jars are loaded in with the following added to the project.clj file:

    [com.teradata/terajdbc4 "14.00.00.13"]
    [com.teradata/tdgssconfig "14.00.00.13"]

so that the next time we run leon, we'll get the jars, and they will know how to connect to the datasource.

Then I can simply make a function that hits the source:

  (defn hit-teradata
    ""
    [arg1 arg2]
    (let [info (cfg/config :teradata)]
      (sql/with-connection info
        (sql/with-query-results rows
          ["select one, two from table where arg1 = ? and arg2 = ?" arg1 arg2]
          rows))))

Sure, the example is simplistic, but it works, and you get the idea. It's really in the config and jar referencing that I spent the most time. Once I had that working, the rest was simple JDBC within clojure.

Adding More Metadata to the Demand Service

Thursday, January 24th, 2013

Dark Magic Demand

This morning I was asked to add something to the demand service so that it'd be easier debugging what's happening within that service. It wasn't horrible, and given that we already had the provision for the meta data associated with the demand, it was simply a matter of collecting the data and then placing it in the map at the right time.

I was really pretty pleasantly surprised to see how easy this was in clojure. Since everything is a simple data structure (as we're using it), it's pretty easy to change a single integer to a list, and add a value to the end. Then it's just important to remember what's what when you start to use this.

Placed it into the functions, deployed the code and we should be collecting the data on the next update. Good news.

Added Investigative Tool to Metrics App

Wednesday, January 23rd, 2013

WebDevel.jpg

This morning, with the right help, I was able to put up a slick little javascript JSON viewer on the main 'metrics' web page for the project I'm on at The Shop. The goal of this is really a quick way to allow folks - both developers and support folks, to look at the details of the Demand objects in the demand service. Since each demand has a UUID, it's possible to look at the series of the demand objects as it "flows" through the service and is adjusted, combined, and delivered to the client.

I have to say that I'm very impressed with the speed of the clojure service. Then again, it's java, and there's not all that much to it - there's the indexed postgres table, a select statement, and then formatting. Sure, that's a simplification, but it's not like we're delivering streaming video or something. But wow! It's fast.

And the regex-based JSON classifier in javascript is pretty impressive as well. It's fast and clean and the CSS is a lot of fun to play with in the syntax highlighting. I can really see how much fun this can be. Big time sink, if I played with it too much.

But it's nice to punch it out and make it available on the much improved unicorn/nginx platform. Those timeouts are gone, and the code is much simpler, and that's a wonderful thing.

Running Tests, Fixing Issues, Moving Forward

Monday, January 21st, 2013

Dark Magic Demand

Today I spent a lot of time with the new calculation chain reaction in the demand service trying to make sure that everything was running as it should. There were a few issues with the updates when there wasn't an updating component - like a demand set without a seasonality set to adjust it. In these cases, we just did nothing but log an error, when the right thing was to realize that a seasonal adjustment without any factors is just a no-op, and to return the original data as the "adjusted" data. Easy.

But the guy that wrote it doesn't think like that. So I had to put in functions to return the empty set if there is nothing in the database with which to adjust the data. It's hokey, but hey, this entire project is like that. I'm not saying it's not valuable in some sense, but I'm looking at this and thinking that we picked a language, and a team (including me) that has no real traction here at The Shop, for the sake of what? Exposure? Coolness? Are we really moving in this direction? Or is it just another fad to pacify the Code Monkeys that want to play with new toys? Will be be moving to the next new toy when they get bored with this one?

Anyway… I've been fixing things up, and realizing that I'm getting to be a decent clojure developer. I'm not really good - let alone really good, but it's a start, and I don't have to hit the docs every 5 mins to figure something out. It's starting to make sense, even if it isn't the way that my co-worker might like it.

Thankfully, things are really working out well. By the end of the day I had updated the code in the main pipeline app to use either form of the demand coming out of the service so that we can have a much improved demand forecasting impact in the app. Very nice to see.

I have to wait a day to put it in UAT just to make sure things have settled out on other fronts, and then we can isolate what the changes are due to this effect. But it's progress, and that's good to see.

Updated Leiningen to 2.0.0

Monday, January 21st, 2013

Homebrew

This morning I saw that Leiningen 2.0.0 was officially released, and I updated it on my work laptop with the simply command:

  $ lein upgrade

as I'd downloaded that directly from the site. But on my own MacBook Pro, I was using the version from Homebrew, and this morning they, too, updated their "brew" and I just needed to say:

  $ brew update
  $ brew upgrade leiningen

and I had 2.0.0 on my box as well. I don't think there's any difference in the installs, but it's nice to use the brew version on my laptop as it's different than the one on my work box, and some folks may ask about this as we get more folks on the team.

Calculations are Flowing!

Friday, January 18th, 2013

Dark Magic Demand

Well, it's been a while getting here, but we finally have the chain reaction of calculations working in the demand service all the way up to, and including, the closed deal adjustments. I wrote quite a few tests on the closed deal adjustments because I'm working in a new language, and in order to make sure that it's working as I expect, I needed to test everything. First, in the unit tests of the taxonomy and price checks, as well as how to decompose the location data, and then in the second-level functions like closed deal option decomposition, and finally, in the top-level functions.

This is typically a ton more testing than I normally write, and in about six months, I won't be writing these tests, either. It's just that I'm really not at all sure about this clojure code, and in order to make myself feel more comfortable about it, I needed the tests. And the REPL.

But to see the logs emit the messages I expected, and the ones I didn't, was a joy to behold. This was a long time in coming, and it's been a rough and bumpy road, but it's getting a little smoother, and with this milestone, next week should prove to be a great step forward for the project.

I'm looking forward to it.

Pulled Additional Fields from Salesforce for Demand Adjustment

Thursday, January 17th, 2013

Salesforce.com

This afternoon I realized that we really need to have a few additional fields about the closed deals for the demand adjustment. Specifically, we have no idea of the start date for the deal, and while we have the close_date, that's not much use to us if it's empty, as many are until they have an idea of when they really want to shut down the deal. Additionally, one of the sales reps pointed out that there are projected sales figures on each 'option' in a deal, and rather than look at the total projected sales and divide it up, as we have in the past, we should be looking for those individual projected sales figures and using them - if no sales have been made.

Seems reasonable, so I added those to the Salesforce APEX class, and ran it to make sure it was all OK. There were no code changes to the ruby code because we had (smartly) left it as a hash, and so additional fields aren't going to mess things up… but in out code we can now take advantage of them.

Surprisingly time-consuming because I had to drop tables, and add properties and get things in line - but that's what happens when you add fields to a schema… you have to mess with it. Still, it's better than using a document database for this stuff. Relational with a simple structure beats document every time.