Archive for the ‘Coding’ Category

Getting Tools for Clojure Project Going

Thursday, December 6th, 2012

Clojure.jpg

I've been asked to start full-time on a new phase of the project I've been on, and the first thing I want to get working is the infrastructure - the machines, the databases, the deploy scripts, etc. These make the rest of the coding a lot nicer. Since Leiningen already handles the creation of a deployment jar, all I needed to do was to build something that made it simple to build and deploy these jar files. Since the original project had a similar scheme, it made sense to carry that same theme through to this new project. The original used ruby and rake to make the deployment scripts, but the syntax was easy to reproduce with make and bash scripts.

The makefile was pretty simple as it just called the bash script(s), and while the bash scripts aren't too bad, there were plenty of things to work out because most of the interesting work was done on the remote host. The most interesting part was building the /etc/init.d script to stop and start the application. Again, it's not too hard, but it's something that had to take a little time to work out the details.

In the end, we have a nice init.d script for the project that's deployed with each deployment of the application. We can then use this and the makefile to deploy, start and stop the application on the two datacenter hosts. Not bad at all.

Tired of Waiting for People – Finishing Teradata Pull

Wednesday, December 5th, 2012

Building Great Code

After waiting for a few other folks in another group, I just decided that there was no reason to wait any longer. A co-worker in Palo Alto has been waiting on some data for weeks now, and there's no reason for it. I had the ruby code to pull data from Teradata and put it into JSON structures for use in the main code base. I had some time today, and just decided that there wasn't a good reason to wait any longer.

I got the code out of storage and refreshed the SQL query with my co-worker and then started summarizing the data as per his requests. Thankfully, it was all pretty straightforward - I needed to collect all deals for a merchant, and take the median of a few values and count up the occurrences of a few others. Nothing horrible, and a few helper methods made pretty quick work of it.

After I got it all generated, it was time to work the data into the Merchant model in the existing code. The final destination for this data is to update the sales value calculation by updating the Merchant's quality score based on previous deals. I needed to put it in the ETL for the raw merchant data and just merge in the new data with the existing data and then it's ready to be used in the calculator.

Not bad. And it didn't take more than an hour or two. No need to wait for the other group any longer. Now they can write their code and then we can make a simple REST client to it and fold in the data in the same way. Easy to update and simple to retrofit. Nice.

Default Encodings Trashing Cronjobs

Wednesday, December 5th, 2012

bug.gif

This morning, once again, I had about 500+ error messages from the production run last night. It all pointed to the JSON decoding - again, but this time I was ready: the fail-fast nature of the script now didn't try to do anything else, and I could retry them this morning. So I did.

Interestingly, just as with the tests yesterday, when I run it from the login, it all works just fine. So I fired off the complete nightly run and then set about trying to see what about the crontab setup on these new boxes was messed up and didn't allow the code to run properly. Thankfully, based on yesterday's runs, I know I could get them all done before the start of the day.

So when I started digging, I noticed this in the logs:

  Input length = 1 (Encoding::UndefinedConversionError)
    org.jruby.RubyString:7508:in 'encode'
    json/ext/Parser.java:175:in 'initialize'
    json/ext/Parser.java:151:in 'new'
    ...

so I did a little googling and it brought me back to encodings - what I expected. Which reminded me of this issue I had with reading the seasonality data in the first place. Then I looked at our code, and we are using a standard reader method to get data for both CSV and JSON:

  def self.read_file(filename)
    contents = ''
    what = project_root + '/' + filename
    File.open(what) do |file|
      contents = file.read
    end
    contents
  end

which is all very standard stuff.

What the hits on google were saying was that I needed to think about the encodings, and so I changed the code to read in iso-8859-1 and then transcode it to utf-8:

  def self.read_file(filename)
    contents = ''
    what = project_root + '/' + filename
    File.open(what, 'r:iso-8859-1') do |file|
      contents = file.read
    end
    contents.encode('utf-8', 'iso-8859-1')
  end

Then I saw in another post about encodings in ruby, that I could collapse this into one step:

  def self.read_file(filename)
    contents = ''
    what = project_root + '/' + filename
    File.open(what, 'r:iso-8859-1:utf-8') do |file|
      contents = file.read
    end
    contents
  end

which simplifies the code as well as the understanding: The file is iso-8859-1, but I want utf-8. Perfect! I put this in and I should be good to go.

But the question is really then: Why does the login shell work? After all, if they both failed, that would make sense. But they both don't. That got me looking in the direction of what's defined in the login shell that's not in the crontab pseudo-shell. As soon as I scanned the output, it was clear:

  LANG=en_US.UTF-8

and that explained everything.

The crontab 'shell' doesn't define this, and you can't put it in the crontab file like you can the SHELL and MAILTO variables. So the solution was simple: put it in my main script right after the PATH specification:

  export LANG="en_US.UTF-8"

and all the problems should just go away! That would be nice. I'll have to check when the runs are finished this morning.

Updating Metrics App for Couch Changes

Tuesday, December 4th, 2012

WebDevel.jpg

Most of my day was spent struggling with the 'metrics' app - a simple web app that we use to present the metrics for all the runs we do. Now that we're running all of North America, the next most important issues to solve are adding a few columns to some CSV exports from this web app. But as I soon found out, this was far more involved than adding a column or two.

The reason they needed to be added was just additional information for the users investigating the data to spot problems. But what I soon found was that the changes we had made to how we wrote data to Couch - as four separate documents as opposed to one document and three (server-side) updates to that document, had a far greater impact than we knew. Most clearly evident in that a lot of the reports simply didn't work.

So I needed to go back and check every function on the page. Thankfully, most of the ties were to the javascript or backing ruby service code, but it was still a lot of work as there wasn't a ton of documentation on it, and I had to bop back and forth to the Couch web viewer to see what I had available to me to build with.

But the real kicker was when we needed to relate one document, the output of one process doesn't have any way to relate it's output to that of another. The best we've got is the loose relationship of time: one process starts pretty soon after the other.

So I had to add quite a few views, and complicate the logic in order to get what we needed from what we were given, and the timing relationship between the phases. It's not ideal, but it seems to work, and for all the crud I had to go through, it should work.

I'm glad it's over.

Lots of Little Tasks Add Up to Lots of Progress

Monday, December 3rd, 2012

Building Great Code

Today I've spent a lot of time doing a lot of little things that have really added up to some really significant changes for the application. We're already running all of North America, except the account reassignment, so that's a major goal already reached, but there are still a lot of little things that need to be done to get us to the next level.

From this morning's runs, it was clear I needed to put in a little time making the code a lot more robust to bad data. We were getting some nil class exceptions, and that's just being careless with the code. You have to make sure something it's nil before you assume it's not nil.

I also fixed the encoding on the CSV by:

  CSV.foreach(manual, :headers => true, :encoding => 'iso-8859-1') do |rec|
    # ...process the record
  end

in a very similar manner, we got a new file from the users for the seasonality data, and this guy had plenty of non-UTF-8 characters and rather than edit them out, I choose to use the different encoding to properly handle them.

Finally, I updated the logging on the reassignment phase so that we could really see what's happening on the unassignment and assignment phases - including a very easily extractable 'undo' text for those times that we may need to undo the changes we've made. This has been a problem for a while, and it really just needed to get punched out.

I had a few more, but they were even less exciting than these. All told, however, I cleared a lot of issues in the system, and that's what really counts.

Fixed for Canadian Postal Codes – Again

Monday, December 3rd, 2012

bug.gif

Once again, I had a report of a bug in the system and I started tracking it down. This particular bug was reporting that the number of closed deals to adjust the demand for was being seriously under-reported. Like major under-reporting. So I started looking at the fetching code, and then how the closed deals were being matched up against the demand, and it literally popped off the screen at me.

Canadian postal codes.

I've seen this before.

Thankfully, I knew just what to do. The problem was that in Canada, the postal codes are six characters with a middle space, and only the first three are significant to the spatial location data we use. That means we needed to look at the country and then correctly deal with the postal code.

The code I came up with was very similar to what I'd used in the past:

  all_zips = recent_close['locations'].map do |loc|
    loc['country'] == "CA" ? loc['zip_code][0,2] : loc['zip_code']
  end

and then we can use them just like we do with the Merchant to Demand pinning. Makes perfect sense why were weren't seeing a lot of matches with the previous code - the postal codes were far too specific.

That was a nice one to get out.

Slick Scheme to Efficiently Process a Queue in Bash

Friday, November 30th, 2012

Building Great Code

In the beginning of this project, we created a very simple bash script to run the jobs we needed run in a crontab. It's just a lot easier, I've found, to run things out of a simple bash script than try and put them in the crontab itself. The crontab just looks cleaner, and it's not a real shell, so it's just better all-around.

But as the project got more complex, it was clear that I was beginning to test the limits of what could be easily done in bash. The problem then, was the fact that a vocal contingent of the guys on this project don't really know bash - and have no desire to learn it. Interestingly enough, their argument for using complex and difficult things like meta-programming in ruby is that there's a "floor full of people that understand it". But when bash comes up, it's not even really checked against that same "floor full of people" to see if they know it as well.

It's Code Monkeys, what can you say?

Anyway, as things progressed, I needed to have a way to simply run many jobs at the same time, but ensure that all jobs of a single kind are done before moving on to the next phase of processing. The solution I came up with was pretty straightforward, but not exactly very efficient. The idea was to have a loop and start n background processes, and then wait for all them to finish before continuing the looping and starting more.

This is a pretty simple but it means that the speed with which we can process things is determined by the slowest (or longest running) job in the batch. Therefore, a relatively small number of well placed jobs in the queue can really spread things out.

While this has been OK for a few weeks, we really needed something cleaner, so I came up with this far simpler plan:

  function worker {
    for i in $list; do
      if [ lock($i) ]; then
        do_work($i)
      fi
    done
  }

The idea being that if I launch n of these with a simple:

  for (( i=0; i<n; i++ )); do
    ( worker ) &
  done
  wait

then we'll have these workers running through the list of things to do - each picking up the next available job, and doing it, but skipping those that have been locked by the other workers.

Really pretty slick. The trick was finding out that mkdir is atomic, so it's simple to use that to make the directory tagging the process, and if we are able to make it, then we have to do the work, and if we can't, then someone else is, or has, done the work.

This is super cool!

I was thinking I'd need a queue, or something, and all I really needed was a list and a filesystem. That's sweet. Really. That's one of the coolest things I've seen in a long time.

Interestingly enough, the code is now a lot simpler:

Could not embed GitHub Gist 4178656: Not Found

I still need to test it all, and that will be Monday, but there's no reason to think it won't work, and this way, we have n workers all doing the best they can until all the work that's needed to be done is done, and then everything stops and we can move on to the next phase.

Very cool!

Big Step Forward – No Thanks to Me

Friday, November 30th, 2012

trophy.jpg

This morning it looks like we were able to run all of North America through the UAT system with a co-worker's changes to Salesforce, and that's a huge step forward - but it wouldn't have happened if it had been up to me. I was thinking it was too big a jump to try and take - to go from 27 divisions to 170+ in one night. I would have done it in a couple of nights. But it worked. In spite of me.

Good experience for me - to be vocal and wrong. I've already apologized to the guy running the test and he laughed… good for me. I'll mention it again in stand-up, and say I was wrong. I want to do it as it drives home the idea that I don't know everything.

But it's a big step. About 5 hours and all of North America. We should be able to get it down from there, but even if we can't, it's workable, and that's a huge win.

Fixed up Metrics Web Pages for CouchDB Changes

Thursday, November 29th, 2012

WebDevel.jpg

Recently, in order to get the kind of performance we needed from CouchDB, we had to drop all sense of updating the existing documents in Couch with the data from subsequent runs. This "insert only" mode turned out to be vastly superior to the update scheme even after we made it server-side updating and sending just what we needed for the update. It was just too slow. So now we are going to have four documents when we had previously had one.

Things had to change.

The biggest concern was the metrics web page and widgets that showed a lot of the different results of the runs - all hitting Couch for their data. In the previous version, we had the one document to look at for all the data, but now we had to be careful about what we were looking at, and gathering up for display.

Thankfully, the views in Couch could be adjusted to make very few code changes, and where there were code changes, we didn't have to change the views in Couch - so it was pretty easy to get things figured out. Not bad, really.

At least now, come Monday, we'll have good data in the metrics app, and that's very important.

Created New Tools for Mid-Day Prioritization Fixes

Thursday, November 29th, 2012

Building Great Code

I had long suspected that we needed to have tools for correcting problems associated with the reassignment and prioritization phases of the process, and today I finally decided to just make some. There are several interesting pieces in this story, but let's talk about the actual need and how I worked that into the process first.

It's not surprising that again today we had a slight issue with the prioritization of a single sales rep - not due to the code, but due to the incoming data. It would have been really nice to have simply re-run that one sales rep through the prioritizer, and fix them up. But we didn't have any tools to do that. So we had to say "Sorry, it'll be fixed tomorrow".

So after stand-up, I decided that we needed to have tools to:

  • Re-prioritize a single sales rep - this will pull all the accounts (merchants) for a single sales rep and then cleanly rank them for their daily call list. This is basically what we do nightly, but we start out getting all the sales reps in a division.
  • Clear all the priorities on a single sales rep - this is something that I think is going to become more important as things slip throughout the cracks and we need to clear out account call list priorities en masse. This will simply pull in all the accounts for a single sales rep and then clear out their call list priorities.
  • Clear all the priorities on a single sales rep within a division - this is like the last, but in the case of the House Account, which is the same for all divisions, we might want to confine the clearing to a single division for safety sake.

With these three tools, we should be able to do all the quick fixes that have come up since we started showing this to sales reps and their city managers. Thankfully, the code for all this is pretty simple - even battle-tested. If we look at the existing code that gets all the sales reps for a division and then prioritizes them one by one, we can simply make that inner block the prioritize_rep() method, and move the code such that it's simple to call either method - the division-level one, or the per-sales rep one, and get what we need.

Finally, it's simple to copy that method and create clear_rep() where we don't prioritize the accounts for a rep, but simply get them and clear out the requisite fields. It's not bad at all. Pretty simple, really. But that's where the fun ends.

In order to do this, I had to change a lot of specs and other code simply because it wasn't meant to be flexible. This is what I hate most about these unit tests. They really aren't written to be as reusable and flexible as the code they are testing, but they need to be. I spent probably 30 min changing the code, and about another hour fixing the tests. That's messed up.

Buy the real story of the day is when I was talking about doing this, some of the other guys in the group didn't necessarily want to help do it, but they certainly wanted to make sure that their $0.02 was listen to, and done. It's like the unwritten evil side of Agile, or maybe it's just the Code Monkeys, but it's perfectly natural to have a design discussion - even if it's completely one-sided. It's considered "helpful", and "nice". But really, it's about wanting to control the decision without having to burden the responsibility of it being right.

I can work in Agile, and I see some of the benefits, but I think it's like any other evangelical movement - the downsides are completely dismissed by the "faithful" as fringe, and extreme - and certainly not representative of what they do. But it is.

I really long for being on a project where I don't have Code Monkeys. I like the people, just not how they act a lot of the time.