Archive for the ‘Cube Life’ Category

Slick Scheme to Efficiently Process a Queue in Bash

Friday, November 30th, 2012

Building Great Code

In the beginning of this project, we created a very simple bash script to run the jobs we needed run in a crontab. It's just a lot easier, I've found, to run things out of a simple bash script than try and put them in the crontab itself. The crontab just looks cleaner, and it's not a real shell, so it's just better all-around.

But as the project got more complex, it was clear that I was beginning to test the limits of what could be easily done in bash. The problem then, was the fact that a vocal contingent of the guys on this project don't really know bash - and have no desire to learn it. Interestingly enough, their argument for using complex and difficult things like meta-programming in ruby is that there's a "floor full of people that understand it". But when bash comes up, it's not even really checked against that same "floor full of people" to see if they know it as well.

It's Code Monkeys, what can you say?

Anyway, as things progressed, I needed to have a way to simply run many jobs at the same time, but ensure that all jobs of a single kind are done before moving on to the next phase of processing. The solution I came up with was pretty straightforward, but not exactly very efficient. The idea was to have a loop and start n background processes, and then wait for all them to finish before continuing the looping and starting more.

This is a pretty simple but it means that the speed with which we can process things is determined by the slowest (or longest running) job in the batch. Therefore, a relatively small number of well placed jobs in the queue can really spread things out.

While this has been OK for a few weeks, we really needed something cleaner, so I came up with this far simpler plan:

  function worker {
    for i in $list; do
      if [ lock($i) ]; then
        do_work($i)
      fi
    done
  }

The idea being that if I launch n of these with a simple:

  for (( i=0; i<n; i++ )); do
    ( worker ) &
  done
  wait

then we'll have these workers running through the list of things to do - each picking up the next available job, and doing it, but skipping those that have been locked by the other workers.

Really pretty slick. The trick was finding out that mkdir is atomic, so it's simple to use that to make the directory tagging the process, and if we are able to make it, then we have to do the work, and if we can't, then someone else is, or has, done the work.

This is super cool!

I was thinking I'd need a queue, or something, and all I really needed was a list and a filesystem. That's sweet. Really. That's one of the coolest things I've seen in a long time.

Interestingly enough, the code is now a lot simpler:

Could not embed GitHub Gist 4178656: Not Found

I still need to test it all, and that will be Monday, but there's no reason to think it won't work, and this way, we have n workers all doing the best they can until all the work that's needed to be done is done, and then everything stops and we can move on to the next phase.

Very cool!

Big Step Forward – No Thanks to Me

Friday, November 30th, 2012

trophy.jpg

This morning it looks like we were able to run all of North America through the UAT system with a co-worker's changes to Salesforce, and that's a huge step forward - but it wouldn't have happened if it had been up to me. I was thinking it was too big a jump to try and take - to go from 27 divisions to 170+ in one night. I would have done it in a couple of nights. But it worked. In spite of me.

Good experience for me - to be vocal and wrong. I've already apologized to the guy running the test and he laughed… good for me. I'll mention it again in stand-up, and say I was wrong. I want to do it as it drives home the idea that I don't know everything.

But it's a big step. About 5 hours and all of North America. We should be able to get it down from there, but even if we can't, it's workable, and that's a huge win.

Fixed up Metrics Web Pages for CouchDB Changes

Thursday, November 29th, 2012

WebDevel.jpg

Recently, in order to get the kind of performance we needed from CouchDB, we had to drop all sense of updating the existing documents in Couch with the data from subsequent runs. This "insert only" mode turned out to be vastly superior to the update scheme even after we made it server-side updating and sending just what we needed for the update. It was just too slow. So now we are going to have four documents when we had previously had one.

Things had to change.

The biggest concern was the metrics web page and widgets that showed a lot of the different results of the runs - all hitting Couch for their data. In the previous version, we had the one document to look at for all the data, but now we had to be careful about what we were looking at, and gathering up for display.

Thankfully, the views in Couch could be adjusted to make very few code changes, and where there were code changes, we didn't have to change the views in Couch - so it was pretty easy to get things figured out. Not bad, really.

At least now, come Monday, we'll have good data in the metrics app, and that's very important.

Created New Tools for Mid-Day Prioritization Fixes

Thursday, November 29th, 2012

Building Great Code

I had long suspected that we needed to have tools for correcting problems associated with the reassignment and prioritization phases of the process, and today I finally decided to just make some. There are several interesting pieces in this story, but let's talk about the actual need and how I worked that into the process first.

It's not surprising that again today we had a slight issue with the prioritization of a single sales rep - not due to the code, but due to the incoming data. It would have been really nice to have simply re-run that one sales rep through the prioritizer, and fix them up. But we didn't have any tools to do that. So we had to say "Sorry, it'll be fixed tomorrow".

So after stand-up, I decided that we needed to have tools to:

  • Re-prioritize a single sales rep - this will pull all the accounts (merchants) for a single sales rep and then cleanly rank them for their daily call list. This is basically what we do nightly, but we start out getting all the sales reps in a division.
  • Clear all the priorities on a single sales rep - this is something that I think is going to become more important as things slip throughout the cracks and we need to clear out account call list priorities en masse. This will simply pull in all the accounts for a single sales rep and then clear out their call list priorities.
  • Clear all the priorities on a single sales rep within a division - this is like the last, but in the case of the House Account, which is the same for all divisions, we might want to confine the clearing to a single division for safety sake.

With these three tools, we should be able to do all the quick fixes that have come up since we started showing this to sales reps and their city managers. Thankfully, the code for all this is pretty simple - even battle-tested. If we look at the existing code that gets all the sales reps for a division and then prioritizes them one by one, we can simply make that inner block the prioritize_rep() method, and move the code such that it's simple to call either method - the division-level one, or the per-sales rep one, and get what we need.

Finally, it's simple to copy that method and create clear_rep() where we don't prioritize the accounts for a rep, but simply get them and clear out the requisite fields. It's not bad at all. Pretty simple, really. But that's where the fun ends.

In order to do this, I had to change a lot of specs and other code simply because it wasn't meant to be flexible. This is what I hate most about these unit tests. They really aren't written to be as reusable and flexible as the code they are testing, but they need to be. I spent probably 30 min changing the code, and about another hour fixing the tests. That's messed up.

Buy the real story of the day is when I was talking about doing this, some of the other guys in the group didn't necessarily want to help do it, but they certainly wanted to make sure that their $0.02 was listen to, and done. It's like the unwritten evil side of Agile, or maybe it's just the Code Monkeys, but it's perfectly natural to have a design discussion - even if it's completely one-sided. It's considered "helpful", and "nice". But really, it's about wanting to control the decision without having to burden the responsibility of it being right.

I can work in Agile, and I see some of the benefits, but I think it's like any other evangelical movement - the downsides are completely dismissed by the "faithful" as fringe, and extreme - and certainly not representative of what they do. But it is.

I really long for being on a project where I don't have Code Monkeys. I like the people, just not how they act a lot of the time.

Rewriting Bash to Ruby (cont.)

Thursday, November 29th, 2012

Ruby

This morning I was able to finish up the re-write of the summary script and I was very pleased with the results: the processing of the pipeline log dropped from 4+ min to less than 5 sec - even with jruby, and the other two are in the sub-2 sec range. The latter are really dominated by the jruby startup time, and if we can move to an RMI ruby in deployment, that will help here too.

In short - fantastic success! Now I need to come up with a better queueing and processing scheme in bash - or re-write that in ruby as well…

Great File Encoding Tip for Ruby

Thursday, November 29th, 2012

Ruby

This morning I ran into a problem with the ruby re-write of the summary script that I've been working on since late yesterday. The error was occurring on the relatively simple code:

  File.open(src) do |line|
    if line =~ / BEGIN /
    # …
    end
  end

right in the open() method call. The error was cryptic:

  summary:48 in 'block in process_pipeline' invalid byte sequence in UTF-8 (ArgumentError)
      from summary:47" in 'each'

I had to hit google, as it was clear to me there were odd characters in the file, and while I might like to fix that - the key to the previous version was to include the '-a' option to grep to make sure that it looked at the files as binary files. But what would do the trick here?

Turns out there's a StackOverflow answer for that:

  File.open(src, 'r:iso-8859-1') do |line|
    if line =~ / BEGIN /
    # …
    end
  end

which instructs the IO object to read the file with the ISO-8859-1 encoding and that did the trick. No other changes were necessary!

Sweet trick to know.

Rewriting Bash to Ruby

Wednesday, November 28th, 2012

Ruby

With all the efficiency changes in the code recently, the next most inefficient was the bash script that analyzed the run logs and generated some summary statistics for us to view in the morning. When I first created this script, it wasn't all that complex, and the logs weren't nearly as big as they are now. I used the typical assortment of scripting tools: grep, sed, awk, but the problem was that as I added things to the summary script the time it took to execute was getting longer and longer. To the point that it took several minutes to run it on the main pipeline process. That's no good.

So I wanted to rewrite it in something that was going to be fast, but process the file only once. The problem with the current version isn't that it's using bash, or grep, it's that the files are hundreds of megabytes and we need to scan them a dozen or more times for the data. What we needed was to make a single-pass summary script, and that's not happening with bash and grep.

So what then?

Ruby popped to mind, but given that we're using jruby, there's a significant startup penalty. But maybe we can force it to use a compiled MRI ruby in the deployment environments, and that will speed up the loading.

C, C++ both seemed like ideal candidates, but then I know how the rest of the guys in the group would react, and it's just not worth it.

So Ruby is is.

This shouldn't take long, as most of this is pretty simple stuff for ruby. Let's get going...

Moving Day has Arrived!

Wednesday, November 28th, 2012

Building Great Code

Finally, moving day has arrived! This morning I've been getting things moved over to the new servers in our own datacenter, and this should provide a very needed boost to the performance of the application. This includes a CouchDB server with 24 cores and 96GB of RAM with a 1.6TB disk array, as well as a nice app server with 24 cores and 96 GB of RAM. There's a mirrored Couch pair for production, and a similar app server there.

It's been a lot of little things, lots of little code changes and pushes. Even some reconfiguring of aliases in the firewalls, but that's where I'm starting to hit a snag. I used to be able to do this, now it's meant to be handled by the production operations group. That's not too bad, but they won't push anything until 4:00 pm today, and if it's not 100% right, then we're going to have a hard time getting it right for tomorrow.

I'm hoping to get a few more tests done today, but I doubt that I'll be able to simply because a co-worker is busy using UAT to test things there. It's a shared environment, and there's no way to run both tests at once, so since he was first, I have to wait.

I'm not the most patient of people.

When Deadlines aren’t Really Deadlines

Wednesday, November 28th, 2012

cubeLifeView.gif

This morning I got an email from the project manager for the project I'm on about the Q4 goals of getting the code running for all of North America by the end of the quarter. In short, he wants to have coding done on 12/10, and then running for everyone on 12/17. I wanted to ask him in what calendar is 12/10 then "end of the quarter"? I was a little more respectful than that, but the end result was that I asked him to explain why we needed to hit these artificial deadlines. His response was classic:

Because in order to have users using it by the end of Q4, we need to have training and development done long before that.

To which I about replied I quit!

But I didn't. I simply said that I was only comfortable with getting the code running by 12/31/2012 - not 12/10, or his artificial goals.

In the end, I know it was bad. It'll end badly, and I really don't care at this point. I've given him all I am going to, and now he's just a plain old jerk in my book. Nothing special about that.

And certainly not worth working on Thanksgiving.

Getting New Hardware Ready to Go

Tuesday, November 27th, 2012

servers.jpg

This afternoon I've been working with a co-worker to get all the new hardware up and going in our own datacenter so that we can move our application from Amazon's EC2 to our own, more reliable, machines. It's a bit of a hassle in that there are now 22 new machines to rebuild, and the folks doing it aren't paying really close attention to the machine names and set-ups, so there have been a lot of re-dos, but it's getting there.

We should be able to get all the critical machines up and going before I have to leave today, and then I can get started on moving the apps in the morning.

Exciting times to be getting out of EC2, and onto far far better hardware. I'm just hoping that it's going to clear up the issues we've been having with Couch. Now that would be really nice!