Archive for November, 2012

Getting New Hardware Ready to Go

Tuesday, November 27th, 2012

servers.jpg

This afternoon I've been working with a co-worker to get all the new hardware up and going in our own datacenter so that we can move our application from Amazon's EC2 to our own, more reliable, machines. It's a bit of a hassle in that there are now 22 new machines to rebuild, and the folks doing it aren't paying really close attention to the machine names and set-ups, so there have been a lot of re-dos, but it's getting there.

We should be able to get all the critical machines up and going before I have to leave today, and then I can get started on moving the apps in the morning.

Exciting times to be getting out of EC2, and onto far far better hardware. I'm just hoping that it's going to clear up the issues we've been having with Couch. Now that would be really nice!

Code Cleanup

Tuesday, November 27th, 2012

Code Clean Up

Today has been a lot of little things to try and get the application's performance good enough so that it can still run in EC2 for the few days that it has left in that datacenter. I'm trying to put in simple, clean fixes to minimize the time spent in an overall run so that we can get more divisions out in the same period of time.

This brings up the point that's been bugging me for a few days, and that's expectations. I'm really getting tired of making extraordinary effort for some management folks that really don't seem to recognize the nature of the effort, or appreciate what it is that I'm really doing.

It's nothing I haven't seen before, but it's always a little sad the first time you see it at a new job. That realization that this guy is no batter than that other guy at the previous place, and they are going to push and make artificial deadlines and then pre tent to "tell Dad" if you don't meet them.

Working last week on Wednesday, Thursday, and Friday to make a deadline that I didn't think was possible, just to make it possible for this guy to tell his superiors that his team "did it" was something I was willing to do - as long as it was appreciated. But it wasn't. So now this guy has marginalized himself. I won't break my back to get him out of his own jam any more.

But hey… what am I doing now, then? I'm trying to make this work as opposed to just letting it fail.

I'm a chump.

Tracking Down Problem with Salesforce Data

Tuesday, November 27th, 2012

Salesforce.com

This morning I was tracking down a bug that was reported by our project manager related to the prioritization phase. This particular sales rep wasn't getting a good call list, and I needed to dig into Why?

After I added a bunch of logging, I was able to see that it was all a data problem. The fields in Salesforce are often just strings, and this leads to not easily enumerable sets. It's not necessarily Salesforce's fault, it's the way in which it's used, and we seem to be having a little problem with consistency here. But be that as it may, it's still our problem and we need to figure out the proper way to get at these sales reps regardless of how they seem to be classified.

Sigh… these pseudo-business decisions are always the worst. They are made for "today", and change "tomorrow", and we're always going to be correcting for problems in the mappings.

Writing Effective Log Messages – It’s a Lost Art

Tuesday, November 27th, 2012

I know this may seem like an old man complaining about these young kids and how they aren't doing it right, but I have to say, it seems that the art of writing good, concise, effective, log messages is a lot art. I've been trying to debug a problem this morning and it's all cleared up when you introduce one decent log message, and elaborate a little on a few others. I mean really - the problem is clearly solved with a few minutes of work on writing effective log messages.

OK, so here's my list of rules for log messages - not that anyone cares:

  • Each log message has to stand alone - you can't assume that log messages will come in any order - certainly not with multi-threaded code, and that's just about the standard these days.
  • Each log message has to be useful - putting out a message saying "sending 5 to output" is not really useful. You can say more - like what they are, or why they are going out. If not, you're really only doing the log file equivalent of a "busy indicator", and that's not useful.
  • Each log message is human-readable - when you dig into log files, you need to be able to read them. There is a school of thought where the log files should be designed for easy scraping. I think the scraping is something done after you have good logs, and it's not all that hard. But listing key/value pairs just doesn't cut it.
  • Each log message contains the class and method where it occurs - there's so much to be gained by always knowing where the code is that wrote the log. Just do it.
  • Put in enough logging to know what's happening - disk space is cheap, so write out good log messages every step along the way of the processing. This is going to pay off over and over when you're tracking down problems.

This morning, I've been adding and augmenting to the log files in our code to get things up to the point that I can effectively debug a problem we're having. Had this already been done, the debugging would have been trivial because there's no bug! It's all a data problem, and that would have been easily seen with a little bit better logging.

Oh well… I guess that's going to be part of what I have to do in this group.

Move to CouchDB Server-Side Updates

Monday, November 26th, 2012

CouchDB

In a continuing effort to make the code more efficient and really, just plain faster, this afternoon I've been working with a teammate to update CouchRest, our ruby client to Couch, to handle server-side updates. Couch allows server-side updates - you basically write a javascript function that takes the document and the request and you can update the document as you see fit, and return something to the caller.

It's not bad, really. It should certainly make the updates a ton faster as right now we're reading, updating and writing back the complete document for a very small change - in one case just a single field. This is really where the document database falls down, and you long for a SQL statement where you can simply UPDATE and be done with it.

Still, it's nice to be able to write:

  function(doc, req) {
    var ans = false;
    var fld = 'lead_assignment';
    if (doc) {
      doc[fld] = JSON.parse(req.body);
      and = true;
    }
    return [doc, JSON.stringify({'updated': ans})];
  }

and be able to make a change with:

  def update_merchant_assignment(division, sf_id, stuff)
    return nil if (id = get_latest_results_docID(division, sf_id)).nil?
    Database.update('merchant_updater/add_assignment', :id => id, :body => stuff)
  end

It really simplifies the code, and it certainly cuts the bytes moved for an update way down. I'm hoping it's enough… we'll have to see how it goes.

Plenty of Production Problems – Argh!

Monday, November 26th, 2012

bug.gif

This morning has been a really tough one. It started with me checking on the overnight runs while I was still at home (4:00 am), and seeing that they failed due to problems I introduced over the latter part of the week. I really hate that. It was my fault, that's for sure, and it was brought on by an very inconsistent API in CouchRest. No excuse, it was me, and it really bugs the crud out of me when I do that.

No errors, just failed writes to Couch. Argh!

The next really nasty thing was that with the new divisions, I was getting new data, and in that data, we had some bad data, and the optimistic coding that it the hallmark of the ruby debs I know, simply started erring out on nil pointers. Argh!

In the end, I was able to get things re-run and it was OK, but it was a very stressful morning, and there doesn't seem to be a decent payoff for all this stress and work.

It just doesn't seem worth it.

Activated the Write-Back for Production Pilot

Saturday, November 24th, 2012

Come Monday, we have a new pilot to start - even though we really haven't solved any of the scaling issues, we press for more features. It's getting kinda old. But hey, a promise is a promise, and I hope it doesn't bury us.

Thankfully, it's only two divisions, and I added them to the whitelist in the config files, and we should be good to go for Monday. I've got my fingers crossed.

Fixed up the Retry Code and Added Instrumentation

Friday, November 23rd, 2012

bug.gif

I've been having plenty of issues with one of the processes in the application, and I needed to really bolster up this optimistic code with some solid defensive coding - including handling timeouts and putting in some solid New Relic instrumentation to boot. These latter phases of the project have really been glossed over until recently - little to no logging, no instrumentation, no real careful, thoughtful coding.

So I have to go back and do it now.

Ideally, it's be great to see a change in the harts and minds of my team-mates, but I'm not counting on that. I think it's just not in how they seem themselves and their jobs. So it's up to me to do it.

It's not horribly hard, and it keeps me off the streets.

Coding on Thanksgiving – Trying to Get Performance Up

Thursday, November 22nd, 2012

Speed

Well… the addition of the new divisions (added just to meet the crazy deadline) didn't go as well as I'd hoped. Thankfully, I had good New Relic data to look at and see what was happening in the process(es). What it looks like is that there are large sections of code that aren't really taking advantage of the machine, and doing too many things serially. So I set about attacking them.

On Thanksgiving.

First, there was one process that was doing a lot of writing to Couch serially. That was easy enough to fix with a simple java Executor and a couple of threads. I also moved all the single document writes to Couch to bulk stores so that we got much better performance when we had all the data to write up-front.

The next thing was to try adding timeouts to the CouchRest API just to see how that would go. I'm hoping that these REST calls that simply don't return can be trapped in a simple "total timeout" and then retried. As it is now, some of them simply never return, and that's no good at all.

In the end, I had to get the speed up. I'll see how these changes work tonight and make any needed adjustments in the morning.

Working at Home – Just Won’t Miss a Deadline

Wednesday, November 21st, 2012

Bad Idea

I'm working at home on something that I really shouldn't be working on - trying to meet a deadline that I told the project manager we weren't going to meet because we had been having scaling issues, and it just wasn't feasible to get it done by Monday. But here I am… a little bit of spare time, and I'm a sucker for not missing deadlines.

So I'm just going to add in the new divisions and see how it goes. If I have to make adjustments to the code to make it fit in the time allowed, so be it. It should work, and the only question in my mind is do we have the time?

I've got my fingers crossed.