Archive for the ‘Cube Life’ Category

Interesting RSpec Tips

Monday, November 19th, 2012

Unit Testing

This afternoon I found a set of tests in the code that weren't implemented, and should be. They were stubbed out by one of the guys on the Team, but he didn't have time to really implement them, he just wanted to remind himself that we needed these tests, and so he stubbed them out and then went on about what he needed to do. I noticed them, and decided that since I didn't have a lot going on at this time, it made sense to give it a go at implementing the tests.

After all, I know there's a lot about rspec I don't know, and this would be a nice way to learn about it.

Some of the tests were really pretty clear: make sure the main routine returns something. So how to do that - simply? Well… we can always stub out the methods with simple return values and then just make sure that you get back what you expect.

  require 'lead_assignment/entry_point'
 
  describe LeadAssignment::EntryPoint do
    describe ".unassign_and_assign" do
     before(:each) do
       class FauxRepClient
         def get_reps(division)
           []
         end
       end
       LeadAssignment::EntryPoint.stub(:reps_client => FauxRepClient.new)
       LeadAssignment::EntryPoint.stub(:fetch_accounts => [])
       LeadAssignment::EntryPoint.stub(:add_accounts => [])
     end

on this first test, I noticed that I wanted to start all my tests with this little configuration, so it was easy to put into a before() block and then it was going to be done before each test within the scope of the enclosing describe. That's nice to remember.

Then I can do the end-to-end test:

     it "returns a result" do
       LeadAssignment::EntryPoint.stub(:sink => nil)
       results = LeadAssignment::EntryPoint.unassign_and_reassign('cleveland')
       results.should == { :unassignments => [], :assignments => [] }
     end

and my first test is done!

I learned a lot about testing writing that, and it was going to pay off as I did the others. They all looked about the same - you stub out certain methods, run certain sections, and then check the output. Not bad at all. Just need to be careful and methodical about what you're doing.

Then I came to a more challenging problem: I needed to know when a specific instance was being called, and with a certain set of arguments. That's not too bad - you have to make the instance, and then you can use it:

     it "writes the results to salesforce" do
       class FauxSFClient
         def bulk_store(accounts)
           nil
         end
       end
       my_store = FauxSFStore.new
       LeadAssignment::EntryPoint.stub(:send_assignments_to_sf? => true)
       LeadAssignment::EntryPoint.stub(:store => my_store)
       my_store.should_receive(:bulk_store).with([]).exactly(2).times
 
       LeadAssignment::EntryPoint.unassign_and_reassign('cleveland')
     end

and again, this works great! I like that I can specify the args to the method, and the instance doesn't need to be exactly what's in the code - my faux class is just as good for this as anything.

The final trick I learned has to do with passing blocks to methods and testing the contents of those blocks. You can't actually tell what's in a block, but you can evaluate it and then test that value:

     it "writes log messages properly for summary script" do
       LeadAssignment::EntryPoint.stub(:send_assignments_to_sf? => false)
       LeadAssignment::EntryPoint.stub(:unassign => [])
       LeadAssignment::EntryPoint.stub(:assign => [])
       LeadAssignment::EntryPoint.stub(:sink => nil)
 
       QuantumLead::Application.logger.should_receive(:info) do |method, &block|
         method.should == "LeadAssignment::EntryPoint.unassign_and_reassign"
         block.call.should == "Starting LeadAssignment in cleveland"
       end
 
       LeadAssignment::EntryPoint.unassign_and_reassign('cleveland')
     end

and you can have as many of those logger checking blocks as you believe you will have calls to the logger. It's pretty nice.

With all this in place, I was able to whip up the necessary tests for the code in short order. Pretty nice tools.

Starting to See a Little Light – Maybe it’s the Train?

Wednesday, November 14th, 2012

Great News

Today I got in early and really started hammering the last few issues for the Pilot starting Monday. I have 10 hours. It's got to be deployed this afternoon and that's it. We have to run in test mode for two days and then turn it on. So it was time to get down to business.

I was able to fix up the last few issues pretty easily, and by 9:00 am I was looking pretty good. People started coming in and things were looking even better, and stand-up went smoothly.

Maybe a little light?

Then I started to polish a little of the code as we needed to clean some stale code out, and clear out a few things with Salesforce - and promote something else. All looking very good. I'm hours away from the deployment for tonight, and things are looking good. I hate to say this - for fear of jinxing myself, but it was looking good.

CouchDB

I then had the time to look at one of the CouchDB problems I've been having: socket errors. What I was doing in the code was writing them one-by-one:

  payloads.each do |data|
    Database.store('Final results') { data[:merchant] }
  end

and while it works, it's making thousands of REST calls to Couch, and that's not efficient at all. There's a bulk store API to Couch in the client we're using, and if I just change the code to be:

  essentials = payloads.map do |data|
    data.select { |k,_| k != :demand_pool }
  end
  Database.store('Final results') { essentials }

then we're making a connection for every 2000 documents and not every one, and all of a sudden, things are under control with Couch!

This is GREAT news! I'm super happy about this. It means we may not have to ditch Couch, and that's nice, but certainly we have plenty of time to switch it out based on what we're doing and needing, and not on some socket problem with the server.

Very nice news!

The day is looking up… maybe the light isn't the oncoming train, after all!

The Final Push for a Big Pilot – Crazy Plans

Tuesday, November 13th, 2012

cubeLifeView.gif

Today was a stressful day - very. I'm trying to get the CouchDB and Salesforce endpoints to scale, having essentially no luck, and then going into a meeting this afternoon about how we can't hold back anything for the new Pilot Launch on Monday.

Like the Orange Juice commercial, the fun just kept coming as we were in a weekly planning meeting and I found out that we need to have even more features in the code before tomorrow evening.

13 hours.

We have 13 hours to stabilize this code, get some sense of scalability, fix a few outstanding features, and then del with these few more, and all in 13 hours.

I think it's a mistake, and it's driven by nothing more than the desire by the project manager to look good to his boss(es). It's all about face. His. He isn't asking us for dates, he's telling us dates, and then when things come up - as they often do, it's "impossible" to move the dates. All for face.

So I'm getting a little tired of this today. I need to just go home.

Thankfully, it's Angelina's Birthday, so I have that to look forward to tonight. Good.

Updated Configs to Work in Two Datacenters

Monday, November 12th, 2012

servers.jpg

The current plan was to have Production running on the old EC2 boxes and have UAT run out of our data center until we were sure things were OK, and then switch Production over as well. This seemed like a good plan, but there were issues with this and management wanted to run the essential production data, and UAT data, in EC2 and then run the non-essential production data and UAT data in SNC1. This means that there would be multiple boxes running the same code hitting the same data sources and sinks, but covering different regions.

Sounds reasonable, and even safe. So let's do that.

The issues with getting a different config for UAT in one datacenter is that all we really have is hostname versus hostname -f, and I had to use that every place I could. The wrinkle came in when looking at the configuration, as that isn't looking at the machine name - just the environment setting - 'hat' or 'production'. Not so easy.

But I worked on this all day. It was not easy. And then I was ready to test.

It wasn't pretty.

The problem comes in that we aren't synchronizing the work between data centers, and this makes the later processing steps (prioritization) fail because you only have part of the picture. There's no easy way around it.

The next problem was continued Couch errors. Yes, using my cacheing endpoint helped, but we'd still get problems now and then. No easy solution in our code.

So in the end, almost a wasted day. Almost.

Not a great feeling.

Scaling Problems with Salesforce and CouchDB

Friday, November 9th, 2012

CouchDB

Today I've been having real troubles trying to get our system scaled up to new hardware in the datacenter. Moving from hosts in Amazon EC2 to machines in our own datacenter are a big step in the right direction, but going from iffy bandwidth in EC2 to solid switches in the datacenter and 2 cores to 24 are something we simply have to do in order to scale up to handling the global data load that we're going to need.

The problems I've run into today are all about loading. I've already added a queueing system to the CouchDB interface, in order to minimize the connections to the Couch server so as not to overload it - there were times when the Couch server would simply shutdown it's socket listener, and therefore refuse all updates sent from the process. Not good.

Salesforce.com

There's also a lot of problems today with Salesforce. I don't think they expected the kind of loads we're delivering. This morning, at 3:00 am, Salesforce called the support guys at the shop and told them that a process was bringing one of their sandbox clusters to it's knees, and that, it turns out, is but one of four boxes we need to bring online. They're having a hard time handling this much - I can't imagine what's going to happen when we try to really ramp it up.

I'm starting to have real concerns about both these endpoints of the project. I know there's a lot of data getting moved around, and while we're able to handle it, it's these endpoints that are having the hardest time. I've talked with the project manager and the technical manager about this, and I think we need to start thinking about potential bail-out scenarios.

It's certainly possible to read from Salesforce. We're planning on re-doing the complete demand system, so there shouldn't be an issue there. Persistence? Go back to MySQL or PostgreSQL and store it all in tables. The data is getting pretty nicely finalized, so a nice schema should be able to be made. Save it all in a SQL database, make a simple service that reads/caches this data and offers it up to the clients, and the pages already built on the existing data sources are pretty easily modified.

Odd to think that I was looking at MySQL before Couch popped up. Funny thing is, I have a strong feeling that Salesforce can come up with hardware that makes the grade, but it's the bugs in Couch that worry me the most. You just gotta wonder if it's in the Erlang code, or on the boxes, or what.

So many unknowns, but it's clear that we can't scale to one nice box - there's no way we're going to make it work globally without some serious changes.

Adding Queueing to CouchDB Interface

Friday, November 9th, 2012

CouchDB

I've been fighting a problem with our CouchDB installation using the CouchRest interface for ruby/jruby. What I'm seeing is that when I have a lot of updates to CouchDB from the REST API in CouchRest, we start to get socket connection errors to the CouchDB server. I've gone through a lot of different configurations of open file handles on both boxes, and nothing seems to really 'fix' the problem.

So what I wanted to do was to make a queueing database class. Currently, we have something that abstracts away the CouchDB-centric features, and adds in necessary metadata so that we can more easily track all the data held in CouchDB. This is a nice start, in that I only really needed to add to the bulk of the code was a simple flush method:

  Database.flush

where I initially started with it being implemented as:

  def flush
    nil
  end

At this point, I was free to do just about anything with the Database class - as long as I didn't change the public API. What I came up with is this:

What I really like about this is that I can use it either way - as a cacheing/flushing endpoint, or as a straight pass-through with thread-local database connections. This means that I can start with a simple config file setting:


  data_to: 'database'
  writers: 3

which will give me three writer threads, as specified in the start method argument. Then, when I fix the CouchDB issues, I can switch to the simpler, thread-local connection storage with the code:

  Database.start(config.writers) if config.writers > 0

The call to flush is a no-op if we didn't start anything, so there's no harm in always calling it at the end of the process. Pretty nice. I've easily verified that this gets me what I need, and it's just a matter of how 'throttling' I want to be with the writing of the data to CouchDB. But I tell you this… I'm closer to MySQL than ever before - just because of this. There's no reason in the world to put up with this kind of software if it can't do the job.

Trying to Tune CouchDB

Friday, November 9th, 2012

CouchDB

When faced with the idea that CouchDB might be running low on sockets (file handles), I decided to do a little digging into the CentOS 5 configuration, and see what I could do. Turns out it wasn't all that hard, and I got immediate results. The first thing is to up the hard and soft limits on the file handles in /etc/security/limits.conf for the couchdb user:

  couchdb  soft  nofile  32768
  couchdb  hard  nofile  60000

then simply restart CouchDB. Look at the process limits by:

  $ cat /proc/1234/limits

where 1234 is the pid of the erlang 'beam' process, and see what it says. You are also going to need to tell Erlang about the number of file handles in the /usr/local/etc/default/couchdb file:

  export ERL_MAX_PORTS=32768
  export ERL_FLAGS="+A 8"

The second little bit there is to allow 8 threads in Erlang to process disk updates. This multi-threaded filesystem access isn't on be default for Erlang, so if you're on a big box with lots of cores, it makes a lot of sense to turn it on.

With these, you have CouchDB running about as well as you can get. The rest is up to you.

Found Problem in JRuby 1.7.0 and Java Executors

Wednesday, November 7th, 2012

JRuby

As part of my working in scaling up the code to a global system, I ran into some very odd problems in jruby-1.7.0. I'm not exactly sure if they are in jruby or in the JDK 1.7.0_05 implementation for CentOS 5, but I'm willing to bet they are in jruby, as I think the JDK has been hammered on a lot on these issues. So here's what I found.

It's all about using Java Executors in jruby-1.7.0. I'm creating a series of threads with the newFixedThreadPool() method, and then running through an array of things to send out to a REST API (Salesforce), and then waiting for it all to finish up.

  require 'java'
  java_include 'java.util.concurrent.Executors'
 
  executors = Executors.new_fixed_thread_pool(5)
  updates.each do |u|
    send_update(u)
  end

What I'm seeing is that when the threads are done processing, they don't all seem to "clear" the executor. For some reason, they aren't seen as "done" by the executor, but when looking at the REST API service, I know they completed. And it's always in the last batch of tasks.

This doesn't ever seem to happen on the Amazon EC2 hardware - only the nice, new, fast boxes in the datacenter.

So what I decided to do was to add a special timeout to the shutdown of the executor (start at line 36). This says that if we know how long any action should take, then if we get to the end of the processing queue in the executor, and we have waited long enough, then it's OK to forcibly shut down the executors and know that ready-of-not, it should have been done.

It's not ideal, and in most good cases, it shouldn't happen. But I'm getting a lot of problems with Salesforce and CouchDB as a part of this scaling, and I really have no idea what's going on inside either of those systems. Better to add this and be safe.

Moving a Class to Use Static Methods

Tuesday, November 6th, 2012

Building Great Code

This afternoon I finally got around to fixing up one of the problems I noticed this morning in my battles with production, but didn't want to attack at the time - and that was converting the class I was having problems with to use class methods (static methods), more in keeping with the other, similar, classes in the codebase. It's a reassigned, and in that mode it's really quite stateless, so there's no reason to "create one", and then let it work on the data. We can simply just have something that works on the data. But the original author didn't see it that way, and passed in things that the other, similar, functional blocks had internalized.

This made it a little odd in the sense that one part of the system did it as a "factory", and the other as an "instance", but that's easy enough to change - now that I have the time. I simply went in and cleaned everything up, correcting all the references and making sure it all worked as it should.

It wasn't too hard, and in the end it was a lot more like the other code. In fact, I wouldn't be surprised if we could abstract a lot of both of these into a superclass that has all the common stuff, and the only point of these individual worker classes is the specifics they deal with.

But that's for another day...

Another Busy Morning Fixing Production Problems

Tuesday, November 6th, 2012

bug.gif

When I got in this morning I saw that once again, I was going to be fighting production again, but this time, it's really from the additional processes that we're running - not the main pipeline process that we've been working on for several months.

No… this was somewhat new stuff, and in that I'm not terribly upset, but it is annoying when you see code like this:

  hot_lead.downcase[0..1]

because I know there's going to come a time that hot_lead is nil, and that'll blow up. It's so easy to say:

  hot_lead.downcase[0..1] unless hot_lead.nil?

or something like that. So easy… but the Code Monkeys don't think it's necessary.

Sigh…

In addition to some code problems there were legitimate timeouts with Salesforce, and those had to really be felt with. Again, far too optimistic in this new code, assuming everything would work, and in the real world, we know that's just not the case. So I quickly pulled in the timeout retry handling code from another class and converted it to work with the code having the timeouts. This meant converting the pulled-in methods from class methods to instance methods, but I'll get back to this later in the day and clean this up.

With the timeouts under control, I needed to fix a few issues with the summary script that parses the log files and summarizes them nicely. I had a few issues there, but they weren't as big as the problems I'd felt with up to that point in time. Annoying, yes, but not horribly so.

In the end, I got production run, but it was a touch frantic at times to make sure it got done as quickly as possible.