Archive for the ‘Coding’ Category

Trying to Tune CouchDB

Friday, November 9th, 2012

CouchDB

When faced with the idea that CouchDB might be running low on sockets (file handles), I decided to do a little digging into the CentOS 5 configuration, and see what I could do. Turns out it wasn't all that hard, and I got immediate results. The first thing is to up the hard and soft limits on the file handles in /etc/security/limits.conf for the couchdb user:

  couchdb  soft  nofile  32768
  couchdb  hard  nofile  60000

then simply restart CouchDB. Look at the process limits by:

  $ cat /proc/1234/limits

where 1234 is the pid of the erlang 'beam' process, and see what it says. You are also going to need to tell Erlang about the number of file handles in the /usr/local/etc/default/couchdb file:

  export ERL_MAX_PORTS=32768
  export ERL_FLAGS="+A 8"

The second little bit there is to allow 8 threads in Erlang to process disk updates. This multi-threaded filesystem access isn't on be default for Erlang, so if you're on a big box with lots of cores, it makes a lot of sense to turn it on.

With these, you have CouchDB running about as well as you can get. The rest is up to you.

Fixing JRuby

Thursday, November 8th, 2012

JRuby

While trying to figure out a problem with the scaling, I was getting a bunch of BindExceptions and they seemed to point to this point in the jruby-1.7.0 code:

  1. try {
  2. // This is a bit convoluted because (1) SocketChannel.bind
  3. // is only in jdk 7 and (2) Socket.getChannel() seems to
  4. // return null in some cases
  5. channel = SocketChannel.open();
  6. Socket socket = channel.socket();
  7.  
  8. if (localHost != null) {
  9. socket.bind( new InetSocketAddress(
  10. InetAddress.getByName(localHost),
  11. localPort) );
  12. }
  13.  
  14. try {
  15. // Do this nonblocking so we can be interrupted
  16. channel.configureBlocking(false);
  17. channel.connect( new InetSocketAddress(
  18. InetAddress.getByName(remoteHost),
  19. remotePort) );
  20. context.getThread().select(channel, this, SelectionKey.OP_CONNECT);
  21. channel.finishConnect();
  22.  
  23. // only try to set blocking back if we succeeded to finish connecting
  24. channel.configureBlocking(true);

If I was getting bind exceptions, then it had to be coming from line 104 - after all, that's the call to bind(). The problem was, this really needed to be preceded with a call to set SO_REUSEADDR to true so that we didn't have these kinds of issues.

Something like this:

  1. try {
  2. // This is a bit convoluted because (1) SocketChannel.bind
  3. // is only in jdk 7 and (2) Socket.getChannel() seems to
  4. // return null in some cases
  5. channel = SocketChannel.open();
  6. Socket socket = channel.socket();
  7.  
  8. if (localHost != null) {
  9. socket.setReuseAddress(true);
  10. socket.bind( new InetSocketAddress(
  11. InetAddress.getByName(localHost),
  12. localPort) );
  13. }
  14.  
  15. try {
  16. // Do this nonblocking so we can be interrupted
  17. channel.configureBlocking(false);
  18. channel.connect( new InetSocketAddress(
  19. InetAddress.getByName(remoteHost),
  20. remotePort) );
  21. context.getThread().select(channel, this, SelectionKey.OP_CONNECT);
  22. channel.finishConnect();
  23.  
  24. // only try to set blocking back if we succeeded to finish connecting
  25. channel.configureBlocking(true);

Simple change, but it'd make a huge difference to the operation of the socket. So I submitted a request to the JRuby team.

After a little bit, I realized that I needed to try and build this all myself, and see if I couldn't just prove the problem for myself. The difficulty was - How to build JRuby?

Turns out, it's not all that hard. First, fork the code on GitHub so you can issue Pull Requests with the changed code. Then, follow the directions to build everything about JRuby:

  $ git clone git://github.com/drbobbeaty/jruby.git
  $ cd jruby
  $ ant
  $ ant jar-complete
  $ ant dist

It should all build just fine. Then check everything in, and push it up to GitHub. Next, you need to tell rvm to build a new version of ruby from that GitHub repo:

  $ rvm get head
  $ rvm reinstall jruby-head --url git://github.com/

And we're almost done… Update the Gemfile to have the paths to the jruby-jars file where you built it, and then update the .rvmrc. I was able to get things running and even packaging a jar file for our deployment.

Hacking on the language. That's a first for me. Quite a fun experience.

What I noticed was that it wasn't the bind on line 104 - it was the connect on line 113. Java returned the BindException as a general Can't get the endpoint, and I found this out by putting in the catch statements and logging the host and port of the exception. This was very illuminating, and I'm very glad I did it.

I'm going to even send a pull request to see if they'll take it. It's a very useful debugging tool.

Found Problem in JRuby 1.7.0 and Java Executors

Wednesday, November 7th, 2012

JRuby

As part of my working in scaling up the code to a global system, I ran into some very odd problems in jruby-1.7.0. I'm not exactly sure if they are in jruby or in the JDK 1.7.0_05 implementation for CentOS 5, but I'm willing to bet they are in jruby, as I think the JDK has been hammered on a lot on these issues. So here's what I found.

It's all about using Java Executors in jruby-1.7.0. I'm creating a series of threads with the newFixedThreadPool() method, and then running through an array of things to send out to a REST API (Salesforce), and then waiting for it all to finish up.

  require 'java'
  java_include 'java.util.concurrent.Executors'
 
  executors = Executors.new_fixed_thread_pool(5)
  updates.each do |u|
    send_update(u)
  end

What I'm seeing is that when the threads are done processing, they don't all seem to "clear" the executor. For some reason, they aren't seen as "done" by the executor, but when looking at the REST API service, I know they completed. And it's always in the last batch of tasks.

This doesn't ever seem to happen on the Amazon EC2 hardware - only the nice, new, fast boxes in the datacenter.

So what I decided to do was to add a special timeout to the shutdown of the executor (start at line 36). This says that if we know how long any action should take, then if we get to the end of the processing queue in the executor, and we have waited long enough, then it's OK to forcibly shut down the executors and know that ready-of-not, it should have been done.

It's not ideal, and in most good cases, it shouldn't happen. But I'm getting a lot of problems with Salesforce and CouchDB as a part of this scaling, and I really have no idea what's going on inside either of those systems. Better to add this and be safe.

Moving a Class to Use Static Methods

Tuesday, November 6th, 2012

Building Great Code

This afternoon I finally got around to fixing up one of the problems I noticed this morning in my battles with production, but didn't want to attack at the time - and that was converting the class I was having problems with to use class methods (static methods), more in keeping with the other, similar, classes in the codebase. It's a reassigned, and in that mode it's really quite stateless, so there's no reason to "create one", and then let it work on the data. We can simply just have something that works on the data. But the original author didn't see it that way, and passed in things that the other, similar, functional blocks had internalized.

This made it a little odd in the sense that one part of the system did it as a "factory", and the other as an "instance", but that's easy enough to change - now that I have the time. I simply went in and cleaned everything up, correcting all the references and making sure it all worked as it should.

It wasn't too hard, and in the end it was a lot more like the other code. In fact, I wouldn't be surprised if we could abstract a lot of both of these into a superclass that has all the common stuff, and the only point of these individual worker classes is the specifics they deal with.

But that's for another day...

Another Busy Morning Fixing Production Problems

Tuesday, November 6th, 2012

bug.gif

When I got in this morning I saw that once again, I was going to be fighting production again, but this time, it's really from the additional processes that we're running - not the main pipeline process that we've been working on for several months.

No… this was somewhat new stuff, and in that I'm not terribly upset, but it is annoying when you see code like this:

  hot_lead.downcase[0..1]

because I know there's going to come a time that hot_lead is nil, and that'll blow up. It's so easy to say:

  hot_lead.downcase[0..1] unless hot_lead.nil?

or something like that. So easy… but the Code Monkeys don't think it's necessary.

Sigh…

In addition to some code problems there were legitimate timeouts with Salesforce, and those had to really be felt with. Again, far too optimistic in this new code, assuming everything would work, and in the real world, we know that's just not the case. So I quickly pulled in the timeout retry handling code from another class and converted it to work with the code having the timeouts. This meant converting the pulled-in methods from class methods to instance methods, but I'll get back to this later in the day and clean this up.

With the timeouts under control, I needed to fix a few issues with the summary script that parses the log files and summarizes them nicely. I had a few issues there, but they weren't as big as the problems I'd felt with up to that point in time. Annoying, yes, but not horribly so.

In the end, I got production run, but it was a touch frantic at times to make sure it got done as quickly as possible.

Finally Figured Out New Relic and JRuby

Monday, November 5th, 2012

JRuby

This afternoon, I decided to dig into the New Relic Ruby API and see if I couldn't figure out the problem that's been bugging me for a while. Basically, it's all about putting New Relic in a jruby-1.7.0 jar and getting the instrumentation to load.

I knew that New Relic had their code on GitHub, so I went and forked the code in hopes of fixing the problem.

I started digging in the first place I've come to suspect in jruby-1.7.0 - Dir.glob.

Sure enough, they had code that was using this in much the same way that the Configulations gem was: it was scanning directories, in this case, their own code, and loading the scripts it found. These scripts, it turns out, are their "instrumentation" directory.

Makes perfect sense.

The call to Dir.glob returned the empty array, so there was no error. But there was also nothing loaded. Simple. I changed the code from:

  def load_instrumentation_files pattern
    Dir.glob(pattern) do |file|
      begin
        log.debug "Processing instrumentation file '#{file}'"
        require file.to_s

to:

  def load_instrumentation_files pattern
    Dir.glob(pattern.gsub(/^jar:/, '')) do |file|
      begin
        log.debug "Processing instrumentation file '#{file}'"
        require file.to_s

and, as expected, everything started to work just fine!

I sent in a pull request to the New Relic folks so that they could fix their code, and I also sent them an email, as I was back talking to their support guys.

Got a nice note back from one of the support guys thanking me for the fix. I know jruby will get it fixed eventually, but this is really an issue that needs fixing sooner than later.

Maybe I'll fork that and fix it? 🙂

Just a Housekeeping Morning

Monday, November 5th, 2012

Code Clean Up

This morning has been a leisurely morning of cleaning up a few nagging issues, making sure things are in place, and organizing things for when I have the next big push. I'm expecting machines in the new datacenter any day now, and there will be work there to get things tested and moved over. Then we're also expecting more data from another group that will allow us to boost the relative value of merchants based on their previous history with us, and that's going to be quite a bit of work. Still, there's the new demand system that needs to be started and built up from scratch.

There's a lot to do, but this morning it was all with someone else, so I spent a little time getting things cleaned up and in order so that when one of these hits, I don't have to scramble to clean things up and get the new work done.

Testing NewRelic with JRuby

Sunday, November 4th, 2012

I've had a few questions about what's happening at The Shop with New Relic and JRuby 1.7.0, so today I decided to take a little time and really prove to myself what's up and what's broken with the two. My initial tests were with jruby-1.7.0, but then I backed off and started from what I remember worked - jruby-1.6.7.2. Because I needed to see this interact with New Relic, I thought that it'd be easy to just sign up for a free New Relic account.

This is really kinda neat. I have liked what New Relic can do, and it seems there's a pretty decent free system going on there, and it makes it easy to try things out. So I signed up, got things going, and started getting some results:

So it looks like the jruby-1.6.7.2 both as a script and as a self-executable jar work, and then on the same graph, I was able to get jruby-1.7.0 as a script. So far, so good.

The last test was to run this under jruby-1.7.0 as a self-executable jar:

With these tests out of the way, I know that I can run New Relic with jruby-1.7.0 -- I just need to make sure it's getting started and that the Net::http instrumentation is being loaded at the right time. I have high hopes for the runs that will kick off in a few hours!

[11/5] UPDATE: OK… bad news. Yes, for my simple tests, New Relic worked fine from a jruby-1.7.0 jar. However, if there is a more complex project - with dozens of classes, then the instrumentation isn't loaded, and we have a problem. I've sent another request into New Relic, and we'll have to see what they say. If I could manually load the instrumentation, I'd be OK, but I don't know of a way to do that.

[11/5 3:00pm] UPDATE: Found the problem! It's the old JRuby Dir.glob issue and I fixed the New Relic Ruby API and sent them a pull request with the fix. I also sent them an email, so I'm hoping they take it to heart, but we have the fix we need, and that's the main issue.

Finally Solved a Nagging NewRelic Problem

Friday, November 2nd, 2012

JRuby

Ever since we've moved to JRuby 1.7.0, we've had a problem with our New Relic stats, and it's really been quite annoying. New Relic makes all these nice graphs of the instrumentation you place in your apps, and then makes it dead simple easy to show that and drill down to any performance issues you may be having.

In the last few weeks we've been battling quite a few Salesforce performance issues, and without the New Relic graphs, it was essentially impossible to know the details of the calls we were making to Salesforce. This was nasty. I knew it was all about JRuby 1.7.0, but why? Thankfully, the New Relic support guys gave me a good idea.

Basically, it boiled down to my app wasn't loading the net/http instrumentation at the right time. It needed to be loaded very early in the cycle, but it was supposed to be automatically loaded, and it clearly wasn't. I had made DEBUG loggings - nothing seemed to help. Then I looked at the code again:

  require 'ext/rpm/lib/new_relic/agent/error_collector'
  require 'net/http'
  require 'newrelic_rpm'

and I started to wonder: What if there's a problem with that agent? So I changed the code to read:

  require 'net/http'
  require 'ext/rpm/lib/new_relic/agent/error_collector'
  require 'newrelic_rpm'

BINGO! Works!

I'm so glad! This has been a real annoyance, and now I have a work-around. Don't really care if it's something I have to manually do, I'm just glad it's working.

Fixing More Code Monkey Bugs

Friday, November 2nd, 2012

bug.gif

This morning I came in and saw that we were having serious Production problems - and they were in the code, and it was effecting UAT as well. Now I'm not a prude… I get that code goes into Production with bugs - heck, I've done it, and I'll likely do it again in the next month. But I'm here to fix it, and to the best of my abilities, I test the code before I go home with a quick run. But last night was ugly.

I guess I wasn't really surprised that I had problems this morning… I was having problems last night getting a test to run. The Code Monkeys on the team have been fiddling with a lot of things - specifically the Salesforce endpoints, and while I think that's great, it's a little frustrating when there are folks trying to get things done with Salesforce, and others trying to refactor and hope it's not breaking anything.

But that was last night, and I thought I had it all behind me. But evidently not. In the refactoring, they have employed the Hashie gem, and while it's nice, it's also a little tricky. They wanted to base their conversion of our "container based objects" from a Hash to something based on the Hashie::Trash which is a transforming Hash. However, that, in turn, it turns out, is based on the Hashie::Dash - or defined Hash.

This means that if you haven't defined the possible entries for the Dash, then you can't add it. That's nice, to keep "junk" out, but if you miss one in the code, then all of a sudden, working code becomes very uncorking code all because of this restriction. I think they thought they planned for that, but they left off a critical component. A very necessary subclassed method:

  def assert_property_exists!(property)
    unless self.class.property?(property)
      Thrash.property(property.to_s)
    end
  end

This code simply makes sure that if the new property isn't there, it's created. This isn't what the Hashie::Dash is all about, but it's what we needed because we're constantly adding attributes to the Merchant, and we want to avail ourselves of the mapping in the Hashie::Trash, but not the restrictions of the Hashie::Dash.

What's annoying about all this is that I'm the one finding this out.

I'm not the one championing the Hashie gem. I'm not the one working it into the code. I'm not the one writing tests for any of this. But I'm the one that picks up the pieces and fixes it and re-runs production and UAT to make sure we have something for the day.

It's hard being this guy when I'm working with a group of Code Monkeys that are as careless as they are. If I'm finding my own bugs, then that's on me. My fault - totally. But when I'm finding things that I would have checked - and they should have checked, and it's causing the project to look bad, then it's annoying. Too much of this, and I'm really getting frosted.

Unfortunately, I'm not their manager. I'm not their boss. So I can do nothing about it. Not a single, bloody, thing.