This is Why I’m Not a Java Fan

java-logo-thumb.png

Normally, I like working in clojure - I really do. I don't mind that it's really executing as Java code because the authors have gone to great lengths to make clojure a better experience than Java - and it is. But every now and then, a Java-ism pops up, and it's impossible to figure out what's going on. Case in point - one of my Storm workers today.

I started getting some odd Nagios alerts on one of my boxes, and I didn't think much of it because we've added a lot of traffic, and if we're in one of those pulses, it's going to saturate a box - there's just no way around this other than to buy far more hardware than we need - just so we can handle the maximum peak traffic times without saturating a box.

Sure... we could, but it's a lot of money for something that should be very natural and just a natural regulatory process of the cluster's operation.

Then I look at the graphs, and see this for the memory usage for one box:

Memory Usage

Clearly, something happened around 12:00 UTC that started the memory climbing, but I have no idea what it is. There are no exceptions, and the Storm logs are clear, so I have very little visibility into the process. It's difficult to know what caused it - and so how to prevent it from happening.

Even worse, this isn't an often-repeating problem. In fact, this is the first time I've seen this happen in all the time I've been working with Storm. It could have been the communication buffers or the JVM memory in combination with the communication buffers, but at 100GB, it had to be at least half the communication buffers, as the JVMs are configured to max out at 48GB - so the rest had to be the communication buffers.

So I'm not really thrilled at this, but then again, I've seen a lot of network issues in the last 24 hours, and if a box couldn't send to another box, maybe it got backed-up and that caused the problem. Hard to say, but it's not a fun thought.

I'll just have to monitor this and do my best to prevent this problem from hitting production.