More Topology Balancing

Unified Click

I've spent most of the day trying to get the topology working under load from the batch email send process. In truth, I may never get it really perfectly balanced because it's a batch process and not a real-time, on-demand, kind of thing. I'm trying to count shotgun pellets after the gun goes off. It's kinda tough.

But still I try. And I'm learning a lot about the way this topology and Storm is responding to the load. For instance, if you don't want to buffer tuples in the system - and for the most part, I don't, then use the :local-or-shuffle and then make sure that your data flow is balanced on all bolts before that step. This will save a lot of lag in the throughput as it can hand off one tuple to the next bolt without going through any buffering.

What I've been playing with lately is significantly increasing the size of the decorator bolt parallelization hint and the encoder and transmitter bolts to see if this will make a difference, or if it's just going to shorten the time we're at capacity by moving more messages through the system - but still always being at capacity.

So I've had good luck, actually, and this is the message rate for a bulk email send (purple) and the corresponding output decorator (golden) and output messages (cyan):

Message Rate for 500 PH

There's a lot to like about this graph over the older ones - first, the decorate and xmit are virtually identical - i.e. no buffering. Excellent. Also, the drop-off on the output is nearly as good as the ramp-up, so that means that we're really doing a pretty decent job of moving the data. I'm not unhappy with this graph at all. But the capacity graph is a different story:

Message Rate for 500 PH

Here we see that we peaked after the email send block was done, and that's a bit odd, but on the plus side, the encode and emit bolts also rose nicely saying that the decoding is starting to share the load more, and that's a good thing.

My concern is still the capacity number. I suppose I'll run a few more tests with higher numbers still and see if that makes any difference, but I have a feeling it's not going to make any change to the height of the capacity surge - but it will likely lessen the duration.