Overhead of Storm Communication
Tuning Storm topologies isn't easy. Sure... if you have a topology that just does simple calculations, like my Disruption Detector - then it's easy. Almost deceptively simple. But if you have a topology whose bolts actually have to do something, then it gets a lot more complex. And the more they do, the more complex it becomes.
And one of the insidious things about Storm is the communication buffers. These are using memory not in the heap, so they have the ability to crush a box even if you are careful with the -Xmx4G settings to make sure you don't overload the box. No... these are tricky little beasts, and the conflicts raised by improper use of Kafka is really challenging.
The tradeoff is really one of connectivity versus balance. If the Kafka data is well-balanced across all boxes and all partitions, then you can simply default to using the :local-or-shuffle connectivity directive in your Storm topology, and you should be in good shape. This scheme says: If I have a bolt in the same worker, use it. If not, then look to others and balance accordingly.
And it's that stickiness to the local worker that's the kicker.
If we have an unbalanced kafka cluster, then some of the readers will have more data to process than others, but with this affinity for the local bolts, the excess data will be queued up on the local bolts in the worker, and not get spread out in the cluster. This makes for an uneven distribution of work, and that's very bad.
But if we use the :shuffle directive then every message will be looking to balance out the load by checking all the similar bolts in the topology and messages will be moved from worker to worker, box to box, without regard for really the benefit of that movement. It's not like you can put in a Cost Function for the movement of n bytes - Oh... that would be sweet.
So what happens is that you end up having multi-GB communication buffers on all boxes, and that can crush a system. So you have to be careful - you can't remove all the :shuffle directives - especially if you have an unknown - and potentially unbalanced - kafka source. But you can't have too many of them, either.
Finding the right topology when you're doing real work, is not trivial.