Performance of Larger Storm Topologies

Storm Logo

I was looking at the loading on one of my topologies today, and noticing that it wasn't catching up to a backlog from a restart as fast as I'd hoped. Since I had the capacity, it made sense to expand the topology and give it more workers, and more JSON decoder bolts, and more data filtering bolts. So I doubled the workers to 20, added many more bolts, and restarted it.

And the result was that it was slower. The capacity numbers went way up, and the backlog continued. This made no sense, but then it did. There's a balance to all things in a topology-based system like Storm. You can't increase one thing and expect it not to impact the system in other places.

So I took the changes out, and the speed returned. You can't tweak topologies without getting the performance data, and you can't measure it without upsetting the running of the topology. It's not an easy system to use, but it can be made to be quite useful. You just have to be careful.