Re-Tuning Experiment Topology
Anytime you add a significant workload to a storm cluster, you really need to re-balance it. This means looking at the work each bolt does, making sure there is the proper balance between the bolts at each phase of the processing, and then that there are enough workers to handle the cumulative throughput. It's not a trivial job, but it's a lot of experimentation and then looking for patterns and zeroing in on the solution.
That's what I've been doing for several hours, and I'm no where near done. It's getting closer, and I think I have the problem isolated, and it's very odd. Basically, there are a few bolt instances - say 4 out of 160 - that are above 1.0 - the rest are at least a factor of ten less. This is my problem. Something is causing these few bolts to take too long, and then that skews the metric for all the instances of that bolt.