Speeding up Experiment Processors
This morning has been a big push on speeding up the experiment reporting engines - the first part was speeding up the SQL they used, and then we had to smartly cache the results of certain values that are calculated based on the stability of the time-series data. These calculations are very expensive - even with the speed-ups in the SQL pulling the data from the database, we needed to make these workers far faster at what they are doing.
The first thing was to realize that these stability metrics were really slow-moving values, and we only needed to update them every 6 hours, or so. Any more often than that is a waste of time because they are not moving that much in an hour or so, and that saves us another factor of 100x or more.
Then we needed to look at the monitoring we had on the queues that fed these guys to make sure that we had good data feeding them as we base a lot of our monitoring metrics on these graphs, and an empty graph is just bad news.
In the end, we have something that is exactly what we need - fast in the database, fast on the workers, and very responsive to the needs we are placing on it. We didn't need this kind of performance in the initial deployment, but it's nice to know it didn't take me more than a day or two to get it once we needed it.