Changing the Topology to Get Stability (cont.)
I think I've finally cracked the nut of the stability issues with the experiment analysis topology (for as long as it's going to last) - it's the work it's doing. And this isn't a simple topology - so the work it's doing is not at all obvious. But it's all there. In short - I think it's all about Garbage Collection and what we are doing differently in the Thumbtack version than in the Third-Rock version.
For example, we had a change of the data schema in redis so that we had the following in the code to make sure that we didn't read any bad data:
(defn get-trips "The experiment name and variant name visited for a specific browserID are held in redis in the following manner: finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0 and this method will return a sequence of the: <expr-name>|<browser>|<t-src>|<variant>|<country> tuples for a given browserID. This is just a convenience function to look at all the keys in the finch|<browserID> hash, and keep only the ones with five values in them. Pretty simple." [browserID] (if browserID (let [all (fhkeys (str *master* "|" browserID))] (filter #(= 4 (count (filter #{\|} %))) all))))
where we needed to filter out the bad tuples. This is no longer necessary, so we can save a lot of time - and GC by simply using:
(defn get-trips "The experiment name and variant name visited for a specific browserID are held in redis in the following manner: finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0 and this method will return a sequence of the: <expr-name>|<browser>|<t-src>|<variant>|<country> tuples for a given browserID. This is just a convenience function to look at all the keys in the finch|<browserID> hash. Pretty simple." [browserID] (if browserID (fhkeys (str *master* "|" browserID))))
At the same time, it's clear that in the code I was including far too much data in the analysis. For example, I'm looking at a decay function that weights the most recent experiment experience more than the next most distant, etc. This linear weighting pretty simple and pretty fast to compute. It basically says to weight each event differently - with a simple linear decay.
I wish I had better tools for drawing this, but I don't.
This was just a little too computationally intensive, and it made sense to hard-code these values for different values of n. As I looked at the data, when n was 20, the move important event was about 10%, and the least was about 0.5%. That's small. So I decided to only consider the first 20 events - any more than that and we are really spreading out the effect too much.
Then there was the redis issues...
Redis is the bread-n-butter of these storm topologies. Without it, we would have a lot more work in the topologies to maintain state - and then get it back out. Not fun. But it seems that redis has been having some issues this morning, and I needed to shut down all the topologies, and then restart all the redis servers (all 48 of them).
I'm hoping that the restart settles things down - from the initial observations, it's looking a lot better, and that's a really good sign.