Archive for the ‘Cube Life’ Category

Cleaning Up Parsing Errors

Friday, October 24th, 2014

Unified Click

From time to time, I've been asked to add features to my data stream by groups, and one of these groups at The Shop deals with affiliates directing traffic - and hopefully sales - to the site. If so, they get paid, which is good for everyone. One of the things that's important about this is to accurately track who came from where, and all that. To that end, I've had to implement some rather complex if-then-else logic into the code to match what the existing attribution code is.

Well... it seems I probably made a mistake in being a little too inclusive with some of the data. I was including a URL in the data I'm being sent when I really didn't want to include the domain in that URL - just the parameters.

When this was pointed out, I realized that the instructions I'd received from the group about this feature were relatively vague, and after I really dug into what they were asking, and comparing the code, it was likely the case that I wanted the options, but not the domain in the URL.

Not at all clear, but hey... I can admit that I didn't get it right - easy fix.

Moving Sharded Redis Servers

Friday, October 24th, 2014

Redis Database

This morning I found that I really needed to move some of my shared redis servers. Due to the needed bandwidth, I've got eight redis servers on one 196GB RAM box, but this morning, I saw that the total RAM in use was over 170GB, and it was causing issues when redis tried to fork and save the data.

This is what I love about redis and this sharding - I can simply shut things down, move the redis dump.rdb files, fire up the servers on the new machines, and everything will load back up and be ready to go. A simple change in the sharing logic to point to the new machine for those moved servers, and everything is back up and running. Very nice.

Yes, it's very manual as well, but sometimes, the costs of that manual configuration are really worth it. Today it paid off. Big.

How long is too long?

Thursday, October 23rd, 2014

cubeLifeView.gif

I'm in a group that I've tried to get moving for the last year. And when I say moving, I mean producing. Shipping. Delivering. I don't claim to know how hard it is to make really nice UI/UX systems, but it seems to me that it shouldn't take as long as it's been taking the current group I'm in. And I'm not alone in this thought.

I've talked to the two previous managers, and their managers, and their manager - and I'm wondering if I'm just being dense, or if I'm now being impatient? If I step back a bit, it's clear from those around me that the good talent from The Shop is leaving. Lots and lots of good talent is just walking out the door to join smaller companies, bigger companies, different markets - they are really just all over the place. But it's clear - there's a Brain Drain going on.

But maybe it's the natural progressing of things. Maybe they really valued something about the earlier days in The Shop, and they see it as an irreversible trend. If that's the case, then I'm being dense because all the evidence in front of me is saying that the quality of my work life, and therefore my life life, is going downhill - fast.

Yet I have to wonder... am I just looking for an excuse? Should I give this third manager more time? Maybe he really will turn it around like he says? Then again, what makes him special where the other two have tried and failed? Do I really think this guys is that much different? Does he have any more power or influence?

If not, then even if I give him the benefit of the doubt, and assume everything he says is 100% honest and he does really want to change things, what makes me think he's actually going to be able to do it?

I've talked to him about it, and said I've tried for a year, and come to realize it's just not changing, and I want to be on a team that Ships Stuff. That's what I enjoy doing. I'm just not sure I have the energy anymore to put up with this - and the changes he wants to make. He'll turn the team into a Hadoop/Spark query-builder team... I'm just not into that.

We'll have to see how things play out in the coming days.

Data Detective – All Day Long

Thursday, October 23rd, 2014

Detective.jpg

Today has been long... Very long... I've done precious little other than track down data issues for the back-end system I've built. It's not exciting because it's data errors, and not logic errors, and to make things even worse, it's not a crashing bug - it's a logic problem that leads to experimental results that look very wrong - because they likely are misrepresented and not mis-handled.

I know it's all part of the job, and today was as good as any for doing this - because I didn't have a lot to do otherwise, but it's hard, nasty work if you aren't set up for the necessary data links, and in today's case - I wasn't. So I did a lot of visual inspection and narrowing down of data sets, but in the end, I got the results I needed.

I think the logic we're using for extracting the referring source of the traffic is not right. But how to fix it without breaking the other mappings is something I'm going to leave to the original author.

More CSS Fiddling with Textual 5

Thursday, October 23rd, 2014

Textual 5

I have to say that one of the coolest things about Textual 5 was already in Textual 4.1.8, but I didn't really avail myself of it a lot - and that's the CSS styles for the UI. I've been fine-tuning the CSS for the effects in the window - like the messages from the server, and the messages about why I might have disconnected, and then the line separating the old from the new messages... it's all looking so much better, but it's all just simple pixel moves.

I really love this app. Wonderful support on IRC - no less!

Struggling with Storm and Garbage Collection

Tuesday, October 21st, 2014

Finch Experiments

OK, this has to be one of the hardest topologies to balance I've had. Yeah, it's only been a few days, but Holy Cow! this is a nasty one. The problem was that it was seemingly impossible to find a smoking gun for the jumps in the Capacity of the bolts for the topology. Nothing in the logs. Nothing to be found on any of the boxes.

It has been a pain for several days, and I was really starting to get frustrated with this guy. And then I started to think more about the data I had already collected, and where that data - measured over and over again, was really leading me. I have come to the conclusion it's all about Garbage Collection.

The final tip has been the emitted counts from the bolts during these problems:

Bolt Counts

where the corresponding capacity graph looks like:

Bolt Counts

The tip was that there was a drop in the emitted tuples, and then a quick spike up, and then back to the pre-incident levels. This tells me that something caused the flow of tuples to nearly stop, and then the system caught back up again, and the integral over that interval was the same as the average flow.

What lead me to this discovery is that all the spikes in the Capacity graph were 10 min long. Always. That was too regular to be an event, and as I dug into the code for Storm, it was clear it was using a 10 min average for the Capacity calculation, and that explains the duration - it took 10 mins for the event to be wiped from the memory of the calculation, and for things to return to normal.

Given that, I wasn't looking for a long-term situation - I was looking for an event, and with that, I was able to start looking at other data sources for something that would be an impulse event that would have a 10 min duration effect on the capacity.

While I'm not 100% positive - yet - I am pretty sure that this is the culprit, so I've taken steps to spread out the load of the bolts in the topology to give the overall topology more memory, and less work per worker. This should have a two-fold effect on the Garbage Collection, and I'm hoping it'll stay under control.

Only time will tell...

UPDATE: HA! it's not the Garbage Collection - it's the redis box! It appears that the redis servers have hit the limit on the forking and writing to disk, and even the redis start-up log says that there should be the system setting:

  vm.overcommit_memory = 1

to /etc/sysctl.conf and then reboot or run the command:

  sysctl vm.overcommit_memory=1

for this to take effect immediately. I did both on all the redis boxes. I'm thinking this is the problem after all.

Best results I could have hoped for:

Bolt Counts

Everything is looking much better!

Changing the Topology to Get Stability (cont.)

Monday, October 20th, 2014

Storm Logo

I think I've finally cracked the nut of the stability issues with the experiment analysis topology (for as long as it's going to last) - it's the work it's doing. And this isn't a simple topology - so the work it's doing is not at all obvious. But it's all there. In short - I think it's all about Garbage Collection and what we are doing differently in the Thumbtack version than in the Third-Rock version.

For example, we had a change of the data schema in redis so that we had the following in the code to make sure that we didn't read any bad data:

  (defn get-trips
    "The experiment name and variant name visited for a specific browserID are held
    in redis in the following manner:
 
      finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0
 
    and this method will return a sequence of the:
 
      <expr-name>|<browser>|<t-src>|<variant>|<country>
 
    tuples for a given browserID. This is just a convenience function to look at all
    the keys in the finch|<browserID> hash, and keep only the ones with five values
    in them. Pretty simple."
    [browserID]
    (if browserID
      (let [all (fhkeys (str *master* "|" browserID))]
        (filter #(= 4 (count (filter #{\|} %))) all))))

where we needed to filter out the bad tuples. This is no longer necessary, so we can save a lot of time - and GC by simply using:

  (defn get-trips
    "The experiment name and variant name visited for a specific browserID are held
    in redis in the following manner:
 
      finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0
 
    and this method will return a sequence of the:
 
      <expr-name>|<browser>|<t-src>|<variant>|<country>
 
    tuples for a given browserID. This is just a convenience function to look at all
    the keys in the finch|<browserID> hash. Pretty simple."
    [browserID]
    (if browserID
      (fhkeys (str *master* "|" browserID))))

At the same time, it's clear that in the code I was including far too much data in the analysis. For example, I'm looking at a decay function that weights the most recent experiment experience more than the next most distant, etc. This linear weighting pretty simple and pretty fast to compute. It basically says to weight each event differently - with a simple linear decay.

I wish I had better tools for drawing this, but I don't.

This was just a little too computationally intensive, and it made sense to hard-code these values for different values of n. As I looked at the data, when n was 20, the move important event was about 10%, and the least was about 0.5%. That's small. So I decided to only consider the first 20 events - any more than that and we are really spreading out the effect too much.

Then there was the redis issues...

Redis is the bread-n-butter of these storm topologies. Without it, we would have a lot more work in the topologies to maintain state - and then get it back out. Not fun. But it seems that redis has been having some issues this morning, and I needed to shut down all the topologies, and then restart all the redis servers (all 48 of them).

I'm hoping that the restart settles things down - from the initial observations, it's looking a lot better, and that's a really good sign.

Changing the Topology to Get Stability (cont.)

Friday, October 17th, 2014

Storm Logo

I have been working one the re-configuration of the experimental analysis topology to try and get stability, and it's been a lot more difficult than I had expected. What I'm getting looks a lot like this:

Instability

and that's no good.

Sadly, the problem is not at all easy to solve. What I'm seeing when I dig into the Capacity numbers for the bolts are that there's only one of the 300+ bolts that has a capacity number that exceeds 1.000 - and then by a lot. But all the others are less than 0.200 - well under control. So why?

I've looked at logs, I've looked at the logs - nothing... I've looked at the tuples moving through the bolts, and interestingly, found that many of them just aren't moving any tuples. Why? no clue, but it's easy enough to scale it back to me a more efficient topology.

What I've come away with is the idea that might be the different way we're dealing with the data. So this weekend, if I have time, I'll dig in and see what I can do to make this more even between the two. It's not obvious, but then - that's 0.9.0.1 software.

Changing the Topology to Get Stability

Thursday, October 16th, 2014

Storm Logo

OK... it turns out that I needed to change my topology in order to get stability, and it's only taken me a day (or so) to find this out. It's been a really draining day, but I've learned that there is a lot more about Storm that I don't know - than that which I do.

Switching from having one bolt that calls two functions to two bolts each calling one function - should certainly have a lot more communication overhead. Interestingly, it's how I regained stability. The lesson I've learned - Make bolts as small and purposeful as possible.

Amazing...

Re-Tuning Experiment Topology

Wednesday, October 15th, 2014

Finch Experiments

Anytime you add a significant workload to a storm cluster, you really need to re-balance it. This means looking at the work each bolt does, making sure there is the proper balance between the bolts at each phase of the processing, and then that there are enough workers to handle the cumulative throughput. It's not a trivial job, but it's a lot of experimentation and then looking for patterns and zeroing in on the solution.

That's what I've been doing for several hours, and I'm no where near done. It's getting closer, and I think I have the problem isolated, and it's very odd. Basically, there are a few bolt instances - say 4 out of 160 - that are above 1.0 - the rest are at least a factor of ten less. This is my problem. Something is causing these few bolts to take too long, and then that skews the metric for all the instances of that bolt.