Archive for October, 2014

Moving Sharded Redis Servers

Friday, October 24th, 2014

Redis Database

This morning I found that I really needed to move some of my shared redis servers. Due to the needed bandwidth, I've got eight redis servers on one 196GB RAM box, but this morning, I saw that the total RAM in use was over 170GB, and it was causing issues when redis tried to fork and save the data.

This is what I love about redis and this sharding - I can simply shut things down, move the redis dump.rdb files, fire up the servers on the new machines, and everything will load back up and be ready to go. A simple change in the sharing logic to point to the new machine for those moved servers, and everything is back up and running. Very nice.

Yes, it's very manual as well, but sometimes, the costs of that manual configuration are really worth it. Today it paid off. Big.

How long is too long?

Thursday, October 23rd, 2014

cubeLifeView.gif

I'm in a group that I've tried to get moving for the last year. And when I say moving, I mean producing. Shipping. Delivering. I don't claim to know how hard it is to make really nice UI/UX systems, but it seems to me that it shouldn't take as long as it's been taking the current group I'm in. And I'm not alone in this thought.

I've talked to the two previous managers, and their managers, and their manager - and I'm wondering if I'm just being dense, or if I'm now being impatient? If I step back a bit, it's clear from those around me that the good talent from The Shop is leaving. Lots and lots of good talent is just walking out the door to join smaller companies, bigger companies, different markets - they are really just all over the place. But it's clear - there's a Brain Drain going on.

But maybe it's the natural progressing of things. Maybe they really valued something about the earlier days in The Shop, and they see it as an irreversible trend. If that's the case, then I'm being dense because all the evidence in front of me is saying that the quality of my work life, and therefore my life life, is going downhill - fast.

Yet I have to wonder... am I just looking for an excuse? Should I give this third manager more time? Maybe he really will turn it around like he says? Then again, what makes him special where the other two have tried and failed? Do I really think this guys is that much different? Does he have any more power or influence?

If not, then even if I give him the benefit of the doubt, and assume everything he says is 100% honest and he does really want to change things, what makes me think he's actually going to be able to do it?

I've talked to him about it, and said I've tried for a year, and come to realize it's just not changing, and I want to be on a team that Ships Stuff. That's what I enjoy doing. I'm just not sure I have the energy anymore to put up with this - and the changes he wants to make. He'll turn the team into a Hadoop/Spark query-builder team... I'm just not into that.

We'll have to see how things play out in the coming days.

Data Detective – All Day Long

Thursday, October 23rd, 2014

Detective.jpg

Today has been long... Very long... I've done precious little other than track down data issues for the back-end system I've built. It's not exciting because it's data errors, and not logic errors, and to make things even worse, it's not a crashing bug - it's a logic problem that leads to experimental results that look very wrong - because they likely are misrepresented and not mis-handled.

I know it's all part of the job, and today was as good as any for doing this - because I didn't have a lot to do otherwise, but it's hard, nasty work if you aren't set up for the necessary data links, and in today's case - I wasn't. So I did a lot of visual inspection and narrowing down of data sets, but in the end, I got the results I needed.

I think the logic we're using for extracting the referring source of the traffic is not right. But how to fix it without breaking the other mappings is something I'm going to leave to the original author.

More CSS Fiddling with Textual 5

Thursday, October 23rd, 2014

Textual 5

I have to say that one of the coolest things about Textual 5 was already in Textual 4.1.8, but I didn't really avail myself of it a lot - and that's the CSS styles for the UI. I've been fine-tuning the CSS for the effects in the window - like the messages from the server, and the messages about why I might have disconnected, and then the line separating the old from the new messages... it's all looking so much better, but it's all just simple pixel moves.

I really love this app. Wonderful support on IRC - no less!

Upgraded to Textual 5

Wednesday, October 22nd, 2014

Textual 5

I just noticed that there was a paid update to Textual 4.1.8 - Textual 5 in the Mac App Store. I can't even remember what got me looking in the #textual chat room, but there it was, and that was really good news. I'm a big fan of Textual, and that they had an upgrade for OS X 10.10 is really nice. I do with Adium had one.

The UI changes aren't major - which is nice, and it reads the configuration from 4.1.8, which is also really nice. They got rid of the hidden check box to not restrict the window size, but his concession was to make the minimum "small enough" that it wasn't needed.

Leave it to me to point out that I wanted about 20% less.

So I let him know, and we'll see if there's a change in the offing. It's not horrible as-is, but I would like to have the freedom to make it a little smaller... but I can live for now.

UPDATE: the CSS configuration of Textual is just amazing. I'm able to customize this view to a level that's more than I ever imagined. I'm sure there are lots of others that thinks this is no big deal - but to me, screen real estate is critical, and the ability to customize a view like this is just fantastic. Very highly recommended.

Finding the Joy in Life Again

Wednesday, October 22nd, 2014

Great News

I honestly would have put money on the fact that this would not have happened today. Big money.

I'm sitting on the bus riding to work, and I realize that I'm pretty happy without a pain-causing personal relationship in my life. That was a wow! moment. I've been separated for about 2 years, and the divorce is in the works, but I would have bet real money I'd feel horrible for the rest of my natural life. But today... on the bus... for a few minutes... I didn't.

That was huge for me. Huge.

Then I'm in work, updating a few postings with the results of the tests I'd done overnight, and I'm back into the swing of posting like I used to. It's been a long two years, but I'm back to writing about what I'm doing, and it's really helping. I'm feeling like I'm enjoying myself again.

This, too, was huge for me.

I don't expect this to last all day... but the fact that I have felt this way tells me that I need to keep doing what I'm doing - keep moving forward, and then maybe this will come again. And maybe when it comes again, it'll last longer. Maybe.

Struggling with Storm and Garbage Collection

Tuesday, October 21st, 2014

Finch Experiments

OK, this has to be one of the hardest topologies to balance I've had. Yeah, it's only been a few days, but Holy Cow! this is a nasty one. The problem was that it was seemingly impossible to find a smoking gun for the jumps in the Capacity of the bolts for the topology. Nothing in the logs. Nothing to be found on any of the boxes.

It has been a pain for several days, and I was really starting to get frustrated with this guy. And then I started to think more about the data I had already collected, and where that data - measured over and over again, was really leading me. I have come to the conclusion it's all about Garbage Collection.

The final tip has been the emitted counts from the bolts during these problems:

Bolt Counts

where the corresponding capacity graph looks like:

Bolt Counts

The tip was that there was a drop in the emitted tuples, and then a quick spike up, and then back to the pre-incident levels. This tells me that something caused the flow of tuples to nearly stop, and then the system caught back up again, and the integral over that interval was the same as the average flow.

What lead me to this discovery is that all the spikes in the Capacity graph were 10 min long. Always. That was too regular to be an event, and as I dug into the code for Storm, it was clear it was using a 10 min average for the Capacity calculation, and that explains the duration - it took 10 mins for the event to be wiped from the memory of the calculation, and for things to return to normal.

Given that, I wasn't looking for a long-term situation - I was looking for an event, and with that, I was able to start looking at other data sources for something that would be an impulse event that would have a 10 min duration effect on the capacity.

While I'm not 100% positive - yet - I am pretty sure that this is the culprit, so I've taken steps to spread out the load of the bolts in the topology to give the overall topology more memory, and less work per worker. This should have a two-fold effect on the Garbage Collection, and I'm hoping it'll stay under control.

Only time will tell...

UPDATE: HA! it's not the Garbage Collection - it's the redis box! It appears that the redis servers have hit the limit on the forking and writing to disk, and even the redis start-up log says that there should be the system setting:

  vm.overcommit_memory = 1

to /etc/sysctl.conf and then reboot or run the command:

  sysctl vm.overcommit_memory=1

for this to take effect immediately. I did both on all the redis boxes. I'm thinking this is the problem after all.

Best results I could have hoped for:

Bolt Counts

Everything is looking much better!

Glui Having Trouble with OS X 10.10

Monday, October 20th, 2014

Glui

Glui - a great tool for replacing Skitch, is having a little problem with OS X 10.10, and it's not that hard to see:

Glui Problem

So I emailed the developer, and within 15 mins I got a response that the fix was in the Mac App Store, and I should see the update in a few days. Sweet! I use this for a lot of my journal posts, so it's nice to get it all fixed up, and I am really liking the OS X 10.10 UI, so it's just nice to see them so responsive.

Great tool - Great support!

Adium 1.5.10, Yahoo!, and OS X 10.10 aren’t Happy

Monday, October 20th, 2014

Adium.jpg

Turns out that Adium 1.5.10 and Yahoo! IM isn't happy with OS X 10.10 - it's saying that it can't connect to the Yahoo! IM server and erring out. Normally, this wouldn't be an issue for me, but I've got a few friends that I stay in touch with all the time, and because of their employer, two of them have been kinda off-limits to me for a while - but one finally found his way back to me through Yahoo! IM.

While I'm really loving the changes in OS X 10.10, sadly, with the update, the Yahoo! IM connection has failed to work. And it's not at all helpful in any way:

Bolt Counts

What I've read from the Adium blog is that there is a known fix - but there's a potential security concern, and I'm not at all sure why they are delaying releasing a patch for Yosemite... but they are.

So I'm out of touch with an old friend until I can get an update from Adium. This is one of the issues with Open Source - abandoned software. I'm sure the point is that I can pick it up and fix it, and I might be able to, but it's a pain, and it's something I've come to depend on. In that, I like paying for things.

Changing the Topology to Get Stability (cont.)

Monday, October 20th, 2014

Storm Logo

I think I've finally cracked the nut of the stability issues with the experiment analysis topology (for as long as it's going to last) - it's the work it's doing. And this isn't a simple topology - so the work it's doing is not at all obvious. But it's all there. In short - I think it's all about Garbage Collection and what we are doing differently in the Thumbtack version than in the Third-Rock version.

For example, we had a change of the data schema in redis so that we had the following in the code to make sure that we didn't read any bad data:

  (defn get-trips
    "The experiment name and variant name visited for a specific browserID are held
    in redis in the following manner:
 
      finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0
 
    and this method will return a sequence of the:
 
      <expr-name>|<browser>|<t-src>|<variant>|<country>
 
    tuples for a given browserID. This is just a convenience function to look at all
    the keys in the finch|<browserID> hash, and keep only the ones with five values
    in them. Pretty simple."
    [browserID]
    (if browserID
      (let [all (fhkeys (str *master* "|" browserID))]
        (filter #(= 4 (count (filter #{\|} %))) all))))

where we needed to filter out the bad tuples. This is no longer necessary, so we can save a lot of time - and GC by simply using:

  (defn get-trips
    "The experiment name and variant name visited for a specific browserID are held
    in redis in the following manner:
 
      finch|<browserID> => <expr-name>|<browser>|<t-src>|<variant>|<country> => 0
 
    and this method will return a sequence of the:
 
      <expr-name>|<browser>|<t-src>|<variant>|<country>
 
    tuples for a given browserID. This is just a convenience function to look at all
    the keys in the finch|<browserID> hash. Pretty simple."
    [browserID]
    (if browserID
      (fhkeys (str *master* "|" browserID))))

At the same time, it's clear that in the code I was including far too much data in the analysis. For example, I'm looking at a decay function that weights the most recent experiment experience more than the next most distant, etc. This linear weighting pretty simple and pretty fast to compute. It basically says to weight each event differently - with a simple linear decay.

I wish I had better tools for drawing this, but I don't.

This was just a little too computationally intensive, and it made sense to hard-code these values for different values of n. As I looked at the data, when n was 20, the move important event was about 10%, and the least was about 0.5%. That's small. So I decided to only consider the first 20 events - any more than that and we are really spreading out the effect too much.

Then there was the redis issues...

Redis is the bread-n-butter of these storm topologies. Without it, we would have a lot more work in the topologies to maintain state - and then get it back out. Not fun. But it seems that redis has been having some issues this morning, and I needed to shut down all the topologies, and then restart all the redis servers (all 48 of them).

I'm hoping that the restart settles things down - from the initial observations, it's looking a lot better, and that's a really good sign.