Archive for the ‘Cube Life’ Category

HipChat 3.0 – Wow! Not a Good Update

Thursday, September 4th, 2014

This morning I saw that HipChat 3.0 for the Mac was out, and I updated. Why not? Well... the changes in this version are really worthy of the 3.0 designation. The UI is almost completely redone, and it's not a really nice facelift. Interestingly, I'm not the only one that thinks this.

HipChat 3.0

The title bar is huge! What were they thinking? The whitespace around each line... and the whitespace to the left in the names column... it's just too much. In these days when people work primarily on laptops, I'm just plain shocked to see a design that so wasteful of space.

Then there's the baseline for the text.

HipChat 3.0

In the larger picture, it looks like the baseline for the text of the name is lower than the baseline for the text. But when you zoom in and draw a line, it looks as though it's just an artifact of the anti-aliased text - even on a Retina display!

HipChat, as a service, is a pretty good service. It's solid, reliable, searchable, and it just works. But the designers they have had on the Mac OS X products just didn't understand that most people want to customize their experience. Why not allow CSS to stylize the display? Or at least offer a toolkit to make themes? Either of these would allow teams to personalize their view so that it'd work best for them.

Propane - the Campfire Mac OS X client - did this. I was able to completely customize the UI. Very cool. This just seems to be something trying to chase the iOS 7 style guidelines... and missing... badly. So much wasted space.

Reloading Postgres Configuration

Tuesday, July 15th, 2014

PostgreSQL.jpg

When you add a new user to a postgres database, you typically have to add them to the pg_hba.conf file and then tell postgres to reload it's configuration. There are a few ways to do this, and they are all pretty simple, but it bears writing them down for future reference.

First, adding a user is simple. Find the postgres directory - typically, all the files are there, and in that list will be a file called pg_hba.conf. The final few lines will need to look like:

  host all jim  0.0.0.0/0 trust

where here the user jim is considered to be trusted on all boxes he could come in on. This might be too liberal for you, but in most data centers, it's not an unreasonable assumption.

Then you need to tell postgres to reload the config. This can be done within the psql client:

  SELECT pg_reload_conf();

or from the shell, by the postgres user:

  /usr/local/bin/pg_ctl reload -D /var/pgsql

assuming the main postgres directory is /var/pgsql.

It was the best of times, it was the worst of times…

Monday, June 30th, 2014

cubeLifeView.gif

It's been a heck of a month, and I've adopted Dickens' line from A Tale of Two Cities as it seems to fit my life amazingly well these days. I'm trying to adapt to the current environment at The Shop, but it's proving increasingly difficult the more I attempt to adapt, and in the end I'm left with the feeling that it's an abusive relationship where the more I attempt to adapt, the more I'm being exploited.

The work can be fun, and it's one of the two things I'd classify in the ... best of times... category. But there's very little real work, and far more people posturing and pretending and trying to get me to do their work for them. In most cases, I accept the work because it is, after all, the best of times, but then it becomes an abusive relationship when these people expect me to do their work for them, and get angry when I can't - or won't. I've enabled them to be poor producers in this environment, and by enabling that behavior, I've done them no favors.

Management is not supporting me because they are the ultimate benefactors of all the work I do, so the more I do, the more they like it. It's a very simple equation. I also have a manager that seems to be completely incapable of delivering any kind of bad news, so instead, he convinces himself that the right thing is to adjust his expectations, and have me to their work. If they then become happy, and all the work gets done (by me), then all the better. No more reason to deliver bad news.

But it's not healthy. It's not sustainable. And I'm getting to the point that I'm just done with this particular unhealthy relationship, and it's time to move on. I like the company, but not the people I'm working with. There are people here at The Shop that I'd like to work with, have in the past, and would enjoy doing so again. There are even people in this group that are decent workers. But the vast majority are not. And I've enabled their bad behavior for too long.

It's time to make them stand on their own two feet. That's the only way they will get better. And they need to know if they can get better, or if it's time to change jobs. That is what we all should have to do, but I've been keeping them from this by picking up all the group's work. Like I said, I wasn't doing them any real favors. I was doing myself a favor by keeping my mind active.

It's time to make some changes.

Bash Safety Tip: Check for Terminal Prompt

Wednesday, February 19th, 2014

Ubuntu Tux

I was having a pretty painful issue with SCP the other day and it took me a few hours to get to the bottom of it, but there's a lesson to be learned here. The problem was that I was able to SSH to a host, but was not able to SCP to that same host. The keys were good, the permissions on the key files were good - and SSH was just fine. It was just SCP.

Finally, I figure out it was my .bashrc file. I had added some functions in there, and they were doing fine with a terminal session, but the SCP "head-less" session was causing it to hang. Horribly. And that's the Safety Tip for the day:

Add this after the alias commands at the top of your .bashrc:

  # skip the rest if we're not interactive
  if [ -z "$PS1" ]; then
      return
  fi

and then you'll have the aliases if you need them, but you won't have the rest that could foul up the SCP session.

Very useful tip. Do it.

Optimizing Redis Storage

Sunday, January 19th, 2014

Redis Database

[Note: I created this post for The Shop, and while I'm not sure if it'll ever see the light of day, it's useful, and I wanted to see it posted. So here it is.]

Optimizing Redis Storage

The Optimize Group

One of the tasks of the Optimize Team where I work is to build a real-time analytics engine for our A/B testing framework which involves analyzing the consumers experiencing each of the variants on each experiment. Along with this, we need to look at each deal sold in the system and properly attribute each sale to the experiments the consumer visited on their way to that sale that might have influenced their buying decision. Based on this visiting and buying data, the different product teams can then determine which of the experiment variants they want to keep, and which they don’t.

In order to improve any consumer-facing Groupon product, experiments are done where a random sample of consumers will be placed into a testing group and shown one or more variants of the original, control, experience, and then their responses will be tallied. This A/B testing data will come to our cluster in the form of several separate messages. Some will indicate the consumer, browser, and device when an experiment variant is encountered, others will indicate when a consumer purchased a deal. It is then the job of this cluster to correlate the actions taken by that consumer to see if the variant is better than the control. Did the larger image lead to more purchases? Did the location of the button cause more people to click on it? All these experiments need to be classified and the consumer actions attributed.

Recently, several production systems started using Clojure and given that Storm is written primarily in Clojure, it seemed like a very good fit to the problem of real-time processing of messages. There are several topologies in our cluster - one that unifies the format of the incoming data, another enriches it with quasi-static data, and then a simple topology that counts these events based on the contents of the messages. Currently, we’re processing more than 50,000 messages a second, but with Storm we have the ability to easily scale that up as the load increases. What proved to be a challenge was maintaining the shared state as it could not be stored in any one of the bolts as there are 30 instances of it spread out across five machines in the cluster. So we had to have an external shared state.

All of our boxes are located in our datacenter, and because we’re processing real-time data streams, we’re running on bare metal boxes - not VMs. Our tests showed that if we used the traditional Redis persistence option of the time/update limits, a Redis box in our datacenter with 24 cores and 96 GB of RAM was more than capable of handling the load we had from these 30 bolts. In fact, the CPU usage was hovering around a consistent 15% - of one of the 24 cores. Plenty of headroom.

Redis is primarily a key/value store, with the addition of primitive data types including HASH, LIST, and SET to allow a slightly nested structure and operations to the cache. And while it’s ability to recover after a crash with it’s data intact is a valuable step up over Memcached, it really makes you think about how to store data a useful and efficient layout. The initial structure we chose for Redis was pretty simple. We needed to have a Redis SET of all the experiment names that were active. It turns out that there can be many experiments in the codebase, but only some are active. Others may have completed and just haven’t been removed from the code. To support this active list, we had a single key:

	finch|all-experiments => SET (names)

and then for each active experiment name, we had a series of counts: How many consumer interactions have there been with this experiment? How many errors were there on the page when dealing with an experiment? and even a count for the basic errors encountered in the stream itself - each updated with Redis’ atomic INCR function:

	finch|<name>|counts|experiment => INT
	finch|<name>|counts|errors => INT
	finch|<name>|counts|null-b-cookies => INT

The next step was to keep track of all the experiments seen by all the consumers. As mentioned previously, this includes the browser they were using (Chrome 29.0, IE 9.0, etc.), the channel (a.k.a. line of business) the deal is from (Goods, Getaways, etc.), and the name of the variant they experienced. The consumer is represented by their browser ID:

	finch|<name>|tuples => SET of [<browser>|<channel>|<variant>]
	finch|<name>|variant|<browser>|<channel>|<variant> => SET of browserId

The Redis SET of tuples containing the browser name and version, the channel, and the name of the variant they saw was important so that we didn’t have to scan the key set looking for the SETs of browser IDs. This is very important as Redis is very efficient at selecting a value from the key/value set - but it is horribly inefficient if it has to scan all the keys. While this function exists in the Redis command set, it’s also very clearly indicated as not to be used in a production system because of the performance implications.

Finally, we needed to attribute the sales and who bought them, again based on these tuples:

	finch|<name>|orders|<browser>|<channel>|<variant>|orders => INT
	finch|<name>|orders|<browser>|<channel>|<variant>|qty => INT
	finch|<name>|orders|<browser>|<channel>|<variant>|revenue => FLOAT
	finch|<name>|orders|<browser>|<channel>|<variant>|consumers => SET of uuid

As you can see, the lack of nested structures in Redis means a lot needs to be accomplished by how you name your keys, which makes this all appear to be far more complicated than it really is. And at the same time, we have purposefully chosen to use the atomic Redis operations for incrementing values to keep the performance up. Consequently, this may seem like a lot of data to hold in Redis, but it lead to very fast access to the shared state and Redis’ atomic operations meant that we could have all 30 instances of the bolt hitting the same Redis instance and updating the data concurrently. Performance was high, the analytics derived from this data were able to be generated in roughly 5 sec, so the solution seemed to be working perfectly.

Until we had been collecting data for a few days.

The memory usage on our Redis machine seemed to be constantly climbing. First it passed 20 GB, then 40 GB, and then it crashed the 96 GB machine. The problem stemmed from the fact that while an experiment was active we were be accumulating data for it. While the integers weren’t the problem, this one particular SETs was:

	finch|<name>|variant|<browser>|<channel>|<variant> => SET of browserId

There would, over time, be millions of unique visitors, and with more than a hundred active experiments at any one time, and even multiple browserIDs per consumer. Add it all up, and the Redis SET would have hundreds of millions of entries. This would continue to grow as more visitors came to the site and experience the experiments. What we needed was a much more efficient way to store this data.

Wondering what Redis users do when wanting to optimize storage we did some research and found a blog post by the Engineering group at Instagram. We also found a post on the Redis site that reinforces this point and gives tuning parameters for storing efficiently in a HASH. Armed with this knowledge, we set about refactoring our data structures to see what gains we could get.

Our first change was to pull the ‘counts’ into a HASH. Rather than using:

	INCR finch|<name>|counts|experiment
	INCR finch|<name>|counts|errors
	INCR finch|<name>|counts|null-b-cookies

we switched to:

	HINCR finch|<expr-name>|counts experiment
	HINCR finch|<expr-name>|counts errors
	HINCR finch|<expr-name>|counts null-b-cookies

Clearly, we were not the first to go this route as Redis had the equivalent atomic increment commands for HASH entries. It was a very simple task of breaking up the original key and adding the ‘H’ to the command.

Placing the sales in a HASH (except the SET of consumerIDs as they can’t fit within a HASH), was also just a simple breaking up of the key and using HINCR and HINCRBY. Continuing along these lines we saw we could do a similar refactor and we switched from a SET of browserIDs to a HASH where the keys are the browserIDs - just as unique, and we can use the Redis command HKEYS to get the complete list. Going further, we figured we could that values of the new HASH could contain some of the data that was in other structures:

	finch|<browserID> => app-chan => <browser>|<channel>
	finch|<browserID> => trips|<expr-name>|<name_of_variant> => 0

where that zero was just a dummy value for the HASH key.

With this new structure, we can count the unique browserIDs in an experiment by using the Redis EXIST function to see if we have seen this browserID in the form of the above HASH, and if not, then we can increment the number of unique entries as:

	finch|<expr-name>|tuples => <browser>|<channel>|<name_of_variant> => INT

At the same time we get control over the ever-growing set of browserIDs that was filling up Redis in the first place by not keeping the full history of browserIDs, just the count. We realized we could have the browserID expire on a time period and let it get added back in as consumers return to use Groupon. Therefore, we can use the Redis EXPIRE function on the:

	finch|<browserID>

HASH, and then after some pre-defined period of inactivity, the browserID data would just disappear from Redis. This last set of changes - moving away from a SET to a HASH, counting the visits as opposed to counting the members of a SET, and then EXPIRE-ing the data after a time really made the most significant changes to the storage requirements.

So what have we really done? We had a workable solution to our shared state problem using Redis, but the space required was very large and the cost of keeping it working was going to be a lot more hardware. So we researched a bit, read a bit, and learned about the internals of Redis storage. We then did a significant data refactoring of the information in Redis - careful to keep every feature we needed, and whenever possible, reduce the data retained.

The end effect? The Redis CPU usage doubled, which was still very reasonable - about 33% of one core. The Redis storage dropped to 9 GB - less than 1/10th of the original storage. The latency in loading a complete experiment data set rose slightly - about 10% on average, based on the size and duration of the experiment. Everything we liked about Redis: fast, simple, robust, and persistent, we were able to keep. Our new-found understanding of the internals of Redis has enabled us to make it far more efficient. As with any tool, the more you know about it - including its internal workings, the more you will be able to do with it.

What Would I Build?

Monday, November 25th, 2013

Storm

I've been playing around with Storm for a while now, and while I don't think there are all that many folks in the world that are expert at it, I'm certainly an advanced novice, and that's good enough for the amount of time I've put into it. I've learned a lot about how they have tired to solve the high-performance computing platform in clojure and on the JVM, and I've come away with an affirmation of the feelings I had when I was interviewed for this job, and discussing functional languages: Garbage Collection is the death of all functional languages, and certainly Storm.

I like the simplicity of functional languages with a good library of functions. Face it, Java took off over C++ because C++ was the base language, and Java had the rich object set that everyone built on. It made a huge difference in how fast people could build things. So if you want a functional language to have a lot of traction fast, you need to make sure that you don't send people to re-invent the wheel to do the most basic tasks.

But the real killer is Garbage Collection. I'm not a fan, and the reason is simple - If I'm trying to do some performant coding, I want to control when that happens, and under what conditions. It's nice for novices to be able to forget about this and still write stable code, but when you want to move 1,000,000 msgs/sec, you can't do it without pools, lockless data structures, mutability, and solid resource control. None of which I get in the JVM - or anything based on it.

So what's a coder to do? Answer: Write another.

There used to be Xgrid from Apple, but they dropped that. They didn't see that it was in their best interests to write something that targets their machines as nodes in a compute cluster, and they aren't about to write something where you can use cheap linux boxes and cut them out altogether. Sadly, this is a company, and they want to make money.

But what if we made a library that used something like ZeroMQ for messaging, and then we used something like C++ for the linux side, and Obj-C++ for the Mac side and made all the tools work like they do for Storm - but instead of using clojure and the JVM, and a ton of tools on the server-side to handle all the coordination and messaging, let's use something that's far more coupled with the toolset we're working with.

First, no Thrift. It's bulky, expensive, and it's being used as a simple remote procedure call. There are a lot better alternatives out there when you're using a single language. Stick with a recent version of ZeroMQ and decent bindings - like their C++ ones. Start small and build it up. Make a decent console - Storm is nice here, but there's a lot more that could be done, and the data in the Storm UI is not really easily discernible. Make it clearer.

Maybe I'll get into this... it would certainly keep me off the streets.

Chasing the Magic Tool

Monday, October 28th, 2013

Storm

I'm in the midst of a new project here at The Shop, and I can understand that it's really new technology, and as such, very little is really known about it. Sure, if you listen to the conference talks, Storm is old news, but put it into production and all of a sudden a lot of people's hands lower in the crowd because it's just so bloody new. I'm trying to make it work.

But at the same time, I'm seeing emails about new distributed systems frameworks -- sounds a lot like what Storm is about, and management is asking for opinions. My initial opinion is pretty simple: Pick one and get good at it.

I'm here, but I'm worried that this place is the exact opposite of "Enterprise Tools" - they are the "Always Shiny". We have a tool for distributed, fault-tolerant, computing - so why are we looking at another? Should we assume that the selection we have is premature, and that based on what we have found, we need something better?

I'm not against competition, but then, you have to allow for the fact that you're going to have a hodgepodge of all kinds of systems in the end, as no one goes back and converts a working production system from one working tech to another, different, working tech. There's never time.

So why the search? Why not just get good at one of the leaders in the space, and then gain the critical experience to be able to really make it work?

I fear the answer is that too many people think the tool is the real power.

Nothing could be further from the truth. I've seen it done over and over again - what might be considered antique tech building some of the most amazing things because the people that used it knew it so well they were able to overcome the problems, and make amazing where a newcomer to the tech would see it as impossible.

I hope I'm wrong. I fear I'm not.

Oh, I am SO Guilty of This…

Friday, September 20th, 2013

I just saw this on twitter:

A common fallacy is to assume authors of incomprehensible code will somehow be able to express themselves lucidly and clearly in comments.

— Kevlin Henney (@KevlinHenney) September 20, 2013

…and for the first time in weeks it made me want to post something.

I'm so horribly guilty of this that I don't even realize it. When I look at poorly documented code, I think the author was just lazy - because he's as smart as I am - right? Maybe not.

In fact, probably not.

To this day I don't see myself as any smarter than a lot of the professional developers I have worked with. Sure, there are some really junior folks, but I'm talking about the seasoned professionals - those guys that may have been working in the web space for a while, or working on back-end systems, or library builders… they are all just as smart as I am. The only difference, so I thought, between them and me is that I worked so much harder that it was just a matter of effort.

This little gem of a tweet says in 140 characters what I keep missing over and over again - that when you look at really bad code, it's often times more likely that the author didn't know any better, or was using too much StackOverflow, and really had no idea what they are doing. So adding comments to this mess is only going to increase the line count and not really add value to the work.

I need to remember this more often.

Building Clojure Libraries for use in Storm

Friday, September 20th, 2013

Storm

Here's a little gem that I figured out while trying to deliver a nice clojure library for some storm work that I've been doing. The problem is that when you build an uberjar for a topology (or library) with leiningen, you don't want to include the storm jars as that will mess up the storm cluster when you go to deploy the topology. So how to get it to all work locally, but when building the uberjar, it goes smoothly.

Sadly, this is not clearly documented anywhere that I could find. But there were bits and pieces here and there and I was able to figure out what I needed with a little trial-and-error.

Thankfully, it's all in the leiningen project.clj file:

  (defproject having-fun "1.0.0"
    :aot [project.core]
    :profiles {:provided {:dependencies [[storm "0.9.0-wip16"]]}}
    :repositories [["releases" {:url "http://nexus/content/repositories/releases/"
                                :sign-releases false}]]
    :main project.core)

Where the keys seem to be that with leiningen 2, you need to accept that the :aot tag needs to be at the top level, and not buried in a :profiles entry. This seems to be a change going forward, so I wanted to adopt it now, and it works better this way.

Additionally, the :profiles line is all about excluding the storm jars in the uberjar, which is just what I needed, and then the :repositories tag is all about where to deploy this to with a simple:

  $ lein deploy

With this, I've been able to build clojure libraries with defbolt and defspout constructs - which is exactly what I wanted to do, and then put this up on our local nexus server so that it's very easy for others in the group to put this library in their project.clj file and make use of it.

Sweet.

Breaking Production – Not Good Leadership

Friday, August 9th, 2013

Bad Idea

This week has been a little stressful for me - I've spent a few days off work getting the last of my things out of the house and into storage, and then signing some papers to sell the house. It's all a necessary part of life, I know, but it's stressful, and so I have to push through it.

What I didn't expect to have to deal with was a broken production app that supports some of the capabilities of the main project I'm on at The Shop. It's not a lot - it's not really looking all that great, but it's really useful to me in what I'm doing, and I depend on it every morning for writing up a status email that I send to the group about the overnight runs.

Anyway, for two days in a row one of the senior developers in the group - a relative newcomer, has broken production. The first day, I was pretty nice about it - just asking him if he checked production once he deployed the changes, and knowing full well he hadn't. The next day I was not as happy, and it started a significant email chain with him, the group manager, and myself about what we should be doing, and the qualities of leadership, in general.

The problem is that this guy was hired to be the Tech Lead of the group, but he's never really lead in a way that I felt worth following. He could certainly command, but that's not how groups at The Shop are run - it's meant to be a consensus of smart guys arriving at a good decision for the good of the team and business. There will certainly be differences of opinion, and our group has had many, but after a good talking session, we understand everyone's position, and consensus is reached. It might not leave everyone happy about things, but it works.

At least it used to.

Now it's not working, and I've tried to give it several months to work itself out. But after the second day in a row where no testing was done after deploying changes to production, I felt it was time to point out that this casual approach to production has to stop. That it's very simple to test when it's in production, and the lack of even the simplest of testing is really a sign of a much larger problem.

I could try to make light of the real problem, but it boils down to attitude. Always does, doesn't it? If you have to proper attitude about your work, then you care about how it's seen by others. You are careful about changes. You watch the details and make sure they are all covered.

Basically, you do a good job. Regardless of the job. Carpenter, Dentist, Doctor, Coder - all are the same. If you take care in what you are doing, the end result may not be perfect, but it's at least something you can defend as being the very best you can do.

In this case, he knew it was a mistake. And to do it two days in a row was - well… really inexcusable. So I pointed out that leadership is an isolated job - it's up to others if they choose to follow you. Command is an entirely different thing, and I think we have a problem with the words and definitions we're using for this position. He may have been hired as the lead, but it presumed that he was capable of doing that job. For me, at least, he can't.

I don't know what will happen. I doubt if The Shop will re-arrange staff to suit me, but it's possible that I can have my project separated to make it easy to not have to face the daily friction of dealing - or in my case not dealing with him. I hope that's the case, but I don't know that they will do that. If not, it's been clear that there are other groups in The Shop that would be glad to have me help them, so it's not all that bad, but it's uncomfortable now, and I've been able to keep it very professional and positive.

What gets me is that the original members of this group would have laughed a bit at the first day, and then roasted him alive on the second. That we have gotten to this point is very sad to me. I miss the old team.