Archive for the ‘Clojure Coding’ Category

A Completely Wasted Day

Thursday, November 13th, 2014

PHB.gif

Last night I had to explain some very bad clojure storm code to a couple of guys in the group. The reason for this was the new assault on the latency - as opposed to trying the one thing I said to try - which was remove one of the kafka publishers. But that's here nor there because that will be the last thing tried because it's my idea.

But because of this explanation, and the horrible code, something that should have taken about an hour or two - at the outside - is now sitting at a day and a half... and one of those days - yesterday - was a completely wasted day. Nothing was done on this because of the very bad code written by the developer asked to do it, and the fact that the other guys either didn't get to it, or wasted too much time before asking for my help (arrogance or marginalization - both are bad traits), and so a day is gone.

But this morning, I'm chatting with the author of the bad code, and he's thinking it's all good... that they refactored the code, and it's all happy skippy. But it's clearly not. And I'm chatting to him via IM, and not in person - at the office.

And I realize that it's times like these that I need to leave this team, and the management is right. If they are completely satisfied with this kind of work - then I'm not right for this team - at all. There was plenty of playing around yesterday... everyone but me took the lunch time off to go watch a talk... and there was no one in before 9:00 am - except me... and a day was wasted in the solution.

Yeah, this is not the kind of work I want to do.

Explaining Very Bad Storm Clojure Code

Wednesday, November 12th, 2014

Clojure.jpg

This afternoon I'm forced to stay late because two guys in the group needed to have some ver bad clojure code explained to them because the guy that wrote it was off for the day, and the code he wrote was bad - to say the least. First, to be fair, this is clojure code, so it can be tough for some to understand, but these two guys aren't bad, so I don't think that was the issue.

No, the issue is that you can read clojure code a lot sooner than you can write good clojure code. For example, this was in the most convoluted function:

  (defn build-a-bridge
    [start-point end-point & [lanes type width]]
    (let [lanes (or lanes (/ width 20))
      ... ))

in a few short lines, he'd done so many bad things - that actually would work - that it made understanding this code very hard for these two guys.

Start with the re-valuation:

    (let [lanes (or lanes (/ width 20))

yes, it's possible, but in the real code, there was so much else going on it was hard to see this (the guys missed it), and so they thought that the lanes argument was nil, when in fact, it was being re-valued in the let. Yes, it's legal, but it's bad form, and the reason it was done is because the author really didn't understand optional arguments:

  (defn build-a-bridge
    [start-point end-point & [lanes type width]]

Here, the type of the bridge is non-optional - and so, really, are the number of lanes. Yet he was in so much of a hurry to get this done in the day, that he slapped things wherever he needed to to get them to work. Then, when the optional args were nil, he used the re-valueing to "make them right".

To call it hacks is being generous. It was a horrible way to use functional code, and it was all because he wanted to impress. Impress management that he's capable of getting the feature out in a day. In truth, it should have been an hour, but a day for him was within reasonable limits. But the time slipped by, and he started to panic.

I think it's fair to say that a panicked rookie clojure developer is bound to make some pretty bad code. And so I had to stay late to explain this code to the others in the team that couldn't figure it out for themselves.

And I'm the one that has to leave... Insanity.

The Effects of Being Marginalized

Wednesday, November 12th, 2014

Bad Idea

Part of the New Direction that we are embarking on is to re-examine the topology that runs the data feed. Now I had built this, and done experiments covering months to get this to the point it was. Yet one of the managers talked to another Storm developer in another division, and that developer convinced him - without any knowledge of the specifics of the topology - that what I was doing was "All wrong". So my manager told me to work with this guy to fix it.

So I followed his advice and made the changes.

It didn't help, but it didn't make things worse.

Now today I'm being asked to revisit the topology and "Scale it back" because my manager is convinced there is something else in play here. OK, so I do as I'm told - because they are clearly not interesting in my opinions or we'd still be at the topology structure we had before all this.

So I start the experiments with a baseline, and then I start halving things: half the workers, half the bolts - all to get numbers to see if I'm going in the right direction. I don't fine-tune things until I'm really close, and half is easy to work with because we're still doing coarse tuning.

After about three experiments, we're far better, far faster, and have higher peak capacities during time of load. All is looking very good. Then I look at the Storm UI to the Capacity numbers, and I fine tune a little. This guy is a little high, so add some. This guy is very low, he doesn't need as much. Not a lot, just a little tweaking.

In the end we're looking a lot better from all the metrics. Good.

And the values?

Yup... just the ones I had before all this started.

The Joy of Using GitHub/E

Wednesday, November 12th, 2014

GitHub Source Hosting

I really am amazed about the real joy in using GitHub. It's hard to imagine the vision they had... Let's take git, and then build an entire culture and eco-system around it - all in the browser! That's some vision. Yet this morning I was able to put targeted comments on the lines of a pull-request with syntax-highlighted examples, and see it all in preview mode. It made me smile. This kind of attention to detail is really inspirational to me.

It's more than source control - it's workflow... collaboration - it's a wonderful framework with which to do group development - or personal development, for that matter. I really do enjoy working with it - even on really bad code. And the style... it's not always been perfect, and they have made some changes I might not agree with, but they have done it up to the nines. There is certainly no way someone is going to say they haven't sweated the details.

So even when I have to spend 30 mins making comments on code that should not have been written, I'm happy that I'm doing it in GitHub. What a great tool!

Interesting Ideas with Carl

Friday, November 7th, 2014

Salesforce.com

This morning I was chatting with Carl (that's not his name) - the guy that used to be my manager, but went to the West Coast, and now is looking to move back to Chicago... We were chatting about an idea he had - of using Salesforce.com as the source of data for sales projection algorithms. Then I remembered that Heroku got acquired by Salesforce and there's a specialized connect platform between Heroku and Salesforce for just this kind of scalable application building.

Heroku also handles Postgres as a Service, and they support clojure very nicely. In all, it sounded like a really nice platform to build this on. I can't way to see what Carl comes up with next.

Fixing Up a Database Mapping

Wednesday, November 5th, 2014

Clojure.jpg

Today I ran into a Legacy Bug - some little bit of code that used to work, but hasn't been used in so long that when someone really did try to use it - well, it kinda worked, but not exactly. It was really a simple database column mapping. What I had was:

  (defn hit-to-array
    "Function to format the 'hit' map of data into a simple sequence of the
    values for the map - in a specific order so that they can be easily
    understood."
    [arg]
    (if arg
      (map arg [:variant :country :browser :t-src :b-cookies])))

and I needed to change it to:

  (defn hit-to-array
    "Function to format the 'hit' map of data into a simple sequence of the
    values for the map - in a specific order so that they can be easily
    understood."
    [arg]
    (if arg
      (map arg [:variant :country :browser :traffic_source :users_count])))

because we had done a database restructure, and the rows coming back were now focused on the business names, and not the internal representation. The old code was returning nil, and the new code was properly finding the fields.

Similarly, I had to change:

  (defn sale-to-array
    "Function to format the 'sale' map of data into a simple sequence of the
    values for the map - in a specific order so that they can be easily
    understood."
    [arg]
    (if arg
      (-> arg
        (util/update :billings util/to-2dp)
        (map [:variant :country :browser :t-src :d-chan :orders :qty
              :billings :consumers :est_monthly_billings
              :est_monthly_billings_pct]))))

to:

  (defn sale-to-array
    "Function to format the 'sale' map of data into a simple sequence of the
    values for the map - in a specific order so that they can be easily
    understood."
    [arg]
    (if arg
      (-> arg
        (util/update :billings util/to-2dp)
        (map [:variant :country :browser :traffic_source :demand_channel
              :orders_count :qty :billings_total :buyers_count
              :est_monthly_billings :est_monthly_billings_pct]))))

because we dropped the qty field - no one wanted it - and we again changed the names to those more user focused names. It's not a big deal, but it makes a big deal to the guys now implementing the feature.

I wrote this up quite a while ago, and it never got called, but there wasn't a really horrible error - so it didn't crash when they called it - it just didn't return the right data. Now it does.

I love these simple fixes - and it's pretty much all about the functional style of coding.

Grabbing Metadata from the Email Opens

Wednesday, November 5th, 2014

Unified Click

I have to admit when I'm wrong - it's just the only decent thing to do, in my opinion. And today, I was schooled by my boss's boss's boss at The Shop saying that we did, in fact, have the user-agent in the nginx logs for the user email open messages I was just finishing up on. I was sure we didn't, and when I showed him... well... I instantly apologized for my mistake - I was looking right past it. Totally my mistake.

The reason this all came up was that the messages have an app_version field, and typically, for the other user actions, this is the browser and version for the web, or the version of the Android or iOS app - something that lets us know a little bit more about the platform it's coming from. Sadly, without this user-agent, I was stuck looking at the URL of the nginx log, and that didn't have much of anything really useful.

With this, I was able to easily parse it - already had the functions for it - and then drop that in just like all the other messages. It was a very simple fix, but it had a profound effect on the data quality. Much nicer to know this. Much.

Adding Email Opens – Data Can Surprise You

Tuesday, November 4th, 2014

Unified Click

This afternoon, before I leave to go vote, I wanted to add in the code to decode all the email opens that occur in a day. I have actually been working on decoding these messages for a while, but I've had to divert my attention to other, more pressing, needs of late. Finally, this afternoon, I was able to get back to the email opens, and it was nice to close it out.

It was a basic addition to the topology, and while I could have combined it with the other email-based data feed, I have chosen to keep it separate for now - just to be able to monitor the send traffic separate from the open traffic. I will say that I did have one logic error in the code - and that was because the email opens are simple nginx logs, and those aren't formatted as JSON - so I had to parse that first, and then process it, and I had the initial checks done before the parsing. They always failed, and that was an issue.

But a quick logical walk-through, and I found the problem, and we were off to the races. What I was surprised about was the very moderate levels of traffic at 2:00 pm in the afternoon. Now it's probably much heavier when the sends are done, so we'll have to watch it in the morning, but it's nice to see that the addition isn't a torrent that floods all processing - immediately.

I was expecting more load - maybe tomorrow will show it to me.

Storm Doesn’t Always Guarantee a Good Balance

Tuesday, November 4th, 2014

Storm Logo

I trust the developers of Storm way too much. Maybe the truth of the matter is that I have assumptions about how I'd have developed certain aspects of Storm, and they seem perfectly logical to me. When those don't turn out to be the way it's really done in Storm, then I feel like I trust the Storm developers too much. Case in point: Balance.

I had a spout with a parallelization hint of 20. I had a topology with 20 workers. Right away, I'm thinking that no matter how the workers are arranged on the physical hardware of the cluster, there should be one - and only one - spout per worker. That way we have balanced the work load, and assuming that there are sufficient resources on each physical box for the configured number of workers, then we have the start of a nicely balanced topology.

Sadly, that assumption was wrong, and determining the truth took me several hours.

I was looking for problems in the topology related to the latency of it - lots of high capacity numbers in the UI. So I started digging into the spouts as well as the bolts, and what I aw in three of my four topologies was that there were six - out of twenty - spout instance that had zero emitted tuples. Not a one.

I looked at the logs, and they were grabbing messages from Kafka... updating the zookeeper checkpoints... just what it should do. I was worried that kafka was messed up, so I restart two of the nodes that seemed to have problems. No difference.

Hours I sent trying to figure this out.

And then I decided to just write down all the machine/partition data for each of the spouts and see if there was something like a missing partition somewhere. And as I wrote them all down I saw there were twenty of them - just like there should be... and yet the spouts reporting zeros had the same machine/partition configurations as another spout that was returning data.

Storm had doubled-up some of the spouts, and left others empty, and done so in a very repeatable way. In a sense, I was pretty upset - why log that you're reading messages if you're really "offline"? Why don't you distribute the spouts evenly? But in the end, I was just glad that things were working. But boy... it was a stressful hour.

More Tuning of Storm Topology

Monday, November 3rd, 2014

Unified Click

I spent a good chunk trying to get my data stream topology to work smoothly, but it's proving to be very elusive. I've tried all manner of things - including expanding to take nearly all of 20 machines up for the one topology. Crazy.

When I got the first version of this up, I wrote an email to my VP saying this is possible, but it's going to be expensive. Very. The hardware isn't cheap, and we're going to need so much of it - due to the inefficiencies in the JVM, in the messaging, in clojure... it's just going to take a ton of metal. I'm not sure they understood the magnitude of this cost.

We're bearing down on the Holiday Season, and we have no excess capacity. None. Twenty worker nodes all jammed up. Now if we need even 100% excess (a 2x spike) - which is nothing in the Cyber Monday sense, then we need an additional 20 machines. Amazing. For a capacity of about 100k msgs/sec.

At my last Finance job we did 3 mil msgs/sec on three boxes. Add in greeks and what-if analysis and it's 5 boxes. The idea that we need 40 boxes for 100k msgs/sec is just crazy. We are building on a tech stack that is super inefficient.

But we are... and it's not my decision - as hard as I've lobbied for a more efficient solution.