Archive for the ‘Coding’ Category

Working with CouchDB’s Map/Reduce Framework

Wednesday, August 22nd, 2012

CouchDB

This afternoon I've been doing a lot with CouchDB's map/reduce framework for querying data out of CouchDB. The terminology is pretty simple: a Document can hold multiple Views where each view has a Map component that looks at each document in the database and returns something based on the inspection of it's data, and an optional Reduce function that takes all the results of the Map function calls and reduces it to a smaller dataset.

It's pretty standard in a lot of languages: first you operate on the individual elements in a collection, and then you summarize those values. In CouchDB it's all in Javascript. That's not bad, I've done a lot of that in my day, so it's pretty easy to get back into the swing of things.

One interesting issue is that CouchDB is written in erlang, and while I don't see myself digging into the guts of this thing, it's interesting to know where it all comes from, as it makes it a lot easier to understand why they chose Javascript, for instance.

Anyway, let's say I want to see all the merchants that have no OTCs assigned to them. I'd create a Temporary View in the CouchDB web page, and then in the View Code I'd have something like this:

  function(doc) {
    if (doc.meta.label == "QuantumLead.results" &&
        doc.otcs.length == 0) {
      var key = [doc.division,
                 doc.meta.created];
      var blob = { name: doc.merchant.name,
                   sf_id: doc.merchant.sf_id };
      emit(key, blob);
    }
  }

The interesting parts here are that the emit() method is really the action item in this function. When we want to add something to the output for this Map function, we have to call emit() with the first argument being the key, and the second the value. The key, as shown here, can be a multi-part key, and the value can be any Javascript object.

The thing I like about the use of Javascript here is that the attributes look like "dotted methods" and not hash members. This makes it so much easier to reference the data within a doc by just using the key names and dots. Very nice use of Javascript.

So now that I have my first few Views and Documents in the system, I need to work on getting things out of these calls, and into some nicely formatted output for the important demo that's coming up.

Getting Ready for an Important Demo

Wednesday, August 22nd, 2012

I just got an email from our project manager about a demo he's set up for the COO. The email included a response from the COO about the significance and importance of this project, and how it'll play into the long-term plans for this place. It's pretty scary to think of.

So all of a sudden, I'm feeling that same pressure to perform that I have felt for 16 yrs in Finance. It's the first real demo with this level of visibility since I've joined The Shop, and while it might be ho hum for a lot of the guys, for me, it's the "first impression" this guy is going to have of me and my additions to the team. It's not life-or-death, but it's important, and I want it to go well.

So I'm a little nervous… So many things to get finished and in place for the demo… it's not like we'll have time to run through it before time, it'll be wing-it all the way.

Yikes!

Problems Deploying CouchDB to EC2 Servers

Wednesday, August 22nd, 2012

Amazon EC2 Hosting

This morning Jeff is still having problems getting CouchDB deployed to our Amazon EC2 machines, and it's almost certainly due to the deployment system that's in place in The Shop. It's something I completely understand, but it's also based on the idea that you can't trust anyone. That, and it's an old RedHat-based distro that I know from experience is not as easy to deal with as something like a more recent Ubuntu.

Still, it's just the way it has to be, as that's the only way Prod Ops can deal with things, so there's no real way around it. The problem is that you need to be able to build the code on one box, and package it up - similar to an RPM or a deb package, and then deploy it across a lot of machines. All well and good, but Jeff is having a horrible time getting CouchDB 1.2.0 compiled on his build box.

There are some things he's trying, and even seeing if the other folks around here have any ideas. But the latest attempts have left something that looks like CouchDB running on the server, but when I go to add things to it, I get a nasty stack trace about 'Connection refused' after some kind of timeout. I've inserted about 1500 documents of the 2500 I need to, and it stops.

At the same time, I was able to use Homebrew to simply:

  $ brew install couchdb

and then follow a few instructions about getting it to run on my login startup, and that's it. It Just Works.

I would say that this would also be the case if we were looking at standard Ubuntu boxes in EC2 or Rackspace, and using yum or apt get. The real question is why do we need to do these custom packages for Open Source software when they are so easy to just install?

Again… no way to know… no way to answer. It just is and that's it.

Getting Acquainted with CouchRest

Tuesday, August 21st, 2012

CouchDB

Jeff, the guy recommending CouchDB as a document database for our app, suggested that I look at CouchRest as a nice ruby client for CouchDB. And the docs look impressive - so far as they go. It's pretty easy to open up and use a database:

  require 'couchrest'
 
  @db = CouchRest.database('http://localhost:5984/megafun')

and then saving a document is pretty easy as well:

  @db.save_doc({ one: 1, two: 2 })

even doing a bulk store of multiple documents is easy:

  @db.bulk_save([{ one: 1, two: 2 },
                 { one: 1, three: 3 },
                 { one: 1, four: 4 }])

But the main docs don't really say anything about using a proxy, and in The Shop, with lots of hosts in Amazon's EC2, there's a lot of proxy work that we have to do.

Specifically, to get to Amazon's East datacenter, we have to use a re-directing proxy on our laptops and there was just nothing in the docs about using a proxy, so I had to dig into the code for CouchRest, and thankfully, I've learned a bit of ruby in the last few weeks, and the support was already there!

Because we have servers in EC2 east, I couldn't hard code the proxy usage, but using the pattern we have used for other proxy-based access, I was able to very quickly set up the config files for the CouchDB databases, and then in the code say:

  class Database
    def self.database
      CouchRest.proxy(AppConfig.database.proxy_uri) if AppConfig.database.use_proxy?
      @db ||= CouchRest.database(AppConfig.database.uri)
    end
  end

and then in the rest of the Database class I could just reference this database method.

The things I'm doing have a lot more to do with how we want to organize the data, and not the logistics of the CouchDB itself. We'll have to come up with some standards on the document format to enable the selection, aggregation, etc. that we're going to need in this project. Not a bad start, and it's looking pretty good right now.

I just need Jeff to get the server spun up on one of our EC2 boxes.

Placing my WordPress CodeHighlighterPlus on GitHub

Tuesday, August 21st, 2012

wordpress.gif

This morning I thought I'd spend a few minutes getting my fixes to the existing WordPress plugin - CodeHighlighter, up and into the WordPress site so that I could easily update it, etc. After all, there might be several folks that are looking for something like I wanted, and not finding it in the existing tools. I downloaded the version I'd hacked up on my site, and then placed it into a git repo, added a README.md, and then pushed it up to a new GitHub repo. I was then hoping to simply publish it on the WordPress site, and be done with it.

Silly me… Why should it be that easy?

Turns out, the WordPress plugin site is an SVN repo where you have to give them your code and then they give you access to an SVN repo (Sourceforge?) where you can put your code. A few years ago, I wouldn't have minded, but now… SVN… really?!? Nah… I think I can do just fine without that cruft.

I can simply use GitHub and clone the repo in any WordPress install that I have. There's no need to have anything fancier than that. In any event, I'm guessing that in a little while, the WordPress team will switch to GitHub anyway as the number of SVN users are going to dwindle like Perforce, Visual SourceSafe, PVCS, etc. all have. There's just no way to keep the project looking up to date with SVN.

So it's there, and it's easy to use, you just have to be a little smarter than the average WordPress blogger, but that's OK. I am, and that's all that really matters.

UPDATE: by simply getting into the wp-content/plugins/ directory and doing:

  $ git clone git@github.com:drbobbeaty/CodeHighlighterPlus.git

and then using the WordPress Plugins page, I can disable the old version, enable this new clone, and then delete the old one. After this, everything is OK, and it's all controlled by the GitHub repo.

To be true, this isn't going to auto-update from he WordPress Plugins page, but I didn't have to mess with the SVN repo either - and that's a win for me.

Starting to Use CouchDB

Monday, August 20th, 2012

CouchDB

The decision was made late last week that we really should try to use some document database - like CouchDB for saving our run state for metrics extraction and historical analysis. We had initially planned on using MySQL, as it's the dominant database in The Shop, but that was going to require that we flatten the ruby objects to some set of key/value pairs, or come up with some ORM that would properly store and load things. Neither was really super attractive, and so we had Jeff in the group take a look at CouchDB. I knew MongoDB wasn't the answer, because I'd used it previously, but there were a few nice sounding things CouchDB had that could really tip the scales in it's favor.

Most notably were the views. These are basically the same things that you'd expect in a SQL database, but in the document sense they can be arbitrary map/reduce schemes implemented in javascript, and stored on the CouchDB. This means that we can make some interesting views for the metrics that gather data across different documents and make it presentable in a very simple and efficient way.

I'm thinking that generating the majority of the metrics are possible in this way, and then the stuffing these values into a visualization system shouldn't be too bad. We'll have to see as I'm still in the very early stages of using this, but it certainly has some interesting potential for what we're trying to do.

The Power of Positive Attitude? I’d Like to Think So

Friday, August 17th, 2012

Dorey.jpg

I don't know… maybe positive thinking really does work. This week I had a run-in with someone that was amazingly uninterested in being flexible. The next day, they changed the API on their system. I was starting to believe that this was just going to be status quo when dealing with them, but then today, I received an email and it amazed me.

Their leader totally reversed his position, and was asking me how I wanted the data. This was a real shock as I had never expected it, and was settling myself in for a series of constant changes to the API and fixes to the ETL to keep things working. Very nice to see.

I spent about 5 min thinking about it, and decided that what we originally had was a good plan, and we just needed to keep going. The original plan was to have an array of maps (this is all JSON) where each map represented a possible match and the array was a logical OR. This allows them to change the nature of the individual maps and the logical OR can include a region and a zip code… or a series of zip codes… or regions… all this was vey well thought-out for the geo-tagging.

I wanted that back, and then asked that for the taxonomy of the demand, we do something very similar - an array of maps where each is a tuple of the classification of the demand based on the default taxonomy. This will also be logically OR-ed to get the possible classifications that this demand can fulfill.

In short, I think it's clean and clear, and makes a lot of sense. I hope they accept it.

In any case, I'm shocked that I might have had an effect on the change of heart, but who knows? Maybe I'll step it up with the people on the train at night and see if they get a little nicer!

Going to be Digging into MySQL

Friday, August 17th, 2012

MySQL.jpg

Today I'm going to be spinning up a MySQL instance in Amazon's EC2 for holding the metrics of our application. The goal is to have historical records for detecting trends and A/B testing to see the effects of the changes. It's all about "save it - report it", and I'm looking forward to using MySQL in this case. I've not done a ton with MySQL, and it's a huge deal here at The Shop, so it'll be nice to see it all done right, with good tools, and monitoring as well as good backups.

I've found an interesting OS X client for MySQL - and I'll be digging into it as soon as I get things set up. Or maybe I'll just stick t the command line as I like that on PostgreSQL best. Who knows… it's exciting to be moving in this direction this morning. I just hope I can make some real progress today.

Protecting Yourself from API Changes

Thursday, August 16th, 2012

cubeLifeView.gif

Today, while working on a bunch of other stuff, I got a chat from the data scientist in Palo Alto asking me why the service tag changed to primary_service, and what else changed. Turns out that the folks I'm getting the data from has decided to change the data format I was getting from them, without telling me about it. Nice.

The problem wasn't that it only took me an hour or so to change the ETL on the data to get it back to where I needed it, it was that they decided to change it in ways that weren't required without telling us anything about it! I know this is often the case, but we're supposed to be on the same team, and they just didn't bother telling us.

I know the point of the ETL code is to fix that up, and allow us to insulate ourselves from changes on their part, but it's also the timing. These guys could not possibly have been in my position yesterday or they'd never have done it. It's one of those cardinal rules of making an API - once you have it, you stick to it! If you have to change it, it's because it doesn't work, and then you open a discussion - even a simple "heads-up" email, to make sure that people are aware of the change that's coming, and why you had to make it.

It's not horrible, and I liked that I got new data that made the results better, but it was the way they did it that was annoying. I fear that this isn't the last such occurrence from these guys.

It’s Amazing to See Such Inflexibility

Wednesday, August 15th, 2012

cubeLifeView.gif

I just got out of a meeting, and I have to say that I'm really quite shocked to see someone at The Shop so incredibly inflexible as Jim. This is the first time I've met Jim, and while I haven't been impressed by his emails, and specifically his responsiveness to my requests, I always wrote it off as him being exceptionally busy. Busy people have a lot going on - I get it. But today was the first face-to-face meeting, and during that, Jim really blew my socks off.

Jim is vending some demand data, and it makes perfect sense to everyone but Jim to have him include the current inventory in his service. After all, if there's some change in inventory, it's going to really effect what we need to get because excess inventory will shrink demand, and a big run on something will push it up. But not in Jim's mind.

No, Jim was making a theoretical demand calculation, and in that, I can see a value. But he's also including the inventory on hand at the time of his run - which is once a week, so it's going to make it harder for us to know what current inventory-effecting events to include in the demand we get from Jim. We have to look at the time of his run, and then look at all events to see how they might effect the demand.

It'd be far simpler to have Jim vend the raw demand and then we can always correct it by the current inventory. Much simpler. No problems in potentially lost transactions. Better.

But not to Jim. Holy Cow! It's been a long time since I've seen a professional programmer push back so hard on a feature. It was speculated that it was all about the level-of-service issue… after all, hitting Jim's service once a week and saving the results is far easier on Jim than having to make sure it's up all the time and vending the right data. I just can't get over the sheer laziness of this guy.

He's a developer. What's he got to do but develop? How hard is it to make it work 24x7? Not too hard, I've done it in a bunch of technologies over the years. Stop being so bloody lazy, Jim!

Holy Cow!