Really Nasty Data Archeological Dig

Detective.jpg

I know it needed to be done, and I know someone had to do it, but that doesn't make it any more fun than it already isn't. Digging in the data to find out why we aren't matching up merchants and demand in Philadelphia is no fun at all. It's a lot of data with very little pattern to it, and a whole lot of problems. But that's what I was doing for several hours today. The pain and suffering was really compounded by the complete lack of real thought put into this as we headed into the meeting.

Overall, I was very angry at myself for not pushing back. I should have. I know that now, but it's that blasted work ethic thing that causes me to say "Yes" when I should be saying "Hold on a sec…"

The problem is that we're getting demand and merchants to fulfill that demand, and the assumed "match" here should be very high. Why? Just because. They really have never looked at this and have no idea what it should be, but "instinctually" many think that it should be "very high" - like 90%. So when the first runs came out with it being more like 50%, they wanted to know why. I totally agree.

But where we diverge is in the How?

Once program manager suggested I send a 5000+ line Excel file where the hierarchical JSON data was somehow magically "flattened" to make it easy for anyone to look at the data and determine why the merchants weren't matching. Thankfully, I had the strength of character to say "No" to that.

But that wasn't until after I heard another request to log all 5000+ merchants against all 1500+ demands - yielding more than 8 million log lines. Nope. That's just plain silly.

I wanted to get to the bottom of this to be sure, but I wanted to do it in a way that makes at least a little sense. And looking at 8 million log lines isn't it. So I started building a few CouchDB temporary views and started looking for what wasn't being matched and why.

Turns out there were two major issues: the demand wasn't supplying sufficient 'service' coverage to pin enough merchants, and the zip codes on the merchant data was really pretty horrible. Call me 'Indy' on this - it only took me about 90 mins to find these reasons and document them up for the group. Nice. Clean. Efficient.

Nothing like looking at 8 million log lines.