Scaling Problems with Salesforce and CouchDB

CouchDB

Today I've been having real troubles trying to get our system scaled up to new hardware in the datacenter. Moving from hosts in Amazon EC2 to machines in our own datacenter are a big step in the right direction, but going from iffy bandwidth in EC2 to solid switches in the datacenter and 2 cores to 24 are something we simply have to do in order to scale up to handling the global data load that we're going to need.

The problems I've run into today are all about loading. I've already added a queueing system to the CouchDB interface, in order to minimize the connections to the Couch server so as not to overload it - there were times when the Couch server would simply shutdown it's socket listener, and therefore refuse all updates sent from the process. Not good.

Salesforce.com

There's also a lot of problems today with Salesforce. I don't think they expected the kind of loads we're delivering. This morning, at 3:00 am, Salesforce called the support guys at the shop and told them that a process was bringing one of their sandbox clusters to it's knees, and that, it turns out, is but one of four boxes we need to bring online. They're having a hard time handling this much - I can't imagine what's going to happen when we try to really ramp it up.

I'm starting to have real concerns about both these endpoints of the project. I know there's a lot of data getting moved around, and while we're able to handle it, it's these endpoints that are having the hardest time. I've talked with the project manager and the technical manager about this, and I think we need to start thinking about potential bail-out scenarios.

It's certainly possible to read from Salesforce. We're planning on re-doing the complete demand system, so there shouldn't be an issue there. Persistence? Go back to MySQL or PostgreSQL and store it all in tables. The data is getting pretty nicely finalized, so a nice schema should be able to be made. Save it all in a SQL database, make a simple service that reads/caches this data and offers it up to the clients, and the pages already built on the existing data sources are pretty easily modified.

Odd to think that I was looking at MySQL before Couch popped up. Funny thing is, I have a strong feeling that Salesforce can come up with hardware that makes the grade, but it's the bugs in Couch that worry me the most. You just gotta wonder if it's in the Erlang code, or on the boxes, or what.

So many unknowns, but it's clear that we can't scale to one nice box - there's no way we're going to make it work globally without some serious changes.