You Mean Hadoop Isn’t Perfect?

Hadoop

I read with a giggle this article off Slashdot about the problems companies are having making Hadoop actually work in their environment. This is not in the least surprising to me. Hadoop is a nice, distributed, storage environment, but so is a very nice SAN. Both can store a lot of files, and do it very quickly, with redundancy, but one is putting entire computers on each "disk", while the other is allowing the disks to just be... well... disks.

The promise (hype) of Hadoop is that by distributing the computing power like the storage, the map/reduce jobs can be done fast and easily and you get old-style SQL performance with as much online storage as you can muster.

But the truth of the matter is far different than this, as this article in the WSJ attests.

Hadoop is nice, but it's not fast. It's good for lots of storage space, but so is a SAN. It's nice to do small map/reduce jobs, but so is CouchDB. But you can't scale CouchDB to any size you want, either. There will always be limits.

I know folks that are looking at Cassandra, and really like it. It's not as general-purpose as Hadoop, but it's targeted at the problem of massive storage with SQL access. Advocates of Hadoop will say Use Spark - get SQL that way! and that's possible, but then that's not Hadoop, is it?

You can use many different caching schemes to make any storage scheme work faster, but the promise of Hadoop was that you wouldn't need them. That's where it fell over. I have no doubt that Hadoop can do a lot of really good things in a lot of very specialized environments, but it's not the silver bullet worth $2 billion. It's a nice open source map/reduce like CouchDB, but on as many boxes as you want. It's nice... but it's not what people hoped it would be.

Too many distractions in systems development these days. Just too many.