Finding Allocation Errors with TCMalloc – Ain’t Easy

google-labs-logo.gif

Today I've learned a very valuable lesson today: TCMalloc really doesn't have bugs, but it sure looks like it does and stack traces can be very deceptive at times. I have been getting a series of segmentation faults on some code and the backtrace was always in about the same state, and was saying something like this:

  #0  0x0002aac607b388a in tcmalloc::ThreadCache::ReleaseToCentralCache
        (tcmalloc::ThreadCache::FreeList*, unsigned long, int) ()
        from /usr/lib/libtcmalloc.so
  #1  0x0002aac607b3cf7 in tcmalloc::ThreadCache::Scavenge() ()
        from /usr/lib/libtcmalloc.so
  ...

The lesson learned, after googling this backtrace, is that TCMalloc doesn't have bugs, it's just too stable. However, it's not able to properly trap double-frees, or illegal frees, so when it finds that it's structures are corrupted, it bails out and appears to have a bug, when the problem was really in the 'hosting' code. Meaning: user error.

So I started looking at what was leading up to this in the backtrace. I worked on this for the better part of a day, and reformulated the code several times. In the end, I was totally unable to correct the problem. Very frustrating.

Then it hit me - maybe it wasn't in the calling stack? After all, this same code was working quite well for months in other apps. This was the 'Eureka moment' for this guy... it wasn't the call stack at all - it was somewhere else in the code. So I started grepping for all the 'new' and 'delete' instances in the code. Sure enough... I found a few problems.

It's so easy for junior guys to miss these things, and they did. I only look for them because I've been bitten so badly (like this) so many times - it's the first thing I do when building a class with heap support - make the allocations and deallocations match. No two ways about it.

I'm hoping that this fixes these problems, and it's looking good so far. Just awfully tricky when the bug is nowhere in the stack. Wild.