Finding Errors that Aren’t Really There – Harder than it Sounds

I spent several hours this morning trying to find bugs in my ticker plant that simply weren't there. What I mean to say is that I was changing the trie from three levels to four, and at the same time radically simplifying the node structure and tree walking and all of a sudden my unit tests were failing. What?!

OK, I made a bunch of changes, let's see what really changed, and start to figure out the problem.

The first step was to look at the failing tests and see what the problem was. The first indicator was that the size() methods were failing. What I'd done was to create a main node superclass so that I could have just two classes: leaf and branch. So I took:

  struct LeafNode {
    family_t                 *family;
    family_map_t             others;
    boost::detail::spinlock  othesMutex;
  };
  typedef struct LeafNode leaf_t;
 
  struct BranchNode {
    family_t         *family;
    leaf_t           kids[47];
  };
  typedef struct BranchNode branch_t;
 
  struct RootNode {
    family_t         *family;
    branch_t         kids[47];
  };
  typedef struct RootNode root_t;

where the branch and root are different because of the type of their children, to the more uniform:

  struct Node {
    family_t         *family;
  };
  typedef struct Node node_t;
 
  struct LeafNode : Node {
    family_map_t     others;
    boost::spinlock  othesMutex;
  };
  typedef struct LeafNode leaf_t;
 
  struct BranchNode : Node {
    node_t           kids[47];
  };
  typedef BranchNode branch_t;

so that I could have several branches from the root to the leaf. This was a good idea - but it wasn't exactly how I had it originally implemented. In the cut-n-paste refactoring, I ended up leaving in the family pointer in the branch. This had the added benefit of not generating any errors or warnings in GCC, but when I constructed the subclass, the super worked just fine, and then the subclass threw a junk non-NULL pointer in for family.

This took me far longer than I had hoped to find. I was tracking down the constructor calls, seeing the family set to NULL by the superclass, and then all of a sudden being set to some junk in the main class. Finally, I was looking at the header file and noticed that I hadn't removed the family reference in the branch class. Fixed that, and off we go.

Then things started working... but they seemed to be going very slowly. In fact, it seemed to be going horribly slowly. So I started to look at what I was doing and see what might be causing the hold up.

Was it the looping in the tree navigation? Was it the use of subscripts in the elements of the tree path? All these didn't seem right. But in case, I refactored the code again just to have a more direct path with direct variables. No difference.

Crud.

Then I realized that I was doing four things in each pass of the loop, and given that, the times were really better than I had in the past. I took out the "straight-through" code, and back to the looping, and even a little better.

So... two mistakes in one day for not paying attention. That's how a professional does it. Yeah... right.