Fantastic Speed Boost on My uint128_t

Professor.jpg

Late yesterday I realized that I had some lingering code that was using the uint128_t I had created because I needed to uniquely map the instrument names into some kind of number space for including in the likes of std::map. The code I had originally written worked, but it wasn't nearly fast enough, and so I stopped using it (so I thought), and switched to the trie.

But it wasn't really gone. I had a lingering use for it in my client code, and I decided this morning to fix up the implementation so that I had something that was a lot faster - hopefully in the same ballpark as a uint64_t for map usage.

The first thing I did was to add a timed test section to my testing code for the conflation key - that's what I called the 128-bit value generated from the name of the instrument. It was pretty simple:

  int         cnt = 100000;
  log.info("starting the uint64_t tests...");
  boost::unordered_map<uint64_t, int> little;
  uint64_t    startTime = msg::TransferStats::usecSinceEpoch();
  // ...first the puts
  for (int i = 0; i < cnt; ++i) {
    little[i] = i;
  }
  // ...now the gets
  for (int i = 0; i < cnt; ++i) {
    if (little[i] != i) {
      error = true;
      log.error("uint64_t test failed for i=%ld", i);
    }
  }
  uint64_t    totalTime = msg::TransferStats::usecSinceEpoch() - startTime;
  log.info("%d uint64_t tests completed in %ld usec", cnt, totalTime);
 
  log.info("starting the uint128_t tests...");
  boost::unordered_map<uint128_t, int> big;
  startTime = msg::TransferStats::usecSinceEpoch();
  // ...first the puts
  for (int i = 0; i < cnt; ++i) {
    big[i] = i;
  }
  // ...now the gets
  for (int i = 0; i < cnt; ++i) {
    if (big[i] != i) {
      error = true;
      log.error("uint128_t test failed for i=%ld", i);
    }
  }
  totalTime = msg::TransferStats::usecSinceEpoch() - startTime;
  log.info("%d uint128_t tests completed in %ld usec", cnt, totalTime);  

What I saw in my initial tests was horrible. It wasn't even close. I had more than a factor of 300x difference between the two. When I looked at the way I'd implemented the uint128_t it made a lot of sense:

  private:
    uint8_t   mBytes[16];

I had 16 individual bytes as the data ivar for the object. Makes a lot of sense as it never suffers from the host/network byte ordering issues, and things looked fast in the code - but there were loops and a lot of calls to memcpy(). So I needed to take a new approach, and I decided to go to the other extreme - two uint64_t values as opposed to sixteen uint8_t values.

This changed a lot of the code. For one, it made a lot of sense to write my own hton() and ntoh() functions for the classes so they'd look like the system calls ntohl() and the like. It really wasn't all that hard, either:

  uint128_t hton( const uint128_t & aValue )
  {
    uint128_t     retval;
 
    // get the byte pointers to the source and destination
    uint8_t *dest = (uint8_t *)retval;
    uint8_t *src = (uint8_t *)aValue;
    // now map the bytes one-by-one from source to destination
    dest[0]  = src[7];
    dest[1]  = src[6];
    dest[2]  = src[5];
    dest[3]  = src[4];
    dest[4]  = src[3];
    dest[5]  = src[2];
    dest[6]  = src[1];
    dest[7]  = src[0];
    dest[8]  = src[15];
    dest[9]  = src[14];
    dest[10] = src[13];
    dest[11] = src[12];
    dest[12] = src[11];
    dest[13] = src[10];
    dest[14] = src[9];
    dest[15] = src[8];
 
    return retval;
  }
 
 
  uint128_t ntoh( const uint128_t & aValue )
  {
    uint128_t     retval;
 
    // get the byte pointers to the source and destination
    uint8_t *dest = (uint8_t *)retval;
    uint8_t *src = (uint8_t *)aValue;
    // now map the bytes one-by-one from source to destination
    dest[7]  = src[0];
    dest[6]  = src[1];
    dest[5]  = src[2];
    dest[4]  = src[3];
    dest[3]  = src[4];
    dest[2]  = src[5];
    dest[1]  = src[6];
    dest[0]  = src[7];
    dest[15] = src[8];
    dest[14] = src[9];
    dest[13] = src[10];
    dest[12] = src[11];
    dest[11] = src[12];
    dest[10] = src[13];
    dest[9]  = src[14];
    dest[8]  = src[15];
 
    return retval;
  }

The old scheme allowed me to use memcpy() to put the data into a data stream - and to take it out. But now with a real "host byte order", I needed to add methods on the uuid_t class to pack and unpack it's data from the data streams. Not bad, and it made the code look a lot cleaner, but I had that crud scattered in the code in a ton of places.

Bad form on my part - really.

I even had to create the prefix/postfix increment and decrement operators to make sure it could function in the loops I might have. I really wanted this to be complete. Thankfully, the code to do this didn't turn out to be that hard. In fact, I was able to do it all in a lot fewer lines of code because I could use the compiler to do a lot of the up-casting work that I had to do with memcpy() before. Nice benefit.

The upshot of all these changes is that the new uint128_t was only 30% slower than the uint64_t! That's amazing compared to where it started. It's not going to set any speed records, but given that it's not a built-in CPU data type, it's pretty good. Certainly good enough for all the things I need it to do.

Fantastic work!