Fantastic Speed Boost on My uint128_t
Late yesterday I realized that I had some lingering code that was using the uint128_t I had created because I needed to uniquely map the instrument names into some kind of number space for including in the likes of std::map. The code I had originally written worked, but it wasn't nearly fast enough, and so I stopped using it (so I thought), and switched to the trie.
But it wasn't really gone. I had a lingering use for it in my client code, and I decided this morning to fix up the implementation so that I had something that was a lot faster - hopefully in the same ballpark as a uint64_t for map usage.
The first thing I did was to add a timed test section to my testing code for the conflation key - that's what I called the 128-bit value generated from the name of the instrument. It was pretty simple:
int cnt = 100000; log.info("starting the uint64_t tests..."); boost::unordered_map<uint64_t, int> little; uint64_t startTime = msg::TransferStats::usecSinceEpoch(); // ...first the puts for (int i = 0; i < cnt; ++i) { little[i] = i; } // ...now the gets for (int i = 0; i < cnt; ++i) { if (little[i] != i) { error = true; log.error("uint64_t test failed for i=%ld", i); } } uint64_t totalTime = msg::TransferStats::usecSinceEpoch() - startTime; log.info("%d uint64_t tests completed in %ld usec", cnt, totalTime); log.info("starting the uint128_t tests..."); boost::unordered_map<uint128_t, int> big; startTime = msg::TransferStats::usecSinceEpoch(); // ...first the puts for (int i = 0; i < cnt; ++i) { big[i] = i; } // ...now the gets for (int i = 0; i < cnt; ++i) { if (big[i] != i) { error = true; log.error("uint128_t test failed for i=%ld", i); } } totalTime = msg::TransferStats::usecSinceEpoch() - startTime; log.info("%d uint128_t tests completed in %ld usec", cnt, totalTime);
What I saw in my initial tests was horrible. It wasn't even close. I had more than a factor of 300x difference between the two. When I looked at the way I'd implemented the uint128_t it made a lot of sense:
private: uint8_t mBytes[16];
I had 16 individual bytes as the data ivar for the object. Makes a lot of sense as it never suffers from the host/network byte ordering issues, and things looked fast in the code - but there were loops and a lot of calls to memcpy(). So I needed to take a new approach, and I decided to go to the other extreme - two uint64_t values as opposed to sixteen uint8_t values.
This changed a lot of the code. For one, it made a lot of sense to write my own hton() and ntoh() functions for the classes so they'd look like the system calls ntohl() and the like. It really wasn't all that hard, either:
uint128_t hton( const uint128_t & aValue ) { uint128_t retval; // get the byte pointers to the source and destination uint8_t *dest = (uint8_t *)retval; uint8_t *src = (uint8_t *)aValue; // now map the bytes one-by-one from source to destination dest[0] = src[7]; dest[1] = src[6]; dest[2] = src[5]; dest[3] = src[4]; dest[4] = src[3]; dest[5] = src[2]; dest[6] = src[1]; dest[7] = src[0]; dest[8] = src[15]; dest[9] = src[14]; dest[10] = src[13]; dest[11] = src[12]; dest[12] = src[11]; dest[13] = src[10]; dest[14] = src[9]; dest[15] = src[8]; return retval; } uint128_t ntoh( const uint128_t & aValue ) { uint128_t retval; // get the byte pointers to the source and destination uint8_t *dest = (uint8_t *)retval; uint8_t *src = (uint8_t *)aValue; // now map the bytes one-by-one from source to destination dest[7] = src[0]; dest[6] = src[1]; dest[5] = src[2]; dest[4] = src[3]; dest[3] = src[4]; dest[2] = src[5]; dest[1] = src[6]; dest[0] = src[7]; dest[15] = src[8]; dest[14] = src[9]; dest[13] = src[10]; dest[12] = src[11]; dest[11] = src[12]; dest[10] = src[13]; dest[9] = src[14]; dest[8] = src[15]; return retval; }
The old scheme allowed me to use memcpy() to put the data into a data stream - and to take it out. But now with a real "host byte order", I needed to add methods on the uuid_t class to pack and unpack it's data from the data streams. Not bad, and it made the code look a lot cleaner, but I had that crud scattered in the code in a ton of places.
Bad form on my part - really.
I even had to create the prefix/postfix increment and decrement operators to make sure it could function in the loops I might have. I really wanted this to be complete. Thankfully, the code to do this didn't turn out to be that hard. In fact, I was able to do it all in a lot fewer lines of code because I could use the compiler to do a lot of the up-casting work that I had to do with memcpy() before. Nice benefit.
The upshot of all these changes is that the new uint128_t was only 30% slower than the uint64_t! That's amazing compared to where it started. It's not going to set any speed records, but given that it's not a built-in CPU data type, it's pretty good. Certainly good enough for all the things I need it to do.
Fantastic work!