Converting Doubles and Floats in C++ for Java

cplusplus.jpg

My C++ ticker plant needs to communicate with Java, and it's unfortunately the case that the host-order versus network-order comes up in these situations as well. Thankfully, for virtually all our integers (signed and unsigned) we're using Google's VarInt encoding with the ZigZag encoding for the signed integers. It's very nice as it completely removes any byte ordering issues from the cross-language API.

But doubles and floats are another issue.

In order to make them fast on the Java decoding side (the majority of my clients will be Java-based) we really need to make the encoding in the messages Java-focused. This means that when the data arrives in the ByteBuffer, a simple getFloat() gets the value out correctly.

What this means is that I need to run htonl() on the float before it's packed into the data stream, and for a double, I need to do that, and then reverse the order of the two uint32_t values to properly reverse the byte ordering. I decided to start with the simplest method: brute force. Assuming the double to convert is in the variable gold, this code bit reverses the bytes and puts them in the std::string buffer for sending to the clients:

    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);

And getting it back out of the std::string as it's read in from the upstream system is simply:

    uint8_t    in[8];
    uint8_t    out[8];
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);

Very deliberate, but not very efficient. But it worked. Once I had that in the code, I started working on coming up with better methods of doing this conversion, and timing them to see what the real difference was. To keep the playing field level, this is the test that I wrote for the first version of the conversion - out and back, 100,000 times:

  // start the timer
  uint64_t    start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    // now flip it's bytes
    uint8_t    in[8];
    memcpy(in, &gold, 8);
    uint8_t    out[8];
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    conduit.append((const char *)out, 8);
 
    // now back again
    memcpy(in, conduit.data(), 8);
    for (uint8_t i = 0; i < 8; ++i) {
      out[i] = in[7 - i];
    }
    memcpy(&gold, out, 8);
  }
  uint64_t    loopTime = getTime() - start;
  std::cout << "100000 'loop' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

And the result of this running on my MacBook Pro was:

  peabody{drbob}465: a.out
  100000 'loop' passes took 11048 usec or 9.05141 trips/usec
  gold: 254.76

I then started fiddling with the code. The first thing I wanted to try was using htonl() and see if it was any faster than simply throwing down bytes in the correct order. Turns out, it was. Nice. My next test was faster:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t	in[3];
    memcpy(&in[1], &gold, 8);
    in[0] = htonl(in[2]);
    in[1] = htonl(in[1]);
    conduit.append((const char *)in, 8);
 
    uint32_t	out[3];
    memcpy(&out[1], conduit.data(), 8);
    out[0] = ntohl(out[2]);
    out[1] = ntohl(out[1]);
    memcpy(&gold, &out[0], 8);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'ntoh/hton' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

and while I was still using memcpy(), I was now working with larger chunks of data, and the speed proved out:

  100000 'ntoh/hton' passes took 5992 usec or 16.6889 trips/usec
  gold: 254.76

But I wanted to get away from the memcpy() calls altogether. If I was a little careful with the way I did things, it worked out just fine:

  // let's see if we can use the ntohl/htonl for better speed
  start = getTime();
  // do the looping a bunch of times
  for (uint32_t l = 0; l < 100000; ++l) {
    // clear out the conduit
    conduit.clear();
 
    uint32_t  *in = (uint32_t *)(void *)(&gold);
    uint32_t  buff[] = { htonl(in[1]), htonl(in[0]) };
    conduit.append((const char *)buff, 8);
 
    uint32_t  *out = (uint32_t *)conduit.data();
    uint32_t  *target = (uint32_t *)&gold;
    target[0] = ntohl(out[1]);
    target[1] = ntohl(out[0]);
  }
  loopTime = getTime() - start;
  std::cout << "100000 'string mix II' passes took " << loopTime
            << " usec or " << 100000.0/loopTime << " trips/usec"
            << std::endl;
  std::cout << "gold: " << gold << std::endl;

This guy clocked in nicely:

  100000 'string mix II' passes took 5291 usec or 18.9 trips/usec
  gold: 254.76

Lesson learned: Get it right, then get it fast. Always pays off. I've more than doubled the speed and it's the same interface to the outside world. Even if we look at the code, there's very little I can really trim out - even if I didn't do the byte re-ordering. It's just about down to the extra time in the htonl() calls. And that's really nice.