Getting Pretty Clever with Serialization and Google Varints

google-labs-logo.gif

Today I started with a working, but somewhat inefficient serialization scheme and a desire to do a lot better. Late yesterday, I talked with my sensei at The Shop about the improvements I'd made, and we both agreed that the best thing to do was to look at how Google protobufs packs it's data into a byte stream. It's certainly very fast, and efficient, as Google uses it to move massive amounts of data around it's enterprise, and that's what I was looking for. Thankfully, I knew that all I needed to do was to write a few helper methods and the same serialization scheme I'd been using the the in-house serialization scheme would work just fine.

So what is Google protobufs doing, really? Well, they are 'packing' 32-bit and 64-bit unsigned integers into as small a space as they can fit - and still retain a very fast deserialization speed. How do they do it? They encode the integer into 7-bit sections and use the high-order-bit to indicate if this is the last byte in the number. So as long as the serialized byte has the high-order-bit set, it's not the last. Google calls these varints - variable integers.

This doesn't sound very fast, but when you look at the implementation, it's really pretty slick. This is the 32-bit unsigned integer serialization scheme:

  bool Message::packVarint( std::string & aBuffer, uint32_t aValue )
  {
    char      buff[5];
    uint16_t  size;
 
    buff[0] = static_cast<uint8_t>(aValue | 0x80);
    if (aValue >= (1 << 7)) {
      buff[1] = static_cast<uint8_t>((value >>  7) | 0x80);
      if (aValue >= (1 << 14)) {
        buff[2] = static_cast<uint8>((value >> 14) | 0x80);
        if (aValue >= (1 << 21)) {
          buff[3] = static_cast<uint8>((value >> 21) | 0x80);
          if (aValue >= (1 << 28)) {
            buff[4] = static_cast<uint8_t>(value >> 28);
            size = 5;
          } else {
            buff[3] &= 0x7F;
            size = 4;
          }
        } else {
          buff[2] &= 0x7F;
          size = 3;
        }
      } else {
        buff[1] &= 0x7F;
        size = 2;
      }
    } else {
      buff[0] &= 0x7F;
      size = 1;
    }
 
    // now just append this data to the buffer
    aBuffer.append(&buff[0], size);
    return true;
  }

The beauty of this scheme is that it only does the work it needs to. Sure, it's a mess of nested if statements, but so what? It's not like you need this to be exceptionally readable, and it's not that bad. It's there when you need it, and it's going to run through the data like a hot knife through butter. Amazingly simple and clean.

And to deserialize it is just a fast:

  bool Message::unpackVarint( std::string & aBuffer,
                              uint16_t & aPos,
                              uint32_t & aValue ) const
  {
    /*
     * I'm going to go with Google's code here, for the most part, as they
     * have gone through the difficult part of optimization. I'm not a huge
     * fan of the 'goto's, but again, it's about speed here, and that's
     * more important than the use (or not) of a particular language struct.
     */
    const char  *ptr = & aBuffer.data()[aPos];
    uint32_t    b;
    uint32_t    result;
 
    b = *(ptr++); result  = (b & 0x7F)      ; if (!(b & 0x80)) { aPos += 1; goto done; }
    b = *(ptr++); result |= (b & 0x7F) <<  7; if (!(b & 0x80)) { aPos += 2; goto done; }
    b = *(ptr++); result |= (b & 0x7F) << 14; if (!(b & 0x80)) { aPos += 3; goto done; }
    b = *(ptr++); result |= (b & 0x7F) << 21; if (!(b & 0x80)) { aPos += 4; goto done; }
    b = *(ptr++); result |=  b         << 28; if (!(b & 0x80)) { aPos += 5; goto done; }
 
    /*
     * If we are here, then that means that the data is too big to fit into
     * a 32-bit int. Sadly, we need to keep reading until we find the end of
     * the varint, and update aPos based on the end of the actual data. The
     * value returned to the caller will simply be truncated.
     */
    for (int i = 0; i < 5; ++i) {
      b = *(ptr++); if (!(b & 0x80)) { aPos += (6 + i); goto done; }
    }
    // if we're here, then we overflowed the 64-bit varint, and it's bad.
    return false;
 
    // at this point, we're all done, save the value and leave
    done:
    aValue = result;
    return true;
  }

in the code, the main section of that method is really one line per byte, and it's pretty easy to read.

Again, the code only does what's needed, and it's pretty fast. This is some really good stuff from Google. I'm impressed. I'm also a little surprised that they made such heavy use of gotos in the code. Normally, that's a "coding no-no', but clearly, they have effectively used the old-style 'jump' quite effectively here.

They also created 64-bit versions of these methods, but here they didn't make the choice I would have made: they cleverly realized that on a 32-bit machine, it was faster to break up the 64-bit unsigned int into three 32-bit blocks (remember the single bit per byte for continuation), and then process those. However, for most modern linux systems, it's 64-bit, so they could have made a simple #ifdef on the word size and then use the "chunking" code when it's 32-bit, and expand the above methods to work on 64-bit numbers when not.

There's a significant advantage here because of this chunking - the code above does only what's necessary, and does it fast. Expanded to 64-bits it's no slower until you get to a number that can't be stored in 32 bits. At that point, the "extra" code is used, and it slows down a bit. But in Google's "chunking" they realize that to do that, they'd have to have additional compares to see how large a number is. Because of that, they did a simple, but effective, hardcoded binary tree to get the general size of the number. While this yields the best performance on a 32-bit machine, it's going to be far from the best on a 64-bit system.

Therefore, in my code, I put in the appropriate #ifdefs so that the right version is used for the wordsize of the generated code, and so we'll always be getting the best performance we can get. I'm very excited about this. And a little surprised that they didn't include this in their code, but no matter, I've got it and that's great.

When it comes to signed integers, they created this "ZigZag Scheme" where the signed integers are coded into unsigned integers by ranking them by their absolute value - negative values winning ties, and then using the same varint packing scheme on them as on the unsigned ones. It's pretty slick - and very fast. One line of code does it. Pretty amazing stuff.

With a few helper methods, I was able to change the serialization scheme pretty easily and then run my tests on the same data. What I saw was that a 29-byte packet of two strings went to 26 bytes because the Strings were encoded with these varints for the length, and that's a 20% savings for just two strings! Not bad. When I start to add in real data, it's going to get even more impressive.

All in all, I have to say that once again, today was a good day. Pretty neat stuff.