Identifying, Sorting, Classifying a Ton of Messages

The Magic School Bus

Today I started the process of trying to consolidate the 300+ messages in The Magic Schoolbus into a few reasonable categories: OPRA messages (tons of them, space is critical, data format very rigid), Price Messages (little looser, but still important and small), and everything else. The remainder of the messages are really suitable for fitting into self-describing message formats like JSON, or more likely BSON, as they are very flexible - have variable number of components, and don't need to get shot around the network all the time.

The Really Wasteful

Take for instance, the Holiday Calendar. This is just like every other Holiday Calendar I've ever seen: give it a date (or default to today, and it'll give you all the trading holidays for the next 'n' months. Very simple data structure. Even simpler when all you're talking about are US Equities and their options - you don't even need to tell it which exchange you're asking about as they are all the same.

But here's what The Magic Schoolbus does: Every minute it will publish a list of all holidays for the next ten years and those that are registered for this data will receive it. Over, and over again. Every minute. The format is pretty simple as well. There's the basic header of the message (far too verbose and general) but the payload of the message looks like this:

  struct {
    uint16_t       modifiedBy;    // trader ID
    char           today[9];      // YYYYMMDD
    uint8_t        numHolidays;   // # of holidays
    Holidays_NEST  holidays_nest[];
  } HolidayCalendar;

where Holidays_NEST looks like:

  struct {
    char      holidayDate[9];   // YYYYMMDD
    uint8_t   holidayType;      // 1=no trading; 2=half day
  } Holidays_NEST;

Now even if we put aside the problems with this content - like a date that's 9 bytes when 2 would do (as a uint16_t) - in fact, we could compress the entire message to look like this:

  struct {
    uint16_t    modifiedBy;    // trader ID
    uint16_t    today;         // YYYYMMDD
    uint8_t     numHolidays;   // # of holidays
    uint16_t    holidays[];    // tYYYYMMDD
  } HolidayCalendar;

where the 't' is the type of day and the date immediately follows. A simple mask gets us what you need and size comparison (assuming 64-bit pointers) is:

  old size = 12 + n * 10
  new size = 5 + n * 2

and for a typical year we have, say 7 holidays, and ten years, so n = 70:

  old size = 12 + 70 * 10 = 712
  new size = 5 + 70 * 2 = 145
  savings: 79%

It's just stunning how bad some of these messages are.

The Horrible Congestion

Look again at the Holiday Calendar - it's sending this data out every minute. Why? Because the designers believed that this was the only way the data was going to get delivered to the client. What about a data cache/data service? They even have a cache server in the architecture - but it holds all the messages sent and as such, it's not nearly as efficient as a more customized data service.

So I need to do something here - basically, stop the insanity of sending all this data all the time. I need to have the client get it when it requests it and when it fundamentally changes. This means something a lot more intelligent and flexible than read from the database, make a monster message, send it, repeat.

The Task

It's huge. I have to look at all the used messages and then try to see what can be combined into a nice, compact format for sending at high speed to a lot of clients, and what can be more free-form and possibly even skip the 29West sending in the first place.

It's a monster job. But it's gotta be done. The reason this is in such a horrible state is because no one has taken it upon themselves to do this until now. It's ugly, and it's painful, but it's got to be done.