Archive for the ‘Coding’ Category

More Drive Array Problems

Wednesday, February 22nd, 2012

bug.gif

Today my UDP feed recorders blew out a 3TB+ disk array, and when the admins unmounted and remounted it, I had 2.4TB free. Something was going on. I had changed the file writing from a buffer and write once style to a append incremental updates scheme, and all of a sudden things blew up. So of course, I think it's me. So I decided to check.

The first thing was to get a simple test app, and then run it on the drive array and not, and compare the results. Thankfully, there were drives on my troubled box that weren't the drive array - my home directory, for one. So I just needed a test app, run it in my home directory, then on the drive array and compare the results.

My little test app was simple:

  #include <iostream>
  #include <fstream>
  #include <stdio.h>
  #include <stdint.h>
 
  int main() {
    std::string  name("local.bin");
    std::string  buffer("Now is the time for all good men to "
                        "come to the aid of their party\n");
 
    for (uint16_t i = 0; i < 10000; ++i) {
      std::stream  file(name.c_str(), (std::ios::out |
                                       std::ios::binary |
                                       std::ios::app));
      file << buffer;
      file.close();
    }
 
    return 0;
  }

and then I compiled it and ran it. on my home directory I got:

  $ ls -lsa
  656 -rw-r--r--  1 rbeaty UnixUsers 670000 Feb 22 16:44 local.bin

and when I ran it on my suspect drive array I got:

  $ ls -lsa
  262144 -rw-r--r--  1 rbeaty UnixUsers 670000 Feb 22 16:44 local.bin

So it's clear that the byte counts are right - 670000 in both cases, but the blocks used are reasonable on my home directory drive, but the drive array is totally wigged out. This explains the problem I've been seeing - when I append to a file the drive array gets confused and adds all kinds of blocks to the file, but doesn't corrupt the byte count. Very odd.

So I sent this to the admins and let them use this as a test case for trying to fix this. I'm sure hoping they can do something to fix this guy. I need to have this running as soon as possible.

UPDATE: that's 256k blocks - exactly. This is interesting. That means it's not accidental. There's something about the driver that's putting 256k blocks for the binary append, and doing this over and over again. Interesting, but it's just all the more evidence that this is a drive array bug.

[2/23] UPDATE: turns out to be an XFS option: allocsize=262144k, and that was easily fixed by the admins. I'm guessing the file system on the home directory filesystem wasn't XFS, or had a better default allocation size. But it's fixed. Good.

Wonderful Unix Number Counting Command

Wednesday, February 22nd, 2012

Ubuntu Tux

Today I was having a really bad day with my UDP feed recorders. They filled up a 3TB+ drive array and I could not figure out why. As I dug into this, I starting seeing a pattern that was really bad: the files in a directory didn't add up to the output of du. So I wanted to test the theory.

The trick was, I wanted to just run a simple command - not write a script or a program. Just a command, but it wasn't clear what to do. So I hit google, and there was the simple result:

  ls -lsa | awk '{ sum += $6 }END{ print sum }'

The ls is obvious - the 6th column is the number of bytes, but the awk line is the real beauty here. I've used awk a lot before, but I didn't know it had dynamic variables like this. And then to have the END tag to put it at the end of the command. Simply brilliant.

This is why I love linux/unix. You don't have to write groovy scripts if you understand the system. Love it!

Adding NBBO Exclusion Rules to Ticker Plants

Wednesday, February 22nd, 2012

High-Tech Greek Engine

Today I spent most of the day working into my NBBOEngine the concept that the exclusion rules for exchanges wasn't limited to the global exclusion of a single exchange - but that it might be targeted at a single stock, or it's options, or the entire family. These requests came in from the operations group, and they said they needed these rules before they could go live with the Greek Engine. As you might recall, the engine uses the embedded ticker plants, so that's the connection.

So I needed to come up with a way to easily allow the global defaults as well as instrument-level overrides to those defaults, and scope them to include just the instrument, it's options, or both. In general, it wasn't horribly hard - a boost::unordered_map with a std::string key of the SecurityKey, and then a simple object that would hold the scope (a uint16_t bit-masked word), and the array of bool values for the individual exchanges.

Then I needed to replicate the method calls - adding the SecurityKey and scope, and then work that into the framework, and then work that into the external API. Nothing terribly complex, but it's a lot of little pieces, and more than a little typing. In the end, it's all working pretty nicely, and the additional load on the NBBOEngine is zero. Actually, I improved a few things, and that offset the additional lookup of the map.

I then did a little fixing up of the code for visualizing these exclusions, so that it's clear what's being excluded - at the global or instrument levels, and this makes the update really complete. Lots of little things, but in the end a far better system for managing bad data from the exchanges.

Factored Out Some Magic Numbers

Wednesday, February 22nd, 2012

GeneralDev.jpg

One of the things I really hate about "sloppy" code is the use of Magic Numbers in the code. Things that look like this:

  if (mAutoFlipSize < 400000) {
    if (((++emptyTrips > 75) && (sz > 50)) ||
        (sz > mAutoFlipSize)) {
      drainAllPendingMessages();
      flipSide();
      continue;
    }
  }

Unfortunately, this is all my code. I have no one to blame but myself. I started tuning this code, and needed to play with the buffer sizes and limits, and this is what I was left with.

But having it, and leaving it are two entirely different things. I spent a few minutes today to remove all these magic numbers and use a simple grouped enum to make them far more manageable:

  namespace msg {
  namespace kit {
  namespace udp {
  enum tConst {
    eAutoFlipEmptyTripSize = 50,
    eAutoFlipEmptyTrips = 75,
    eAutoFlipDefault = 50000,
    eAutoFlipManual = 400001,
  };
  }     // end of namespace udp
  }     // end of namespace kit
  }     // end of namespace msg
 
 
  if (mAutoFlipSize < udp::eAutoFlipManual) {
    if (((++emptyTrips > udp::eAutoFlipEmptyTrips) &&
         (sz > udp::eAutoFlipEmptyTripSize)) ||
        (sz > mAutoFlipSize)) {
      drainAllPendingMessages();
      flipSide();
      continue;
    }
  }

Much better! I now know that I can keep the constants in sync with the code. No more Magic Numbers.

Fixed Weekend and Holiday STALE Flag

Tuesday, February 21st, 2012

bug.gif

I got a note from one of the guys in another group about trying to hit one of my servers on the weekend, and not getting what he thought he should get. Instead of the hundreds of thousands of instruments, he was getting just a few hundred - clearly not right. But where was the problem?

The problem with finding this guy was that the bug report was sketchy at best. Made worse by the fact that the guy that reproduced it in my group failed to tell me any of the details about what he had found. Consequently, I spent quite a while trying to track down possible changes in the commit logs as opposed to looking at the code - where the problem lay.

Finally, I was able to extract this information from him, and was able to see that the STALE flag - a flag we use int he system to indicate that there have been no quotes or trades on an instrument today, was improperly showing as 'true' on the weekends. While this makes perfect sense, it's got the unintended consequence of filtering out all the STALE instruments from the output to the client.

What I needed was to change the logic for the STALE to allow for the fact that if it's not a trading day, then any update (quote, print, summary) was OK, and we're not stale. On a trading day, we use midnight of the same day. It's pretty simple logic, but it's going to make a huge difference in how this code acts on the weekends and holidays.

Glad I was able to get the information about the problem. It was pretty easy after that.

Fixed Auto-Flipping on Exchange Feeds

Tuesday, February 21st, 2012

bug.gif

This morning we had an unusual situation with the exchange feeds due to some downed lines from one of our providers. Let's forget for a minute that I used to do this job with the telcos, and I know exactly how they respond, and to think that this is down for a few hours - let alone a day is something that I almost laugh at. OK… I really laughed at this.

But like I said, let's forget about these facts…

No, today was a unique opportunity for me to test my auto-flipping logic on the exchange feeds because for some feeds we lost the A side, and others we lost the B side. So I should expect to see groups of feeds on A and others on B. What I saw was that nothing flipped, and so I dug into why.

Well… it turns out there were a few mistakes on my part. I had originally been using:

  bool UDPExchangeFeed::drainAllPendingMessages()
  {
    bool                 error = false;
    msg::DecodedMessage  *pkg = NULL;
    while (mPackages.peek(pkg)) {
      if (pkg == NULL) {
        mPackages.pop(pkg);
        continue;
      }
      deliverMessages(*pkg);
    }
    return !error;
  }

the idea being that if I ran into a NULL in the queue, I'd skip it. Otherwise, I'd deliver the messages in the package and continue. Hold on a sec… there's my first mistake. I'm never popping off the messages!

Yes, friends, I had an infinite loop, and that was what was stopping my flipping from happening. I needed to have something like this:

  bool UDPExchangeFeed::drainAllPendingMessages()
  {
    bool                 error = false;
    msg::DecodedMessage  *pkg = NULL;
    while (mPackages.peek(pkg)) {
      if (pkg != NULL) {
        deliverMessages(*pkg);
      }
      mPackages.pop(pkg);
    }
    return !error;
  }

where it's clear that only in the case of non-NULL peek, did I do something, but I always popped off that top element to continue.

The next problem I found wasn't so much a logic issue as a use-case issue. The trigger that I was using for knowing when to flip sides was the size of the incoming datagram queue. The problem with this is that if the decoders are working, that queue is almost always going to be very small. It's the decoded packages queue that was also in play. So let's add them and use that as the trigger. Looking much better now.

The final issue was really one of size. What happens when I have a trip level of 50,000 messages, and I have a feed that doesn't produce that in 5 mins? I get stale data. That's no good. What I need to do is to detect when there's a long period of inactivity in the preferred side, and there's something on the other side to use. In order to figure this out, I put a little counter on the loop to count up not many "preferred side is empty - wait", passes I'd had. If it was enough, say 75, then if there's something on the other side - even if it's not 50,000 messages, flip over because this side isn't producing anything now.

With this, I get the behavior I was originally looking for. We flip when we have data and it doesn't take a long time to do it. I don't miss a lot, and we have a nicely self-adjusting system. Good news that this came up today.

Wild Problem in Boost ASIO Async Reader

Friday, February 17th, 2012

Boost C++ Libraries

This afternoon, I ran into what appears to be a problem with boost ASIO. I'm reading a framed TCP message where the fixed-sized header includes the number of bytes in the rest of the message, and then trying to read the rest of the message with the boost::asio::transfer_all() completion code. The goal of this says to either return the full buffer or an error. What I'm seeing from time to time is that I'm asking for n bytes and I'm getting m where m is a good bit below n.

I have been able to catch this in the reader, flag it as an error, and then notify the client to re-issue the request. These retries always seem to work (after I reset the socket connection), so it's not the server or the client - it's the communication between the two. Not a lot of fun, but at least I have a semi-reliable solution with the retries. This will hold until I get back to this in a few days.

But I'm just shocked that there's a problem in the boost ASIO code. I know it's possible to just drop the connections and not face the problem, but that seems to be excessive. What I want is to track down why this is happening.

Hopefully, I'll get to it.

Refactoring the Feed Recorders

Friday, February 17th, 2012

Building Great Code

This morning my manager stopped by to talk about the problems I've been having with the transfer of the data from the feed recorders to the archive server. He talked to my ex-manager about the issue, and he came up with the idea that we could just increase the frequency that we wrote to the filesystem from the recorder, and then stay using the filesystem where things look to be stable.

It's a pretty good idea. I needed to work out a few things - like the filenames, and how to deal with that, but in general, the idea is sound: if there's an existing file being "filled", then add to it, if not, create a new one and it becomes the current file. Once the file exceeds a certain size, don't write to it again, and let the next block of data create a new file.

The last trick I added was to have the writing of the file include the renaming of the file if appending data - to include the ending timestamp. This makes it such that the files are always consistent, and if the recorders crash, we're not loosing much as they only have 5 sec of data in them. The rest is written to the filesystem, with the updating filename so that it's easy to use at any point in time.

Empty block will naturally take care of themselves, and we're looking pretty good. It's a solid plan.

So I took out all the Broker-related stuff from the feed recorders, and then cleaned things up on the archive server, so that I got back to a nearly neutral state, checked that in, and then started updating the recorders to write out their data in this manner. The server was pretty much untouched as it now functions completely on the filesystem.

I started the tests, and sure enough, about every 5 sec, I get an update and the file gets a little bigger, and the name changes. The CPU load is a little bigger, but it's not bad, and the payoff should be significant -- the archive server should just work.

I need to let it run for a bit, and then hit it with my tests to see, but I'm very optimistic about the possibilities.

Struggling with Data Transfer Issues

Thursday, February 16th, 2012

bug.gif

I spent about 60% of my day today dealing with user issues related to the latest testing cycle, and the rest of my day trying to get a good handle on this data transfer issue that I'm having with my archive server and feed recorders. It appears that so long as the recorders write the files to the filesystem, the server can read them and decode them just fine. But when I go through The Broker for the in-memory buffers in the recorders, I get all kinds of junk.

It's not every time, though - it's not hard to repeat on the scale that I need it to work, but it's very tricky to repeat on the small scale that makes it easy to find the problem. It could be a boost ASIO problem -- I know I made some changes recently hoping that would clear things up, but maybe we have a problem with the io_service instances? Don't know.

I do know that when I only look at the filesystem, it's fine. Crud.

So today I spent as much time as I could gather to narrow down the problem. As of the end of the day, I'm thinking it's not in the serial transfer, but in something after that. What? I con't know, but it's looking like it's getting into the process just fine.

We'll have to work on it more tomorrow.

Quick Checking of Listening Socket Using Boost ASIO

Wednesday, February 15th, 2012

Boost C++ Libraries

This morning a user wanted to be able to see if a Broker was alive and listening on the machine and port it was supposed to be on. This came up because Unix Admin decided to do a kernel update on all the staging machines last night, and we didn't have everything set to auto-restart. Therefore, we had no processes for people to look to. Not good. Thankfully, we have backups, but how does a user know when to hit the backup? After a failed call, sure, but with retries built into the code, that can take upwards of 10 sec. What about something a little faster?

Seems like a reasonable request, and to this morning I added a little isAlive() method to the main Broker client. It's very simple - just tries to connect to the specific host and port that it's supposed to use, and if something is there listening, it returns 'true', otherwise, it returns 'false'. Really easy.

Boost ASIO makes it a little un-easy, but still, it's not too bad:

  bool MMDClient::isAlive()
  {
    bool       success = false;
 
    // only try if they have set the URL to something useful...
    if (!mHostname.empty() && (mPort > 0)) {
      using namespace boost::system;
      using namespace boost::asio;
      using namespace boost::asio::ip;
 
      // getting the connection in boost is a painful process…
      tcp::resolver   resolver(mIOService);
      std::ostringstream  port;
      port << mPort;
      error_code              err = error::host_not_found;
      tcp::resolver::query    query(mHostname, port.str());
      tcp::resolver::iterator it = resolver.resolve(query, err);
      tcp::socket             sock(mIOService);
      if (err != error::host_not_found) {
        err = error::host_not_found;
        tcp::resolver::iterator   end;
        while (err && (it != end)) {
          sock.close();
          sock.connect(*it++, err);
        }
      }
      // if we got a connection, then something is there…
      success = !err;
      // …and close the socket regardless of anything else
      sock.close();
    }
 
    return success;
  }

While using boost isn't trivial, I've found that the pros outweigh the cons, and it's better to have something like this that can handle multi-DNS entries and find what you're looking for, than to have to implement it all yourself.

This guy tested out and works great. Another happy customer!