Today I've run into a problem where I'm really not at all pleased with the solution I have. It works, but that's a far cry from what I want to see. It's best described as brute force, and that, while sometimes necessary, is almost never the words you'd use to describe an elegant solution. So here's the set-up:
I have volatility data for a series of instruments - OK, let's simplify it and make them options. We have the components of the option: underlying, expiration, strike, and then the "meat" of the data - the volatility. If I put this into a SQL table, it looks a lot like this:
CREATE TABLE Volatility (
acquired datetime NOT NULL,
product VARCHAR(42) NOT NULL,
expiration datetime NOT NULL,
strike FLOAT NOT NULL,
volatility DOUBLE,
generationTime datetime,
CONSTRAINT volatilityPK PRIMARY KEY (acquired, product, expiration, strike)
)
where I added in the acquired field to hold when I received the data, and the generationTime to be the time the source sends me as the instant the data was generated there. It's pretty standard.
But I ran into a wrinkle. There were multiple sources sending me data for the same options so I was getting a bunch of primary key violations. That's not good, so the solution is to make the key unique by adding the 'source' as identified by the portfolio it's coming from. So now the table looks like this:
CREATE TABLE Volatility (
acquired datetime NOT NULL,
portfolio VARCHAR(128) NOT NULL,
product VARCHAR(42) NOT NULL,
expiration datetime NOT NULL,
strike FLOAT NOT NULL,
volatility DOUBLE,
generationTime datetime,
CONSTRAINT volatilityPK PRIMARY KEY (acquired, product, expiration, strike)
)
But now we're getting to see the real problem: size. Let's look at a single record. In order to hold the "meat" of the table - the volatility, we need to have a primary key that's:
- acquired - 8 bytes
- portfolio - average of 8 bytes
- product - average of 6 bytes
- expiration - 8 bytes
- strike - 4 bytes
for a total of approximately 34 bytes to hold an 8-byte volatility value. That's a ton of overhead.
The problem only gets worse when you look at the enormity of the data. I've got roughly 120,000 instruments and if I'm sampling them every 10 sec, that's 6 times a minute, 360 times an hour, or upwards of 3600 times a day. That's 432,000,000 rows a day or about 18,144,000,000 bytes for the data, of which a whopping 14,688,000,000 bytes is the key. That's almost 17GB of data a day - just to hold the volatility data.
If we held it in a 'map of maps' data structure, the only thing we'd need is the data that's actually changing: the acquired and volatility. That could represent a huge savings in memory. But the cost is that I can't "mirror" this data to a back-end SQL database, and can't use relational algebra to get data out.
I've thought of trying to duplicate this in SQL by having two tables:
CREATE TABLE Instrument (
instrumentID IDENTITY NOT NULL,
portfolio VARCHAR(128) NOT NULL,
product VARCHAR(42) NOT NULL,
expiration datetime NOT NULL,
strike FLOAT NOT NULL,
CONSTRAINT instrumentPK PRIMARY KEY (insrtumentID)
)
CREATE TABLE Volatility (
acquired datetime NOT NULL,
instrumentID INTEGER NOT NULL,
volatility DOUBLE,
generationTime datetime,
CONSTRAINT volatilityPK PRIMARY KEY (acquired, product, expiration, strike)
)
where I have a table of instruments, and those all have unique IDs, and then I'm using that ID, a simple integer, to represent the corresponding fields in the old Volatility table. It's all possible, and it's closer to the "normalized database" you'd expect to see, but the problem comes in when you look at the other factor: access speed.
If I'm generating 432 million rows a day, then it's not going to be too long before it's virtually impossible to do a join on that table. Even a month of data, and we're talking about holding years, would generate nearly 9 billion rows in that table.
Nah... there's no way we can do a SQL JOIN on that guy. We have to use the de-normalized data, and then just live with it. Crud. Not great, but it's the best I can do right now.