Perl and Regular Expressions are Pretty Amazing

perl.jpg

Late yesterday I was working on enhancing a feature of my fast-tick risk server where I wanted to be able to take research portfolios and load them into the system as if they were real positions - just tagged a little differently so they aren't confused with real positions. As I was doing this, I realized that I needed to parse the option symbol and remove a single component.

In my server, the IBM Dec 2008 85.00 Put is symbolized as IBM:IBM.U:20081220:85.0000:0 where the components are separated by colons (:) and the first is the underlying (many times the symbol for the underlying is not the option symbol), the second is the option symbol, a dot, and the exchange the option is traded on, the third is the expiration, the fourth the strike, and the last is 0/1 for Put/Call. Pretty simple. But for technical reasons of the file formats, I needed to have:

  IBM:20081220:85.0000:0

essentially stripping out the option symbol and exchange. I knew it was possible in Perl, but at the time I was on the train trying to work this out on my way home. Thankfully, OS X has a complete perl reference built-in.

I started assuming that the symbol was given to me. I knew I had it in the script, I just needed to mangle it to the proper form.

  my $symbol = "IBM:IBM.U:20081220:85.0000:0";

and if I did the simple regex on it, I almost got what I wanted:

  my $symbol = "IBM:IBM.U:20081220:85.0000:0";
  print $symbol . "\n";
  $symbol =~ s/(^.*)\:.*\:(.*$)/$1\:$2/;
  print $symbol . "\n";

I got:

  IBM:IBM.U:20081220:85.0000:0
  IBM:IBM.U:20081220:0

and as soon as I saw this, I knew it was because the first wildcard was being 'greedy' in it's matching, and I was deleting the second to the last, not second, component of the symbol. So I looked up the perl docs on my Mac, and there in a wonderful example was the way to make it a stingy match:

  my $symbol = "IBM:IBM.U:20081220:85.0000:0";
  print $symbol . "\n";
  $symbol =~ s/(^.*?)\:.*?\:(.*$)/$1\:$2/;
  print $symbol . "\n";

With this, I was able to match the first part properly and the results were what I wanted:

  IBM:IBM.U:20081220:85.0000:0
  IBM:20081220:85.0000:0

While I knew there was a key to regexs that would make the normally greedy match a stingy match, I'm still amazed at the power of a language like Perl with it's very powerful regex system built in. I put in the code this morning and it worked like a charm. It's really pretty neat that a half-dozen lines of a perl script can add all this functionality. Sweet.