Archive for the ‘Coding’ Category

Reading More on Clojure

Monday, November 19th, 2012

Clojure.jpg

We are going to be writing a project in Clojure, and I've been told I'm going to be on this project, so I need to get up to speed on Clojure. Thankfully, I had the Pragmatic Programmer's book on Programming Clojure, so I've been re-reading it these past few days.

It's interestingly a lot like Erlang, which I had to learn at a previous position for some work I was doing there. In fact, that's when I picked up the Clojure book, but I also had to pick up the Erlang book, and since there was work to do on that, it took top priority for my time, so I learned it more completely.

Now I'm looking at Clojure, and it's really a lot like Erlang in a bunch of ways. The record data structure, the gen_server ideas, and the functional code all make for a pretty quick learn for me. There's still a lot I need to read - I'm only about one-third of the way through, but it's something that I've been working on these days. Just to be ready.

Decided to Switch to Homebrew

Thursday, November 15th, 2012

Homebrew

I've got an old install of Erlang, and Clojure, and I need to update them for work I'm about to do, but I don't feel like doing the same old installs… I'm going to try Homebrew for package management because it's working so well for my work laptop. So I cleared out the old installs of these packages, which was a chore - basically, complete directories in /usr/local/ or in ~/Library/ and I also took the time to clean up my .bash_login and .bashrc because they had additions for the PATH, and even DYLD_LIBRARY_PATH that needed to be removed as well.

Once I had the old stuff removed, I installed Homebrew with the simple command:

  $ ruby -e "$(curl -fsSkL raw.github.com/mxcl/homebrew/go)"

and it installed itself just fine. Having done this already, I knew what to expect, but the next steps were really nice:

  $ brew install erlang
  $ brew install leiningen

where Leiningen is the Clojure package manager and REPL tool. Once I had this installed, I noticed that /usr/local/bin wasn't early enough in my PATH to make sure that I picked off the Homebrew commands and not the native OS X commands.

Actually, Homebrew itself, pointed this out to me. Nice installer. So I had to track down where this was happening. Interestingly enough, I wasn't adding /usr/local/bin/ to my path - the system was! In /etc/paths there's a list of paths to add:

  /usr/bin
  /bin
  /usr/sbin
  /sbin
  /usr/local/bin

and I needed to change it to:

  /usr/local/bin
  /usr/bin
  /bin
  /usr/sbin
  /sbin

to get things right. Now, I had the PATH right, and both Erlang (erl) and Clojure (lein repl) started up just fine. Sounds like a no-op, but I'm on more recent versions, and for the work I'm about to get into, switching to Leiningen is a must.

But I didn't stop there… Oh no… I kept on cleaning things up. I don't even have Qt on this box, but that was in my PATH, and the Groovy, and a lot of other things that I don't have and don't need. All cleaned up.

By now my .bash_login and .bashrc are looking almost spartan. But then I was wondering about PostgreSQL. Was that on Homebrew? Would it work with Apache2 on my OS X box? Since I had the time, I decided to try it. So once again, I followed the simple steps to migrate from one package to the other:

Step 1 - make a complete backup. I went into my home directory and backed up everything in my server:

  $ /usr/local/pgsql/bin/pg_dumpall -U _postgres -o > pgbackup

Step 2 - shut down the old version, and remove it's startup script from the system-wide install location:

  $ sudo launchctl unload \
      /Library/LaunchDaemons/org.postgresql.postgres.plist
  $ sudo rm /Library/LaunchDaemons/org.postgresql.postgres.plist

Step 3 - remove the old install and all the symlinks in the man pages and the /usr/local/bin directory that I did myself with this install:

  $ cd /usr/local
  $ sudo rm -rf pgsql-9.1

there was some shell magic in the removal of the links - like an ls piped into a grep for 'pgsql' and then removing them. Nothing fancy, but it took a little time.

Now that the old PostgreSQL install was really gone - even from my .bash_login and .bashrc, I was ready to install the PostgreSQL from Homebrew. One of the reasons was that it was 9.2.1 and the previous install was 9.1.

Step 4 - install PostgreSQL:

  $ brew install postgresql

Step 5 - create initial database for Homebrew PostgreSQL install:

  $ initdb /usr/local/var/postgres -E utf8

Step 6 - set it to start on my login, and start it now:

  $ mkdir -p ~/Library/LaunchAgents
  $ cp /usr/local/Cellar/postgresql/9.2.1/homebrew.mxcl.postgresql.plist \
        ~/Library/LaunchAgents/
  $ launchctl load -w ~/Library/LaunchAgents/homebrew.mxcl.postgresql.plist

Step 7 - reload databases from initial dump:

  $ psql -d template1 -f ~/pgbackup

At this point, we can run psql and access the databases, and I'm sure I'm up and running, I needed to see about the integration with Apache2 - I have to have that working for some projects I've done, and are still working on.

Step 8 - activating PHP in Apache2 config on my box. Edit the file: /etc/apache2/httpd.conf and uncomment the line that looks like:

  LoadModule php5_module libexec/apache2/libphp5.so

and restart apache:

  $ sudo apachectl graceful

Step 9 - make my ~/Sites directory executable again. Create the file /etc/apache2/users/drbob.conf:

  <Directory "/Users/drbob/Sites/">
    Options FollowSymLinks Indexes MultiViews
    AllowOverride None
    Order allow,deny
    Allow from all
  </Directory>

and at this point, I had the quite familiar PHP info screen up and my simple database accessing page worked like a charm. I'd successfully completed the migration!

But was I done? No!

I've been running boost 1.49.0 for a while, and I like that I figured out how to do universal binaries of the libraries. Very nice. But then I checked Homebrew:

  $ brew info boost
  boost: stable 1.52.0 (bottled), HEAD
  www.boost.org
  Not installed
  github.com/mxcl/homebrew/commits/master/Library/Formula/boost.rb
  ==> Options
  --with-icu
    Build regexp engine with icu support
  --without-python
    Build without Python
  --with-mpi
    Enable MPI support
  --universal
    Build a universal binary

so I could update to boost 1.52.0 and get the same universal binaries without missing a beat! This might be really nice. So I removed my own boost install:

  $ cd /usr/local/include
  $ sudo rm -rf boost
  $ cd /usr/local/lib
  $ sudo rm -rf libboost_*

and then I installed boost from Homebrew:

  $ brew install boost --universal

Odd… I got:

  ...failed updating 22 targets...
  ...skipped 12 targets...
  ...updated 10743 targets...
 
  READ THIS: github.com/mxcl/homebrew/wiki/troubleshooting
 
  These open issues may also help:
    github.com/mxcl/homebrew/issues/14749

The hint was to run brew doctor and correct all the errors. Well… I had a lot of them - all from my manual boost and gfortran installs. So I ditched my old gfortran install and cleaned up all the problems and then I re-ran the install:

  /usr/local/Cellar/boost/1.52.0: 9086 files, 362M, built in 6.1 minutes

When I looked in /usr/local/include and /usr/local/lib I see all the boost code, and I even checked that I got the universal binaries:

  $ file /usr/local/lib/libboost_wave-mt.dylib 
  /usr/local/lib/libboost_wave-mt.dylib: Mach-O universal binary with 2 architectures
  /usr/local/lib/libboost_wave-mt.dylib (for architecture i386): Mach-O dynamically
    linked shared library i386
  /usr/local/lib/libboost_wave-mt.dylib (for architecture x86_64): Mach-O
    64-bit dynamically linked shared library x86_64

Excellent!

Now to put back gfortran from Homebrew:

  $ brew install gfortran

and after cleaning up more cruft from the old gfortran install, it installed and worked just fine!

I have now successfully removed all the third-party builds I once used with Homebrew. This is amazing stuff.

Starting to See a Little Light – Maybe it’s the Train?

Wednesday, November 14th, 2012

Great News

Today I got in early and really started hammering the last few issues for the Pilot starting Monday. I have 10 hours. It's got to be deployed this afternoon and that's it. We have to run in test mode for two days and then turn it on. So it was time to get down to business.

I was able to fix up the last few issues pretty easily, and by 9:00 am I was looking pretty good. People started coming in and things were looking even better, and stand-up went smoothly.

Maybe a little light?

Then I started to polish a little of the code as we needed to clean some stale code out, and clear out a few things with Salesforce - and promote something else. All looking very good. I'm hours away from the deployment for tonight, and things are looking good. I hate to say this - for fear of jinxing myself, but it was looking good.

CouchDB

I then had the time to look at one of the CouchDB problems I've been having: socket errors. What I was doing in the code was writing them one-by-one:

  payloads.each do |data|
    Database.store('Final results') { data[:merchant] }
  end

and while it works, it's making thousands of REST calls to Couch, and that's not efficient at all. There's a bulk store API to Couch in the client we're using, and if I just change the code to be:

  essentials = payloads.map do |data|
    data.select { |k,_| k != :demand_pool }
  end
  Database.store('Final results') { essentials }

then we're making a connection for every 2000 documents and not every one, and all of a sudden, things are under control with Couch!

This is GREAT news! I'm super happy about this. It means we may not have to ditch Couch, and that's nice, but certainly we have plenty of time to switch it out based on what we're doing and needing, and not on some socket problem with the server.

Very nice news!

The day is looking up… maybe the light isn't the oncoming train, after all!

The Final Push for a Big Pilot – Crazy Plans

Tuesday, November 13th, 2012

cubeLifeView.gif

Today was a stressful day - very. I'm trying to get the CouchDB and Salesforce endpoints to scale, having essentially no luck, and then going into a meeting this afternoon about how we can't hold back anything for the new Pilot Launch on Monday.

Like the Orange Juice commercial, the fun just kept coming as we were in a weekly planning meeting and I found out that we need to have even more features in the code before tomorrow evening.

13 hours.

We have 13 hours to stabilize this code, get some sense of scalability, fix a few outstanding features, and then del with these few more, and all in 13 hours.

I think it's a mistake, and it's driven by nothing more than the desire by the project manager to look good to his boss(es). It's all about face. His. He isn't asking us for dates, he's telling us dates, and then when things come up - as they often do, it's "impossible" to move the dates. All for face.

So I'm getting a little tired of this today. I need to just go home.

Thankfully, it's Angelina's Birthday, so I have that to look forward to tonight. Good.

Google Chrome dev 25.0.1323.1 is Out

Tuesday, November 13th, 2012

This morning I noticed that Google Chrome dev 25.0.1323.1 was out and the release notes have returned to a more spartan style. Yes, people can read SVN logs, but that's not the point - really, is it? If you take the time to make a blog post about the release, you should be able to make release notes that say more than "Read the SVN logs".

Still… progress is nice to see continuing.

Updated Configs to Work in Two Datacenters

Monday, November 12th, 2012

servers.jpg

The current plan was to have Production running on the old EC2 boxes and have UAT run out of our data center until we were sure things were OK, and then switch Production over as well. This seemed like a good plan, but there were issues with this and management wanted to run the essential production data, and UAT data, in EC2 and then run the non-essential production data and UAT data in SNC1. This means that there would be multiple boxes running the same code hitting the same data sources and sinks, but covering different regions.

Sounds reasonable, and even safe. So let's do that.

The issues with getting a different config for UAT in one datacenter is that all we really have is hostname versus hostname -f, and I had to use that every place I could. The wrinkle came in when looking at the configuration, as that isn't looking at the machine name - just the environment setting - 'hat' or 'production'. Not so easy.

But I worked on this all day. It was not easy. And then I was ready to test.

It wasn't pretty.

The problem comes in that we aren't synchronizing the work between data centers, and this makes the later processing steps (prioritization) fail because you only have part of the picture. There's no easy way around it.

The next problem was continued Couch errors. Yes, using my cacheing endpoint helped, but we'd still get problems now and then. No easy solution in our code.

So in the end, almost a wasted day. Almost.

Not a great feeling.

Scaling Problems with Salesforce and CouchDB

Friday, November 9th, 2012

CouchDB

Today I've been having real troubles trying to get our system scaled up to new hardware in the datacenter. Moving from hosts in Amazon EC2 to machines in our own datacenter are a big step in the right direction, but going from iffy bandwidth in EC2 to solid switches in the datacenter and 2 cores to 24 are something we simply have to do in order to scale up to handling the global data load that we're going to need.

The problems I've run into today are all about loading. I've already added a queueing system to the CouchDB interface, in order to minimize the connections to the Couch server so as not to overload it - there were times when the Couch server would simply shutdown it's socket listener, and therefore refuse all updates sent from the process. Not good.

Salesforce.com

There's also a lot of problems today with Salesforce. I don't think they expected the kind of loads we're delivering. This morning, at 3:00 am, Salesforce called the support guys at the shop and told them that a process was bringing one of their sandbox clusters to it's knees, and that, it turns out, is but one of four boxes we need to bring online. They're having a hard time handling this much - I can't imagine what's going to happen when we try to really ramp it up.

I'm starting to have real concerns about both these endpoints of the project. I know there's a lot of data getting moved around, and while we're able to handle it, it's these endpoints that are having the hardest time. I've talked with the project manager and the technical manager about this, and I think we need to start thinking about potential bail-out scenarios.

It's certainly possible to read from Salesforce. We're planning on re-doing the complete demand system, so there shouldn't be an issue there. Persistence? Go back to MySQL or PostgreSQL and store it all in tables. The data is getting pretty nicely finalized, so a nice schema should be able to be made. Save it all in a SQL database, make a simple service that reads/caches this data and offers it up to the clients, and the pages already built on the existing data sources are pretty easily modified.

Odd to think that I was looking at MySQL before Couch popped up. Funny thing is, I have a strong feeling that Salesforce can come up with hardware that makes the grade, but it's the bugs in Couch that worry me the most. You just gotta wonder if it's in the Erlang code, or on the boxes, or what.

So many unknowns, but it's clear that we can't scale to one nice box - there's no way we're going to make it work globally without some serious changes.

Adding Queueing to CouchDB Interface

Friday, November 9th, 2012

CouchDB

I've been fighting a problem with our CouchDB installation using the CouchRest interface for ruby/jruby. What I'm seeing is that when I have a lot of updates to CouchDB from the REST API in CouchRest, we start to get socket connection errors to the CouchDB server. I've gone through a lot of different configurations of open file handles on both boxes, and nothing seems to really 'fix' the problem.

So what I wanted to do was to make a queueing database class. Currently, we have something that abstracts away the CouchDB-centric features, and adds in necessary metadata so that we can more easily track all the data held in CouchDB. This is a nice start, in that I only really needed to add to the bulk of the code was a simple flush method:

  Database.flush

where I initially started with it being implemented as:

  def flush
    nil
  end

At this point, I was free to do just about anything with the Database class - as long as I didn't change the public API. What I came up with is this:

What I really like about this is that I can use it either way - as a cacheing/flushing endpoint, or as a straight pass-through with thread-local database connections. This means that I can start with a simple config file setting:


  data_to: 'database'
  writers: 3

which will give me three writer threads, as specified in the start method argument. Then, when I fix the CouchDB issues, I can switch to the simpler, thread-local connection storage with the code:

  Database.start(config.writers) if config.writers > 0

The call to flush is a no-op if we didn't start anything, so there's no harm in always calling it at the end of the process. Pretty nice. I've easily verified that this gets me what I need, and it's just a matter of how 'throttling' I want to be with the writing of the data to CouchDB. But I tell you this… I'm closer to MySQL than ever before - just because of this. There's no reason in the world to put up with this kind of software if it can't do the job.

Trying to Tune CouchDB

Friday, November 9th, 2012

CouchDB

When faced with the idea that CouchDB might be running low on sockets (file handles), I decided to do a little digging into the CentOS 5 configuration, and see what I could do. Turns out it wasn't all that hard, and I got immediate results. The first thing is to up the hard and soft limits on the file handles in /etc/security/limits.conf for the couchdb user:

  couchdb  soft  nofile  32768
  couchdb  hard  nofile  60000

then simply restart CouchDB. Look at the process limits by:

  $ cat /proc/1234/limits

where 1234 is the pid of the erlang 'beam' process, and see what it says. You are also going to need to tell Erlang about the number of file handles in the /usr/local/etc/default/couchdb file:

  export ERL_MAX_PORTS=32768
  export ERL_FLAGS="+A 8"

The second little bit there is to allow 8 threads in Erlang to process disk updates. This multi-threaded filesystem access isn't on be default for Erlang, so if you're on a big box with lots of cores, it makes a lot of sense to turn it on.

With these, you have CouchDB running about as well as you can get. The rest is up to you.

Fixing JRuby

Thursday, November 8th, 2012

JRuby

While trying to figure out a problem with the scaling, I was getting a bunch of BindExceptions and they seemed to point to this point in the jruby-1.7.0 code:

  1. try {
  2. // This is a bit convoluted because (1) SocketChannel.bind
  3. // is only in jdk 7 and (2) Socket.getChannel() seems to
  4. // return null in some cases
  5. channel = SocketChannel.open();
  6. Socket socket = channel.socket();
  7.  
  8. if (localHost != null) {
  9. socket.bind( new InetSocketAddress(
  10. InetAddress.getByName(localHost),
  11. localPort) );
  12. }
  13.  
  14. try {
  15. // Do this nonblocking so we can be interrupted
  16. channel.configureBlocking(false);
  17. channel.connect( new InetSocketAddress(
  18. InetAddress.getByName(remoteHost),
  19. remotePort) );
  20. context.getThread().select(channel, this, SelectionKey.OP_CONNECT);
  21. channel.finishConnect();
  22.  
  23. // only try to set blocking back if we succeeded to finish connecting
  24. channel.configureBlocking(true);

If I was getting bind exceptions, then it had to be coming from line 104 - after all, that's the call to bind(). The problem was, this really needed to be preceded with a call to set SO_REUSEADDR to true so that we didn't have these kinds of issues.

Something like this:

  1. try {
  2. // This is a bit convoluted because (1) SocketChannel.bind
  3. // is only in jdk 7 and (2) Socket.getChannel() seems to
  4. // return null in some cases
  5. channel = SocketChannel.open();
  6. Socket socket = channel.socket();
  7.  
  8. if (localHost != null) {
  9. socket.setReuseAddress(true);
  10. socket.bind( new InetSocketAddress(
  11. InetAddress.getByName(localHost),
  12. localPort) );
  13. }
  14.  
  15. try {
  16. // Do this nonblocking so we can be interrupted
  17. channel.configureBlocking(false);
  18. channel.connect( new InetSocketAddress(
  19. InetAddress.getByName(remoteHost),
  20. remotePort) );
  21. context.getThread().select(channel, this, SelectionKey.OP_CONNECT);
  22. channel.finishConnect();
  23.  
  24. // only try to set blocking back if we succeeded to finish connecting
  25. channel.configureBlocking(true);

Simple change, but it'd make a huge difference to the operation of the socket. So I submitted a request to the JRuby team.

After a little bit, I realized that I needed to try and build this all myself, and see if I couldn't just prove the problem for myself. The difficulty was - How to build JRuby?

Turns out, it's not all that hard. First, fork the code on GitHub so you can issue Pull Requests with the changed code. Then, follow the directions to build everything about JRuby:

  $ git clone git://github.com/drbobbeaty/jruby.git
  $ cd jruby
  $ ant
  $ ant jar-complete
  $ ant dist

It should all build just fine. Then check everything in, and push it up to GitHub. Next, you need to tell rvm to build a new version of ruby from that GitHub repo:

  $ rvm get head
  $ rvm reinstall jruby-head --url git://github.com/

And we're almost done… Update the Gemfile to have the paths to the jruby-jars file where you built it, and then update the .rvmrc. I was able to get things running and even packaging a jar file for our deployment.

Hacking on the language. That's a first for me. Quite a fun experience.

What I noticed was that it wasn't the bind on line 104 - it was the connect on line 113. Java returned the BindException as a general Can't get the endpoint, and I found this out by putting in the catch statements and logging the host and port of the exception. This was very illuminating, and I'm very glad I did it.

I'm going to even send a pull request to see if they'll take it. It's a very useful debugging tool.