Archive for October, 2014

Upgraded to Mac OS X 10.10 Yosemite

Monday, October 20th, 2014

Yosemite

This morning I just got a little antsy and decided that on my personal laptop, it was time to upgrade it Mac OS X 10.10 Yosemite. I am so glad I did. The flat look - on all the web sites of the reviews - is just what I like. Almost no chrome - no need for it. Smooth visuals on my retina MacBook Pro, but I'm guessing that's the key to this upgrade - having a rMBP. Still... I've got one :), so I'm loving this.

I'm still on the Light treatment as I don't find the transparency at all intrusive or distractive - as I've been using it on MacVim for a while. MacVim was my only concern about working, but that's fine. What I didn't expect was that Adium 1.5.10 could no longer connect to Yahoo! IM. That's not good because I still reach out to Jeremy on Yahoo!.

I looked on the web, and it turns out that this is a known issue with OS X 10.10, and there is a fix for it, but it's not out yet - not even in beta. I'm a little surprised by this as I didn't really dig into what the patch was, but they have been so proactive in the past, I'm surprised that there's no beta out there for this.

After all, this has been out in "general release" for quite a while now, and while they may not want to target each update, it's something else to know there's a fix, and let weeks pass and no beta or update. I guess I'll have to wait.

Changing the Topology to Get Stability (cont.)

Friday, October 17th, 2014

Storm Logo

I have been working one the re-configuration of the experimental analysis topology to try and get stability, and it's been a lot more difficult than I had expected. What I'm getting looks a lot like this:

Instability

and that's no good.

Sadly, the problem is not at all easy to solve. What I'm seeing when I dig into the Capacity numbers for the bolts are that there's only one of the 300+ bolts that has a capacity number that exceeds 1.000 - and then by a lot. But all the others are less than 0.200 - well under control. So why?

I've looked at logs, I've looked at the logs - nothing... I've looked at the tuples moving through the bolts, and interestingly, found that many of them just aren't moving any tuples. Why? no clue, but it's easy enough to scale it back to me a more efficient topology.

What I've come away with is the idea that might be the different way we're dealing with the data. So this weekend, if I have time, I'll dig in and see what I can do to make this more even between the two. It's not obvious, but then - that's 0.9.0.1 software.

Changing the Topology to Get Stability

Thursday, October 16th, 2014

Storm Logo

OK... it turns out that I needed to change my topology in order to get stability, and it's only taken me a day (or so) to find this out. It's been a really draining day, but I've learned that there is a lot more about Storm that I don't know - than that which I do.

Switching from having one bolt that calls two functions to two bolts each calling one function - should certainly have a lot more communication overhead. Interestingly, it's how I regained stability. The lesson I've learned - Make bolts as small and purposeful as possible.

Amazing...

Fixed Log Configs for SIngle-File Logging

Thursday, October 16th, 2014

Storm Logo

This morning I wanted to take some time to make sure that I got all the log messages into one file, and that not being a redirection of stdout or stderr. This is something I've done a few times, and it just took the time to set up the Logback config file. The reason we're using Logback is that this is what Storm uses, and since this is a Storm jar, we needed to use this style of logging.

Interestingly, the config wasn't all that hard to get a nice, daily-rotating, compressing, log file for everything I needed:

<configuration scan="true">
  <appender name="FILE"
            class="ch.qos.logback.core.rolling.RollingFileAppender">
    <file>/home/${USER}/log/experiments.log</file>
    <rollingPolicy
        class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
      <fileNamePattern>
        /home/${USER}/log/experiments_%d{yyyy-MM-dd}.log.gz
      </fileNamePattern>
      <maxHistory>30</maxHistory>
    </rollingPolicy>
    <encoder>
      <pattern>
        [%d{yyyy-MM-dd HH:mm:ss.SSS}:%thread] %-5level %logger{36} - %msg%n
      </pattern>
    </encoder>
  </appender>
 
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder>
      <pattern>
        [%d{yyyy-MM-dd HH:mm:ss.SSS}:%thread] %-5level %logger{36} - %msg%n
      </pattern>
    </encoder>
  </appender>
 
  <root level="INFO">
    <appender-ref ref="FILE" />
  </root>
</configuration>

I have to admit that this is a decent tool - if you know how to configure it properly. But I guess that's true for a lot of the Apache projects.

Re-Tuning Experiment Topology

Wednesday, October 15th, 2014

Finch Experiments

Anytime you add a significant workload to a storm cluster, you really need to re-balance it. This means looking at the work each bolt does, making sure there is the proper balance between the bolts at each phase of the processing, and then that there are enough workers to handle the cumulative throughput. It's not a trivial job, but it's a lot of experimentation and then looking for patterns and zeroing in on the solution.

That's what I've been doing for several hours, and I'm no where near done. It's getting closer, and I think I have the problem isolated, and it's very odd. Basically, there are a few bolt instances - say 4 out of 160 - that are above 1.0 - the rest are at least a factor of ten less. This is my problem. Something is causing these few bolts to take too long, and then that skews the metric for all the instances of that bolt.

Fixing Replication on Postgres

Wednesday, October 15th, 2014

PostgreSQL.jpg

This morning I noticed that my replicated database wasn't synced to the master, and that meant that something had happened to cause the master to be moving too much data, or have too long a pause time to keep synced. Re-establishing the link isn't all that hard, but it takes time - so I turned off the process feeding data into the master database, and then, as the postgres user I coped the files from the master to the slave:

  $ cd /var/groupon
  $ rsync -av --exclude postgresql.conf --exclude postmaster.pid
          pgsql/ db2:/var/groupon/pgsql/

and then I simply need to restart the properly configured slave, and restart, and the log will report:

  LOG: streaming replication successfully connected to primary

Refactoring Analytics for Multi-Mode Design (cont,)

Wednesday, October 15th, 2014

Finch Experiments

This morning I finished up the deployment of the code to UAT and then set about updating the docs in the GitHub/E repo for all the changes. This wasn't all that hard as most of it was already there, but I needed to make sure that I had the docs match the code in the server.clj namespace, and then the big job of adding the docs for the different attribution schemes.

I've got two schemes - the original scheme that wasn't all that good, and the new one looks at the order of the experiment experiences per session and then attributes the weight of the deal on those in a time-decaying fashion. It's not great, but it's a massive improvement over the old scheme.

I wrote all this up, with examples, and it's checked in for the front-end guys to use.

Firefox 33.0 is Out!

Wednesday, October 15th, 2014

Firefox3.5.jpg

I haven't used Firefox a lot, lately, but it's still a browser I keep updated because I read that it's starting to turn things back around. The new OpenH264 in Firefox 33.0 should be an interesting take on the video streaming, and it's fast enough, so maybe I'll look at it a little more.

I really did like the workspaces concept, but I just didn't get the feel that it was as smoothly integrated as it could be. Maybe it's getting better?

How Useless a One-on-One Can Get

Tuesday, October 14th, 2014

PHB.gif

I'm not an easy person to life with - nor one to work with. I'm demanding of myself, and in that, most people think that even if I'm not visibly demanding of them, I am, internally, very disappointed in them if they aren't achieving the same levels. That's not the case, and as much as I try to correct that misconception, it persists. Still, I try to be a Team Player - do the jobs that need to be done regardless of how I feel about them. But I have to confess that a useless One-on-One is something I'm about to opt-out of.

I have a manager that believes he's a good manager to all. He's not - at least not to me, but that doesn't factor into his thinking. He wants to get my feedback - which is always the same - this group is split along the lines of those that do, and those that primarily sit around. If I were in the latter group, I'd be happy - but I'm not, and the group I'm in has a population of 1.

What happens is that Management pushes down on me to deliver, and I do my best to meet their expectations. It's hard work because I'm delivering things for many different groups, and so when I look around and see people playing cards at lunch, or arriving at 9:00 am, it's hard not to feel like I'm being taken advantage of.

So I have a one-on-one with a guy that doesn't understand the first thing about what I'm doing. He doesn't even know who I'm working for - what they are asking, when they are asking it, and what those deadlines are. He is my Manager in name only. And yet he wants a one-on-one.

Silly. And I've had enough of silly.

Refactoring Analytics for Multi-Mode Design

Tuesday, October 14th, 2014

Finch Experiments

Today has been a lot of coding on a change that I was asked to do - and in truth, it's a nice feature to have in the experiment analytics. It's basically the request that all attribution schemes be active in the system at once, and then there is just a different URL for the different versions of the code and data.

This is not unlike what I've done in the past, and it's a good way to allow things to be isolated and roll-out new features without having to mess with a long and involved testing process. But the problem here is that we have a finite redis space, and there's only so much I can do at 50k msgs/sec, and while I would love to have all the versions running side by side, I've already had problems getting the first one working and fitting in redis.

I know this doesn't compute to the current Management, and it's sad that the guy doesn't really understand what's going on and just admit it. Sadly, that seems to be hard for a lot of managers - it makes them seem more human, but at the same time, forces them to expose their weaknesses.

Anyway... I've been re-writing this code all day, and I think I have it all code complete. In the morning I'll try it out in UAT and see how it goes.