Random header image... Refresh for more!

Posts from — January 2010

This Code Doesn’t Work And Here’s Why

I am a fan of continuous integration and automated builds.  It’s a great tool to help ensure that people are always checking in clean code that can be built successfully.  However, I have found that simply having an automated build system that sends out mail isn’t enough.  Some people are too lazy to set up proper e-mail filters, so they end up deleting all mails from the build server, including the “HEY, YOU BROKE THE BUILD” e-mails that are required for the proper functioning of any CI system.  As a result, I realized that simply having the system is not enough, there needs to be some other indication of the state of a build, something completely passive, yet visible to all.  At first, I set up a red light, using X10.  This alone was not enough, so it was soon joined by a screen of shame:  A computer monitor that would display, in big text for everyone to see, who broke the build and how long they’d left it broken.  That system worked fairly well.

Over time, though, the number of builds that needed to be monitored grew unwieldy.  The original display program I’d written was no longer up to the task.  When you have over a hundred separate builds, split across multiple teams, something that can display the status of only seven builds at once doesn’t work all that well.  It was clear that I needed to start over and create a new display system for this new world.

At the core of this build monitor system is a timer.  Every 50 ms, it wakes up and sees if there are any tasks that need to be performed.  This timer function is used to control the build status updates.  Every 30 seconds, the system will contact each of the CCNet servers that are being monitored, and pull down their latest build information.  Here’s the essence of that code (Error checking, etc. removed):

protected void FireEvents(object sender, ElapsedEventArgs e)
{
    foreach (IntervalEventBase intervalEvent in _collectionList)
    {
        if (intervalEvent.NextEventTime < DateTime.Now)
        {
             ThreadPool.QueueUserWorkItem(
                 delegate
                 {
                    intervalEvent.UpdateNextEventTime();
                    intervalEvent.FireEvent(_state);
                 });
        }
    }
}

Those IntervalEvents can be any action I want to take on a timed interval.  One type of event checks a build server every thirty seconds, while another checks the time to determine if the application should enter a power saver mode which turns the screen off if it’s after hours and no one is around to look at it.  Not all events are on the same schedule, which is why it has to check whether or not it needs to fire a given event.

I set my program up to monitor a single build machine and, as expected, it updated the status every thirty seconds.  I failed a build and it showed up as failed on the next update.  Stopped a build and it displayed as stopped.  Everything was working, so I added a few other build servers to be monitored and deployed my program.  My new and improved build monitor application was now visible to all of the engineers in the company, telling them whether or not any of their builds were broken.

Almost immediately, I noticed that the screen was reporting that some of the builds were “stale”, that is, they’d missed their next expected build time by more than five hours.  Usually, this is a sign that a particular build on a server has gone sideways and is blocking all of the other builds as a result.  I didn’t think much of it at the time, especially when it cleared.  The server that was getting the stale warning was one I wasn’t very familiar with, so I figured that maybe there actually was a build that kicked off occasionally and that took six hours and blocked everything else while it ran.

I ignored the problem for a few days, until I noticed that a build box I was familiar with showed the same problem.  I knew that nothing on that box takes more than five minutes to run, so blocking for five hours was a major problem.  I went to the CCNet Dashboard and found that it was all clear.  Nothing was building and nothing was late.  So…  What the hell?  Why was the build monitor saying everything on that box was stale?  Obviously, my update timer was working right, because otherwise all of the build machines would be stale.  Network issues, perhaps?  Whatever.  I kicked the box, restarted my program, and everything showed up fine.

But still, at least once a day, individual build boxes would randomly report that everything was late to build when nothing was wrong.  I had to dig deeper and figure out what was going on.  The simplest way to begin was to put the last updated time for each server on the screen.  The servers should never be more than 30 seconds out of date, so by knowing when a particular server ran into trouble, I might be able to solve the problem.  Maybe the anti-virus software was kicking off at 11:30 every day and killing a connection to a build server somehow.  A broken connection would result in an error being displayed, but it would get displayed once for a total of ten seconds.  Very easy to miss.

I put the last updated time on the build status slides, then left the program to bake for a day so things could get in a weird state.  I checked it at 4:30 PM, and here’s the times it reported:

Build Server 1: 12:23 PM
Build Server 2: 4:27 PM
Build Server 3: 6:12 AM
Build Server 4: 3:56 PM
Build Server 5: 2:03 PM
Build Server 6: 4:30 PM
Build Server 7: 10:24 AM
Build Server 8: 3:38 PM

Blink.  Blink.  WTF?

Okay, that makes NO sense at all.  WHAT THE HELL?

If the timer is working, all of them should report a last updated time of around 4:30 PM.

If the timer isn’t working, all of them should be stale and all reporting the time I launched the program.

But this?  Scattered randomly throughout the day?  How in the hell was that happening?  I know that the .Net Timer classes aren’t meant to provide 100% realtime reliability and that using the ThreadPool will cause execution to be unpredictable, but this is well beyond the expected limits of tolerance.  This just cannot be happening.  There’s no way.

I open up the code and start looking around.  The timer is set up with an interval of 50 ms.  Clearly it’s firing properly because some of the builds are getting updated.  And if any of them are they all should be, since they’re all being queued in the same function call from the same timer tick.  Then it hit me what the problem was.

You see, using the ThreadPool doesn’t really make many guarantees about the execution.  You hand some work over to the ThreadPool and it’ll get done at some point in the future when the pool gets around to it.  It’s great for when you have something you need to do asynchronously and don’t require too much control over when or where it’s executed.  If you go back to the code, you’ll see that the function I’m putting on the pool will schedule the next execution of the event.  Trouble is, it’s the timer method that’s checking that value, and the timer is in its own asynchronous little world.  Obviously, what’s going on here is that the timer is firing and enqueuing a bunch of events and then firing again and enqueuing some of them AGAIN, before they had a chance to run and update their scheduled time.  As this happens, the number of queued events will grow larger and larger and who knows when any particular build will run.  Obviously, that will cause all sorts of problems.

Okay then, easy fix.  Wrap the body of the FireEvents method in a Monitor.TryEnter()/Monitor.Exit() block to ensure it can’t execute multiple times at once.  Then, put a flag on the events, so they won’t get re-queued if they’re already in the queue to be executed.  Here’s what the function looked like:

protected void FireEvents(object sender, ElapsedEventArgs e)
{
    if (Monitor.TryEnter(_collectionList))
    {
        foreach (IntervalEventBase intervalEvent in _collectionList)
        {
            if (intervalEvent.NextEventTime < DateTime.Now && !intervalEvent.IsQueued)
            {
                intervalEvent.IsQueued = true;
                ThreadPool.QueueUserWorkItem(
                    delegate
                    {
                        intervalEvent.UpdateNextEventTime();
                        intervalEvent.FireEvent(_state);
                        intervalEvent.IsQueued = false;
                    });
            }
        }

        Monitor.Exit(_collectionList);
    }
}

Okay, everything’s golden now.  The problem is solved.  I rebuild, fire it up and…

COMPLETE FAILURE

Now they’re not updating at all.  AT ALL.  I put in some breakpoints and watch it run.  There are 9 events in _collectionList.  The foreach iterates through all nine, sets the .IsQueued flag to true on them and enqueues them on the ThreadPool.  The next time the timer fires, the first eight events, all of which are build servers, still have their IsQueued flags set to true.  Only the last event, which is the stupid power saver check, has its flag set to false.

What the hell?  It can’t possibly be taking thirty seconds to ask the build servers for their status.  One of them had to have finished by now.  How can they all be enqueued, yet not be finished?  I set a breakpoint inside the FireEvent method on the build status updater class and start things running, then wait.

And wait.

And wait.

It’s obviously been more than thirty seconds and it’s never getting called.  Not once.

I’m watching eight build status update events getting enqueued, but they’re never running and never having their queued flags reset.  But the power saver event gets its flag reset, so it must be getting called.  I put a breakpoint inside it and start things over.

Breakpoint hit!  Well, at least something is working.

Press F5.

Breakpoint hit!  Did I wait 30 seconds?

Press F5.  Breakpoint Hit!  Press F5.  Breakpoint Hit!  It’s definitely not 30 seconds.

I keep going.  I hit the breakpoint inside the power saver event NINE TIMES.

Great.  I now have a build monitor that doesn’t monitor any builds, but at least it’ll be absolutely certain to shut off the screen at 7PM Monday through Friday and keep it off all day Saturday and Sunday.  WTF?  It’s clearly enqueuing work for build update events and somewhere along the line, they’re all magically becoming power saver events.  Did I accidentally set System.Environment.TreeHugger = true; ?

I go back and stare at the code.

Oh…  Yeah…  Now I get it…

I like anonymous delegates and lambda functions.  I think they’re a great addition to the language.  I know they utterly confuse some people, but I think they’re great.  They let you keep your logic all together in one block, instead of forcing you to write a separate function and then figure out how to get your data into that function when it gets called.  In this case, I want to set the IsQueued flag on the event, then schedule its next execution time, fire the event, and finally say that it’s not queued up anymore.  That’s the logical flow of operations and thanks to an anonymous function, it’s all in one place.  Behind the scenes, it’s the compiler that ends up writing a separate function and figures out how to get your data into that function.  In most cases, that process works fine.

This is not one of those cases.

See, when the compiler works its magic, it packages up the anonymous function and any variables your function references into what’s called a closure.  This closure has a reference to the variable, not the value of the variable.  Subtle difference.  In most cases, you’re using anonymous functions synchronously, so there’s really no practical difference.  The value of the variable cannot change by the time the anonymous function is called and the variable is referenced.  In an asynchronous context, however, this subtle difference will get all hopped up on angry juice and come back to bite you.

What went wrong here is that the foreach loop does not create a new variable with every iteration.  It’s reusing intervalEvent and simply giving it a new value every time through the loop. 1  When I’m calling UserQueueWorkItem, I’m enqueuing a function that references that variable, not the value in it at the time I enqueue.  Which means that by the time the ThreadPool actually executes my function, there’s no guarantee what intervalEvent will be pointing at.  Most of the time, it will be the last event in _collectionList.  Sometimes, the ThreadPool will fire off an event while the iteration is still happening, so it will execute with some value in the middle.  It’s even possible that the value of intervalEvent will change while my anonymous function is executing.

This is what was causing all of the wackiness with the updates.  Most of the time, none of the updates were getting through, but it would run the screen shutoff event nine times in a row every 30 seconds.  Every once in a while, the ThreadPool would have mercy on an intervalEvent and allow it to run while the foreach was still iterating, which was how many of the servers would appear to be up-to-date and also how stale servers got cured automatically.

Okay, so, that’s why everything’s all FUBAR.  Now, how do you fix it?  Well, the problem is that the iterator variable is being reused, so, what you need to do is create a local variable inside the scope of the foreach to capture the value, then use that variable inside the anonymous delegate.  Like so:

protected void FireEvents(object sender, ElapsedEventArgs e)
{
    if (Monitor.TryEnter(_collectionList))
    {
        foreach (IntervalEventBase intervalEvent in _collectionList)
        {
            IntervalEventBase localEvent = intervalEvent;
            if (localEvent.NextEventTime < DateTime.Now && !localEvent.IsQueued)
            {
                localEvent.IsQueued = true;
                ThreadPool.QueueUserWorkItem(
                    delegate
                    {
                        localEvent.UpdateNextEventTime();
                        localEvent.FireEvent(_state);
                        localEvent.IsQueued = false;
                    });
            }
        }

        Monitor.Exit(_collectionList);
    }
}

You have to love it when the fix for a big nasty bug is something that looks like a rookie coding mistake.  Which, of course, means that you have to clearly comment the fix, otherwise someone will think that it IS a rookie coding mistake and “fix” it.  The closure now has the proper value when it gets run and all of my builds are updating like clockwork every thirty seconds.

Well, at least I think they are.  I fixed it at 6:59 PM, so pretty much as soon as it started running, the power saver event fired and the screen turned off…

 

Anyway, if you want a compact repro case to play with, here you go.  “foreachString” will usually print out all “nine”s down the line (Or occasionally a “three” or something in the first spot or two), while “localString” will have a different value on every line (Although not necessarily in order).

string[] strings = new string[] { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };

foreach (string foreachString in strings)
{
    string localString = foreachString;
    ThreadPool.QueueUserWorkItem(delegate { Console.WriteLine("foreachString: {0}  localString: {1}", foreachString, localString); });
}

Sample Output:

foreachString: zero  localString: zero
foreachString: nine  localString: two
foreachString: nine  localString: three
foreachString: nine  localString: four
foreachString: nine  localString: five
foreachString: nine  localString: six
foreachString: nine  localString: seven
foreachString: nine  localString: eight
foreachString: nine  localString: nine
foreachString: nine  localString: one

And for more information on the code generated by the compiler for anonymous functions and closures, including information on this very problem (called “an interesting but dangerous side effect” in section 6), check out this CodeProject article by P. Adityanand: http://www.codeproject.com/KB/cs/InsideAnonymousMethods.aspx

  1. With a foreach, this variable reuse isn’t really apparent.  If you’re using a regular for loop, it’s clear that the iterator variable is the same each time through. []

January 23, 2010   No Comments

This Code Doesn’t Work

The following is a snippet of code that reproduces a bug I came across today.

string[] strings = new string[] { "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine" };

foreach (string str in strings)
{
    ThreadPool.QueueUserWorkItem( delegate { Console.WriteLine(str); } );
}

It should print out the numbers “zero” through “nine”.  It doesn’t work.  Why not?

I’ll write up the answer tomorrow, but I want to see if anyone else can get it first.

January 22, 2010   No Comments

Geological Fun Fact

 Today’s 6.5 earthquake in Northern California occured at the Mendocino Triple Junction.  That’s the southern terminus of the Cascadia Fault.  The Cascadia fault runs off shore of Oregon and Washington and is capable of causing magnitude 9+ earthquakes, like the one in Indonesia five years ago.  Those massive quakes happen when the fault rips open from one end to the other, like a zipper.  Such an earthquake would devestate Seattle, Portland AND Vancouver, BC.  Yes, AND.  A full rupture of the Cascadia Subduction Zone would rip apart the ocean floor from Northern California to halfway up Vancouver Island, so I’m talking about a 700 mile long tear in the earth.  The word “epicenter” has no useful meaning here.  There would also almost certainly be a tsunami that would hit pretty much every low lying coastal city in Oregon and Washington within minutes of the quake, and it would race across the ocean at 600 miles an hour and hit Alaska, Hawaii, and Japan.  Massive earthquakes along the Cascadia Subduction Zone happen about every 300-600 years, and the last one occured in 1700.

Let me repeat:  Today’s quake in California was on the same fault that will someday destroy Seattle.

So…  Got batteries for your flashlight…?

 

In the meantime, here’s a picture of Cape Mendocino, on the coast of Northern California, just on shore from the Mendocino Triple Junction:
Mattole Beach, Petrolia, CA

January 9, 2010   No Comments

James Cameron Plays Too Many Video Games

Let’s see…  He’s played…

Final Fantasy X: 

  

Suteki Na De cutscene

 

Panzer Dragoon Orta:

  

Episode 2: Altered Genos

Episode 4: Gigantic Fleet

Secret of Mana:

  

Title Sequence1

 

Mana Tree Cutscene

 

Command & Conquer:

Mechanical Man

The Orca

 

The Legend of Zelda: Ocarina of Time

HEY!  Listen!

  1. The Japanese version because it shows more of the tree. []

January 3, 2010   No Comments