Random header image... Refresh for more!

Category — Programming

To The Cloud!

About two years ago, the higher ups at my mid-sized web company had an idea.  Pull together a small team of devs, testers, IT, and even a corporate apps refugee, stick them in a small, dark room, and have them move the entire infrastructure that powered a $300 million a year business into the cloud, within six months.

We were terrified.

We had no idea what we were doing.

But somehow, we managed to pull it off.

It wasn’t easy.  A lot of things didn’t work.  Many shortcuts were taken.  Mistakes were made.  We learned a lot of things along the way.

And so now, two years on, it seems like a good time to take a look at how we got to where we are and what we’ve discovered.

Where We Started

When we began the cloud migration process, we lived in two data centers.  One on the west coast, one on the east coast.  For the most part, they were clones of one another.  Each one had the same number of servers and the same code deployed.  This was great for disaster recovery or maintenance.  Release in the west?  Throw all our traffic into the east!  Hurricane in the east?  Bring all the traffic across the country.  Site stays up, user is happy.  It’s not so great for efficiency.  The vast majority of the time, our data centers were load balanced.  We’d route users to the data center that would give them the fastest response time.  Usually, that ended up being the DC closest to them.  Most of our users were on the east coast, or in Europe or South America.  As a result, our east coast servers would routinely see 4-5x the traffic load of the boxes on the west coast.  That meant that we might have 60 boxes in Virginia running at 80% CPU, while the twin cluster of 60 boxes in Seattle is chillin’ at less than 20%.  And when we need to add capacity in the east, we’d have to add the same amount of boxes to the west.  That’s a lot of wasted processing power.  And electrical power.  And software licenses.  Basically, that’s a lot of wasted money.

We had virtualized our data centers a few years prior, and while that was a huge step forward over having rack after rack of space heaters locked in a cage, it still wasn’t the freedom we were promised.  Provisioning a server still took a couple of days to get through the process.  We’d routinely have resource conflicts, where one runaway box would ruin performance for everything else on the VM host.  There was limited automation, so pretty much anything you wanted to do involved a manual step where you had to hop on the server and install something and configure something else by hand.  And if we ran out of capacity on our VM hosts, there’d be a frantic “We gotta shut something down” mail from the admin, followed by a painful meeting where several people gathered in front of a screen and decided which test boxes weren’t needed this week.  (The answer was usually most of them…)  And once in a while, we’d have to add another VM host, which meant a months long requisition and installation process.

All this meant that we were heavily invested in our servers.  We knew their names, we knew their quirks, and if one had a problem, we’d try to revive it for days on end.  Servers were explicitly listed in our load balancers, our monitoring tools, our deployment tools, our patch management tools, and a dozen other places.  There was serious time and money that went into a server, so they mattered to us.  It was a big deal to create a new box, so it was big deal to decide to rebuild one.

Our servers were largely Windows (still are).  That meant that in the weeks following Patch Tuesday, we’d go through a careful and tedious process of patching several thousand servers to prevent some Ukrainian Script Kiddie from replace our cartoon dog mascot with a pop-up bomb for his personal “Anti-virus” ransomware.  And, of course, that process wasn’t perfect.  Box 5 in cluster B wouldn’t reboot.  Box 7 ended up with a corrupted configuration.  And box 12 just flat out refused to be patched because it was having a rough day.  So hey, now there’s a day or two of cleaning up that mess.  Hope no one noticed that box 17 was serving a Yellow Screen of Death all day and didn’t get pulled from the VIP!

Speaking of VIPs, we had load balancers and firewalls and switches and storage and routers and all manner of other “invisible” hardware that loved to fail.  Check the “Fast” checkbox on the load balancer and it would throw out random packets and delay requests by 500 ms.   The firewall would occasionally decide to be a jerk and block the entire corporate network.  The storage server would get swamped by some forgotten runaway scheduled process, and that would lead to a cascade of troubles that would knock a handful of frontend servers out of commission.  Every day, between 9 AM and 1 PM, traffic levels would be so high that we’d overload a bottleneck switch if we had to swing traffic.  And don’t even get me started about Problem 157.  No one ever figured out Problem 157.

In short, life before was a nightmare.

Where We Are Now

Today, we’re living in three regions of AWS.  We’ve got auto-scaling in place, so that we’re only running the number of servers we need.  Our main applications are scripted, so there’s no manual intervention required to install or configure anything, and we can have new servers running within minutes.  The servers in all of our main clusters are disposable.  If something goes wrong, kill it and build a new one.  There’s no point in spending hours tracking down a problem that could’ve been caused by a cosmic ray, when it takes two clicks and a couple of minutes to get a brand new box that doesn’t have the problem.  (That’s not to say the code is magically bug free.  There’s still problems.  The cloud doesn’t cure bad code.)  We pre-build images for deployment, run those images through a test cycle, then build new boxes in production using the exact same images, with only a few minor config tweaks.  We even have large portions of our infrastructure scripted.

Not too long ago, I needed to do a release to virtually every cluster in all regions.  Two years ago, this would have been a panic-inducing nightmare.  It would have involved a week of planning across multiple teams.  We would’ve had to get director level sign off.  We would have needed a mile long backout plan.  And it would have been done in the middle of the night and would have taken a team of six people about eight hours to complete.  When I had to do it in our heavily automated cloud world, I sent out a courtesy e-mail to other members of my team, then clicked a few buttons.   The whole release took two hours (three if you could changing the scripts and building the images), and most of it was completely automatic.  It could have taken considerably less than two hours, but I was being exceptionally cautious and doing it in stages.  And did I mention that I did this on a Tuesday afternoon during peak traffic, without downtime or a maintenance window, all while working from home?

To recap, I rebuilt almost every box we have, in something like ten separate clusters, across three regions.  I didn’t have to log into a single box, I didn’t have to debug a single failure.  The ones that failed terminated themselves, and new boxes replaced them, automatically.  The boxes were automatically added to our load balancers and our monitoring system, and once the new boxes were in service, the old boxes removed themselves from the load balancers and monitoring systems.  While this was going on, I casually watched some graphs and worked on other things.

This is a pretty awesome place to be.

And we still have ideas to make it even better.

May 27, 2015   No Comments

Earthworm, Winston, heli_ewII, and partial traces

I’m not a geologist.  I’m not a seismologist.  I say that, because I want to make it clear that I have no idea what I’m talking about before you continue reading this.  I want to write about it here because it frustrated me for a couple of days, because there was nothing out there about this issue.

Anyway, I’ve been putting together a personal seismograph station for my house, because I’m just that kind of nerd.  I built a TC1 slinky seismometer last year, and have been running it using Amaseis for a couple of months now.  That software is a bit limited, and I wanted to take it to the next level.  I got an Arduino, an accelerometer, and a Raspberry Pi, and decided that I’d try installing Earthworm.

Because, clearly, an amateur seismometer station hanging out in a garage needs to be running Earthworm.

(I’ll probably talk about my setup at some point in the future, especially if I get it tuned and happy and to the point where I feel comfortable that it’s working and stable.)

Anyway, after wasting several days trying to figure out how to get my data into Earthworm wave_serverV, I gave up and installed Winston, instead.

If it’s good enough for the USGS, it’s good enough for my garage.

I was easily able to write an adapter that fed data from my Arduino into Winston.  Problem is, I don’t really like Winston’s graphs.  They’re so blue.  I like the multicolored heli_ewII graphs more.

So, I pointed heli_ewII at Winston (Yay for interoperability) and didn’t get what I was expecting.  Instead of solid traces, I just got short specks of a trace every two minutes.  Two minutes was the interval to redraw the helicorder images in heli_ewII.d, and it looked like it was only getting a few seconds, then nothing.  I tweaked every setting, turned on debugging, but nothing worked.  I knew the data was in Winston, I could see it in the DB.  But, for some reason, heli_ewII was producing spotty, broken graphs.

Eventually, I sorted out what the problem was.  The data acquisition Python script was trying to be too smart.  You see, Winston wants the sample rate of the data when you insert tracebufs.  Since my Arduino & Pi & Python package was built to the exacting specifications of a project assembled for fun on a kitchen table, the actual sample rate could vary wildly.  However, it was very easy to calculate what the rate was:  Number of samples taken / time taken.  I was sending data to Winston every five seconds or so.  I’d get 280 – 290 samples from the Arduino accelerometer in that window, for a sample rate of about 56-58.  And so, I’d tell Winston the exact sample rate when I sent it data.

That turned out to be a mistake.

I think that variable sample rate ended up confusing heli_ewII and Swarm’s zoomed trace (and spectrograms).  Every time the sample rate changed, they’d stop drawing data.  Swarm’s main heli view worked just fine, but that was it.  Everything else was messed up.

So, I tried giving it a constant sample rate.  That make the helicorder displays happy and continuous, but then I started getting overlapped buffers, or something like that.

In the end, I set my acquisition script to automatically adjust the sample rate it’s sending about once an hour.  This should minimize the missing data in the traces, minimize overlapping data, and still have a sample rate that corresponds to the performance of the accelerometer.

March 8, 2014   No Comments

Overwriting Anomalous Data in Graphite

Recently, I’ve been setting up a home sensor array using Arduino modules, which feed data into a Graphite instance running on a Raspberry Pi.  Along the way, through faulty wiring, faulty coding, or the effects of sunspots, I’ve ended up with some bad data ending up in Graphite.

And by bad data, I mean that my living room sensor once reported that it was 289,268 degrees C one day.  Since my house and the surrounding neighborhood show no signs of having been vaporized, I am forced to believe that reading is incorrect.

Since having a temperature reading of several hundred thousand degrees throws off the range of the graph somewhat, I wanted to get rid of that errant data point.  I looked around for a Whisper DB editor, but didn’t find anything.  Graphite’s all open source, so I probably could build one myself, but I really didn’t feel like going that far.  Then I remembered the “Feeding in Your Data” page in the Graphite docs, where it talks about using the command line to send data values.  Maybe if you can write datapoints using this method, you can overwrite datapoints using this method, too.  So, I figured I’d give it a shot.

The example they gave on the page is this:

echo "local.random.diceroll 4 `date +%s`" | nc ${SERVER} ${PORT};

The “local.random.diceroll” part is your counter name.

The “4” part is the data you want to write.

And the “date +%s” is just a fancy way of saying the Unix Epoch Timestamp of Now.

So, in my case, I’d overwrite the counter “stats.gauges.Temperature.LivingRoom” with a comfortable room temparature value of somewhere between 20 and 25.  But what about the timestamp?  I can’t just go in and hope I get the timestamp.  I can guess from that graph that it happened sometime around 00:32:30, but who knows if I’m right.  And what timezone is Whisper using?  Local time?  UTC?  And if it’s UTC, is that + 8 hours or +7 hours this time of year…?  I needed a better way to get the exact timestamp.

Fortunately, Graphite makes that relatively easy.  You see, you can get the data in json format if you want.  While it’s not as pretty as the graph, it does have the exact values of the timestamp you’re looking for.  Getting the json is fairly straightforward.  First, you go to the Graphite Browser and find the data you want to fix.  Once you have it in the window, right click and open the graph in a new tab (Or copy/paste the graph URL, either way works).  From there, edit the URL of the graph, by adding “&format=json” to the end.  Load that page and your screen will fill with json datapoints.  Like so:

In that sea of numbers, find the value you want to replace.  In my case, it’s easy to find, because it’s the one claiming that my living room jumped up to 6 times hotter than the surface of the sun for a moment.  The second value in that datapoint is the timestamp I need to wipe out the errant temperature spike.  Now that I have it, I can run the following command to put things right:

echo "stats.gauges.Temperature.LivingRoom 22.8 1373614500" | nc localhost 2003

And now my living room is back to a comfortable 22.8 degrees.

July 13, 2013   No Comments

Continuous Integration and You

Continuous Integration is probably the most important development in team programming since the rise of source control systems.  It’s not a recent development, either, but while source control usage is universal, Continuous Integration tends to meet with calls of “It’s too hard” and “We don’t need it yet”.  Even in my company, where the benefits of Continuous Integration are obvious and clear, there always seems to be some resistance or avoidance when it’s time to put a project into the system.

Well, it’s not that hard1, and yes, you do need it right now.

Before I begin, I work at a web company.  Most of our software are fairly small (code-wise) sites, services and tools, that sort of thing.  We don’t really have any monolithic enterprise systems that take seven hours to compile with thousands of dependencies.  This posting is written from my experience in this world.  If that’s not your world, some of these comments might not apply to you, although the basic concepts should still be able to translate into what you do.

Continuous Integration, in its most basic form, is a system that watches source control check-ins, and when it detects a change, will build the code.

And deploy the code.

And run the automated tests against the code.

And build other projects that depend on the module that was just built, and deploy and test those, too.

You may have heard of Continuous Integration before, but didn’t look into it because you think it’s an “Agile process” and that you have to be doing test-driven design and extreme programming and all of that other nonsense in order for it to work.  Well, you don’t have to be doing Agile to do this.  Even if you’re in a barrel and going over a hard-core ISO-9000 Waterfall, you’ll benefit from at least part of it.  Hell, I’ve even set it up for some of my own personal projects, just so I don’t have to deal with the hassles of manual deployment.

Here’s why Continous Integration is Awesomeness:

  • Broken code detected automatically and immediately.  Accidentally check in a file with only half a function written?  Forget to check in a new file or library?  Within a few minutes, CI will fire off a build and choke, so you’ll know right away that there was a problem with what you just checked in.  You don’t have the case where two days later, someone else checks out your code, discovers that it’s broken, then wastes an hour trying to figure out what’s wrong before calling you over to help. 2  Plus, there’s the “public shaming” factor working in favor of quality software.  When someone breaks the build, everyone finds out about it, so people tend to be more careful about what they check in.
  • Automatic deployments.  If you’ve got web sites or services, you can’t use them unless they’re running somewhere.  Obviously, you can install the site by hand every couple of days, but then you have to go through the tedious steps of logging in, finding the latest build package, uninstalling the old code and installing the new code, repeating those steps for every machine you need the code on.  You could easily waste up to an hour or two a day doing this.  (On top of that, you’re running code that’s out of date most of the time.)  CI can do it for you.  Your machines are always up to date and you’re not wasting any of your time to make that happen.  Best of all, pesky testers won’t bug you asking for an updated build or begging you to push your latest fixes.
  • Automatic testing.  There’s test automation, then there’s automated test automation.  Automated tests that don’t run on their own don’t run often enough.  You’ve gone to all the trouble of making a computer do all the work of testing for you, so why do you still have it where someone has to push the button to make it go?  Think that’s job security for you or something?  If you have automated tests, make them run in a CI system.  That way, the developers check in their code and a few minutes later, they get an e-mail that tells them if anything broke, and you don’t have to do anything.  You don’t even have to be in the office!
  • Always up-to-date.  Testers are always looking at the latest code, so they’re not filing bugs about things you fixed last week.  The upper-ups are always looking at the latest code, so they’re not complaining about things that you fixed last week.  And the teams that rely on your component are using the component as it actually is, not as you designed it two months ago.
  • Code builds are clean.  Everything’s built automatically out on the build servers, not on your machine, where you’ve modified config files and replaced critical libraries with the debug versions.  The build servers are clean, and becaue what they’re building is constantly being deployed and tested, you know that the build packages they’re producing work.
  • Integration and deployment aren’t afterthoughts.    In ages past, I worked on a project which had a three month release cycle.  The first month and a half was strictly local development, on the developer’s machines.  Nothing was running anywhere other than the Cassini server in Visual Studio.  Like many systems, there was a front end and a back end.  Of course, the back end wasn’t running anywhere, so the front end was using a mock service that worked like the actual back end would, in theory, at least.  Then came the day of the first drop.  Developers spent hours building and installing everything, then turned it over to QA and went home.  The next day, QA was only able to log a single bug:  “[Drop 1] NOTHING WORKS. (Pri1/Sev1)” 3  You see, while everything worked according to the spec, in theory, none of the developers actually talked to one another, so neither side mentioned how they filled in the holes in the spec.  The back end was using a different port, because the original port conflicted with something, and the front end was using lowercase parameter names and expecting a different date format.   This sort of trainwreck deployment disaster doesn’t happen when you have continuous integration in place.  Not because there aren’t miscommunications, but because those miscommunications are discovered and resolved immediately, so they don’t pile up.  The front end developers point their code at the live work-in-progress back end servers.  If there’s a port change or a date format mismatch, it can be detected and corrected right away. 
  • Testers get something to test immediately.  In the story above, the testers were forced to try to write their tests against a phantom system for the first month and a half, based on partial technical specs and a handful of incomplete mockups.  After that, we were writing tests on code that was out of date.  It doesn’t have to be like that.  With Continuous Integration,  whatever is checked in is deployed and running somewhere, making it at least partially testable.  Even if there’s no functionality, at the very least, there is a test that the deployment package works.  The argument that testers can just pull the code out of the repository and run the service or website on their local machine and run all of their tests against that is ridiculous.  It’s a waste of time, because every tester has to pull down the code, configure their machines, build the code, then bug the developers for an hour or two because things aren’t working right.  Even after all that, there’s no telling whether some bugs are actual bugs, or just byproducts of the way it’s set up on a particular tester’s machine.  Worse still, the tests that are written aren’t going to be run regularly, because there’s no stable place to run them against.
  • Deployment packages and test runs are easily archived.  The system can automatically keep a record of everything that it does.  Need to compare the performance of builds before and after a change that was made two weeks ago?  You’ll be able to do that, because you still have the .MSIs archived for every build over the last two months.  Need to find out when a particular piece of functionality broke?  You’ll be able to do that, because the tests ran with almost every check-in and the results were saved.  You don’t have drops and test passes that are done two weeks and a hundred check-ins apart.
  • Progress is always visible.  Ever have a manager above you, sometimes far above you, want to see your progress?  Ever have to race to get something set up to demo?  CI takes care of that.  You’re always deploying with every check in.  A demo of the current state is always a link away.
  • It saves time and saving time saves money.  Use this point with skeptical management.  The first time you set up a project in a CI system, yes, you’ll spend several days getting everything running.  However, if you add up all the time that your team will waste without an automated build/deploy/test system, it will easily pay for itself.  You won’t spend any time on getting a build ready, deploying the build, firing off the automated tests, reporting on the automated tests, trying to get something you’re not working on set up on your box, explaining how to set up your code on someone else’s box, creating a demonstration environment to show off what you’ve done to the upper-ups after they asked about your progress in a status meeting, integrating with a service that’s been under development for months but that you’ve never seen, trying to figure out what check in over the last three weeks broke a feature, wasting a morning because a coworker checked in broken code, or being yelled at by coworkers for checking in broken code and causing them to waste their morning.  And, of course, the second time you set up a project in Continuous Integration, you can copy and paste what you did the first time, then only spend a fraction of the time working out the kinks.  It gets easier with every project you do. 

Like almost anything that’s good, there are some points where it’s painful.

  • When the build goes bad, people can be blocked.  Everyone’s relying on the code being deployed and functional.  Now that you have an automatic push of whatever you’ve checked in, there’s always the chance that that what you’ve checked in is garbage, leading to garbage getting pushed out.  However, the flip side to this is that the problem is exposed immedately, and there’s a very real and pressing need to fix the problem.  Otherwise, the blocking issue may have gone undetected for days, perhaps weeks, where it would be far more difficult to track down.
  • It never works right the first time.  Never.  Whenever you set up a project in your Continuous Integration system or make major changes to an existing project, it won’t work.  You’ll spend an hour tracking down little problems here and there.  Even if you get the build scripts running on your local machine, they’ll fail when you put them out on the build server.  It’s just part of the territory.  However, tweaking the build to work on a different machine does mean that you’re more aware of the setup requirements, so when it comes time to ramp up the new guy or to deploy to production, you’ll have a better idea of what needs to be done.
  • Everyone knows when you screw up.  Everyone gets the build failure e-mail.  So don’t screw up.  Where I work, we’ve taken it a step further.  We used to have a “Build Breaker” high score screen, which would keep track of who had left builds broken the longest.  When the build broke, everyone involved would make sure they fixed it as soon as they could, so they’d avoid getting the top score.  However, we also have a watcher set up that will send an e-mail to the specific person responsible for the build break, letting them know that something went wrong.

Now you know why it’s good to set up Continuous Integration, as well as some of the things that will go wrong when you do.  So, all that’s left is for me to give you some advice based on what I’ve learned in my own experience.  I’m not going to go so far as to say that these are Continuous Integration Best Practices or anything like that, just some tips and tricks you might find helpful. 4

  • Do it.  Just do it.  Stop complaining about how hard it is or how it takes so much time to set up or how we don’t really need it at this point in the project.  Do it.  DO IT NOW.  It’s easier to do it earlier in the project, and the overall benefit is greater.
  • Add new projects to CI immediately.  Create the project, check it in, and add it to CI.  Don’t wait, don’t say “I’ll do it when I’ve added functionality X”.  Get it in the system as soon as you can.  That way, it’s out of the way and you won’t try to find some excuse for putting it off later when the testers are begging you to finally do it.  I view CI as a fundamental part of collaborative software development and the responsibility of any developer who sets up a new project.  If you’re a tester and the devs won’t do it themselves, do it for them.  You need it.  Trust me, you do.
  • Use what makes sense.  Where I work, we use CCNet and nant.  Not necessarily because they’re the best, but because they’re what we’ve used for years and they make sense for us.  make is ancient and confusing, we don’t use Ruby for anything, so Rake would be a bit outside of our world.  If you can do what you need to do with a batch file and a scheduled task, then go for it.  Although, trust me, you’re going to want something that can handle all of the automatic source control checkout stuff and allow wide flexibility in how it fires off builds.  And don’t pay for anything.  Seriously, there’s free stuff out there, you don’t need to be paying thousands of dollars a year just because something that has a great sales pitch.
  • Consistency is the key.  On a major project about a year and a half ago, we standardized the set up of our projects and build scripts.  Ever since then, we’ve reused that model on a number of other projects, and it has been amazingly effective.  As long as we give parts of our projects standard names (project.Setup, project.Tests, etc.), we can copy and paste a small configuration file, change a few values in it, and in just a few minutes, we have that project being built, deployed, and tested in our CI system.  An additional benefit is that our projects now all have a similar organization, so any part of the code you look at will have a sense of familiarity.
  • Flexibility is also the key.  While we’ve gotten enormous gains from standardizing what we can standardize, there’s always going to be something that has to be different.  Don’t paint yourself into a corner by forcing everything you put in the system to act in exactly the same way.  However, don’t be too flexible.  If someone’s trying to take advantage of the flexibility in the system because they’re too lazy to be consistent, then, by all means, break out the Cane of Conformity and make them play by the rules.
  • Your build system is software, too.  Don’t think of it as just a throwaway collection of files to get the job done.  It isn’t.  It’s an integral part of your software.  It might not be deployed to production or shipped to the customer, but everything that is deployed to production or shipped to the customer will run through it.  Do things that make sense.  If you can share functionality, share it.  If you can parameterize something and reuse it, do that.  Check in your build scripts and CI configuration.  Remember, you’re going to have to update, extend and maintain your build system as you go, so it’s in your best interest to spend the time to make a system that’s simple to update, extend and maintain.
  • Be mindful of oversharing build scripts.  You want to try to make your build scripts and CI configs so that they’re modular and reusable, but be careful that you don’t share too many things or share the wrong things between unrelated projects.  At my company, we have a handful of teams, and most of them have one or more build servers.  At one point, one of the teams was reorganized into several smaller teams, and sent to work on wildly divergent projects.  However, they continued to share a single library of build scripts.  Some time later, someone made a change to the one of the scripts in the library that he needed for his project.  His project on his server worked just fine.  Then, two weeks later, every other server that used these shared scripts began to fail randomly.  No one had any idea what was going on, and it took several hours to trace the problem back to the original change.  This illustrates the danger of sharing too widely.  You want to try to structure the build script libraries in such a way that changes are isolated.  Perhaps you can version them, like any other library.  Or, like we’ve done for most of our projects, copy the build script libraries locally into the project’s directory, so everyone’s referencing their own copy. 5
  • Check in your build scripts and CI configuration.  I know I just said that in a point above, but it’s important enough that it deserves its own bullet.  Your build scripts are an important piece of your software, so when they change, you need to rebuild your code, in order to make sure that the process still works.  You want them checked in for the same reasons the rest of your source code is checked in.  We even have our CCNet.config checked in, which means that our Continuous Integration systems themselves are in CI.  In other words, CCNet will detect a change to the CCNet.config that’s checked in, pull down the latest version, and restart itself to pick up the changes.  Under normal circumstances, we never have to touch the build server.  We just check in our changes and everything is automatically picked up.
  • Don’t touch the build server.  Obviously, you’ll have to set up the box, and once in a while, there’s some maintenance to be done, but for day to day operations, don’t log on to the box.  In fact, don’t even manually check anything out other than the CI configuration file.  Everything that needs to be checked out should be checked out automatically by the CI system.  This even extends to tools, where possible.  We’ve got versions of nant and other command line tools checked into SVN, and any build that uses them is responsible for checking them out.  One of the benefits of this is that it makes it easy to move the builds to a different server in an emergency if something goes wrong.  If any of our build servers dies, then we can probably get things back up and running on an alternate server in about half an hour.
  • PSExec is your friend.  If you’re doing stuff on Windows systems, there’s a tool from Sysinternals called “PSExec”, which makes it fairly straightforward to run a command on a remote machine.  Get it.  Use it.  Love it.
  • Every build should check out everything it needs.  Every build should be able to run from scratch. 6  Every build should check out all the code it needs, all the dependencies, all the tools, all the data files.  It should be possible for you to go on to the build box, wipe out the source tree, and have all of your builds succeed the next time they run.  In fact, I’d recommend intentionally deleting the source tree from time to time, just to may sure that everything is self sufficient.  The reason for this is that every build should be using the latest code.  If Build A depends on a library that only Build B checks out, then there’s a chance that Build A will run before Build B updates, leaving Build A potentially out of date and in an unknown state.  Yes, this requirement often means that there are certain libraries or files that are included by pretty much everything, so even a simple change to those libraries will cause a build storm where everything gets rebuilt.  People hate these, but think about it:  You’ve changed a core dependency, therefore everything has to rebuild because everything was changed.
  • Only check out what you need.  Target the checkouts to be only the directories that you need for that particular build.  Don’t have every project watching the root of your repository, because then you’ll get builds for every single check in, no matter how unrelated the check in was.
  • Don’t set up your CI system with a normal employee account.  You want your CI system to run under an account that won’t go on vacation or get fired.
  • Fail builds when unit tests fail.  If you’re doing unit tests, you want those tests to be running as part of the CI build.  You also want them to fail the build.  The philosophy of unit testing is that when a unit test breaks, your code is broken and you need to fix the issue immediately.  This will very strongly encourage developers to make sure that the unit tests are maintained and in good working order and that their new code doesn’t break anything.
  • Don’t fail builds when functional/regression/integration tests fail.  It’s generally expected that at least some of your functional or regression tests will fail with every build.  If you don’t have at least a handful of regression tests failing due to open bugs in the software, then you need more tests.  Where I work, the functional test builds only fail when there is a problem when building or executing the tests, not when one of those tests fail.
  • Don’t deploy if a build fails.  Deployment should be one of the last steps in a build and should only be performed after the build and unit tests (if you have them) are successful.  Don’t wipe out a previous build that was successfully deployed with something that you know is broken.
  • Don’t archive deployment packages if a build fails.  If the build breaks or the unit tests die or the deployment doesn’t work, don’t archive the deployment package.  This will ensure that any MSI that’s saved is of a build that passed all the unit tests and was installed successfully.
  • Split your builds into logical components.  If possible, avoid having a monolithic build that will build everything in your company.  The bigger a build is, the more stuff that’s included, the more chances for the build to go wrong and the longer a build will take.  You want to aim for quick and small builds for faster feedback.  It’s fine if you have cascading builds, where one build triggers the next, as long as the project that was actually changed is built first in order to give immediate feedback.
  • Don’t filter build failure mails.  EVER.  Build failure notices are vital to the health of a CI system.  When a build breaks, you need to know about it.  However, I’ve seen a lot of people who simply set up a mail rule that forwards all mail from the build server into a folder that they never look at.  DON’T DO THAT.  It’s fine if you filter the successes away, but the failures should be front and center.  If anything, you need to set up a rule that flags failures as important.  I have a mail rule that specifically filters mails from the build server only when the subject line contains “build successful” and does not contain “RE:” or “FW:”, etc.
  • Fix your broken builds.  NOW.  Don’t wait.  Fix it.  Now.  NOW!  When a build breaks because of you, it should leap to the front of your priority queue.  Do not go to lunch, do not go for coffee, do not pass go, do not collect $200.  FIX. IT. NOW.
  • Don’t go home without making sure your builds are successful.  And especially don’t go on vacation.
  • Sometimes, broken builds aren’t the end of the world.  Sometimes it’s okay to check in something you know will break the build.  If you’re doing a major refactor of part of the code, it’s fine to do it in stages.  Go ahead and check in bits and pieces as you go.  In fact, in some source control systems, it would be suicidal to attempt to pile up all of the renames, moves, and deletes into a single check in at the end.  In some cases, you might even want to disable a particular build while you make changes.  (Just make sure to turn it back on again.)

Finally, and most importantly, stop your whining, stop your objections, stop reading this post, and go set up your Continuous Integration system immediately.

  1. You people know seven programming languages, yet you’re afraid of a build script?  It’s not hard.  You’re just lazy. []
  2. Except they can’t, because you’re in Cancun for the next two weeks. []
  3. And, of course, QA got blamed for slipping the release date because they didn’t get their testing done according to schedule. []
  4. Plus, I just wanted to say “Continuous Integration Best Practices” a couple of times to get more hits.  Continuous Integration Best Practices.  Okay, that oughta be enough. []
  5. However you choose to organize it, sticking project specific e-mail addresses in a shared file that everyone references is a dumb idea.  There’s nothing shared about it.  Don’t force me to rebuild everything just because you’ve hired a new intern. []
  6. Well, almost every build, at least.  Obviously a test run is going to need the project it’s testing to have been built first. []

February 6, 2011   No Comments

dynamic Has a Use! Partial Verification of Complex Classes in Test Automation

I don’t like dynamic in C#.  It’s one of those features that seems to serve only to confuse the language.  And yeah, I get that Ruby and Python are doing it, but if Ruby and Python jumped off a bridge, would you?

When I first picked up C#1, I liked it for three reasons: 

  1. Garbage collection.
  2. It had a built-in string class.
  3. It actively prevented you from doing stupid things that C++ was fine with.

When they added “var” to the language, reason #3 died a little.  People started using “var” to declare ordinary variables where the type is known.  Sure, the compiler can figure out what the type is, but a human can’t, at least not by just reading the code without a knowledge of the return type of the function you just called. 

And then along came “dynamic” and walloped reason #3 over the head with a large stick.  Now, not even the compiler knows what you’re doing.  You have to wait until runtime to tell if your code is going to work.  It’s a bit like programming in Basic on a Commodore 64, except you’ll get a RuntimeBinderException instead of “?SYNTAX ERROR IN 150”2.  The end result is the same, a stupid typo broke your code when someone else tried to run it. 

It doesn't make any sense here, either.

 The idea behind dynamic is that it lets you write code against certain types of COM objects and dynamic objects from languages like Python and Ruby.  Of course, COM should be dead by now and that maybe MS Office’s Automation should look at this new-fangled .Net thing that everyone else in Redmond has been using FOR TEN YEARS, and that maybe if you want to use features from Python or Ruby that maybe you should actually be using Python or Ruby, but whatever…  The way dynamic works is that you can declare a variable as type “dynamic”, just like you’d declare it a “string” or an “int”, and this tells the compiler that you have no idea what the object is.  Since you don’t know what it is and since the compiler doesn’t know what it is, you can go ahead and call any method you’d like on it or use and property on it and theoretically, it’ll work.  It’s called “Duck Typing”, as in “quacks-like-a”.  In other words, I don’t care what object X really is and I don’t want to force it to be a subclass of Y or implement interface Z.  All I care about is whether or not it can “Quack()”.  If it can “Quack()”, then it’s enough of a Duck for me to use in my function. 

In most cases, dynamic in C# doesn’t actually mean that you can dynamically change the structure of the underlying object.  Whatever that object is, it’s still that type, even though you’re hiding it behind the dynamic keyword.  Think of dynamic as if it’s “var” with amnesia.  “var” knows what type it is at compile time, so you can ask it “Do you know how to quack?” and it can tell you.  With dynamic, it won’t know whether or not it can quack until the code actually runs, at which point its memory comes back and it remembers how to quack.  Either way, whether it knows how to quack at run-time or compile-time, the thing quacking is still a duck. 

Now, I don’t know about you, but I don’t trust amnesiac ducks. 

You see, since you can call any method or property on the dynamic object, you have to simply hope that whatever the object turns out to be when you run it actually implements that method or property, or else you’re totally screwed.  Also, since there’s no Intellisense to guide you, you’d better hope that you spelled the property or method correctly, or else you’re totally screwed.  Oh, yeah, and since there’s no type checking, you’d better hope that the type of that property or the types of the parameters or return value of the method are what you think they are, or else you’re totally screwed. 

It’s this whole “…or else you’re totally screwed” bit that turns me off about dynamic.  There’s a lot of ways this can go wrong, and when it goes wrong, you don’t know it’s gone wrong until the program is running and it dies.  And, of course, it won’t do that until it’s out live in production and a customer comes across the bug, and the whole site goes down and the closest tester gets blamed for missing the bug.  You can see how that would be a problem for me, so you can understand why I don’t like dynamic

But then I found a way to actually make use of dynamic to solve a testing problem I’ve been having. 

Before I go into details, let me first say that I’m a bit reluctant to share this idea, for several reasons: 

  1. I don’t like dynamic, so it feels dirty to use it in this way.
  2. I haven’t really fully explored this idea, so it could be a complete load of nonsense.
  3. It’s voodoo magic code and I don’t like using voodoo magic code because no one can understand voodoo magic code.  I know that I’m going to spend more time explaining how it works to other people than they’re going to spend using it, because they’re going to look at it and be scared away.
  4. “…or else you’re totally screwed”.  It would not be good to have test cases that fail at runtime because of typos.
  5. It feels like this is half of a good idea, and that once I figure out what the missing half of the puzzle is, I’ll have something that’s useful and won’t scare people away and won’t feel dirty.  I don’t like writing about half an idea.
  6. I don’t like dynamic.

However, I’m sharing it because I want it to be a good idea and hope that writing about it will get me thinking about how to solve those remaining problems. 

Anyway, here’s the scenario: 

Where I work, we have services that return complex objects.  Sometimes, these objects can have upwards of 20 properties, and some of those properties are classes that have more properties on them.  In other words, there’s more properties floating around in these objects than in most Florida land scams.  Testing these objects is relatively straightforward, though.  Simply create a parallel object with the expected values, and walk through the expected and actual objects property by property, comparing the values.  If something doesn’t match, fail the test. 

The problem I have is that for most of my tests, I don’t really care about most of those properties.  Each test only focuses on a handful of values and the rest don’t matter.  The one-size-fits-all approach of walking the properties required an exact match.  If I only care about four or five properties, why should I have to specify values for the other 15?  I might not even know what they are.  One of the properties could be something like response time, which I have no way of knowing at the time I write the test.   But, I don’t want to have to write custom validation for each test to check only the values it cares about, because that’s inefficient and difficult to maintain. 

Let’s give a more concrete example class to work off of here. 

 
public class SearchResult
{
        public string Title { get; set; }
        public string Excerpt { get; set; }
        public string URL { get; set; }
        public double Score { get; set; }
        public int Size { get; set; }
        public int Position { get; set; }
        public DateTime IndexedTime { get; set; }
        public string CacheURL { get; set; }
} 

 

This class is a simplification of a result returned from a search engine.  Each result has a title, an excerpt of text from the page, a URL for the page and internal information, like the confidence score of the result, the size of the page, the date the page was last indexed, the position of the result, and, if available, a URL for a cached copy of the page. 

Here’s the property by property checker: 

 
public void CompareResults(SearchResult expected, SearchResult actual)
{
    Assert.AreEqual(expected.CacheURL, actual.CacheURL, "CacheURL mismatch");
    Assert.AreEqual(expected.Excerpt, actual.Excerpt, "Excerpt mismatch");
    Assert.AreEqual(expected.IndexedTime, actual.IndexedTime, "IndexedTime mismatch");
    Assert.AreEqual(expected.Position, actual.Position, "Position mismatch");
    Assert.AreEqual(expected.Score, actual.Score, "Score mismatch");
    Assert.AreEqual(expected.Size, actual.Size, "Size mismatch");
    Assert.AreEqual(expected.Title, actual.Title, "Title mismatch");
    Assert.AreEqual(expected.URL, actual.URL, "URL mismatch");
}

 

Now, let’s write a test:  If I search for “dogs”, then I expect to get the Wikipedia entry for “Dogs” as my top result. 

 
public static void CheckThatWikipediaIsInResultsForDogs()
{
    SearchResult expected = new SearchResult();
    expected.URL = "http://en.wikipedia.org/wiki/Dogs";
    expected.Title = "Dog - Wikipedia, the free encyclopedia"; 
    expected.Position = 0;

    SearchEngine engine = new SearchEngine();
    SearchResult actual = engine.Search("dogs"); 

    CompareResults(expected, actual);
} 

 

Now we run it and…  Aw crap.

Assert Failed! != http://www.mathpirate.net/cache?=http%3A%2F%2Fen.wikipedia.org%2Fwiki%2FDogs : CacheURL mismatch

The test failed, not because the Wikipedia entry wasn’t the result, which is what I cared about, but because the CacheURL was wrong.  I don’t care about the CacheURL.  There are other tests somewhere else that will deal with CacheURL.  Now, I can specify it if I want to, but I also have to specify the excerpt and the page size and the last indexed time, etc.  But I’m not going to .  I don’t give a flying monkey dance about those values in this test, so it’s stupid to specify them.  The “CompareResults” method has to change.

One fairly simple way to take care of that problem is to check for non-null or non-default values and not check the values if the value is the default.  Here’s the new and improved CompareResults:

public static void CompareResults(SearchResult expected, SearchResult actual)
{
    if (expected.CacheURL != null) { Assert.AreEqual(expected.CacheURL, actual.CacheURL, "CacheURL mismatch"); }
    if (expected.Excerpt != null) { Assert.AreEqual(expected.Excerpt, actual.Excerpt, "Excerpt mismatch"); }
    if (expected.IndexedTime != default(DateTime)) { Assert.AreEqual(expected.IndexedTime, actual.IndexedTime, "IndexedTime mismatch"); }
    if (expected.Position != default(int)) { Assert.AreEqual(expected.Position, actual.Position, "Position mismatch"); }
    if (expected.Score != default(double)) { Assert.AreEqual(expected.Score, actual.Score, "Score mismatch"); }
    if (expected.Size != default(int)) { Assert.AreEqual(expected.Size, actual.Size, "Size mismatch"); }
    if (expected.Title != null) { Assert.AreEqual(expected.Title, actual.Title, "Title mismatch"); }
    if (expected.URL != null) { Assert.AreEqual(expected.URL, actual.URL, "URL mismatch"); }
}

Run it now and it passes.  It didn’t care about the missing CacheURL.  Hey, problem solved!

Except…   The new and improved CompareResults is new and improved and still broken.

Here’s what was returned from the Search method:

Notice that Position is set to 5.  In my test, I explicitly set that position should be 0.  CompareResults missed it because  of the check that I added: if(expected.Position != default(int)).  default(int) isn’t some magic value, like “undefined”.  It’s “0”.  Plain old zero.  So when that line executed, it became if(0 != 0)3.  And so, the check was skipped and it missed the fact that the test should have failed.  The same problem will occur if you’re explicitly checking for a null property value.

So…  What to do?

There are multiple options to handle this case, none of them very pleasant:

  • Live with it.  One-size-fits-all never actually fits you.  Deal.
  • Write custom validation for every test that needs to check a default or null value explicitly.  Hope that everyone remembers this needs to be done.
  • Write a parallel set of classes that mimic the actual class, but have boolean properties like “PositionSet” for every single property on the object. 
  • Go back to the original method that checks everything exactly, but write custom pre-treater methods that can go in and clear out values that you don’t care about before you do the comparison. 4

Or…

Use dynamic.

Here’s how it works:

Instead of creating an instance of the type of object that you want to create, you create a dynamic object, and only set the properties you care about.  Then, in the comparison method, it can check if you’ve set the value, and then do the comparison if you have.  If not, it gets ignored.  This solves the problem of forcing a check on values you don’t care about as well as the problem of default values.  If you didn’t set the value up front, it’s not there, so the verification method won’t look at it.

Let’s see it in action:

public static void CheckThatWikipediaIsInResultsForDogsDynamic()
{
    dynamic expected = new ExpandoObject();
    expected.URL = "http://en.wikipedia.org/wiki/Dogs";
    expected.Title = "Dog - Wikipedia, the free encyclopedia";
    expected.Position = 0;

    SearchEngine engine = new SearchEngine();
    SearchResult actual = engine.Search("dogs");

    CompareResultsDynamic(expected, actual);
}

The test method looks almost identical.  The only notable difference is the first line.  I changed “SearchResult expected = new SearchResult();” to “dynamic expected = new ExpandoObject();”5.  The properties are set on the object the same way, the function is called the same way.

public static void CompareResultsDynamic(dynamic expected, SearchResult actual)
{
    if (PropertyExists(expected, "CacheURL")) { Assert.AreEqual(expected.CacheURL, actual.CacheURL, "CacheURL mismatch"); }
    if (PropertyExists(expected, "Excerpt")) { Assert.AreEqual(expected.Excerpt, actual.Excerpt, "Excerpt mismatch"); }
    if (PropertyExists(expected, "IndexedTime")) { Assert.AreEqual(expected.IndexedTime, actual.IndexedTime, "IndexedTime mismatch"); }
    if (PropertyExists(expected, "Position")) { Assert.AreEqual(expected.Position, actual.Position, "Position mismatch"); }
    if (PropertyExists(expected, "Score")) { Assert.AreEqual(expected.Score, actual.Score, "Score mismatch"); }
    if (PropertyExists(expected, "Size")) { Assert.AreEqual(expected.Size, actual.Size, "Size mismatch"); }
    if (PropertyExists(expected, "Title")) { Assert.AreEqual(expected.Title, actual.Title, "Title mismatch"); }
    if (PropertyExists(expected, "URL")) { Assert.AreEqual(expected.URL, actual.URL, "URL mismatch"); }
}

This is the new CompareResults function.  First, the “expected” parameter has changed.  It’s now “dynamic” instead of “SearchResult”.  The Asserts themselves are identical, and they’re still wrapped inside of if statements, but the if conditions have changed.  Instead of looking for nulls or default values, they’re checking that the property exists on the dynamic object.

This is one part where the whole dynamic thing is sorely lacking.  There’s no built-in way to tell if an object can do what you want it to do without trying to do it.  It’s like you’re commanding the duck to “Quack, damn you!”, without first asking “Can you actually quack?”.  Ruby’s got “respond_to?(‘Quack’)”, while C# has “Do it yourself and don’t forget to account for all of the different possibilities of how the quacking can be done, otherwise just try to  ‘Quack’ and catch the exception”.  Here’s a very abbreviated and probably error-prone example of how to ask if an object supports the property you want to call:

public static bool PropertyExists(object dynamicObject, string propertyName)
{
    if (dynamicObject is ExpandoObject)
    {
        IDictionary<string, object> expando = (IDictionary<string, object>)dynamicObject;
        return expando.ContainsKey(propertyName);
    }
    else
    {
        return dynamicObject.GetType().GetProperty(propertyName) != null;
    }
}

First, it checks if the object is an ExpandoObject.  If it is, then it can use the explicit IDictionary implementation on IDictionary in order to find out if the property exists or not.  Otherwise, it uses basic Reflection to ask if the object has the property you’re looking for.  This PropertyExists method will probably need a bit of work for general consumption, but it’s a start and works well enough for this example.

But…  Does it work?

Assert Failed! 0 != 5 : Position mismatch

It didn’t complain about the CacheURL, like the first solution did, but it did catch the Position mismatch, which the second solution missed.  I’d call that a success!

There are, of course, problems with this solution…

  • There’s no Intellisense, so you have to know what the object looks like.  Of course, what you can do there is write the initialization with the actual class, then change it to dynamic once you have all of the properties in place.
  • There’s no compile-time checking, so if you make a typo or if a property disappears or changes names, you don’t know about it until everything breaks at runtime.  Or worse, nothing breaks at all, but you end up not checking the values you think you’re checking. 6
  • It’s voodoo magic code.  You’re going to be the only one who understands how it works, so you’re on the hook for making sure it actually does work and you’ll going to be the one that gets blamed or cursed every time something goes wrong.

Like I said, it’s half of a good idea and I haven’t really explored all of its benefits and consequences yet.  The core is here and I know this is a good place to start and build something awesome on top of.  I already have some ideas of where this could go.  Like toward a generic exapandable data-driven object comparison framework.

But that’s for another time…

Here’s the code from this post:  http://www.mathpirate.net/svn/Projects/AutomationUsingDynamic/

  1. I came through C++ and hadn’t been exposed to Java at that time. []
  2. Okay, it’s more like writing JavaScript, but whatever. []
  3. if(0!=0){ throw new UniverseDestroyingParadoxException(“Critical Mathematical Foundation Error.  Please Reboot.”); } []
  4. I admit that I’ve done this once.  It was not one of my better moments. []
  5. ExpandoObject is a type that does some magic with the C# dynamic runtime binder.  I’m not going to explain how it works, after all, there are search engines for that.  However, using ExpandoObject lets you dynamically add members to your dynamic object.  If you tried to do this with an ordinary object, you’d get a RuntimeBinderException because the property you’re setting isn’t there. []
  6. Why, oh why didn’t they do something like “dynamic<T>” for at least some compile-time checking? []

January 1, 2011   No Comments

Just throw;ing this out there.

In C#, what’s the difference between “throw;” and “throw e;”?

This is one of those “Interview Question” tidbits of language trivia.  It’s the sort of thing that you can probably remember the exact circumstances where you learned about it, because it was probably preceded by three hours of frustration and confusion.  It’s the sort of thing that I’d always assumed separated the serious C# programmers from the casual passerby.

Until today…

Today, this issue came up and I was honestly surprised by the number of people around me who didn’t know the difference, and even more surprised about how the difference seems to be missing from official sources, like the C# Language Specification1, so I felt compelled to write about it.

To clarify, I’m asking about the difference between “throw;” and “throw e;”, where you’re rethrowing an exception that was caught inside a catch statement.

In other words, this:

public static void DoSomething()
{
    throw new Exception("Thrown on line 12.");
}

public static void ExceptionRethrower()
{
    try
    {
        DoSomething();
    }
    catch (Exception e)
    {
        LogException(e);
        throw e;
    }
}

So, DoSomething(); does something and throws an exception2.  You want to catch it and log it or whatever, but you want the exception to bubble up to the callers of ExceptionRethrower() for some reason.  I’m sure you’ve written code like this somewhere.

And there it is, “throw e;” at the end of the catch.  Let’s write a bit of code to call this function and catch the exception, and see what turns up.

static void Main(string[] args)
{
    try
    {
        ExceptionRethrower();
    }
    catch (Exception e)
    {
        Console.WriteLine(e);
    }
}

Okay…  So, the function DoSomething is throwing an exception, which is caught and logged in ExceptionRethrower, then rethrown up to Main, where the exception and its stack trace are printed out.  That means the stack trace will have DoSomething on top, then ExceptionRethrower, followed by Main on the bottom, right?

System.Exception: Thrown on line 12.
   at ThrowExample.Program.ExceptionRethrower() in E:\svn\Projects\ThrowExample\ThrowExample\Program.cs:line 26
   at ThrowExample.Program.Main(String[] args) in E:\svn\Projects\ThrowExample\ThrowExample\Program.cs:line 34

Nope.

It’s got Main and ExceptionRethrower, but what happened to DoSomething?  And the exception message says that I threw it on line 12, but according to the stack trace, I was nowhere near line 12.

So…  WTF?

Let’s play with the debugger, instead.  I’m going to comment out the try/catch in Main and let the exception fall out and kill my app and see what VS has to say about it.

Line 26 is my “throw e;”.  It’s eating my stack trace!

Okay, so, let’s see what happens if I just “throw;”, instead.

System.Exception: Thrown on line 12.
   at ThrowExample.Program.DoSomething() in E:\svn\Projects\ThrowExample\ThrowExample\Program.cs:line 12
   at ThrowExample.Program.ExceptionRethrower() in E:\svn\Projects\ThrowExample\ThrowExample\Program.cs:line 26
   at ThrowExample.Program.Main(String[] args) in E:\svn\Projects\ThrowExample\ThrowExample\Program.cs:line 34

Hey, look!  DoSomething() is there in the trace now, with the exception originating on line 12 like it should.

So, there’s your answer to the original question:  Rethrowing the instance, as in “throw e;” will discard the existing stack trace and make it look like the exception originated on the “throw e;” line, while a parameterless “throw;” will retain the original stack trace.

Sort of…

Look at that second stack trace again.  Look closely at the line numbers.  It still has line 26, where the “throw;” is located, but the call to DoSomething() is on line 21.  It should, like the call stack below shows, have DoSomething() line 12 (Where the exception was thrown), then ExceptionRethrower() line 21 (Where DoSomething() was called), then Main() line 34 (Where ExceptionRethrower() was called).

How did line 21 get tossed aside for line 26?  Well, think about the stack for a moment.  When you’re generating a stack trace, either on the call stack side or on the exception side, each function has a single frame, pointing at the last line that was executed.  When you’re at line 12 in DoSomething(), the last line in ExceptionRethrower() was line 21.  However, once the exception is thrown, you hit the catch in ExeceptionRethrower() and get rerouted, so by the time you end up in Main(), the last line called in ExceptionRethrower() was line 26 at the throw;.  This also means that if you throw, catch, and rethrow, all in a single function, the stack trace will end up pointing at the rethrow, regardless of whether you’re using “throw;” or “throw e;”.

Of course, if that whole “call stack” explanation I just gave actually held any water, then adding in a finally block would cause the stack trace to point at the last line of the finally, because that was the last line of the function executed.  However, that’s not the case.  Even when you add in the finally, the stack trace still points at the throw line.   But hey, the explanation still sounds plausible, so I’m leaving it in this post.  I’m sure it’s something close to that, anyway.  It’s a good thing finallys don’t affect the trace, because if it did, I can’t imagine how wildly FUBARed some stack traces would end up. 3

In general, if you have to catch and rethrow an exception, you should use the plain throw;.  If you rethrow the instance using throw e;, you’ll have a hard time debugging and testing, because your stack traces will all dead end at a throw that’s nowhere near the source of the problem.  However, if you’re writing a library or service for third-party consumption, then throw e; provides a quick way to let exceptions get out to the callers while hiding some of the implementation details of your code by killing off the call stack. 4

Another thing to be aware of is that whichever rethrowing scheme you use, the original instance is used.  That means that if you save off the exception instance for later, the stack trace will have been wiped out or modified by the time you want to use it.  Even more subtle and insidious is that this means that if you’re using a threaded logger of some sort, there’s a chance that the stack trace will be mangled by the time the logger gets around to writing out the exception.  In other words, some times the log will say the exception’s on line 12, other times it’ll say the exception’s on line 26. 5

The sample code is here: http://www.mathpirate.net/svn/Projects/ThrowExample/

To reiterate, for those who skipped to the end:

  • “throw e;” wipes out the stack trace, restarting it at the point of the throw.
  • “throw;” maintains the existing stack trace, but still mangles the stack of the current frame.
  1. I couldn’t find it in there, anyway… []
  2. From line 12, obviously… []
  3. By the way, speaking of finally blocks, did you know that you can’t return from one?  There’s another bit of obscure language trivia for you. []
  4. Of course, if you actually think that’s a good idea, then you probably will want to rethink your exception handling strategy in general…  That’s just the only reason I could think of for deliberately using throw e; to rethrow a caught exception. []
  5. Have fun debugging that one… []

December 13, 2010   No Comments

Maybe they need to read their own best practices.

I got this error today.  It is a magnificent demonstration of several layers of FAIL.

August 19, 2010   No Comments

‘COMPATIBILITY’ is undefined

At work, I’m currently spending most of my time developing an internal webapp.  It’s the first significant work I’ve done in ASP.Net MVC and JavaScript, so it has been a great learning experiment.

Yesterday, I added a feature that involved JSON serialization to carry data on a round trip from the server to the client and back.  The feature is fairly straightforward:  The server writes one of its data objects to the page as a JSON string, which lets the JavaScript on the page interact with the object.  In response to a user action, the page will package up the object in JavaScript, turning it back into a JSON string, and pass it back to the server.  It feels like a bit of a hack, but hey, it works.

It works in some browsers, that is.  To get the object back into JSON format for the trip back to the server, I was using the method JSON.stringify(object).  The JSON object is apparently natively supported by IE8 and FF3.5 (And maybe Chrome, although I didn’t check), and since the target audience of this internal tool is already using one of those browsers, I don’t have to worry about compatibility.

Before checking into our CI system, which will deploy the code to a testing server, I run it locally.  Works fine in IE8.  Works fine in FF 3.6.  Everything’s happy.  SVN Commit and continue working.  If it’s working HERE, it’ll work THERE.

Or not…

About an hour after I check in this change, I go take a look at the testing server to verify a couple of unrelated changes.  I click on a link and I get a JavaScript error.

'JSON' is undefined

Um…  Excuse me?  I’m using IE8, which I know has a native JSON object.  It’s one of the things they brag about adding in IE8.  It’s there.  I know it’s there.  This worked just fine when I’m running it on my box, and it’s not like there’s some file I forgot to deploy that could cause a native JS object to disappear.  It has to be there…

So I try it in Firefox.  If something went wrong with the deployment or the testing machine, it would be broken in Firefox, too.  Of course, it works fine there.

I scour teh Intarwebz for help related to this error and find nothing useful.  Most of the advice is “Well, did you include ‘JSON2.js’?”, which is a file I’m not using and don’t need because JSON is a native object now.  It has to be there…

After wasting about an hour trying to find a solution, I performed my favourite problem solving manuever:  Throw away everything you know about the problem and start from scratch.  The branch of thought I was on before didn’t get anywhere, so it was time to explore a new direction.  In the initial search, I was looking for a solution to the “JSON is undefined” problem.  But maybe that’s not the problem.  Maybe that’s just the symptom of a different issue.  So now, I have to find what the problem really is.

  • Did the code not get checked in right?  The code is fine.  Nothing’s left in pending changes.
  • Did the JavaScript not get deployed correctly?  The deployment is fine.  All script files I have locally are on the testing server.
  • Have I ever come across anything like this before?  Yes.  IE’s security settings will treat intranet sites differently than Internet sites.

IE security settings…?  Well, that’s easy enough to check.

I go back to the page and change the server in its URL to be the fully qualified internal domain name for the server, not just the name of the server itself.  Hit the link…  Hey, it works!

So…  IE security settings, perhaps?  I consider simply ignoring the problem and telling everyone who uses this app to access the server with the full domain name, but quickly realize what a stupid solution that would be.

Why would this be a security setting, though?  Whenever I’ve run afoul of IE’s security model, it’s because I’ve been doing something that IE thinks could potentially be a risk to security.  It rarely is an actual risk (Unless trying to expand and collapse nodes in an XML file is risky in ways I cannot fathom), but there’s usually some semi-rational reason for blocking an action.  Killing the JSON object isn’t rational.  You can’t do anything with it that you couldn’t already do in JavaScript.  So why disable it?

And that’s when I noticed something.  On my localhost page, there was that little “broken page” button next to the address bar, but that button wasn’t there on the testing server.  WTF?  I didn’t force compatibility settings on the testing server, so why is that button missing?

I open up the Compatibility View Settings dialog and I find this inside:

“Display intranet sites in Compatibility View” is checked BY DEFAULT.

So, Compatibility View is what makes your browser act like IE 7 because so many websites out there were hacked together to work in IE6 and 7 because that’s whate everyone used.  As you may know, IE 6 and 7 were a bit wild when it came to rendering things, prefering the cowboy way of going it alone and doing what feels right, rather than, you know, attempting to follow standards.  When IE8 came along and fixed a lot of those issues, it meant that some poorly designed older pages looked like cat vomit.  So, Microsoft put in the Compatibility View button to make the pages render like they did in IE 7.

That’s all well and nice, except that A: They turned it on by default for intranet sites, and B: It’s not just rendering.  This is important, because I had no idea about this until yesterday.  Compatibility View also rolls back the JavaScript engine in IE to the previous version.  So…  No more JSON object!  Poof!  Gone!

DAMN YOU, IE6.  I’M NOT USING YOU AND I’M NOT EVEN DEVELOPING TO SUPPORT YOU AND YOU’RE STILL MAKING MY LIFE HELL.

So, what did I do?

<script src="Microsoft.Ajax.js" type="text/javascript"/>

That has its own JSON object in it and made my problem go away.  It’s a cop-out, true, but I don’t really care.  I’m trying to write an application that has absolutely nothing to do with stupid default compatibility settings and JavaScript engines and all that other nonsense.

August 18, 2010   4 Comments

I think this can be optimized.

I was playing around with some Linked Data SPARQL query engine stuff today and it generated the following query against its SQL data store.  Now, I’m not a database expert or anything, but something tells me that you can probably optimize this query somewhat.

SELECT id,  value FROM rdf_entities WHERE id

July 29, 2010   No Comments

Windows Workflow Quick Tip

Given that I spent all day looking for this and had trouble finding it (Okay, I’ll be honest, I skipped a page in the book because it didn’t look relevant, but whatever), if you ever need to get the name of the currently executing state on a Windows Workflow (WF) State Machine Workflow, there’s a class you can use called StateMachineWorkflowInstance.  StateMachineWorkflowInstance has a property called CurrentStateName to give you the name as a string and another called CurrentState, which gives you an instance of the StateActivity object that’s currently running.

For some reason unknown to me, StateMachineWorkflowInstance isn’t derived from WorkflowInstance1, so the WorkflowInstance you get back from WorkflowRuntime.CreateWorkflow can’t be cast to a StateMachineWorkflowInstance instance.  Instead, you have to new up the object yourself.  The constructor takes the WorkflowRuntime where the workflow is executing, and the Guid returned by the InstanceId property on the WorkflowInstance that was returned by CreateWorkflow for the workflow you want to know the state for.

  1. Cause, you know, .Net is all object orienty and such, so that’s what I would’ve done… []

June 15, 2010   No Comments