Random header image... Refresh for more!

To The Cloud!

About two years ago, the higher ups at my mid-sized web company had an idea.  Pull together a small team of devs, testers, IT, and even a corporate apps refugee, stick them in a small, dark room, and have them move the entire infrastructure that powered a $300 million a year business into the cloud, within six months.

We were terrified.

We had no idea what we were doing.

But somehow, we managed to pull it off.

It wasn’t easy.  A lot of things didn’t work.  Many shortcuts were taken.  Mistakes were made.  We learned a lot of things along the way.

And so now, two years on, it seems like a good time to take a look at how we got to where we are and what we’ve discovered.

Where We Started

When we began the cloud migration process, we lived in two data centers.  One on the west coast, one on the east coast.  For the most part, they were clones of one another.  Each one had the same number of servers and the same code deployed.  This was great for disaster recovery or maintenance.  Release in the west?  Throw all our traffic into the east!  Hurricane in the east?  Bring all the traffic across the country.  Site stays up, user is happy.  It’s not so great for efficiency.  The vast majority of the time, our data centers were load balanced.  We’d route users to the data center that would give them the fastest response time.  Usually, that ended up being the DC closest to them.  Most of our users were on the east coast, or in Europe or South America.  As a result, our east coast servers would routinely see 4-5x the traffic load of the boxes on the west coast.  That meant that we might have 60 boxes in Virginia running at 80% CPU, while the twin cluster of 60 boxes in Seattle is chillin’ at less than 20%.  And when we need to add capacity in the east, we’d have to add the same amount of boxes to the west.  That’s a lot of wasted processing power.  And electrical power.  And software licenses.  Basically, that’s a lot of wasted money.

We had virtualized our data centers a few years prior, and while that was a huge step forward over having rack after rack of space heaters locked in a cage, it still wasn’t the freedom we were promised.  Provisioning a server still took a couple of days to get through the process.  We’d routinely have resource conflicts, where one runaway box would ruin performance for everything else on the VM host.  There was limited automation, so pretty much anything you wanted to do involved a manual step where you had to hop on the server and install something and configure something else by hand.  And if we ran out of capacity on our VM hosts, there’d be a frantic “We gotta shut something down” mail from the admin, followed by a painful meeting where several people gathered in front of a screen and decided which test boxes weren’t needed this week.  (The answer was usually most of them…)  And once in a while, we’d have to add another VM host, which meant a months long requisition and installation process.

All this meant that we were heavily invested in our servers.  We knew their names, we knew their quirks, and if one had a problem, we’d try to revive it for days on end.  Servers were explicitly listed in our load balancers, our monitoring tools, our deployment tools, our patch management tools, and a dozen other places.  There was serious time and money that went into a server, so they mattered to us.  It was a big deal to create a new box, so it was big deal to decide to rebuild one.

Our servers were largely Windows (still are).  That meant that in the weeks following Patch Tuesday, we’d go through a careful and tedious process of patching several thousand servers to prevent some Ukrainian Script Kiddie from replace our cartoon dog mascot with a pop-up bomb for his personal “Anti-virus” ransomware.  And, of course, that process wasn’t perfect.  Box 5 in cluster B wouldn’t reboot.  Box 7 ended up with a corrupted configuration.  And box 12 just flat out refused to be patched because it was having a rough day.  So hey, now there’s a day or two of cleaning up that mess.  Hope no one noticed that box 17 was serving a Yellow Screen of Death all day and didn’t get pulled from the VIP!

Speaking of VIPs, we had load balancers and firewalls and switches and storage and routers and all manner of other “invisible” hardware that loved to fail.  Check the “Fast” checkbox on the load balancer and it would throw out random packets and delay requests by 500 ms.   The firewall would occasionally decide to be a jerk and block the entire corporate network.  The storage server would get swamped by some forgotten runaway scheduled process, and that would lead to a cascade of troubles that would knock a handful of frontend servers out of commission.  Every day, between 9 AM and 1 PM, traffic levels would be so high that we’d overload a bottleneck switch if we had to swing traffic.  And don’t even get me started about Problem 157.  No one ever figured out Problem 157.

In short, life before was a nightmare.

Where We Are Now

Today, we’re living in three regions of AWS.  We’ve got auto-scaling in place, so that we’re only running the number of servers we need.  Our main applications are scripted, so there’s no manual intervention required to install or configure anything, and we can have new servers running within minutes.  The servers in all of our main clusters are disposable.  If something goes wrong, kill it and build a new one.  There’s no point in spending hours tracking down a problem that could’ve been caused by a cosmic ray, when it takes two clicks and a couple of minutes to get a brand new box that doesn’t have the problem.  (That’s not to say the code is magically bug free.  There’s still problems.  The cloud doesn’t cure bad code.)  We pre-build images for deployment, run those images through a test cycle, then build new boxes in production using the exact same images, with only a few minor config tweaks.  We even have large portions of our infrastructure scripted.

Not too long ago, I needed to do a release to virtually every cluster in all regions.  Two years ago, this would have been a panic-inducing nightmare.  It would have involved a week of planning across multiple teams.  We would’ve had to get director level sign off.  We would have needed a mile long backout plan.  And it would have been done in the middle of the night and would have taken a team of six people about eight hours to complete.  When I had to do it in our heavily automated cloud world, I sent out a courtesy e-mail to other members of my team, then clicked a few buttons.   The whole release took two hours (three if you could changing the scripts and building the images), and most of it was completely automatic.  It could have taken considerably less than two hours, but I was being exceptionally cautious and doing it in stages.  And did I mention that I did this on a Tuesday afternoon during peak traffic, without downtime or a maintenance window, all while working from home?

To recap, I rebuilt almost every box we have, in something like ten separate clusters, across three regions.  I didn’t have to log into a single box, I didn’t have to debug a single failure.  The ones that failed terminated themselves, and new boxes replaced them, automatically.  The boxes were automatically added to our load balancers and our monitoring system, and once the new boxes were in service, the old boxes removed themselves from the load balancers and monitoring systems.  While this was going on, I casually watched some graphs and worked on other things.

This is a pretty awesome place to be.

And we still have ideas to make it even better.

May 27, 2015 20:17:28 [887]   No Comments

Winston High CPU Usage

As part of my home seismometer network, I set up an installation of Winston Wave Server. It was far easier to set up and get my data (from homemade seismometers) into than Earthworm. However, I noticed that it was running on the hot side. Even when idle, it was using 99% CPU. I thought that was strange, but it wasn’t causing any problems, so I didn’t feel like trying to fix it.

Until today, when I expanded my network to a second node and found that it couldn’t keep up with the data.

Now, I know that people run Winston with a large number of stations and channels. It seemed very odd that two stations and seven total channels was making Winston chug along for me. Sure, I have it on a box that’s not super high-spec, but it should be enough to handle more than seven channels. The docs even say something about running 200-300 stations before needing to optimize.

So clearly, something was wrong.

I reviewed all the config settings. Nothing.

Then I found a “no input” flag, -i. I was running Winston as a service on Linux, so it didn’t need direct input, so I decided to give that flag a try.

Instantly, CPU went from maxed out to barely doing anything.

Not quite sure what the interactive input mode was trying to do when there wasn’t any input for it, but clearly, it wasn’t anything good. Now my server is happily handling seven channels without a complaint, and the server no longer feels like it’s going to burst into flames.

March 23, 2014 00:28:57 [061]   No Comments

heli_ewII “No Menu! Just Quit.”

I’m in the process of setting up a seismometer and Winston and Earthworm for a personal seismometer station.  I’m trying to get everything set up to run on startup, so that the system can reboot and continue, without manual intervention.

I’ve got Winston working.

I’ve got my data acquisition script working.

I even have Earthworm working.

The problem is that heli_ewII (which is the only reason I have earthwrom running right now) refuses to play along.  It’ll say it’s starting, then throw a temper tantrum, scream “No Menu!  Just quit.”, and exit.  Whatever it’s doing, statmgr isn’t happy about it, and refuses to retry starting it.  It’ll say that the module needs human intervention.  Later, when I manually restart heli_ewII, everything works fine.  It looks like what’s happening is that the heli_ewII module starts too quickly and is impatient.  I don’t think Winston has had a change to start up and begin serving the menu correctly.  So, heli_ewII thinks it’s talking to a bad server, and instead of gracefully trying again or terminating in a way that statmgr will restart it 30 seconds later, it just explodes.

Clearly not what I want.

Fortunately, Earthworm is open source!  So, why not just fix the code?  This is what I did in Earthworm 7.7.  It’s untested and should probably make the options configurable and might, in fact, do very very bad things.  But hey, it’s free!

Between the lines “Build_Menu (&BStruct);” and “if(!BStruct.got_a_menu) {“, just after the “/* See what our wave servers have to offer */” comment, insert this code.  It should make heli_ewII retry several times before giving up.  Hopefully, this is enough time for Winston to get its act together.

int noMenuCount = 0;
while(noMenuCount < 10 && !BStruct.got_a_menu){
  logit("e", "%s No Menu!  Try again.\n", whoami);
  for(j=0; j<BStruct.nServer; j++)
  {
    wsKillMenu(&(BStruct.menu_queue[j]));
  }
  sleep_ew(5000);
  noMenuCount++;
  Build_Menu (&BStruct);
}

March 9, 2014 16:51:36 [744]   No Comments

Earthworm, Winston, heli_ewII, and partial traces

I’m not a geologist.  I’m not a seismologist.  I say that, because I want to make it clear that I have no idea what I’m talking about before you continue reading this.  I want to write about it here because it frustrated me for a couple of days, because there was nothing out there about this issue.

Anyway, I’ve been putting together a personal seismograph station for my house, because I’m just that kind of nerd.  I built a TC1 slinky seismometer last year, and have been running it using Amaseis for a couple of months now.  That software is a bit limited, and I wanted to take it to the next level.  I got an Arduino, an accelerometer, and a Raspberry Pi, and decided that I’d try installing Earthworm.

Because, clearly, an amateur seismometer station hanging out in a garage needs to be running Earthworm.

(I’ll probably talk about my setup at some point in the future, especially if I get it tuned and happy and to the point where I feel comfortable that it’s working and stable.)

Anyway, after wasting several days trying to figure out how to get my data into Earthworm wave_serverV, I gave up and installed Winston, instead.

If it’s good enough for the USGS, it’s good enough for my garage.

I was easily able to write an adapter that fed data from my Arduino into Winston.  Problem is, I don’t really like Winston’s graphs.  They’re so blue.  I like the multicolored heli_ewII graphs more.

So, I pointed heli_ewII at Winston (Yay for interoperability) and didn’t get what I was expecting.  Instead of solid traces, I just got short specks of a trace every two minutes.  Two minutes was the interval to redraw the helicorder images in heli_ewII.d, and it looked like it was only getting a few seconds, then nothing.  I tweaked every setting, turned on debugging, but nothing worked.  I knew the data was in Winston, I could see it in the DB.  But, for some reason, heli_ewII was producing spotty, broken graphs.

Eventually, I sorted out what the problem was.  The data acquisition Python script was trying to be too smart.  You see, Winston wants the sample rate of the data when you insert tracebufs.  Since my Arduino & Pi & Python package was built to the exacting specifications of a project assembled for fun on a kitchen table, the actual sample rate could vary wildly.  However, it was very easy to calculate what the rate was:  Number of samples taken / time taken.  I was sending data to Winston every five seconds or so.  I’d get 280 – 290 samples from the Arduino accelerometer in that window, for a sample rate of about 56-58.  And so, I’d tell Winston the exact sample rate when I sent it data.

That turned out to be a mistake.

I think that variable sample rate ended up confusing heli_ewII and Swarm’s zoomed trace (and spectrograms).  Every time the sample rate changed, they’d stop drawing data.  Swarm’s main heli view worked just fine, but that was it.  Everything else was messed up.

So, I tried giving it a constant sample rate.  That make the helicorder displays happy and continuous, but then I started getting overlapped buffers, or something like that.

In the end, I set my acquisition script to automatically adjust the sample rate it’s sending about once an hour.  This should minimize the missing data in the traces, minimize overlapping data, and still have a sample rate that corresponds to the performance of the accelerometer.

March 8, 2014 23:24:30 [017]   No Comments

I guess I never mentioned…

I probably should have said something here, but I promptly abandoned CWP #5 after I discovered the motors I was using wouldn’t propel the ball.  Instead, they just kinda made the motors jam.  I tried adding mass to the motors to give them more inertia, but that just made them off-balance and they started shaking like crazy and there was a burning electronics smell.

Since I clearly had no idea what I was doing and no equipment to fix things, I stopped working on the project.

March 8, 2014 22:54:39 [996]   No Comments

Firing Assembly

image

Set up the firing tube and the motors and…
It doesn’t work.
The ball barely moves out of the tube.
Since it’s probably a lack of motor power, there’s not much I can do about it.
Oh well.

February 15, 2014 21:07:21 [921]   No Comments

Firing Tube Print Complete

image

Looks like it turned out pretty good.

February 15, 2014 18:34:54 [815]   No Comments

Firing Tube

Here’s the model of the firing tube.  The large tube is the right diameter for a Ping-Pong ball.  The holes in the sides are where the pitching wheels will go.  The bit off the side is where the motor mounting will attach.

This is printing right now, and it says it’ll take about an hour and a half to finish.  I tend to have bad luck with prints like this.  They always seem to peel away from the baseplate or split at one of the layers.  It if doesn’t work, I’ll try using PLA, which is less like to do that.  Unfortunately, PLA is more likely to gum up my printer…

February 15, 2014 17:08:13 [755]   No Comments

Better Fit

image

The second attempt at the slider turned out better. 
The second revision of the motor holder didn’t work out so well, though.  I messed up the measurement, so the holes didn’t align.  Have to try that one again.

February 15, 2014 17:01:52 [751]   No Comments

Slider, Round 1

image

The motor fits fine, but it’ll slide right out if I don’t secure it using the screw holes.  So I’ll need to stick them in the model.
The slide rail, however…  The slider is way too loose.  I’ll need to shrink the hole just a bit.

February 15, 2014 15:31:53 [688]   No Comments