Posts from — May 2015
About two years ago, the higher ups at my mid-sized web company had an idea. Pull together a small team of devs, testers, IT, and even a corporate apps refugee, stick them in a small, dark room, and have them move the entire infrastructure that powered a $300 million a year business into the cloud, within six months.
We were terrified.
We had no idea what we were doing.
But somehow, we managed to pull it off.
It wasn’t easy. A lot of things didn’t work. Many shortcuts were taken. Mistakes were made. We learned a lot of things along the way.
And so now, two years on, it seems like a good time to take a look at how we got to where we are and what we’ve discovered.
Where We Started
When we began the cloud migration process, we lived in two data centers. One on the west coast, one on the east coast. For the most part, they were clones of one another. Each one had the same number of servers and the same code deployed. This was great for disaster recovery or maintenance. Release in the west? Throw all our traffic into the east! Hurricane in the east? Bring all the traffic across the country. Site stays up, user is happy. It’s not so great for efficiency. The vast majority of the time, our data centers were load balanced. We’d route users to the data center that would give them the fastest response time. Usually, that ended up being the DC closest to them. Most of our users were on the east coast, or in Europe or South America. As a result, our east coast servers would routinely see 4-5x the traffic load of the boxes on the west coast. That meant that we might have 60 boxes in Virginia running at 80% CPU, while the twin cluster of 60 boxes in Seattle is chillin’ at less than 20%. And when we need to add capacity in the east, we’d have to add the same amount of boxes to the west. That’s a lot of wasted processing power. And electrical power. And software licenses. Basically, that’s a lot of wasted money.
We had virtualized our data centers a few years prior, and while that was a huge step forward over having rack after rack of space heaters locked in a cage, it still wasn’t the freedom we were promised. Provisioning a server still took a couple of days to get through the process. We’d routinely have resource conflicts, where one runaway box would ruin performance for everything else on the VM host. There was limited automation, so pretty much anything you wanted to do involved a manual step where you had to hop on the server and install something and configure something else by hand. And if we ran out of capacity on our VM hosts, there’d be a frantic “We gotta shut something down” mail from the admin, followed by a painful meeting where several people gathered in front of a screen and decided which test boxes weren’t needed this week. (The answer was usually most of them…) And once in a while, we’d have to add another VM host, which meant a months long requisition and installation process.
All this meant that we were heavily invested in our servers. We knew their names, we knew their quirks, and if one had a problem, we’d try to revive it for days on end. Servers were explicitly listed in our load balancers, our monitoring tools, our deployment tools, our patch management tools, and a dozen other places. There was serious time and money that went into a server, so they mattered to us. It was a big deal to create a new box, so it was big deal to decide to rebuild one.
Our servers were largely Windows (still are). That meant that in the weeks following Patch Tuesday, we’d go through a careful and tedious process of patching several thousand servers to prevent some Ukrainian Script Kiddie from replace our cartoon dog mascot with a pop-up bomb for his personal “Anti-virus” ransomware. And, of course, that process wasn’t perfect. Box 5 in cluster B wouldn’t reboot. Box 7 ended up with a corrupted configuration. And box 12 just flat out refused to be patched because it was having a rough day. So hey, now there’s a day or two of cleaning up that mess. Hope no one noticed that box 17 was serving a Yellow Screen of Death all day and didn’t get pulled from the VIP!
Speaking of VIPs, we had load balancers and firewalls and switches and storage and routers and all manner of other “invisible” hardware that loved to fail. Check the “Fast” checkbox on the load balancer and it would throw out random packets and delay requests by 500 ms. The firewall would occasionally decide to be a jerk and block the entire corporate network. The storage server would get swamped by some forgotten runaway scheduled process, and that would lead to a cascade of troubles that would knock a handful of frontend servers out of commission. Every day, between 9 AM and 1 PM, traffic levels would be so high that we’d overload a bottleneck switch if we had to swing traffic. And don’t even get me started about Problem 157. No one ever figured out Problem 157.
In short, life before was a nightmare.
Where We Are Now
Today, we’re living in three regions of AWS. We’ve got auto-scaling in place, so that we’re only running the number of servers we need. Our main applications are scripted, so there’s no manual intervention required to install or configure anything, and we can have new servers running within minutes. The servers in all of our main clusters are disposable. If something goes wrong, kill it and build a new one. There’s no point in spending hours tracking down a problem that could’ve been caused by a cosmic ray, when it takes two clicks and a couple of minutes to get a brand new box that doesn’t have the problem. (That’s not to say the code is magically bug free. There’s still problems. The cloud doesn’t cure bad code.) We pre-build images for deployment, run those images through a test cycle, then build new boxes in production using the exact same images, with only a few minor config tweaks. We even have large portions of our infrastructure scripted.
Not too long ago, I needed to do a release to virtually every cluster in all regions. Two years ago, this would have been a panic-inducing nightmare. It would have involved a week of planning across multiple teams. We would’ve had to get director level sign off. We would have needed a mile long backout plan. And it would have been done in the middle of the night and would have taken a team of six people about eight hours to complete. When I had to do it in our heavily automated cloud world, I sent out a courtesy e-mail to other members of my team, then clicked a few buttons. The whole release took two hours (three if you could changing the scripts and building the images), and most of it was completely automatic. It could have taken considerably less than two hours, but I was being exceptionally cautious and doing it in stages. And did I mention that I did this on a Tuesday afternoon during peak traffic, without downtime or a maintenance window, all while working from home?
To recap, I rebuilt almost every box we have, in something like ten separate clusters, across three regions. I didn’t have to log into a single box, I didn’t have to debug a single failure. The ones that failed terminated themselves, and new boxes replaced them, automatically. The boxes were automatically added to our load balancers and our monitoring system, and once the new boxes were in service, the old boxes removed themselves from the load balancers and monitoring systems. While this was going on, I casually watched some graphs and worked on other things.
This is a pretty awesome place to be.
And we still have ideas to make it even better.
May 27, 2015 No Comments