Why We Were Down.
Last weekend was pretty interesting for us. Our production webserver died when we were deploying our new facebook application (circa 3:20am). After some frantic live debugging and desperate attempts to revive the server, we had to “call it”. Time of death, May 17th, 6:15am.
After a few short hours of restless sleep, we tried again to recusistate the server. Those attempts were equally futile. Ok, so let’s look at the evidence in more detail:
- CPU spinning
- Memory thrashing
- MySql DB connection maxed out
- SSH shell intermittently responsive
- HTTP requests fail
- No Apache access being logged
- etc…
We patched up the culprit code, brought up a backed up version of our deceased server, and push out our new code. Hmmm, so far so good… but wait, why doesn’t our site work yet? Unfortunately for us, we had accidentally released the IP of our production webserver when we brought up our new server. In other words, www.mylisto.com doesn’t know about the new server we just brought up… so our site is still down. This should be a simple fix, just remap the mylisto.com DNS to point to our new IP and we should be good to go, right? Right, but our previous DNS entry was set to stay alive in cache for 1 week. That means it might take a week for some computers/users to realize that mylisto.com is hosted on a new server. So, our site continues to be down.
Alright alright, so we can’t just leave our users hanging for a week… we gotta come up with a solution! Our solution was rather ingenius: instead of using mylisto.com (which currently doesn’t map to our new IP), we just use the new IP. So a page that used to look like www.mylisto.com/nintendo_wii would now be http://75.101.157.86/nintendo_wii.www.mylisto.com/nintendo_wii would now be http://75.101.157.86/nintendo_wii. Yay! Now our site is clickable!
We were celebrating our great solution until we found out that our CAPTCHA solution (provided by re-captcha) authenticates image requests based on the referrer url. So when a user went to http://75.101.157.86/rgs/register to register for a new account or went to http://75.101.157.86/itm/add to add a new item, the CAPTCHA image won’t load. (re-captcha probably thinks a hacker is trying to spoof our site…) So, even though our existing users could click around the site, no new users could join and no new items could be added to the site…
Needless to say, we reverted the site back to using mylisto.com…. :P
Thankfully, this happened to us when our site’s still young and most of our users are our friends. We definitely learned a couple of valuable lessons. For one, don’t set your DNS TTL (time to live) to 1 week when your production environment might be in flux. And also, don’t use the cron to run certain jobs every minute when the job itself may take longer than a minute to execute. :P 1 year ago