Ironically: Lightning Strikes Out Amazon’s Cloud – And Us!

When your cloud suppliers is out, you’re out too. For Amazon the disaster was a “Performance issue” according to their health check dashboard. We have a different story to tell, for us this meant a 100% outage, a “Service disruption”. Yesterday we booted up the primary functions and today we’ve recovered most of the Mobile Documents system, wrecked by Amazon’s backup system running amok. This is what happened.

19:41 CET, Sunday August 7, 2011, a lightning strikes out a transformer in Dublin, Ireland. The lightning sparks a fire and explosion, and propagates into Amazon’s data center where its backup generators fail, that normally kicks in automatically. An Amazon cloud power outage is a fact. And out goes the Mobile Documents service as well. You could expect that a top-notch modern data center should be able to withstand a thunder storm. No problem you may think, the center will soon get power and Amazon can restart their instances.

02:06 CET, Monday August 8, 2011, Amazon recommends re-launching instances in different availability zones, that is not in the affected EU-WEST-1 region’s facility.

This Monday morning when we wake up and check our system in Amazon’s Web Services Console and we discover that our live instance that was running is locked, meaning we cannot re-launch it. Ok, so we turn to our backups (snapshots) that we take regularly of all instances and volumes of the entire Mobile Documents systems. We’re stunned to find that for some critical backups are just gone, 18 days of backups for some of the volumes has just vanished. The latest backup for these dates July 20, which means it’s a bad idea to revert too, a terrible idea. Perhaps these backups are just offline and will be made available soon. So we wait for more information from Amazon as the get there data center powered up and alive again.

08:04 CET, Tuesday August 9, 2011, we check the Amazon Health Dashboard and find out (under the Europe tab and Ireland Elastic Cloud Compute):

In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.

Ok, so this explains why our live system instance is locked, it’s wrecked. Not surprisingly, as a push email service with hundreds of thousands of subscribers in every corner of the world means it of course is busy 24/7 and more or less is constantly writing to disk. Ok, so let’s fall back on the backups we think, let’s wait until we get them back. Then things get worse.

11:12 CET, Tuesday August 9, 2011, we receive an email. In detail Amazon explains the “impossible”:

Separately, and independent from the power issue in the affected availability zone, we’ve discovered an error in the EBS software that cleans up unused snapshots. During a recent run of this EBS software in the EU-West Region, one or more blocks in a number of EBS snapshots were incorrectly deleted. The root cause was a software error that caused the snapshot references to a subset of blocks to be missed during the reference counting process. This process compares the blocks scheduled for deletion to the blocks referenced in customer snapshots. As a result of the software error, the EBS snapshot management system in the EU-West Region incorrectly thought some of the blocks were no longer being used and deleted them. We’ve addressed the error in the EBS snapshot system to prevent it from recurring.

We have now disabled all of your snapshots that contain these missing blocks. You can determine which of your snapshots were affected via the AWS Management Console or the DescribeSnapshots API call. The status for any affected snapshots will be shown as “error”.

It then dawns upon us. Not only are 18 days of backups for some volumes gone, they’re deleted, and if not enough the majority of the remaining snapshots have been corrupted by Amazon’s backup system. Most our backups glares back at us in red screaming out “error”.

We now have one last resort, try to recover the instance that was live when the disaster struck, the one that is wrecked and that is locked. So we ask for it to see what can be done. Tuesday evening we get access to it. Indeed it’s inconsistent, we can’t boot it and attempts to read and dump the data fails. It’s a wreck.

01:26 CET, Wednesday August 10, 2011, Amazon have created copies of affected snapshots and recommends to run recovery tools. We do and it’s not very helpful, our data is just still trashed.

Normally, a lightning doesn’t lead to a power outage of a data center. Normally, we would have booted up the system from backups (snapshots) first thing Monday morning. When Thursday morning arrives we’ve spent all Wednesday night to Thursday doing some data forensics. Successfully we were able to recover all critical business data. We could also validate that we got it all back. However, the system is a mess of bits and pieces and now the work starts to get all subsystems alive again. By the Thursday evening we have all primary functions started and even more today, Friday, when I’m writing this blog post.

This is the story what happened and need I say, one of the worst weeks in the history of Mobile Documents. We’ve throughout the week been trying to keep everyone updated on Twitter. Thanks for all support. To all of you that use and like Mobile Documents, we’re just very, very sorry. Several days of service outage is unacceptable. If you read this far I hope you understand that we’ve done what we could and that two separate things that both shall be impossible have created the disastrous situation that we’ve spent the entire week to recover from.

There are still accounts that are affected. These will be fully recovered and functional early the next week. If you have an affected account you will get an email. No email and your account is ok.

Is there a lesson to be learned? – There are two.

Lesson one: Regarding clouds, no data center is safe. You need to keep your live data in at least to geographically disparate locations so that you can failover.

Lesson two: Don’t trust any backup system or sugar-coated words of suppliers. (We have regularly utilized and recovered data from snapshots without problems during the beta testing period. Yet when situation became critical, the backups were simply not there. Bad luck, yes indeed.)

We’ve learnt and we’re motivated to create an even more resilient system, and we’ll be really paranoid when creating the scenarios it shall withstand.

Comments are closed.