Vishwak Solutions has been one of the early adopters of Cloud for hosting needs of its customers around the world. For one of our clients we have their Video Portal hosted on Amazon AWS infrastructure in Amazon East Coast Data Center (US-East-1) for nearly a year now.
Amazon AWS has been designed with various Availability zones which are extremely unlikely that a single failure will take out the multiple Zones at once. Whatever happened on 21st April started creating disruptions on the EC2 instances we have used for this Video Portal. There were 5 instances running for this project, initially one of those Instances got impacted. This instance was holding the core end-user website (presentation layer) on it. Because of this, the Video Portal was down – there was no way for consumers to reach the portal. Due to budget constraints there were no fall-back instance (load-balanced) for this project.
Since this is the first time something like this was happening in AWS we didn’t have a process to follow. We were to think on our feet and below are the steps we tried to get the instances running again.
1. We were able to access the other instances, except the affected instance. Able to ping it but unable to connect to it via RDP. Tried to reboot through the EC2 API, but no luck.
2. Later Tried to Stop/Start the instance, no luck. The instance remained in “Stopping” status for more than 10 hours.
3. As the STOP did not take effect we decided to create a new instance and configure the websites again which will take couple of hours to get the site back.
4. If we could get a new instance created we can restore from the backup.We take backup of every EC2 instance once a week as AMI images. We decided to go this route and launch a new instance from AMI Backup.
5. In parallel our testers were checking the AWS status page and also subscribed the RSS to get the latest info about the AWS. Still the console continued to be non-responsive.
6. After several hours we managed to get a new instance created from the AMI Backup. Disassociated the Elastic IP bound with the affected EC2 instance and then associated the same Elastic IP to the new AMI Instance.
7. Finally we got the site up & running. Time taken was 16 hours. Unfortunately during the entire period the Video Portal was down.
This unfortunate event made our team to test in action on restoring AWS Instances from backups. It was a rare opportunity to validate our Backup Procedures followed in Vishwak.
After this incident we have convinced the customer to go for another instance (load-balanced). So that there will be no single point of failure.
We have also deployed, configured & tested the entire production setup in Amazon Singapore Data Center. Backups of all the instances into AMI has been taken in Singapore Data Center as well. So the next time anytime like this (god forbid) happens we will be able to quickly restore access to a running instance much sooner (in minutes and not hours).