Tuesday's major Amazon Web Services outage was caused through human error, the retailer has confirmed, with the downtime that impacted a number of online services, including Apple's, traced back to a single wrongly-entered command performed during debugging.
The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.
In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.
"We will do everything we can to learn from this event and use it to improve our availability even further."
Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.
Source: appleinsider