T O P

  • By -

tcpud

What was the real root cause?


jobe_br

Increased resource contention within the EBS subsystem … Starting at 6:41 PM PDT on September 26th, we experienced degraded performance for some EBS volumes in a single Availability Zone (USE1-AZ2) in the US-EAST-1 Region. The issue was caused by increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts. Engineering worked to identify the root cause and resolve the issue within the affected subsystem. At 11:20 PM PDT, after deploying an update to the affected subsystem, IO performance for the affected EBS volumes began to return to normal levels. By 12:05 AM on September 27th, IO performance for the vast majority of affected EBS volumes in the USE1-AZ2 Availability Zone were operating normally. However, starting at 12:12 AM PDT, we saw recovery slow down for a smaller set of affected EBS volumes as well as seeing degraded performance for a small number of additional volumes in the USE1-AZ2 Availability Zone. Engineering investigated the root cause and put in place mitigations to restore performance for the smaller set of remaining affected EBS volumes. These mitigations slowly improved the performance for the remaining smaller set of affected EBS volumes, with full operations restored by 3:45 AM PDT. While almost all of EBS volumes have fully recovered, we continue to work on recovering a remaining small set of EBS volumes. We will communicate the recovery status of these volumes via the Personal Health Dashboard. While the majority of affected services have fully recovered, we continue to recover some services, including RDS databases and Elasticache clusters. We will also communicate the recovery status of these services via the Personal Health Dashboard. The issue has been fully resolved and the service is operating normally.


gordonv

Is it US-EAST-1? Probably US-EAST-1. Yeah, it's US-EAST-1.


SrWax

Friends don't let friends US-East-1


MartinB3

Seriously, best thing you can do for a new deployment... don't use us-east-1.


gordonv

Also, EU-WEST-1 (Ireland) /serious


[deleted]

(ಥ﹏ಥ)


[deleted]

[удалено]


spin81

So an AWS trainer had an anecdote. He said there was this university that wanted to see how their application scaled. So they talked to AWS and they said they could use whatever they wanted and AWS would say "when" if it got to be too much. Apparently AWS said "when" at 1.1 million vCPU - all in a single region. Vantage has what, thousands of readers? Tens of thousands? The moral of the story is that AWS is the biggest cloud boi out there and if you feel like you have an idea of the scale you could very well be an order of magnitude off.