Major outage of gateway webservers

Incident Report for Shopware

Postmortem

Incident summary

Due to a change of our logging library and it’s configuration our gateway proxy webservers failed to send their logfiles to our central logging and monitoring infrastructure. This caused the webserver instances to run out of disk space and in the end the service crashed.

Incident response analysis

Our monitoring did not detect the failure of our gateway servers and we were lucky to catch the issue manually. A monitoring and alerting for this specific case would have informed us much earlier and is already implemented by now.
After detecting the failure the fix of the problem itself was implemented quickly because we were able to identify what was causing the issue immediately.

We could confirm that the issue was mitigated right after implementing the fix. All shop instances were accepting requests again.

Post incident analysis

The issue did not occur in our staging environment because it was a slow growing problem. It could have taken weeks until this would have affected our staging environment because the amount of requests and the traffic is much lower in the staging environment.

A code-review of the changes was performed but not helpful to prevent this event because it was not really obvious that the change of the logging-library could tear down the whole service in the way it did.

The major issue in this event was the lack of monitoring and alarming which was addressed immediately after. We could have minimized the impact of this issue by better monitoring and an earlier alarm right after the network traffic went down.

Timeline of events

19.11.2020 - 18:35 CET
Network traffic to our gateway servers dropped below normal values (monitoring did not catch the issue)

19.11.2020 - 20:51 CET
We noticed shop instances to be offline / not reachable

19.11.2020 - 21:00 CET
Issue with logging configuration was identified as root cause for the outage

19.11.2020 - 21:10 CET
New binary including a fix for the logging configuration was deployed to production and restored the service

Lessons learned

Our monitoring was not sufficient enough to catch this failure.
We already implemented additional monitors to further improve our insights and to mitigate such outages in the future.

Posted Nov 20, 2020 - 15:53 CET

Resolved

There was an issue regarding our gateway webservers which caused a major outage of all shop instances starting at around 18:35 CET.

The root cause was a misconfiguration in our logging service which lead to our webservers filling up their storage and as a result the instances died.

We identified the misconfiguration at around 21:10 CET and provided a fix for this issue immediately.

Posted Nov 19, 2020 - 21:35 CET