Due to a change of our logging library and it’s configuration our gateway proxy webservers failed to send their logfiles to our central logging and monitoring infrastructure. This caused the webserver instances to run out of disk space and in the end the service crashed.
Our monitoring did not detect the failure of our gateway servers and we were lucky to catch the issue manually. A monitoring and alerting for this specific case would have informed us much earlier and is already implemented by now.
After detecting the failure the fix of the problem itself was implemented quickly because we were able to identify what was causing the issue immediately.
We could confirm that the issue was mitigated right after implementing the fix. All shop instances were accepting requests again.
The issue did not occur in our staging environment because it was a slow growing problem. It could have taken weeks until this would have affected our staging environment because the amount of requests and the traffic is much lower in the staging environment.
A code-review of the changes was performed but not helpful to prevent this event because it was not really obvious that the change of the logging-library could tear down the whole service in the way it did.
The major issue in this event was the lack of monitoring and alarming which was addressed immediately after. We could have minimized the impact of this issue by better monitoring and an earlier alarm right after the network traffic went down.
19.11.2020 - 18:35 CET
Network traffic to our gateway servers dropped below normal values (monitoring did not catch the issue)
19.11.2020 - 20:51 CET
We noticed shop instances to be offline / not reachable
19.11.2020 - 21:00 CET
Issue with logging configuration was identified as root cause for the outage
19.11.2020 - 21:10 CET
New binary including a fix for the logging configuration was deployed to production and restored the service
Our monitoring was not sufficient enough to catch this failure.
We already implemented additional monitors to further improve our insights and to mitigate such outages in the future.