Starting September 7th users started to complain about the overall system performance. Support verified typical performance indicators but did not see any anomalies. But checking the app performance manually confirmed user’s observations. So 1st level decided to involve 2nd level.
Further analysis discovered that most analysis tasks were not terminating correctly and their duration was far beyond typical values. Since 48hrs earlier an update of the analysis function took place, further investigation went into that direction. unfortunately this did not lead to any results.
2 hours later the team met in an escalation meeting and all parameters were evaluated once more. During this evaluation it turned out that the Mongo DB was out of read tickets. The infrastructure has been upgraded and the system returned to normal state within minutes.
The overall responsiveness of the system has been strongly reduced, partly up to unacceptable waiting times. Some analysis were not completed and therefor not stored. We advise users, that do weekly builds or scans only, the rescan their solutions, which have been scanned the last time during 6th and 7th of September. In standard CI/CD-systems with daily or even commit based builds/scans, the next scan typically is just a few minutes away. So there should not be any impact.
DeepScan - Scans have not been impacted by this event.
A sudden increase in analysis requests with >50 analysis per min for a longer period did exhaust the number of available read tickets within Mongo. This is a known limitation in Mongo that already caused some trouble a few months ago. To prevent this an alarm had been developed and deployed. This alarm did even trigger. But it has been directed in a recently abandoned communication channel. Thus the alarm did not arrive and support did not expect this to be the issue. Otherwise it could have been resolved within minutes.
However, it is unusual for Mongo to reach this state. This should not be the case with only a few analysis running parallel. So what lead to this strange situation?
As usual, when something goes wrong, there are learnings:
A) We will add more CHAOS-elements to our regular testing cycles to verify and improve the behaviour under failure conditions. It is not enough to have 2nd net, when it is not spanned and solidly anchored.
B) We will enforce our efforts to get rid of the known limitations in our architecture to prevent such scenarios. In the past we already distributed requests by removing meteor from many parts and functions of the application. Meteor still is not capable to support readPreferenceSecondary
. Therefor a scale-out of the application immediately increases the pressure on the primary server. During the last 24hrs a new version of the analysis function has been deployed, that removed all sort of such dependencies.