Analysis failing

Incident Report for TrustSource

Postmortem

WHAT HAS HAPPENED?

Starting September 7th users started to complain about the overall system performance. Support verified typical performance indicators but did not see any anomalies. But checking the app performance manually confirmed user’s observations. So 1st level decided to involve 2nd level.

Further analysis discovered that most analysis tasks were not terminating correctly and their duration was far beyond typical values. Since 48hrs earlier an update of the analysis function took place, further investigation went into that direction. unfortunately this did not lead to any results.

2 hours later the team met in an escalation meeting and all parameters were evaluated once more. During this evaluation it turned out that the Mongo DB was out of read tickets. The infrastructure has been upgraded and the system returned to normal state within minutes.

WHAT WAS THE IMPACT?

The overall responsiveness of the system has been strongly reduced, partly up to unacceptable waiting times. Some analysis were not completed and therefor not stored. We advise users, that do weekly builds or scans only, the rescan their solutions, which have been scanned the last time during 6th and 7th of September. In standard CI/CD-systems with daily or even commit based builds/scans, the next scan typically is just a few minutes away. So there should not be any impact.
DeepScan - Scans have not been impacted by this event.

WHAT HAS CAUSED THE ISSUE?

A sudden increase in analysis requests with >50 analysis per min for a longer period did exhaust the number of available read tickets within Mongo. This is a known limitation in Mongo that already caused some trouble a few months ago. To prevent this an alarm had been developed and deployed. This alarm did even trigger. But it has been directed in a recently abandoned communication channel. Thus the alarm did not arrive and support did not expect this to be the issue. Otherwise it could have been resolved within minutes.

However, it is unusual for Mongo to reach this state. This should not be the case with only a few analysis running parallel. So what lead to this strange situation?

WHAT ARE THE LEARNINGS?

As usual, when something goes wrong, there are learnings:

A) We will add more CHAOS-elements to our regular testing cycles to verify and improve the behaviour under failure conditions. It is not enough to have 2nd net, when it is not spanned and solidly anchored.

B) We will enforce our efforts to get rid of the known limitations in our architecture to prevent such scenarios. In the past we already distributed requests by removing meteor from many parts and functions of the application. Meteor still is not capable to support readPreferenceSecondary. Therefor a scale-out of the application immediately increases the pressure on the primary server. During the last 24hrs a new version of the analysis function has been deployed, that removed all sort of such dependencies.

Posted Sep 09, 2022 - 13:13 CEST

Resolved

We finally decided to further provision additional compute capacity, which resolved the issue. System is returning to normal.
Interesting and subject of further investigation will be the KPIs. The avg. CPU load did not rise beyond 25%. Despite this low value database reads were queuing up, leading to long running requests, blocking further requests from being processed.
Providing additional capacity resolved this vicious circle.
However, we will need further investigation, to understand what leads to overload behaviour at low CPU load.

Posted Sep 07, 2022 - 18:14 CEST

Update

We can confirm that the performance impact observed at some instances is caused by the analysis issue mentioned before. It is still not clear why some analysis fail to be written to the DB and others do not. Extended logging has been provided and is about to be deployed in a canary.
We will keep you updated.

Posted Sep 07, 2022 - 15:06 CEST

Update

We are continuing to investigate this issue.

Posted Sep 07, 2022 - 13:32 CEST

Investigating

We are experiencing issues with our analysis function. The analysis results currently are not written to the database. We are investigating the issue and will keep you updated.

Posted Sep 07, 2022 - 13:31 CEST

This incident affected: TrustSource Services (Core Service).