Analysis function with reduced availability

Incident Report for TrustSource

Postmortem

Please find here our analysis of what has happened:

WHAT HAS HAPPENED?

During a regular deployment a lambda function should be updated. While updating the function using automated deployment procedure, the handling took unexpectedly long. Thus the removal of the old parts were not completed so that the new version could not be deployed.

Our DevOps team never experienced such long deployment times before. 5-10 mins were known, but since the service seemed not to return from “deleting” state after 20 mins, the team got nervous. A second deployment failed, due to the old resources not having been removed by then.

WHAT WAS DONE TO RETURN TO NORMAL?

So the team decided to initiate another deployment with new naming conventions. To allow this to take effect, also some routings and access polices had to be modified accordingly. The required actions were identified, compiled into a new deployment plan and executed based on a new deployment script. In parallel our provider has been contacted for further support.

While these actions have been planned and executed, the original deployment script returned into state “delete completed”.

WHAT WAS THE IMPACT AND HOW TO RESOLVE?

Scans uploaded between 2021-09-21T19:09 UTC and 2021-09-21T20:07 UTC (CET = UTC +2) will not have received an analysis. Our team will review the API logs and contact customers having uploaded Scans during that period. To re-scan:

Got to INBOUND / SCANS
See in the list of scans and open the scans that have entered the platform in the above mentioned time window
In the scan view press the “Re-Process” button.

This will restart the processing of the scan as if it would have been transfered right now.

WHAT WAS DONE TO PREVENT THIS IN THE FUTRE?

Our discussions with our provider’s support team concluded that we need to accept that the removal of resources may take up to 40 mins in specific constellations, which are out of our scope or even awareness. This has been confirmed as an issue and is about to be improved in the future [1]. Meanwhile we will have to challenge our deployment procedures to handle such cases.

[1] https://github.com/serverless/serverless/issues/5008

Posted Sep 22, 2021 - 11:08 CEST

Resolved

We identified a workaround and re-deployed services with new routings. This allowed to restore complete functionality. However, this is not what is expected from infrastructure as code and we will have to review this together with our service provider to prevent such surprises in the future.

Posted Sep 21, 2021 - 22:09 CEST

Identified

During a regular update the infrastructure management mechanism ran into an undefined state preventing further changes to happen concerning the desired artefact.

Posted Sep 21, 2021 - 21:48 CEST

This incident affected: TrustSource Services (Core Service).