Please find here our analysis of what has happened:
During a regular deployment a lambda function should be updated. While updating the function using automated deployment procedure, the handling took unexpectedly long. Thus the removal of the old parts were not completed so that the new version could not be deployed.
Our DevOps team never experienced such long deployment times before. 5-10 mins were known, but since the service seemed not to return from “deleting” state after 20 mins, the team got nervous. A second deployment failed, due to the old resources not having been removed by then.
So the team decided to initiate another deployment with new naming conventions. To allow this to take effect, also some routings and access polices had to be modified accordingly. The required actions were identified, compiled into a new deployment plan and executed based on a new deployment script. In parallel our provider has been contacted for further support.
While these actions have been planned and executed, the original deployment script returned into state “delete completed”.
Scans uploaded between 2021-09-21T19:09 UTC
and 2021-09-21T20:07 UTC
(CET = UTC +2) will not have received an analysis. Our team will review the API logs and contact customers having uploaded Scans during that period. To re-scan:
This will restart the processing of the scan as if it would have been transfered right now.
Our discussions with our provider’s support team concluded that we need to accept that the removal of resources may take up to 40 mins in specific constellations, which are out of our scope or even awareness. This has been confirmed as an issue and is about to be improved in the future [1]. Meanwhile we will have to challenge our deployment procedures to handle such cases.