IDM service down

Incident Report for TrustSource

Postmortem

WHAT HAS HAPPENED?

Feb 2nd we received a Vulnerability Alert concerning the infrastructure we apply for connecting 3rd arty authentication and authorisation providers. This critical vulnerability bears the potential for unauthorised 3rd parties to remotely inject users into previously known accounts. Despite this attack still requiring a lot of internal know how, e.g. internal IDs, which should not be exposed, to perform a successful attack, we accepted this as a serious threat to our overall system integrity.

In the following two days an upgrade of the component has been provided and tested on our DEV services. After completing tests successfully, the hot fix has been scheduled for deployment on PRD, together with another long awaited update concerning internal certificates.

The update has been started and the certificate base renewed but then the deployment of the patch failed. At that point it has ben unclear why the setup procedure failed. But unfortunately the rollback to the previous version has not been possible since the former version expected the old certificates and the provider managed infrastructure did not allow to revert to the old certificates.

Our DEV team jumped in and developed an old version lifted on the new certificate stack to allow DB connectivity. After testing this successfully on DEV the new composition went on PRD,… and failed. Obviously the upgrade changed the database schema and got stuck in a position where the new tables were not there yet, but some were already migrated leaving leaks for the older versions.

Finally a backup of the status before the first change has been restored and the new old image was successfully applied, returning to an operational state.

WHAT WAS THE IMPACT?

Users login in directly to the application were not impacted. Until the updated old version has been activated, users login into TrustSource using the IDM service, were banned from access.

WHAT HAS CAUSED THIS?

The root cause for this outage hass been the combination of the two repair actions. Despite the two being fairly independent from each other, the change of the certificate store limited the rollback options. This has not been seen when analysing the risks of the two changes separately.

The analysis of the failing setup procedure derived, that the service failed to update while operating some renovation works applying liquibase updates to the internal table and index structures of the IDM. Unfortunately these changes were not transactional. So half of the migration already went through when the failure occured. Unfortunately parts of the already successful restructuring prevented to restore the old service.

These failures were caused by an empty DB scheme available in the PRD database - most likely from the time before the service has been operational - which has not been present in the more frequently updated test systems. Despite liquibase clearly assigning the requests towards the correct schema, someone in the chain between the liquibase executor and the database seemed to have skipped case sensitivity and thus produced the irritation leading to the failure.

WHAT ARE THE LEARNINGS?

Reflecting the events, we found three take-away-messages:

A) DO NOT COMBINE DIFFERENT ROLLOUT-TOPICS

Combining unrelated activities bears the risk of unexpected interaction and it produces unnecessary dust. The dust veils clarity. Due to the amount of changes, OPS team had difficulties identifying the real cause for failure, ending up to bark up the wrong trees. That caused unnecessary outage time.

B) DON’T TAKE MANAGED SERVICES AS A SURE BET

It is cool to get many things managed. We would not be able to provide what we provide without the whole bunch of managed services. However, this comfort still bears risks. Technical debt may also be accumulated here. The fact that the change of the certificate stack could not be reverted, has not been clear until the desire has occured! A more sorrow testing and risk management even on the managed services could do good.

C) IT IS GOOD TO RUN OPEN SOURCE

We never would have been able to get from any software vendor in that short time. Due to the ability to look into the sources it has been possible to identify the failing position quiet fast. Talking to not involved people and getting their buyin would have taken much longer. And maybe it would not even have solved the problem, since this has been caused by another issue created ages back on our side.

Posted Feb 17, 2022 - 14:03 CET

Resolved

We are happy to confirm the final resolution of the issue!
A sound postmortem will follow.
Sorry for any inconvenience caused.

Posted Feb 05, 2022 - 17:09 CET

Update

To keep you updated:
We have provided a secured version of the service. But during upgrade we ran into difficulties with provided upgrade routine on data side. Currently we are analysing the issues. Most likely we will revert to the last known good. A detailed postmortem will be provided afterwards.
PLEASE NOTE: This only impacts users login into the service through a third party authentication mechanism! Users with direct login to the TrustSource service or API usage are not impacted.

Posted Feb 04, 2022 - 08:12 CET

Monitoring

We had to shutdown the IDM service to protect system integrity due to a recently announced vulnerability in the infrastructure we are using. There is a patch available which we are currently applying. We expect the service to be available again later today. We will notify you accordingly.
If you are using TrustSource with TrustSource logins or are using API keys for communication with the service, this issue will not impact you. This is just impacting users of the Identity Management Service.

Posted Feb 03, 2022 - 15:06 CET

This incident affected: TrustSource Services (Core Service).