Feb 2nd we received a Vulnerability Alert concerning the infrastructure we apply for connecting 3rd arty authentication and authorisation providers. This critical vulnerability bears the potential for unauthorised 3rd parties to remotely inject users into previously known accounts. Despite this attack still requiring a lot of internal know how, e.g. internal IDs, which should not be exposed, to perform a successful attack, we accepted this as a serious threat to our overall system integrity.
In the following two days an upgrade of the component has been provided and tested on our DEV services. After completing tests successfully, the hot fix has been scheduled for deployment on PRD, together with another long awaited update concerning internal certificates.
The update has been started and the certificate base renewed but then the deployment of the patch failed. At that point it has ben unclear why the setup procedure failed. But unfortunately the rollback to the previous version has not been possible since the former version expected the old certificates and the provider managed infrastructure did not allow to revert to the old certificates.
Our DEV team jumped in and developed an old version lifted on the new certificate stack to allow DB connectivity. After testing this successfully on DEV the new composition went on PRD,… and failed. Obviously the upgrade changed the database schema and got stuck in a position where the new tables were not there yet, but some were already migrated leaving leaks for the older versions.
Finally a backup of the status before the first change has been restored and the new old image was successfully applied, returning to an operational state.
Users login in directly to the application were not impacted. Until the updated old version has been activated, users login into TrustSource using the IDM service, were banned from access.
The root cause for this outage hass been the combination of the two repair actions. Despite the two being fairly independent from each other, the change of the certificate store limited the rollback options. This has not been seen when analysing the risks of the two changes separately.
The analysis of the failing setup procedure derived, that the service failed to update while operating some renovation works applying liquibase updates to the internal table and index structures of the IDM. Unfortunately these changes were not transactional. So half of the migration already went through when the failure occured. Unfortunately parts of the already successful restructuring prevented to restore the old service.
These failures were caused by an empty DB scheme available in the PRD database - most likely from the time before the service has been operational - which has not been present in the more frequently updated test systems. Despite liquibase clearly assigning the requests towards the correct schema, someone in the chain between the liquibase executor and the database seemed to have skipped case sensitivity and thus produced the irritation leading to the failure.
Reflecting the events, we found three take-away-messages:
A) DO NOT COMBINE DIFFERENT ROLLOUT-TOPICS
Combining unrelated activities bears the risk of unexpected interaction and it produces unnecessary dust. The dust veils clarity. Due to the amount of changes, OPS team had difficulties identifying the real cause for failure, ending up to bark up the wrong trees. That caused unnecessary outage time.
B) DON’T TAKE MANAGED SERVICES AS A SURE BET
It is cool to get many things managed. We would not be able to provide what we provide without the whole bunch of managed services. However, this comfort still bears risks. Technical debt may also be accumulated here. The fact that the change of the certificate stack could not be reverted, has not been clear until the desire has occured! A more sorrow testing and risk management even on the managed services could do good.
C) IT IS GOOD TO RUN OPEN SOURCE
We never would have been able to get from any software vendor in that short time. Due to the ability to look into the sources it has been possible to identify the failing position quiet fast. Talking to not involved people and getting their buyin would have taken much longer. And maybe it would not even have solved the problem, since this has been caused by another issue created ages back on our side.