This summer I was initiated into the arcane art of zero downtime deployment. A vendor, for better or worse, mandated that application releases are deployed on their platform without sacrificing the service’s availability. Achieving no downtime for a stateless application is more or less straightforward. Take down the application server, upgrade the application, and restart the server; repeat per replica. Bringing a database into the picture complicates the process because alterations to the database schema or data must not break the older application which is live until the upgrade is over (i.e., assuming the upgrade isn’t rolled back). If matters weren’t complicated enough, a subtle requirement for correctness one has to keep in mind is that data must be consistent relative to the version of the application reading/writing it. For example, if app V1 writes x to column A and now app V2 starts writing x′ to column B superseding column A, we expect app V1’s user to see x while app V2’s user to see x′ instead of x while both versions of the application are running side by side given that, logically, column A and column B represent the same field.
I was on unchartered terrain here and, after researching various strategies, I came across Marcin’s zero downtime deployment for databases article. Marcin explains how to achieve zero downtime for a database schema upgrade by enumerating through a sequence of n application deployments where n - 1 intermediate deployments migrate the database with Flyway in a backwards compatible way. In the final deployment, assuming the previous deployments went well, the database is changed to its final state such that (1) any un-migrated data slipped in from the previous deployment (i.e., n - 1) is migrated and (2) the schema is no longer backwards compatible.
What struck me about the above approach is the relatively long deployment procedure ops have to follow. Minimally, ops are required to make 3 deployments multiplied by the number of replicas. Moreover, for each deployment, a different version of the application needs to be released. This means bumping the application version no., tagging the released application source code, building it, and so on. Leaving aside logistics, this strategy necessitates that you temporarily relax NOT NULL constraints in the intermediate deployments since app V1 doesn’t know about the new columns giving way to the possibility of tainted data (naturally, one can implement a less safe form of null check constraint in the application).
Given the mentioned drawbacks, I decided to radically alter the approach. Instead of going through a long sequence of deployments, I shifted the strategy to using database insert/update triggers. All the popular databases have support for triggers and the vendor’s database, PostgreSQL, wasn’t an exception. Consider the following table taken from the article’s example, and adapted to PostgreSQL, where the surname column will be added:
The migration script adding the surname column to the PERSON table is like the article’s but with one crucial difference:
Can you spot it? I’ve made surname NOT NULL guaranteeing a value to be always be present. The subsequent ON INSERT OR UPDATE trigger will copy from last_name to surname preventing the NOT NULL constraint from firing when app V1 is inserting a record into PERSON:
With the trigger in place, I only have to deploy the application once for every replica because the heavy lifting is performed inside the system-wide trigger as opposed to the application. In the next release, the trigger and the obsolete column will be dropped from the database.
Note that if the database we were on wasn’t PostgreSQL, it might be possible for data to be inserted or updated without firing the trigger. Indeed, the NOT NULL constraint imposed on the surname column may be fired from app V1 if an attempt to insert the data is performed between the time surname is added and the trigger is created. The reason is that many databases don’t support transactional DDL. However, this isn’t the case for PostgreSQL and therefore the DML and DDL will run in single transaction.