June 10, 2017

Incident Report

Issue Summary

CEI released an update to ProWrite Cloud on Saturday afternoon, June 10, 2017. The update included a new application tier release as well as updates to the database tier. There were no reported incidents until Monday June 12th. As some customers logged in Monday morning June 12th it became clear that some meta data was inaccessible to some users. Affected data included location, person (non-welder), test facility, and association of elements like company information as associated to individual documents. Some ProWrite Cloud users were impacted from June 10 through June 12.

Root Cause

A database script intended to migrate data into the newly release schema failed and its failure was not proactively detected by CEI staff or automated testing protocols in place at the time. The script completed all aspects of schema modifications except for the relocation of certain data. While part of the successful execution of the script updated the schema it did not complete the action to transfer the data to the newly created tables and fields.

Resolution and recovery

We identified the problem at 10:17am, June 12th after customers reported data inaccessibility. Through mid-afternoon we worked to reintegrate the inaccessible data. As we continued working and understood the depth of impact we realized that it was more responsible to rollback to the previous release and restore all customer databases to their previous states. We began restoration of customer databases mid-afternoon and rolled the application tier back once completed. Recovery completed about 1am Central time June 13th.

Corrective and preventative actions

Multiple measures will be implemented to improve reliability and earn customer trust back, including:

  • Enhanced testing of data migration during schema updates
  • A transition to side-by-side deployments of release candidates. Current proposal is that any releases modifying data structure will include a duplicate deployment of the production environment, deploying database copies into the environment, and providing preview access to all partner customers who may then assist us in final validation of a release candidate.
    • A secondary benefit of the side-by-side deployment will be the ability to "flip a switch" and roll a release back nearly instantaneously. 
  • Review of recovery plan to identify methods to decrease the amount of time required to rollback a release
  • Refactoring of automated tests to remove deficiencies is already underway.