Feb 022017
 

“After a second notices he ran it on db1 instead of db2″… This sentence (somewhat shortened, to make a fitting title) describes the beginning of a colossally effed up night at GitLab.com.

In response to a spike in system load, which resulted in lag on a replication server, the operator thought that maybe restarting the replication server with a clean slate is a good idea. So he decided to wipe the replication server’s data directory.

Unfortunately, he entered the command in the wrong window.

I feel his pain. I did make similar mistakes before, albeit on a much smaller scale, and the memories still hurt me, years later.

I have to commend GitLab for their exceptional openness about this incident, offering us all a valuable lesson. I note that others also responded positively, offering sympathy, assistance, and useful advice.

I read their post-mortem with great interest. In reaction, I already implemented something that I should have done years ago: changing the background color of some of the xterm windows that I regularly open to my Linux servers, to distinguish them visually. (“Create issue to change terminal PS1 format/colours to make it clear whether you’re using production or staging”).

Of course similar incidents and near misses also changed my habits over the years. I rarely delete anything these days without making a backup first. I always pause before hitting Enter on a command that is not (easily) reversible. I have multiple backups, and tested procedures for recovery.

Even so… as Forrest Gump says, shit happens. And every little bit helps, especially when we can learn from the valuable lessons of others without having to go through their pain.

 Posted by at 10:13 am