Congrats, you broke production!

Feb 04, 2023

People rarely break applications on purpose. But when it does happen, it’s often the best learning opportunity, so why not make the most of it?

Not all fuckups are born equal

Some fuckups you want to prevent. I think when you ask your engineers for a list of “software engineering fallacies”, those are good candidates. For example, you might want to prevent stuff that causes permanent damage to user data or your company's reputation.

Some things are also enough to discourage via linters / guidelines.

But then, there is a big bucket of fuckups that can happen, and as long as they can be resolved quickly, they should happen on a regular basis.

The alternative is having an overly restrictive system and or process.

Effective learning

Learning by doing is well known to be effective. Combine that with a very specific goal to “fix it ASAP” and you have the perfect opportunity to understand how something works within a few hours.

Here is the scenario:

I change something
it gets released
something is very broken, seems to be related to what I did
I start checking, debug the hell out of it, trying to understand how it works
I find a solution and release it
3 months later … someone has a similar issue
git blame / someone points them to me
they learn what I know while we fix it together
now we have 2 people who have some confidence when working in this area

This happens organically, at least sometimes. The issue isn’t that knowledge isn’t shared. It’s that risky actions are avoided in the first place.

Proficiency vs Tenure vs Productivity

In my experience companies are focused on proficiency and productivity, i.e. do you know what to do and do you know how to do that within our systems.

While having your onboarding strategy built around these goals is effective in the short run, there is a key element missing: tenure. Think CTO who happens to know the solution to all the most painful problems, because guess what, he’s dealt with them before.

Knowing how something works and why it was built this way, in the context of the team’s and product’s evolution is incredibly valuable. This is true when solving urgent issues but also when planning some of those fun mega refactors.

In a growing team, how do you speed up “spreading tenure”?

Encouraging Failure

The more complicated and brittle the system, the more people are scared to change it. Multiply that by how much of a pain making releases is and you’ve got a whole team that is overly conservative and doesn’t really know how anything works in practice.

On one side, of course you should make it easy to do releases. There are countless books and talks on CD / DevOps / Agile / whatever that attempt to help you do that, or at least sell their consulting services on the matter.

But beyond that, failures need to be normalized. If people see that fuckups are normal and can be resolved, together, quickly, then why would they be too frightened to make a change?

It’s like climbing. Once you experienced that the rope will hold and you know how not to smash into the wall, there shouldn’t be a crippling fear in you about climbing back up, right?

Engineering teams need badges for fuckups! For example:

I broke the staging DB
I broke the production DB
I broke an integration
I blocked “master”
I fixed a blocker bug
I reverted my change
I fixed something before anyone noticed
I messed up deployments somehow
I deployed a hotfix
I messed up the lockfile
I messed up the release workflow
I …… insert something that’s painful to do in your system or process

For companies, having this as part of an engineering culture should be a no-brainer:

super useful for performance reviews
a long term onboarding checklist
a team that's not afraid to break things
...
profit

In summary, the real MVPs on your team are the ones who keep breaking things, because they know the system better than anyone else. Why not encourage exactly that mindset with a hint of gamification and pageantry?

Winston’s Substack

Discussion about this post