This is how not doing DevOps is affecting your everyday work.

You have to put in long hours, and work on weekends, not just for you but for everyone who depends on you, including your family and friends.

Jing Xuan Ang
12 min readMay 14, 2022

DevOps is a combination of the Development and Operations departments. Historically, the two are kept separate. However, recently, some organizations have found it beneficial to have either a dedicated team to oversee both or have both departments work together.

Author’s note: Last year, I read ‘The Phoenix Project’, ‘The Unicorn Project’ and ‘The DevOps Handbook’. I wrote my reflection in ClickUp Doc here. I consolidated the learning points that I have found relatable as a data consultant/ engineer. I am re-written the reflection and posted it here on Medium. 😊

DISCLAIMER: I am just an ordinary user of ClickUp; My intention is to share how I use ClickUp as an individual user. Every user of ClickUp has an affiliated link. If you want to save a day every week, sign up here. Cheers!☺

The Enemy is Ourselves

You know, it’s odd. So many of these problems we’ve been facing are caused by decisions we made. We have met the enemy. And he is us.

You mean everything that’s wrong with the Phoenix Project we did to oursevles?

And here it goes, the enemy is ourselves!

The Common Problems

Well, here are the common problems.

Project teams often end up with tight deadlines because of external commitments:

First, you take an urgent date-driven project, where the shipment date cannot be delayed because of external commitments made to Wall Street or customers. Then you add a bunch of developers who use up all the time in the schedule, leaving no time for testing or operations deployment. And because no one is willing to slip the deployment date, everyone after Development has to take outrageous and unacceptable shortcuts to hit the date.

No one has an overview of what is going on:

How can we manage production if we don’t know what the demand, priorities, status of work in process, and resource availability are?

We also have all the calls going into the service desk, whether it’s requests for something new or asking to fix something. But that list will be incomplete, too, because so many people in the business just go to their favorite IT person. All that work is completely off the books.

We bump up the priorities of things all the time, but we never really know what just got bumped down. That is, until someone screams at us, demanding to know why we haven’t delivered something.

Also, Development and QA environments do not match the Production environment. This caused unforeseen troubles when deploying. Adding on fuel to the below problem found in The Unicorn Project.

… developers need a system where they can get fast and continual feedback on the quality of their work. If you don’t find problems quickly, you end up finding them months later. By then, the problem is lost in all the other changes that every other development made, so the link between cause and effect disappears without a trace. That’s no way to run any project.

There are too many approval processes that can make the developer less productive. Instead, we need to give developers the trust and rights to push changes to production!

The requests to change requirements happen often. This caused lesser time available for testing, resulting in poorer quality of work.

As soon as the project is “completed”, most employees are reassigned to another project. This leads to developers not being unable to see the long-term consequences of the decisions they make. So, will the developers ever learn the impact of the one shortcut that they took in the early stage?

Due to tight deadlines and not having an overview, standards are being sacrificed.

The problem is that engineers who are building new applications or environments often don’t know that these documents exist, or they don’t have time to implement the document standards. The result is they create their own tools and processes, with all the disappointing outcomes we’d expect: fragile, insecure, and unmaintainable applications and environments that are expensive to run, maintain, and evolve.

Four Categories of Work

Essentially, they are:

  1. Business projects
  2. Internal IT projects
  3. Changes
  4. Unplanned work

Unplanned work is the most destructive work. It causes the IT capacity death spiral. Most likely, it is caused by technical debt (Refer to Technical Debt). Below is a quote from the book on how unplanned work can affect the entire workstream.

When you spend all your time firefighting, there’s little time or energy left for planning. When all you do is react, there’s not enough time to do the hard mental work of figuring out whether you can accept new work. So, more projects are crammed onto the plate, with fewer cycles available to each one, which means more bad multitasking, more escalations from poor code, which means more shortcuts.

Technical Debt

This is the cause of unplanned work. Here are some quotes from the three books.

I’m pretty sure we don’t do any sort of analysis of capacity and demand before we accept work. Which means we’re always scrambling, having to take shortcuts, which means more fragile applications in production. Which means more unplanned work and firefighting in the future. So, around and around we go.

He said, ‘technical debt is what you feel the next time you want to make change’. There are many things that people call technical debt, but it usually refers to things we need to clean up, or where we need to create or restore simplicity, so that we can quickly, confidently, and safely make changes to the system.

Technical debt is a fact of life, like deadlines. Business people understand deadlines, but often are completely oblivious that technical debt even exists. Technical debt is inherently neither good nor bad — it happens because in our daily work, we are always making trade-off decisions.

… technical debt describes how decisions we make to problems that get increasingly more difficult to fix over time, continually reducing our available options in the future — even when taken on judiciously, we still incur interest.

This is the technical debt and daily workarounds that we live with constantly, always promising that we’ll fix the mess when we have a little more time. But that time never comes.
… somebody has to compensate for the latest broken promise.
… everybody gets a little busier, work takes a little more time, communications become a little slow, and work queues get a little longer.

TLDR; We have to really consider eradicating technical debt as part of daily work. Period.

Work Center

Firstly, work centers refer to machine, man, method, and measure.

Suppose for the machine, we select the heat treat oven. The men are the two people required to execute the predefined steps, and we obviously will need measures based on the outcomes of executing the steps in the method.

If every work center (or project) needs ONLY person X, then EVERY work center will be stuck. So, it is best to do documentation or have some sort of list to determine the resources and time. Then, the work center does not need to rely on only person X and the bottleneck is removed.

“What I’ve done,” she continues, “is take some of our most frequent service requests, documented exactly what the steps are and what resources can execute them, and timed how long each operation takes. … “

Wait Time

What the graph says is that everyone needs idle time or slack time. If no one has slack time, WIP gets stuck in the system. Or more specifically, stuck in queues, just waiting.

Wait Time graph from ‘The Phoenix Project’.
Wait Time graph from ‘The Phoenix Project’.

Five Ideals of Work

The First Ideal — Locality and Simplicity

… design things so that we have locality and our systems and the organizations that build them. And we need simplicity in everything we do.

The Second Ideal — Focus, Flow, and Joy

It’s all about how our daily work feels. Is our work marked by boredom and waiting for other people to get things done on our behalf? Do we blindly work on small pieces of the whole, only seeing the outcomes of our work during a deployment when everything blows up, leading to firefighting , punishment , and burnout? Or do we work in small batches, ideally single-piece flow, getting fast and continual feedback on our work? These are the conditions that allow for focus and flow, challenge, learning, discovery, mastering our domain, and even joy.

The Third Ideal — Improvement of Daily Work

… Toyota Andon cord teaches us about how we must elevate improvement of daily work over daily work itself.

In Toyota, if the problem cannot be resolved within a certain time limit, the whole production line is stopped and everyone will swarm and help with the problem. This is unlike most companies/ project teams, who would find ways to work around the problem or schedule this task at later date “when the team has more time”.

Encouraging everyone to swarm in to help creates knowledge, rigorous standardization of work procedures, and documentation of the results within the entire project team, and eventually, the whole company. We would then be able to hold blameless post-mortems (Refer to The Fourth Ideal).

Teams are often not able or not willing to improve the processes they operate within. The result is not only that they continue to suffer from their current problems, but their suffering also grows worse over time.
… processes actually degrade over time.

We prioritize the team goals over individual goals — whenever we help someone move their work forward, we help the entire team.

The Fourth Ideal — Psychological Safety

… we make it safe to talk about problems, because solving problems requires prevention, which requires honesty, and honesty requires the absence of fear.

Keeping psychological safety in mind, we can then have a blameless post-mortem to look for how to redesign the system to prevent the same accident from happening instead of looking out for human errors.

The Fifth Ideal — Customer Focus

… we ruthlessly question whether something actually matters to our customers, as in, are they willing to pay us for it or is it only of value to our functional silo?

Conclusion on Five Ideals of Work

Without these five ideals, I felt like it leads to feelings of powerlessness.

… we deprive other people of their ability to control their own outcomes and even create a culture where people are afraid to do the right thing because of fear of punishment, failure, or jeopardizing their livelihood. This can create the condition of learned helplessness, where people become unwilling or unable to act in a way that avoids the same problem in the future.
For our employees, it means long hours, working on weekends, and a decreased quality of life, not just for the employee but for everyone who depends on them, incuding family and friends. It is not surprising that when this occurs we lose our best people (except for those that feel like they can’t leave because of a sense of duty or oblilgation).

Instead of deployments during off-office hours, perhaps, we can let developers do deployments during the business day, without clients noticing — except that they will suddenly see new features/ data that amaze them.

For example, I get really excited when I see new features on ClickUp during the day. 😁

In addition,

… by creating fast faster feedback loops at every step of the process, everyone can immediately see the effects of their actions.
… giving continual assurance that the code and environments operate as designed and are always in a secure and deployable state.

Of course, in the data consulting industry, both the vendor and client need to be willing to allow developers to reload the data and dashboards throughout the day despite the risk of a system crash. But really, how do we do office-hours deployment and ensure that the end-users are still able to access the dashboard (Please refer to First Way, Release Patterns)?

Three ways of work

First Way

… enables fast left to right flow of work from Development to Operations to the customer.

Adopt Lean methodology

First, limit work in progress (WIP). However, the problem is that engineers/ employees are usually assigned to multiple projects and they have to switch between tasks. This costs additional time and effort into the value stream and also leads to many prioritization problems that we have to make.

Second, deploy a part of the work the moment it is ready → in a small batch. This helps to reduce WIP, create faster lead times, faster detection of errors, and reduce rework.

Release Patterns

Environment Based: Blue-Green Deployment Pattern

Create two same environments, where green is live and blue is staging. After testing in the blue environment and we are confident of the changes, then blue becomes live and green becomes staging. If we need to roll back, then green remains live while blue remains as staging.

Personally, I think that this can potentially resolve the problem of production and development environments being different.

Environment Based: Canary Release Pattern

A1 — Production servers for internal employees

A2 — Production servers for a small percentage of customers after certain acceptance criteria

A3 — Production servers for all customers

Application Based: Features Toggles

Controls which features are visible and available to the specific user segments.

Application Based: Dark Launches

Deploy all the functionality into production and then perform testing of that functionality while it remains invisible to all the customers.

Second Way

… enables the fast and constant flow of feedback from right to left at all stages of our value stream

This ties in with The Third Ideal of having improvements in daily work. Other avenues of feedback include:

Using telemetry

This is like a dashboard of the health of the system. This is not quite relatable to me as a data consultant because the environments are mostly under the clients after we go live. Yet, it is still an interesting aspect.

Testing

If we put the full testing to the end,

… we detect the problems later, the problems become more difficult to fix and our customers have worse outcomes, which in turn creates stress across the value stream.

The waterfall method allows us to see the majority of our problems only in production.

Pair programming/ Extreme Programming and Agile

  • One fills the role of the driver (writes the code), and another acts as the observer.
  • One writes the automated test case, another implements the code.

This helps to spread knowledge and catch design defects early.

Third Way

… enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to expermimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures.

… leader’s role is to create the conditions so their team can discover greatness in their daily work.

Example: Teaching Thursdays

For two hours, everyone is expected to teach something or learn something. The topics are whatever you want to learn: cross-train in another silo or business unit, take part in our famous in-store training program, spend time in our stores or manufacturing plants, sit with your customer or our helpdesk, learn about Lean principles or practices, learn a new technology or tool, or even how to better manage your career. The most valuable thing you can do is mentor or learn from your peers. And you can expect to see me there too. Learning is for everyone, and it is from there that we will create competitive advantage.

I worked in a bank previously. I participated in giving a sharing and receiving a sharing. Personally, I enjoyed it as I get to learn from everyone. ☺️

Order of Priority

All in all, the order of priority is recommended to be Employee Engagement, Customer Satisfaction, and Cash Flow.

Personal Improvements

Keeping a daily work diary

In it, she tracks everything she’s worked on, how much time she spent on it, any interesting lessons she learned from it, and a list of things to never do again.

Image from ‘The Unicorn Project’

Write, Think, Speak

In order to speak clearly, you need to be able to think clearly. And to think clearly, you usually need to be able to write it clearly.

References:

  1. The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win by Gene Kim, Kevin Behr, George Spafford
  2. The Unicorn Project by Gene Kim
  3. The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Gene Kim, Jez Humble, Patrick Debois, John Willis

Thank you for making it this far! Does this sound like something that you will use or sound like someone you know will need this? What are you waiting for? Share it with them! Follow me on Medium and my YouTube channel to see more of such content. Alternatively, never miss out on any updates by signing up here when I release further new articles.

--

--

Jing Xuan Ang

Hello! I share how ClickUp can be used for personal use cases and other relevant life hacks. Follow me to learn more.