Developer Tea :: Dealing with Inevitable But Unpredictable Events

Dealing with Inevitable But Unpredictable Events

Published 11/29/2021

Bugs are inevitable, but individually unpredictable. Trying to use a single strategy for dealing with these events will miss a huge number of the events, while requiring massive efforts. Instead, we should focus on building a balanced approach to dealing with problems like this.

🙏 Today's Episode is Brought To you by: LaunchDarkly

This episode is brought to you by LaunchDarkly. LaunchDarkly enables development and operations teams to deploy code at any time, even if a feature isn't ready to be released to users. Innovate faster, deploy fearlessly, and make each release a masterpiece. Get started at LaunchDarkly.com today!

📮 Ask a Question

If you enjoyed this episode and would like me to discuss a question that you have on the show, drop it over at: developertea.com.

📮 Join the Discord

If you want to be a part of a supportive community of engineers (non-engineers welcome!) working to improve their lives and careers, join us on the Developer Tea Discord community by visiting https://developertea.com/discord today!

🧡 Leave a Review

If you're enjoying the show and want to support the content head over to iTunes and leave a review! It helps other developers discover the show and keep us focused on what matters to you.

Transcript (Generated by OpenAI Whisper)

What would you do if you knew for a fact that your code was going to produce a certain number of defects? My name is Jonathan Cutrell, you're listening to Developer Tea. My goal on this show is to help driven developers like you find clarity, perspective, and purpose in their careers. What would you do if you knew that your code was going to have a certain threshold, a certain number of defects, and there was nothing you could do to avoid those problems occurring. This is a thought experiment because there is no way for us to know a certain number of bugs are going to show up. There's no way for us to guarantee that that's going to happen. But it is very likely that given enough usage, your code will have a bug. We know this kind of intuitively that most of the code that we write ends up having bugs at some point, whether that's because of a simple mistake that we have made, an environmental mistake based on where the code is running, maybe an environmental mistake based on changes around the ecosystem, maybe someone else pushes code to an API that our particular system uses, it consumes. Ultimately, we will have bugs that appear seemingly from nowhere. If we take this idea that bugs might appear and we turn it into a fact, instead of turning it from a possibility that we try to hedge around, we turn it into a fact. What is the best route that we could take? Inspiration for this particular line of thought comes from a talk that I watched on YouTube called a Crash Course in Building Reliable Software, and the plan is definitely intended there by Ming-Bai, and it was shared by a channel called RT Thread, I'm not familiar with the channel other than this particular talk. Then in the talk, Ming discusses the reality that there is far more usage by actual users than there is coverage in tests. In other words, bugs are much more on the order of 100,000 times more likely to occur in this particular project scenario that he had set up than they were to occur in tests, they were more likely to occur with actual users. Of course, these numbers are specific and we're not talking about the specific numbers as much as we're talking about this principle, the idea that at some scale, you're testing and all of the work that you would do to prevent bugs from actually happening, it becomes difficult to prevent. So what can you do? Abstracting this for a moment, we should talk about bugs as what kind of event they are. A bug is an occurrence that is generally undesirable or was not foreseen for some reason or another that causes a disruption. And we experience these kinds of events in our regular lives. You may have traffic when you're way to drop your kids off at school or maybe your team member catches a cold the night before a big release. And just like what we talked about with bugs in our software, we know that these things will occur. This is kind of a paradox because we can't say for sure when or what will happen. But if we were to zoom out and look at the occurrence rate, for example, of colds or of bugs in our software, it's probably relatively reliable if we were to average it out over time. So from this broad standpoint, these unpredictable individual events become predictable based on some occurrence rate. Now this doesn't mean that you're going to somehow use the occurrence rate to predict when these bugs are going to happen. That's not the point here. We're not trying to do a prediction exercise. Instead, we're trying to do a preparation exercise. Knowing that bugs will occur. Knowing that teammates will get sick. What can we do? How can we prepare? We're going to take a quick sponsor break and then we're going to come back and talk about different strategies for dealing with these unpredictable and yet predictable events. Cooper team is supported by LaunchDarkly. Feature management for the modern enterprise. Fundamentally changing how you deliver software. This is LaunchDarkly and here's how it works. LaunchDarkly enables development and operations teams to deploy code at any time. Even if a feature isn't ready to be released to users, wrapping code with feature flags gives you the safety to test new features and infrastructure in your production environments without impacting the wrong end users. This happens to be related to what we're talking about today with predicting bugs. When you're ready to release more widely, update the flag status and changes are made instantaneously by their real-time streaming architecture. With LaunchDarkly, you innovate faster, deploy fearlessly and make each release a masterpiece. You can get started for free today at launchdarkly.com. Thanks again to LaunchDarkly for sponsoring today's episode of Developer Tea. It's quite appropriate that today's sponsor was LaunchDarkly because this tool and many other tools that we have at our disposal as engineers are part of the answer for what we do about bugs. What do we do about the fact that bugs will occur in our software? Now we should go ahead and mention that this is not an argument for tests being bad. In fact, tests very often take you the majority of the way towards where you need to be, assuming that you're writing tests in a responsible way. We're not going to hyper-focus only on testing today. Testing is an important tool, but it's not the only one. The approach for how to deal with these unexpected events is multi-tiered. Like we mentioned in a recent episode about thinking in matrices, this is a specific point in time where you can think about bugs, you can think about defects in your code through the lens of a matrix. Your strategy for dealing with these unexpected events, these unpredictable events like bugs or getting sick, it's going to have to be a composition of approaches. Testing alone is not going to handle it. Neither is having reporting in your production environment. All of these things together, a composition approach, is truly the only practical way to deal with this. If you were to focus only on testing, there is a clear diminishing return. In other words, after you write your first 10 tests, you're getting a lot of value out of each additional test for the average code base. But after you write your 500th test, the value that you're getting is starting to diminish. Eventually some coverage level, each additional test is producing diminishing value. On the whole, if you were to actually analyze the time necessary to maintain a testing of code base like this, it almost becomes its own project. This can be wasteful and inefficient. This can go stale. Eventually they might lead to a false sense of security. Similarly, if you had no tests at all and you only relied on some kind of reporting, production reporting, or maybe partial releases, you released only 10% of your user base is kind of a canary release. Waiting for people or some tooling to report it back to you, all of the negative events that are occurring, all the bugs that are happening, all the crashes or whatever else. Well, you may miss something that's very clearly a simple thing to have caught, something like a typo. Having tests that you run, particularly tests in some kind of continuous integration format where you can't release your software to production unless those tests pass, well, that would have caught that. It never would have gone out. Well, let's imagine that your interface is very complicated and that you have a lot of different ways of displaying many different data fields, let's say, for example. And that the tests that you have is kind of a UI test suite. You're kind of testing for visibility or making sure that, you know, click targets can be clicked or that the contrasts are correct. Well, in this case, the amount of complexity, the number of tests that you would have to run to simulate all of those environments, it becomes enormous. The number of different devices that are being used on the market today, combined with a virtually infinite number of possible sizes, assuming that you have this product in a browser, the complexity of trying to test all of this is next to impossible. But this is a scenario where having some kind of reporting from the production side of things, whether that's from user reports or some kind of session recording system, all of these are tools that provide insight that would be much harder to try to gain in a test suite. And up into this point, we talked about different kind of techniques and tools to use to try to capture these issues in different ways. But what if you could catch the issues before you ever tried to release to production, before the CI pipeline is hit? And this is where you have various measures of human quality control that you can employ. This simple example of this is code review. That is a type of quality control or quality assurance. And having code review for new code gives you the chance to have somebody who hasn't been looking at that code, a fresh set of eyes, to look at it for the first time. It's also possible to bring in someone who has worked in this area of the code base who might have some kind of institutional knowledge. But often it is as valuable or perhaps even more valuable to have someone who has no institutional knowledge at all looking at the code. Why would that be? Because fresh eyes have no predetermined assumptions about how that code should look. If something seems confusing to them based on context that wouldn't seem confusing to someone who had seen that pattern before, a person might bring it up in code review for the first time. These are decisions that might sit in our code bases for years before anyone ever thinks to question them. In these scenarios, not only are we able to catch bugs a little bit earlier, but we're also able to think about systems that produce bugs, think about systems that produce complexity and systems that produce code that is hard to reason about for newcomers or even people who have been working with this code for a long time. Now, we want to talk about one more area of behavior that can help kind of alleviate these hidden bugs that end up going out in production. And we'll use a blanket term here and you probably will know what this means when I say it. And it can be classified as a lot of different things, but we'll call it risky coding. Risky coding. Often engineers are willing to take risks. Right? We're willing to code in a way that we wouldn't necessarily consider safe, but we do it so that we can get something done. We're likely in a rush. Other possible reasons we might do this is because we learned it this particular way and we haven't learned it the correct way, a better way, a safer way. And still other reasons might be because we are kind of copying existing patterns. We might take an existing dangerous pattern and essentially carbon copy it into a new area of our code base. Other terms that might describe this are dirty code or kind of chosen tech debt. And essentially what you've decided to do when you engage in this risky coding is you've made it more likely that bugs will occur later on down the road. Think about this for a second. You make it more likely that bugs are going to occur later on down the road. And when that happens, the further away, the further down the road that the bugs show up, the more expensive it is to do something about it. Once again, I'd like to give Ming by some credit for this particular part of or this particular point. The idea that it's possible, it's possible that if we code in a risky way and he uses some different language for this, that we're moving our bug occurrence later. Once again, we know that we're going to have bugs, but when do they occur? If we write code in a clean manner, in a dependable manner, in a conventional manner, in a way that allows it to be tested, right? In a way that is ergonomic for the language or the framework that we're using, these are all kind of good hygiene habits of writing code, right? If we write it in this way, it's much more likely that a bug is going to show up earlier in the process. It's much more likely that we'll be able to catch that bug before a user has seen anything, or that we'll be able to catch it in a code review, or that we'll be able to write a test that actually pushes that particular use case. So it's almost universally true that these risky behaviors, when you're writing code, doing uncommon workarounds and hacky things, this is almost certainly going to produce these latent bugs, these things that you wouldn't have expected that are hard to test for, they're hard to review because the code is obscure, maybe. Those things are going to produce bugs down the road, which makes them more expensive to fix. We've talked about a specific strategy in this episode. Certainly, there are many more things that you can do. I encourage you to go watch Ming's Talk on YouTube. There are more things that you can do to improve the reliability of your software, but the underlying point here is simple. There is not any one thing that you should be doing. This is a composition approach. Returning to our example, or an abstracted example of an engineer that's on your team, and unfortunately they have to be out sick. What are the many different things, let's say as a manager, that you can do to prepare for this? One is as simple as making sure that your employees are not overworking themselves. People who are overworked and stressed out are more likely to get physically ill. Another part of the composition might be making sure that that employee has access to healthcare. This may not be your direct concern, but ensuring that they know all of their different avenues for seeking the care they need. Another important strategy that you might employ here is reducing the impact of a given person being absent. This takes much more work and thought. This might mean asking the team where there are areas in your code base that are confusing to them. If there's someone who is responsible for a particular area because they know the most about it, then maybe it's time for that person to mentor another person so they can also know something about it. But we need to understand that most of these kinds of problems, the unexpected events, the things that we can't really plan for in a specific scenario, but we can plan for on a more broad scale. They require a composition of balance of elements. I hope you'll take some time to think about the different scenarios in your work and in your teams that require this kind of compositional thinking and fill in the gaps. Preparing for bugs is a perfect opportunity to do these exercises, to recognize, okay, how do we prepare for bugs rather than simply hoping that they won't happen? Thanks so much for listening to today's episode of Developer Tea. Thank you again to today's sponsor, Launch Darkly. You can get started for free at launchdarkly.com. This could be one of the many elements in your composition of avoiding bugs in your production applications. Thanks so much for listening to the show. I wanted to take a quick moment to say thank you to those of you who have taken the time to leave reviews and iTunes. We've gotten some really good feedback. When I say good feedback, I mean some good critical feedback. One of the points of feedback that we received was that the show has kind of drifted away from being a practically applicable show for software engineers. First of all, I want to say I think that feedback is entirely valid. There's a lot of ideas that I'm very interested in, that I'm willing to talk about on this show that sometimes they just don't land as the most actionable things that you need as an engineer. I need to hear that from you. I've opened up a lot of channels to get this kind of feedback and I'd love to hear more feedback about the show so we can continue to refine it and improve it. You can give me that feedback on our Discord community, head over to developertea.com slash Discord. You can join that for free. You can send me a tweet at developer.t. You can send me an email developer.t.a.gmail.com. And of course, you can leave a review in iTunes. As a result of that particular feedback, we are going to try to shift the show a little bit more concrete where we've been leaning more abstract in the last year or so in our episodes. We're going to try to shift the show back towards that concrete kind of applicable advice for software engineers. Hopefully that is going to resound with more of you and I'm excited for the coming year. That's all for today.