🎞️ Videos → Engineering Without a Safety Net: Where It Works and Where It Hurts

Edit metadata on GitHub

Event	BKK.JS #23
Speaker	Faris Aziz

Description

Ever wondered when it's okay to bypass those safety nets in engineering? This talk explores the delicate balance between shipping fast and building reliable software. Faris Aziz, a Staff Software Engineer at Smallpdf with experience across various industries, shares insights from his journey working with different companies and their unique approaches to safety nets, from manual QA to rigorous test coverage. He delves into the real-world costs of cutting corners, highlighting the importance of metrics, observability, and resilience. Join us to learn how to build confidence in your deployments, identify areas where risks are acceptable, and understand the long-term implications of your engineering choices. Discover how to navigate the trade-offs and deliver value sustainably without sacrificing stability.

Chapters

Introduction and Icebreaker: The Rubber Ducky Method 0:00
Engineering Without a Safety Net: Finding the Balance 0:51
Speaker's Journey: Safety Nets in Different Companies 2:09
When Bypassing Safety Nets is Acceptable (and When It's Not) 7:54
Speaker Introduction: Faris Aziz, Staff Software Engineer 8:47
Defining Safety Nets: Testing, Metrics, Observability, Resilience 10:12
Manual Testing and QA Terminology 12:30
The Importance of Metrics and Trend Analysis 13:35
Resilience Patterns: Beyond Basic Retries 15:54
Building Complete Confidence: Combining Testing, Metrics, and Resilience 17:30
Introducing DORA Metrics: Focusing on Outcomes 18:00
The High Cost of Flying Blind: Downtime and Failures 19:36
The Language of Confidence: Identifying Smells of Low Confidence 20:36
Where to Cut Corners (and Where Not To) 22:00
Deployment Strategies for Reducing Blast Radius 22:29
Feature Flags: Duplicate Code for Guaranteed Rollbacks 23:00
Finding the Right Balance: Context and Business Needs 24:16
Q&A: Feature Flags, Front-End Observability, Balancing Cost and Safety 25:17
Giving Away JetBrains Licenses and Closing Remarks 33:09

Transcript

These community-maintained transcripts may contain inaccuracies. Select any text to report issues instantly, or edit on GitHub for advanced changes.

Introduction and Icebreaker: The Rubber Ducky Method0:00

That's going to hurt people's ears, right? Every time I go to meetup or conference talk, I bring a rubber ducky with me because I know people are distracted sometimes, so brings the attention back into the room. I know for this one, there's going to be a little bit of a language barrier, so I'm going to adapt the talk a little bit, and if you feel like it needs to be slowed down, reworded, just don't hesitate to pause me in the middle of the talk.

Really happy to be here. First time talking in Asia, so I'm really excited about that. I've absolutely been loving the vibes in Bangkok, so this is really cool that there was a BKK.JS organized. Really big thank you to Riffy for making this happen, because I sent him a random email 2 months ago, and I was like, "Hey, I'm coming to Bangkok. Can we organize something?" And it was really cool we could do that.

Engineering Without a Safety Net: Finding the Balance0:51

Today, my talk is going to be about engineering without a safety net, where it works and where it hurts. This is going to be essentially a talk about finding resonating points, because I'm sure we've done a lot of these things in our engineering career, and even if you haven't started your engineering career, you'll probably resonate with a couple of the other things I'm going to say next. First thing I like to start with is something called the fast follow fallacy. There's an array over here defined of excuses, and I want everyone to reflect, and if you have never used these excuses, any of these excuses in your entire career, please lift your hand up right now. You've never said any of these things. No? Okay, we're all on the same page. Has anyone here used git commit --no-verify, skipping the hooks? Yeah? Okay. You and me, we'll have a drink afterwards. But essentially, these are safety nets. These are things also we tell ourselves when we want to bypass these safety nets and move forward because sometimes there are delivery pressures. And so we can resonate with a lot of this stuff. And so the real question we're trying to answer today is when is bypassing safety measures the right call, because sometimes it really is, and what are the long-term costs?

Speaker's Journey: Safety Nets in Different Companies2:09

I start this talk a little bit with going through my journey of safety nets, because I've worked at a lot of companies, and each company has their own take around how they handle safely deploying things into a production system. I've worked in companies where there are large enterprise products serving millions of users, and they only have a manual for a formal QA process. So everything's manually tested, and we'll explore what that means. Then there are startups that are obsessed with test coverage. Has anyone here been in a company before where your manager tells you, "SonarQube has to be at 90% coverage, and if it's anything below, then it's not going to production." Has anyone here used static analysis with SonarQube, and you have to see the test coverage change, right? So these are a lot of things that sometimes we gamify, but there's some companies that are obsessed with these metrics. And then there are products that are very much from a behavioral perspective, only wanting to test end-to-end tests. So no units, integration tests, anything of the sort. We just want to test the visual stuff, the stuff that the end user sees. That's also another philosophy around testing and the safety net. And then sometimes, I've even worked on high traffic platforms with tens of millions of users, and they don't have any automated or any type of tests

whatsoever. And they can be even generating tens of million dollars per year, and there's no safety nets in place. This is also an interesting perspective of how much of the traditional engineering practices can you get away with before it actually becomes critical to your product survival. So, a couple of the companies I've worked at and a couple of the experiences I've had is first of all, in the connected TV application space. Has anyone here heard of applications like Eurosport

or GCN, Discovery Plus?

Anybody know the Discovery Channel from the US, right? So Discovery Channel has their own TV applications similar to Netflix. TV applications when I were working a lot with them, they're streaming to a lot of users. Actually, for the Tokyo 2020 Olympics, we were streaming to 1.3 having 1.3 billion minutes streamed. So that's- Anyone consider that's a small number? That's a big number, right? 1.3 billion minutes streamed. And when you're streaming the Olympics to everyone live, your product can't fail. And so clearly, we have to consider in that scenario, what are the things being put in place to make sure that thing functions. And so there were several hundred engineers, I think 2-300 engineers working on that project, and it was predominantly manual testing that I was working with. So smoke testing, sanity testing, regression testing, those are manual quality assurance terminologies being used. If you've only worked with Jest or unit test automated tests, you probably won't be familiar with that terminology, but it's really interesting to learn, and we'll explore it in a bit.

Then I've worked in early stage startups where there are no users. We just get pre-seed money. You're under 30 engineers. You're just getting to know each other, just getting to build the first couple of things, and the quality gates are set by management. So yeah, sometimes you have to pass this unit testing benchmark and check the coverage every single time, like, "Okay, am I passing 90%?

Should I add in a bunch of useless tests just so that they pass and they go green? Don't care about it." And then it just gets the test coverage going. There's actually a really cool website if you look it up called Stryker Mutator, and Stryker Mutator actually goes in and mutates your tests in a type of chaos engineering style, and actually tries to break them to see if your tests are actually testing something valuable, not that if I change the core code, the test will still pass as green. So it actually sees if you've done something that's valid. And then yeah, static analysis is used as the gates to make sure something ends up in a production system. Then, there are a lot of user experience focused platforms. This is a big thing. I see a lot of companies who are very UI or frontend focused. They like to test what the user experiences. So using a lot of Storybook, using a lot of snapshot testing to make sure the HTML stays consistent across multiple iterations of a component. Then end-to-end testing, because why bother test the units of your functions and so on and so forth, if you're just going to test all the clicking and different flows of logins, sign up that are users actually going to go through, because sometimes that is what matters most. And so some companies just stick with that.

Then we've got some platforms that I've worked on that are really high scale and really low safety, because the priority is making sure you can deliver as fast as possible and keep on iterating and doing a lot of AB tests and delivering to customers sooner rather than later. And sometimes tests and safety nets, they can bog you down a little bit and slow things a lot. I've heard of CI/CD pipelines that can take 3 hours to pass.

And if a flaky test pops up in 2 and 1/2 hours

down your CI/CD pipeline, you're going to rip your hair out. So sometimes people don't want to bother about those things. But yeah, I've worked on platforms where they're generating tens of millions of dollars every year, with tens of millions of users every single month. Again, not small numbers. No automated testing suites, nothing, nada. Products have been working for over 10 years. You go there and you're shocked. You have no idea how this thing is functioning. And then there's no dedicated QA resource. So when you're actually shipping something to production system, what you're doing is going on staging, clicking around, check if it functions. Oh no, I make a new commit, builds again, click it a couple of times. Okay, it works. I get tired after testing this for the 15th time. Out to production. And if I don't get any alerts, I don't get any alerts. If you don't use PagerDuty, can't really get alerted and tired of those notifications.

So then you actually get really anxious when you deploy.

Has anyone ever been anxious when they deploy something to production? Like they close their eyes, they click that green button and they're done. Is anyone here not allowed to deploy after 5:00 on a Friday? I'm not allowed to deploy after 5:00 on a Friday. We were scared. I'm scared to deploy after 5:00 on a Friday. Cool.

When Bypassing Safety Nets is Acceptable (and When It's Not)7:54

So hard truth, sometimes it's the right call to make. There's time to market, to get something out really fast, you need to just do the bare minimum to get something functional and out to users. You have limited resources in the startup space. So sometimes frontend engineers are going to develop some things with a couple of mocks because backend APIs can't be prepared. If you're in the validation phase of certain ideas,

you do a lot of AB tests. You have the control and the variant of an AB test and sometimes the variant fails. So why bother testing something that might fail? Gather the analytics and maybe you stick with the original. Does anyone here work at a company that does a lot of AB testing? No? Okay. One person resonates with me. I'm happy. Cool. And then in the startup world, you have to survive sometimes. You've got like a limited runway, so you've got to work with it. So really the quote around here that I like to say is perfect is the enemy of shipped.

Speaker Introduction: Faris Aziz, Staff Software Engineer8:47

So, a little bit of an introduction about myself. I know I've already been talking for like 8 minutes and 59 seconds. But my name is Faris. I've worked at a plethora of companies before all across Europe and the US. I'm a staff software engineer at Smallpdf. I'm responsible for all payments infrastructure and monetization. So if you can't pay on our platform, it's my fault.

And then I also got a background in connected TV, growth teams, monetization, fintech, fitness tech, and anything you can name. I've worked on performance engineering for a very long time. I love making things faster. I've also seen experiences where like somebody's tried to make something 10 times faster, but there's zero users. So I like to take a pragmatic take on making things faster. And I work a lot also in the engineering leadership space. I used to previously be an engineering manager. So a lot around career growth and stuff like that. I also love open source contributions. So anybody here use Raycast?

Cool. Yeah, so I develop a couple of the Raycast extensions, like the official Stripe one and a couple of others. So yeah, that's something I have a little bit of fun with and I really recommend it as a project. And just like this is a meetup, I'm also the founder of ZurichJS, which is a meetup in the Swiss German space. And so I really love building communities and coming to them and supporting them because I think it's really cool stuff. And it's not easy to do. It's very stressful to put a meetup together.

So big round of applause at the end for those that do that.

Defining Safety Nets: Testing, Metrics, Observability, Resilience10:12

So what is a safety net? We already started talking a little bit about that. The safety net are things that give you confidence when you ship code. So that can be testing that we talked about, and that's your manual or automated testing. Verifies expected behavior. Does this work the way I think it's going to work and the way it's defined in the specifications of my Jira ticket? Metrics. Is my system healthy? Is it working? How long is it working for? I'm only working for 8 hours a day. I want to also know if the system was still working while I was sleeping. I'm not monitoring and refreshing the thing for 24 hours a day.

Observability. I can't just go jump in and put 𝚌𝚘𝚗𝚜𝚘𝚕𝚎.𝚕𝚘𝚐 everywhere. I need to know in a non-production system is something working. Then there's resilience. So something is bound to fail at some point, but have you built and put the structures in place in your platforms that when something does fail, it has mechanisms in place to get back to a working condition. A very baseline version of this is retries. If a fetch request fails, I'm going to retry something. But that's just the tip of the iceberg when it comes to resilience. So digging into a little bit more of the safety nets, there's automated testing that we already talked about, and you've got the whole test pyramid. So I'm not going to dig into every single one, but this is actually just to bring awareness, because maybe you know about unit tests, testing pure functions, testing things in isolation. And maybe integration tests is something new for you that you haven't played around with. Or maybe you didn't realize that there was a separation between integration tests and component tests,

which are just for maybe UI in isolation.

Then a big one, that's a big differentiator, is end-to-end tests versus system tests. And sometimes they're used a little bit interchangeably, but end-to-end tests are maybe using something like Playwright or Cypress, and you're testing all the way through from the frontend to the actual backend and back. So you're testing end-to-end if a system's working in a production or staging environment. And yes, sometimes end-to-end tests are run on a production environment against the real database. Then you have your system tests, which are then maybe just testing the front end and not touching the back end. So maybe just running it just with mocks, which are exact replicas of maybe your OpenAPI specs or your GraphQL schemas, just to test things even if the back end hasn't finished developing an endpoint.

Manual Testing and QA Terminology12:30

Then you've got manual testing. Does anyone here work in a company where they work with QAs? So like dedicated engineers for QAing resource,

or do we just all click things around ourselves? Hands up. Thank you very much. Two, three people, four people made me happy today. Cool. So, when it comes to manual QA, they use a lot of this terminology around smoke testing, sanity testing, regressions, exploratory testing. These terms are things that we're not exposed to in the software engineering space as much if you're just a front-end developer, back-end developer. These are terminologies that are actually really interesting to start investigating because it gives you an idea of the structure or the way one can go about breaking down or segmenting types of manual testing

that you may actually want to do yourself. Because let's say you have no tests and you still want to deploy something, is there a more methodical way to go about it than just clicking things and hoping for the best? And so it's actually interesting to be able to share this language with QA engineers. And this is used a lot for like higher risk features, complex user flows. You may still end up doing manual tests even though you've covered things with automated testing.

The Importance of Metrics and Trend Analysis13:35

Then we have metrics. So metrics are really important around how much traffic do you have. Maybe you've got 10 endpoints on your back end, but only one of them surges a lot. And that's being used the most, maybe it's your login endpoint. Then errors, the rate of errors, the distribution of errors. I'm sure some of you may use New Relic, Sentry to figure out how often is something going wrong. And if you've got 10 million users and an error is happening a hundred times a day, you don't care, it's not a big deal. But if a error starts mounting all the way and a million of those errors are happening, then you're going to get a little bit worried. And then also detecting things like latency,

and then business conversions.

There are also business analytics and metrics that are really important. Why this is really cool is also to understand trend analysis because not everything is static when it comes to users using your platform. If you have an e-commerce platform and Black Friday is happening and there's a massive sale, you're going to see a natural surge in the amount of traffic going to your website. If you have a static threshold of like, the second there's 5,000 errors per day, I'm going to consider it an issue, then you're going to break that threshold on days where actually historically it's normal for seasonality to have an increase on your traffic. So understanding trends of the flows that your business goes through understands makes you understand how to triage particular issues. And then we get into observability, which is very much that if you think about local observability, it's like I'm going to 𝚌𝚘𝚗𝚜𝚘𝚕𝚎.𝚕𝚘𝚐 in 15 different places and figure out what works and what doesn't work.

This is essentially console logging in production or figuring out is something an error, should I be warning at this stage? That if somebody's going through a payment flow, is the error that I'm getting because the user typed their expiration date incorrect in their credit card, or did I fail to hit the Stripe API and produce a payment intent so I can execute the payment? So understanding the different types of behaviors and logs you expect to come out of your product. And then very much also to for unknown unknown detections in the sense that there's just some things you cannot predict. A lot of the flows around observability is, oh, something goes wrong, I can't figure out what goes wrong, I add a bunch of logs, deploy to production, and then read back the logs and try to narrow down what the problem is. So a lot of the times, observability can even help figure out things you didn't even know were an issue because your localhost development environment, your super powerful and fast M4 Mac

is not the same environment that your users are using your platform on.

Resilience Patterns: Beyond Basic Retries15:54

And then we talked also earlier a little bit about resilience patterns where retries, really tip of the iceberg around that. But then you go have circuit breakers and retries with backoffs, and you have jitters, fixing thundering herd problems, rate limiting. There's an entire plethora of ways to go about making your platform resilient so that you can actually have self-healing properties. There are actually resilience patterns like retries, and if you just put a base retry, you can actually break your API because it hammers it too many times. If you have a retry of 10 for every single API endpoint and your API is just returning a 500,

and then your system decides that every time I get a non-200 code, I'm going to keep hitting the API, you're going to break the API. And so there are more intelligent ways to go about developing resilience in your systems beyond just the basic retries.

And so what I like to call these are pillars of confidence. What are these things that keep my product up and functioning and what are the trade-offs or one versus the other that I want to use to be able to make sure that I'm happy with what's going out? So there's testing, observability, resilience, and all that loops into a feedback loop. So a feedback loop of telling me what's going wrong, how's it going wrong, how often is it going wrong, and all of these work together to create a comprehensive safety system. The absence of one weakens your overall confidence,

but you can seldom have all of them in place. Does anyone have all of these in place at the best way possible, in the most efficient way, you have all the observability, your platform is working all the time? Nobody? Cool, we all have jobs. Fantastic.

Building Complete Confidence: Combining Testing, Metrics, and Resilience17:30

So how do we build complete confidence? If you have individual components, tests alone, you know if it works, but not in production. If you have metrics alone, you know something's wrong, but not why. Then you have resilience alone, and it can survive failures, but you can't learn from the failures and go into your retrospectives with information and data. And so all these have this combined effect to detect issues early, recover gracefully, learn, and improve because everything's iterative. You have a lot of progressive enhancements when it comes to product development.

Introducing DORA Metrics: Focusing on Outcomes18:00

But what's really interesting is I want to talk about something called DORA. Does anyone know Dora the Explorer?

Yeah, only three people know Dora. No? More? There we go. Wakey, wakey. Do I have to come back? Thank you very much. So we're not going to talk about Dora the Explorer too much, but I like to say it because there's something called DORA metrics. Has anyone here heard of DORA metrics?

Okay, one person. And two people. Awesome. So DORA metrics, I find them really interesting because in the metrics game, there's usually gameified as we talked about in terms of Sonar, you know, Sonar trying to get the test coverage as high as possible. What's cool with DORA metrics is you have these core four ones. The first is deployment frequency. How often are you deploying to production environment? Lead time for changes. How long from commit to production?

Then you have mean time to recovery. How long does it take when something goes wrong to get fixed? And then change failure rate. How many of my deployments out of the total deployments cause a failure? And there are these performance levels of what's low, medium, high, and elite in terms of these metrics. And what I like a lot about DORA metrics is that it doesn't care what your unit test coverage is. It doesn't care what testing you use, it doesn't care what you do. It cares about end results and outcomes. Your user does not care how many unit tests you have. Does not care how many resilience patterns you have. Your user cares how many times is my app broken? And so these are interesting ways to see that maybe you've spent two weeks for a technical debt sprint to fix a bunch of your tests, but it hasn't moved any of these metrics. That means the end user hasn't actually seen an outcome of that. And so there's a large cost when it comes

The High Cost of Flying Blind: Downtime and Failures19:36

to flying blind. If you have a 10 million ARR business, and you can assume that your revenue is around 1.14k

per hour and you have a 99.9% uptime

which sounds like a lot but it's actually up to 8.76 hours of downtime per year you could have over $10,000 per year of direct cost Take that into change failure rate and if you have a high change failure rate of 5% which sounds low takes 3.5 hours or half a working day to figure out how to resolve something which generally takes a lot longer and 100 deploys per year, which is not a lot 100 deploys per year means that maybe you do two deployments a week Normally you want to be doing 5, 10 or more per day You could have $20,000 plus of cost as a result of that

And so you have also hidden costs that come with that such as customer trust, engineering time team morale and more I'm 30 seconds over Am I allowed to go for 2 minutes more? Yeah, okay Everybody cool with 2 minutes more? I'll try to not speed through

The Language of Confidence: Identifying Smells of Low Confidence20:36

So, language of confidence is what I'd like to talk about next Because how do we know that there's something wrong in our development process? And we can actually take a little bit of a psychological look to this because it's the way that we communicate things Has anyone ever used these starter phrases?

Somebody asks you, "Oh, how does this thing work?" "It should work this way, yeah?" Because I don't want anybody to blame me when I was wrong I want to absolve myself of any lack of confidence Because I don't have time to read the documentation I just want to tell you, "Maybe it works that way" "I have a feeling that this is an issue" "Let's just try it and see what happens" I've said all of these things I've heard all of these things But how can we go about having the kind of language which is affirmative in the sense that this is how it works? "Our metrics show 𝑥, 𝑦, and 𝑧" "The logs indicate this, and I can see from the traces that this is what's happening" And so if you find yourself using this language question why do you lack confidence in what you're doing? Same as, "We don't deploy on Fridays" I used to have a CTO that I worked with that I absolutely loved and he said, "We're not allowed to say the word Friday at work" We are not allowed to say it You're only allowed to say Thursday plus one Because if you say Friday, everything breaks It's done So, no deployments on Fridays you freeze before the holidays you do code freezes around Christmas Again, this is another smell that something is lacking confidence in your process

Where to Cut Corners (and Where Not To)22:00

But there are places that you can cut corners safely which is internal admin tools, static content things that aren't used that much infrequently used features But to know it's not frequently used you have to have the analytics and data to back up that that place doesn't deserve enough love in terms of a safety net perspective And there are places where it really hurts Payment processing, authentication, data migrations things that are very hard to recover from and things that directly impact cost from a business perspective

Deployment Strategies for Reducing Blast Radius22:29

There are things you can do from a deployment strategy perspective around reducing blast radiuses So, even if you don't have safety nets in place things like trunk-based development which is around the ideology of having really short-lived branches ones that only last a day or two frequently merging to main means that every time something goes wrong you're not rolling back 50,000 lines of code you're rolling back 50 lines of code And so this is really helpful I'm not going to take 2 more minutes because my biggest issue is I talk too much

Feature Flags: Duplicate Code for Guaranteed Rollbacks23:00

I'll touch a little bit about feature flags and I won't go too much into it because there's a really incredible article by Martin Fowler around feature flags and I'll say one thing about them Feature flags is where you should be duplicating the most code that you can Because if we take the example if you're trying to feature do an AB test or a feature flag across button one and button two and your React components or whatever you're using sometimes what you'll do is you'll go within your button component you'll use your feature flag system, LaunchDarkly whatever it is you'll check if something's on or off And let's say the functionality of the button is changing When I click something, the feature flag is on it does thing A If feature flag is off, it does thing B

And so you're writing this code inline within your component which means that you're changing the original code which is adding risk and cyclomatic complexity to it

And if you don't know the term cyclomatic complexity really worth looking up SonarQube uses this a lot

And so when you're actually developing things with feature flags to have 100% guarantee in your rollbacks actually working the way it previously did you should be doing things like duplicating the component entirely and changing only the one line that's actually going to make an end difference But I'll let you read a little bit more about that

Finding the Right Balance: Context and Business Needs24:16

The conclusion of it all is that it's about finding the right balance and that's very hard to get right And so there's no single right approach There's a lot of context from a business perspective that comes into this but knowing why you have these butterflies in your stomach this feeling that something's going to break is a really important starting point And so being targeted around your safety, strategic and knowing how to talk around confidence and use the language that is going to help you navigate these conversations is really important because the goal isn't perfection the goal is delivering value sustainably Thank you very much You can connect with me on LinkedIn here If you have feedback on the session, it's over there and I want to take a group photo because I love doing that

And if you ever want to follow any of the other talks I do I'm speaking at React Summit New York this year And you can just search my name, Faris Aziz and I'm the first result on Google and you can follow all the stuff that I do And if you have any questions you're more than welcome to ask them And if you want stickers I brought them from Switzerland Cool

Q&A: Feature Flags, Front-End Observability, Balancing Cost and Safety25:17

โอเค พอ That's the talk, right? โอ้ Yeah We do the photo โอ้ That's a photo Yeah, one more photo Okay Okay We all have to choose

Either we say all together "safety net" or we say "rubber ducky" "Rubber ducky" There we go Okay 3 2 1 "Rubber ducky" Awesome Thank you Cool Questions? Q&A ครับ question time มีคำถามไหมครับ อย่าลืมนะครับว่าถามเป็นภาษาไทยก็ได้นะครับ เดี๋ยวแปลให้ ครับมีแจก JetBrains นะครับผม มีคำถามเรื่อง QA safety net ไหมครับ spreading of the engineering practice for the speed? Okay.

Any questions? If there are no questions, I assume I said everything perfectly. So please tell me I didn't. A question? Yeah, I think this one is more of the small questions. Yeah. So I just saw on your slide that you mentioned things about feature flags a bit. If implement feature flags properly, faster rollback operation

Is there any sign that we can tell like, okay, we are in the right way? I would say if you are able to get an understanding of what the cyclomatic complexity of your code is. And that's a complicated term, that's more for computer science, mathematical type term. Cyclomatic complexity, if you distill it, it essentially means how many if conditions are in your code. Because a feature flag, if you simplify it, it's an if condition. If something's on, do this. If something's off, do that. The more if conditions you have in your code, the more paths for something to go wrong, and the more paths that you need to test in terms of edge cases that they exist. So the lower your cyclomatic complexity and the lower the amount of different paths there are to get to a different result of the output of your component, the closer you are to stability. And so if adding feature flags increases that in a significant amount, you're adding testing complexity and reducing guarantees. Because you just need to guarantee that even if the previous version is broken, when you roll back, you want to know it works exactly the way it did before. And copying and pasting is the easiest way to guarantee that.

Yeah, thank you.

All right, we do have another question right here.

Hello, James from EarnIn. So I have question for the front-end observability.

So just now you share about what tools are you using for front-end observability. I did use Datadog.

When it comes to front-end observability, it highly depends on the amount of users you have. Because if you're using Datadog, it's going to get very expensive very quickly. So if you're talking about cutting down costs, Datadog is not going to be the way to go. And if you're going to be recording every single session that a user's having on the front-end, the expenses are going to fly through the roof. I've had the best experience personally with Sentry.

Sentry is really incredible, but if your back-end is using Datadog heavily, having a way to harmonize your full stack observability becomes a lot more interesting. And so I'll say what the question is here is like what's not the best tool to use, because these tools are always going to evolve and they're meant to solve similar problems, but how can I affect my mean time to resolution the best? How fast is it for me to resolve an issue and where is my problem? Do I have enough information on the front-end? What browser is being used? What feature flags are on and off? Am I able to trace everything all the way to the back-end if maybe something failed from a database perspective and something wasn't put into the DB? So it's more of a developer experience question.

Do you hate the tool? Do you love the tool? Are you excited to work with the tool? Is it hard to change the tool? Observability distilled is I'm asking my product or my code questions. Why are you doing this? How are you doing this? Why can't you do that? And so if you can answer those questions easily, you're using the right tool. If you can't, you're either not using the right tool or you're not using the tool correctly. Thank you. Cool. Now we have one last question here.

So first of all, amazing session. I have some question because you say that we are trying to balance the safety with the speed, right? Yeah. But in my opinion, I think there's another factor that I want to ask that how will you balance it. It's more like the raw cost because the more safety you have, as you say, Datadog can get very expensive. Those advanced like Dynatrace, those are extremely expensive. Or even just if you use open source like Prometheus to ingest log, the infrastructure can get extremely expensive. Yeah. Yeah. Yeah, how do you not only balance the speed, but balance the dollar cost with the safety?

So When it comes, again, a developer's favorite answer to everything is, "It depends." And I use "it depends" for the answer to every single question initially. When you're balancing all of that, initially the

conversation is around what are the features you're trying to observe. So in my opinion, anything around payments and monetization is a non-negotiable. Because if somebody on screen is clicking and they're paying 10 baht for something, but they get charged 10,000 baht, it's a nightmare. Today, I got a message because somebody wasn't refunded money fast enough on one of the products that I work on. And so it has a very direct impact on customer trust. And so, again, it depends on what you're working on. From an A/B test perspective, if you're doing A/B tests, the analytics you care about the most are business ones. So forget about the analytics, maybe from an observability perspective.

But overall, it's a negotiation, and you're

negotiating cost. And I don't like the idea that all the pressure of deciding what tech debt to work on, on what tool to work with is on the developer itself. I think it's a negotiation with your PM. And so if you're able to have the negotiation around cost, DoorMetrics, again, it was an interesting one because you saw on a 10 million ARR business, "Oh, how much could I lose if my system works this way or has this rate of failure or whatever?" If you're able to say that, "Hey, we make 10 million per year, and it takes me 4 hours to resolve this

with Datadog, but it takes me 3 hours to resolve it with Dynatrace," you can actually calculate the uplift you get from a developer experience and infrastructure perspective versus what's saving cost on the invoice that you get every single month. Those are the hidden costs. So if you're able to put the question to your PM, it's like, "Hey, would you rather it takes me long to fix this, or it's expensive to maintain this? You pick." But as an engineer, you're a consultant. Even if you work full-time in a place, you're a consultant. You're consulting on pros and cons of direction one versus direction two. You can't just say you're going to save the company this way. You got to ask the business questions. So being able to speak the language of business and cost, and using metrics to back those up, that will

make the biggest difference at the end of the day. And the last thing I'll say around observability is that it's a continuously evolving process in the

sense that when I'm deploying something that I don't have a guarantee will work, I'll put 10 times more logs than something that I know works 99% of the time. And so I'll add 10 logs, but then after 2 weeks, if it's working the way I expected to, I'm going to remove the logs and save the cost. So temporary analysis for a sprint, retrospectively see if the thing's working the way you wanted to, remove the logs, and now you have almost a guarantee that it's stable. So sort of like minimum viable product, what's the minimum viable observability that you want to put in place? Okay. Thank you. Very good.

Okay. Any more questions?

Giving Away JetBrains Licenses and Closing Remarks33:09

We got three competing for the JetBrains. Okay. Now this is a problem. We have three quality questions, but we only have one code to give.

I'll do you one better. I have three JetBrains licenses, too. I'll give all three of them JetBrains licenses. Wow!

So you give me the names, you send me an email, I'll get you two licenses on above that. All three of you get them because taking the time to think of a question, absorbing the content, I don't think the best one or whatever deserves always the best. I think just taking a chance on yourself and asking the question should give you the confidence to do it again, and that's more important than anything. That's very cool. So cool. If you guys want stickers, I'm leaving them here. If I leave and I see the stickers are empty, my heart will be broken, but I'll understand. Cool. Thank you for your time. Let's enjoy pizza. Thank you so much.