Videos Takeaways from our attempt on scaling a small system in the Gojek Universe

Description

Talking about "Scale": Takeaways from our attempt on scaling a small system in the Gojek Universe. The year is 2019 and every engineer must have been asked once to build a “scalable” system. I will be telling the story of our team journey in building a financial system that serves 20X traffic in less than a year. Engineering practices, wrong (and right!) decisions, process improvement and more!

Chapters

  • Introduction: Scaling Challenges at Gojek 0:00
  • Story 1: The Three-Week Feature and Codebase Issues 1:32
  • Hoarded Code vs. Over-Abstraction: Finding a Livable Codebase 4:00
  • The KonMari Method for Code: Finding the Right Questions 7:50
  • Story 2: The BonChon Incident and the Importance of Monitoring 11:53
  • Scaling Infrastructure: Monitoring and Alerting 13:37
  • Why Monitor? Analyzing Trends, Alerts, Dashboards, and Postmortems 15:25
  • Black Box vs. White Box Monitoring: Symptoms and Causes 18:22
  • Google's Four Golden Signals: Latency, Traffic, Errors, and Saturation 21:52
  • Story 3: Urgent Requests and the Importance of People and Process 24:01
  • Prioritization Sessions: Scaling Product and Process Through Collaboration 26:54
  • Conclusion: Scaling is More Than Just Infrastructure 27:57

Transcript

These community-maintained transcripts may contain inaccuracies. Please submit any corrections on GitHub.

Introduction: Scaling Challenges at Gojek0:00

So the next topic is talking about scale... Takeaways from our attempt on scaling a small system in the Gojek Universe

Hello. Welcome to the talk.

Let me fix this real quick.

All right.

Hi. Are you guys sleepy yet? Just get back from lunch. But hang on.

I have some interesting stories to tell. It's going to be helpful, so please listen and let's get started.

Story 1: The Three-Week Feature and Codebase Issues1:32

Story time. My squad has been working on this product for a little over a year. And it is a customer-facing product that is already live on production when I join. Our main task is to maintain, develop, and scale this system. And to be honest, scaling the system is actually quite tricky. And imagine you need to scale a system which, one, it has been inherited from people who already moved on. So when you have any kinds of question at the code, like how it works, you cannot ask them. Second, using different tools from other department inside the same chapters. So we are calling our department chapters and team squad for some reason. And we only have a handful of engineers working on it. And one of them is actually the team lead that is always drowning in meetings all days. So a lot of weird things happens.

So one day, we receive a task to add a new pretty straightforward features into our system. And I asked my teammate back then. So he is more familiar with the code and he has more context. So I asked him like, "How much time do you think this is going to take?" And he replied, "It's probably going to take only a day or two. It's quite straightforward." And I was like, "Okay. Sounds reasonable." And what do we need to do? Let's get started. And we start adding code, writing SQL queries,

adding tests. Yes, we write tests after we write code. No TDD here. And then fixing tests, adding more code. And this turns out to be a three weeks worth of effort in adding these small features. So it was a pretty painful experience.

What happened? What happened with the code? And why is this new trivial features took such a long time?

Hoarded Code vs. Over-Abstraction: Finding a Livable Codebase4:00

Our main codebase is actually a backend service which serve as a backend for mobile app and a transactional system for the wider distributed system in the Gojek Universe. It has several sections that serve different purposes written by multiple people. So if you look at this picture, right? You can compare one section of the code to this house. Some sections of our code is actually pretty messy. I would call it like a hoarded code. It's mostly overrun by clutter and stuff that renders it's actually unusable. And we cannot really easily find things that we need. And it is quite hard to find the right place to put new things in.

And this is what we have to live with, part of it. We fight with it daily just to get ourselves around it and try to ship things and get things done. And we kind of hope that we'll get to rewrite the whole thing one day.

We wouldn't make the same mistake, right? Because we know better. We can rewrite jQuery with React. We can tear down monolith application to a bunch of different microservices.

So for us, it might work well for a little while. But if the team still keeping this habit of not cleaning up after themself, it will get back to this ugly mess in no time.

And on the other hand, some parts of our code is actually quite clean code.

I'm not sure if I can actually call it clean code, but when you see the code, it has a lot of patterns, over patterns. The classes are small. If you find that it's getting bigger, then they split it to multiple classes. And it is super DRY. So DRY is don't repeat yourself. So there is no duplicates, and it looks pretty nice. But when you work with that piece of code, and you try to add something to it.

where should this piece of code be? like maybe I should put it in this class. oh wait, this class extends some base classes. maybe I should read there. so you need to dig through all these layers and layers of abstraction until you find the right place that is responsible for that piece of new functionality that you're trying to add. and you might not even be able to find one in the existing code.

this piece of code, I want to compare it to this showroom, this kind of stage house, which looks nice.

like we want to have a house that kind of look like this

but if you look closely to the house, it's actually pretty hard to live in. like you can see the couch. if you want to sit on the couch and watch TV, it's going to be a bit weird because sitting sideways. and where can I charge my phone when I was sitting on the couch? like there's no plug or anything near. and why is that table sits on the floor? if I want to put something on the table, I need to do something like this. it's actually not possible to comfortably live

in this kind of house.

The KonMari Method for Code: Finding the Right Questions7:50

so we need something more in between.

and we need a house that is livable. so I have took this concept from Salami,

which is she is a really good developer and she has really well thought out ideas that we take from. so what does it mean to be livable?

so livable code is clutter enough to be comfortable. we know where to find things and we have enough space for flexibility.

so when the squad decided it's time to actually discuss because our code is both of the extreme. some parts are hoarded code. the other part are pretty much abstraction layers over layers of abstraction and it's pretty hard to work with. so we decided to discuss about the way our codebase

should be and why and where are we going to put things.

essentially, we are discussing how to live with this codebase together as a team.

so I would like to ask if anyone of you know this woman? so she is the author of the technique called KonMari. and the KonMari technique boils down every decision

to keep or get rid of or donate things into one simple question. does this item give me joy? it reframes the problems into new aspects from "should I keep this item" to "how does this item make me feel?" and coming back to our code, what's the right question to ask? so during the discussion, we ask ourselves a lot and then there are certain questions that I find really useful when you are trying to make your code more livable. so the question is when you look at the code, do we understand the purpose of the code?

like why is this here? what's the purpose of that piece of code? and then if the answer is no, we don't know why it's here, then it is most likely because it's either it was a hoarded code, a piece of code that put together in the hurry and it's quite messy that you cannot find the purpose of the code itself. or it was overly abstracted to the point where the purpose is actually hidden under the layers of abstraction.

and we keep doing this for couple months and the result is quite useful.

so we make one change at a time. we follow this one leadership advice. he said, "you touch it, you improve it." so we actually do that and now the codebase has transformed into a place or a piece of code that the whole team could live in. we know where things are and why they are there. so adding new features, we have less cognitive load on trying to find the place for those new things. refactoring and adding new things has never been easier.

Story 2: The BonChon Incident and the Importance of Monitoring11:53

that was our first story. here's the other one. so one peaceful afternoon, right? when I was having a serious discussion with my squad on which size of BonChon wings do they want? like we always have this snack time every afternoon. and our product owner come in and then ask us to gather some data for him from the production database. so I said, "okay, sure.

I'll just run a simple query and then probably get that result for you." then I walk back to my desk and open up my laptop, thinking about the query for a bit and then start typing. then I run the test. I run the query on the staging environment. I press enter. okay, looks like the result is what I expected. maybe I should just try and get the data from our production data. I copy the same exact code, paste it in the terminal and then press enter. it took a bit long, but it looks like everything is fine.

so little that I know, our self-managed Postgres instances, the load has skyrocketed without giving any alert and it was pure luck that one of our teammates was actually looking at the dashboard for our infrastructure and saw the spike. then we just cancel the query and no BonChon

was actually ordered that day.

Scaling Infrastructure: Monitoring and Alerting13:37

so that leads to my next topic that I want to talk about, the infrastructure. so I guess generally when people are asking about scale, they think about infrastructure, right? more specifically, SRE or site reliability engineering.

and software system are pretty dynamic and unstable. and the only perfectly stable system is actually the dead system that has no development anymore. and our job as engineers or an SRE or people who maintain that piece of infrastructure is to maintain the balance of shipping features and the stability of the system. this is kind of a trade-off that we need to make. back then we had no dedicated SRE and our team needed to hold our own and needed to maintain our own infrastructure and make sure our system is stable enough to serve our rapidly growing user base.

we actually did and tried a lot of things.

based on the team experience, the most important aspect of scaling any kind of system is actually monitoring. so this sounds boring, right? why what's scaling? what's monitoring has to do with scaling and it sounds so boring, just looking at graphs.

Why Monitor? Analyzing Trends, Alerts, Dashboards, and Postmortems15:25

but hear me out so you won't get into the bunch of incidents like me.

you might wonder why we need to monitor our infrastructure. there are multiple reasons that it is crucial to monitor your own system. the first one is to analyze the long-term trends. answer these questions like how quickly our user base is growing? how long till we need to increase our instances' size to accommodate the growing user base?

the second important reason why we need to monitor is to have some kind of alerts.

something that could tell you that, "hey, something is broken on production. please someone come and fix it right now." or something that is imminent like,

"okay, your SSL certificates will be expired in 15 days. you better renew it." this kind of alerts is a crucial part of maintaining the stability and scaling your system.

the third one, building a dashboard. you should be able to at least answer some basic questions about your service. I'll actually talk about this in a bit. and last, we need to do postmortem or post hoc analysis. aka we need to debug stuff on production to fix some issues such as why is our response time spike? what else also happens at the same time? if we have visibility or monitoring set up properly, you might find out that, "oh, the DB query is actually taking

a pretty long time and let's investigate on that." so you can narrow down the problem easily. monitoring and alerting enable your system to tell us what is currently broken

or it could also tell us what is going to break. so when the system isn't able to automatically fix itself, they can actually ping human and you or your teammates could investigate the problem and fix it in time before it starts cascading into other systems.

Black Box vs. White Box Monitoring: Symptoms and Causes18:22

So our monitoring system or any kind of monitoring system would have to give the answer for these two questions. One is what's broken? And second, why is that thing broken?

We combine what we called black box monitoring. So for us, we are using Datadog dashboard and white box monitoring. We are using Kibana for logs and more details on the machines. We use these two monitoring tools combined to

identify and recover from problems. So the black box monitoring is actually used to identify only the symptoms of the issue. I'll give some example on the next slide. And when we know there is a problem, we actually need a way to dig deeper into the issues, right? So that's when white box monitoring tools provides

more details for us to debug those problems. So this is some real scenario that we have

experienced. So one day, our monitoring tools detected that the system time has suddenly skyrocketed. So after deployment of some config change, the response time start increasing for unknown reasons.

Luckily we have this health dashboard, right? We set up a proper black box monitoring that tells us, hey, there is something wrong with your system and you might want to look at it.

And also, when we look at the dashboard at the same time, we also see the alerts from the machine disk space. So the disk space is running low or it's full. These are all examples of detecting the symptom by black box monitoring. And after thorough investigation, we found out that

our log rotation was not working properly by going into the machines, looking at the disk size, looking at the files, etc. where we have those tool to look into the system. So the log rotation was not working properly and causing the machine disk to be bloated with failed log backup and the system ran out of disk space and is impacting the performance of our system. So this is an example of the cause.

So we have this black box monitoring that helps you identify what's wrong with your system. And we have white box monitoring tools that helps you identify the inner details of your system.

Google's Four Golden Signals: Latency, Traffic, Errors, and Saturation21:52

So there is four golden signals from Google SRE that we followed. So these are the four main metrics that you might want to detect on your system on your black box monitoring. First is latency. So you should be able to identify any performance impact of the system, high load, long wait for a certain request or a network issue. The key is to identify the benchmark for successful and failed response separately so that

the data is not distorted. The second one is the traffic. This one is quite straightforward. We want to be able to tell the usage of the system on the current user base to see how fast the

traffic is growing. How long do we need to consider scaling up?

The third one is the error rate. So this thing is here to alerts us on some kind of imminent catastrophe. So this metrics will warn you that something bad is currently happens on your system. The last one is saturation.

So saturation means how full your system is. And this metrics will help you answer the question, is it time to actually scale your system up? Maybe do a horizontal scaling or just upgrading your machine. The three most common things that we measure for this is memory, IO and CPU.

So we can be sure that our resources is enough to serve our current load in the system.

And after we have track all these signals, we have a lot more visibility on the system and the health of our service and it's actually helps us mitigate some really serious events before it has happened.

Story 3: Urgent Requests and the Importance of People and Process24:01

This is the last one. So it just happened at the end of the year where everyone is going on vacation and there's one guy and me left. So our product owner come in and said, "Hey, I have a problem for you to help fix. It's pretty urgent. Can you do it by today?" I was on the other task and said, "Okay, let me prioritize that after my current one is finished."

And then he comes back an hour later and said, "Oh, wait, we actually need another one which is also urgent." And then he comes back again. So this has happened quite a number of times and it has reduced the team productivity

for quite a bit.

It leads to my last point that I want to talk about when you want to scale your actual product and system. It's the people and the process.

So our main communication happens via three mediums like face-to-face, on instant messaging like Slack, and video calls. We had less issue when it comes to local team because we are face-to-face. We can discuss. We can actually talk in person. But nowadays, companies are open to have remote teams and it can increase the talent pool of the companies, but it also creates new challenges for the company.

Because face-to-face conversation is not always possible. What we need to do is actually try and collaborate remotely. So the working environment forces us to consistently collaborate remotely. It took us for a while to get used to it. We have experimented a lot of process changes

and here are some tips. I think we almost ran out of time, so I was just going to talk about the first one, which is quite important. So the key aspect of scaling your product and making your process work is actually not tagging everything as urgent and the team need to be serious about their priority.

So the product team might want to have a portal ship tomorrow and a new feature ship next week, but your engineering team also needs to fix that technical debt before they do that, right? So how would you communicate this in a remote environment?

Prioritization Sessions: Scaling Product and Process Through Collaboration26:54

Our team is actually utilizing this concept

of prioritization sessions where we sit together

and prioritizing as a team even though we are remote.

We have some write-up in a long-form document where we discuss how we are going to do things and what is the goal of the wider team right now.

So the process has not only help improve the feature delivery of the team, but also improve our tech debt, our code quality because we can spend some times and start pushing for more priority on the tech debt while still focusing on the main goals of the product.

Conclusion: Scaling is More Than Just Infrastructure27:57

Let me say this. Scaling product is actually quite hard. It is not only about how much traffic your system can handle or how would you increase your system capability to handle the load. Scaling a product requires collaboration from all functions to make sure the business can operate and grow according to the target.

As an engineer, I strongly believe that scaling does not mean adding more VM instances or adding more pods to your Kubernetes cluster or optimizing SQL query.

For me, it means solving business need by optimizing available resources in a timely manner and consistently doing so. So as a team, we are still learning to continuously scaling our system and I'm proud to say that we are now more content than ever with our code, our infrastructure, and our process. We are shipping things more quickly and we are handling production issue in more uniform ways.

So I'm Tino, a product engineer from Gojek. I would love to chat more about this type of things, so thank you for listening.

ขอบคุณคุณทีโน่นะคะ Thank you so much for คุณทีโน่ from Gojek Thailand for sharing your experience with us.