Rollout changes for large-scale applications serving billions of user
What is rollout?
After the software engineer completes writing the code, the next step is to create a build (the binary) to deploy it on the server so that it can be used by users.
Typical rollout workflow
Typically most of the team adopts the following workflow to roll out changes to production servers.
- Application has a dedicated fleet of Dev, QA, Preprod, and Prod servers i.e. they don’t share any infra (separate compute layer, separate database, separate auth layer, separate storage, etc).
- Developer write code and submit their changes to the source code repository.
- CI (Continuous Integration) tool listens for changes and creates builds after running automated test cases.
- CD (Continuous deployment) tool listens to new version (Gitops) and rolls out new binary on the Dev server.
- Generally, only Dev servers are automated to deploy on code changes. For the rest of the servers, a manual trigger is required.
- Once the developer gives go go-ahead the release manager deploys changes on the QA Server for testing.
- Once the QA team verifies the changes then release manager rollout changes on the Preprod server. QA team again performs testing on the preprod server.
- Once the QA team verifies the changes then release manager rollout changes on the prod server. QA team again performs testing on the prod server to make sure everything is working fine.
- If everything is good then the QA team declares a successful rollout. Otherwise team decides either to roll back the build or roll forward.
Problems with the above approach
- A lot of pressure on the QA team to certify the build.
- A lot of manual testing and decision is involved.
- If no plan for a forward fix then the entire build is abandoned even if it was caused by small changes. Typically no way to roll back just one commit.
- Lower release frequency due to manual testing and high risk of breaking the prod.
- Due to the longer release cycle which includes a lot of changes, it becomes very hard to debug the issue on prod.
- There was always a fear of launching new changes on prod. Which lead the team to always do firefighting to deal with new issues.
- In case of any bug, it impacts all the users. Leads to a bad user experience.
- It hurts badly and becomes unmanageable if your application has billions of users.
The goal for scalable rollout
- Rollout changes to a few percentages of users, if all looks good then rollout to the next higher percentage of users.
- Serve a subset of users (maybe 50% of employees or 10% of free users) to preprod. This way risk of spreading bugs to prod will be minimized. It also reduces the load on the QA team.
- Allow employees/testers to always use the QA (Or other) server for their daily usage of the application. If a subset of user traffic is served from the QA server then there would be higher chances to catch the critical user journey bug way ahead.
- Support canary rollout. I.e. before start rolling out on prod, first we would like to compare the impact of the previous version vs the new version. 1% of total traffic can be directed to the new version and similarly 1% of traffic of total traffic can be directed to the old version and run automated tests to compare the CPU/memory/error rate etc for ~2 hours. If all looks good then only move to roll out to rest otherwise reject the rollout and let the team debug the issue.
- Reduce blast radius. Let’s say if changes are being rolled out on prod then don’t rollout changes to the entire 100% at a time. Instead first on 1% of infra, run a suite of automation tests. If all looks good then move to the next 5% then the next 10% then the next 30% etc. The goal is to do incremental rollout.
- Leverage automation test cases to reduce manual effort.
Design rollout for billions of users
If the application serves billion+ users then we need to first redesign the application and rollout strategy differently.
Shard data into Dev and Live
Instead of dividing the infrastructure vertically into dev, qa, preprod & prod, we should do smart sharding of the data itself. We can divide all data into two categories
- Development
- Live data
The entire infrastructure is only divided into these two sets. I.e. www.app-dev.com for dev and www.app.com for the rest data. It means only two dedicated infra is needed to store dev and live data.
Define risk class
For every live user we can categorize as following risk class
- Integration
- QA
- Preprod
- Prod
We can keep just one risk class for Dev.
Define User entity and map to risk class
We define a user as an entity. Then maintain a separate EnityMap database to map the user to the corresponding risk class. Update the traffic manager to read the entity ID from the request and based on mapping redirect to the corresponding group. Please note; that the traffic manager acts as the routing of your traffic. This is not a load balancer. The load balancer is closer to your compute layer to manage the load.
Define groups
We can define groups of users based on their geo-location and type. For example.
- Integration: We can map data for the integration server to
integration-us-group-1
. - QA: We can map test data created for QA testing to
qa-us-group-1
. - Preprod: We can map 50% of employees (or 10% of free users) to
preprod-us-group-1
orpreprod-eu-group-1
etc. - Prod: We can divide the world into 100 groups (or more based on the scale) and map these groups to the prod category. Let’s say
prod-us-group-1
,prod-us-group-2
,prod-eu-group-1
etc.
Map set of layers for each group
Based on the group we do sharding for all the layers.
- Let’s say
prod-us-group-1
maps to a set of virtual maps. - Similarly, all data
prod-us-group-1
will be stored in the corresponding database shard. - Similarly, all files
prod-us-group-1
will be stored in the corresponding file system shard.
Update load balancer to redirect traffic based on entity
Now live data is internally shared based on entity type. I.e. it categories entity as per the following
- Integration
- QA
- Preprod
- Prod
The same URL i.e. www.app.com is used to access all of these entity types. Internally at the network level, it identifies the entity category and redirects traffic to the corresponding compute layer as well a similar database shard is used for storage.
Architecture of application
Following is the final architecture of the application at a high level.
- Now the code committed by the developer gets auto-deployed to both the Dev server and the Integration server which is part of the Live data setup. A small percentage of users use the Integration server therefore if any critical user journey is broken it would be easily caught by live users.
- You might be thinking, wait, isn’t the Dev/Integration server okay to break? Ans is no. But it doesn’t give an excuse to commit a code without automation test case coverage. Developers must have committed code with full automation test coverage which runs before the build is deployed. The goal here is to have enough automation testing and then catch bugs that were somehow missed in the automation test or might not be easy to automate.
- Build stays on the Integration server for a day to give enough time to users to report if any issues are discovered.
- On day 2, the build is rolled out to the QA server, where the QA team runs the manual test for the use cases that can’t be easily automated.
- On day 3, the build is rolled out to preprod.
- On day 4, the build is rolled out to prod.
- A canary test i.e. comparing 1% traffic to the new version with the old version is executed on every stage before it gets rolled out. Again, need enough test cases to verify the bugs.
- The above release cycles run in parallel on different builds to reduce overall frequency.
Rollback strategy
With the above approach, if any issue is either caught by automated test suites or employees or testers or reported by users, then we can take the following approach to rollback.
- If it is on the prod and the preprod, then simply roll back the build to the last safe version.
- If the issue is reported during the rollout to the server then the rollout tool fails the release. Since the release cycle is shorter therefore list of changes would be limited. It helps the developer to analyze the root cause. Either we can abandon the release or remove the few commits from the build create a new build and roll it out again.
Use the Experiment tool to execute code in a controlled way
- Many times developer wants to submit their code but make sure it should not be executed until the entire flow is ready.
- Another goal is to test a new feature with specific users or a percentage of traffic. Then developer would like to monitor reports before increasing the traffic to new code.
- In case of any error, simply turn off traffic instead of rolling back the build version.
This is achieved by Experiment. Experiment tools provide ways to declare flags in the code that developer use to guard their new feature.
Once the build is rolled out the developer can enable experiments in specific environments for certain users or a certain percentage.
Conclusion
With the help of sharded architecture and experiments, it becomes easy to roll out changes to serve billions of users' traffic. Please note; that I only talked about the high-level design. May be need a separate blog to talk about which Cloud provider or Opensource tool we can use to implement it?