Keeping master green at scale(eng.uber.com) |
Keeping master green at scale(eng.uber.com) |
https://blog.acolyer.org/2019/04/18/keeping-master-green-at-...
His analysis indicates that what uber does as part of its build pipeline is to break up the monorepo into "targets" and for each target create something like a merkle tree (which is basically what git uses to represent commits) and use that information to detect potential conflicts (for multiple commits that would change the same target).
what it sounds like to me is that they end up simulating multirepo to enable tests to run on a batch of most likely independent commits in their build system. For multirepo users this is explicit in that this comes for free :-)
which is super interesting to me as it seems to indicate that an optimizing CI/CD systems requires dealing with all the same issues whether it's mono- or multi- repo, and problems solved by your layout result in a different set of problems that need to be resolved in your build system.
Only if you spend the time to build tools to detect commits in your dependencies, as well as your dependent repositories, and figure out how to update and check them out on the appropriate builds.
So, no, it doesn't come for free.
You are totally correct that to achieve the same performance, correctness, and overall master level "green"ness in a multirepo system you would have to either define or detect dependencies and dependent repos, build the entire affected chain, and test the result. That part is much easier in monorepo.
What I was referring to with "this" is that Uber's method of detecting potential conflicts. In multirepo land it would be a "conflict" if two people commit to the same repo. In multirepo, therefore, detecting potential conflict is trivial.
If Bob commits to repo A and Sally commits to repo B, their commits can't result in a merge conflict. Well, unless the repos are circularly dependent - which would be bad :-) don't do that. Of course, monorepo makes that situation impossible so there's an advantage for monorepo.
It seems like whether you have mono- or multi- the problems solved by one choice will leave other problems the build system has to solve that it wouldn't have to solve if the other option were chosen.
Different work would be required in multirepo but it would be work to solve the problems that monorepo solves just by virtue of it being a monorepo.
In fact, at Uber we have seen that behaviour with one of our popular apps when we did not have a monorepo. The construct of probabilistic speculation explained in the paper applies even in this scenario to guarantee a green master.
Or do you mean that multirepo could also benefit from the construct of probabilistic speculation by ordering commits across multiple repos such that you are maximizing the number of repos that have changed before you build and minimising the number of commits applied to single repos?
Or both :-)
We began working with the idea of consensus based CI/CD. If you pushed a change, you published that to the network. It gave other systems the opportunity to run their full suite of tests against the deployment of your code. Some number of confirmations from dependent systems was required to consider your code "stable". This progressed nearly sequentially assembling something like a block chain.
Ultimately the client was unable to pull this off for the same reason they were unable to decouple the systems: lack of software engineering capability.
This is either brilliant or just something built for a promotion packet
Merge Requests now combine the source and target branches before building, as an optimization: https://docs.gitlab.com/ee/ci/merge_request_pipelines/#combi...
Next step is to add queueing (https://gitlab.com/gitlab-org/gitlab-ee/issues/9186), then we're going to optimistically (and in parallel) run the subsequent pipelines in the queue: https://gitlab.com/gitlab-org/gitlab-ee/issues/11222. At this point it may make sense to look at dependency analysis and more intelligent ordering, though we're seeing nice improvements based on tests so far, and there's something to be said for simplicity if it works.
One useful metric is the ratio between test time and the number of commits per day. If your tests run in a minute, you can test submissions one at a time and still have a thousand successful commits each day. If your tests take an hour, you can have at most 24 changes per day under a one-at-a-time scheme.
I worked on Kubernetes, where test runs can take more than an hour-- spinning up VMs to test things is expensive! The submit queue tests both the top of the queue and a batch of a few (up to 5) changes that can be merged without a git merge conflict. If either one passes, the changes are merged. Batch tests aren't cancelled if the top of the queue passes, so sometimes you'll merge both the top of the queue AND the batch, since they're compatible.
Here's some recent batches: https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=batch
And the code to pick batches: https://github.com/kubernetes/test-infra/blob/0d66b18ea7e8d3...
Merges to the main repo peak at about 45 per day, largely depending on the volume of changes. The important thing is that the queue size remains small: http://velodrome.k8s.io/dashboard/db/monitoring?orgId=1&pane...
At Amazon, for example, they have multi repos setup. A single repo represents one package which has major version.The Amazon's build system builds packages and pulls dependencies from the artifact repository when needed. The build system is responsible for "what" to build vs "how" to build, which is left to the package setup (e.g. maven/ant).
I am currently trying to find a similar setup. I have looked as nix, bazel, buck and pants. Nix seems to offer something close. I am still trying to figure how to vendor npm packages and which artifact store is appropriate. And also if it is possible to have the nix builder to pull artifacts from a remote store.
Any pointer from the HN community is appreciated.
Here is what I would like to achieve:
1. Vendor all dependencies (npm packages, pip packages, etc) with ease. 2. Be able to pull artifact from a remote store (e.g. artifactory). 3. Be able to override package locally for my build purposes. For example, if I am working on a package A which depends on B, I should be able to build A from source and if needed to build B which A can later use for its own build. 4. Support multiple languages (TypeScript, JavaScript, Java, C, rust, and go). 5. Have each package own repository.
And didn't you find that this created massive headaches trying to build many disparate and inconsistent dependencies across repos? I think the benefits touted from mono-repos are exactly illustrated by the pain points working with Amazon's multi repo setup, in my opinion.
"Refactoring an API that's used across tens of active internal projects will probably a good chunk of a day."
This was my experience.
I’m just curious, but in fairness both of these schemes have obvious issues that will become headaches or positive design depending on your outlook. Clearly you can engineer effectively in either scheme.
obviously there are many others who do not use monorepo (amazon comes to mind) but it's reasonable to claim that they are actually widely used and fundamental when used
Bors builds one change at a time. On the other hand, Submit Queue speculatively builds several changes at a time based on the outcomes of other pending changes in the system. Apart from that, Submit Queue uses a conflict analyzer to find independent changes in order to commit changes in parallel as well as trim the speculation graph.
We have also evaluated the performance of Single-Queue (idea of Bors) on our workloads. In fact, as described in the paper, the performance of this technique at scale was so high (~132x slower) that we omitted its results. Submit Queue on the other hand operates at 1-3x region compared to an optimal solution.
I recommend you to read the paper here for further details. https://dl.acm.org/citation.cfm?id=3303970
Bors builds multiple changes at once (it creates a merge commit of all available changes and then runs the tests on all of them), and merges if all of them are good.
Possibly you are thinking of the older bors, as opposed to modern bors-ng?
It relies on understanding the inputs and outputs for all CI build steps to work out how changes to particular files might conflict.
Also, it has a much more sophisticated understanding of how likely a change is to be the source of failure, which it updates in response to repeated test runs. It can then prioritise the changes which are most likely to succeed.
disclaimer: I am one of Datree.io founders. We provide a visibility and governance solution to R&D organizations on top of GitHub.
Here are some rules and enforcement around Security and Compliance which most of our companies use for multi-repo GitHub orgs. 1. Prevent users from adding outside collaborators to GitHub repos. 2. Enforce branch protection on all current repos and future created ones - prevent master branch deletion and force push. 3. Enforce pull request flow on default branch for all repos (including future created) - prevent direct commits to master without pull-request and checks. 4. Enforce Jira ticket integration - mention ticket number in pull request name / commit message. 5. Enforce proper Git user configuration. 6. Detect and prevent merging of secrets.
Flaptastic will make your CI/CD pipelines reliable by identifying which tests fail due to flaps (aka flakes) and then give you a "Disable" button to instantly skip any test which is immediately effective across all feature branches, pull requests, and deploy pipelines.
An on-premise version is in the works to allow you to run it onsite for the enterprise.
Whenever our team has a significant number of flakey tests (more than 1-2) we usually schedule a bug squash session to fix them and amortize the cost over the whole team.
> monolithic source-code repositories
A monorepo is a monolithic repository
>When an engineer attempts to land their commit, it gets enqueued on the Submit Queue. This system takes one commit at a time, rebases it against master, builds the code and runs the unit tests. If nothing breaks, it then gets merged into master. With Submit Queue in place, our master success rate jumped to 99%.
(I'm one of the authors as well as the tech-lead of the system.)
It still depends on well written tests, lest your confidence be dashed when a human starts pushing buttons and pulling levers.
Also, don't break up tightly coupled code/modules into separate repos for the sake of microservices. Hard working developers will have to do two or more builds, PRs, possibly update semvers, etc... Find the right seams. If two repos tend to always change in lockstep, think about merging.
They have designed this as a result of a need, not just a fancy project.
> Optimistic execution of changes is another technique being used by production systems (e.g., Zuul [12]). Similar to optimistic concurrency control mechanisms in transactional systems, this approach assumes that every pending change in the system can succeed. Therefore, a pending change starts performing its build steps assuming that all the pending changes that were submitted before it will succeed. If a change fails, then the builds that speculated on the success of the failed change needs to be aborted, and start again with new optimistic speculation. Similar to the previous solutions, this approach does not scale and results in high turnaround time since failure of a change can abort many optimistically executing builds. Moreover, abort rate increases as the probability of conflicting changes increase (Figure 1).
Most of the complexity and suffering of a submit queue evolves from the interactions between your VCS and CI systems. Keeping things simple is great! Kubernetes' CI system is Prow, which runs the tests as pods in a Kubernetes cluster. Dogfooding like this is great, since the team you're providing CI for can also help fix bugs that arise.
It sounds like Uber's thing has a lot more smarts regardint deciding what gets tested. For the scale I work at (<200k lines of code) that isn't necessary.
Bazel has target caching including remote caching which can be shared across multiple engineers/execution environments. The tricky part would be ensuring your builds are hermetic and reproducible (which is also easier to achieve in monorepo setup).
https://medium.com/netflix-techblog/towards-true-continuous-...
They do have some benefits, but they also come with an immense cost
> This paper introduces a change management system called SubmitQueue that is responsible for continuous integration of changes into the mainline at scale while always keeping the mainline green. Based on all possible outcomes of pending changes, SubmitQueue constructs, and continuously updates a speculation graph that uses a probabilistic model, powered by logistic regression. The speculation graph allows SubmitQueue to select builds that are most likely to succeed, and speculatively execute them in parallel. Our system also uses a scalable conflict analyzer that constructs a conflict graph among pending changes. The conflict graph is then used to (1) trim the speculation space to further improve the likelihood of using remaining speculations, and (2) determine independent changes that can commit in parallel
Two code-conflict-free changes may pass a pre-merge build+test cycle independently but may logically break one another if both changes are merged into master. Using a submit/merge queue guarantees that each change has passed tests with the exact ordering of commits it would be merged onto. The example described here is a better explanation: https://github.com/bors-ng/bors-ng#but-dont-githubs-protecte...
The fancy bits in this implementation from the paper are interesting but the model itself is not that unusual.
I guess I just have a hard time imagining how many buys developers really commit important work all at once on large projects...
Next step is to serialize all proposed changes, so they are rebased one on top of other before running tests. This eliminates breakage due to merging, but does not scale:
> The simplest solution to keep the mainline green is to enqueue every change that gets submitted to the system. A change at the head of the queue gets committed into the mainline if its build steps succeed. > > This approach does not scale as the number of changes grows. For instance, with a thousand changes per day, where each change takes 30 minutes to pass all build steps, the turnaround time of the last enqueued change will be over 20 days.
This paper is about scaling a variant of such queue.
But sure enough, we definitely weren't the first to go down this path. Facebook was using (or developing the tech for) server-side rebasing in 2015.[1] Gitlab provides native server-side rebase functionality, likely inspired by various parties already having developed tools to do the same.
These aren't new ideas. But handling them at the scale where you land hundreds or even thousands of commits a day to a repo and require the ability to deploy at will, that's where engineering comes into play.
0: https://smarketshq.com/marge-bot-for-gitlab-keeps-master-alw...
1: https://softwareengineering.stackexchange.com/questions/2787...
If you have 1 app that's the bread and butter of your company and, 60% of your 2000+ engineers working on various features of that one app, then even in a multi-repo world, you are going to have that 1 repo receiving ton of commits and the problem of keeping it green remains. Prob. speculation helps there.
If something is consistently failing I would assume this tool does not disable it.
You also would need to do that as an atomic operation (in the mono-repo + especially with a commit queue you're building on the atomicity of git).
Having to unwinding that transaction if you aren't atomic can get you into a big mess at larger scales.
Here's a good related talk: C++ as a "Live at Head" language: https://www.youtube.com/watch?v=tISy7EJQPzI by Titus Winters (from Google).
Currently, I'm not convinced that you need to track and apply commits across dependencies and dependent systems atomically/transactionally to have a sane build environment even at scale, but you definitely get that part free with monorepo.
Any links to docs or presentations that address that specific issue would be very welcome :-)
This holds true when A and B are leaf repos, but gets tricky with repos inside a dependency graph. More concretely, if C depends on both A and B, and it turns out that C depends on A and B in such a way that A_bob and B_sally are mutually incompatible, you need some kind of mechanism for reconciling that.
Of course, exactly as you point out, mono and multi are two tradeoffs for the problems that large codebases intrinsically are.
The concept of an evergreen master with testing done in branches, followed by automated merges/rebases is not special. Quite a few companies have been doing it for years, it's the off-the-shelf tooling and subsequent publicity that haven't necessarily been around as long.
As for OP's material? The automated conflict resolution via reordering to optimise parallelism - that certainly feels novel.
Imagine I have three changes, C1 modifies F1, C2 modifies F2, and C3 modifies F1. There's no relation between F1 and F2.
At low-ish rate of submission, you test and commit C1, then test and commit C2, then when you try and test and commit C3, you rebase, and re-test and commit. (the merge doesn't conflict so can be automatically fixed)
Now assume all three changes are submitted by 3 different engineers in the span of a minute and engineers don't want to manually rebase. The rebase/build/submit time is less than the time between changes!
So you have a tool that queues up the changes, and at each change you
1. Rebase onto current head
2. build with the new changes
3. Submit
But that's still really slow. Since everything is sequential. If my change takes ~30m to test, it blocks everyone else who depends on my change.
So OK, do things in parallel: Build and test C1, C1 + C2, and C1+C2+C3. Then, as soon as C1 is finished testing, you can submit all 3. There's still 2 problems though: C2 is unreasonably delayed, and "what if C1 is broken".
So, if C2 and C1 don't conflict, you can actually just submit C2 before C1 even though the request to submit was made after. But when there really is a dependency, like C3 and C1, the question is, do I build and test {C1, C1+C3}, {C3, C3+C1}, or something else. SubmitQueue appears to try and address that question. "Given potentially conflicting changes (not at a source level but at a transitive closure level), how do I order them so that the most changes succeed the fastest, assuming some changes can fail, and I have enough processing power to run some, but not all, permutations of changes in parallel"?
- changeset A is submitted, an integration branch is cut from latest master, and CI begins
- changeset B is submitted, an integration branch is cut from latest master, and CI begins
- changeset A's integration branch passes CI build/test, so A is merged into master
- changeset B's integration branch passes CI build/test, so B is merged into master
- however, changeset A + B interact in such a way that causes build and/or tests to fail
- build is now broken
You're probably thinking "that sounds like it wouldn't happen very often. Both changes would need to be submitted within some window such that changeset B's integration branch does not include changeset A, and vice-versa". Which is correct, but that's where the scale comes in. With enough engineers this starts happening more, and the more engineers you have the more unacceptable it is to have the build broken for any amount of time. And the more engineers the more code you have so the longer any individual build starts taking which lengthens the window during which the two conflicting changes could be submitted.
You need to do it in a way that serializes the changes because that's the only way to prevent this, but that takes too long. So the paper is about how to solve this problem.
Uber has navigation, route optimization, queuing, etc. Facebook has to propagate activity our to massive and complex social network graphs.
I'm not discounting the toughness of operating at Airbnb's scale, but from my limited understanding it seems like they are not solving a new problem.
Every problem you could have with bad dependencies is entirely self-inflicted. The Right Thing™ is to choose a known-good version, and update when you have the bandwidth to pay down the tech debt.
What they are describing here is to detect if items do not conflict beyond a simple merge conflict and build & commit them simultaneously, increasing the throughput of the submit queue system.
Just to clarify, the ML models are used to predict the prob. that a given change will succeed against master as well as the prob. of conflict between changes.
With this in mind, what Bazel does when a test is marked flaky is run it several times. This is a simple way of minimizing the effect of flakiness while still getting confidence from green tests.
Our solution allows the someone to know the test failed because its flaking out immediately as soon as it flakes, and provides a 1 click option to instantly disable that test across all feature branches so that everybody else can continue working undisturbed.
Without something like this, you have to: 1) Create a new feature branch 2) Commented out the broken test 3) Wait for it to pass CI 4) Gain approvals as needed 5) Merge the PR back to the master line 6) Message everybody to let them know the test was removed and they should rebase
The process above is sort of the industry standard and this means a giant loss in productivity for everybody on your team and is especially painful for monolith codebases.
Companies where I've worked easily hemorrhage $1m per year on this problem in terms of developer productivity losses if you consider the number of hours wasted per year.
Integration tests are nice, but best if ran separately...
There's always a window where both will be in use, because we can't synchronously replace every running process everywhere (not that it's even a good idea without a canary). The shorter you try to make that window, the more needless pain is created and plans disrupted. While we could use prod to beta test every single version of everything, that shouldn't be our priority.
In this case you'll have a very short transition, between all consumers updating their client code (possibly in a backwards compatible way) and the change in the implicated system being deployed, not the other way around.
Introduce a breaking change into a common library and now you have to update every other dependency to support it.
Not so bad in a monorepo. But when your codebases are distributed?