Monorepo: please do(medium.com) |
Monorepo: please do(medium.com) |
I'd say that open-source best practices for shared libraries are appropriate if you're making an open-source shared library. However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases. In my experience, it's also far from "problem solved". You can point your finger at semantic versioning but in the meantime we go through hell and back with package managers trying to manage transitive library dependencies and it SUCKS. Why, for example, do you think people are fed up with NPM and created Yarn? Or why people constantly complain about Pip / Pipenv and the like? Why was the module system in Go 1.11 such a big deal? The answer is that it's hard to follow best practices for shared libraries, and even when you do follow best practices, you end up with mistakes or problems. These take engineering effort to solve. One of the solutions available is to use a monorepo, which doesn't magically solve all of your problems, it just solves certain problems while creating new problems. You have to weigh the pros and cons of the approaches.
In my experience, the many problems with polyrepos are mostly replaced with the relatively minor problems of VCS scalability and a poor branching story (mostly for long-running branches).
Why do you say so?
I agree share library style makes more sense in most cases though. The main problem with it is forcing everyone to use the latest library versions but that isn't insurmountable by any means.
Personally, I think there’s a place for mono repos and there’s a place for smaller independent repos. If a project is independent and decoupled from the rest of the tightly coupled code base (for instance things which get opesourced), it makes no sense to shove it into a huge monorepo.
SVN was first released in 2000. Git in 2008. Branching, tagging and diffing were nowhere near what is possible now.
That goes back to desktop with a disk smaller than a GB, CPU in the tens of MHz with a network so slow and reliable, if you have one at all.
This effort makes a lot of sense if your consumers are complete strangers who work for other organizations. If your consumers are in the same organization, then there are easier ways to achieve similar benefits. See Conway’s Law. It’s not an accident that code structure reflects the structure of the organization that created it, I would claim that organizational boundaries should be reflected in code. Introducing additional boundaries between members of the same organization should not be done lightly.
One of the main benefits of version numbers is that it tells your consumers where the breaking changes are, but if you have direct access to your consumers’ code and can commit changes, review them, and run their CI tests, then you have something much better than version numbers. If you are running different versions of various dependencies you can potentially have a combinatoric explosion of configurations. Then there’s the specter of unknown breaking changes being introduced into libraries. It happens, you can’t avoid it without spending an unreasonable amount of engineering effort, but the monorepo does make the changes easier to detect (because you can more easily run tests on downstream dependencies before committing).
Cross-cutting changes are also much more likely for certain types of projects. These are difficult with polyrepos for obvious reasons (most notably, the fact that you can’t do atomic commits across repos).
Packaging systems also have administrative overhead. If you shove everything in a monorepo you can ditch the packaging system and spend the overhead elsewhere. These days it’s simple enough to shove everything in the same build system.
Various companies that I’ve worked for have experimented with treating internal libraries the same way that public libraries are treated—with releases and version numbers. Most of them abandoned the approach and reallocated the effort elsewhere. The only company that I worked for that continued to use internal versioning and packaging was severely dysfunctional. One startup I worked for went all in on the polyrepo approach and it was a goddamn nightmare of additional effort, even though there were only like three engineers.
> One of the main benefits of version numbers is that it tells your consumers where the breaking changes are, but if you have direct access to your consumers’ code and can commit changes, review them, and run their CI tests, then you have something much better than version numbers.
A small peeve of mine: Semver and version numbers generally are lossy compression. They try to squeeze a wide range of information into a very narrow space, for no other reason than tradition.
I’m also completely baffled by your statement that “simply avoiding them doesn’t make it better.” Reading that statement, I can only feel that I have somehow failed to communicate something and I’m not really sure what, because it seems obvious to me why the premise of this statement is wrong. When you avoid performing a certain task, like releasing software, which costs some number of work hours, you can reallocate those work hours to other tasks. It’s not like the tasks of releasing and versioning simply stop happening, but you also get additional hours to accomplish other things which may be more valuable. So it’s never an issue of “simply avoiding” some task, at least on functional teams, the issue is choosing between alternatives.
And it should also be obvious that cutting discrete releases for internal dependencies is not an absolute requirement, but a choice that individual organizations make depending on how they see the tradeoffs or their particular culture.
There really are many different ways to develop software, and I’ve seen plenty of engineers get hired and completely fail to adapt to some methodology or culture that they’re not used to. The polyrepo approach with discrete releases cut with version numbers and changelogs is a very high visibility way of developing software and it works very well in the open source world, but for very good reasons many software companies choose not to adopt these practices internally. It’s very sad when I see otherwise talented engineers leave the team for reasons like this.
Though it's true that monorepos without proper discipline can tend towards coupling. Yet, when discussing mono vs poly, we should keep this in mind.
I don't know how you maintain that arm's length separation if you don't have compilation units in your language of choice, and that may contribute to some of the muddiness in this kind of discussion. "It depends."
- [x] Namespaces and the like without much security benefit
- [x] Giant Java dependency
- [x] Strange syntax and glyphsI feel like if you are working completely in the open-source world, and you are contributing one open-source project to a larger array of available projects, then the decision to use a polyrepo makes a lot of sense. You can submit libraries to a package repository like Yarn/NPM/PyPI or you can use Git references for e.g. Go's package manager.
But what I experienced with polyrepos outside this world is that we ended up with a weird DAG of repos. It was always unclear whether a specific piece of code that was duplicated between projects should be moved into one dependency or another, or whether it should have its own repo. Transitive dependencies were no fun at all, if you used git modules you might end up with two copies of the same dependency. You might have to make a sequence of commits to different repos, remembering to update cross-repo references as you go, and if you got stuck somewhere you had to work backwards. This feels like a step backwards, like the step backwards from CVS to RCS.
Again, in the open-source world you might have some of this taken care of by using a package manager like Yarn. But if your transitive dependencies aren't suitable for being published that way, it can be tough. Monorepo + Bazel right now is a bit rough around the edges but overall it's reduced the amount of engineering time spent on build systems.
On the other hand, it's not like Bazel can't handle polyrepos. In fact, they work quite nicely, and Bazel can automatically do partial checkouts of sets of related polyrepos, if that's your thing.
As for VCS scalability problems, I expect that Git is really just the popular VCS du jour and some white horse will show up any day now with a good story for large, centralized repos with a VFS layer. In the meantime any company large enough to experience VCS performance problems but not large enough have their own VCS team (like Google and Facebook) will suffer, or possibly pay for Perforce.
If your project is mostly something like C++ (which has support built-in to Bazel) then the WORKSPACE rules will be much more manageable and partial checkouts become a lot easier.
I'd be more interested to read about a project or company that failed due to making one choice or the other. And then by switching things to the other way, things were fixed.
Otherwise, as someone who was worked with both, I imagine there are a host of other decisions that will be much determinant on your success.
Let's not get too wrapped up in what color to paint the shed.
Please don't do this.
Of course the Monorepo is not free of downsides, those mentioned in the original article are real, although a bit exaggerated in my opinion. VCS operations can be slow and scaling a VCS system is challenging, but possible. And the risk of high coupling and a tangled architecture is also very real if you don't use a dependency management system like Bazel/Buck/Pants.
But in my opinion the downsides of the Polyrepo are much worse and much much harder to fix. The main problem is that you need a parallel version control system like SemVer on top of your VCS. SemVer is fine for open source projects but for a dynamic organization is a nightmare because it is a manual process prone to failure. SemVer dependency hell is really hard to deal with and creates a lot of technical debt.
Additionally, once you go Polyrepo you lose true CI/CD. Yes, you still have CI/CD pipelines but those apply only to a fraction of the code. Once you get used to run `bazel test` and you know you will run every single test of any piece of code that could depend on the code you just changed, you never want to go back. Yes, you could have true CI/CD with Polyrepos, but it requires a lot of work and writing a lot of tooling that does not exist in the wild. It is cheaper to invest in scaling your VCS in a multi-repo.
If we had the tooling to do multirepo atomic commits and reviews then maybe we would of stuck with polyrepos, but it doesn't really exist out in the wild, so monorepo it was.
Both monorepos and polyrepos have advantages and disadvantages. Many factors — scale, overall team quality and experience, level of integration between projects are a few that come to mind — will affect how much those advantages and disadvantages matter to any given company at any given point in time. The right choice for you isn't necessarily the right choice for me.
Much more important than which approach you choose is understanding, and accepting, the consequences of your choice. You'll want to extract value out of the advantages, you'll need to mitigate the disadvantages. You won't be able to adopt tools and processes meant for the other approach without some degree of friction.
All this forcing people to do things the Right Way (my way) is surely part of the pushback against monorepos.
But set that aside for the moment. Let's suppose defaults should force people to do things the Right Way, and that we also know what the Right Way is.
Instead of letting anyone sloppily depend on any code checked into the monorepo, shouldn't we force people to think long and hard about contracts between components -- the default concern in a polyrepo architecture? When and how to make contracts, when and how to break contracts? Isn't this how Amazon moved past their monorepo woes, adopted SOA, built AWS, and became one of the largest companies on earth? Heck, isn't this how the Internet itself was built?
Fighting back against monorepo design is dangerous - embrace experimentation.
Git had a sparse checkouts feature since a long time, but it only affected the checkout itself, all the blobs would still be synced.
Now, Git is gaining good monorepo capabilities with the git partial clone feature [1]. Their idea is that with them you can only clone the parts of a repository that are interesting to you. This has been brewing for a while already but I'm not sure how ready it is. There doesn't seem to be user-level documentation for it yet, to my knowledge, so I am linking to the technical docs.
[1]: https://github.com/git/git/blob/master/Documentation/technic...
You can certainly achieve this with Perforce, SVN, HG, any repo system there too.
Linux: FUSE + ?
Windows: Dokan? CBFS? Or the new fangled https://docs.microsoft.com/en-us/windows/desktop/projfs/proj... which VFSForGit uses
Let me give a concrete example. The Android open source project (AOSP) which builds the system of Android devices has the code size close to the scale of tens of GB (let alone all the histories!). It is already a massive monorepo in itself. And typically you would have many of them from different OEM/SoC vendors of different major releases. In such a scenario, it would turn into 'a monorepo of monorepos,' which is quite unpleasant to imagine.
With 100 engineers a monorepo might seem a good idea. With 500 it becomes nearly impossible to do anything involving a build. Some isolation is needed.
Also from my experience many engineers just don't give a shit about architecture. They create entangled mess, that kind of works for the customer, and go home. Without some enforced isolation it is impossible to maintain it.
That being said I am more inclined to polyrepos.
Today, not quite. I work for a multi billion dollar tech company and we have several thousand repos (and it's awesome)
Both FB and Google have more than 500 devs and are using a monorepo.
This would help people working on smaller apps, since they don't need to look at other apps unless they're working on shared library code.
Of course, once you are working on library code, you have to build and test all the apps that use it. But even at Google, the people working on the lowest levels of the system can't use the standard tools anyway.
I don't see why you'd need semver. The apps could sync to a particular commit in the library repo.
More to the point, as the author of TFA allows, once a system reaches a certain size, nobody can understand it all. At some point you have to engage division of labor /specialization, and once you do that, it doesn't make sense to have just anybody randomly making changes in parts of the code-base they don't normally work in.
I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR. Basically "internal open source" or "inner source"[1].
In my experience, this is about as close as you can get to a "best of both worlds" situation. But, as the author of TFA also says, you absolutely can make either approach work.
Unfortunately some of the most popular CI/CD services out there(Travis, Circle, etc) don't even support cross-repo pipelines, much less mono repo builds.
Those both look way more in the weeds than what I would have imagined.. I guess for Bazel at least it makes sense given Googles scale how fine-grain they would get into caching and incremental builds..
For my needs a simple tool that would allow discovering "WORKSPACES" and constructing a build graph based on what's changed, while handing off the actual building to some entry point in the workspace, would be good enough. Have a weird collection Gradle projects, node projects, test suites, docs, and etc with their own build processes already in place.
Some things are also on a "critical" path while others can run async given the context(branch, tag, etc)...
I'm rambling though.
I find it enjoyable how plenty of comments both here and in the other discussion are of people saying "We had a mono/polyrepo and things improved tremendously when we migrated towards a poly/monorepo". The issue might be one of growth and complacency: a drastic change like that forces the team to face the technical debt that was being ignored and do a better implementation using what was learned from past mistakes.
Perhaps the fact that since their level was now higher, they wouldn't have to deal with the nitty gritty details and pain of working with a monorepo as a developer?
E.g. I wasn't for it when I was a dev, but now that I can just impose it on others, I love it. Same with how various 'development process' rituals are adopted...
How does the library team know which consumers a commit may break? What tools are recommended?
With a monorepo, the basic effort you have to put in to start scaling is quite high. To properly do a local build, you need bazel or something. But bazel doesn't stop at just building, but it manages dependencies all the way down to libraries and stuff. Let's say you're using certain maven plugins, like code coverage, shading, etc. Would bazel have all the build plugins your project needs? Most likely not. You have to backport a bunch of plugins from maven to bazel and so on. Guess how many IDEs support bazel? Not a lot.
Then you need to run a different kind of build farm. When you check-in stuff to a monorepo, you need to split and distribute one single build. Compared to a polyrepo where one build == one job, a monorepo is like one build == a distributed pool of jobs, which again needs very deep integration with the build tool (bazel again here), to fan out, fan in across multiple machines, aggregate artifacts, and so on.
Then the deployment. Same again. There is no "just works" hosted CI or hosted git or anything for monorepos. People still dabble with concourse or so on.
And guess what, for a component in its own repo, you don't need to do anything. Existing industry and OSS tooling is built from ground up for that. Just go and use them.
To provide a developer a "basic experience" to go from working on, building and deploying a single component – the upfront investment you need to provide with a monorepo is very high. Most companies cannot spend time on that, because scale means different things to different companies. There is a vast gap in the amount of ops/dev tooling you have for independent hosted components vs monorepo tools. Just search for "monorepo tools" or DAG and see how many you can come up with. So what really happens with a monorepo is, most companies go with multi-module maven and jenkins multi-job. The results are easy to predict. I'm not saying that maven/jenkins are bad, but they are _not_ sophisticated, and are not anywhere close to what Twitter/Facebook/Google or any modern company uses to deal with a monorepo (for a good reason). They are just not good at DAG. If you're relying on maven+jenkins as your monorepo solution, all I can say is "good luck".
Instead, if you start by putting one component in one repo, you keep scaling for _much longer_ before you hit a barrier.
In principle, monorepos are better. In practice, they don't have the basic "table stakes" tooling that you need to get going. Maybe monorepo devops tooling is a next developer productivity startup space. But until then, it's not mainstream for very good reasons.
How do the "global build tools" play with language specific build tools?
My primary stack is Rust and Scala. Both have excellent build capabilities in their native tools. How well do pants/bazel integrate with them? I wouldn't want to rewrite complex builds nor would I expect these tools to have 100% functionality of native ones.
I know the Scala rules are used in production by multiple companies. Rust support is improving quickly, but it's not perfect. See the dedicated GitHub repositories for more information.
(I work on Bazel)
Maybe you can clear my confusion. If Module B is dependent on Module A, then every version of B should refer to a specific version of A, correct? What is there to break? Development can continue on A without interfering with B, and then you can uptick B once it points to a later A.
I'm not sure what this has to do with the mono/poly discussion.
To avoid that, you do 10 migration commits so everyone is on the latest version. If you're going to do that as standard operating procedure anyway might as well make it far easier and have a monorepo.
Adding additional PRs across different repos is functionally no different than the same PR with scattered dependencies in a monorepo, except that separating the PRs makes each isolated set of changes more atomic and focused, which has led to fewer bugs and better quality code review and, the hugest win, each repo is free to use whatever CI & deployment tooling it needs, with absolutely no constraints based on whatever CI or deployment tool another chunk of code in some other repo uses.
The last point is not trivial. Lots of people glibly assume you can create monorepo solutions where arbitrary new projects inside the monorepo can be free to use whatever resource provisioning strategy or language or tooling or whatever, but in reality this not true, both because there is implicit bias to rely on the existing tooling (even if it’s not right for the job) and monorepos beget monopolicies where experimentation that violates some monorepo decision can be wholly prevented due to political blockers in the name of the monorepo.
One example that has frustrated me personally is when working on machine learning projects that require complex runtime environments with custom compiled dependencies, GPU settings, etc.
The clear choice for us was to use Docker containers to deliver the built artifacts to the necessary runtime machines, but the whole project was killed when someone from our central IT monorepo tooling team said no. His reasoning was that all the existing model training jobs in our monorepo worked as luigi tasks executed in hadoop.
We tried explaining that our model training was not amenable to a map reduce style calculation, and our plan was for a luigi task to invoke the entrypoint command of the container to initiate a single, non-distributed training process (I have specific expertise in this type of model training, so I know from experience this is an effective solution and that map reduce would not be appropriate).
But it didn’t matter. The monorepo was set up to assume model training compute jobs had to work one way and only one way, and so it set us back months from training a simple model directly relevant to urgent customer product requests.
Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
In my experience, this type of policy blocker is uniquely common to monorepos, and easily avoided in polyrepo situations. It’s just a whole class of problem that rarely applies in a polyrepo setting, but almost always causes huge issues with monorepo policies and fixed tooling choices that end up being a poor fit for necessary experiments or innovative projects that happen later.
Hear, hear. Let teams choose the processes and tools that work best for them. In previous release engineering positions, I resisted the many attempts to instroduce a single standard workflow for all projects. The support burden of letting a thousand flowers bloom was not great, but the benefit was that devs understood their project and were empoiwered to make changes when the business requirements changed faster than standardized tooling could.
We had a few contracts for standard behaviours, but they were low-overhead: must respond to 'make/make test', have a /status endpoint that 500'd when it was unhealthy, register a port in the service conf repo, etc.
It makes it less atomic if you need simultaneous changes in multiple repositories.
> Had we been able to set this up as a separate repo where there were no global rules over how all compute jobs must be organized, and used our own choice of deployment (containers) with no concern over whatever other projects were using / doing, we could have solved it in a matter of a few days.
I think this was an organisational problem, but I accept the argument that monorepos will provide a seed around which such pathologies can crystallise. But I don't believe it's the only such seed and I don't think it's an inevitable outcome from monorepos.
Unless you mean your presubmit test would push to production machines, that's bad and shouldn't be allowed, but again has nothing to do with a monorepo.
A company could just as easily have draconian policy about testing and deployment and multiple repos. Maybe you could break the rules (hell you could have broken the rules in monorepo land), but again, that's just a rules issue, not an issue of the repository.
It's not that it's a single right way to do it. There isn't, and anyone who tells you there is has something to sell you, or is inexperienced enough to not have seen enough of the problem domain.
What is for certain: teams need to have tooling that causes the conversations and behavior that lead to the outcomes we want. As systems and teams scale large enough, this tooling becomes essential - without it, teams go their own way, and in so doing, may or may not create the culture needed for the outcomes you want.
I have never once in my career, so far, had to tell a team to communicate less. When we're talking about engineering organizations that are large enough to diverge, you must solve these problems somehow, and it needs to be systemic and intentional.
Your post puts a lot of the onus on A for breaking B, C, and D, but I think equal care and consideration needs to come from the other side of the contract. Eg, What are you depending on? Is it a dependency you want to take on, or are you and the shared code likely to diverge in life? These are top of mind decisions in a polyrepo architecture, but from my experience they're often not even considered in a monorepo. Anything checked in is fair game for reuse. This is why I suspect you may be "forcing" the wrong thing.
For reference I've worked in companies large and small, both monorepo and polyrepo. When I worked on Windows back in the 00's the monorepo tooling (SourceDepot) was quite amazing for the time, but the costs of that sort of coordination were also painfully apparent to everyone.
The place I currently work has a monorepo for desktop software and polyrepos for everything else. It isn't a straight up A/B experiment, but anecdotally the pain is higher and shipping velocity lower in the monorepo half of the world. Most of the monorepo pain is related to CI or other costs of global coordination, the kind of things Matt touches on midway (albeit probably too subtlely). I'd be interested to see your counterarguments to those points as well. Do you need fancy dependency management tooling to make your global CI builds fast and reproducible? Matt argues those end up being equivalent to the kind of dependency tooling that's intrinsic to polyrepo architectures anyway.
What's dangerous about it? Monorepos have a lot of benefits, and should absolutely be considered. Maybe even by most. But right now in the community it's almost pushed as the "only true way with all benefits and no drawbacks", and that's absolutely not true. To the point the knowledge of why and how to poly repo is already starting to get lost.
That's dangerous.
What do you even mean by "dangerous"? To a business? To your health?
What is the deal with people trying to make these sorts of global assertions in a vacuum about what's "good" and "bad"? This doesn't make any engineering sense in any way to me. You have a problem and you figure out the best way for your business to solve that problem given some bounded resources. Nothing in the basic problem solving process (scientific method?) necessitates all the arbitrary "should" axioms. Why don't people just analyze their specific situation and figure out a solution?
It's like people arguing vehemently about the optimal design that every company "should" be using for all windshields for all personal vehicles on the road, without even remotely discussing various vehicle body shapes and sizes.
Have you (or anyone reading this thread) encountered similar issues? How do you solve them in a monorepo?
My feelings here are apart from your tool of choice (Pypi) so read them with that in mind.
Why are you dependent on 3rd party code that isn't in your repo? I am a huge advocate of the monorepo and vendoring. Depending on your tooling of choice and your workflow checks for updates on this third party code should be frequent (security) and done by someone qualified (not a job for the "new guy").
The question is where should this start and end? The answer (for me) is everything and I have elected to use less (and reduce complexity) to avoid bloat. Really though this is an artifact of my use of Git: https://unix.stackexchange.com/questions/233327/is-it-possib... --
It was a gigantic pain trying to find owners for half-dead repos for services still running and in use, where the original authors had left years ago & from teams 4 or 5 restructures ago. The one thing I learned was: never make a user the owner of a repo (unless it is in their personal space), always find a team to accept responsibility for it.
This is how it works at my company. The issue we run into is that PRs coming from non-core maintainers tend to either get over-scrutinized (e.g. "this diff may work for you but it's not generic enough for X/Y/Z") or flat out ignored at the code review stage and sometimes don't land in a timely enough manner.
Another challenges with this approach is when you have deeply nested dependencies and need to "propagate" an upgrade in some deep dep up the tree. In the JS/Node world, this usually means fixing an issue involves hacking on transpiled files in the node_modules folder of a project to figure out what change needs to be made, and then mirroring said change into the actual repo and then tweaking things until type checking/linting/CI pass. Not really conducive for collaboration.
One other problem is that security/bug fix rollouts are a bit more challenging. We had a case a while back where a crash-inducing bug was fixed and published but people still experienced crashes due to not having upgraded the one out dozens of packages required by their projects.
Here's my rule: You break it, you fix it.
> I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR.
I'd rather see pairing, extensive tests and fast CI. I see PRs as a necessary evil, rather than a good thing in themselves. If I make a change that breaks other teams, I should fix it. If I can make a change to fix code anywhere in the codebase, I should write the test, write the fix and submit it.
Small, frequent commits with extensive testing creates a virtuous cycle. You pull frequently because there are small commits. You are less likely to get out of sync because of frequent pulls. You make small commits frequently because you want to avoid conflicts. Everyone moves a lot faster. I have had this exact experience and it is frankly glorious.
I’ve seen this invoked so many times to shirk responsibility though. Someone piles up all kinds of crap in a tight little closet, complete with a bowling ball on top, and the next unsuspecting dev who comes by and opens it gets an avalanche of crap falling on them while the original author can be heard somewhere in the background saying “it’s not my problem.”
This winds up leading to more crap-stacking just to get the work done ASAP and you wind up with a mountain of tech debt.
I like the zero flaw principle where new feature work stops until all currently known flaws are fixed. Then everyone is forced to pitch in and responsibility is shared whether you want it or not.
I'm accustomed to collective ownership where, ideally, this never happens and in practice happens rarely (followed by the little closet being torn out and replaced).
> I like the zero flaw principle where new feature work stops until all currently known flaws are fixed.
I agree: stop the line. But I think it's orthogonal to the sins or virtues of n-repology.
Isn't it reasonable to assume that FB/Google will do a cost analysis of mono/poly repo approaches and pick the one that is the most cost effective? At that scale they have absolutely no room for dogma; it's all about costs.
In the post yesterday one of the arguments was that if nobody checks out all of the code then what's the value of having the code all in one place?
Last monorepo I worked on, individual contributors checked out just the tree they were working on (we had a suite of applications with several shared modules). We made it simple and straightforward for them to get what they wanted and ignore people whose work didn't impact them.
But the senior people, who were better with architecture and version control trivia, checked out the entire thing. They would steward any cross-cutting changes that needed to be done, and make sure any callers to shared libraries were updated in the face of breaking changes. They were also backstopped by the build plans, (some of) which also checked out the entire thing.
Git cannot checkout sub directories and it slows down exponentially with the number of branches. It's the opposite of what is needed to run a mono repo in a large company.
The big companies that predate git and such used monorepos because that was the norm at the time, and it was easier to do with the tools at the time, and as they scaled, they just scaled their process instead of changing everything. But several large tech companies, especially newer ones, do the multi repo approach.
If the "no one-size-fits-all" claim happens to be genuinely and axiomatically true for a particular engineering trade-off, then fine. There's no one correct displacement of an internal combustion engine. There's no one correct resolution of an LCD screen. Fine. It's demonstrably true that a trade space exists.
But a lot of times people seem to just throw up their hands and call it a trade space when really they just haven't reached a conclusion yet. "There's nothing inherently better or worse between Ubuntu and Windows, they're basically just ice cream flavors!" No! Maybe we haven't fully realized a more perfect operating system yet to settle the debate, but that doesn't just make it a meaningless question. It's perfectly possible for a system to be architected poorly given both the real world it has to interact in and the future world it makes possible. To say that this question is an unanswerable matter of taste is to be completely unimaginative about how good an operating system _COULD_ be. (See the death of operating system research and all that).
CVS is _worse_ than git. It just is. I don't want to hear this "well maybe if it fits your use case" mumbo jumbo. If you think that you have a unique snowflake reason that CVS is more appropriate than git, than you are almost certainly lying to yourself or misinformed.
And it's strict hierarchies like that that inspire these articles. There are a lot of technologies out there, and lot of ideas, and most people don't know most of the things you need to know to come up with a good answer to what suits "their specific situation". So people like myself are looking for lessons learned and certain invariants that help them narrow the solution space. I have no idea whether a monorepo would work well for my organization, and if the only thing that your article has to contribute is "monorepos sometimes work for some people, but YMMV! Good luck!" then I have learned nothing. But if somebody thinks that they've learned a fundamental truth about the universe, that that could be useful to me. Whats more, most people like me have a situation that _isn't_ that specific. We have to write some code, there's some ML shit in there, and some real-time critical stuff in there. Nothing mindblowing. _Most_ software shops shouldn't need something that is particularly bespoke. So coming in with the prior that everybody will have to do something unique to their organization is bizarre. There is so much commonality between what each software company does, in fact, that if a commonly used technology can be used by shop A but legitimately can't be used by shop B, there's a decent chance that this is a problem or limitation with the tech.
So who knows, maybe saying monorepos are _always_ better or _always_ worse really is too ambitious. But I don't think the concept that they _could_ be is a priori ridiculous. End this software relativism! Things can be made better! Yes, strictly better!
> most people don't know most of the things you need to know to come up with a good answer to what suits "their specific situation".
And most people aren't competent software architects capable of adeptly steering an engineering team in the right choices to make. I'm not sure I understand the point here, or why you want to make a technical field like software engineering dumbed down to the point where "most people" can intuit the right decisions to make simply by asking HN what "the best thing" is.
It could be several orders of magnitude larger and with a larger organization could be a lot of unnecessary code that any given Dev may never touch.
No, each individual set of changes is more atomic (smaller in scope, mutating a system from one state of functionality to a new state of functionality).
The problem is that it’s a linguistic fallacy to act like in the monorepo case “the system” is the sum of a bunch of separate systems (it isn’t, because they are not logically required to depend on simultaneously transitioning). So in that monorepo case, to move subcomponent A from some state of functionality to a new state of functionality, you unfortunately have to also make sure you include totally unrelated (from subcomponent A’s point of view) changes that also correctly transition subcomponent B to a new state of functionality, and subcomponent C, etc., which is exactly less atomic (to transition states, you are required to have simultaneous other transitions that are not logically required for any reason other than the superficial sake of the monorepo).
I don't see what's superficial about "everything everywhere is in sync", myself.
And I have absolutely seen PR race conditions. Assuming that everyone perfectly sliced up the polyrepo on the first go is optimistic.
Well it is superficial by definition, because two unrelated things are “in sync” only because you say so. The very meaning of “in sync” in your sentence is some particular superficial standard you chose that has nothing to do with the logical requirements of the isolated subcomponents (i.e. “in sync” meaning two independent subcomponents were adjusted in the same large commit or PR is, by definition, superficial... it’s just a cosmetic notion of “in sync” you chose for reasons unrelated to any type of requirement).
What you’re saying amounts to something of a No True Scotsman fallacy... “no _real_ monorepo would limit different projects from using individualized tooling if needed...” Yet that limitation suspiciously coexists with monorepo tooling frequently, and does not frequently coexist with polyrepo tooling.
This is the (wrong) assumption. Like I said, there's nothing about a monorepo that "begets" draconian policy. Your anecdotal experience is not a rule. The monorepo I work with doesn't have draconian policies about how tooling must work. There are apis and recommended tools, and if those don't fit your needs (which is unusual), the teams that maintain those tools are willing to support your uses, but if not, you're also free to hack yourself something that works. Writing additional pre-commit hooks is encouraged.
> What you’re saying amounts to something of a No True Scotsman fallacy
Again, no. Certainly monorepos can do this. They're still real monorepos. But polyrepos can too. They're still polyrepos. Its orthogonal.
I’d flip it around and say instead that you are assuming the properties by which to compare the two approaches ought to be properties that are roughly like “first principles” and that no first principles difference really exists between them in terms of limiting what you can do.
But this is the wrong way to look at it because, pragmatically, it’s simply just not the sociological phenomenon that actually happens as a side effect in terms of the practical result. Who cares if there’s a first principles reason for them to be different in terms of effectiveness? I certainly don’t— they just are different in terms of effectiveness.
Not the parent, but for us, the 3rd party code is in a private package manager (artifactory, private npm, whatever). Having thousands of libraries we didn't write in our repo doesn't sound like fun.
The more serious answer is that when you have hundreds/thousands of applications with as many use cases, countless products and teams, and generally just ship a lot of stuff, it adds up.
Again: I wish it was a smooth experience. Because I like the ideas very much. But it wasn't when I tried and I don't know anyone -- outside of Google -- for whom it was a smooth experience.
I find Bazel’s syntax much easier to deal with than other build languages that use JSON (essentially the same Python syntax but with lots of extra quotes everywhere and extra fussiness about where commas are allowed).
bazel build //main:hello-world
I'm sure the double slashes and colon have important differences. It is not obvious what they are. cc_binary(
name = "hello-world",
srcs = ["hello-world.cc"],
deps = [
":hello-greet",
"//lib:hello-time",
],
)
It's not instantly obvious why one is :hello-greet and the other is //lib:hello-time.I could swear I've seen @ floating around as well.
As I said above, I am sure these are all very sensible. But I am just tired of memorising minilanguages embedded in strings. I don't want to any more.
The Bazel rules for languages is also not perfect imo. Like I dislike hooking Bazel up to tools like NPM and Webpack. I'd rather have a system that could sync NPM modules into third_party automatically and setup Bazel files for them, then have a bundling system that is native to Bazel that allows taking full advantage of it's caching and pure building.
Bazel is imperfect on Windows as well. I have tried to help but admittedly it is hard work and it'll take time. I wanted to get Bazel Watcher working on Windows, but my PR is stalled because the Windows API is very truly quite maddening at times. (Feel free to find the PR, it's almost hilarious how convoluted it is to effectively kill a tree of processes. Linux of course is imperfect here but it lets you get 95% of the way Much easier.)
However, here's what I will say: if you are in an organization, I think Bazel really shines. If you can take time to write some custom tools and rules and really integrate your software into Bazel, it can be an awesome experience. Sadly the publicly available rules try pretty hard to match existing semantics and fall short of showing off how nice Bazel can be in some cases, but I think C and C++ is a great area where Bazel shines above the pack.
Another plus: it is Amazing having a build system that crosses languages. Does your Python script depend on a C module and connect over TCP to a Go program? No problem, all of that is easy to express. Do you want to have a Go script that writes a TypeScript file that gets compiled and bundled into your apps JS bundle? Once again this is all fairly natural and you can easily accomplish it with a simple combination of normal build rules and a genrule.
And Starlark is a reasonably complete almost-subset of Python, so it's easy to compose, extend and refactor your rules. If you want to generate a matrix of targets for say, testing across browsers and platforms, you can do that, and make it reusable too.
Basically my advice with Bazel:
- Check out how well it works with C and C++, and I think Java also works quite well. This should give you an idea of how it looks when done right.
- Don't constrain yourself to what Bazel offers in terms of rules. Starlark is hugely powerful and you can easily make your own rules for things.
P.S.: the weird path syntax is probably many parts legacy, but it's not actually super hard to understand. When you see a colon, the left side of the colon is a path to a folder, and the right side is a target name. When you see double slashes, it means absolute path relative to root of workspace. If the colon is omitted the target name is assumed to be the same as the folder name.
//:base -> the base target in the BUILD file in the root of the workspace
//base -> //base:base -> the base target in the BUILD file in the base folder relative to the of the workspace
//app/ui:tests -> the tests target in the BUILD file in the app/ui folder relative to the workspace root
:genfile -> the genfile target in the BUILD file in the current directory
There is some context sensitivity about how to refer to files versus targets and whether you're referring to runfiles, output files, or build files, but most of the time it's surprisingly obvious actually. When it comes to files versus targets, it largely works a bit like Make except there's namespacing for input files vs output files (and runfiles, but that's another topic.)
There is also an @ syntax used to refer to paths outside the current workspace. It mainly comes into play when importing rules.
This is pretty much what I think of when I want to like Bazel. I wish we had it on Cloud Foundry. Or, rather, I wish it had existed 5 years ago and had been used on Cloud Foundry from the beginning, because CF and its associated projects have hundreds of repositories and these have mostly been kept in sync through mountains of tests and oceans of automation. It works, but I know that in another universe it works better.
- I was replying to a comment, not the article.
- The article spoke about points that were largely independent of the current or future state of tooling. Instead, it focused on fundamental issues with mono- vs poly-repo systems. Most directly, being forced to fix migrations and incompatibilities immediately rather than letting versions skew.
If you want to batter someone for not arguing for or against the points in the article, you can do it with the comment I was replying to, or with your own comment just now.
B: Developer working on daily basis in component A finds a bug in component B. He just has to change the code and commit it for review, instead of understanding the specifics of working with component B repository.
We keep pruning and gc'ing with different flags, but pulls just seem far slower than other smaller repos.
If it's a small company where every developer touches every part of the application, sure. Taking the FAANG approach if you're not part of that acronym sounds like introducing inefficiency.
I'd expect to see problems with this approach once you get into the 100s or 1000s of developers. The tooling for this scale of repository isn't as mature.
Equal care does need to come from the other side of the contract. Most frequently, I see teams B, C, and D in a polyrepo world do the worst of all worlds: take dependencies liberally, pin them in place, and try to forget about them. Of course, high functioning engineering teams (and cultures) will try and avoid this: they will be thoughtful about dependencies, and they will keep them up to date. In practice, they most frequently do not. This is especially true in the enterprise broadly. When we get it wrong, and take a dependency we wish we hadn't, how do we know? When do we know? What is our recourse? If I depend on code in the monorepo that diverges, I'm more likely to know near to the point of divergence (because of the nature of the system). That means the conversation about how to fix it happens sooner. I'm not interested in avoiding error - that's going to happen. I'm interested in how close to the introduction of the error do we understand it, and how do we communicate about its remediation.
As far as CI and global coordination goes, the cost is high in either direction if the system is distributed, and the solutions are similar in my experience. I think the worst case is the mixed one (which is a world I inhabit) - you wind up splitting your investment in both style and effort across both approaches. With the monorepo style, one big advantage is where the complex CI interactions can be encoded, since you have access to more of the code itself. Granted, at scale, you likely are testing against artifacts rather than point in time commits outside of the component in question (this is very similar to what you're going to do in a polyrepo, too.)
I think solid testing design requires real effort and understanding of the system under test, regardless of repository layout. Which brings us back to communications again. The more you can see, and the more clearly experienced the pain is across the teams, the more likely you are to have the critical conversations needed to improve the system - rather than making local fixes ("my teams tests are fast", "their component sucks").
This has been my observation as well, minus the value judgment. Why is pinning dependencies and moving on with life the worst thing in the world? As you point out in your article, a security fix in A does suddenly force B, C, and D’s hand. Another scenario I’ll add to that: if A provides communication between B, C and D, a synchronized update to all dependents might be required.
Thing is, I’d argue these scenarios are the exception to the rule. If you’re drawing boundaries in the right places (again this may come back to contract design) you’re largely free to change implementation details when you need to, on your own terms, and not because some distant transitive dependency has decided it’s time for your build to break.
With monorepos I see lots of the latter. Lots of breakage for no other reason than “everyone needs to be on the same page.” Lots of conversations — O(N^2) conversations, times some constant factor — that might not need to take place, ever, but it’s critical the entire company have them right now because the global build is broken.
Here’s another way of looking at it. Until a few years ago, it was standard practice to frequently update npm dependencies against fuzzy semvers. Now most people pin their dependencies, and their dependencies’ dependencies, with a lockfile. And in other ecosystems like go’s you also have tooling to support much more controlled, infrequent and minimal dependency upgrades (see MVS).
Why the change? Because people got tired of things breaking all the time. They wanted off the treadmill so they could Get Things Done again. I don’t see how monorepos provide this stability, and frankly it seems like the monorepo idea is where npm was about 5 years ago. Perhaps even farther behind than that, since C, C++ and others haven’t even evolved viable language package managers yet.
You’re a rust fan, so maybe cargo + a monorepo is a sweet spot I haven’t encountered yet? Anyway, I do really appreciate you taking the time to share your perspective on these things. It’s been great having a reasonable discussion about them.
By doing this you only ever "step" a dependency one at a time and one minor minor version at a time, so you only get very few and very small breakages each time. Instead of locking your depfile and then 6 months down the road you realize you need a security fix in component foo but then you got 1000 other backwards incompatible changes to fix because of transitive dependencies that also need to be upgraded in order to satisfy foo 1.2 dependencies.
I think it's important to separate internal dependencies from external ones. My personal advice is to treat external dependencies in whatever way the language prefers, and upgrade on a cadence. This is because you can't have any real impact on your external dependencies - even if they are critical, you can essentially treat them as a black box for terms of this conversation. For the rest of my response, lets assume we're talking internal dependencies.
The thing about breakage "for no reason" is that you are still broken, you just don't know it yet. One assumes the team that broke you had a reason. It might be a good or bad reason, from your point of view, but it wasn't no reason. When I talk about forcing the conversation, this is why. It's not better to hide from the changes, or pretend that you are safe. You aren't. All that happens is you move the time between when the breakage was introduced, and when you discover it. Most frequently, that discovery happens when the upgrade becomes critical (security) - and the time to apply the change has gotten longer, and the team who made the breaking changes no longer remembers clearly the drift. This makes teams even more less likely to move.
By ensuring these types of changes hurt, and are understood to be a shared responsibility (the consumer has a responsibility to move, the producer has a responsibility to understand and protect the stability of their consumers), teams have the impetus to design and build systems that ensure their stability. It's one thing to ask for things like circuit breakers, backwards compatible interfaces, etc. It's all theoretical from a single engineers, or single teams, point of view. It's not a panacea, but when the contract is structured this way, everyone adapts to the issue: producers get more defensive, consumers get less debt.
Like I say in the original, I think this comes down to perspective. When my concern was primarily the efficiency of a single team, who was small enough to stay connected through conversation and shared understanding, it matters way less.
A lot of your reply comes from the perspective of wanting, as an engineer, to just Get Things Done again. I get it, and I'm sympathetic. It is harder to work this way, because you can't take the easy shortcuts (pinning, delaying the upgrade, ignoring your consumers, etc.) - but that's precisely the point. Those things are bad in the long term.
Source Depot was great (modulo availability issues), but I don’t think they got anywhere near the scale of Piper.
This is a bit misleading to outsiders. Each of these repos was huge for the time, corresponded to a major subsystem with many disparate components, and the default tooling on the ground was the cross-repo tooling. One got the impression that if they could have pulled off one giant monorepo to rule them all, they would have, but they fell just short due to some technical details (cough spinning magnetic disks). In the meantime `sdx` was a convenient abstraction that allowed people to work in a monorepo way.
All in all it wasn't so different from present-day monorepos broken into git submodules for performance reasons.
So - what's stopping you from depending (using) anything else? Or how to stop you from doing this? BAZEL (blaze) has visiblity rules, which by default are private - e.g. the rules in your packages are hidden, unless explicitly made public, or alternatively you can white-list one by one which other packages (//java/com/google/blah/myapp) can include you back.
Let's say there is a new cool service, and your team wants to try it out... but it's not out there for everyone to use, it's in alpha, beta, whatever stage. So you ask for permission from the team, or simply create a CL with your package target, name, "..." folder resolution so that you are whitelisted - eventually you will (if that's good idea, and approved). For example you want, if some library got deprecated, and has been slowly replaced with another, and then now instead of being "//visibility:public" is just white listing the last users of it... Well probably not good idea to be added on that list, as the whole thing is going out soon (yes, Google tends to deprecate internally even faster than externally - ... which is good!). But such mechanisms are helpful in getting this worked correctly.
There are dependencies everywhere. Monorepos are one of the tools which can be used to make dealing with them easier in some cases. They’re not an absolute solution not appropriate for all circumstances, but no tool is!
Sorry not copying it here to avoid repost.
In a monorepo that's already done when I finish working on the modules in B.
That I am unable to release from A until it has been synced with the module in B is not "a cosmetic notion". It's being unable to release. I consider releasability at all times to be the most important invariant to be sought by the combination of tests, CI and version control.
This is not usually true in monorepos or polyrepos, and is quite a dangerous practice that nobody should use and hasn’t got much at all to do with what type of repo you use.
I worked in a monorepo for a long time where you still had to deploy versioned artifacts. So when you makes changes to B, you still have to bump version IDs, pass deployment requirements and upload the new version of B to internal pypi or internal maven or internal artifactory, etc.
Then consumer app A needs to update its version of B, test out that it works and that, from app A’s point of view, it is ready and satisfied to opt-in to B’s new changes, and do build + deploy of its own to deploy A with upgraded B.
Doing this in a way where a successful merge to master (or equivalent notion) of a change for B is suddenly a de facto upgrade for all the consumers of B is insanely bad for so many reasons that I’m not even going to try to list them all. Monorepo or not, nobody should be doing that, that is bonkers, crazy town bad. It’s a similar order of magnitude of bad as naively checking credentials into version control.
I guess compilers must work differently for you.
Correlation (and a weak one at that) is not causation.
I can just as easily suggest that monopolicies beget monorepos, and that indeed makes a lot more sense. Its easier to enforce global standards when there's a single repo. So companies who wish to enforce draconian standards may move in that direction. That says nothing about companies that don't wish to enforce draconian standards though.
In “The Beginning of Infinity,” physicist David Deutsch makes a point like this about styles of government. Deutsch suggests the defining characteristic of a good governmental system should not be whether it consistently produces good policies, but instead that there is an extremely low-cost barrier to removing bad policies once it becomes clear they are bad.
Thinking this way, if monorepos permit a situation where there are monopolicies about allowed languages, allowed deployment tooling, etc., and those policies cannot be quickly discarded when it becomes clear they are bad for a certain business goal, then this is perfectly good reason to disfavor monorepos regardless of whether they cause the bad policies.
I think your responses continue to miss the point because you’re talking about correlation and causation as if it matters in a situation like this: but it precisely doesn’t matter.
If a tool doesn’t actively prevent certain policy failure modes (even if it does not cause anyone to choose a bad policy), that is a relative failure of the tool.
Contrasting with polyrepos where it is quite harder to enforce failed monopolicy ideas is one area where polyrepos are a better tool: to misuse polyrepos policy-wise you have to go way out of your way and add a lot of draconian policy enforcement tooling that often can still be circumvented. Those inherent barriers are a good thing that monorepos don’t have.
Separately, I’d also say that the political failure mode where central IT wants to enforce draconian policies is extremely common, and those types of organizations specifically see a monorepo as a tool of (their desired) oppression and control.
Since the base rate of occurrence of horrible companies is super high among all companies, it probably does mean that P(bad | monorepo) is pretty high conditional evidence of a bad workplace culture.
In this case, the double slashes are absolute "paths" relative to the top of the workspace, and the part after the colon is a relative "path" to another Bazel target.
I put "paths" in quotes because these are meaningfully different from the true filesystem equivalents; avoiding confusion with real absolute and relative filesystem paths is probably why they made their own syntactic mini-language.
[The sibling reply to mine, referencing Piper and Perforce, goes into a bit more detail on the specifics and the origin of the // prefix.]
What would the better way have been for them to do this?
I don't know, off the top of my head (having been on the other side of this conversation, I am aware how frustrating that answer is). But I know I couldn't keep it straight when I was fighting Bazel and that I gave up. And anecdotally I am not alone: I have seen Bazel torn out of multiple projects, sometimes quite painfully.
This clearly shows in Bazel's Python support: its internal version (Blaze) gets used quite often with Python inside Google's monorepo, and it works very nicely in that role, but that's a very different way of using Python than approximately the entire rest of the world. It's still Python, to be clear, just everything else is pretty different. ;)
Still, Bazel's model is pretty great if you adjust your brain, tooling, and patterns to it. I accept that most people don't. And some of its preferred usage patterns are more trouble than they're worth in a typical small shop anyway, at least with the usual other tooling one has to integrate with.
Tradeoffs...
The ":" is a bit different, e.g. just "//lib" means "//lib:lib" - e.g. points to the "lib" target in /lib/BUILD file, while "//lib:hello-time" points to "hello-time" target in /lib/BUILD file. So not having the ":name" in "//dir:name" means name="dir" - e.g. "//dir:dir" - at first this is strange, but then you get used to it. Your default target is named after the folder it's sitting in.
For example, in the most recent monorepo I worked in, most everything was written in Java and Scala. But when you compile consumer app A that depends on submodule B, it does not just naively use the code of submodule B already sitting at the same commit of the monorepo. That would be terrible, because it would mean if anyone changed some code in submodule B, then app A has been silently upgraded just by default.
Instead, the necessary shared object / jar / whatever is compiled only for submodule B, which is then uploaded with its new version identifer to internal artifactory that stores the compiled jars, shared objects, whatever (and stores Python packages, containers, and many other types of artifacts too).
Now when you compile app A, it retrieves the right artifacts it needs from artifactory, to treat submodule B like a totally separate third party library, and app A is free to specify whatever version of B that it needs, no different than specifying open source third party dependencies.
It really seems like you are willfully trying to act like you don’t understand what I’m saying. This approach works perfectly for compiled languages and artifacts, that’s one of the primary use cases it is designed from the ground up to solve.
There’s no reason why CI in a monorepo can’t create versioned code artifacts like Python packages, Java libraries or special jars, Docker containers, whatever. This is a very common workflow, e.g. combining a monorepo with in-house artifactory.
Definitely not talking wire format changes. Talking about publishing versioned libraries, jars, etc., from subsets of monorepo code.
>There’s no reason why CI in a monorepo can’t create versioned code artifacts like Python packages, Java libraries or special jars, Docker containers, whatever. This is a very common workflow, e.g. combining a monorepo with in-house artifactory.
Correct, and this is necessary. But there's no reason for a to depend on b from the artifactory instead of a just depending on b at a source level, and building a and b simultaneously and linking them together. Now you have fully hermetic, reproducible builds and tests.
Why is not doing versioning so insanely bad that you can't list all the reasons (this would be a much more interesting discussion if you did).