Monorepo: please do

234 points by Soliah 7 years ago | 161 comments

In my experience, this discussion gets convoluted by confusing modularity with monorepo. They are orthogonal to each other; you can have a very modular codebase in a monorepo but also a very coupled (non-modular) codebase with polyrepo.

Though it's true that monorepos without proper discipline can tend towards coupling. Yet, when discussing mono vs poly, we should keep this in mind.

hinkley 7 years ago | |

First monorepo I worked on, we used separate compilation units for each 'module'. We paid a tax on build time but it added a bit of friction to adding new cross-module dependencies willy nilly.

I don't know how you maintain that arm's length separation if you don't have compilation units in your language of choice, and that may contribute to some of the muddiness in this kind of discussion. "It depends."

lstamour 7 years ago | | |

I think the private visibility and shared build chain that Bazel offers could step in here, in that it makes it harder to build a project without specifying every dependency, when combined with linting tools and clearly assigning code ownership...?

benmarten 7 years ago | |

It's not, why do I have to checkout terabyte of code that I don't need, even if the code is modularized?

jchw 7 years ago | | |

No need to checkout a terabyte of code. If your repo is scaling that high, you're going to want a VFS layer. Microsoft made a VFS layer for Git. As you might imagine, you simply grab files as needed, and your version control just deals with diffs for the most part. Google's own monorepo is proprietary but the Bazel build system is open source and would work great with a VCS hooked up with a VFS layer.

erulabs 7 years ago | | |

If a mono-repo has a terabyte of code, or if 10 small repos have 1/10th a terabyte each, what have you really gained? In any case, git LFS solves large file storage effectively, as do a number of other artifact storage solutions, and a repo with a terabyte of code is _not_ going to be trivially split apart, since it would be by a factor of thousands, the biggest codebase ever created by humankind.

skj 7 years ago | | |

Sounds like a tooling problem. We shouldn't use the current state of tooling as an excuse.

rhacker 7 years ago | | |

It's not for everyone, but damn, why is there a TERABYTE of code? Just curious - assets? checking in binaries?

alexnewman 7 years ago | | |

Signs your build system is never going to be adopted outside of people cargo culting you?

  - [x] Namespaces and the like without much security benefit
  - [x] Giant Java dependency
  - [x] Strange syntax and glyphs

mwkaufma 7 years ago | | |

We have a perforce monorepo with ~80gb total payload for the whole thing, but everyone uses streams to filter it, so that's not a problem.

slobotron 7 years ago | | |

Chances are you will end up downloading a lot of dependencies anyways, why not have git deliver it all?

jonthepirate 7 years ago |

The reason I am upvoting this is that it is written in a positive tone. Too many people - especially in the world of DevOps, trash everything. (X is the worst, don't do that, etc) and more often than not do not offer better guidance following their whiney tone. We need more "please do's" in this industry. Thank you Adam.

klodolph 7 years ago |

I have personally migrated a medium size polyrepo code base (something like ~20 repos?) into a monorepo and I agonized over the decision. But it lifted a huge weight off my shoulders.

I feel like if you are working completely in the open-source world, and you are contributing one open-source project to a larger array of available projects, then the decision to use a polyrepo makes a lot of sense. You can submit libraries to a package repository like Yarn/NPM/PyPI or you can use Git references for e.g. Go's package manager.

But what I experienced with polyrepos outside this world is that we ended up with a weird DAG of repos. It was always unclear whether a specific piece of code that was duplicated between projects should be moved into one dependency or another, or whether it should have its own repo. Transitive dependencies were no fun at all, if you used git modules you might end up with two copies of the same dependency. You might have to make a sequence of commits to different repos, remembering to update cross-repo references as you go, and if you got stuck somewhere you had to work backwards. This feels like a step backwards, like the step backwards from CVS to RCS.

Again, in the open-source world you might have some of this taken care of by using a package manager like Yarn. But if your transitive dependencies aren't suitable for being published that way, it can be tough. Monorepo + Bazel right now is a bit rough around the edges but overall it's reduced the amount of engineering time spent on build systems.

On the other hand, it's not like Bazel can't handle polyrepos. In fact, they work quite nicely, and Bazel can automatically do partial checkouts of sets of related polyrepos, if that's your thing.

As for VCS scalability problems, I expect that Git is really just the popular VCS du jour and some white horse will show up any day now with a good story for large, centralized repos with a VFS layer. In the meantime any company large enough to experience VCS performance problems but not large enough have their own VCS team (like Google and Facebook) will suffer, or possibly pay for Perforce.

malkia 7 years ago | |

BAZEL has WORKSPACE file that can work with multi-repos, but AFAIK things are still rough there, though would get better eventually (I'm bit hand-wavy on the details).

klodolph 7 years ago | | |

Yes, exactly. Unfortunately, sometimes the partial checkouts can be somewhat limited by the fact that your WORKSPACE code will import Starlark defined in other repos. This can get a bit ridiculous if your repo uses a bunch of different languages, if you browse through e.g. the TypeScript support instructions for Bazel you’ll see some of what you’re in for.

If your project is mostly something like C++ (which has support built-in to Bazel) then the WORKSPACE rules will be much more manageable and partial checkouts become a lot easier.

kokokokoko 7 years ago |

Its almost as if both approaches have positives and negatives. Some of which are more important depending on your project and organization.

I'd be more interested to read about a project or company that failed due to making one choice or the other. And then by switching things to the other way, things were fixed.

Otherwise, as someone who was worked with both, I imagine there are a host of other decisions that will be much determinant on your success.

Let's not get too wrapped up in what color to paint the shed.

natalyarostova 7 years ago | |

>Its almost as if

Please don't do this.

ceronman 7 years ago |

I work at a large organization (2000+ devs). We have used both a Monorepo and Polyrepo. After some extensive experience with both models my conclusion is that the Monorepo is by far a superior model, specially for a large organization.

Of course the Monorepo is not free of downsides, those mentioned in the original article are real, although a bit exaggerated in my opinion. VCS operations can be slow and scaling a VCS system is challenging, but possible. And the risk of high coupling and a tangled architecture is also very real if you don't use a dependency management system like Bazel/Buck/Pants.

But in my opinion the downsides of the Polyrepo are much worse and much much harder to fix. The main problem is that you need a parallel version control system like SemVer on top of your VCS. SemVer is fine for open source projects but for a dynamic organization is a nightmare because it is a manual process prone to failure. SemVer dependency hell is really hard to deal with and creates a lot of technical debt.

Additionally, once you go Polyrepo you lose true CI/CD. Yes, you still have CI/CD pipelines but those apply only to a fraction of the code. Once you get used to run `bazel test` and you know you will run every single test of any piece of code that could depend on the code you just changed, you never want to go back. Yes, you could have true CI/CD with Polyrepos, but it requires a lot of work and writing a lot of tooling that does not exist in the wild. It is cheaper to invest in scaling your VCS in a multi-repo.

woolvalley 7 years ago |

My org went from polyrepo 10 commit semver dependency hell when updating an internal API to monorepo and it saves a lot of time. Unmigrated semver breaking changes are a form of technical debt, and it takes a lot more total man hours to do the 'proper' one by one many commit poly repo migration than the other way around.

If we had the tooling to do multirepo atomic commits and reviews then maybe we would of stuck with polyrepos, but it doesn't really exist out in the wild, so monorepo it was.

pdpi 7 years ago |

Can we just move along and get to "Monorepo: Maybe do it, maybe don't. Just think it through and own your decision"?

Both monorepos and polyrepos have advantages and disadvantages. Many factors — scale, overall team quality and experience, level of integration between projects are a few that come to mind — will affect how much those advantages and disadvantages matter to any given company at any given point in time. The right choice for you isn't necessarily the right choice for me.

Much more important than which approach you choose is understanding, and accepting, the consequences of your choice. You'll want to extract value out of the advantages, you'll need to mitigate the disadvantages. You won't be able to adopt tools and processes meant for the other approach without some degree of friction.

0xFACEFEED 7 years ago | |

That's what most people do. They just don't blog about it.

sigil 7 years ago |

Observe how the verb "force" gets used 6 times. Monorepos "force the conversation." You the individual contributor are "forced to deal with the situation" and "forced to see the upfront cost" of breaking contracts. Your team is forced to "look up from their component, and see the perspectives of other teams and consumers."

All this forcing people to do things the Right Way (my way) is surely part of the pushback against monorepos.

But set that aside for the moment. Let's suppose defaults should force people to do things the Right Way, and that we also know what the Right Way is.

Instead of letting anyone sloppily depend on any code checked into the monorepo, shouldn't we force people to think long and hard about contracts between components -- the default concern in a polyrepo architecture? When and how to make contracts, when and how to break contracts? Isn't this how Amazon moved past their monorepo woes, adopted SOA, built AWS, and became one of the largest companies on earth? Heck, isn't this how the Internet itself was built?

mmmeff 7 years ago |

Thank you so much for writing this. As someone who’s worked in the best and worst of these two words, the productivity gains are absolutely insane and the limitations, as stated by the author, are no more painful than limitations of federated/polyrepo code.

Fighting back against monorepo design is dangerous - embrace experimentation.

est31 7 years ago |

There aren't good monorepo solutions out there (yet). Git LFS is great for few large files, but it doesn't help with tons of smaller files. Git submodules are crap when it comes to usability, and have been for a long time, it's even mentioned in the famous Torvalds Git Talk.

Git had a sparse checkouts feature since a long time, but it only affected the checkout itself, all the blobs would still be synced.

Now, Git is gaining good monorepo capabilities with the git partial clone feature [1]. Their idea is that with them you can only clone the parts of a repository that are interesting to you. This has been brewing for a while already but I'm not sure how ready it is. There doesn't seem to be user-level documentation for it yet, to my knowledge, so I am linking to the technical docs.

[1]: https://github.com/git/git/blob/master/Documentation/technic...

dangoor 7 years ago | |

From earlier discussions around monorepos, I saw references that Google, Facebook, and other large monorepo orgs have been making use of Mercurial.

est31 7 years ago | | |

Yes, Facebook is mercurial based to my knowledge. Google is using its custom solution called piper I think: https://cacm.acm.org/magazines/2016/7/204032-why-google-stor...

itsdrewmiller 7 years ago | |

https://vfsforgit.org/ is another option here - MS-originated and Github is adopting it - https://venturebeat.com/2017/11/15/github-adopts-microsofts-...

malkia 7 years ago |

Monorepo is a total win, if you have something like https://github.com/Microsoft/VFSForGit (ex GVFS) - e.g. any monorepo that overlays changes, and the rest are simply file names with no actual contents is a win.

You can certainly achieve this with Perforce, SVN, HG, any repo system there too.

Linux: FUSE + ?

Windows: Dokan? CBFS? Or the new fangled https://docs.microsoft.com/en-us/windows/desktop/projfs/proj... which VFSForGit uses

e3b0c 7 years ago |

Monorepo could be a decent choice if your software stack does not require too much external dependencies. Or more precisely, the ratio of own code to the third-party code is reasonably high.

Let me give a concrete example. The Android open source project (AOSP) which builds the system of Android devices has the code size close to the scale of tens of GB (let alone all the histories!). It is already a massive monorepo in itself. And typically you would have many of them from different OEM/SoC vendors of different major releases. In such a scenario, it would turn into 'a monorepo of monorepos,' which is quite unpleasant to imagine.

totallysnowman 7 years ago |

I think that the reason of the argument is that both authors understand the definition of "large repository" very differently.

With 100 engineers a monorepo might seem a good idea. With 500 it becomes nearly impossible to do anything involving a build. Some isolation is needed.

Also from my experience many engineers just don't give a shit about architecture. They create entangled mess, that kind of works for the customer, and go home. Without some enforced isolation it is impossible to maintain it.

That being said I am more inclined to polyrepos.

thurn 7 years ago | |

the fact that essentially 100% of big tech companies use monorepos seems like evidence that it is at least possible to do it in a scalable way...

shados 7 years ago | | |

Definitely not 100%. It also has a lot less to do with company size, and more about when the company was created. Before the git and similar tools of the world came to be, managing a single repo was a pain, nevermind hundreds or thousands of them. So (almost) everyone did it the way these big companies did.

Today, not quite. I work for a multi billion dollar tech company and we have several thousand repos (and it's awesome)

influx 7 years ago | | |

Amazon does not use a monorepo, so you might want to rethink your "statistic".

senderista 7 years ago | | |

AMZN doesn’t, unless things have changed drastically in the last 3 years.

denimnerd42 7 years ago | | |

yeah by writing custom version control software. Am I going to convince my company to do that (which has like 50k software engineers) probably not.

pbalau 7 years ago | |

> With 500 it becomes nearly impossible to do anything involving a build.

Both FB and Google have more than 500 devs and are using a monorepo.

hocuspocus 7 years ago | | |

At what cost? Both FB and Google employ hundreds of devs to work on internal tooling only. For most companies this isn't feasible.

skybrian 7 years ago |

I wonder if a star pattern would work, where you have a single, shared repo for all your libraries and a repo for each app.

This would help people working on smaller apps, since they don't need to look at other apps unless they're working on shared library code.

Of course, once you are working on library code, you have to build and test all the apps that use it. But even at Google, the people working on the lowest levels of the system can't use the standard tools anyway.

ceronman 7 years ago | |

A star pattern still has most of the downsides of the multirepo approach. Specifically, it has the problem of needing a parallel version control (e.g. SemVer) on top of your individual repositories. This creates fragmentation, where different applications have dependencies on different versions of the libraries which ends up in dependency hell, technical debt, and CI hell.

skybrian 7 years ago | | |

An alternative would be to have a policy where all the app repos must use the same version (nobody can upgrade until they all upgrade). This makes things harder for the library maintainers, but no more than a monorepo.

I don't see why you'd need semver. The apps could sync to a particular commit in the library repo.

mindcrime 7 years ago |

"Shared responsbility" is one of those ideas that sounds good on paper, but doesn't really scale terribly well in the real world. As the old saying goes "when everybody is responsible, nobody is responsible".

More to the point, as the author of TFA allows, once a system reaches a certain size, nobody can understand it all. At some point you have to engage division of labor /specialization, and once you do that, it doesn't make sense to have just anybody randomly making changes in parts of the code-base they don't normally work in.

I'd rather see a poly-repo approach, with a designated owner for discrete modules, but where anybody can clone any repo, make a proposed fix, and submit a PR. Basically "internal open source" or "inner source"[1].

In my experience, this is about as close as you can get to a "best of both worlds" situation. But, as the author of TFA also says, you absolutely can make either approach work.

[1]:https://en.wikipedia.org/wiki/Inner_source

Rapzid 7 years ago |

Any good mono repo build tools out there? I've been thinking about this for the past few weeks. Considering creating a general purpose monorepo tool chain and potentially a mono repo first CI system.

Unfortunately some of the most popular CI/CD services out there(Travis, Circle, etc) don't even support cross-repo pipelines, much less mono repo builds.

fxfan 7 years ago | |

Pants and bazel sound like favorites

Rapzid 7 years ago | | |

Interesting, thanks! Didn't realize Bazel was open sourced..

Those both look way more in the weeds than what I would have imagined.. I guess for Bazel at least it makes sense given Googles scale how fine-grain they would get into caching and incremental builds..

For my needs a simple tool that would allow discovering "WORKSPACES" and constructing a build graph based on what's changed, while handing off the actual building to some entry point in the workspace, would be good enough. Have a weird collection Gradle projects, node projects, test suites, docs, and etc with their own build processes already in place.

Some things are also on a "critical" path while others can run async given the context(branch, tag, etc)...

I'm rambling though.

jpeeler 7 years ago | | |

Does anyone know how please (https://please.build) compares?

cryptonector 7 years ago |

I agree, use a monorepo. I anxiously await MSFT's git megamonorepo functionality. Until then there's things like git meta[0].

[0] http://twosigma.github.io/git-meta/

luord 7 years ago |

Yet another chapter in one of the big flamewars. Seeing as I fall in the monorepo camp, I must say I mostly agree; also, I much prefer this tone for an article.

I find it enjoyable how plenty of comments both here and in the other discussion are of people saying "We had a mono/polyrepo and things improved tremendously when we migrated towards a poly/monorepo". The issue might be one of growth and complacency: a drastic change like that forces the team to face the technical debt that was being ignored and do a better implementation using what was learned from past mistakes.

coldtea 7 years ago |

>But I think Matt’s argument misses the #1 reason I’ve flipped quite hard to a monorepo perspective as my own level in the organization has gotten higher

Perhaps the fact that since their level was now higher, they wouldn't have to deal with the nitty gritty details and pain of working with a monorepo as a developer?

E.g. I wasn't for it when I was a dev, but now that I can just impose it on others, I love it. Same with how various 'development process' rituals are adopted...

Tempest1981 7 years ago |

For those using monorepos, what is your branch strategy? Say that 3 projects share a library, and release on different schedules. How does each project freeze shared library changes? Do you keep N version branches?

How does the library team know which consumers a commit may break? What tools are recommended?

AzzieElbab 7 years ago |

As engineers we spend wast amounts of time in constant search for a rival to "tabs vs spaces" debate

randyrand 7 years ago |

The more complicated answer is sometimes you should use a mono repo and other times you shouldn't.

rdsubhas 7 years ago |

This is starting to get a debate of "principles", like forcing A and B to talk, or forcing A and B to have more explicit boundaries, and so on. Guess where that ends (hint: it doesn't).

With a monorepo, the basic effort you have to put in to start scaling is quite high. To properly do a local build, you need bazel or something. But bazel doesn't stop at just building, but it manages dependencies all the way down to libraries and stuff. Let's say you're using certain maven plugins, like code coverage, shading, etc. Would bazel have all the build plugins your project needs? Most likely not. You have to backport a bunch of plugins from maven to bazel and so on. Guess how many IDEs support bazel? Not a lot.

Then you need to run a different kind of build farm. When you check-in stuff to a monorepo, you need to split and distribute one single build. Compared to a polyrepo where one build == one job, a monorepo is like one build == a distributed pool of jobs, which again needs very deep integration with the build tool (bazel again here), to fan out, fan in across multiple machines, aggregate artifacts, and so on.

Then the deployment. Same again. There is no "just works" hosted CI or hosted git or anything for monorepos. People still dabble with concourse or so on.

And guess what, for a component in its own repo, you don't need to do anything. Existing industry and OSS tooling is built from ground up for that. Just go and use them.

To provide a developer a "basic experience" to go from working on, building and deploying a single component – the upfront investment you need to provide with a monorepo is very high. Most companies cannot spend time on that, because scale means different things to different companies. There is a vast gap in the amount of ops/dev tooling you have for independent hosted components vs monorepo tools. Just search for "monorepo tools" or DAG and see how many you can come up with. So what really happens with a monorepo is, most companies go with multi-module maven and jenkins multi-job. The results are easy to predict. I'm not saying that maven/jenkins are bad, but they are _not_ sophisticated, and are not anywhere close to what Twitter/Facebook/Google or any modern company uses to deal with a monorepo (for a good reason). They are just not good at DAG. If you're relying on maven+jenkins as your monorepo solution, all I can say is "good luck".

Instead, if you start by putting one component in one repo, you keep scaling for _much longer_ before you hit a barrier.

In principle, monorepos are better. In practice, they don't have the basic "table stakes" tooling that you need to get going. Maybe monorepo devops tooling is a next developer productivity startup space. But until then, it's not mainstream for very good reasons.

marcosdumay 7 years ago |

So... An article based on equating change recording medium with integration testing procedures.

fxfan 7 years ago |

There's a lot of discussion of bazel and co inside sub-comments but i have a question that isn't addressed-

How do the "global build tools" play with language specific build tools?

My primary stack is Rust and Scala. Both have excellent build capabilities in their native tools. How well do pants/bazel integrate with them? I wouldn't want to rewrite complex builds nor would I expect these tools to have 100% functionality of native ones.

laurentlb 7 years ago | |

Bazel has some level of support for many languages: https://docs.bazel.build/versions/master/be/overview.html#ad...

I know the Scala rules are used in production by multiple companies. Rust support is improving quickly, but it's not perfect. See the dedicated GitHub repositories for more information.

(I work on Bazel)

benmarten 7 years ago |

Please don't. It's just too slow and not efficient. Instead use common open source best practices of shared library architecture. Problem solved! Putting everything into one repo is just lack of organization and creates a huge mess.

klodolph 7 years ago | |

I feel like you've really done no work supporting your argument there. "Slow and inefficient"... what, exactly, is slow and inefficient? Because there are plenty of things slow and inefficient about polyrepos.

I'd say that open-source best practices for shared libraries are appropriate if you're making an open-source shared library. However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases. In my experience, it's also far from "problem solved". You can point your finger at semantic versioning but in the meantime we go through hell and back with package managers trying to manage transitive library dependencies and it SUCKS. Why, for example, do you think people are fed up with NPM and created Yarn? Or why people constantly complain about Pip / Pipenv and the like? Why was the module system in Go 1.11 such a big deal? The answer is that it's hard to follow best practices for shared libraries, and even when you do follow best practices, you end up with mistakes or problems. These take engineering effort to solve. One of the solutions available is to use a monorepo, which doesn't magically solve all of your problems, it just solves certain problems while creating new problems. You have to weigh the pros and cons of the approaches.

In my experience, the many problems with polyrepos are mostly replaced with the relatively minor problems of VCS scalability and a poor branching story (mostly for long-running branches).

mindcrime 7 years ago | | |

However, these practices are inappropriate for internal libraries, proprietary libraries, and other use cases.

Why do you say so?

zamadatix 7 years ago | |

Too slow as in "to do it" or too slow as in "to use it". In either case I think if that were true there wouldn't be monorepo's at Google, Facebook, and Microsoft. I will say it's true that didn't come for free, e.g. Microsoft had to make GVFS due to the sheer enormity of their codebase but that's already done and works pretty well.

I agree share library style makes more sense in most cases though. The main problem with it is forcing everyone to use the latest library versions but that isn't insurmountable by any means.

mlthoughts2018 7 years ago | | |

My old boss was an engineering manager at Google in the 90s and early 2000s. He used to tell us that _everyone_ he interacted with at Google _hated_ the monorepo, and that Google’s in-house tooling did not actually produce anything approaching a sane developer experience. He used to laugh so cynically at stories or that big ACM article touting Google’s use of a monorepo (which was a historical unplanned accident based on toppling a poorly planned Perforce repository way back when), because in his mind, his experience with monorepos at Google was exactly why his engineering department (several hundred engineers) in my old company did not use a monorepo.

woadwarrior01 7 years ago | | |

I work at one of the monorepo companies that you mention and there’s some truth to the “too slow” part. Although it’s it’s been a lot better lately (largely, due to the efforts of the internal version control dev teams), I’ve noticed at times in the past that you could do a ‘<insert vcs> pull’, go on a 15 minute break and it wouldn’t be done by the time you’re back.

Personally, I think there’s a place for mono repos and there’s a place for smaller independent repos. If a project is independent and decoupled from the rest of the tightly coupled code base (for instance things which get opesourced), it makes no sense to shove it into a huge monorepo.

scaleout1 7 years ago | | |

Last time I work at a massive Monorepo, half of my team was running got fetch as a cron job. It was an extremely painful experience

greenshackle2 7 years ago | |

It would be quite remarkable if in-house corporate software, which face different constraints and challenges than open source software, turned out to nonetheless have exactly the same best practices.

mindcrime 7 years ago | | |

The idea of using open source styled practices for internal development is not exactly new or remarkable. It's something people have been doing for a long time.

https://en.wikipedia.org/wiki/Inner_source