Google Is 2B Lines of Code, All in One Place

Google Is 2B Lines of Code, All in One Place(wired.com)

473 points by sk2code 10 years ago | 325 comments

antics 10 years ago |

Just because people are talking about it: I work at MSFT, and the numbers Wired quotes for the lines of code in Windows are not even close to being correct. Not even in the same order of magnitude.

Their source claims that Windows XP has ~45 million lines of code. But that was 14 years ago. The last time Windows was even in the same order of magnitude as 50 million LOC was in the Windows Vista timeframe.

EDIT: And, remember: that's for _one_ product, not multiple products. So an all-flavors build of Windows churns through a _lot_ of data to get something working.

(Parenthetical: the Windows build system is correspondingly complex, too. I'll save the story for another day, but to give you an idea of how intense it is, in a typical _day_, the amount of data that gets sent over the network in the Windows build system is a single-digit _multiple_ of the entire Netflix movie catalog. Hats off to those engineers, Windows is really hard work.)

rakoo 10 years ago | |

> the numbers Wired quotes for the lines of code in Windows are not even close to being correct

OTOH you can't blame them for being incorrect if you (as in, Microsoft, not you personally) are being so secretive about the figures. I'm pretty sure everyone would love to see how Microsoft works internally, especially now that you teased us with that Windows build system.

dsjoerg 10 years ago | | |

Yes you can blame them for being incorrect. If they don't have a correct and relatively up-to-date figure, they should be clear about that. "Microsoft declined to comment on how many lines of code Windows has now" or "Windows XP used 45 million lines of code, but that was 14 years ago so it's not a very good comparison to anything".

hanwenn 10 years ago | |

The reason Google developed Piper was basically that Perforce couldn't scale beyond a single machine, and our repo was getting too large to even be usable on the biggest, beefiest x86 server that money could buy (single machines with terabytes of ram and zillions of cores).

If Microsoft has close to the same amount of code in a single repository, then they must have also written their own version control service that runs on more than one machine.

The last rumors I heard is that Microsoft bought a license to the Perforce source code, and created their own flavor to host internal code ("Source Depot" ?), which presumably still runs on a single machine.

ohitsdom 10 years ago | |

> single-digit _multiple_ of the entire Netflix movie catalog

Strange unit of comparison, although I may start using it.

RyJones 10 years ago | | |

Facebook gets a Flickr worth of photos every few days.

protomyth 10 years ago | | |

Well, I remember when Library of Congress and encyclopedias were used as units of measure. I would guess Netflix is the new stack of media.

euske 10 years ago | | |

Uh, how much is it actually? (Simple searching didn't seem to get me the answer.)

We should have a list of these things.

ghgr 10 years ago | | |

That's the same as saying "the same order of magnitude"

Pxtl 10 years ago | |

I assume they don't use msbuild, because they don't completely hate themselves.

4ad 10 years ago | |

So, how many lines of code does it have?

antics 10 years ago | | |

I can't say. I work here, but I don't speak for the company.

RandallBrown 10 years ago | | |

It's a pretty hard number to come up with. Most employees only have access to a small fraction of the codebase. Even if you had access to all of it, it's hard to say what actually counts as Windows and what doesn't.

dx211 10 years ago | |

And they probably didn't count all of the test crap and IDW tools that nobody's used since Bill Gates was there, but still get built every time.

Datsundere 10 years ago | |

Would you rather work on the linux kernel instead of windows?

azth 10 years ago | | |

What's the relation to the post to which you're replying?

est 10 years ago | |

I once torrented a WindowsXP+Office2003, ripped, together for about 120MB. Has basic functionalities working great.

dekhn 10 years ago |

I'm a google software engineer and it's nice to see this public article about our software control system. I think it has plusses and minuses, but one thing I'll say is that when you're in the coding flow, working on a single code base with thousands of engineers can be an intensely awesome experience.

Part of my job- although it's not listed as a responsibility- is updating a few key scientific python packages. When I do this, I get immediate feedback on which tests get broken and I fix those problems for other teams along side my upgrades. This sort of continuous integration has completely changed how I view modern software development and testing.

sytse 10 years ago |

So a monolithic codebase makes it easier to make an organization wide change. Microservices make it easier to have people work and ship in independent teams. The interesting thing is that your can have have microservices with a monolithic codebase (as Google and Facebook are comprised of many services). But you can also have a monolithic service with many codebases (like our GitLab that uses 800+ gems that live in separate codebases). And of course you can have a monolithic codebase with a monolithic service (a simple php app). And you can have microservices with diverse codebases (like all the hipsters are doing).

I'm wondering if microservices force you to coordinate via the codebase just like using many codebases force you to coordinate via the monolithic service. Does the coordination has to happen somewhere? I wonder if early adopters of microservices in many codebases (SoundCloud) are experiencing coordination problems trying to change services.

ChuckMcM 10 years ago |

I will say that I saw and experienced many things that changed my definition of 'large' at Google, but the most amazing was the source code control / code review / build system that kept it all together.

The bad news was that it allowed people to say "I've just changed the API to <x> to support the <y> initiative, code released after this commit will need to be updated." and have that effect hundreds of projects, but at the same time, the project teams could do the adaptation very quickly and adapt. With the orb on their desk telling them at that their integration and unit tests were passing.

I thought to myself, if there is ever a distributed world wide operating system / environment, it is going to look something like that.

sshumaker 10 years ago |

Xoogler here. There were tons of benefits to Google's approach, but they were only viable with crazy amounts of tooling (code search, our own version control system, the aforementioned CitC, distributed builds that reused intermediate build objects, our own BUILD language, specialized code review tools, etc).

I'd say the major downside was that this approach basically required a 'work only in HEAD' model, since the tooling around branches was pretty subpar (more like the Perforce model, where branches are second-class citizens). You could deploy from a branch but they were basically just cut from HEAD immediately prior to a release.

This approach works pretty well for backend services that can be pushed frequently and often, but is a bit of a mismatch for mobile apps, where you want to have more carefully controlled, manually tested releases given the turnaround time if you screw something up (especially since UI is really inefficient to write useful automated tests around). It's also hard to collaborate on long-term features within a shipping codebase, which hurts exploration and prototyping.

nulltype 10 years ago | |

Could you elaborate how the single repo model causes that thing you said in the last sentence?

ksk 10 years ago |

Its interesting that they compare LoC with Windows. I suppose that this article wants us to be amazed at those numbers. However, my experience with Google's products indicates a gradual decline in performance and a simultaneous gradual increase in memory bloat (Maps, Gmail, Chrome, Android). Which ironically, FWIW, hasn't been the case with Windows. I have noticed zero difference in performance going from Windows 7 to 8 to 10.

branchless 10 years ago | |

I'd have to disagree with this. First the baseline: windows is very slow. Second I found later versions slower. Third (and most maddening) every version of windows I've ever used has gotten slower over time (including not installing new s/w and defragmenting).

sz4kerto 10 years ago | | |

Windows is slow? Compared to what? In what task? Running a game? Boot time? Opening Firefox?

I have problems with Windows, but it's the fastest desktop os I think, mostly because it's graphics stack is way the best of all. Running a number crunching C code is exactly the same on Windows or Linux. (See all the benchmarks on the Internet.)

ksk 10 years ago | | |

I know that people have had experiences similar to yours. It's fine to disagree, but AFAIK pretty much all benchmarks show that there is no noticeable difference in performance from 7 to 8 to 10 and this matches with my own experience. I refuse to upgrade unless I get similar or better performance. But then again, I'm not really interested in researching every single benchmark. Windows is fast, stays fast, and that's pretty much all I care about.

ocdtrekkie 10 years ago | | |

8.1 and 10 run incredibly well even on very old hardware. I will agree a given Windows install may feel slower over time, and it makes sense to rebuild the PC occasionally, though that may, again, be less so with 8.1 and 10.

ZanyProgrammer 10 years ago | | |

People defrag modern SSDs?

lighthawk 10 years ago |

"The two internet giants (Google and Facebook) are working on an open source version control system that anyone can use to juggle code on a massive scale. It’s based on an existing system called Mercurial. “We’re attempting to see if we can scale Mercurial to the size of the Google repository,” Potvin says, indicating that Google is working hand-in-hand with programming guru Bryan O’Sullivan and others who help oversee coding work at Facebook."

Why Mercurial instead of Git?

k33n 10 years ago |

Comparing "Google" to Windows isn't really a fair comparison. I'm sure all of the code that represents products that Microsoft has in the wild far exceeds 2B lines.

DannyBee 10 years ago | |

Note that this is just the monolithic repository. Google also has other non-piper repositories containing hundreds of millions of lines too :P

For example, android and chrome are git based.

Note also that when codesearch used to crawl and index the world's code, it was not actually that large. It used to download and index tarballs, svn and cvs repositories, etc.

All told, the amount of code in the world that it could find on the internet a few years ago was < 10b lines, after deduplication/etc.

So while you may be right or wrong, i don't think it's as obvious you are right as you do.

ocdtrekkie 10 years ago | | |

I'm still trying to figure out why having everything dumped in one big pile is something worth bragging about. I'd far rather have code sorted well into proper repositories.

guelo 10 years ago | |

Agree especially since Google's repo contains their version of almost the entire Microsoft Office suite.

nickpsecurity 10 years ago | |

I also agree especially as this includes the whole Google software ecosystem and Microsoft has their own ecosystem. Microsoft's whole ecosystem of products that work together and run on Windows is much larger than Windows itself. Might not be as monolithically developed, though.

bluedino 10 years ago | |

Windows, Office, things like .NET, other sites like MSN and Bing...

utexaspunk 10 years ago | | |

SQL Server has got to be a few lines...

yongjik 10 years ago |

One humorous side-effect of having all that code viewable (and searchable!) by everyone was that the codebase will contain whatever typo, error, or mistake you can think of (and convert into a regular expression).

I remember seeing an internal page with dozens of links for humorous searches like "interger", "funciton", or "([A-Z][a-z]+){7,} lang:java"...

wetmore 10 years ago | |

> "([A-Z][a-z]+){7,} lang:java

Yeah this one was my favorite of the code search examples, there are some really good ones in there.

cag_ii 10 years ago | | |

Can you explain this? It looks to me like a regexp that searches Java source for words 7+ characters that start with a capital letter?

bubersson 10 years ago | |

I encourage everyone to search in their codebase for "1204", "521" and "265" :)

nandhp 10 years ago | |

And then you killed off the public version and you keep that fun (and useful) toy to yourself.

(But as great as Google Code Search was, my grudge is because of Reader.)

low_battery 10 years ago |

Direct link to talk (The Motivation for a Monolithic Codebase ):

https://www.youtube.com/watch?v=W71BTkUbdqE

Walkman 10 years ago | |

This is crazy :D I have never heard tools and workflows like this.

kazinator 10 years ago |

I am unable to believe that Google has 2B lines of original code written from scratch at Google.

Maybe they are counting everything they use. Somewhere among those 2B lines is all the source code for Emacs, Bash, the Linux kernel, every single third-party lib used for any purpose, whether patched with Google modifications or not, every utility, and so on.

Maybe this is a "Google Search two billion" rather than a conventional, arithmetic two billion. You know, like when the Google engine tells you "there about 10,500,000 results (0.135 seconds)", but when you go through the entire list, it's confirmed to be just a few hundred.

roxmon 10 years ago | |

Google has been around for 17 years and employees roughly 10,000+ software developers. I think it's reasonable to assume that the 2B LOC metric is accurate...

hk__2 10 years ago | | |

Windows has been around for 35 years and Microsoft had 61,000+ employees (ok, that’s not only software developers and they don’t work only on Windows) in 2005; and it’s only ~50M LOC. I don’t think the number of years + developpers really show something; you don’t write new code everyday.

sp332 10 years ago | |

Yes, that is counting everything. It's "the software needed to run all of Google’s Internet services" (so probably not Emacs, but the other stuff). But it's all in the repo, and it all has to be maintained.

hokkos 10 years ago | |

> Google engineers modify 15 million lines of code across 250,000 files each week > Google’s 25,000 engineers

So employees modify 120 lines / day if we imagine a linear growth in employees in 17 years to 25K coders, with 250 work day a year they employ around 6 more coders each work day, so about 55M man day, so around 6,3G LOC modified. But modified != added, so I wound't believe this is all their own lines.

hanwenn 10 years ago | |

most of the non-google code is stored in a special subdirectory, and AFAICT, it's less than 10% of the total.

hellbanner 10 years ago |

"LGTM is google speak for Looks good to me" - actually common outside of Google.

malkia 10 years ago | |

SGTM

a3n 10 years ago |

In the spirit of "You didn't build that," I wonder how many lines of code comprise the binaries that Google binaries run on? Windows, Linux, network stacks, Mercurial, etc, etc.

I also wonder if there's a circular relationship anywhere in there.

Splines 10 years ago | |

It's turtles all the way down, and also includes all the hardware and people.

sytse 10 years ago |

The CitC filesystem is very interesting. This is local changes overlaid on top of the full Piper repository. Commits are similar to snapshots of the filesystem. Sounds similar to https://github.com/presslabs/gitfs

makecheck 10 years ago |

I really wish there was a tendency to track all change/activity and not just total size; maybe like the graphs on GitHub. Removing things is key for maintenance and frankly if they haven't removed a few million lines in the process of adding millions more, they have a problem.

Having a massive code base isn't a badge of honor. Unfortunately in many organizations, people are so sidetracked on the next thing that they almost never receive license to trim some fat from the repository (and this applies to all things: code, tests, documentation and more).

It also means almost nothing as a measurement. Even if you believe for a moment that a "line" is reasonably accurate (and it's tricky to come up with other measures), we have no way of knowing if they're measuring lots of copy/pasted duplicate code, massive comments, poorly-designed algorithms or other bloat.

nhaehnle 10 years ago | |

The article claims 2 billion lines of code across 25000 engineers, which boils down to 80k lines of code per engineer. I'm not sure what to think about that.

It seems to be in a reasonable order of magnitude for C++/Java-type languages compared to projects that I have seen, but it does imply a significant chunk of code that is not actively being worked on for a long time (which is not necessarily a bad thing - don't change a running system and all that).

dekhn 10 years ago | |

Although I agree that line counting is a silly exercise much of the time, the talk did cover change activity as well as total size.

With regard to copy/pasted duplicate code and massive comments, we do have ways of knowing that as both of those are easily computable. Duplicate code can be matched using hashes and comments are delimited, making their measurement easy.

brozak 10 years ago |

The comparison of Windows to all of Google's services is pointless and misleading.

It's like comparing the weight of a monster truck and the total weight of all the cars at a dealership...

temuze 10 years ago |

Assuming these numbers are right...

(15 million lines of code changed a week) / (25,000 engineers) = 600 LOC per engineer per week

Is ~120 LOC per engineer per workday normal at other companies?

_delirium 10 years ago | |

Elsewhere in this thread it's mentioned that Google makes use of large-scale, automated refactoring tools: http://research.google.com/pubs/pub41342.html

Would be interesting to know what percentage of the total LoC touched are typically from that kind of automated refactor. Depending on the codebase, you can touch a ton of lines of code in a very small amount of time with those tools.

ajg360 10 years ago | |

I write between 4-600 lines of code a day where I work... I feel that 120 LOC is a day is on the smaller side (of what I'm used to anyway).

xur17 10 years ago | |

It really depends on what you're writing. Lower level c / c++, doubtful. Python, javascript, java, etc, yeah, it's believable.

melling 10 years ago |

I imagine that there's a lot of Java and C++. I do like Go but it makes you wonder if a more expressive language that requires a fraction of the code would be helpful. Maybe Steve Yegge will see Lisp at Google after all.

astrange 10 years ago | |

He claims to have stopped using it (#5):

https://sites.google.com/site/steveyegge2/ten-predictions

jakub_g 10 years ago |

Some questions that immediately come to my mind:

- What is the disk size of a shallow clone of a repo (without history)?

- Can each developer actually clone the whole thing, or you do partial checkout?

- Does the VCS support a checkout of a subfolder (AFAIK mercurial, same as git, does not support it)?

- How long does it take to clone the repo / update the repo in the morning?

Since people are talking about huge across-repo refactorings, I guess it must be possible to clone the whole thing.

Facebook faces similar issues as Google with scaling so they wrote some mercurial extensions, e.g. for cloning only metadata instead of whole contents of each commit [1]. Would be interesting to know what Google exactly modified in hg.

[1] https://code.facebook.com/posts/218678814984400/scaling-merc...

thrownaway2424 10 years ago | |

Most of your questions don't apply to the system described in this article. You do not clone the repository, you merely chdir into a vfs that is backed by a consistent view of the repository at a point in time, which view is served from a large distributed service that lives in Google datacenters alongside other Google services like Search, Maps, and Gmail. Because it is enormous and nobody clones it, it is also true that nobody partially clones it. You do not "checkout a subfolder" either.

Your last point is the only one that applies. If you want your view to advance from revision 123 to revision 125 it takes about a second to do so. If you have pending (not yet submitted) changes in your client, they might have to be merged with other changes, which can take a bit longer. If you have a really huge pending change, and your client is way behind HEAD, it might take a few tens of seconds to merge everything.

bruckie 10 years ago | |

Most of these questions are answered in the talk. The tl;dr is that you don't clone or check out anything at all: instead, you use CitC to create a workspace, and the entire repository is magically available to you to view or edit.

This model precludes offline work, of course. But that's not much of a problem in practice.

jakub_g 10 years ago | | |

I did not follow the links in wired article, and didn't realize there was a link to a youtube talk. Thanks for tl;dr, need to watch the video!

lrem 10 years ago | |

In practice: none of these operations take long enough to tempt you into alt-tabbing to cat videos.

therealmarv 10 years ago |

What? This is surpassing the mouse genom complexity. See this charts for comparison: http://www.informationisbeautiful.net/visualizations/million...

Strikingwolf 10 years ago |

Really interesting article. Sounds like a great solution to the problem in git of submodules. Definitely worth looking at. Thanks for posting OP.

IMO this system would best be suited for large companies, but I could see the VCS that they are developing being used by anyone if it gets a github-esque website.

ilurkedhere 10 years ago |

Yeah, but it's only like ~200 lines rewritten in Lisp.

juhq 10 years ago | |

A serious question about Lisp and Google, is Lisp used within Google, and if so, in what projects and why?

Apocryphon 10 years ago |

Looks like someone's going to have to update this: http://www.informationisbeautiful.net/visualizations/million...

buro9 10 years ago |

This hurts just thinking about what the build, test and deploy systems must look like.

jsolson 10 years ago | |

Well, for build take a look at bazel, although attach it to a cluster of machines that can all read from Piper.

michaelwww 10 years ago |

For those interested, the source analyzer Steve Yegge was working on called GROK has been renamed Kythe. I don't know how useful it turned out to be for those 2B LOC. http://www.kythe.io/docs/kythe-overview.html

Steve Yegge, from Google, talks about the GROK Project - Large-Scale, Cross-Language source analysis. [2012] https://www.youtube.com/watch?v=KTJs-0EInW8

Locke1689 10 years ago |

What I'd like to know and no one seems to mention:

What's the experience like for teams not running a Google service and instead interacting with external users and contributors, e.g. the Go compiler or Chrome.

bruckie 10 years ago | |

Many larger external projects are hosted in other repositories (Chrome and Android are well-known examples).

Smaller stuff (like, say, tcmalloc or protocol buffers) is usually hosted in Piper and then mirrored (sometimes bidirectionally) to an external repository (usually GitHub these days).

Locke1689 10 years ago | | |

Thanks, but I guess I was asking more about how this affects the other development characteristics described. You still have to deal with the massive repository and infrastructure, but if you're Go, for example, and you want to change an API 1) you can't see the consumers because many or most won't be Google-internal, and 2) even if you could see them, you can't change them. Even the build/test/deploy systems are somewhat compromised because you can't rely on all builders of your components being Google employees and having access to those resources.

So in these scenarios, what does Google's infrastructure buy you, if anything? And if it doesn't buy you anything, how does that influence Google culture? Are teams less willing to do real open development due to infrastructure blockage?

727374 10 years ago |

Really? This article sounds very over simplified, but I haven't worked at google so I wouldn't know. I'm assuming if you want to change some much depended on library, there's a way to up the version number so you don't hose all your downstream users. That's the way it worked at Amazon at least. Also, I wonder why the people in the story think Google's codebase is larger than that of other tech giants, not that it really matters.

jsolson 10 years ago | |

Google mostly works at HEAD. Very little is versioned, and branches are almost unheard of.

In general you change the much depended on library and all of its consumers (probably over time in multiple changes, but you can do it in one go if it really needs to be a single giant change).

rictic 10 years ago | |

It's incumbent upon the person updating the library to get all users migrated to the new one. There are a few strategies for doing this though, including temporarily having two versions of the library.

There are also tools for making large scale changes safely and quickly.

devinj 10 years ago | |

The whole point of one big repository is being able to avoid versioning and always work at head.

zBard 10 years ago | |

Last I heard Google is still on Java 7 precisely because of this, although that might have changed. It's fun seeing the different theologies at Amazon and Google - I remember Yegge's famous platform rant, and he highlighted the Amazon versioned-library system as something which it did better than Google.

sandGorgon 10 years ago |

What are the best practices to follow in a single-repo-multiple-projecrs world? Some people recommend git submodule, others recommend subtree.

How do you guys manage alerts and messages - does every developer get a commit notification,or is there a way to filter out messages based upon submodule.

How does branching and merging work?

I'm wondering what processes are used by non-Google/FB teams to help them be more productive in a monolithic repo world.

cmrdporcupine 10 years ago | |

Generally branching isn't really a thing at Google. Work is done at the code review level per change list ("CL"). Most changes happen through incremental submission of reviewed CLs, not by merging in feature branches. Every CL must run the gauntlet of code review, as well as can not usually be submitted without passing tests. There are rare cases where branching is used, but not commonly.

As for notifications, the CL has a list of reviewers and subscribers. If you want to see code changing, you watch those CLs. Most projects have a list where all submitted CLs go.

sandGorgon 10 years ago | | |

Can you explain this a little more - what is a CL vs a changeset...and what do you mean by watching changelists. It sounds like you're subscribing to specific commits...but I'm talking about more at a project/directory level within the monolithic repo.

ajross 10 years ago | |

FWIW: git submodules are not a single repo by definition. It's just a way to automate the checkout of specifically-versioned external projects without requiring hackery like packing tarballs into the project source. It has its uses, but it's definitely not what they're talking about here.

luckydude 10 years ago | | |

Agree 100%. Git submodules are for tracking other stuff, not for doing dev on that other stuff.

If you would like to see how things would work with submodules that behaved just like files behave (full distributed workflow) we've got a (unfortunately commercial) solution here:

http://www.bitkeeper.com/nested

nemesisrobot 10 years ago |

The comparison bewteen the total LOC across all of Google's products against just one of Microsoft's is a bit unfair.

h1fra 10 years ago |

The comparison with windows really is just here to provide a something to compare for casual reader, it's not really that good. An OS is a huge project. But google has hundred of different project, apis, library, framework... Even unix with an "unlimited" source of developpers does not reach that point.

dchichkov 10 years ago |

I remember somebody wise had said once: "Every line of code is a constraint working against you."

jfkw 10 years ago |

How do the monolithic repository companies handle dependencies on external source code?

Are libraries and large projects e.g. RDBMS generally vendored/forked into the monolithic repositories, regardless of whether the initial intent is to make significant changes?

jpollock 10 years ago | |

There's typically a subdirectory called third_party, with subdirectories for each vendor, product and version. If the team is smart, they will also enact a rule saying "only one version". If you're really, really smart, local changes are kept as a set of patches, keeping them separate from the imported tar file.

So, for source deliveries:

  third_party/apache/httpd/2.4/release.tgz
                              /patch.tgz
                              /Makefile (or other config)
  third_party/apache/httpd/2.2/release.tgz
                             /patch.tgz                             
  ...

cpeterso 10 years ago | | |

For example, here is Chromium's third_party directory:

https://chromium.googlesource.com/chromium/src.git/+/master/...

breatheoften 10 years ago |

Are the source of piper and the build tools also in the mono repo and also developed/deployed off the head branch? Seems like a random engineer could royally fubar things if they broke a service which the build system depends on ...

thrownaway2424 10 years ago | |

You said "developed/deployed" as if it were the same thing. Even if you somehow checked in the giant flaw, bypassing all code review and automated testing, it's not like that would suddenly appear in production. Google isn't some PHP hack where you just copy a tarball to The Server. Binaries of even slightly important systems typically go through many stages of deployment, first into unimportant test systems, then usually very, very slowly into production with lots of instrumentation and of course, quick and easy methods of rolling back to the previous release.

breatheoften 10 years ago | | |

I see - it was something of a half baked thought but in my defense I wasn't trying to suggest that I thought the head was automatically deployed to production ... Deployed to testing round 1 ... N is still a "deployment" isn't it ...? The shared boilerplate for how that magic works in a scaleable way for so many different projects must be quite complex and itself hard to test ...

QuercusMax 10 years ago | |

Everything has to go through pre-submit checks before it makes it to HEAD. And if you get it past those and it starts breaking stuff, there are robots that will automatically roll back your change if it breaks enough stuff.

dblotsky 10 years ago |

Even if the numbers are off, the assumption that 40M lines of code take less effort to write than 2B lines of code commits the fallacy that effort is proportional to number of lines of code. Come on, Wired, you can do better.

amelius 10 years ago |

Is this article saying that all developer employees have access to the "holy" search algorithm internals? I can hardly believe that to be true, given the fact that SEO is a complete industry.

enf 10 years ago | |

Once upon a time it was all in one repository. Shortly after I started there in late 2005, the "HIP" source code (high-value intellectual property, I think it stood for) was moved to its own source tree, with only precompiled binaries available to the rest of the company.

Looks like there is a Quora question that mentions this too: https://www.quora.com/How-many-Google-employees-can-read-acc...

jsolson 10 years ago | |

It is not saying that.

FTA:

> There are limitations this system. Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees.

The vast majority of code is visible to everyone, though.

shampine 10 years ago | |

No, it specifically says the opposite:

"Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees."

known 10 years ago |

How frequently Google does https://en.wikipedia.org/wiki/Code_refactoring

rbinv 10 years ago |

Those are mind-boggling numbers.

Although I kind of doubt that "almost every" engineer has access to the entire repo, especially when it comes to the search ranking stuff.

dblock 10 years ago |

A giant repo works for Google, and works for Facebook, and Microsoft, but it's bad for the development community at large.

If you start centralizing your development you’re killing any type of collaboration with the outside world and discouraging such collaboration between your own teams.

http://code.dblock.org/2014/04/28/why-one-giant-source-contr...

wedesoft 10 years ago |

With 2 billion lines of code I would consider the problem of developers stepping on each other's toes essentially solved.

rbanffy 10 years ago |

What I find most distressing is that their Python code indents with two spaces... This is so wrong, Google.

izzydata 10 years ago |

If they were to recompile all of it on a standard desktop PC how long would it take? A week?

sa2015 10 years ago |

I wonder how close the "piper" system is to the code.google.com project.

DannyBee 10 years ago | |

I worked on code.google.com, i can tell you the are 100% unrelated.

piper grew out of a need to scale the source control system the initial internal repositories were using

code.google.com was a completely separate thing supporting completely different version control models, and a very different scale (very large number of small repositories, vs very small number of very large repositories)

a1k0n 10 years ago | |

IIRC, Piper is a reimplementation of the perforce backend, in order to handle the code size and the sheer number of "changelists" submitted per second. Nothing to do with code.google.com.

spectral321 10 years ago | |

They are unrelated. :)

MrBra 10 years ago |

Am I the only one who initially read 28 instead of 2B ? :)

MrBra 10 years ago | |

downvoter: laughter is good for your health.

wellsjohnston 10 years ago |

What is a "line of code"? out of the 2b lines of code google has, how much of it was auto-generated? how many of those lines are config files? This is a very silly article that has little to no value.

therealmarv 10 years ago |

So they do not suffer on git submodules I guess

wgpshashank 10 years ago |

Cool , How much front and back end each ?

rosege 10 years ago |

How many lines is duckduckgo? :-)

creshal 10 years ago | |

Can't be that many, given they outsource the actual search engine to third parties.

nootropicdesign 10 years ago |

OMG it's all in one file? OMG OMG it's all on ONE LINE????!!!

Sven7 10 years ago |

Now I know why my google plus page takes half a day to load.

aikah 10 years ago |

lol git clone http://urlto.google.codebase.git ...

I wonder how much time it takes to clone the repo, provided they use git.

robertk 10 years ago | |

It's 80TB. You don't clone, just ask for views.

kuschku 10 years ago |

This explains quite some things.

Still, this is not a very forward-thinking solution. Building and combining microservices – effectively UNIX philosophy applied to the web – is the most effective way to make progress.

EDIT: Seems like I misunderstood the article – from the way I read it, it sounded like Google has a monolithic codebase, with heavily dependent products, deployed monolithically. As zaphar mentioned, it turns out this is just bad phrasing in the article and me misunderstanding that phrasing.

I take everything back I said and claim the opposite.

thomashabets2 10 years ago | |

That's why Google is so unsuccessful at scaling technical solutions, unlike you they're not forward-thinking.

kuschku 10 years ago | | |

No, it’s not that they are unsuccessful, it’s that they are unable to maintain it properly. Already today they have tons of open security issues.

Or think about April 1st, when they set a Access-Control-Location: * header on google.com because someone wrote the com.google easteregg.

Read the post from the SoundCloud dude from yesterday to find out how to do software management properly (hint: modularization is everything)

zaphar 10 years ago | |

Google runs practically everything internally as services. Nothing about the code repository makes it impossible to run microservices. Where did you get the idea that google runs a single monolithic app for everything?

kuschku 10 years ago | | |

The article claimed the code, and they way it's run, is a monolithically developed and deployed product.

If that’s not the case, I apologize for misunderstanding it.

But if it was the case, I wanted to state that it might not be wise, for the same reasons as this thread mentioned https://news.ycombinator.com/item?id=10195423

EDIT: Thanks for telling me, though! Always nice to be proven wrong, as at least I learnt something today :D