Git is a purely functional data structure (2013)

Git is a purely functional data structure (2013)(blog.jayway.com)

227 points by eisokant 8 years ago | 97 comments

I sometimes like to explain things the other way around. Immutability being version control for program state.

I rarely use functional programming but I certainly see its appeal for certain things.

I think the concept of immutability confuses people. It really clicked for me when I stopped thinking of it in terms of things not being able to change and started instead to think of it in terms of each version of things having different names, somewhat like commits in version control.

Functional programming makes explicit, not only which variables you are accessing, but which version of it.

It may seem like you are copying variables every time you want to modify them but really you are just giving different mutations, different names. This doesn't mean things are actually copied in memory. The compiler doesn't need to keep every versions. If it sees that you are not going to reference a particular mutation, it might just physically overwrite it with the next mutation. In the background "var a=i, var b=a+j", might compile as something like "var b = i; b+=j";

runeks 8 years ago | |

> I think the concept of immutability confuses people.

I think it confuses people because it’s framed oddly. Immutability isn’t about being unable to mutate state, it’s about no longer using containers (registers) as variables, such that the equal operator actually means “equals” as opposed to “store”.

In most programming languages, the equal operator works as a “store” operation, which stores a value in a named container/register. In “immutable by default” languages like Haskell, the equal operator actually means “equals”, as in “is synonymous with”.

The essence of immutability is referencing values directly, through synonyms, as opposed to storing them in named registers for later retrieval. When it’s done this way, immutability no longer makes sense: is the number 3 mutable? Can the number 3 be mutated into 4, or are they just two distinct numbers?

canes123456 8 years ago | | |

I think I understand the point you are making but this is way more confusing for me. Partly this can be blamed on imperative lanuages and = vs ==.

sobellian 8 years ago | |

> In the background "var a=i, var b=a+j", might compile as something like "var b = i; b+=j";

Believe it or not, many compilers internally represent mutable variables as a sequence of immutable variables: https://en.wikipedia.org/wiki/Static_single_assignment_form

Edit: I should clarify that I mean "immutable" in the context of primitive values like integers.

macintux 8 years ago | |

That's an interesting way to describe it. I've talked a lot about immutability at conferences, but I've never thought about it in those terms. Thanks.

masklinn 8 years ago | | |

That realisation has been used by environments like Elm's Reactor[0] or its (sadly defunct) Time Traveling debugger. I think Om also had something like that.

Basically, if you use persistent data structures and a unified application state, you can keep a list of all previous application states and you can browse it or ship it for debugging, it's not that expensive.

[0] in debug mode, you get a list of all events having occurred in your application and can instantly move back to that state, and you can import/export that state history: http://elm-lang.org/blog/the-perfect-bug-report

runT1ME 8 years ago | |

Nice, good explanation. And one way to look at the State Monad is when you always want the 'latest version' of an immutable 'value' that keeps getting new versions. :)

antoaravinth 8 years ago | |

That's good explanation. I guess Immutable.js does uses the same concept behind the scene to retrieve each references i.e like commits. Looks like Immutable.js uses Tries data structures for such operations. May be I'm wrong here as well.

Edmond 8 years ago |

Perhaps for those familiar with "functional data structures" such an analogy is helpful but I find it easier to simply explain git for what it is without adding more exotic nomenclature to it.

Git lets you do version control via full snapshots as opposed to just tracking diffs (even though it does actually do this too behind the scene).

You can think of a full snapshot as saving a copy of your project structure every time you do a commit. The key trick is that git doesn't actually create new copies of the content for each commit but simply maintains a tree structure whose nodes are pointers (via hashing) to the content they represent.

The complication from git is not in understanding the core concept but knowing how best to apply them. There are all sorts of crazy workflows you could implement by manipulating git pointers and their associated patches. As with anything that is flexible, difficulty comes in knowing how to constraint yourself when using it.

kazinator 8 years ago |

Git is a purely functional data structure, except for the mutating head pointers, rewriting of tags, various state in the repo related to things like on-going rebases, cherry picks, bisects, ... oh and the index which is one object changing in-place (not to mention working tree, of course).

vickychijwani 8 years ago | |

While what you say is technically accurate, I think you're missing the bigger picture here. The author's point still stands: git's commit history (arguably the most important part of a version-control system) can be viewed as a purely-functional data structure, and that view has practical benefits too. I tried to explain more here: https://news.ycombinator.com/item?id=15892013

rubenbe 8 years ago |

I often recommend people to read "Git Internals". If you know how git works internally, it's much easier to understand how it works and the reasons behind it.

https://git-scm.com/book/en/v1/Git-Internals

nine_k 8 years ago | |

Even more enlightening is The git Parable.

http://tom.preston-werner.com/2009/05/19/the-git-parable.htm...

randomsearch 8 years ago | |

This suggests a poor abstraction.

goialoq 8 years ago | | |

"Internals" is a poor choice of term. "Data structure" is a better term. Git is "plumbing and porcelain". The plumbing is the core of git. Porcelain are shortcuts. In general, Torvalds projects (Linux, Git) aren't big on abstractions that maximize simplicity-of-use, they focus on doing complex things correctly and quickly. Adding abstraction makes it hard to get details correct and run quickly.

mannykannot 8 years ago |

I think the author has a point in saying that learning Git by trying to map it to Subversion is not the best way to do it, but I don't think analyzing it as a functional data structure adds much insight. To me, it is easier to understand when you look at its purpose, and how it solves the problems of that domain - and the biggest difficulties of version control are on account of the problem being essentially one of distributed, lockless concurrency, something not mentioned in this article.

icc97 8 years ago |

I found the explanation from the Immutable JS presentation easier to understand when talking about Immutable data structures [0]

[0]: https://youtu.be/I7IdS-PbEgI?t=5m7s

shurcooL 8 years ago |

Even after 4 years, this remains my favorite, most influential article that helped me understand and feel comfortable with git. It's just a very good analogy.

dustingetz 8 years ago |

for a real database that works like git, see http://www.datomic.com

if git killed svn, datomic kills postgres

xj9 8 years ago | |

if it were open source maybe, but i'm definitely not going to switch (even though i might want to) for licensing reasons. in fact, i have more motivation to write an libre datomic clone than to pay cognitect anything for their proprietary db.

dustingetz 8 years ago | | |

do it!

macintux 8 years ago | |

As much as I appreciate datomic, that's a poor conclusion to draw. git is objectively better than svn. Postgres is not objectively worse than datomic; there are things that datomic simply can't do efficiently.

dustingetz 8 years ago | | |

if you go back and read the git vs svn flame wars back when it first came out, people said the same thing. it happens every time there is a paradigm shift technology. the reactjs flame wars of 2013/4 were particularly brutal as everyone and their mom felt qualified to comment. the key idea here is that, at scale, immutability is just better, at nearly everything

Scarbutt 8 years ago | |

datomic is to slow on writes, even slower than sqlite, that's because it serializes all writes.

ioquatix 8 years ago |

.... and one day I had the crazy idea to make a database on top of it: https://github.com/ioquatix/relaxo because the underlying immutable data structure makes this quite feasible.

snissn 8 years ago |

Is git a blockchain?

icebraining 8 years ago | |

They both use Merkle/Hash trees:

Hash trees are used in the IPFS, Btrfs and ZFS file systems, BitTorrent protocol, Dat protocol, Apache Wave protocol, Git and Mercurial distributed revision control systems, the Tahoe-LAFS backup system, the Bitcoin and Ethereum peer-to-peer networks, the Certificate Transparency framework, and a number of NoSQL systems like Apache Cassandra, Riak and Dynamo.

https://en.wikipedia.org/wiki/Merkle_tree

masklinn 8 years ago | |

Mostly the reverse, if you need to relate them a blockchain is a degenerate Git (history, which is a subset of Git itself):

A Git history is a DAG[0] (each commit can have multiple parents) and beyond that a polytree (it can have multiple roots); while a blockchain is an arborescence[2] (there's a single root — the "genesis block"; and each block can only have a single parent).

Further, beyond the technicalities blockchains are generally very linear (the side-chains tend to be pretty short, forks aside) while Git repositories can be extremely broad (have lots of concurrent branches).

[0] https://en.wikipedia.org/wiki/Directed_acyclic_graph

[1] https://en.wikipedia.org/wiki/Polytree

[2] https://en.wikipedia.org/wiki/Arborescence_(graph_theory)

finnthehuman 8 years ago | |

Is a hot dog a sandwich?

The answer depends entirely on definition. What properties of bitcoin are essential to a blockchan vs which properties are simply how bitcoin happens to use a blockchain?

If blockchain just means the Merkle tree, then yes.

If it means Merkle tree + a computational consensus system for adding nodes, then no.

linschn 8 years ago | |

Short answer, yes. The current commit includes the hash of its parent(s), so its own hash reflects the whole history, and one can not change the history without also changing the current hash. Just like a block contains the hash of the previous block.

sparkie 8 years ago | | |

That's a Merkle Tree. A blockchain is an application of a Merkle tree in which each node contains transaction data, and a majority of clients agree that the longest chain of blocks is the correct one.

Git also uses a Merkle-DAG, but it is not a blockchain.

doug1001 8 years ago |

well the top-level git data structure is pretty close to eg, Scala's Vector, which is an immutable container implemented as a tree with a high branching factor of 32. Modification to such a vector, rebound to a new variable, relies on structural sharing of the original (http://www.codecommit.com/blog/scala/implementing-persistent...)

erikb 8 years ago |

A data structure cannot be functional. I understand what he's trying to say, and agree with most of it, but the word "functional" is purely wrong. What he wants to say is "good". But not all "functional programming" is good, nor is all good programming functional by necessity, despite what your local Lambda The Ultimate nerd tries to tell you.

The Best, when it comes to data structures is a Directed, Acyclic Graph. For instance your typical linux filesystem is a DAG. But there's one problem with DAGs: When they reach a certain complexity human brains are not fit enough to parse them anymore. (programs still can though)

So in many circumstances at least a human programmer needs to take a look at the state of your program and make assumptions about its correctness, which is called debugging. And that's why in Good programs we often use Good data structures instead of The Best.

Good data structures are key->value stores (which you may know as "hash tables" or "dictionaries"), trees, and trees in a simplified special form: lists, each of them being somewhat able to represent the other two, if one can accept a performance hit and/or increased complexity in source code. Dictionaries, trees, lists. That's it. And you do that in every programming language that is at least a little bit interested in being Good.

So there's nothing special or functional about git's data structures, it's just normal Good programming, and a few programmers who are so good at programming that they don't even need to mention it anymore, they breath good programms.

Then of course to the normal bread-earning coder good programs are a rare sight. But the reason is not that they are really rare, the reason is that successful business doesn't really require Good programs to succeed. Mediocre programs are good enough to earn their rent, and most of us spend most of our coding hours to earn our rent.

All that being said, if you don't just want to make money, go and spend some time studying git internals. It will teach you a lot more than most of your teachers/professors taught you combined. Sadly the source code is written by Linux gurus, who like to encrypt their source code with a very special key that only people from their tribe can understand. But the Git Book is actually good enough that you can study quite a lot of the internals from that book. I also suggest writing your own git in your favorite programming language once, to really understand it.

_0w8t 8 years ago |

Git is not a purely functional data structure [1]

[1] man git-rebase

19870213 8 years ago | |

But git-rebase does not alter existing commits in the commit tree, it simply creates a new branch (meaning new commits) on the tree.

Simon_says 8 years ago | | |

Alright wiseguy, git gc.

klodolph 8 years ago | |

If Git were purely functional, you would expect rebase not to modify the existing data in any way, and indeed that is exactly what happens. You can create, delete, or modify only the top-level pointers: branch names, reflog, etc. Instead, rebase creates a completely new set of commits, and points the current branch at a new one. This is exactly how functional data structures work in e.g. Haskell, where "inserting" an element into a dictionary means that you get a new copy of the dictionary, and all existing references to the original are unmodified.

_0w8t 8 years ago | | |

Git rebase alters the structures that are relevant for me, like heads of named branches. In Haskell let bindings are immutable. To reference to the results one has to put them into new bindings. I.e. if Git was purely functional, the rebase would create new names for branches.

fiatjaf 8 years ago | |

I don't see that anywhere. A rebase is just a fork.

ianamartin 8 years ago |

I find this conversation fascinating because there is so much disagreement on the meaning of "functional" and "immutable"

What I've gathered so far from reading the article and the comments is that some people who are in the know about a very specific paper agree that Git is a purely functional data structure. And that others look at the ways you can use Git and point a finger and say, "Look! It can be mutated! Therefore it cannot be functional!" And the response to that is, "Don't be so technical about how you define functional. Or immutable. You know it when you see it."

Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is this so difficult to get a handle on?

If a thing says immutable on the tin, and it's mutable, how is that purely functional? I know, read the paper. I know. But still, it's a legit question.

It seems to me that a data structure so amazing as being purely functional shouldn't be so easy to misunderstand as what we're seeing here. And it's clearly being misunderstood. And not only by me.