Git is a purely functional data structure (2013)(blog.jayway.com) |
Git is a purely functional data structure (2013)(blog.jayway.com) |
I rarely use functional programming but I certainly see its appeal for certain things.
I think the concept of immutability confuses people. It really clicked for me when I stopped thinking of it in terms of things not being able to change and started instead to think of it in terms of each version of things having different names, somewhat like commits in version control.
Functional programming makes explicit, not only which variables you are accessing, but which version of it.
It may seem like you are copying variables every time you want to modify them but really you are just giving different mutations, different names. This doesn't mean things are actually copied in memory. The compiler doesn't need to keep every versions. If it sees that you are not going to reference a particular mutation, it might just physically overwrite it with the next mutation. In the background "var a=i, var b=a+j", might compile as something like "var b = i; b+=j";
I think it confuses people because it’s framed oddly. Immutability isn’t about being unable to mutate state, it’s about no longer using containers (registers) as variables, such that the equal operator actually means “equals” as opposed to “store”.
In most programming languages, the equal operator works as a “store” operation, which stores a value in a named container/register. In “immutable by default” languages like Haskell, the equal operator actually means “equals”, as in “is synonymous with”.
The essence of immutability is referencing values directly, through synonyms, as opposed to storing them in named registers for later retrieval. When it’s done this way, immutability no longer makes sense: is the number 3 mutable? Can the number 3 be mutated into 4, or are they just two distinct numbers?
Believe it or not, many compilers internally represent mutable variables as a sequence of immutable variables: https://en.wikipedia.org/wiki/Static_single_assignment_form
Edit: I should clarify that I mean "immutable" in the context of primitive values like integers.
Basically, if you use persistent data structures and a unified application state, you can keep a list of all previous application states and you can browse it or ship it for debugging, it's not that expensive.
[0] in debug mode, you get a list of all events having occurred in your application and can instantly move back to that state, and you can import/export that state history: http://elm-lang.org/blog/the-perfect-bug-report
Git lets you do version control via full snapshots as opposed to just tracking diffs (even though it does actually do this too behind the scene).
You can think of a full snapshot as saving a copy of your project structure every time you do a commit. The key trick is that git doesn't actually create new copies of the content for each commit but simply maintains a tree structure whose nodes are pointers (via hashing) to the content they represent.
The complication from git is not in understanding the core concept but knowing how best to apply them. There are all sorts of crazy workflows you could implement by manipulating git pointers and their associated patches. As with anything that is flexible, difficulty comes in knowing how to constraint yourself when using it.
http://tom.preston-werner.com/2009/05/19/the-git-parable.htm...
if git killed svn, datomic kills postgres
Hash trees are used in the IPFS, Btrfs and ZFS file systems, BitTorrent protocol, Dat protocol, Apache Wave protocol, Git and Mercurial distributed revision control systems, the Tahoe-LAFS backup system, the Bitcoin and Ethereum peer-to-peer networks, the Certificate Transparency framework, and a number of NoSQL systems like Apache Cassandra, Riak and Dynamo.
A Git history is a DAG[0] (each commit can have multiple parents) and beyond that a polytree (it can have multiple roots); while a blockchain is an arborescence[2] (there's a single root — the "genesis block"; and each block can only have a single parent).
Further, beyond the technicalities blockchains are generally very linear (the side-chains tend to be pretty short, forks aside) while Git repositories can be extremely broad (have lots of concurrent branches).
[0] https://en.wikipedia.org/wiki/Directed_acyclic_graph
[1] https://en.wikipedia.org/wiki/Polytree
[2] https://en.wikipedia.org/wiki/Arborescence_(graph_theory)
The answer depends entirely on definition. What properties of bitcoin are essential to a blockchan vs which properties are simply how bitcoin happens to use a blockchain?
If blockchain just means the Merkle tree, then yes.
If it means Merkle tree + a computational consensus system for adding nodes, then no.
Git also uses a Merkle-DAG, but it is not a blockchain.
The Best, when it comes to data structures is a Directed, Acyclic Graph. For instance your typical linux filesystem is a DAG. But there's one problem with DAGs: When they reach a certain complexity human brains are not fit enough to parse them anymore. (programs still can though)
So in many circumstances at least a human programmer needs to take a look at the state of your program and make assumptions about its correctness, which is called debugging. And that's why in Good programs we often use Good data structures instead of The Best.
Good data structures are key->value stores (which you may know as "hash tables" or "dictionaries"), trees, and trees in a simplified special form: lists, each of them being somewhat able to represent the other two, if one can accept a performance hit and/or increased complexity in source code. Dictionaries, trees, lists. That's it. And you do that in every programming language that is at least a little bit interested in being Good.
So there's nothing special or functional about git's data structures, it's just normal Good programming, and a few programmers who are so good at programming that they don't even need to mention it anymore, they breath good programms.
Then of course to the normal bread-earning coder good programs are a rare sight. But the reason is not that they are really rare, the reason is that successful business doesn't really require Good programs to succeed. Mediocre programs are good enough to earn their rent, and most of us spend most of our coding hours to earn our rent.
All that being said, if you don't just want to make money, go and spend some time studying git internals. It will teach you a lot more than most of your teachers/professors taught you combined. Sadly the source code is written by Linux gurus, who like to encrypt their source code with a very special key that only people from their tribe can understand. But the Git Book is actually good enough that you can study quite a lot of the internals from that book. I also suggest writing your own git in your favorite programming language once, to really understand it.
[1] man git-rebase
What I've gathered so far from reading the article and the comments is that some people who are in the know about a very specific paper agree that Git is a purely functional data structure. And that others look at the ways you can use Git and point a finger and say, "Look! It can be mutated! Therefore it cannot be functional!" And the response to that is, "Don't be so technical about how you define functional. Or immutable. You know it when you see it."
Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is this so difficult to get a handle on?
If a thing says immutable on the tin, and it's mutable, how is that purely functional? I know, read the paper. I know. But still, it's a legit question.
It seems to me that a data structure so amazing as being purely functional shouldn't be so easy to misunderstand as what we're seeing here. And it's clearly being misunderstood. And not only by me.
More precisely, it doesn't create new copies of content that you didn't change. For example, if you have 100 files in your repo and you change one of them and then commit, git creates a new copy of the content of the file you changed--a new blob storing the new file content--and a new tree object that references the new blob instead of the old one, plus the other 99 blobs that store the contents of files you didn't change; the new commit object then references the new tree object (plus the message and metadata). But git never stores diffs between old and new content; it just creates a new blob every time the content of a file changes.
Git pack files compress objects by storing them as diff files going backwards. That is, it stores the most recent state in full, then uses patches to go backwards. Because you're more likely to need a recent version in full than an older one.
One more step is to point that futures / promises, and even lists, are monads that a e.g. JS programmer uses every day, too. It reminds me of the old literary character who did not know that he's been speaking prose all his life.
That's completely orthogonal to whether the version history is immutable with each branch being like singly linked list where we "cons" new things onto the front.
> the version history is immutable with each branch being like a singly linked list where we “cons” new things onto the front
Just contributes to the general confusion around git where people decide it’s too complicated to learn.
(Apologies if my sarcasm detector is just broken today)
Why is it that whenever functional programming comes up, "real programmers" come out of the woodwork to bloviate, only to expose just how little they actually know about the topic?
"Purely functional" in this case is a jargon term that relates to Okasaki's 1996 PhD thesis, available online (https://www.cs.cmu.edu/~rwh/theses/okasaki.pdf). It's a classification for data structures much in the way "regular" or "context-free" or "turing-complete" serve as classifications for grammars.
For a data structure to be "purely functional" (and no, article author does not mean "good" here) simply means that it's implementable in a pure functional language without mutation. Examples of non-functional data structures would be ones where mutation is intrinsic to how its algorithms work: traditional linear probing hash tables, union-find for graphs, etc.
By this technical criterion, the git object graph is clearly a purely functional data structure, sorry.
Beyond that, you seem to have shown misunderstanding of what that means (despite your claims of "I understand but I disagree"), as the other examples you give DAGs, dictionaries, trees, lists, and this comment thread, do not have the interesting properties of purely functional data structures that make them worth discussing in the context of Git. Namely, the immutability of existing entries.
The thesis of the article isn't that purely functional data structures are Good, or that functional programming is Good. Those are quality judgments that are not present in the article. The thesis is: Explaining Git in terms of the purely functional data structure it uses might be helpful to learning Git.
I believe mutability is a trade off. You can make objects more immutable by being less efficient in terms of storage size, search time, source code simplicity. In exchange for these negative attributes you get source code you can proof more easily. It's good when that attribute is needed, but not most of the time.
Git is only immutable on that meta level where we talk about the object key-value store. The implementation uses mutable files though, and even steps away from that immutable key-value store when the size gets too big and needs to store more efficiently.
And on the user level you can amend, edit, squash and delete commits easily, which is actually a strength of git not a weakness. People love it for being more mutable than SVN.
I think the author is using the term "purely functional data structure" to mean a data structure that lends itself to an implementation in a purely functional language [1,2].
[1] https://en.wikipedia.org/wiki/Purely_functional_data_structu...
[2] https://www.amazon.com/Purely-Functional-Structures-Chris-Ok...
A round wheel is more useful for a car than a triangular wheel. That doesn't mean it's a "car wheel". It's just as good on a horse wagon or a bike.
This stuff isn't obscure just because you don't happen to know about it already. Chris Okasaki's publications have been cited over 1000 times, mostly for his PhD work -- those papers, especially on functional data structures and amortization analysis in lazy languages, are considered foundational for a whole research area in Computer Science.
> Is this some kind of Obi-wan Kenobi from a certain point of view stuff? Why is this so difficult to get a handle on?
Did you learn calculus overnight, or expect to understand a technical conversation about differential equations and Cauchy sequences without taking two years of classes or reading a couple of really thick books first? Why should this be any different?
> If a thing says immutable on the tin, and it's mutable, how is that purely functional? .... It seems to me that a data structure so amazing as being purely functional shouldn't be so easy to misunderstand as what we're seeing here. And it's clearly being misunderstood. And not only by me.
Sigh. Explaining it properly would involve a tour though the untyped lambda calculus, simply-typed lambda calculus and the Curry-Howard isomorphism, a discussion on denotational vs. operational semantics (Hoare logic, functional interpreters, type-preserving compilation, small-step vs large-step operational semantics, etc).
TL;DR: "purely functional" is a description of the program's meaning (in a technical sense), not its implementation.
Got it. Thanks for the attitude.
I think I can offer something to the discussion here, as I'm straddling the 2 camps - having a fairly intricate understanding of git (I've held training seminars at my company about it), and basic familiarity with functional (technically "persistent") data structures thanks to a long-standing interest in Clojure.
What the functional camp is talking about, when they refer to git as a "purely functional data structure", is the core of git's implementation - commits are immutable snapshots of the repository arranged in a directed acyclic graph (or a tree, if there are no merge commits), and branches are "just" movable pointers to some commit. In this view, everything other than this core model of history as a graph is extra details layered on top.
The other camp (let's call them the "git camp") looks at git in its entirety, including branches, tags, the index, the working tree, stashes, and so on, and they can't help but come to the conclusion you mentioned - "Branches are mutable references! The index is mutable! The working tree is mutable! How can git possibly be called a functional data structure?"
While this 2nd perspective is understandable and technically correct, I think it misses the point the other camp is trying to make: that the immutable commit is the "fundamental unit of git" so to speak, that we interact with everyday, and that most of git's history-manipulating commands (git commit, git commit --amend, git rebase, git merge, git cherry-pick) can be described in terms of purely-functional operations on the commit graph. Once you understand this, many complex git scenarios become easier to understand (rebase, and filter-branch, for example), so this view does have practical value. Then branches, tags, the index, stashes, etc can indeed be understood separately.
In fact, this perspective of git as a data structure is so useful, that I rarely use "git log" as it is, preferring these additional flags to view the entire graph instead:
git log --graph --all --decorate --oneline
I hope that helps, cheers!
If you squint, that's kind of true for Git repositories, too.
The version with the most "proof of work" on it is likely the master branch.
Of course the incentives are very different... but still, the similarity is somewhat illuminating, I think.
I feel like I already said "I understand but I disagree". A round wheel is more useful for a car than a triangular wheel. That doesn't mean it's a "car wheel". It's just as good on a horse wagon or a bike.
Sorry, but yes :)
The commit is immutable because when you "amend" a commit, git actually creates a new commit object, assigning it a new SHA1 hash. This means that the original contents from before the amend (i.e., the "old" commit) are still accessible via the old SHA1 hash - proving that an "amended" commit is actually a copy of the old one, not an in-place mutation.
Think of it like this - if I copy a text file and edit the copy, would you say that I've "mutated" the original file? No, right? That's what git does - it never actually mutates a commit :)
EDIT: to clarify further, git identifies each commit with a unique SHA1 hash which is assigned when the commit is created, and that hash depends on the contents of the files + the current timestamp + the SHA1 of its parent commit + some other things. Which means, if you try to edit the contents of the commit in any way at all (or even if you create another commit with the same exact files at a different time, or as a child of a different commit), its SHA1 hash will be different, i.e., it is a new commit by definition.
You may not find the definition useful, and that's alright. I won't try to convince you.
x == 3 means: Take the value out of the box x. Is it 3?
x = 3 means: x is a box, put 3 in it. The 3 lasts forever, or until someone changes it.
In a functional language:
x = 3 means: Take the value out of box x. Is it 3?
let x = 3 in [...] means: x is a box, put a 3 in it. The 3 lasts for only the [...], but cannot be undone.
x = 3 means: x is defined as 3.
x == 3 means: Is x equal to 3?
let x = 3 in [...] means: within the scope of [...], x is defined as 3.
There is no specific point where something is "put into the box" or "taken out of the box". In fact, there is no "box". The functional programming concept of "bindings" does not correspond to the imperative concept of "variables", not even the oxymoron known was "constant variables". (If you want variables in Haskell you need IORef or STRef, where the load and store operations are explicit monadic actions.)Functional languages are only immutable on that meta level where we talk about objects in memory. The implementation uses mutable RAM though, and even garbage collects when the size gets too big and needs to store more efficiently.
Unless you want to narrow the term 'functional' so much it describes nothing at all, Git is a functional data structure.
(a) an argument helping my thesis by showing that you can program functional languages quite well by using mutable recourses
(b) the biggest users of what you write there are also the biggest enemies of the status quo you describe. Functional programmers really hate that computers are so unlike what they love.
You can implement a functional language with mutable resources, and you can implement a functional data structure like git with mutable resources. How does it help your thesis that git isn't a functional data structure?
> (b) the biggest users of what you write there are also the biggest enemies of the status quo you describe. Functional programmers really hate that computers are so unlike what they love.
Whatever, something that is purely functional all the way down to the wires is impossible.
Any definition of "functional" the rejects every single functional language is an objectively failed definition. A usable definition will go ahead and permit the git data structure to be called functional.
Let me preface this by revealing that I'm neither a functional programmer, nor am I convinced that functional programming or immutability is any sort of silver bullet.
Your original thesis was that "a data structure cannot be functional," which you later defended by saying "git is only immutable on the meta level, its implementation uses mutable files", which, while being 100% true, is also true of the category of programming languages we call "functional" as they rely on mutable RAM. That is what the GP has been trying to explain to you, but your obvious hate for anything labelled "functional" makes you blind to all this.
Sorry mate, you just have to concede the point here. This is not a discussion about whether functional programming is any good, which is what you seem to be treating it as.
For example,
let x = 5 in let x = 6 in putStrLn (show x)
would print 6. Unless you turn on compiler warnings, there is never a need to choose new names for variables if the old version is no longer needed.
let x = 4 ; f y = x + y in let x = 6 in f 10Haskell has mutable refs. That's what Git branches are.
This is both false and irrelevant.
It's false because after the rebase, the branch you just rebased won't point to the old commits. Unless you had other branches there, they would be orphaned. Also, according to the "Notes" section of the git-gc documentation [0]:
> git gc tries very hard not to delete objects that are referenced anywhere in your repository. In particular, it will keep not only objects referenced by your current set of branches and tags, but also objects referenced by the index, remote-tracking branches, refs saved by git filter-branch in refs/original/, or reflogs (which may reference commits in branches that were later amended or rewound).
Since the commit you were on before making the rebase is in the reflog, it will actually not be GCed (yet) even though there are no branches pointing to them.
It's irrelevant because, even if it was true, as long as there are no more objects referencing that commit, it's perfectly eligible for gc. I don't understand your argument that "you can't have it both ways".
That would be a strictly broken GC if it removes commits that are reachable from branch pointers. Please file a bug report with the git project if you've observed this happening, because it would be inconsistent with the git-gc(1) manpage, which says it only removes unreachable objects.
(see: structure sharing, an essential optimization that lets functional data structures reuse bits of each other internally for low-cost copy-and-modify, at the cost of making a data structure no longer responsible for freeing its own memory since it overlaps)
It's still useful and more accurate conceptually to consider every commit as a complete snapshot of the state of code that point.
There are people who distinguish changeset oriented and snapshot oriented and will hotly debate that one or the other is better.
But as you say, restoration of state is a necessary and defining feature.
Library: https://github.com/arximboldi/lager
Ncurses example (full text-editor): https://github.com/arximboldi/ewig
SDL example: https://twitter.com/sinusoidalen/status/939539173584916480
Immutable data: https://github.com/arximboldi/immer
(There was a talk about Lager at MeetingCpp this year, still waiting for it to be published online...)
At work, we are pondering on a new way to split our game levels editing code, and persistent data structures came to me as one idea, in order to handle UNDO in generic way, and they by diff, or other means visualize the changes. I've read Okasaki's book long time ago, but seeing this implemented in a imperative language is really helpful.
Also pretty much enjoyed the post-modern talk too :) - hehehe
Thank you!
Editing code for game levels seems to me like a quite good use-case for this. "Editing" software with rich hierarchical documents is the itch I originally wanted to scratch with this work, even though there are many other use-cases too. Feel free to send me an email if you move further with that idea or if you want to discuss any of these topics with me :-)
I'm guessing you mean good in the sense of well-implemented (performant), but I think good in terms of interface is also very important
For instance in my experience & opinion Immutable.JS is not very fun to use regardless of its implementation, because the interface does not feel natural for the language (not that the designers can really be faulted, the user-accessible part of the language simply limits what they can do). I think a similar library for Python would have similar issues (due to Python being very statements-oriented which is antithetical to persistent data structures).
It also has the ability to really take any object and attach states to it at whim, including immutability.
I think that’s a better trade off than JS
I actually have a half-started emacs-like editor built using Qt that is based on ewig. Not anywhere near usable yet, but my plan is to make a fast editor with fast highlighting and such, and have everything accessible to be scripted in guile. I doubt it will actually materialise, but if I ever get a proof of concept it will be on github.
I love immer! It is friggin excellent.
> Python however does provide some immutable constructs as part of it’s core language (Tuples)
That doesn't really help when you're looking for persistent datastructures.
> It also has the ability to really take any object and attach states to it at whim, including immutability.
No, and if it did that would not be an advantage in this context.