On undoing, fixing, or removing commits in git(sethrobertson.github.io) |
On undoing, fixing, or removing commits in git(sethrobertson.github.io) |
Accidental mutations can be undone either by `--abort`ing (if the command supports it) or by checking out an earlier revision from the reflog.
The GC in git is pretty conservative and, while it can be triggered manually, still makes you jump through some hoops to actually get rid of something. Steve Klabnik wrote about it[1] a little while back.
In certain cases, you don't have access to the reflog because a change wasn't made locally. Perhaps someone screwed up a remote you pull from and it destroyed your history. You can, even still, find, view, and re-associate orphaned objects. Yeah, it's not terribly intuitive and, again, not a workflow anyone has probably committed to memory, but the fact that you can recover from a disaster of that magnitude is pretty amazing.
git provides we developers with a set of tools—powerful tools—and that comes with a level of responsibility. I'd rather have the ability to responsibly clean my history than the alternative.
[1] - http://words.steveklabnik.com/git-history-modification-and-l...
WTF?
What is the point of a version control system if you have to take backups of it to avoid losing data when performing certain operations?
I use git, I like git, but certain aspects of it are fundamentally broken.
Git IS safe, and ANYTHING involving changes to history can be undone without resorting to backups. Data loss can occur when you're mucking about with uncommitted changes, but that's a risk in most other version control systems as well.
[0]: http://jscal.es/2013/08/05/seriously-the-reflog-isnt-that-sc...
It's also definitely not true with uncommitted changes, including gitignored files.
Obviously, if you choose not to edit the history, then you never need to back up in this sense, and you're free to do that. But then you can't ever go back and change things (like remove accidentally committed passwords, etc.)
But if you choose to rewrite the history, and mess up, then you'll be glad you had a backup. And (in response to other comments), even if there are ways of still retrieving/fixing data, it's often easier to just restore from your backup, especially when you're trying out git commands for the first time, and you're not entirely sure if they'll work exactly how you expect. None of us are git experts from the beginning, and I've resorted to git backups numerous times when trying out a command for the first time, and then discovering it wasn't the right way.
Also, sometimes it's easier for a user to roll back to an older back up than to untangle the mess they have created.
Third, git itself is not a backup. When your repository gets corrupted, you're out-of-luck when you don't have backups for those files. So it's still good to take backups of your repositories.
Second, why would a version control system make it so difficult to roll back to an old version that it's easier to restore from backup? This is insane.
Third, I'm well aware of this, and of course you should be making backups of your git repositories (and everything else). But those backups should be there to protect against hardware failure and other external data-loss events, not protect against git itself.
2) All version control systems are vulnerable to data loss if you mess around with them in unusual ways. Would you say svn was fundamentally broken if somebody told you to take a backup before you screwed around with the repo?
2) The difference is that svn does not build this functionality into the main command line tool, and there is no culture of doing terrible things with svnadmin to edit svn repositories the way there is of doing terrible things with git to rewrite git history.
The advise to take a backup doesn't hurt, and might be helpful if restoring the original state is more effort than doing it with git operations.
I'm thinking that the repository could be moved out of the working directory and placed in its own file that's not invisible. If the repo was reified into a visible file, then repos would be portable and you could ftp them. The backup functionality could be separated from the history-tracking functionality, so you could make backups freely without adding noise to the commit history. A backup would basically be a tarball that you could append to a repo file, taking advantage of previous entries for compression. Commits, however they were implemented, could reference snapshots, but they needn't be 1:1.
Symlinks are your friends.
> then repos would be portable and you could ftp them.
tar might come in handy.
> The backup functionality could be separated from the history-tracking functionality, so you could make backups freely without adding noise to the commit history. A backup would basically be a tarball that you could append to a repo file, taking advantage of previous entries for compression.
You can already do this. You can have commits without ancestors or descendants in your repository, and they will still benefit from delta compression.
If something has already gone wrong, and you didn't do it in a separate branch, you can still go back to a previous situation.
Rewriting history in any serious sense (beyond a local reset or rebase for stuff that hasn't been pushed to anyone else yet) is always a bad idea. History is history for a good reason.
Of course any existing commit can always be reverted; that's not rewriting history. A revert is simply a new commit.
I am sad about some of these other comments, which I might paraphrase as "This doesn't help me, and it might help people who are less skilled than me who don't deserve to be helped, therefore it's worthless". It's apparently a common sentiment on this site, but it shouldn't be.
And you can always make a branch out of a previous situation. Gitk/gitx make this particularly easy.
The other way is to have a number of simple concepts which can be combined in various powerful ways. Once you understand these simple concepts, you can compose them to do whatever you need. Git is simple the same way that RISC is simple, and having a manual transmission is simple. You can do a lot more with a manual transmission car than you can with an automatic --- but if you're not careful you can strip the gears. Yet a manual transmission is simpler to maintain, and more efficient (in the hands of someone who knows how to use it) than a automatic transmission. If you take a look at the post, you'll see that the various recipes only use a handful of git commands. Once you've mastered those commands, things are indeed quite simple.
http://mercurial.selenic.com/wiki/ChangesetEvolution
It's been brewing for some time. Basically, the idea is to be able to make it easy to safely edit history collaboratively, with a consistent UI. Facebook is pumping a lot of money into hg right now, and seems particularly interested in getting this feature off the ground.
A number of pieces have been falling into place for this to occur. The first was to have phases, indicators of which commits are safe to edit collaboratively or not, a feature that some git users have wanted:
https://github.com/peff/git/wiki/SoC-2012-Ideas#published-an...
Mercurial now has this feature and uses it as part of the logic for the evolve extension. With this in place, hg is able to transmit metadata that indicates automatically which commits need to be fixed up if you want to edit a commit that someone else has also edited, or if someone edited a commit on top of which you've based off other commits.
The idea is to make something like "git push --force" obsolete. History is safe to edit, and commits can't get lost, not even by accident:
http://www.infoq.com/news/2013/11/use-the-force
By the way, an epilogue to that Jenkins story is that it wasn't completely trivial to recover all lost history, and at least for some of the smaller repos, they never managed to figure out exactly which version was the canonical one.
Unfortunately, it's usually the problem domain that is complex and starting over just means you have to rediscover all of that complexity all over again. HG has more than its fair share of complicated tasks.
Now that I have familiarity with local branching, remote branches, how the 3-way merge works (conceptually) and rebasing, this article comes off as a guide on how to do things that you wouldn't have to do to often anyways.
If you want to know if some operation can be done, you don't reason about git-reset, git-checkout, git-branch, etc.. you reason about the DAG. After you have a solid mental image of what you are attempting to do to the DAG, it is a simple matter to decompose that action into a few weird but ultimately simple incantations with the porcelain. If you are interested in optimizing how many steps you decompose operations into, then you can learn the esoteria of a few git operations, but all of the hard thinking, the real problem-solving, was done in the context of a different abstraction.
git branch branchname-backup
#do dangerous stuff ...
# whoops I just broke the branch really bad I'll start over
git reset branchname-backupAnd like I already said above, data loss can occur when you're working with uncommitted changes, just like in most other version control systems. If the content is not under version control (in this case, not in a git commit), it's not safe.
Honestly, you guys should go watch Linus Torvalds' presentation at Google about Git. The entire point, the massive problem he was trying to solve, was preservation and verification of data integrity.
Regarding uncommitted changes: This is in the same category as forgetting to do your backup before starting to mess around, IMO. I would encourage anyone to simply get used to committing extremely often and just using a quick interactive rebase before pushing.
The simple options are:
- Remove the hard-coded password, and create a new repository with the current state of the code as a starting point.
- Start a new repository with the current code state, but keep the old repository around under lock-and-key, then perform 'complex' patch operations to move changes between the two repositories (e.g. roll back to a previous version of a file before the cut-off).
- Go back through your history, and manually create a new repository from each patch, but removing the password when you get to that commit.
If git always preserves all history, no matter what, then these are your only options.
While operations like `git-filter-branch` sound scary, they don't delete the commit objects from your .git folder. If you created a new branch called (e.g.) master-old because running `git-filter-branch` on your repository, then you can always 'rollback' to master-old if you end up in failure. Or slightly more complex, you could use the reference listing in the reflog to 'rollback' the changes.
Next time, rather than just assume that the poster isn't smart enough to realize that a compromised password should be changed, maybe you could take in the fact that it's probably just an example of data that you might want to extract from your history if it's automatically there. I can think of numerous scenarios where someone might want to remove a password from the history even if it's not compromised (e.g. want to publish a private repo).
IMO the correct option is to create a new repository that has the same history as the old repository minus the offending commit (or possibly with an edited version of that commit that leaves out the offending string).
Because it creates a new repository, there's no risk of data loss in your old repository. Once you're confident that the operation succeeded, you can swap them.
I haven't had to do this for a long time, but as I recall, this is basically how svn does it. It works fine.
The problem with git is that it makes this far too easy and it works by editing existing repositories rather than creating new ones. So instead of once-in-a-blue-moon repository hacking to get rid of that password you accidentally committed, you get people rewriting history because they think the real history isn't "clean". I know a lot of people who routinely edit their local history before pushing changes to a shared repository because they don't want other people to see their true "dirty" history. This is insane.
Finally, I'm confused about something, so maybe you could clear this up for me. I keep seeing assurances that 1) git does not actually destroy any data, and you can always recover if you screw up and 2) editing history is sometimes a vital necessity for cases like when you commit passwords. You yourself made these assurances in this comment. However, 1 and 2 are obviously mutually exclusive. If you can always recover then you can't actually scrub the repository of accidentally committed passwords and the like. Which one is actually true?
1) This is almost true. Anything that is committed to Git is recoverable. When you "re-write" history, Git is creating a new set of commits in the history, an "alternate history path." It does not destroy the original commits, but there is no named reference to them (unless you created a branch/tag pointing to this line of commits).
2) In this case, if you want to actually destroy these unreferenced commits, you must run "git gc". This IS a destructive command. It will remove any unreferenced commits from the repository. (gc = garbage collect). If you never garbage collect, you will always have access to anything that was ever committed. It just might be hard to find since the only reference is the ref-log (if it was recent) or the commit hash.
This is no more insane than editing a source code file before you save it to the file system. Git is used as a development tool as well as version control, and developers are therefore encouraged to commit often, even if the code does not actually compile yet. There is no more need to fill the published history with all of these WIP commits than there is for me to know about every goddamn keystroke you made while you were dicking around with that config file.
There is no such thing as "an edited version" of a commit. A commit is identified by a SHA1 hash of its index of contents. If you change one bit you get a new commit.
You're a C programmer, right? If someone gave you a specification for writing a program to implement git, without telling your what it was, you'd tell them it would take 2 weeks. And that's because you'd reckon it would take 2 hours to knock out a rough version and a couple of days to clean it up.
Seriously, it's that simple. Just go learn how it works.
Thanks for clarifying that.
It's obvious what the answer is if you know how it works, so what was your point exactly?
There's a difference between "has no idea how it works" and "understands the overall structure but doesn't know every single detail".
- One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).
- All of the commit objects are stored under .git/objects.
- Branches are just files under .git/refs/ that contain the SHA-1 sum of the most recent commit on that branch. This is why they are called 'branch pointers.' That's basically all they are.
- If you have a history of 5 commits, and make a change to the initial commit, you now have 10 commits in your .git/ directory. Your (e.g.) 'master' branch will point to the most recent 'tree' of 5 commits. The other commits will still exist in .git/objects, but there will be no branches pointing them. You can use 'git reflog' to find them, or access them by their SHA-1 sum.
- Eventually 'git gc' (gc = garbage collect) will clean out the unreferenced commits, but this happens rarely if you don't explicitly run the command.
- When you 'git push,' you are only pushing branches to the remote repo, so commits that are stored locally, which are not referenced by one of those branches you are pushing, will not be pushed out. If you have commits that you don't want to end up in limbo like this, you should 'git tag' them or create a branch (e.g. 'archive/master-2013-12' that points to them).
Thanks for the help.
>One of those meta-data items is "Parent Commit," so if you change one item in history, it changes the SHA-1 sum of all subsequent items (because at the very least they all need to be re-parented).
What sequence of operations would change a history item in that way?
I've never looked at .git/logs, but it looks like that is used by the `git reflog` command. It's basically a history (or log) of every commit that a particular reference has pointed to[1]. For example, I cloned the git source code:
user@host ~/src/git % cat .git/logs/HEAD
0000000000000000000000000000000000000000 d7aced95cd681b761468635f8d2a8b82d7ed26fd First Last <user@example.com> 1387237920 -0500 clone: from https://github.com/git/git.git
user@host ~/src/git % git reflog
d7aced9 HEAD@{0}: clone: from https://github.com/git/git.git
Note: `HEAD` is a reference to the current branch. E.g.: ~/src/git $ cat .git/HEAD
ref: refs/heads/master
~/src/git $ cat .git/refs/heads/master
d7aced95cd681b761468635f8d2a8b82d7ed26fd
It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.> What are the SHA-1 sums of? Are they of the entire snapshot, or the delta? I went into objects/ and ran `sha1sum $objfile`, and the sum did not match the file name. So that remains obscure.
See: http://stackoverflow.com/questions/5290444/why-does-git-hash...
[1]: Since the local repository was created. This information does not sync between local and remote.
I think it's more or less the DAG represented as an adjacency list. I'd have to think a bit about why there is a separate log file for each branch. It seems that there's some redundancy in doing that, and I'm wondering what the advantages and disadvantages are of splitting the history up in that way.
>It's also of note that branches are referred to as 'references' too, hence storing them under `.git/refs/`.
I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.
>See: http://stackoverflow.com/questions/5290444/why-does-git-hash....
That's a good link. What's in an object? If an object corresponds to a commit, then it must aggregate data about changes to multiple files.
Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed. That's why there is a separate log for each branch. I guess that technically they could have a single log file and add another field to specify the branch, but using the same directory tree structure as under .git/refs/ makes the mental model simpler (and probably a performance improvement not to have to parse the reflog for every branch just to see the reflog for one branch).
> I've developed a loathing of excessive hierarchies/trees, so I'd rather see them flattened in a single directory. But that makes sense.
I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.
> What's in an object?
If you really care to dive deeper, you can check objects here: https://github.com/git/git/blob/master/object.h
You can get a shorter version towards the bottom of the git manpage (e.g. `man git`):
IDENTIFIER TERMINOLOGY
<object>
Indicates the object name for any
type of object.
<blob>
Indicates a blob object name.
<tree>
Indicates a tree object name.
<commit>
Indicates a commit object name.
<tree-ish>
Indicates a tree, commit or tag
object name. A command that takes a
<tree-ish> argument ultimately wants
to operate on a <tree> object but
automatically dereferences <commit>
and <tag> objects that point at a
<tree>.
<commit-ish>
Indicates a commit or tag object
name. A command that takes a
<commit-ish> argument ultimately
wants to operate on a <commit> object
but automatically dereferences <tag>
objects that point at a <commit>.
<type>
Indicates that an object type is
required. Currently one of: blob,
tree, commit, or tag.
<file>
Indicates a filename - almost always
relative to the root of the tree
structure GIT_INDEX_FILE describes.>Think of each branch as a pointer. Then realize that you can make that pointer point anywhere on the DAG, even to parts of the DAG that have no connection to each other. The `reflog` is a (local, non-comprehensive) history of where that pointer has pointed.
I got that branches were pointers. Now that I'm aware that the DAG is fully represented inside objects, I can see that what's inside logs/ is actually just logs. Each log corresponds to a subgraph of the full DAG. Getting history from a log would be more efficient than from the objects themselves, because to get it from objects, you'd have to dereference a lot of object references.
>I'm not sure what branches living under .git/refs has to do with excessive hierarchies/trees. There are enough things stored in the .git directory, that if you mashed them all together it wouldn't make any sense.
Having to descend through layers of subdirectories makes things harder. I'd reduce the depth of the directory tree to the absolute minimum. It's hard to tell if this is the minimum without knowing exactly what all the implementation constraints might have been.
I can see that the real meat of this system is the object store. It's useful to know about `git cat-file` for inspecting it.
Here's an example:
$ git clone blah
DAG:
A - B - C - D - E
\
Z - X - Y
Branches:
master => E
topic/new-feature => Y
reflog:
master
E - clone from blah
topic/new-feature
Y - clone from blah
Notice how cloning a repository with an existing DAG doesn't populate the reflog. It just give it a single entry saying that the branch was updated from 'nothing' to whatever commit it was pointing to remotely.Now let's change where 'master' is pointing:
$ git reset master C
DAG:
A - B - C - D - E
\
Z - X - Y
Branches:
master => C
topic/new-feature => Y
reflog:
master
E - clone from blah
C - reset to C
topic/new-feature
Y - clone from blah
Notice how the reflog is a history of the values that the branch was referencing, but is not the history as what you get when you run 'git log'. After the reset, 'git log master' would show you commits A, B and C, but A and B are nowhere in the reflog.So the DAG is actually stored inside objects. The contents of the objects directory could be described by a relational schema, and I think that would make it easier for a lot of people to understand (myself included):
Blob
- sha1hash (primary key)
- contents (blob)
Tree
- sha1hash (primary key)
TreeEntry
- treeid (foreign key into Tree)
- mode (mode of blob/subtree)
- type ("blob" or "tree")
- objectid (foreign key into Tree or Blob)
- name
Commit
- sha1hash (primary key)
- tree (foreign key into Tree)
- parent (foreign key into Commit)
- author
- committer
- comment
The tree entries are actually denormalized and stored as a list inside the tree. You could represent this more accurately with XML. But who likes XML?I don't have the time to keep up this conversation, but this assertion is wrong. It is not a subgraph. It is a history of the values that the pointer was pointing to (e.g. "Pointer <branch_name> changed from pointing to value AAA to value BBB due to action XXX"). That is basically what all of those entries are. 'AAA' and 'BBB' maybe be in completely unconnected sections of the DAG.
If you create a new repository and add a couple of commits, then yes the reflog files will look like a history, but only because the branch pointer has traversed the DAG from start to end with no deviations.
For example you can have a DAG like this:
A - B - C - D - E
X - Y - Z
If you change the branch pointer to move from B to Z, this is not a subgraph. Well, I guess technically you could call it sgraph of the history of the branch pointer, but it in no way corresponds to the DAG other than that all of the pointer values exist within the DAG. For example the following operations: git clone
git reset --hard Z
git reset --hard X
Would create a graph like this (assuming that master pointed to E when you cloned): E - Z - X
Notice that this really don't correspond to the DAG other than the fact that those objects exist in the DAG.Note:
- All of this information is only contained within the .git/logs files. None of it is stored in the objects themselves.