By Scott Smith
I've been using Git for over 4 years now and I've come to appreciate it as much as any other programming tool. I've stated this before, but it's worth mentioning again: I've always hated SVN! So I was eager to embrace something better when I learned about distributed version control. I didn't understand it at first. It took about 2 weeks before the idea clicked. When it did, it was nothing short of a revelation. So thinking in terms of distribution was never very hard for me. Since that time I've done several migrations to Git for various companies. I've written about it. I've explained it as best I can to many people. I've used it in all programming projects and even some non-programming tasks. Git is wonderful for LaTex documents and versioning directories for backups among other things.
Despite my use of Git daily and despite my experience with Git and despite my fairly extensive study of the subject, I'm still learning new things about it. In a sense, I have trouble *not* thinking in terms of distributed version control! Thinking this way has now become second nature. And that should be good, right?
Well, most of the time...
I recently ran into an issue with another programmer who was ostensibly "losing" history. I was certain this was utter nonsense. I have never had lost work or had any corruption with Git... and I've used it as much as anyone. But wait!! There is a switch on the "git log" command called --full-history.
git log --full-history
Ah-ha! There's the missing change sets! I knew that guy was wrong. Those disappearing commits were still there. So, what's going on? Why wouldn't those change sets show up with a normal "git log" command? What was this guy doing to make the default Git behavior of "git log" not show history that was relevant to him? I've never had this problem and I was determined to get to the bottom of it. This is where my thinking in terms of Git got me in trouble. I couldn't get my head wrapped around the idea that Git's default behavior wasn't working. The truth is, the default behavior works great unless someone has done something that doesn't make sense. Here are the definitions from the man pages for "git-log" or "git-rev-list":
Same as the default mode, but does not prune some history.
Default mode Simplifies the history to the simplest history explaining the final state of the tree. Simplest because it prunes some side branches if the end result is the same (i.e. merging branches with the same content) --full-history
What does this mean? Git prunes history by default? Well, yes. And Git will present you with the simplest possible log of change sets, i.e. commits, that explains the content of the current change set or current state of a file or set of files depending on what is included in the change set. Commits which are "uninteresting" will not be shown.
Let's step back for a moment and geek out how Git works. Git uses a content addressable file system. Each piece of content is hashed. When that content changes, so does the hash. Git uses this principle to track change and guarantee content. Git also incrementally recreates the history starting with the current state for each piece of content that makes up the current state. This is hard to do quickly. It is also hard to make it sane. The longer the history, the more work Git has to do to go all the way back "to the beginning" of when things were originally created. Invariably changes come and go that don't matter to the current state of the file. If a piece of content or history is "interesting" or relevant to the current state of the file, that history will be included in the history log, otherwise it will be pruned if it results in the same state that currently exists.
To put it another way, a series of changes can cancel each other out which means that history is "simplified" by default. Those changes that cancel each other out will appear to have never existed without using the "--full-history" switch. This happens all the time and it is very common. This is good because the content, that exists now ostensibly was put there with good reason by a developer who understood what they were doing. Git is doing everyone a favor by removing noise that is no longer important to what is there now.
But what if someone makes a mistake? This is where "--full-history" comes to the rescue. And all changes, "interesting" to the current state or "not interesting", are included in the log or rev-list. So what happened? The developer was trying to merge 2 different states in a particular file. This is where my knowledge and experience with Git burned me. I would:
- checkout the old file
- diff the content with the current HEAD
- bring the changes I wanted over from the file that was in HEAD to the file that I just checked out
- and finally make a new commit with the relevant content from both versions
As far as I'm concerned, this is normal stuff and I don't think anything of it.
So what really happened to our poor developer who was frustrated with Git losing content? He was being uber cautious because what was happening was confusing him. He would check out an old version of the file and then commit it to save it again because he was told "commit early and often". When in doubt just commit the change and then keep going. This seems like a good thing. And most of the time it is. But in this case it doesn't make sense because that version of the file is already in the history. After all you just checked it out!
And what does it mean to Git when you check out an old version and then commit it? Well from what we've just discussed it would mean that changes that are not relevant to the current state of the file will no longer be shown in the simplified history. Changes between the checked out version and the current version will not be shown unless they have some other contribution to other history outside of our one file we're focusing on here. So checking out something and then committing it again is the wrong strategy and it doesn't make sense in Git terms if you are still actually "interested" in the changes that have happened in the meantime.
Git didn't lose history. It was all still there. It's just that a simple blunder and misunderstanding of Git caused Git to only show history that was "interesting" to the current state and the rest was pruned. How did I resolve the issue with our Git confused developer? I followed the steps listed above and then triumphantly predicted that the missing change set would suddenly become "interesting" again to the current state of the file. We issued a simple "git log" and behold there's your missing commits!