Why you should use a distributed version control system

If you’ve ever:

  • made a commit and then realized you forgot “one little change”.
  • made a commit and regretted it.
  • wished you could combine some the past couple days worth of commits into one nice combined commit in the main branch.
  • wished you could commit just part of a file.
  • needed to drop work on one task and switch tracks to another one without having to make commits with unfinished changes, or commits with changes for one issue and a little of another.
  • wanted to make a test spike with version control and without polluting the public repo.
  • managed an open source project.
  • wanted the security of knowing that there was a valid backup of your revisions on many other peoples boxes, or even just your own.
  • been frustrated with branch namespacing issues
  • been frustrated with how difficult branching and merging is in most centralized version control systems.
  • wished you could just create branches to work on a feature or a bug without worrying about the consequences to the main repo.
  • wondered which branch a bug applied to.
  • wanted to use version control when you were offline.
  • wished you could quickly compare versions of entire trees.
  • wished you could easily release everything in the current branch “except that”.
  • been concerned about how to scale a system to support hundreds, or thousands, of users.
  • been concerned about what would happen if your main repo box died.

…then distributed version control is worth your consideration.

Now, there are a variety of DVCS to choose from, Git and Mercurial (Hg) are main contenders, and both are really good. My experience is primarily with Git so I’ll be speaking as a Git user, but Mercurial can do most everything Git can, and a lot of this is general to all Distributed Version Control systems.

Quite possibly the most powerful, world altering thing about DVCSs is how they handle branches. Because of their distributed nature branching and merging absolutely must be simple effective, and avoid conflicts whenever possible. As a result people who’ve gotten the DVCS religion tend to make a branch for practically every bug or feature. We call these “topic branches”. If you’ve been working on one task for a while and a bug comes in that has to be address NOW it’s not an issue. Commit your changes in the current branch (or “stash” them away) and make a new branch for the bug that just came in. I doesn’t matter if you commit unfinished work in one of these branches, or if the work totally breaks the build, because it’s your branch. When you’ve finished working on a feature or bug you can “rebase” all your interim commits on that task into one nice clean one, or maybe generate a patch to apply to the main branch.

The fact that you can as many of your branches as you want solves a huge number of problems. Combine that with the ability to “rebase” you changes, by bringing up a list of past commits to merge, reorder, or exclude, as you see fit, and …well it’s like heaven. The only caveat is that you can’t rebase commits that others have already pulled. But that’s not a big deal if you simply do your work in personal branches that others aren’t going to be pulling from.

Because most DVCSs allow you to cherry-pick exactly which commits you want to be in a branch you can easily build a release branch with just the parts that you want. There’s one caveat to cherry-picking in most distributed and centralized systems : to do it well you must have commits that are a self contained and atomic as possible. If you make a commit with multiple features/fixes in it and later on you decide not to include one of them there’s no simple way to exclude just the bit you don’t want, because it’s mixed in with all the others. And this, brings us back to the idea of “topic branches.” If you’re already working with topic branches then you can easily merge the changes in one topic branch into the main branch as one atomic commit that can be excluded or included individually. However, as Aristotle pointed out in the comments

“the power of [Git’s] rebasing allows you to go back, split the commit sensibly into several, and then transplant the rest of your history on top of that. I guess whether this is simple depends on your definition of simplicity, but git has enough tooling to support this procedure directly. In other DVCSs you have to painstakingly monkey around with far less powerful tools, and in centralised VCSs it is for all intents and purposes impossible to do it at all.”

Yes, there will always be some changes that are dependent upon some prior commit, but by having nice little atomic commits you can limit the tree of dependencies to a bare minimum.

One of the best features of DVCSs is speed, but unfortunately it’s a lot like the Matrix. No-one can be told how much of a difference it makes. You have to experience it for yourself. I never complained about the speed of the centralized systems I worked with until I switched to Distributed Version Control. I found it hard to believe that “instant” commits would make a difference. Or that the time it took to diff against past versions was an issue. But I swear to you that the effect on how you work is almost as radical as the improved branching, and it flows over into the branching too. Now branching and merging is not only painless, it’s instant. I create, merge, and destroy branches throughout the day without hesitation because no matter what operation I do it comes back so fast that I frequently am incapable of telling that any time passed. That’s means that many (most?) of my operations are happening in the sub 100ms range. Imagine being able to diff versions of entire trees in tenths of seconds.

Scaling to thousands of users is not an issue with Distributed Version Control Systems because one box doesn’t have to support all the simultaneous operations that would normally be going on. Commits, checkouts, reverts, branches, diffs…. all of it happens on peoples local machines. They rarely need interact with the canonical / central repo. This means that large companies like Google wouldn’t need to invest in massive boxes with huge processors and piles of ram just to keep their version control system usable.

Speaking of the canonical repo… workflow doesn’t have to change at all. If you’ve got an established workflow with a centralized system there’s absolutely no reason it can’t continue on after a switch to Distributed Version Control. But, it does open up some other ways of working that you may want to explore.

In a centralized system if the box your main repo lives on goes down you’re pretty much screwed. If your IT people are really on the ball you’ve got it syncing to another box every few minutes (or more) and everything will fail over… bringing the primary box back up and resycing it may cause them to pull out their hair but at least no-one would be affected. Unfortunately it is a very rare company that’s that well prepared. And if you’re hosting an open source project you just have to hope that the people you’re hosting with have good backups and don’t go down.

Most software companies, and open source projects, simply grind to a halt when the version control system goes down. Changes can’t be shared, branches can’t be made, bugs can’t be patched, releases can’t be made, developers start building up a backlog of changes that will end up mooshed together into one commit when things do come back up, etc., etc., etc. But teams using DVCSs simply don’t care. It’s a non issue. “The heads crashed on the drives of the main repo, and a fuel truck plowed into our offsite backups”. It simply doesn’t matter. Sure your IT guys are going to have to deal with fixing that, but the impact to development is essentially zero. Someone stands up, or sends out an e-mail, and says “the main repo is down, just use my box instead”, and people do, and that’s the end of it. When the main box is repaired IT simply pulls from the box everyone’s been using in the interim and once again stands up, or sends out an email saying “the main box is back up”. Automatically syncing a remote box for fail-over becomes trivially simple with DVCS. Just set up a cron job to pull from the main box every minute and you’re done.

When it comes to open source projects there is one factor that absolutely sucks and that’s giving out write permissions. You don’t want to give them out to just anyone, but if you don’t then people who you don’t trust yet either have to work on your code without any revision control, or as is frequently the case, they just fork it into a version control repo they do have control over and loose the ability to easily merge their changes back into the main repo. You could just give them out to anyone but that’s not a choice that many project managers are comfortable with. With DVCSs it simply isn’t an issue. Everyone’s got their own repo(s) everyone can commit, branch, etc.. And when they’ve got something worthwhile they can ask you to pull from them, or send you a patch that they know will work with your repo because that’s where their repo originated and they’ve been pulling down your changes and merging it in to their work.

One problem that’s been an annoyance for me for years is that most bug trackers have no concept of branches. When you go into your bug tracker it shows you all your bugs but there’s no way to tell what branches they exist in, and what happens when you fix a bug, and get QA to sign off on it, in one branch but it hasn’t been patched in another one? Is the bug fixed or isn’t it? And what about the people who don’t have an intimate knowledge of that bug? How are they supposed to know where it does and doesn’t exist?

Distributed Issue Trackers solve this problem. You file a ticket in the branch the bug exists in. Then that ticket follows along with and branches of that branch until someone closes the ticket. As they merge the fixes back into parent branches the closed ticket merges in too. As a result you can always tell if a bug is or isn’t fixed in the current branch. If you commit your closed ticket with your patch then even more possibilities open up. If, for example, it was decided that your cure was worse than the cause and excluded that commit from a release then the ticket would re-open itself in the release too (because you got rid of the fix).

To date the best Distributed Issue Tracker, without question is Ditz, but the field is still very young. I’ve personally been working on a fork of Ditz that makes some dramatic changes and tightly integrates itself with Git, and I’ll update this page and make an announcement as soon as it’s released (a week or two). Personally I wouldn’t recommend anything other that Ditz right now. The projects are either too immature or abandoned and buggy. Ditz doesn’t have a lot of features yet but it’s reliable and gets the job done.

Another cool thing about Distributed Version Control Systems is that you don’t need anyone’s permission, or cooperation, to start using them. I work at a company that uses Perforce, and I think that Perforce is the devil, but it doesn’t matter, because I have Git. I do all my work in Git; constantly branching and merging and mooshing commits, and when I’m ready I just submit the completed changes back to Perforce. And this is actually better for everyone. Sysadmins don’t have to deal with the consequences of me making topic branches all over the place, the commits I make for coworkers are generally cleaner, and I can commit as often as I want, even break the build, without having to worry about the impact on others. Now, it happens that Git has a number of tools for working with the major centralized version control systems (it’s even got a CVS proxy so that CVS people too resistant to change can keep working the way they’re used to). But, if your DVCS of choice doesn’t have a tool to bridge the gap to whatever centralized system you’re forced to deal with, you can always use Tailor.

It sounds too good…

Some of you are probably thinking this sounds too good to be true, or maybe that I’m DVCS zealot. But really, it isn’t, and I’m not. I’m a huge fan of DVCS, it’s true, but I’m not about to claim that they will solve all your problems. To really get the full benefits of them you’ve got to start making atomic commits. This is actually true of centralized systems too, but since centralized systems tend to be such a pain to use, and frequently can’t do cool things like cherry-picking, people don’t even bother to do anything advanced with them. With a distributed system the advanced stuff becomes trivial every-day stuff, and it becomes more annoying to have to deal with those developers who won’t stop making ginormous commits with changes to multiple bugs as well as a new feature or twelve. Fortunately, you don’t have to accept their patches of you don’t want to. ;)

Guis have come a long way since this was first posted, especially for Git and Mercurial with quality that rivals the best tools for centralized systems.

On the command line some DVCSs like Darcs shine with totally intuitive commands, others have a bit more of a learning curve. Git definitely does because of how it stages things before commits. It appears similar enough to the way centralized systems work that many newbs expect it to work the same, and get frustrated when their expectations collide with its significantly different paradigm. Not only does Git stage things before commits but it doesn’t even think in terms of files, although, unless you’re paying attention, it may seem that it does.

Scaling to handle extremely large repos with years of history is something that Git handles very well (Mercurial probably does too), but some systems, like Darcs, have serious problems with.

In short, Distributed Version Control systems are totally bad-ass, and can really help with a lot of common development problems, but they’re still fairly new, and while the big ones are good, and reliable, they may not have all the polish and GUI widgets that you’re used to with your centralized system. My advice is to go get Git, unless the majority of your developers are on Windows, in which case Mercurial may be a better choice… for the moment.