Why you should be using a distributed source control system

I was reading some articles yesterday that finally made the light bulb go off about distributed source control management (scm) and why we should be using them. First off, a distributed scm, unlike CVS or Subversion, has no central repository that all others pull from. It’s possible to set one up and say that it’s the master and tell people to pull from and push to it but that’s more a matter of convention. What’s truly unique about these systems is that each checkout is it’s own self-contained ecosystem. And there are many reasons this is a good thing:

  • You can make as many changes as you want, check in unfinished code, and explore new functionality without ever affecting other developers. It only impacts other users when you sync your current working environment with others.
  • You don’t have to have net access, or be connected to any other box to get work done while still taking advantage of version control. Working on your laptop? Stuck in the boonies with just dial-up? Commit changes, roll-back to previous versions, make branches, none of it is dependent upon some central server. When you do get back online you can sync your changes with others.
  • “You don’t need to set up and manage a central SCM host with sufficient disk space, compute power, bandwidth, and backup to support the concurrent SCM operations of your entire development community.” *

Real world uses:

  • The open source project:

    If Mozilla has taught us anything it’s that you’re NOT going to get thousands of developers working on your project. So, while huge numbers of developers is something that distributed scm handle exceedingly well I think the point is moot. But, in my experience, a developer interested in some OSS (Open Source Software) project will download the source from a traditional scm, poke around to understand it, and if they’re smitten with the idea will start customizing it for their needs. But, the are forced to either work without the benefits of version control or they have to check it into their own personal scm. If they check it into their own you can pretty much forget about ever getting patches from them because they’re no longer able to sync with your tree and it would be way more work than they generally want to do to get synced and give you a patch that was useful. If they work without scm (because they don’t have commit rights to your scm) you may get a patch from them but for them it’s like climbing a rock face without lines and harnesses.

    If you were to use a distributed scm each developer would be working off of their own personal copy of the scm that could be synced at any point in time no matter how many changes, revisions, or commits they have made. They’re generally not going to check it into their own scm system because that’s work that doesn’t get them any real benefit. They already have version control via their checkout and the fact that it’s a distributed system. End result, they don’t need to work without the safety and security of an scm and they will generally always be working on a system they can easily send you changes from when they’re ready.

  • Web developers with sites others depend on:

    If you’re really working like a pro, you’re constantly branching and merging your code. You’ve got a live site branch. You’ve got a branch for each new functionality exploration. You’ve got a trunk branch that everyone merges into their other branches regularly. In the real world you probably have a live site branch and the trunk and cross your fingers that each developers functional exploration doesn’t screw over any other developer’s pokings. Or, in far too many cases, you just have the trunk. The problem comes (for those of us who hate dealing with the pain in the ass that is the current state of merging on most traditional SCMs and have minimal branches) when you decide you want to integrate one developers functional exploration into the live branch. How do you extract that code from the other code in the trunk (or any other common branch)?

    If you were using a distributed scm it would go something like this: Everything starts with the live site repo (wherever you happen to keep that). Everyone’s local repo(s) started as a copy of that. It’s not uncommon to check out another copy to go explore some new feature set in. Maybe you sync it with the live. Maybe you trash it. It doesn’t matter no other repo was affected by your fiddling. You don’t have to make a new branch in some central repo that gets stored forever even if your fiddlings proved insignificant or were just abandoned. If everyone decides that the features you worked out in a particular repo are worth keeping they just sync with it. If you decide to wait to put it into a later release that’s fine too because there’s no need to untangle it from other work you, or others, have been doing. This is because to “branch” you just do a simple checkout from any other repo. When it comes to merging you benefit from the fact that any decent distributed scm is designed to work with syncing sweeping changes to entire code-bases not just individual files. They all handle this a bit differently of course, and some better than others, but the point is that while you’re almost always going to have conflicts you’re working with tools that have to be better equipped to avoid them.

  • Projects with reeeeally large code-bases or many many developers:

    I can’t really speak to those because I just haven’t worked on them but utilizing distributed scms does mean that you don’t have the huge demands on the server hosting your scm. Plus all the aforementioned benefits.

So what are your options? Well the main contenders appear to be darcs (written in Haskell), Mercurial (written in Python), and Git (written in c plus hooks into a bunch of other things I think). darcs seems to be the easiest to use, with some decidedly funky features, but isn’t great for projects with huge numbers of files (it can be ram intensive at times). Mercurial is going to be used by the newly open-sourced JDK, and Git is being used to manage the Linux kernel.

* My light bulb moment, and some of the quotes here are thanks in no small part to this blog post about the JDK moving to Mercurial by Mark Reinhold. Thanks Mark.