feel the Source, Luke!
Considering all the software development projects I’ve been a part of and reflecting on the myriad of issues that typically arise, I can safely say that choosing a source control system was one of the more minor challenges that arose.
At it’s core, a source control (source code control system to use it’s full name), version control or revision control system is a tool used in software development to store the source code for one or more projects and provide a chronological archive of the changes made to the source code by the developers on the team. The chronological archive is in the form of snapshots of the current state of the source code at points in time where changes are made .
Unless you’re the type of person who represents themselves in court, the merits of using source control systems need no explanation. I was however surprised when I started to read more and more about a different type of source control system, the distributed source control system. Up to this point, and unbeknownst to myself, my experiences had been solely with centralized source control systems. My interest had been peaked to understand the need for a different type of source control system and what problems they were purporting to solve.
Choosing a source control system on a project or at a company starting a new software project is not a significant obstacle to navigate. For consulting engagements, the choice most frequently comes down what to ever source control system the client has in place already and the processes that have already been established. There is always room for improvement in the processes and best practices around the usage of these tools, but rarely have I seen a decision to replace one source control system with another once established in an organization. When I have seen it happen, it has happened in the form of migrating away from what people considered a product on its last legs, as is the case when SVN invariably replacing CVS in many situations.
For startup companies starting new software products, the choice usually comes down to what the founders have most experience with and are most comfortable using or most likely trumping that, what’s the hot new trend influencing their choice of source control tools. As a distinct trend began to emerge, I began to read more and more about distributed source systems.
For my part, I had had experience with numerous source control systems, the following list ordered by most frequently encountered on projects:
- Subversion (SVN): Open source source control system, considered by many to be superior to and the replacement for CVS.
- Concurrent Versions System (CVS): Original open source source control system, being replaced by SVN in many situations.
- Perforce: Proprietary source control system developed by Perforce.
- Visual Source Safe: Microsoft’s source control system.
- StarTeam: Source control system from Microfocus, formerly Borland.
- Rational ClearCase: IBM’s behemoth source control product.
Where as distributed source control systems seems to be the current trend, I learned that all the ones I had previous experience with were considered centralized source control systems. Centralized systems use a client server model, where users (clients) of the system all use a centralized instance (server) of the source control repository.
In this way developers work locally on the version of the source code they last retrieved from the repository. When their changes are ready they commit these changes to the centralized repository. Conflicts and merges are dealt with while connected to the centralized repository and the repository keeps track of all changes made by all developers after they have committed their changes successfully.
A distributed source control system differs from this in that every developer has in fact their own copy of the entire repository, including a full history of changes, and no single repository is considered the master repository. Not having spent any time reflecting on the architecture of centralized source control systems, this concept took me a while to fully embrace and understand how this was so different from my past experiences. It wasn’t until I started using github, backed by the git distributed source control system that it became clear what this architectural change meant.
Having an independent copy of the repository locally for a developer means that a full copy of changes made to files can be tracked (assuming commits are made frequently enough) and operations made against the repository happen very fast when compared to the centralized architecture. All repository operations happen locally so are not impacted by the network speed and size of data being transmitted across the network. For example, forking a new branch, merging two existing branches or creating a patch for release and integration become super fast operations and make the development process a more agile and efficient process.
It’s arguable that branching and merging are slower using the centralized model and that this directly influences how frequently branches are forked and in turn how sometimes cumbersome merges can be. Whether or not this alone is significant obstacle to development is not clear to me. It is certainly true to say however that the centralized model depends exclusively on the availability of the centralized repository for these operations. Without access to the repository it is not possible to track changes, branch, merge and diff changes and this alone may be a significant advantage the distributed architecture brings.
Pedantically speaking there is no centralized repository in the distributed model, however from my experience and in reality there is still typically one repository that is considered the master repository, although this distinction is purely by convention rather than anything distinct about that repository. As such, at certain intervals on a team with multiple developers, their changes will need to be integrated in one “master” repository and in turn they will need to deal with the same challenges encountered with the centralized model, albeit perhaps less frequently.
The argument offsetting this however is that an individual developer can fork, branch and merge many more times locally than what might be required to be done centrally and as such more rapidly iterate on changes they are working on. Secondly, supporters of the distributed model argue that merge support and conflict resolution is much easier and more fully featured than centralized models. The argument being that an architecture that encourages much more branching and merging in turn has better built in support for conflict resolution and patching.
From my own experience, I have certainly encountered some of the challenges being addressed by the distributed architecture such as the performance of some repository operations, network latency and dependency of the availability on the centralized server but they have rarely been significant enough to warrant considering a whole new architecture to address them. On the flip side though, now having had some experience with a distributed source control system (git) I have grown to like the agility of it a lot and have used the branching and merging features more than I would have in the traditional model.
For me, it would seem that the distributed source control architecture to have been a solution emerging for a problem that I didn’t really know that I had until I started to use it. I guess that’s what true innovation is all about. I’d like to say that the concept is relatively new but in reality it has gained traction in the last few years but has been around since the year 2000. The first product in the distributed architecture was BitKeeper, which was created to address some of the issues being experienced by the development team during the development of the Linux kernel.
From BitKeeper a number of alternatives have emerged and gained popularity, no more so due to source code control being offered as a software as a service by a large number of companies. A short list of the more popular ones I have come across are:
- BitKeeper: The original distributed source control tool. Offered as a proprietary tool for license.
- Git: A popular open source tool backing numerous software as a service offerings e.g. Github, Gitorious, BitBucket.
- Mercurial: A marginally less well known open source tool, also being offered as software as a service e.g. CodeBaseHQ, BitBucket.
in summary
Although not having really experienced any major negatives with the centralized systems, I do now see the benefits of using a distributed source control system. Everything else being equal, I would opt to use Git over SVN on a new project, if for no other reason than the speed and local agility it offers developers. This combined with the trend of more projects using it only bodes well for it’s continued penetration into projects and continued evolution as a source control system. There are some pretty serious open source projects using this model (e.g. Linux kernel) and you can be guaranteed that as a result the tools will only get better in such circumstances.
Of course, there are a lot of other things for a team to consider when selecting a source code control tool, such as integration with a continuous build system, issue tracking integration and integration with code coverage and code review tools. In my opinion, when considering the build and release system as a whole, is more important that the source control system fits into the existing systems and processes well, rather than solely considering the tradeoffs between a centralized and a distributed architecture.
It will always hold true that even the best tool will never be able to adequately address the shortcomings in the processes and practices that are adopted by a development team as whole. A poor choice of source control system can always be offset by good processes and team discipline, the same cannot be said of the reverse.
The arguments against the distributed architecture are that it might encourage developers to too easily fork and replicate the code base and undertake too large a feature set without ongoing peer review. I however regard this as a shortcoming in the developers approach in this scenario rather than a problem introduced by the tool and one which is best mitigated through the teams development methodology.
A word of caution in conclusion. Having a local repository and the flexibility which it brings, should further reinforce the needs of developers to back up their development machines. With potentially more off shoot work being done locally and not committed centrally leads to a greater need to have a continuous back up in place of developers machines. A developer who fails to see the risk in this regard may soon be representing himself in court.
leave a comment