Version Control with Subversion - Repository Maintenance

Despite numerous advances in technology since the birth of the modern computer, one thing unfortunately rings true with crystalline clarity—sometimes, things go very, very awry. Power outages, network connectivity dropouts, corrupt RAM and crashed hard drives are but a taste of the evil that Fate is poised to unleash on even the most conscientious administrator. And so we arrive at a very important topic—how to make backup copies of your repository data.

There are generally two types of backup methods available for Subversion repository administrators—incremental and full. We discussed in an earlier section of this chapter how to use svnadmin dump --incremental to perform an incremental backup (see the section called “Migrating a Repository”). Essentially, the idea is to only backup at a given time the changes to the repository since the last time you made a backup.

A full backup of the repository is quite literally a duplication of the entire repository directory (which includes either Berkeley database or FSFS environment). Now, unless you temporarily disable all other access to your repository, simply doing a recursive directory copy runs the risk of generating a faulty backup, since someone might be currently writing to the database.

In the case of Berkeley DB, Sleepycat documents describe a certain order in which database files can be copied that will guarantee a valid backup copy. And a similar ordering exists for FSFS data. Better still, you don't have to implement these algorithms yourself, because the Subversion development team has already done so. The hot-backup.py script is found in the tools/backup/ directory of the Subversion source distribution. Given a repository path and a backup location, hot-backup.py —which is really just a more intelligent wrapper around the svnadmin hotcopy command—will perform the necessary steps for backing up your live repository—without requiring that you bar public repository access at all—and then will clean out the dead Berkeley log files from your live repository.

Even if you also have an incremental backup, you might want to run this program on a regular basis. For example, you might consider adding hot-backup.py to a program scheduler (such as cron on Unix systems). Or, if you prefer fine-grained backup solutions, you could have your post-commit hook script call hot-backup.py (see the section called “Hook Scripts”), which will then cause a new backup of your repository to occur with every new revision created. Simply add the following to the hooks/post-commit script in your live repository directory:

The resulting backup is a fully functional Subversion repository, able to be dropped in as a replacement for your live repository should something go horribly wrong.

There are benefits to both types of backup methods. The easiest is by far the full backup, which will always result in a perfect working replica of your repository. This again means that should something bad happen to your live repository, you can restore from the backup with a simple recursive directory copy. Unfortunately, if you are maintaining multiple backups of your repository, these full copies will each eat up just as much disk space as your live repository.

Incremental backups using the repository dump format are excellent to have on hand if the database schema changes between successive versions of Subversion itself. Since a complete repository dump and load are generally required to upgrade your repository to the new schema, it's very convenient to already have half of that process (the dump part) finished. Unfortunately, the creation of—and restoration from—incremental backups takes longer, as each commit is effectively replayed into either the dump file or the repository.

In either backup scenario, repository administrators need to be aware of how modifications to unversioned revision properties affect their backups. Since these changes do not themselves generate new revisions, they will not trigger post-commit hooks, and may not even trigger the pre-revprop-change and post-revprop-change hooks. ^[19] And since you can change revision properties without respect to chronological order—you can change any revision's properties at any time—an incremental backup of the latest few revisions might not catch a property modification to a revision that was included as part of a previous backup.

Generally speaking, only the truly paranoid would need to backup their entire repository, say, every time a commit occurred. However, assuming that a given repository has some other redundancy mechanism in place with relatively fine granularity (like per-commit emails), a hot backup of the database might be something that a repository administrator would want to include as part of a system-wide nightly backup. For most repositories, archived commit emails alone provide sufficient redundancy as restoration sources, at least for the most recent few commits. But it's your data—protect it as much as you'd like.

Often, the best approach to repository backups is a diversified one. You can leverage combinations of full and incremental backups, plus archives of commit emails. The Subversion developers, for example, back up the Subversion source code repository after every new revision is created, and keep an archive of all the commit and property change notification emails. Your solution might be similar, but should be catered to your needs and that delicate balance of convenience with paranoia. And while all of this might not save your hardware from the iron fist of Fate, ^[20] it should certainly help you recover from those trying times.

^[15]That, by the way, is a feature , not a bug.

^[16]While svnadmin dump has a consistent leading slash policy—to not include them—other programs which generate dump data might not be so consistent.

^[17]E.g.: hard drive + huge electromagnet = disaster.

^[18]The Subversion repository dump format resembles an RFC-822 format, the same type of format used for most email.

^[19] svnadmin setlog can be called in a way that bypasses the hook interface altogether.

^[20]You know—the collective term for all of her “fickle fingers”.

Repository Backup