[tahoe-dev] darcs vs. mercurial performance

Brian Warner warner at allmydata.com
Sun Nov 4 00:39:53 PDT 2007


On Sat, 3 Nov 2007 21:43:53 -0600
zooko <zooko at zooko.com> wrote:

> You showed me some nice performance improvements using our tahoe  
> repository with hg compared to darcs the other day.  Is the faster  
> operation a motivator for you to switch to hg?  (It is for me, but  
> I'm so attached to darcs...)
> 
> Could you post something like that to this list so that I can send a  
> comment about it to bugs at darcs.net?

Sure thing! First though, I do want to make it clear that I'm not trying to
provoke a darcs-vs-mercurial flamewar.. I really like darcs for a lot of
things, and part of my interest in trying something else is just interest in
trying something else. I'm a bit of a version-control-system junkie, in case
you haven't noticed :).

I used 'hg convert --source darcs tahoe-trunk' to convert a copy of the Tahoe
source repository (obtainable from http://allmydata.org/repos/tahoe) into a
mercurial repository. The process took about 70 minutes, if I'm remembering
correctly, and had to churn through 1500 patches. The Tahoe source tree HEAD
currently has 177 files with a total of 3.16MB of data.

(btw, after looking more closely at the hg output, I see it is incorrect: it
got confused by a couple of directory moves that took place several months
ago, and has directories like Crypto/ and zfec/ that should not be present.
It also seems to have botched two patches, resulting in differences between
the resulting HEAD and the darcs one, so we'd need to clean up and verify the
hg repo before actually using it. [note to self: I ran 'hg convert' a second
time to pick up some recently added changes, and one of the botched patches
was the second one in that second batch]).

A direct darcs checkout of our tree consumes (according to 'du', on a
filesystem that probably uses 4KiB blocks) 20MB, and has 1893 files. An hg
checkout consumes 13MB under the same conditions. (that's including the bogus
directories.. it consumes 12MB and 824 files once I clean those out to make
the hg tree look like the darcs ones).

A 'darcs get' over HTTP (using apache on allmydata.org as the web server) via
my home DSL line took 47 seconds. A 'hg clone' over HTTP (using 'hg serve' as
the web server) on that same connection takes 13 seconds. I can't currently
test the 'hgweb.cgi' form of access (which would use apache as the webserver
instead of hg directly), but tests with the Foolscap repository suggest that
the results would be within 10% of the 'hg serve' form.

A 'darcs get' over SSH takes 167 seconds, almost all of which is fetching
patches (one at a time: network utilization is very low during this period);
perhaps the last 5 or 10 seconds is actually applying them to build up the
new tree. A 'hg clone' over SSH takes 18 seconds, out of which probably 16
seconds is fetching revisions (which appears to be bandwidth limited: network
utilization is very high during this period), and the last second or two is
building the local tree.

A 'darcs push -a' of a 3-line change over SSH takes 2.67 seconds. An 'hg
push' of the same change over SSH takes 1.16 seconds.

A 'darcs pull -a' of a new 3-line change over SSH takes 4.82 seconds. An 'hg
pull -u' of the same change takes 1.15 seconds. (the '-u' tells hg to update
the local tree to reflect the new changes, to match the darcs behavior).


MOTIVATIONS:

My motivation for considering a switch to hg has three components: speed,
accessibility, and release management.

SPEED: The 3x speed increase for HTTP (i.e. read-only) checkouts is not an
immediate factor, since we do full checkouts so rarely, but I'm concerned
with how long those checkouts will take in another 12 months when we've got
twice or three times as many patches in the repository. This mode mostly
affects the developer community that doesn't have commit access, and also the
buildslaves.

The 10x SSH checkout difference is more worrying, and really makes me
concerned about the future. I suppose it doesn't really affect my daily life
right now: I have exactly one full from-remote checkout per machine, and then
use a dozen local trees which all point at that local checkout. (pushing
changes from a working tree to the canonical repo is a two-step process). But
waiting several minutes (during which it feels like nothing is happening,
since the network utilization is so low) for a checkout that I know could be
done in 20 seconds is frustrating.

My real concern here again is with the future: allmydata's other codebase
(which has closer to 5000 patches) takes so long to perform a full checkout
(I just measured over 12 minutes, it's worse on our slow office DSL) that we
strenuously avoid doing it remotely: when something is so screwed up on our
local end that we need a new checkout, we do a 'darcs get' on the same
machine as the canonical repo, tar up the result, transfer it with scp, then
unpack the result.

The speed differences of push and pull are my biggest concerns, because I
think they will get worse as the repository grows.

Our coworker Rob is constantly frustrated by inexplicable minutes-long delays
with 'darcs push'. I haven't seen nearly the sorts of problems he has
(perhaps because I never commit from windows or OS-X), but I'm worried that
this is an inevitable consequence of managing large trees with darcs, and so
I worry that as the Tahoe repository gets bigger, it (and we) will start to
suffer the same problems. I assume this delay comes from two things: the
algorithmic complexity of doing the patch algebra to determine what must be
sent, and the one-at-a-time nature of the patch transfer steps (I believe it
is using scp or sftp to copy things one at a time, whereas hg seems to be
using a custom protocol over ssh like the way cvs and svn do it).

ACCESSIBILITY: mercurial is an order of magnitude easier to compile than
darcs. This has never affected me personally, since I'm running debian
everywhere and tend to only use i386 platforms, but I seem to recall some
folks using more outlying operating systems or hardware (amd64? OS-X/ppc?
opensolaris?) who did not yet have a functioning ghc and thus couldn't get a
darcs binary running. Mercurial only needs python and gcc, so I do not think
we'll have any community members who are unable to get the latest HEAD
because they lack the tools to do a checkout.

Mind you, I can't think of specific examples of platforms or people for whom
the lack of a darcs binary was preventing them from playing with tahoe.

RELEASE MANAGEMENT: as a former Build Guy, I really really like the
cryptographic hash identifiers that Mercurial provides for each revision.
Being able to mark a build as being "version 039a34be720c" and know that this
completely nails down the contents of that source tree is wonderfully
reassuring. I don't get the same feeling of confidence from a darcs tag.

Partly this is a reflection of my nervousness about the internal darcs patch
layout. We've tried to repair darcs repositories before, in incorrect ways,
by assuming that the hash-like names of .gz files in _darcs/patches/ actually
indicate a hash of their contents, and thus assuming that two such files with
the same names in different trees are identical. They are not (as best I can
tell the files are modified as the patches they represent are commuted with
each other), and we managed to build a couple of corrupted trees in the
process of trying to fix something else, but all of those trees appeared from
the outside to have the same revision. Doing such a thing in hg would
invalidate all the hashes. (in addition you can't 'unpull' a revision in hg,
so I suspect there are slightly fewer ways to corrupt a repository. but OTOH
we're pretty talented at breaking things :-).

Also, I can't compactly refer to any intermediate darcs revisions: there is a
'context file' which is supposed to fulfill this purpose (which I've tried to
use for this purpose in Buildbot, without a lot of success), but it's huge,
and unusable as a point of discussion (as in "I built current HEAD, by which
I mean 039a34be720c, and encountered the following bug:..").

ETC: of course, the speed and release-management issues are entirely a result
of the fundamentally different approaches between darcs and hg (really
between darcs and everything else in the world). Being able to pull specific
darcs patches from one branch and push them into another (along with
everything the patch depends upon) is a wonderful tool for branch management,
and has probably reduced the amount of effort we put into making release
branches on the older allmydata codebase by 50%. On the other hand, we
probably spend an extra 10% on trying to figure out why a given patch seems
to depend upon a bunch of seemingly-unrelated ones, and how to work around
that, and how to record patches differently to avoid the spurious
dependencies. And rob probably spends an extra 10% all by himself just
waiting for 'darcs push' to complete.

I'd anticipate that a mercurial-based Tahoe repository would represent a
significant slowdown for certain branch-management policies, and would
therefore push us towards different policies. (my general rule is that each
branch you have outstanding doubles the amount of developer effort required,
so I'm not a big fan of having branches live for any longer than really
necessary). Without darcs' patch-dependency management, I'd strongly
encourage us to stick to trunk as much as possible, and to keep release
branches short (i.e. think long and hard before merging a fix from trunk to
the branch). We're mostly doing this with the old allmydata.com codebase now,
but it'd probably need to be even stricter if we didn't have darcs to help us
out.

In addition, there are some Mercurial quirks that would slow us down. The
fact that hg makes a (monotone-like?) distinction between the set of
revisions in your database and the actual revision of your working tree has
made me trip at least once a week: doing an 'hg pull' *without* the --update
option gives you a tree that tells you it has all the changes you pulled, but
for some reason doesn't have the right source code. I've gotten really used
to blindly typing 'darcs record', 'darcs push', and 'darcs revert', knowing
that the interactive phase will keep me from causing any unwanted changes,
and hg has no such safety nets (except for the recently-added incremental 'hg
record'). Mercurial does not have the roll-back-time 'darcs unpull' feature,
and is instead limited to a one-revision undo capability, so I'm slightly
more fearful of making bad changes that I can't erase from history.

But, hg has these other spiffy features, like the 'hg view' graphical
revision history browser, and the built-in quilt-like patch management
thingy, and gpg signatures of revisions (although I do not yet understand
what exactly this provides nor what value it offers). Really this is a
reflection of the ease with which hg plugins can be written, which I think is
a strong argument in hg's favor.


Anyways, those are my thoughts. I'm not really pushing to make changes any
time soon, but I'm always looking to learn more about these tools. I really
need to spend some time with bzr or monotone to learn those perspectives too.

cheers,
 -Brian


More information about the tahoe-dev mailing list