[tahoe-dev] Keeping local file system and Tahoe store in sync

Tue Feb 3 17:40:13 PST 2009

On Tue, 3 Feb 2009 18:11:24 -0700
Shawn Willden <shawn-tahoe at willden.org> wrote:

> Take a look in the rdiff-backup source code. It handles extended attributes
> and ACLs.

Cool, thanks.

> I plan to handle resource forks as a separate file, associated with the
> data fork by referencing both of them from the single backuplog entry.

Yeah, if we had node-metadata then I'd be tempted to store the resource forks
in separate files and bind them via the metadata. Since we don't, when I get
around to handling those forks, I'll probably bind them via the
edge-metadata, which is safe because this tool creates read-only-directories
exclusively (so the file+fork pair is still effectively immutable).

> I'm probably being excessively picky about this area, but it really bugs me 
> that a backup may span days or weeks (or perhaps never finish!).

Good point, it's nice to have a coherent short-exposure snapshot of the
filesystem. Oh, how I wish ZFS were easier for me to use..

> My solution is to "scan fast, upload slow". The snapshot then spans the
> time required to scan the file system, including hashing the files that
> need to be hashed, which isn't fast but is unavoidable.

It's not fast, no.. in my experiments, hashing the whole disk is at least
several hours, and sometimes most of the day. But I think we're both planning
to use a cheap path+timestamp+size(+inode?) lookup table and give the user an
option of skipping the hash when the timestamps are still the same.

> If I understand immutable read caps correctly (I need to go read that
> code), I should be able to upload the log which contains all of the read
> caps before uploading the files those read caps reference.

Not really, unfortunately. The immutable file read-cap contains two
crypto-length strings. The first is an encryption key, which can be generated
at random but is usually generated by hashing the contents of the file (along
with a per-user "convergence secret"). The advantage of hashing the file is
that uploading the same file twice results in the same readcap, so you don't
use extra storage. You can also think of this as a limited form of the
"backupdb" which remembers what's been uploaded before, indexed by a hash of
the file. On the other hand, using this technique means that you have to make
an extra read pass over the file. Also note that we don't yet have an API for
providing a random or pre-generated encryption key: there's a ticket #320
which touches on this, but we haven't implemented it yet.

The second string is a sha256 hash of the "URI Extension Block", which itself
contains hashes of the generated shares, which of course depend upon the
encrypted file, which of course depends upon both the original file and the
encryption key. This is the fundamental value which provides integrity
checking, and to compute it you have to do the entire encrypt+encode process
(everything but actually upload the generated shares to some set of storage
servers). It is normally computed at the very end of the upload process,
using values that were stashed during the encode+upload phases.

So, given a file on disk, you have to do almost the entire Tahoe upload
process to find out what the eventual Tahoe readcap is going to be. This
sounds like it's at odds with your plan to upload the "backuplog" before you
finish uploading some of the actual data files. I'm not sure how to rectify
this.

cheers,
 -Brian