[tahoe-dev] prodnet GC pass coming up: send us your manifests!

Brian Warner warner-tahoe at allmydata.com
Thu Nov 20 12:51:49 PST 2008


We're planning to do a manual garbage-collection pass on the
allmydata.com production network next week. This should let us reclaim
something like 9TB of disk space. Because we don't have real accounting
mechanisms in place yet, this is going to be somewhat haphazard. Here's
how it's going to work:

 1: I have a script running now that is collecting "manifests" from all
    allmydata.com user accounts. This script should finish in the next few
    days.
 2: This email is soliciting manifests from all users who have data stored in
    the production network which is *not* reachable from the allmydata.com
    user account database. These are files which are owned by you but not
    readable by us.
 3: Next week, I'll compute the union of these manifests, and then run a tool
    on all prodnet storage servers that will delete half the shares (any
    share numbered 5-9) for any storage index that is not in the combined
    manifest and which is more than a month old.

The "manifest" is a set of storage index strings, like
"7ffkwt55wytdoiahk5dljh5vzi". Each storage index refers to a set of
(generally 10) shares on the storage servers. Note that the storage index is
*not* a file-cap or directory-cap: it cannot be used to recover plaintext (in
fact the storage index is effectively an irreversible cryptographic hash of
the encryption key). The storage index can get you the unvalidated
ciphertext, but not the plaintext.

I know there are a handful of folks who are storing data in the prodnet grid
that is not attached to an allmydata.com-held rootcap. I have a small test
directory in there, I know Rob has some data backed up through the mac
client, and I know that Peter has helped a couple of users get linux clients
running. If you have a Tahoe node that's connected to the prodnet Introducer
(with an introducer.furl that has a tubid prefix of 5qdmoc), and you've used
the CLI 'tahoe create-alias' command, or the webapi's front-page "Create an
[unnamed] Directory" button, then you probably fall into this category. If
you've only added data through the javascript-based webdrive (reachable from
the "login" links on www.allmydata.com), or the native windows client, then
you probably do not. Also note that this is prodnet, not testgrid. We aren't
planning any GC work on the testgrid this year.

To avoid having your private files' shares deleted, you need to compute the
storage-index manifest for your files and send it to me before I do the GC
pass next week. To build this manifest, do the following:

 1: install the latest Tahoe trunk code (the 'tahoe manifest' CLI command
    was only added a few days ago)

 2: make sure your CLI works: see docs/CLI.txt for details. You'll need to
    know the URI of your root directory. If you use the CLI a lot, you
    probably already have this in an "alias": 'tahoe list-aliases' will show
    it to you. 'tahoe ls ALIAS:' should show you the files you want to
    preserve.

 3: run 'tahoe manifest --storage-index ALIAS: >manifest.out', or run it on
    an explicit dircap, like:

     tahoe manifest --storage-index URI:DIR2:4unpz45jffsbk2crhkf3guyhvq:7rcyju55rqdspdvegipypmq4k4gactjsxzjhaqyhcctmll26it6q >manifest.out

 4: examine the manifest.out to make sure it only contains storage index
    strings (which are 26 characters of base32), and not full-power filecaps
    or filenames. The '--storage-index' option is very important: the 'tahoe
    manifest' command without --storage-index will return a table of
    filenames and filecaps, which is private information that we neither want
    nor need. If you see filenames or the string "URI:CHK:" or "URI:DIR2:" in
    manifest.out, delete it and re-run the command with --storage-index .

 5: email manifest.out to me, at warner-tahoe at allmydata.com .

 6: if you have multiple private root directories (i.e. multiple starting
    points), run this command multiple times, and send me the concatenated
    results. Note that if you have rootdir/ and rootdir/subdir/ , you don't
    need to build a manifest of subdir/ : everything in subdir/ will be
    included in the rootdir/ manifest. But if you have rootdir1/ and
    rootdir2/ (and neither one is reachable from the other), then you'll need
    to build a manifest for both.


The shares in our prodnet grid can be put into three categories:

 * reachable by active accounts
 * reachable by privately-held rootcaps
 * no longer reachable (deleted, or only reachable by inactive/cancelled
   accounts)

This GC pass is intended to remove shares in the third "no-longer-reachable"
category. Since we're paranoid, we're only planning to delete half the shares
(i.e. we'll keep shares 0,1,2,3,4 and delete shares 5,6,7,8,9). So this will
only reclaim half the space that a full-GC run would do, but shouldn't
actually make the GC'ed files unretrievable (except for a small handful of
files that were unlucky enough to have three or more of the 0,1,2,3,4 shares
on the four nodes of the late prodtahoe7). The GC pass will not touch shares
that are less than a month old to avoid a race condition (non-garbage shares
which were added after the start of the 'tahoe manifest' run).

So, to keep your private files alive, send me your storage-index manifests by
monday! If you need help building this list, send me email, or contact us on
IRC (#tahoe on irc.freenode.net).

cheers,
 -Brian


More information about the tahoe-dev mailing list