#150 closed defect (fixed)

don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes?

Reported by: zooko Owned by: zooko
Priority: major Milestone: 0.8.0 (Allmydata 3.0 Beta)
Component: code-storage Version: 0.7.0
Keywords: scaling scalability Cc:
Launchpad Bug:


As mentioned on the Performance page, ext3 can store no more than 32,000 entries in a single directory. This ticket is to work-around such limitations by using a few bits of the storage index to choose a subdirectory. Something like this:

p = os.path.join(idlib.b2a_l(s[:2], 14), idlib.b2a(s))

Would allow us to store 524,288,000 shares.

(Yay -- there's a use for z-base-32's bit-length!)

Attachments (1)

convertshares.py (860 bytes) - added by zooko at 2008-01-31T22:12:00Z.

Download all attachments as: .zip

Change History (22)

comment:1 Changed at 2007-09-26T21:56:50Z by warner

For small-to-medium size stores, it also adds an extra 4kiB per share (for the extra directory), but I'm ok with that. This brings the per-share overhead (assuming one share per bucket, i.e. one share per server) from 5486B to 9582B. (1390 bytes of hashes and lease info, 4096 bytes for the bucket directory, and an extra 4096 bytes for the three-character-long intermediate directory that this change would add)

comment:2 Changed at 2007-09-28T02:40:50Z by warner

oh, and we should make sure that a call to allocate_buckets() that would fail because of one of these limits should fail gracefully, by telling the caller that we won't accept their lease request. We might already do this, I'm not sure. Basically it's just a matter of getting the 'except EnvironmentError?' catches right.

comment:3 Changed at 2007-10-18T23:29:27Z by zooko

  • Milestone changed from 0.7.0 to 0.6.2
  • Status changed from new to assigned
  • Version changed from 0.6.0 to 0.6.1

comment:4 Changed at 2007-11-01T17:25:02Z by zooko

  • Milestone changed from 0.6.2 to 0.7.1

We're focussing on an imminent v0.7.0 (see the roadmap) which hopefully has Small Distributed Mutable Files and also a fix for bad SHA-256. So I'm bumping less urgent tickets to v0.7.1.

comment:5 Changed at 2007-11-13T18:26:29Z by zooko

  • Milestone changed from 0.7.1 to 0.7.2
  • Version changed from 0.6.1 to 0.7.0

We need to choose a manageable subset of desired improvements for [ http://allmydata.org/trac/tahoe/milestone/0.7.1 v0.7.1], scheduled for two week hence, so I'm bumping this one into v0.7.2, scheduled for mid-December.

comment:6 Changed at 2007-12-18T00:07:08Z by zooko

  • Keywords scaling added

comment:7 Changed at 2007-12-18T00:09:03Z by zooko

  • Keywords scalability added

comment:8 Changed at 2008-01-04T18:45:56Z by zooko

  • Milestone changed from 0.7.2 to 0.7.1

This is important for scaling up, and it is easy, and I'll do it. Bringing it forward to Milestone 0.7.1.

comment:9 Changed at 2008-01-05T05:30:50Z by warner

cool. If it goes into 0.7.1, how about we make it check the old location too, so that 0.7.0-generated shares are readable by an 0.7.1 node? We could get rid of this extra check when 0.8.0 is released, or maybe give people a little tool to migrate their shares to the new locations.

comment:10 Changed at 2008-01-23T02:31:33Z by zooko

  • Milestone changed from 0.7.1 to 0.10.0

comment:11 Changed at 2008-01-31T19:05:03Z by zooko

  • Milestone changed from 0.10.0 to 0.8.0 (Allmydata 3.0 Beta)

comment:12 Changed at 2008-01-31T19:15:02Z by warner

oh, better yet, if we put the shares in a slightly different place, then we could test for the existence of the old directory at startup, and if it's there, set a flag, which will cause subsequent lookups to check both the old (flat) place and the new (nested) place.

Or do automatic migration at startup. I'm not super-fond of this, both since for a very large store it could take quite a while (although we don't have any very large stores yet), and because for some reason I'm uneasy about automatically moving things around like this.

Or put the nested directories in the same place as the flat shares went (BASEDIR/storage/shares/), but do a quick walk at startup to see if there are any actual shares there (bucket directories with the full-storage-index-sized name). If so, set the flag to look for the flat place in addition to the nested place. With this approach, the "migration tool" would just exist to speed up share lookup slightly (one os.stat instead of two), but otherwise everything would Just Work.

OTOH, the code would be simpler with just a single approach, and if we write up a migration tool at the same time, we can fix all of the 12 storage servers we've currently got in half an hour, and be done with it.

Changed at 2008-01-31T22:12:00Z by zooko

comment:13 Changed at 2008-01-31T22:12:08Z by zooko

Yes, let's manage the 0.8.0 storage code (src/allmydata/storage.py) and the upgrader tool separately. Here's the upgrader tool attached to this ticket. It assumes that the current working directory is the storage/shares directory. It should work even if there is an incomplete upgrade (such as if an earlier run of the upgrade tool died), and even if there is a tahoe node currently running and using this directory.

comment:14 Changed at 2008-01-31T22:37:58Z by zooko

Fixed by b80cfeb186860ab6.

Don't forget to snarf the convertshares.py script attached to this ticket if you want to upgrade shares from the old storage layout.

comment:15 Changed at 2008-01-31T22:38:20Z by zooko

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:16 Changed at 2008-02-12T18:37:07Z by zooko

There is one small problem with the current solution -- the names of the intermediate directories (which are the z-base-32 encodings of the first 14 bits of the storage index) are not prefixes of the z-base-32 encodings of the storage indexes themselves. This is because the z-base-32 encoding of the storage index encodes the first 15 bits into the first 3 chars. So, for example:

MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l ueq/*
drwx------    3 wonwinmc wonwinmc      102 Feb  1 16:30 ueq/ueqdjcok1pjs8fwitrcyoc73aw

One potential improvement would be to use base-62 encoding http://allmydata.org/source/z-base-62/? instead. Two chars of base-62 encodings offer 3844 possibilities, and the names of the intermediate directories could be just the first two chars of the base-62 encoding of the whole storage index.

comment:17 Changed at 2008-02-12T18:37:45Z by zooko

oops, in that example the leading 3 chars happened to be the same. (There's a 50% chance per storage index.)

comment:18 Changed at 2008-02-12T18:38:16Z by zooko

MAIN wonwin-mcbrootles-computer:~/.tahoe/storage/shares$ l aho/*
drwx------    3 wonwinmc wonwinmc      102 Feb  1 10:56 aho/ahtm1iurtjzqdgrtobeat5dpje

comment:19 Changed at 2008-02-12T18:38:42Z by zooko

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from don't overfill your filesystem's directories to don't overfill your filesystem's directories -- and make intermediate dirs be leading prefixes of storage indexes?

comment:20 Changed at 2008-02-12T20:39:01Z by warner

It would be nice if we could predict the location of a given share. Either changing the encoding to optimally fill an ext3 32000-entry-limited directory with an integral number of baseN characters, or using multiple levels of directories. The latter would expose this unfortunate ext3 wart to less code, but would consume more overhead (the 4kB block per directory, even for mostly empty ones).

Could we do some math to make a guess as to how many storage indices get created and placed before we expect to reject a lease request because of the 32000-entry limit? With the current scheme that will occur when we hit 32000 entries in one of the ABC/ top-level directories. Using multiple directory levels would increase this number, of course, but it would be nice to have a rough idea of how many levels we'd need to store, say, 1M buckets, 1G buckets, 1T buckets.

I believe that Rob mentioned that the Yahoo storage folks determined empirically that 7 bits per directory level gave good performance with common filesystems (in terms of caching vs lookup speed). Maybe he could chime in with some more details.

comment:21 Changed at 2008-02-14T14:44:57Z by zooko

  • Resolution set to fixed
  • Status changed from reopened to closed

fixed by e400c5a8fb30ca57

Note: See TracTickets for help on using tickets.