[tahoe-dev] 1,000,000+ files?

Chris Goffinet cg at chrisgoffinet.com
Tue May 6 19:34:08 PDT 2008

>> Right now the mapping is done in a database, then we have a high
>> performance caching layer that uses a distributed hash table for fast
>> lookups based on URL (pretty url -> Tahoe URI).
> That sounds ideal. Tahoe dirnodes are a convenience, so that people  
> can use
> it like a traditional filesystem without being forced to maintain  
> their own
> tables like your database. If some alternative table suits your  
> needs better,
> the by all means bypass tahoe's native dirnodes in favor of that  
> table.
> (Out of curiosity, what counts as a "pretty URL" in your scheme?  
> We've been
> thinking about ways to make tahoe URIs smaller, but we've gone back  
> and
> forth about how small we should try to make them (and on how many  
> bits of
> unguessability we need to maintain). I'd be interested to know about  
> where
> in the design space you've landed.)

Right now there is a 1024 character limit just for storage reasons. It  
mimics exactly how you can name your files on say Amazon S3 or any  
other storage system. (Example: /<client>/<dir>/<filename>). We  
haven't seen any issues with how you guys named the files, I  
personally like the fact its a bit long only because of the  
unguessability factor.I wouldn't even try anything above 2,083 because  
of Internet Explorer restrictions.

I've been going back and forth about building a translation plugin  
directly into Tahoe because I can see the need where a client would  
want to store a representation of a nice URL to a mutable file that  
could change in the FS. Was thinking about using memcachedb (has a  
persistent store, and is very fast. Saw a benchmark of a Dell 2950  
doing 66,000+ reads/s and 22,000 w/s) Seems efficient enough now for  
Tahoe as you can scale more nodes for the distributed factor. The  
plugin could link directly into TwistedWeb work you have for /uri/.  
But again its just a thought. I've been going back and forth if I want  
to implement this in the DFS or abstract it outside and directly into  
the reverse proxy layer that would be doing a high set of reads.

> The only thing you'll need to be aware of is accounting / garbage  
> collection
> (ticket #119). We haven't implemented it yet, but eventually each  
> share will
> need to have a lease on it that identifies an account of some sort,  
> and
> clients will be responsible for cancelling leases on shares that  
> they are no
> longer using. There are a couple of different ways we might approach  
> this,
> but most of them require the client to be able to enumerate all of  
> the files
> that they wish to keep alive (and either periodically renew a lease  
> on each
> one, or keep track of the deltas and cancel a lease on everything  
> that's been
> deleted).

Does that basically mean we need to keep track of the URIs that get  
created for the mutable files overtime as to traverse the old ones at  
certain periods to cause Tahoe to remove them from disk?

More information about the tahoe-dev mailing list