[tahoe-dev] Tahoe as glue filesystem

Duncan McGreggor oubiwann at divmod.com
Mon Jul 7 01:55:52 PDT 2008


On Jul 3, 2008, at 2:23 PM, zooko wrote:

> Dear Valentino Volonghi:
>
> I'm sorry that it has taken so long for me to reply to your inquiry.
> The problem was that I am not confident that I understand what you
> want to do, nor whether Tahoe would be a good tool for you to do it.
>
> Here I will do my best to answer your questions.
>
>
> On May 19, 2008, at 3:40 PM, Valentino Volonghi wrote:
>
>> Hi all, I'm using Twisted Matrix to write a process pool (Ampoule on
>> launchpad). One of the things I'd like to achieve is the possibility
>> to run processes on remote machines and of course here comes the
>> problem of letting them talk to each other.
>
>> a) does tahoe make any sense for this kind of usage? Seeing that it
>> has very strong read performance my impression is that it does, it
>> might not be tailored for this if subprocesses write too much in it.
>
> The best way to find out if this is a good fit is to try some
> experiments.  Here are some general measurements of latency and
> throughput for Tahoe reads and writes on LAN and on DSL:
>
> http://allmydata.org/trac/tahoe/wiki/Performance
>
> However, instead of extrapolating from these measurements to what the
> performance impact would be in your system, you should try some reads
> and writes of the kind that your system will need and measure those.
> Fortunately, it is easy to build Tahoe and easy to script reads and
> writes (either in Python or with RESTful HTTP calls), and Tahoe
> automatically produces detailed performance measurements for you on
> each read or write.
>
> For example, I just uploaded a 41 MB mp3 file (econtalk.org podcast --
> Robin Hanson on signalling [1]) to the Tahoe Test Grid [2] from my
> Macbook Pro over my home DSL line, and got these performance
> measurements:
>
>  * File Size: 40980688 bytes
>  * Total: 1456.18s (28.1kBps)
>     o Storage Index: 684ms (59.86MBps)
>     o Peer Selection: 1.42s
>     o Encode And Push: 1454.08s (28.3kBps)
>        + Cumulative Encoding: 5.19s (7.90MBps)
>        + Cumulative Pushing: 1441.24s (28.4kBps)
>        + Send Hashes And Close: 7.38s
>
> So it was "encoding" (which includes both encryption and erasure
> coding) at a rate of almost 8 MBps and "pushing" (uploading) at a rate
> of 28 KBps.
>
>
>> b) is it possible to disable parts of the protocol for speed
>> reasons?  I'm sure that all or most of the features inside tahoe are
>> extremely useful when storing sensible data or third party data. But
>> this is not true for subprocesses.  Essentially something very
>> transparent and that could be inspected would be great (so basically
>> limiting capabilities and encryption a bit or any other feature for
>> different cases).
>
> There is no option to disable encryption, but you could always hack
> the software to add such an option, but if you did I'll bet you
> couldn't measure any improvement in performance, because the
> encryption is already quite efficient and imposes a very low overhead.
>
> If your measurements show that the encryption layer's overhead *is*
> significant for your intended use cases, then I would like to hear
> about it, because I plan to further optimize that layer at some
> point...
>
>
>> c) would it be a bad idea to add a 'temporary storage' in tahoe that
>> would simply keep data sent in memory (and distribute it between all
>> introducer's clients) in order to be able to access it from inside
>> this cluster of servers.
>
> That's an interesting idea, but you probably have 1/1000th as much RAM
> as hard drive.  (Unless you are thinking of letting the operating
> system swap the storage to disk in virtual memory, which is another
> interesting idea.)


Valentino,

Man, it was great to hear you ask this question. The possibility of  
this for use in small, distributed, diskless devices is what most  
recently prompted me to take another look at tahoe. Of course I'm  
staying for the party anyway, but I am *deeply* interested in a memory- 
only option for tahoe.

As a follow-up to this question, Zooko (and I've done no background  
reading on this), is there any data replication in tahoe? If a host  
goes down and doesn't come back up, are file parts redistributed to  
the surviving nodes?

d


More information about the tahoe-dev mailing list