#782 new defect

connection lost during "tahoe backup"

Reported by: zooko Owned by: andrej
Priority: major Milestone: undecided
Component: code-network Version: 1.5.0
Keywords: availability Cc: andrej@…
Launchpad Bug:

Description

Andrej Falout reported this to tahoe-dev.

Andrej: could you please look for "incident report files" which were created around the time of the problem, in your $TAHOEBASEDIR/logs/incidents directory. If there is an incident report file created about the same time as the (first) failure you encounter, please attach it to this ticket. Thanks!

Change History (5)

comment:1 Changed at 2009-10-25T18:19:02Z by zooko

andrej: the allmydata.com servers have occasionally been full and rejecting new uploads. This may have caused your problem. Did you look for incident report files? Does this problem still occur? Thanks.

comment:2 Changed at 2009-11-01T16:29:28Z by zooko

andrej sent me this note in private email:

"The issue is cause in large majority of the cases by Tahoe's poor resistance to concurrent traffic; put it simply, if I have p2p client running with more then few hundred opened connections, Tahoe starts loosing connections. I stop p2p, Tahoe immediately starts working again.

Please note that this is not a bad router kind of issue, I tested it extensively while debugging another issue. Or a saturated connection, there is plenty of headroom left, and no other network app I use exhibits this kind of sensitivity. It simply looks like Tahoe want response NOW, and if it does not get it NOW, it just gives up.

I'd suspect a more tollerant timeouts plus a connection retry handling would go a long way in fixing this."

comment:3 Changed at 2009-11-04T00:11:24Z by afalout

In response to Zooko's comments:

"I don't see how your theory can fit with my mental model of the Tahoe-LAFS network code. Maybe if you turn on some extra logging and then stimulate it to fail and then post the logs then I can figure it out." I can confirm without any uncertainty that running a P2P app with large number of connections kills Tahoe. I even scripted this into my backup scripts so all P2P traffic is stopped when running Tahoe.. Lite P2P (5 files/500 connections or so) is OK but anything significantly over this is a killer.

Now whether this means something can or even should be changed in Tahoe, is another matter entirely.

I would argue that for an application that is supposed to transfer a large amount of data over a long period of time, ability to recover form any sort of network interruptions is a paramount.

I would even go so far as not to allow Tahoe to quit for this reason at all, instead preferring it to retry the action indefinitely, until it either completes the requested operation, or user interrupts it.

comment:4 Changed at 2009-12-01T00:13:57Z by davidsarah

  • Component changed from unknown to code-network
  • Keywords reliability added

comment:5 Changed at 2009-12-04T04:29:49Z by davidsarah

  • Keywords availability added; reliability removed
Note: See TracTickets for help on using tickets.