#463 closed defect (fixed)

directory isn't rendered at all sometimes

Reported by: zooko Owned by: warner
Priority: major Milestone: 1.2.0
Component: code-mutable Version: 1.1.0
Keywords: Cc:
Launchpad Bug:

Description

Justin wasn't connected to the introducer or to any servers, and when he looked at a directory, the boilerplate at the top rendered, but then no directory contents were rendered -- it just waited indefinitely. Brian said he thinks that if there are no storage servers at all then instead of giving an error about failing to download the SSK, it hangs.

Just now I saw the same thing. It looked like I *did* have many servers connected (on the Test Grid), but I wasn't sure if that welcome page with the stats was stale -- had been loaded earlier when I was connected to a different wireless network. I reloaded the status page and it showed the same (as far as I noticed) status, and then I reloaded the directory and it loaded normally.

Change History (7)

comment:1 Changed at 2008-06-20T17:02:57Z by zooko

This just happened to me again, and reloading the directory, even after the storage servers are connected, doesn't help -- it still fails to render the directory contents in the same way. Restarting the tahoe node, and waiting until the servers are connected before loading the directory, causes it to load normally.

comment:2 Changed at 2008-07-02T23:08:12Z by zooko

This just happened again. Even though the node had been running for a long time and had many storage servers connected, the fact that I attempted to load the directory earlier, when too few servers were connected, appears to prevent it from ever loading until I restart my node. I guess this could have to do with our caching of the DirNode? object.

comment:3 Changed at 2008-07-07T06:56:02Z by warner

Hm, we keep the dirnode object around, but we don't really cache the results of the read (each time you do dirnode.read(), it will contact all the servers again).

Is it fairly reproduceable? I'll see if I can trigger it under closer observation, maybe by starting a node on my laptop with the network disconnected, try (and fail) to read the directory, then connect the network, allow servers to connect, then try to read the directory again.

comment:4 Changed at 2008-07-07T07:09:16Z by warner

Ok, so I am able to reproduce this locally. The second read failing is because of our serialization strategy: the second read is not allowed to proceed until the first has finished, and the first one never finishes. Interrupting the GET doesn't cause the read to stop (although it probably should.. the API doesn't lend itself to that, though).

I'll look more closely at what happens when there are no servers to be asked, that case is probably not handled correctly.

comment:5 Changed at 2008-07-07T07:20:37Z by warner

Yup, it was never entering the state machine.. the operation would just hang forever. Fixed (by 91c7e0f6897827fe), although the new behavior is to emit a "no recoverable versions" error message, whereas if we aren't connected to *any* servers it might be more useful to say something like "I'm not connected to any servers".

comment:6 Changed at 2008-07-07T07:21:06Z by warner

  • Owner set to warner
  • Status changed from new to assigned

leaving this open for a while longer, because it needs a unit test

comment:7 Changed at 2008-07-07T19:23:17Z by warner

  • Milestone changed from undecided to 1.1.1
  • Resolution set to fixed
  • Status changed from assigned to closed

2074c92dd13abb23 adds the unit tests, they aren't exactly on the same situation as Justin saw (a webapi GET of a dirnode while no servers are connected), but they should cover the same underlying problems.

I think it's safe to close this one now.

Note: See TracTickets for help on using tickets.