#535 closed defect (fixed)

"recursion depth exceeded" failure during t=manifest of large directory tree

Reported by: warner Owned by: warner
Priority: major Milestone: 1.3.0
Component: unknown Version: 1.2.0
Keywords: Cc:
Launchpad Bug:

Description

While working on the new "tahoe manifest" CLI command, I noticed some stack-exhaustion exceptions that occurred as the code was traversing a large (but not huge) directory structure.

There were about 700 files, 60 directories, and the largest directory had about 150 children.

#237 has some other notes about how Deferreds can wind up using unusually deep stacks (and thus run into Python's default 1000-frame "recursion limit"). But it turns out that this problem was slightly different.

The key to tracking down these sorts of problems is to start by adding the following code to Deferred._runCallbacks:

--- twisted/internet/defer.py	(revision 24958)
+++ twisted/internet/defer.py	(working copy)
@@ -325,6 +325,12 @@
                 try:
                     self._runningCallbacks = True
                     try:
+                        if len(traceback.extract_stack()) > 900:
+                            print "running", len(traceback.extract_stack())
+                            traceback.print_stack()
+                            print "running", len(traceback.extract_stack())
+                            import os
+                            os.abort()
                         self.result = callback(self.result, *args, **kw)
                     finally:
                         self._runningCallbacks = False
@@ -337,6 +343,12 @@
                         # self.callbacks until it is empty, then return here,
                         # where there is no more work to be done, so this call
                         # will return as well.
+                        if len(traceback.extract_stack()) > 900:
+                            print "chaining", len(traceback.extract_stack())
+                            traceback.print_stack()
+                            print "chaining", len(traceback.extract_stack())
+                            import os
+                            os.abort()
                         self.pause()
                         self.result.addBoth(self._continue)
                         break

This will log the stack that leads up to the failure. In the two cases I wrote about in the #237 comment, the problem occurs entirely in Twisted code (even though the root cause was set up in application code).

In this case, I saw allmydata.util.limiter:43 Limiter.maybe_start_task() and limiter.py:48 _done in the 5-frame recursion cycle. It turns out that a ready-to-fire Deferred (that results from a limiter.add() of an immediate function, in this case ManifestWalker.add_node) causes maybe_start_task to synchronously invoke self._done, which synchronously invoked maybe_start_task. This cycle repeated for every pending task, adding 5 stack frames each time, until the stack got full.

The solution was to interrupt the cycle with an eventual-send:

--- src/allmydata/util/limiter.py
+++ src/allmydata/util/limiter.py
@@ -37,4 +37,4 @@
     def _done(self, res, done_d):
         self.active -= 1
         eventually(done_d.callback, res)
-        self.maybe_start_task()
+        eventually(self.maybe_start_task)

Change History (4)

comment:1 Changed at 2008-11-15T05:06:37Z by warner

fc7cd23bd1906351 fixes this, but I'm trying to write a unit test to prove it. I'm having problems setting up a test environment that triggers the issue: my approach is to create a top-level directory that contains 20 subdirectories, each of which contains 100 LIT files, and then do a t=manifest of the whole thing. The 20 subdirs ought to fill the Limiter queue, and then the 2000 objects should hit ManifestWalker.add_node and finish synchronously. That ought to flood the Limiter's queue with ready-to-fire Deferreds.

But something's screwy in the admittedly weird "fake directory node" test harness I wrote, and it isn't working yet.

comment:2 Changed at 2009-02-04T00:10:22Z by warner

  • Owner changed from nobody to warner

comment:3 Changed at 2009-02-07T19:49:40Z by zooko

I guess this is fixed but Brian couldn't figure out how to write a unit test of it. Shall we close it as fixed?

comment:4 Changed at 2009-02-08T22:05:18Z by warner

  • Resolution set to fixed
  • Status changed from new to closed

yeah, I guess. I haven't seen the bug occur since I changed that code.

Note: See TracTickets for help on using tickets.