#737 closed defect (fixed)

python2.5 setup.py test runs CPU to 100% on 32-bit single-core NetBSD "4"

Reported by: midnightmagic Owned by: warner
Priority: major Milestone: 1.5.0
Component: code Version: 1.4.1
Keywords: Cc: blackdew@…
Launchpad Bug:

Description

Running as in summary:

python2.5 setup.py test

... inside allmydata-tahoe-1.4.1

Eventually stops and runs the CPU up to 100% on a specific test. I am attaching a log of the session.

Details of the machine:

NetBSD quickie 4.99.7 NetBSD 4.99.7 (quickie) #0: Tue Jan 2 14:47:23 PST 2007 root@quickie:/v/src/sys/arch/i386/compile/quickie i386

AMD Athlon(tm) XP 2500+ (single-core, 32-bit) 2.5GB RAM

Python 2.5.2

This is a transitional pthread machine, partway between the M:N -> 1:1 threading model transition. The M:N threads *should* be functional and for all system and most application purposes they are. (KDE, etc.) However, on occasion some software makes assumptions or is built without threading support because configure detected anomalous behaviour.

Attachments (4)

file (6.3 KB) - added by midnightmagic at 2009-06-17T18:20:16Z.
bdew-test-random-unpached.txt (5.6 KB) - added by bdew at 2009-06-18T16:57:58Z.
trial with --random and --until-failure (got stuck too)
bdew-patched-twisted.txt (82.7 KB) - added by bdew at 2009-06-18T16:59:17Z.
A log of tests with twisted patched to show when it happens
twisted.patch (3.2 KB) - added by bdew at 2009-06-24T22:52:23Z.

Download all attachments as: .zip

Change History (28)

Changed at 2009-06-17T18:20:16Z by midnightmagic

comment:2 Changed at 2009-06-18T12:01:19Z by bdew

  • Cc blackdew@… added

I think i (patrially) figured this out...

Looks like there is a bug either somwhere in tahoe or foolscap that causes foolscap.eventual._SimpleCallQueue._turn to be scheduled with a delay of NaN.

From what i can see the methods in _SimpleCallQueue allways install it with reactor.callLater(0, self._turn) - so it is somwhere else.

Anyway, this triggers a bug in twisted that makes the delayed call stick in reactor._pendingTimedCalls forever and no later timed calls get processed. Also in this state reactor.timeout() returns 0 and the reactor starts iterating without delay causing 100% CPU load.

comment:3 Changed at 2009-06-18T15:29:23Z by zooko

  • Milestone changed from undecided to 1.5.0
  • Owner changed from somebody to warner

Nice work, bdew! It's interesting that this happens only in allmydata.test.test_cli.Check.test_check, and only sometimes. Examining that particular test case doesn't suggest anything special to me. bdew: could you run the tests with the --random and --until-failure options of trial?

Assigning this ticket to Brian -- it sounds like the foolscap eventual-send mechanism or Twisted might have a bug. Oh, that's a good question -- what versions of Twisted have we seen this on? bdew's buildbot has Twisted 8.2.0: http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/tahoe-version/logs/stdio . midnightmagic's Twisted version isn't known. There were two other people who reported this bug to me, but they didn't mention Twisted versions.

Changed at 2009-06-18T16:57:58Z by bdew

trial with --random and --until-failure (got stuck too)

Changed at 2009-06-18T16:59:17Z by bdew

A log of tests with twisted patched to show when it happens

comment:4 Changed at 2009-06-18T17:02:14Z by bdew

I've attached the requested log. (this is before trying to catch it in twisted)

I've also patched my local twisted installation to detect the situation and fix it (bu reseting time to 0) and scream about it (look for NAN TIME DETECTED). As you can see from the second log it happens in other tests too.

comment:5 Changed at 2009-06-18T18:29:18Z by warner

yowza. Yeah, since foolscap's eventual() stuff always uses reactor.callLater(0), it must be in Twisted somewhere. Twisted will basically do (time.time()+delay), and we're passing in delay=0, so maybe time.time() is returning NaN ?

First step, confirm that you can reproduce this in Foolscap (without Tahoe). Download the latest foolscap-0.4.2 (from pypi.python.org ... I'm having Trac problems on http://foolscap.lothar.com/trac and I can't edit the front page to reflect the 0.4.2 release). Then do trial --until-failure foolscap.test.test_eventual, which will run just the eventual-send -using tests.

Assuming that fails in the same way, try this:

#!/usr/bin/python
import time
while True:
  t = time.time()
  assert t != float("NaN")

and see if that explodes. It's also likely that Twisted is subtracting a timestamp at some point (i.e. delay = next_event - time.time()) .. maybe subtraction of numbers that are too close to each other is creating a NaN?

This reminds me of a problem I once had on a MIPS box running a not-quite-ready version of linux (one of the old Cobalt Qubes). I rebooted it once and Mailman (of all things) threw an exception saying "Eek! PI is greater than four!". I tracked it down to a bug in the kernel: it wasn't saving the floating-point registers during a context switch, so from a single process's point of view, its fp registers were corrupted at random times. Mailman happens to import the "math" module (for its exponential backoff algorithm), which gets its value of Pi from a C extension module (for improved accuracy or something). To make sure that the extension module was compiled and loaded properly, there was an import-time assertion like 3 <= PI <= 4. And the floating-point register which held that value managed to be clobbered by a task-switch in the tiny interval between being loaded from memory and being tested by that assertion.

Sometimes the fringe is a dangerous place :)

comment:6 Changed at 2009-06-18T22:08:06Z by bdew

I've run the test for a while and it doesn't seem to trigger the bug.

Also if anyone else runs that, note that t != float("NaN") won't work because

>>> float("NaN")==float("NaN")
False

use repr(t) != 'nan'

PS: I've reported the bug on twisted tracker too http://twistedmatrix.com/trac/ticket/3884

comment:7 Changed at 2009-06-19T19:39:06Z by midnightmagic

Testing running for quite some time.. no error.

I'm afraid I'm not good enough with python to know how to run the foolscap test though. Can I request just a couple more pointers? Assume I'm very facile with command-line, and system-level stuff. (And C for that matter.) But just not the magic incantations to get Python to know where zope.interface is for importing foolscap, and how to actually run "trial --until-failure" and so on.

comment:8 follow-up: Changed at 2009-06-20T08:37:26Z by warner

righto. You had the Tahoe unit tests running, which means it was able to import or build Foolscap and everything that Foolscap depends upon, so things should probably be pretty easy.

First step, download the foolscap-0.4.2 tarball from http://foolscap.lothar.com/trac (the front page has a download link), or from http://pypi.python.org (search for "foolscap" and you'll get a page with download links).

Then unpack it and try running "make test". If that works, jump to the "trial" step below. (/usr/bin/trial is the program that runs the unit tests. it comes from Twisted. "make test" is a simple wrapper for invoking trial.)

If that can't find a "trial" executable, or if it complains about ImportErrors, then you may need to install one or more of the Foolscap dependencies. The dependency graph looks like:

  • Foolscap requires Twisted
    • Twisted requires zope.interface (note: zope.interface is a tiny tiny portion of the Zope package, less than 1% by weight. We don't need all of Zope. Don't let the enormous size of Zope scare you)
  • Foolscap requires pyopenssl
    • pyopenssl links against the OpenSSL libraries (.a/.so/.h)

Tahoe goes to great lengths to hide the dependencies from you, by downloading and installing everything that isn't already available on the system (and sometimes even things which *are* on the system, but that's a bug). It can only install python things, though.. it doesn't know how to build libopenssl.a .

If you were to install Tahoe, you'd get everything that Foolscap needs too (Twisted, zope.interface, pyopenssl). That might not include executable scripts like /usr/bin/trial though.

Foolscap does not behave this way: you have to satisfy its dependencies yourself.

So, if there's no "trial" on your $PATH, then download the Twisted tarball from http://twistedmatrix.com , unpack it, and do "sudo python setup.py install" (if you're willing to install stuff to /usr/lib and /usr/bin .. I prefer packages or GNU stow, so I usually do "sudo python setup.py install --prefix /usr/local/stow/XYZ" and then "cd /usr/local/stow && stow -v XYZ" to put it into place in /usr/local/bin and /usr/local/lib). I believe that Twisted includes a copy of zope.interface and will install it if necessary.. no need to download something extra.

After that, "trial --version" should work, so try foolscap's "make test" again.

Foolscap will skip the crypto tests if pyopenssl is not installed, but for this bug, we don't care about those.

Then running "trial --until-failure foolscap.test.test_eventual" in the foolscap source directory will tell it to run that one specific test_eventual over and over again until something fails.

hope that helps.. grab me on IRC if you want to walk through this together.

comment:9 Changed at 2009-06-21T15:00:51Z by bdew

Ha! it looks like time() is returning NaN on my system once in a while after all.

I went and added a bunch of assert's to make find where those NaN's are comming from and narrowed it to reactor.callLater - the input vas valid (there is actually a check there already), but the resulting DelayedCall? was created with a NaN timeout.

So with another assert there

assert repr(self.seconds())!='nan', 'self.seconds() [really: %r] returned NaN! %r,%r,%r'%(self.seconds,self.seconds(),self.seconds(),self.seconds())

I got:

Traceback (most recent call last):
Failure: twisted.internet.defer.FirstError: FirstError[#8, [Failure instance: Traceback: <type 'exceptions.AssertionError'>: self.seconds() [really: <built-in function time>] returned NaN! 1245566718.5629139,1245566718.562958,1245566718.563
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:779:runUntilCurrent
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:26:_turn
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:243:callback
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:312:_startRunCallbacks
--- <exception caught here> ---
/usr/src/tahoe/build2/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/defer.py:328:_runCallbacks
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:98:_got_answer
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/mutable/checker.py:146:notify_server_corruption
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:44:callRemoteOnly
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/allmydata_tahoe-1.4.1_r3918-py2.5.egg/allmydata/test/no_network.py:63:callRemote
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:70:fireEventually
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:61:eventually
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/foolscap-0.4.2-py2.5.egg/foolscap/eventual.py:16:append
/usr/src/tahoe/build2/virt/lib/python2.5/site-packages/Twisted-8.2.0-py2.5-linux-i686.egg/twisted/internet/base.py:669:callLater
]]

Note that i added 3 more calls to reactor.seconds() (which is really time.time) and they returned valid results.

So it looks that on this system something (happening either before this code, or in parallel in another thread) is breaking exactly one call to time()

comment:10 Changed at 2009-06-21T20:55:55Z by warner

Wow, fun! A quick look at the python-2.6 source (Modules/timemodule.c:floattime) doesn't suggest any obvious way to get a NaN.. it calls the C gettimeofday/ftime/time (depending upon what your platform has), adds the pieces together, and returns the result.

You said that a simple test case that just calls time.time() repeatedly didn't ever fail, right? That's unfortunate.. if we didn't think Tahoe was involved then I'd suggest instrumenting timemodule.c to remember the pieces it got, build the PyFloat, then if it's NaN immediately print out the pieces, so we could figure out what gettimeofday() returned that provoked a NaN.

If there were a low-level threading bug that was clobbering memory, I'd expect to see exceptions or deeper errors than just a NaN. If time() couldn't allocate the memory for the PyFloat object, it would raise an exception instead of returning NaN.

Hm, it's worth noting that floats are formatted to strings (in Objects/floatobject.c:format_float) by doing snprintf(), so if your platform's libc does something funky with snprintf(), that might cause problems. Also, Python doesn't appear to do anything to define or test NaN directly: it just tells C to do a+b or a>=b or whatever. So something weird in your C compiler's implementation of floating-point math (or your CPU) could get involved too.

If you get into this, you might try:

  • modify python's timemodule.c to store the values retrieved from gettimeofday() in a file-global variable, just before it adds them together to create the return value for floatseconds()/time()
  • add a function to timemodule.c which retrieves these stored values with as little interpretation as possible (maybe memcopy them into a string in addition to interpreting them as floats)
  • in your catch-NaN-in-reactor.callLater assertion, retrieve and print these values

If we catch gettimeofday() returning something insane, it's either the kernel or some weird memory corruption that's just not causing anything else to catch fire. If gettimeofday() is behaving, then we should suspect the floatseconds() math or the floating point operations done afterward.

Another idea is to add code to floatseconds() that stringifies the float and compares it against NaN right away. Then run everything under gdb and put a breakpoint on the 'if' side of that comparison, then start using the tahoe node until it fails in this way. Then look at the local variables in the debugger and see if anything looks suspicious.

boy, you know how to find the fun bugs, don't you? :-)

comment:11 Changed at 2009-06-21T20:58:01Z by warner

oh, also, in the python assertion, use tmp=self.seconds() and do your checks on tmp. If things are this weird, we should suspect the stringification routines too, so print tmp several times and make sure it holds still. (i.e. if there were a bug in float-to-string, could that manifest as the same symptoms?)

comment:12 in reply to: ↑ 8 Changed at 2009-06-22T23:31:33Z by midnightmagic

Okie doke. Had a chance to run things on one of the test-failing machines:

=============================================================================== [SKIPPED]: foolscap.test.test_banana.ThereAndBackAgain?.test_boundMethod_newstyle

new-style classes still broken =============================================================================== [SKIPPED]: foolscap.test.test_banana.ThereAndBackAgain?.test_classMethod_newstyle

new-style classes still broken =============================================================================== [SKIPPED]: foolscap.test.test_banana.ThereAndBackAgain?.test_class_newstyle

new-style classes still broken =============================================================================== [SKIPPED]: foolscap.test.test_banana.ThereAndBackAgain?.test_instance_newstyle

new-style classes still broken


Ran 460 tests in 168.897s

PASSED (skips=4, successes=456)

After that, "trial --version" should work, so try foolscap's "make test" again.

So, done this.

Foolscap will skip the crypto tests if pyopenssl is not installed, but for this bug, we don't care about those.

I had pyopenssl installed and it succeeded.

Then running "trial --until-failure foolscap.test.test_eventual" in the foolscap source directory will tell it to run that one specific test_eventual over and over again until something fails.

Running this now and can run it overnight. :-) I've nice'd it right up to 20 so I can keep using the machine otherwise. I will report on the results.

comment:13 Changed at 2009-06-23T17:21:32Z by midnightmagic

I ran the tests overnight and they were still running when I arrived in the morning. The final output was:


Ran 3 tests in 0.024s

PASSED (successes=3) Test Pass 4390063 foolscap.test.test_eventual C TestEventual?

testFire ... [OK] testFlush ... [OK] testSend ... [OK]

C------------------------------------------------------------------------------ Ran 3 tests in 0.024s

PASSED (successes=3) Test Pass 4390064 foolscap.test.test_eventual

TestEventual?

testFire ... [OK]

C testFlush ... [ERROR ]

testSend ... C[OK]

=============================================================================== [ERROR]: foolscap.test.test_eventual.TestEventual?.testFlush

Traceback (most recent call last):

File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 677, in runUntilCurrent

File "usr/pkg/lib/python2.5/site-packages/twisted/internet/base.py", line 495, in stop

twisted.internet.error.ReactorNotRunning?: Can't stop reactor that isn't running.


Ran 3 tests in 0.048s

CFAILED (errors=1, successes=2) C

comment:14 follow-up: Changed at 2009-06-24T22:19:01Z by zooko

Hm, it looks like the buildbot that mm set up just locked up in allmydata.test.test_cli.Check.test_check ...

http://allmydata.org/buildbot/builders/MM%20netbsd4%20i386%20warp/builds/6/steps/test/logs/stdio

There's no indication from the build logs whether it was at max cpu usage, of course.

It's strange that mm running it in a shell should behave differently than the buildslave running it.

mm: could you try the same command-line and environment that the buildslave did? It is all listed at the top of that web page. The command was python2.5 setup.py test --reporter=timing and among the environment variables was PYTHONPATH=/home/tahoe/lib/python.

comment:15 Changed at 2009-06-24T22:22:02Z by zooko

By the way now that we have both Black Dew's Debian builder and mm's NetBSD builder hooked up, I see that it was also allmydata.test.test_cli.Check.test_check that locked up on Black Dew's:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/1/steps/test/logs/stdio

That's the one that Black Dew tracked down to the Python time.time() call returning NaN instead of seconds-since-epoch. Black Dew tracked that down by adding debugging instrumentation in Python and re-running. Brian suggested (in this ticket) further debuggery. Perhaps you two guys could supply mm with a patch that he could apply to determine if his lock-up is of the same source?

Changed at 2009-06-24T22:52:23Z by bdew

comment:16 Changed at 2009-06-24T22:55:47Z by bdew

I've attached the patch (for twisted.internet.base) - there are a bunch of asserts that will try to catch where the NaN comes from, and if that fails will yell about it when it reaches reactors loop.

On my buildbot only the last piece is used (not the asserts).

comment:17 in reply to: ↑ 14 Changed at 2009-06-25T00:17:57Z by midnightmagic

Replying to zooko:

It's strange that mm running it in a shell should behave differently than the buildslave running it. mm: could you try the same command-line and environment that the buildslave did? It is all listed at the top of that web page. The command was

{{{python2.5 setup.py test -- reporter=timing }}} and among the environment variables was PYTHONPATH=/home/tahoe/lib/python . One difference could be the system-installed twisted python package I had to install to get the buildbot working. Note the: exceptions.IOError: [Errno 13] Permission denied: '/usr/pkg/lib/python2.5/site-packages/twisted/plugins/dropin.cache.new' ... that's kind of weird. I'd remove twisted from the system location and let tahoe/buildslave do the work themselves if there was a sneaky way of installing twisted so it was buildslave-specific, and build jobs didn't inherit it.

comment:18 Changed at 2009-06-25T00:37:37Z by warner

eek! the text is getting smaller and smaller!

don't worry about the dropin.cache.new problem, it's a know Twisted issue that's harmless. If you really want to, running "trial --help" as root should make it go away, by giving trial the ability to write to the cache file that it so desperately wants to see.

comment:19 Changed at 2009-06-27T07:36:05Z by bdew

Hmm now it segfaulted on the first problematic test:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/21/steps/test/logs/stdio

Sadly i didn't get a dump and can't reproduce the crash.

comment:20 Changed at 2009-07-02T20:48:34Z by zooko

See also #738, which may be due to the same cause as this issue.

comment:21 Changed at 2009-07-03T23:06:40Z by midnightmagic

Update: since the fix mentioned in #738 (disabling crypto++ assembly optimizations) was implemented in my local copy here, python2.5 setup.py test NOW COMPLETES THROUGH THE SPOT WHERE IT FROZE BEFORE!!

So this specific issue I think, if those other ones can be closed, can be closed also once we verify that the same fix might work for bdew.

comment:22 Changed at 2009-07-06T23:15:39Z by midnightmagic

Using trunk from pycryptopp and trunk tahoe (via PYTHONPATH and deleting tahoe/support/lib/site-packages/pycrypto*) the test no longer fails and pushes through the spot where it failed before.

.. with full crypto optimizations.

Other things fail, but.. looks like it's just because it's trunk and certainly not the same problem this ticket references.

comment:23 Changed at 2009-07-06T23:42:59Z by bdew

Same here, with fixed pycryptopp all the tests pass without hangs/rashes:

http://allmydata.org/buildbot/builders/BlackDew%20debian-unstable-i386/builds/40/steps/test/logs/stdio

comment:24 Changed at 2009-07-07T19:09:19Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

Fixed by 9578e70161009035 which increases the requirement on pycryptopp to >= 0.5.15. Note that if you are building pycryptopp against an external libcryptopp, however, then you may still have this bug if your libcryptopp has it.

Note: See TracTickets for help on using tickets.