#31 closed defect (fixed)

crash on cygwin: doWrite on a Port, failure during test_system

Reported by: zooko Owned by: zooko
Priority: minor Milestone:
Component: code Version:
Keywords: Cc:
Launchpad Bug:

Description

allmydata.test.test_system.SystemTest?.test_upload_and_download will run for kilosecs, and sometimes seg fault. Also my client won't connect to the introducer.

So something is deeply wrong in cygwin land...

Attachments (1)

cygwin-poll.diff (1.0 KB) - added by warner at 2007-07-25T01:16:38Z.
patch for python-2.5.1's Modules/selectmodule.c to work around cygwin bug

Download all attachments as: .zip

Change History (10)

comment:1 Changed at 2007-05-11T18:38:02Z by zooko

  • Priority changed from major to minor
  • Status changed from new to assigned

Ah, I also have Windows native python 2.4.3 installed on this machine which cannot be uninstalled, nor can a new Windows native python package be installed. (In both cases I get installer error 2003.)

So this bug is probably specific to my machine...

comment:2 Changed at 2007-05-11T18:39:40Z by zooko

Actually the error code is 2203, at least on uninstall.

comment:3 Changed at 2007-05-11T22:35:47Z by zooko

Fixed my install of Windows. (learned about cacls.exe.)

comment:4 Changed at 2007-05-18T16:48:44Z by zooko

So, I just need to boot up my vmware machine that has cygwin and try the unit tests on it now that I've fixed it and then I can close this ticket as invalid...

comment:5 Changed at 2007-06-06T05:59:06Z by zooko

Hm, I finally rebooted (so that I could run vmware again) and tried again, but again there is badness. The first time I ran the test it passed in 10s. The second time I ran it it hung. :-(

comment:6 Changed at 2007-06-06T16:37:54Z by warner

I've seen an intermittent failure on the cygwin buildslave. The investigation I've done so far suggests that one of the Services is being shut down (the webserver) too early. No idea why.. maybe a socket error as it tries to accept() a connection?

comment:7 Changed at 2007-07-24T23:05:34Z by warner

Just an update here: we've tracked this down to a bug in Python's poll() module, which appears to only be triggered under cygwin. Modules/selectmodule.c:poll_poll() has code to skip over file descriptors that have not been fired (those with a null .revents field), but if this code is used, the list that poll() returns will be filled with random garbage.

When Twisted sees the junk file descriptors, it usually ignores them, because it knows that a fileno of two kabillion is not valid. But every once in a while that random garbage looks like a real descriptor. In the case of the test_system failure, it looks like the fileno of one of the listening sockets, and the random junk in the revents field looks like select.POLLOUT, so the reactor tries to do a doWrite to the listening socket, which throws an exception because listening sockets are never writable.

I'm working on a patch now, trying to make it pass the unit tests.

Changed at 2007-07-25T01:16:38Z by warner

patch for python-2.5.1's Modules/selectmodule.c to work around cygwin bug

comment:8 Changed at 2007-07-25T01:25:50Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed
  • Summary changed from crash on cygwin to crash on cygwin: doWrite on a Port, failure during test_system

More data: cygwin's poll() function sometimes violates POSIX and returns an overly-large fd count. The return value is supposed to be equal to the number pollfd structures that have non-zero .revent fields, but sometimes cygwin's count is too high.

This causes python's selectmodule.c/poll_poll() to overrun the pollfd array, and copy random data into the python list that it returns to the python-side caller of select.poller().poll(). When Twisted's pollreactor sees this, most of the bogus fds are invalid and ignored, but sometimes one of them looks like a real fd and causes the corresponding doWrite or doRead method to be invoked. Usually these are harmless too, but in our case the random fds finally overlapped with a real listening socket (probably because that fd's fileno was sitting in nearby memory), and we hit the exception.

attachment:ticket:31:cygwin-poll.diff is a patch for Python-2.5.1 to workaround the issue, by ignoring the return value from poll() and counting the active fds manually. I've applied this patch and rebuilt the select.dll module on our cygwin buildslave, and now the buildbot is green.

This problem has been reported to the python folks at http://sourceforge.net/tracker/index.php?func=detail&aid=1759997&group_id=5470&atid=105470 along with the patch. Since it's cygwin's buggy poll() that's the root cause, I'm not sure if they'll accept the patch or not.

Zooko said he was tempted to work on a cygwin patch, so I'm going to hold off notifying the cygwin mailing list until we have that patch in hand (or decide that we aren't going to bother). Either way, I think we understand this issue well enough to close out this bug.

comment:9 Changed at 2008-01-19T14:17:04Z by zooko

This bug was fixed in cygwin 1.5.25-7, released 2007-12-17.

Note: See TracTickets for help on using tickets.