#611 closed defect (fixed)

test_repairer test_repair_from_corruption_of_1 sometimes errors

Reported by: warner Owned by: zooko
Priority: major Milestone: 1.3.0
Component: code-encoding Version: 1.2.0
Keywords: Cc: tahoe-dev@…
Launchpad Bug:

Description

On some builders, this test_repairer unit test sometimes appears to hang, for long enough to trigger the timeout error. It might just be that the test is taking a long time, but I doubt it, because its neighboring tests (which should take about the same amount of time) never seem to fail. Also, I've seen it fail on an otherwise fast build slave.

http://allmydata.org/buildbot/builders/hardy2.6/builds/308 is the most recent failure I've seen.

We know that many of the repairer tests are asserting more functionality than the repairer has right now. Also, some of them wind up depending upon random values (probably in the corrupt-share routine), so they pass or fail at random. However, neither of these explains a hang.

Once hunch is that certain failure modes (i.e. corrupting a version byte) cause some state-machine-shaped code to forget to do a callback. Another is that we're missing an errback handler somewhere.

The ideal fix will be to understand why this is causing a hang and fix the bug. The second-best fix will be to remove or disable these tests, because of the other problems described above.

Change History (7)

comment:1 Changed at 2009-02-09T17:12:33Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned

comment:2 Changed at 2009-02-11T23:36:03Z by warner

  • Milestone changed from 1.3.0 to 1.3.1

I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang.

I've disabled the repair-from-corruption tests.

I've seen one hang in the repair-from-deletion test, but it took an hour to trigger, and I haven't been able to capture it (with enough logging to reproduce it directly) despite another hour or two of trying. I'll continue to hammer on it, but I don't think this is going to be a 1.3.0 fix.

So I'm going to push this ticket out to 1.3.1 .

Here's a copy of my notes from test_repairer.py:

As recently documented in NEWS for the 1.3.0 release, the current immutable repairer suffers from several limitations:

  • minimalistic verifier: it's just download without decryption, so we don't look for corruption in N-k shares, and for many fields (those which are the same in all shares) we only look for corruption in a single share
  • some kinds of corruption cause download to fail (when it ought to just switch to a different share), so repair will fail on these too
  • RIStorageServer doesn't offer a way to delete old corrupt immutable shares (the authority model is not at all clear), so the best the repairer can do is to put replacement shares on new servers, unfortunately leaving the corrupt shares in place

This test is pretty strenuous: it asserts that the repairer does the ideal thing in 8 distinct situations, with randomized corruption in each. Because of the aforementioned limitations, it is highly unlikely to pass any of these. We're also concerned that the download-fails case can provoke a lost-progress bug (one was fixed, but there might be more lurking), which will cause the test to fail despite a ".todo" marker, and will probably cause subsequent unrelated tests to fail too (due to "unclean reactor" problems).

So we're turning this test off until we've done one or more of the following:

  • remove some of these limitations
  • break the test up into smaller, more functionally-oriented pieces
  • simplify the repairer enough to let us be confident that it is free of lost-progress bugs

comment:3 Changed at 2009-02-12T03:57:09Z by zooko

There was a typo in this. Brian wrote:

{{{I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang.

I've disabled the repair-from-corruption tests.}}}

He meant:

{{{I've added notes to test_repairer and NEWS to explain the current state of these tests: the repair-from-corruption tests never pass (because they are very strenuous and the repairer is incomplete), and sometimes they appear to hang.

I've disabled the repair-from-corruption tests.}}}

comment:4 Changed at 2009-02-12T16:00:02Z by zooko

I wasn't able to reproduce Brian's hang in allmydata.test.test_repairer.Repairer.test_repair_from_deletion_of_1. I ran it overnight (700 minutes), and it ran the test 9443 times (taking around 3.4s per run, by the way), and didn't hang or fail or otherwise do anything funny. I think I'll relaunch it after removing the debugprintouts that I added, but currently I'm satisfied that repairer is Good Enough for tahoe 1.3.0 release.

comment:5 Changed at 2009-02-12T23:25:58Z by zooko

  • Cc tahoe-dev@… added

Okay, I found the bug. If the uploader requests a read from the DownUpConnector? after the downloader has finished writing and before the downloader closes, then that request will never be satisfied. d7dbd6675efa2f25 fixes this and tests it. Please review! This is the last code change that is scheduled to go into tahoe-1.3.0 and I would hate to have inserted a bug into tahoe-1.3.0 at the last moment, so please review this patch while I work on CREDITS and relnotes.txt. :-)

By the way, I also ran another 4876 iterations of test_repair_from_deletion_of_1 (this time without debugprintouts) with no deviation in behavior, and I am wondering why there is any variation at all in the behavior of this test on Brian's machine. Could it be that system test is using loopback and that the loopback device drops or reorders packets very rarely on Brian's machine but never or at least even more rarely on mine?

comment:6 Changed at 2009-02-12T23:34:44Z by zooko

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:7 Changed at 2009-02-12T23:34:48Z by zooko

  • Milestone changed from 1.3.1 to 1.3.0
Note: See TracTickets for help on using tickets.