#616 closed defect (fixed)

bug in repairer causes sporadic hangs in unit tests

Reported by: zooko Owned by:
Priority: major Milestone: 1.3.0
Component: code-encoding Version: 1.2.0
Keywords: Cc: tahoe-dev@…
Launchpad Bug:

Description

There is a bug in DownUpConnector._satisfy_reads_if_possible():

src/allmydata/immutable/repairer.py@20090112214120-e01fd-7d241072d30b14d3e243829e952e8c8440e6c461#L127

It should be putting leftover bytes back into the self.bufs and the rest into the result, not putting all-but-leftover bytes back and the rest into the result! In cases where the input chunks have come in different sizes than the read requests, this bug could lead to a read request getting more or fewer bytes than it requested. This could lead to data corruption (although not irreversibly so -- it would then upload the same sequence of bytes but in different-sized blocks, which would screw up the integrity checking code but not the ciphertext).

Fortunately, in our current code, the writes and the read requests are always of the same sizes (the block size), so this doesn't happen in practice. I've added an assertion in c59940852b94ba45 just to make it fail safely if this were to happen in practice. I have started writing unit tests for DownUpConnector._satisfy_reads_if_possible() -- it turns out that we need unit tests in addition to the functional tests that I already wrote: src/allmydata/test/test_repairer.py.

This explains the sporadic "lost progress" failure in the functional tests. Hm... Could it also explain the "lost progress" behavior that Brian and I witnessed on the testgrid when this code was newly committed to trunk? I hope not, because that would mean that I am wrong about the writes and reads always having the same sizes. But I'm pretty sure I am right about that.

Change History (2)

comment:1 Changed at 2009-02-12T00:00:32Z by warner

  • Milestone changed from 1.3.0 to 1.3.1

as mentioned in #611, we disabled the repair-from-corruption tests, and have only rarely seen lost-progress in the remaining repair-from-deletion test.

Zooko fixed one bug in the repairer which would have caused lost-progress, but didn't see any other obvious ones.

I've seen lost-progress in repair-from-deletion twice now (after zooko's fix), but it's pretty rare (and therefore hard to analyze). Since repair-from-deletion is supposed to be deterministic, the only entropy source remaining is the order in which download reads and upload writes are interleaved, which means it's going to be a long hard struggle to capture enough information for analysis.

So we're going to push this one out to 1.3.1 . We'd like to have a perfect repairer in 1.3.0, but we also want to have a 1.3.0 soon, and a repairer which hangs once out of every thousand uses might be good enough for that.

comment:2 Changed at 2009-02-20T21:41:06Z by zooko

  • Milestone changed from 1.3.1 to 1.3.0
  • Resolution set to fixed
  • Status changed from new to closed

fixed in 1.3.0 by d7dbd6675efa2f25

Note: See TracTickets for help on using tickets.