#207 closed enhancement (fixed)

unit tests for failure modes of small mutable files

Reported by: zooko Owned by: zooko
Priority: major Milestone: 1.1.0
Component: code-encoding Version: 0.7.0
Keywords: mutable Cc:
Launchpad Bug:

Description

This ticket is the successor to #197. Here are all the pieces of the Small Distributed Mutable Files work that we need to finish:

  • recovery in case a colliding write is detected -- clients should probably take steps to minimize the chance that the colliding write results in all versions of the files being lost permanently. We've designed and written down a recovery mechanism (in docs/mutable.txt that seems Good Enough. There is a place for it to be implemented in allmydata.mutable.Publish._maybe_recover .
  • the client should create a new-style dirnode upon first boot instead of an old-style one
  • the old dirnode code should be removed, along with the vdrive client-side code and the vdrive server (and the vdrive.furl config file)
  • dirnode2.py should replace dirnode.py
  • URIs for the new mutable filenodes and dirnodes are a bit goofy-looking. See #102.
  • rollback-attacks: we chose a policy of "first retrieveable version wins" on download, but for small grids and large expansion factors (i.e. small values of k) this makes it awfully easy for a single out-of-date server to effectively perform a rollback attack against you. I think we should define some parameter epsilon and use the highest-seqnum'ed retrieveable version from k+epsilon servers.
  • analyze control flow to count the round trips. I was hoping we could get an update done in just one RTT but at the moment it's more like 3 or 4. It's much more challenging than I originally thought.
  • try to always push a share (perhaps an extra N+1'th share) to ourselves, so we'll have the private key around. It would be sad to have a directory that had contents that were unrecoverable but which we could no longer modify because we couldn't get the privkey anymore.
  • choose one likely server (specifically ourselves) during publish to use to fetch our encprivkey. This means doing an extra readv (or perhaps just an extra-large readv) for that one server in _query_peers: the rest can use pretty small reads, like 1000 bytes. This ought to save us a round-trip.
  • error-handling. peers throwing random remote exceptions should not cause our publish to fail unless it's for NotEnoughPeersError??.
  • the notion of "container size" in the mutable-slot storage API is pretty fuzzy. One idea was to allow read vectors to refer to the end of the segment (like python string slices using negative index values), for which we'd need a well-defined container size. I'm not sure this is actually useful for anything, though. (maybe grabbing the encrypted privkey, since it's always at the end?). Probably not useful until MDMF where you'd want to grab the encprivkey without having to grab the whole share too.
  • tests, tests, tests. There are LOTS of corner cases that I want coverage on. The easy ones are what download does in the face of out-of-date servers. The hard ones are what upload does in the face of simultaneous writers.
  • Publish peer selection: rebalance shares on each publish, by noticing when there are multiple shares on a single peer and also unused peers in the permuted list. The idea is that shares created on a small grid should automatically spread out when updated after the grid has grown.
  • RSA key generation takes an unfortunately long time (between 0.8 and 3.2 seconds in my casual tests). This will make a RW deepcopy of a large directory structure pretty slow. We should do some benchmarking of this thing to determine key size / speed tradeoffs, and maybe at some point consider ECC if it could be faster.
  • code terminology: share vs slot vs container, "SSK" vs mutable file vs slot. We need to nail down the meanings of some of these and clean up the code to match. Zooko thinks that the name "SSK" -- "Sub-Space Key" -- is not directly applicable to our mutable file URIs.

Change History (19)

comment:1 Changed at 2007-11-13T19:45:20Z by warner

Don't we need to switch-over-to-new-style-dirnodes to close #115 for 0.7.0? In that case, we need to move the following items from this ticket into #115:

  • create new-style dirnode upon first boot instead of old-style one
  • remove old dirnode code, replace with dirnode2

comment:2 Changed at 2007-11-14T14:57:37Z by zooko

  • Owner changed from nobody to zooko
  • Status changed from new to assigned

comment:3 Changed at 2007-11-14T14:57:55Z by zooko

  • Owner changed from zooko to nobody
  • Status changed from assigned to new

comment:4 Changed at 2007-11-14T21:31:04Z by warner

I will add the following items to this:

  • UI for mutable files: PUT-which-means-replace
  • improve browser-oriented web UI for POST (moving the "replace" form to a subpage)
  • unit tests for specific failure conditions:
    • corrupted shares must be identified as such
    • storageserver read failures must be handled
    • short reads must be handled
    • check consistency algorithms

comment:5 Changed at 2008-01-23T02:37:02Z by zooko

  • Summary changed from small mutable files cleanup and polish to unit tests for failure modes of small mutable files

We have a UI for mutable files in the wui -- "overwrite" button

To improve the wui for this -- moving the "overwrite" form to a subpage -- is #277

These items remain:

unit tests for specific failure conditions:

  • corrupted shares must be identified as such
  • storageserver read failures must be handled
  • short reads must be handled
  • check consistency algorithms

comment:6 Changed at 2008-01-23T02:37:38Z by zooko

unit tests for specific failure conditions:

  • corrupted shares must be identified as such
  • storageserver read failures must be handled
  • short reads must be handled
  • check consistency algorithms

comment:7 Changed at 2008-01-23T02:38:21Z by zooko

  • Milestone changed from 0.7.1 to 0.8.0 (Allmydata 3.0 Beta)

comment:8 Changed at 2008-02-08T03:31:09Z by warner

I've created other tickets for all the non-test items described here. This ticket is now solely about writing mutable-file unit tests.

comment:9 Changed at 2008-02-14T00:03:12Z by warner

  • Component changed from unknown to code-encoding
  • Keywords mutable added
  • Owner nobody deleted

comment:10 Changed at 2008-03-08T02:13:31Z by zooko

  • Milestone changed from 0.8.0 (Allmydata 3.0 Beta) to 0.9.0 (Allmydata 3.0 final)

This is something that I would like for 0.9.0, especially considering the change planned: #332 (K=1 for mutable files).

comment:11 Changed at 2008-03-10T19:40:35Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned

comment:12 Changed at 2008-03-10T20:16:26Z by warner

Since #312 and #332 are claiming to depend upon this one, we need to add "test what happens when we see multiple encodings for the same SI" to the list of tests that must be implemented.

Let's brainstorm about what sorts of tests we want to see, this this ticket has been hanging around for so long (and accreted and shed so many items):

  • corrupted shares must be identified as such
    • note that test_system has a subset of these tests, however they are not deterministic (we corrupt most of the shares, but since we don't have control over what order we read them in, we may read good shares early and thus short-circuit the rest). If we had a way to make the system test hit servers in a specific order (oerhaps by delaying the responses), then we could make this more consistent.
    • a better test would be to create a share, mangle it in some specific way, then pass it directly to the validation function in mutable.Retrieve, and assert that it raises the correct exception. Corrupting the signature should not cause a hash failure, for example.
  • storage server read failures must be handled
  • short reads must be handled
  • see what happens when we see multiple versions of a file, make sure we get the same consistency behavior that we expect
  • see what happens when we see multiple encodings of a file, best case behavior is that we can accept any version with enough shares, worst case is that we at least don't try to mix alternate encodings and get garbage

comment:13 Changed at 2008-03-10T20:33:11Z by warner

Oh, and of course, an excellent way to develop these tests is to use the code-coverage data ('make test-figleaf TEST=allmydata.test.test_something.TestClass?.tesT_method; make figleaf-output; firefox coverage-html/index.html') and iterate until all of the error-checking code is actually exercised.

I've found that doing this one test-case at a time is a great way to figure out what my test is actually doing. Gathering coverage data for the whole test run at once gives me confidence in the tests as a whole, but usually loses too much data on about individual tests.

comment:14 Changed at 2008-03-11T08:54:35Z by warner

I've added tests for the multiple-encodings of a file. The same framework (in test_mutable.Roundtrip) can probably be used to test multiple-versions.

comment:15 Changed at 2008-03-12T18:54:06Z by zooko

  • Milestone changed from 0.9.0 (Allmydata 3.0 final) to 0.10.0

comment:16 Changed at 2008-03-24T00:49:06Z by zooko

  • Milestone changed from 1.1.0 to 1.0.0

Hopefully someone will add some unit tests before we release 1.0.0.

comment:17 Changed at 2008-03-25T19:27:13Z by zooko

  • Milestone changed from 1.0.0 to 1.0.1

comment:18 Changed at 2008-04-24T23:21:15Z by warner

  • Resolution set to fixed
  • Status changed from assigned to closed

Most of these items are now done: the big mutable-file refactoring handled a lot of them. The ones that remain:

  • recovery after a colliding write is detected. This belongs at the end of Publish, if a collision was detected, and is responsible for leaving the file in a healthy state (although not in any particular version). #272

Minor issues that we can leave undone:

  • URIs for mutable files look goofy: we'll fix this when we move to EC-DSA based mutable files (#217)
  • more round-trip-count analysis, starting with better visualization tools. The goal is to reduce write to two RTT. Read is probably down to one or two RTT already. #394
  • Servers can probably prevent clients from updating a file by provoking uncoordinated write errors forever.
  • container size is fuzzy. When we examine MDMF we need to look at this more closely. Tracked in #393.
  • rebalancing. The current publish algorithm will put homeless shares on new peers, but it won't move shares from doubled-up peers. #232

Since we have tickets for all the important ones, I'm closing this one out.

comment:19 Changed at 2008-05-05T21:08:36Z by zooko

  • Milestone changed from 1.0.1 to 1.1.0

Milestone 1.0.1 deleted

Note: See TracTickets for help on using tickets.