#546 new defect

mutable-file surprise shares raise inappropriate UCWE

Reported by: warner Owned by:
Priority: normal Milestone: soon
Component: code-mutable Version: 1.2.0
Keywords: availability upload ucwe Cc:
Launchpad Bug:

Description (last modified by warner)

We added a "Thumper" to the allmydata.com prodnet today: 47 new nodes, bringing the grid to a total of 111 nodes. Shortly afterwards, our "trunk-prodnet" automated grid-tester failed: http://allmydata.org/buildbot/builders/gutsy/builds/1195 . There were a couple of different problems here, at least three tickets worth. This ticket is about a problem that is revealed by the addition of a large number of nodes.

First, a quick summary of the problems:

  • problem 1: mapupdate(MODE_WRITE) triggers on 1000, when it should use 1000$ (to avoid triggering on 10001) (ticket #547)
  • problem 2: the mapupdate can hit a false boundary because the inserted gap was too large. It might be a good idea to increase epsilon for MODE_WRITE to reduce the chance of this. (ticket #549)
  • problem 3: when mapupdate hits a false boundary (because of either of the above problems), the subsequent publish may put shares to servers which actually already have them. The code looks for these "surprise" shares and raises UCWE if it sees them. The UCWE is probably inappropriate, instead a new writev should be sent to update the old share. (this ticket, #546)
  • problem 4: delete() which hits UCWE (because of problem 3, or other issues) will retry, but will fail the second time because must_exist=True is the default (ticket #550)

Now, some background. When a directory or mutable file is created, gets a public key, and from that it gets a storage index. The storage index defines a permutation of storage servers: a sequence of serverids. A permutation function is used which accepts as inputs a set of serverids and the storage index, and it produces an ordered list of serverids (the same size as the input set). The partial ordering of the serverids is always the same for any given storage index. This means that adding new servers (i.e. re-running the function with a larger set) will insert new serverids into the output list at various places, but it will not change the relative ordering.

A new mutable file's shares are placed on the first N (=10) writeable servers in permuted order. Later, when the file is subsequently read or written, a "share map update" operation is performed, to take the then-current list of servers and build up a partial map of which shares are on which servers. This mapupdate operation has several different modes, depending upon what needs to be done with the servermap: MODE_READ is for download-only, MODE_WRITE is for read-modify-write, and MODE_CHECK is for file-checking and repair operations.

The mapupdate operation is designed to balance efficiency, speed, and safety. Efficiency is served by querying as few servers as possible. Speed is served mainly by minimizing round trips. Safety is a question of improving consistency: there may be shares of various versions present on the servers, and we desire to have a pretty good chance of seeing the "current" version, and we want to update "most" shares when we modify the file.

In MODE_READ, the mapupdate operation queries servers, in permuted order, until it has received evidence of k (=3) plus "epsilon" (=k) shares of the highest version, and has not seen evidence of any competing versions (those with a higher sequence number). This is enough to retrieve the file and also to reduce the possibility of retrieving an old version (which, although annoying, would not really threaten data-loss, because MODE_READ is not used for directory modifications). Without the epsilon, a small number of old shares would be enough to trick the operation into not looking for a newer version (equivalently, a small group of colluding servers could accomplish a rollback attack).

In MODE_WRITE, the mapupdate operation needs to be more thorough, for two reasons. The first is that MODE_WRITE is for modification, which means somebody is going to take the version that is returned and build a new version on top of it (by modifying the contents). When that new version is published, it will (ideally) completely replace any old versions. So if the version that gets retrieved is not the current one, we'll lose the contents of the current version, which will be a regression in the state of the file. The second reason is that the eventual publish needs to update all shares, to reduce the chance of leaving around shares of old versions (and thus improving the safety of later operations).

So MODE_WRITE must try harder to locate all shares: it tries to strike a different balance between efficiency, speed, and safety. In this mode, the code is supposed to query servers in permuted order until:

  1. it can recover some version of the file
  2. it sees no evidence of older versions of the file
  3. it has seen enough contiguous "I have no share for you" answers that we can conclude we've searched beyond the active set of servers. Specifically, the code looks for a range of servers in which the first server has a share, and the following epsilon (=k) servers do not have a share, then it declares the search complete.
  4. or, it runs out of servers to query

There is a bug in the current implementation of MODE_WRITE, in which the last condition (a "boundary" between the active set and the remaining unused servers) is incorrectly triggered (the pattern-match looks for "1000" but it should really look for "1000$", to avoid matching on "10001"). But that is in a different ticket (#547).

This ticket is about the fact that the publish process doesn't keep track of which shares it's sent to whom, so if it needs to make a second pass (to place shares that were not accepted by their first destination), it will get confused and signal an inappropriate UCWE.

When a single new server is added, it is effectively inserted at a random location into the permuted peer list. The list that used to be "A B C D E" becomes "A B new C D E". The mapupdate operation should tolerate this. Ideally, some kind of rebalancing or repair operation should move the shares up in the server list, continually moving the shares to their "ideal" location (the one that improves the efficiency, speed, and safety of publish/retrieve operations).

The publish operation will leave shares in place whenever possible. (it currently has no mechanism to remove extra shares, and is obligated to modify all shares in place, so it won't intentionally duplicate a share). So if we pretend that N=5 and the file was first created on "A1 B2 C3 D4 E5" (i.e. server A has share 1, server B has share 2, etc), then when "new" is inserted, the map will look like "A1 B2 new(none) C3 D4 E5", and no share will be placed on "new".

If lots of new servers are inserted, such that the map looks like "A B C new new new D E", then the MODE_WRITE mapupdate may conclude that the active set ends with server C: unless it spends the time+bandwidth to query more servers, it won't see the shares on D and E. (if a gap of three servers isn't convincing, imagine that the gap had 100 servers in it). In this case, the mapupdate will stop after the sixth query, and report shares at "A1 B2 C3 new(none) new(none) new(none)". When the subsequent publish takes place, it needs to put shares 4 and 5 somewhere, so it will put them on the first two new servers, and the visible map will wind up as "A1 B2 C3 new4 new5 new(none)". In fact, the full map will be "A1 B2 C3 new4 new5 new(none) D(old4) E(old5)", since D and E would not have been updated.

In effect, the shares have been migrated closer to the start of the permuted list, and a few old shares will get leftover towards the end of the list. As long as the new servers stick around, the old shares will never be seen. Some day, we'll have a repairer or a share-migrator service that will spot these and delete them, but for now they're fairly harmless. (if enough new servers disappear, then the old shares could be dangerous, since they might cause rollback). This process may happen multiple times, as new batches of servers are added.

Now, if a medium number of new servers are added, the "unhoused shares" (4 and 5 in the above example) might be assigned to servers which have not been queried. If we use N=7 and start with "A1 B2 C3 new new new D4 E5 F6 G7", then the mapupdate code would stop at "A1 B2 C3 new new new", and the publish code would need to find homes for shares 4-7. The three new servers would get shares 4 5 and 6, and share 7 would be assigned to the pre-existing (but unqueried) server D. The publish code makes the dubious assumption that if server D wasn't queried, then server D does not have any shares.

As a result, a readv-and-testv-and-writev request is sent to server D that asks it to accept the new version of share 7 (but only if it did not already have a copy of share 7). The readv portion of the request reveals that server D has a copy (now old) of share 4. The publish code responds to this "surprise share" by raising UncoordinatedWriteError, since normally (i.e. when a server was queried and reported having no share, then a tv-a-rv-a-wv reports yes having a share) this indicates that someone else has written a share in the interim.

This UCWE is "problem 3" above. If the mapupdate had queried this server, it would have learned about share 4, and it would have done two things differently: it would update D with the new version of share 4, and it would not have sent share 4 to one of the new servers.

The trick is how to give it the opportunity to learn about this share while still achieving the desired speed and efficiency. Maybe the rule should be that no tv-a-rv-a-wv requests shall ever be sent to a server that hasn't already been queried (i.e. require two round trips, the readv of which might either be done in the mapupdate or in the publish). Another possibility is to do it reactively: instead of throwing UCWE when we see the surprise share, just send out another query to replace it.

Reducing the likelihood of this situation can help, but won't remove the issue entirely. Even if the MODE_WRITE mapupdate used an epsilon equal to N (so that we were sure we'd have answers from N servers, in the hopes of not having to walk beyond the end of the set we've already queried), some of those servers might not be able to accept the write, forcing us off into uncharted territory.

Change History (12)

comment:1 Changed at 2008-12-09T18:45:53Z by warner

  • Description modified (diff)

update bug numbers, clarify the specific focus of this bug (surprise shares cause UCWE)

comment:2 Changed at 2008-12-09T18:49:15Z by warner

Francois Deppierraz experienced this same problem in a different form: one of his storage servers was full, so the writev() failed with a "Disk Quota Exceeded" IOError. This caused the loop to place the unhoused share on a different server, one which had already been given a share. It hit the same surprise-share failure that we hit due to the not-looking-far-enough problems described above.

comment:3 Changed at 2008-12-09T19:01:17Z by warner

I can think of two kinds of fix right now:

  • tolerate "surprise" shares if their roothash matches the version we were trying to publish
  • remember which shares we've sent to whom, and tolerate "surprise" shares if they look like something we remember sending

The first would also have the effect of tolerating UCWE between two uncoordinated writers who happened to be writing the same thing, which seems unlikely but might happen if both clients are trying to recover or repair the file at the same time.

comment:4 Changed at 2008-12-10T05:05:34Z by warner

278c47b9bd859343 adds one of these fixes: tolerate "surprise" shares if their roothash matches the version we were trying to publish. This should fix the problem that Francois say, in which we re-use a server and send it multiple shares in separate requests.

There's a comment in publish.py (around line 716, in _got_write_answer) about what needs to be done for the other half of the fix (to deal with servers that we failed to query during the mapupdate step). There's a way to proceed correctly without needing to signal UCWE (see the comment for details, basically if the surprise share is old enough we can replace it without doing another mapupdate-read-modify-write cycle). But the shorter/more-complete fix is to raise UCWE but mark the servermap to:

  1. ask that server next time we do a mapupdate
  2. make sure to include that server in the boundary calculation, so that it forces us to extend the search further

comment:5 Changed at 2010-03-24T22:51:32Z by davidsarah

  • Keywords preservation upload ucwe added

comment:6 Changed at 2010-03-24T23:08:14Z by davidsarah

  • Priority changed from major to critical

comment:7 Changed at 2010-03-24T23:14:02Z by davidsarah

  • Keywords availability added; preservation removed

I think this only affects whether we can successfully write an update, rather than whether we preserve updates that were claimed to be successfully written.

comment:8 Changed at 2010-03-24T23:14:15Z by davidsarah

  • Milestone changed from undecided to 1.7.0

comment:9 Changed at 2010-05-26T14:43:02Z by zooko

  • Milestone changed from 1.7.0 to 1.8.0

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

comment:10 Changed at 2010-08-10T03:40:59Z by davidsarah

  • Milestone changed from 1.8.0 to soon

comment:11 Changed at 2010-08-10T04:24:46Z by zooko

If you like this ticket, you might like #540 (inappropriate "uncoordinated write error" after handling a server failure).

comment:12 Changed at 2012-11-13T23:27:36Z by zooko

  • Priority changed from critical to normal
Note: See TracTickets for help on using tickets.