#893 new defect

UCWE when mapupdate gives up too early, then server errors require replacement servers

Reported by: warner Owned by:
Priority: critical Milestone: soon
Component: code-mutable Version: 1.5.0
Keywords: availability preservation upload repair ucwe Cc:
Launchpad Bug:

Description

The Incidents that were reported in #877 (by Zooko, referring to mutable write errors experienced by nejucomo) indicate a thorny problem that is distinct from both #877 (caused by a reentrancy error) and #540 (caused by a logic bug that affects small grids where a publish wraps around the peer circle). Here's the setup:

  • mapupdate(MODE_WRITE) wants to find all shares, so they can all be updated. Ideally, the shares are concentrated at the beginning of the permuted peerlist, on the first N servers.
  • to avoid traversing the whole grid, mapupdate(MODE_WRITE) has a heuristic: if we've seen enough shares, and we've also seen a span of contiguous servers (in permuted order) that tell us they do not have a share, then we stop the search. The size of this span was intended to be N+epsilon (where 'epsilon' is a tradeoff between performance and safety, and is set to k). Unfortunately 1.5.0 had a bug, and the span size was set to k+epsilon instead.

In the Incident:

  • mapupdate sent out 10 queries, but decided to finish before all the responses had returned, because it found a boundary early. I'm not entirely sure how the sharemap was shaped, but it looks like it stopped with a span of 3 servers ("found our boundary, 11000"), where k=3 and N=10, and returned a sharemap with 8 shares in it, some doubled up (I think there were 5 servers involved)
  • about 150ms after mapupdate finished, one more response came back (from w6o6, with two shares, but was ignored
  • Publish starts, and sends out updates to 7 servers. Unfortunately, two of these (both owned by secorp) experienced "Permission Denied" errors when attempting to write out the new shares, suggesting a configuration error (maybe the tahoe node process is owned by the wrong user)
  • so Publish falls over to new servers. Some of the new servers it picks suffer from the same error, so the fallover process repeats a few times.
  • finally, it falls over to the server w6o6, and sends it a new share, thinking that w6o6 has no shares (because the servermap was not updated to include w6o6's late response).
  • w6o6 then responds with a writev failure, because it contains shares that the test vector did not expect, causing a UCWE error
  • most of the shares were updated, so the write may have actually happened, even though UCWE was raised.

The biggest problem with this failure is that it is persistent. We don't record any information that would tell a subsequent operation to look further for existing shares, so exactly the same thing will happen the next time we try to modify or repair the directory.

If secorp's servers weren't throwing errors, then I think the condition would eventually fix itself: new shares would be placed on his servers, bridging the span of servers without shares, and then later mapupdate calls would keep going until they'd really seen all of the shares.

Recently (eb1868628465a243, 08-Dec-2009) I fixed mapupdate to use N+epsilon instead of k+epsilon. But the incident report suggests that it stopped with a span of only 3 no-share servers. Looking more closely at the code, I think it only waits for a span of epsilon (not k+epsilon or N+epsilon), and that eb1868628465a243 changed something different. I don't know if the thing that was changed would have prevented this issue or not. It's possible that this is a manifestation of #547 (mapupdate triggers on a false boundary), or #549, or one of the other problems described in #546.

In general, we need to query more servers. But even if we increase the span size or epsilon or whatever, there will always be a weird situtation that could be handled better if we queried more servers. We'd like to have something more adaptive: if the code hits UCWE because it didn't try hard enough, then it should try harder next time.

How should we deal with this? We need something to persist from one operation on a given mutable filenode to the next, some sort of hint that says "Hey, last time we were surprised, so next time you should look further". Or something that tells us that we learned about shares on servers X+Y+Z, and so the next time we do a mapupdate, we shouldn't consider it complete until we've gotten responses from those servers (in addition to any others that we might decide to query).

The most natural place to keep this state would be on the mutable filenode instance. This would help with UCWE that occurs inside a modify() call, because the same filenode is used for each retry, but in general filenodes are pretty short-lived. We don't want to keep the mutable filenode around in RAM forever. Maybe a LRU cache that keeps filenodes around for a few minutes, so that users who experience UCWE and retry the operation can benefit from recent history.

A storage protocol that included "where-are-the-other-shares" hints (#599) would help: this would improve the reliability of mapupdate, since the persistent information would be kept on the storage servers, next to the shares.

A publish process which rebalanced the shares (#232, or #661/#543/#864) might help, by filling in the gaps, except that here the gap was caused by a batch of servers all suffering from the same configuration problem.

The right answer probably lies in having UCWE triggering an immediate repair, and having repair fill in the gaps. But it'd be nice if there were a way to stash some information on the shares before the gaps that let later operations know that they should look past the gap.

Change History (18)

comment:1 Changed at 2010-01-10T07:37:03Z by warner

looking at the eb1868628465a243 change more closely, I think it could be improved. The change increases self.num_peers_to_query, but that's only actually used by MODE_READ. The finish-criteria code for MODE_WRITE (ServermapUpdater._check_for_done) should not let the process finish unless we've received answers from at least self.num_peers_to_query, just like MODE_READ (or if we've run out of servers to query).

The gap size was always defined to be epsilon, but the minimum num_peers_to_query value was intended to make sure we'd looked far enough.

comment:2 Changed at 2010-01-10T07:43:21Z by warner

  • Milestone changed from undecided to 1.6.0
  • Priority changed from major to critical

I don't know if this is really a 1.6.0 or critical issue, but since #877 from whence it spawned was both, I guess I should set it to the same. Feel free to downgrade or defer it. I don't think we can find a good long-term solution in the 1.6.0 timeframe, but maybe the num_peers_to_query change could fit in.

comment:3 Changed at 2010-01-10T17:58:31Z by zooko

Maybe the uploader should have been unsurprised to find shares on w6o6. "Surprises" are when a server told you something during the mapupdate and then told you something different during the next phase. If you didn't ask a server during the mapupdate, then you shouldn't be surprised what he says in the next phase.

I guess this is because I think of surprises as indicating UncoordinatedWriteError, rather than indicating "the world of storage servers are in a stranger state than you assumed".

What do you think of that, Brian? If you already decided to upload version V+1 of your file, and then during the process you find some previously unused storage servers which are holding shares <= V then you should just unsurprisedly resend your write request updated to say "Yes I know you have version U, and please replace it with this version V+1.". Now if you find previously unused servers that have a version > V+1 or a version ==V and with a different root hash then you should stop (but not with UncoordinatedWriteError).

comment:4 Changed at 2010-01-10T18:01:13Z by zooko

I think extending the finish criteria for MODE_WRITE (ServermapUpdater._check_for_done) would be a good improvement (it would have solved nejucomo's problem) and is probably feasible for Tahoe-LAFS v1.6.

comment:5 Changed at 2010-01-10T21:07:41Z by warner

hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector. Not trivial, but not too hard.

I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error?

I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it. Dunno.

comment:6 Changed at 2010-01-10T23:22:29Z by zooko

hm. Yeah, I suppose that if we were writing out version V+1, then it would be safe to overwrite any version (<V) or (=V,=oldroothash) that we saw. This would require either a clever test vector or some code to receive the "test failed, so no write" response and send out a new request with a new test vector.

Hm, the server doesn't currently support an operation that will apply a test-and-set if the share is present or else just apply the set if the share is absent, does it? src/allmydata/storage/server.py?rev=4118#L475 Oh look! It does! "# compare the vectors against an empty share, in which all reads return empty strings" So the current test vector should already work for this case, right? The client just needs to send a test vector saying if the server has a version less than the client's version then overwrite.

I think that discovering a version (=V+1,!=newroothash) or (>V+1) should fire UCWE.. sounds to me like the exact definition of UCWE. Why do you think this should use a different error?

Nejucomo expostulated: "UncoordinatedWriteError! I'm the only person in the universe who knows this write cap!"

And he was right -- there was never an uncoordinated write event. The client can't tell for sure, but if you never saw a server claim that it had version V and then claim that it had version V+u, then you should perhaps blame your problem on random network problems or bugs sooner than on UncoordinatedWriteError. Could we have a related exception named something like MultipleVersionsExist of which UncoordinatedWriteError is the subclass that we use when we know that either some other client just now wrote when we were writing or else some server is doing a "roll-forward" in order to make it look like some other client has done so?

I'm not sure what it should do if it sees (=V,!=oldroothash). This indicates that a UCWE occurred in the past, and now there are two equal-version contenders for the file. Ideally the earlier write should have noticed this and performed an immediate repair. For the current publish, it should stop, but I'll agree that UCWE might not be the most accurate way to report it.

This is the same case as server's V > client's V, right? So having the client send a testv which was < V should handle both cases.

comment:7 Changed at 2010-01-13T19:40:07Z by davidsarah

  • Keywords availability upload repair added

comment:8 Changed at 2010-01-13T23:53:39Z by zooko

So there are three proposals for code changes in this ticket.

  1. The finish-criteria code for MODE_WRITE (ServermapUpdater._check_for_done) should not let the process finish unless we've received answers from at least self.num_peers_to_query, just like MODE_READ (or if we've run out of servers to query).
  2. When sending a share to a server that you do not have any information about in your servermap, then go ahead and include a test vector saying "This share is intended to overwrite any share with version less than V.". Brian: does that sound like it could cause some weird problem? It sounds pretty safe to me.
  3. Invent a new exception named MultipleVersionsExisted and raise that instead of UncoordinatedWriteError if you didn't see actual evidence of a server changing its store between your mapupdate and your write. I earlier proposed that UncoordinatedWriteError could be a subtype of MultipleVersionsExisted, but I'm not sure whether it should be a subtype or just a separate type. Either way is fine with me.

comment:9 Changed at 2010-01-14T00:02:41Z by zooko

This might be related to #899, newly reported by Kyle Markley and Andrej Falout.

comment:10 Changed at 2010-01-28T15:04:36Z by zooko

  • Milestone changed from 1.6.0 to 1.7.0

comment:11 Changed at 2010-02-15T19:35:24Z by davidsarah

  • Milestone changed from 1.7.0 to 1.6.1

Not sure if we can fix this for 1.6.1, but it's definitely a candidate.

comment:12 Changed at 2010-02-16T05:32:09Z by zooko

We're bumping this out of v1.6.1 just because we're not 100% sure that proposals 2 and 3 from comment:8 are actually easy and safe.

comment:13 Changed at 2010-02-16T05:32:16Z by zooko

  • Milestone changed from 1.6.1 to 1.7.0

comment:14 Changed at 2010-03-24T22:54:03Z by davidsarah

  • Keywords preservation added

comment:15 Changed at 2010-03-24T22:54:52Z by davidsarah

  • Keywords ucwe added

comment:16 Changed at 2010-05-26T14:43:29Z by zooko

  • Milestone changed from 1.7.0 to 1.8.0

It's really bothering me that mutable file upload and download behavior is so finicky, buggy, inefficient, hard to understand, different from immutable file upload and download behavior, etc. So I'm putting a bunch of tickets into the "1.8" Milestone. I am not, however, at this time, volunteering to work on these tickets, so it might be a mistake to put them into the 1.8 Milestone, but I really hope that someone else will volunteer or that I will decide to do it myself. :-)

comment:17 Changed at 2010-08-10T03:41:46Z by davidsarah

  • Milestone changed from 1.8.0 to 1.9.0

comment:18 Changed at 2011-07-16T20:51:45Z by davidsarah

  • Milestone changed from 1.9.0 to soon
Note: See TracTickets for help on using tickets.