#778 closed defect (fixed)

"shares of happiness" is the wrong measure; "servers of happiness" is better

Reported by: zooko Owned by: zooko
Priority: critical Milestone: 1.7.0
Component: code-peerselection Version: 1.4.1
Keywords: preservation availability performance upload servers-of-happiness Cc: kevan
Launchpad Bug:

Description

metcarob posted a nice clear bug report to the list:

http://allmydata.org/pipermail/tahoe-dev/2009-August/002494.html

On Sunday,2009-08-02, at 10:57 , <gc20090728@metcarob.com> <gc20090728@metcarob.com> wrote:

I have set up a grid with one intorducer node and one storage node. I took my browser to the 3456 port as the instructions say and I was surprised when I was able to sucessfully upload a file.
 
As I understand it tahoe will split the file into 10 parts and save each part on a diffrent server. This would mean that if a server crashes you still can get the file. I was expecting to have an error message saying that the grid wasn't big enough to reliably save the file. Instead all the parts of the file have been saved on the same server. Why isn't the error message there?

Zooko writing: I've mentioned this a few times before, but I don't think there is a ticket about this exact issue. Basically, the concept of "shares of happiness" (the number of shares, the successful upload of which means that your file is safely uploaded) is almost never what people want. A more useful concept is "servers of happiness" (the number of servers, the survival and correct function of which will guarantee that your file is available).

"Servers of happiness" isn't going to be right for everyone, though. Eventually some people are going to need #573 (Allow client to control which storage servers receive shares), where they get to specify complex policies like "This upload was successful if at least three different geographic sites have K+1 shares each.".

I'm marking this as "priority: critical" instead of the standard priority (which is called "major"), because it could be a reliability problem for users such as metcarob who aren't already familiar enough with the details of Tahoe-LAFS architecture to avoid the problem. I'm also marking it with the keyword "reliability".

Attachments (17)

fecparams.py (3.0 KB) - added by swillden at 2009-08-19T16:30:03Z.
Implementation of proposed k_e, m_e selection algorithm, per comment 32
behavior.2.txt (19.6 KB) - added by kevan at 2009-10-10T23:52:55Z.
The behavior discussed in this ticket
tests.2.txt (33.3 KB) - added by kevan at 2009-10-18T02:23:15Z.
tests.3.txt (41.4 KB) - added by kevan at 2009-10-30T09:47:06Z.
tests.4.txt (134.1 KB) - added by kevan at 2010-01-18T21:48:51Z.
behavior.txt (135.8 KB) - added by kevan at 2010-02-15T20:48:24Z.
adding #834 behavior to the #778 patches
tests.txt (165.3 KB) - added by kevan at 2010-03-19T05:24:17Z.
tests updated to be current
docs.txt (6.9 KB) - added by kevan at 2010-04-28T00:49:27Z.
update documentation patches per comment:181 and comment:173
behavior2.txt (95.2 KB) - added by kevan at 2010-05-07T22:34:07Z.
tests2.txt (113.7 KB) - added by kevan at 2010-05-07T22:34:20Z.
behavior3.txt (97.7 KB) - added by kevan at 2010-05-14T01:48:09Z.
tests3.txt (125.6 KB) - added by kevan at 2010-05-14T01:48:24Z.
docs3.txt (10.6 KB) - added by kevan at 2010-05-14T01:49:02Z.
778codecoverage.darcspatch.txt (17.7 KB) - added by kevan at 2010-05-15T03:52:56Z.
778running.darcspatch.txt (17.7 KB) - added by kevan at 2010-05-19T00:34:59Z.
mutabledocs.dpatch (18.1 KB) - added by kevan at 2010-05-24T00:42:26Z.
make a distinction between immutable file uploads and mutable file uploads wrt servers of happiness
mutabledocsv2.dpatch (18.7 KB) - added by kevan at 2010-05-24T04:41:14Z.

Download all attachments as: .zip

Change History (244)

comment:1 Changed at 2009-08-10T15:46:12Z by zooko

The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).

comment:2 Changed at 2009-08-12T05:57:42Z by kevan

I'd be interested in trying to fix this.

From what I can tell, share creation and uploading happens in two places:

  1. upload.py, for immutable files.
  2. publish.py, for mutable files (and dirnodes, since they're stored as mutable files)

Am I missing any?

I'm also a bit confused at the logic in publish.py. It doesn't seem to refer anywhere to the idea of a happiness value for shares -- is there a reason for this?

In any case, this ticket (unless I'm misunderstanding something) breaks down something like this:

  1. Update the documentation to reflect the change to servers of happiness.
  2. Alter/add tests
  3. Figure out a sane default value for "servers of happiness".
  4. Determine + implement new share placement algorithms for publish.py and upload.py.

Thoughts?

comment:3 Changed at 2009-08-12T06:35:20Z by kevan

  • Cc kevan added

comment:4 Changed at 2009-08-12T14:03:41Z by zooko

Hooray! Thank you for volunteering, Kevan!

Uploading an immutable file is a safe thing to do -- each share that you upload will either end up being properly served by a server or it won't. Success means enough shares are hosted by enough servers. If the upload fails, no harm was done. Publishing a new version of a mutable file is less safe, because each share that gets accepted by a server overwrites any share of the same file and the same sharenumber that was previously held by that server. Once you've started overwriting older shares then you really ought to finish -- you can't undo, and if you don't finish writing all of your shares then your new version will be unhealthy.

So this ticket is easier than you thought. First of all, you can ignore mutable files entirely, and second, the only change to be made to the share placement algorithm for publish.py is "when to give up and report failure".

Here is a recent motivating example, in case you missed it:

http://allmydata.org/pipermail/tahoe-dev/2009-August/002494.html # Why no error message when I have only one storage node?

So the basic test is, with 3-of-10 encoding but only 1 storage server, upload the file, and then give up and return an error message to the user because 1 < servers_of_happiness. (By the way, a sane default value for "servers of happiness" might be ceil(k + m / 2).)

comment:5 Changed at 2009-08-12T15:22:00Z by swillden

I have an alternative idea which I described on the mailing list:

http://allmydata.org/pipermail/tahoe-dev/2009-August/002605.html

In my view shares and servers of happiness are both wrong, though servers is less wrong than shares.

I think the right view is "probability of survival of happiness". Given some assumptions about server reliability, we can calculate for any given distribution of shares what the probability of file survival is over a given interval. That means that we can also calculate (perhaps directly, or perhaps through a quick numerical search) the number of shares we need to distribute in order to achieve a given level of reliability.

This would fix the original problem of this thread: If it is impossible to distribute enough shares to enough servers to achieve the required reliability, then the upload would fail. Ultimately, I think it's a far more useful and flexible approach to the issue. A future Tahoe incarnation that tracks statistics on the availability of peers could estimate their reliability individually, and generate and distribute additional shares to attain the required file reliability. Heck, given aggregate data from a large enough number of grids, we might be able to estimate reliabilities for given hardware/OS configurations to feed into the mix. A share on a Debian stable machine with RAID-6 storage and 100 days of uptime is worth more than a share on a laptop running a copy of Vista which was re-installed last week. Actually implementing the code required to gather all of that sort of data and usefully synthesize it would be a significant project all on its own, but the point is that it would fit within the reliability-based framework, as would anything else we could dream up to make reliability estimates more accurate.

The main reason this is better, though, is that even if the file reliability is computed from estimates that are of somewhat questionable value, it still gives the end user a better way to know/specify what reliability they want to obtain than simply specifying "servers/shares of happiness" and fixed FEC parameters.

If this is of interest, I will be happy to write the code to compute M given K, r (the desired reliability threshold) and estimates of server reliability.

comment:6 Changed at 2009-08-15T02:16:49Z by kevan

I was working on the documentation updates for this ticket earlier today, and:

A more useful concept is "servers of happiness" (the number of servers, the survival and correct function of which will guarantee that your file is available).

started bugging me.

As I understand it, the purpose of this ticket is to fix the issue that metacrob mentioned on the mailing list -- that Tahoe should not consider an upload successful if it cannot place shares on at least servers of happiness servers. The naive and, to me, obivious way to do this is to keep track of how many distinct servers we use when uploading a file, and then failing if, when there are no more outstanding shares, we haven't used servers of happiness servers.

However, this technique doesn't seem to achieve the goal of the wording above.

Let's suppose that I have a file f, default encoding parameters k=3, M=10, happy=2, and that there are enough servers in my grid so that I can find a distinct server to accept each share. If I'm using tahoe patched with my naive solution above, the upload will succeed, but since every server only has one of 10 shares, and k=3, the correct functioning of 2 servers is not enough to guarantee the availability of my file.

An obvious enough solution to this is to ignore values of happy that are less than k. But there are use cases for values of happy that are less than k -- e.g., if I just want an offsite backup somewhere and don't care very much about reliability beyond that. Maybe we should just change the description of happy to remove any guarantees about reliability, leaving that instead for some of the tickets referenced in the first comment.

Hopefully this doesn't come off as me being pedantic about wording -- I'm just trying to make sure that I'm on the same page as you on what needs to be done with this ticket.

comment:7 Changed at 2009-08-15T02:52:45Z by zooko

Hm, good points, Kevan.

So, for the case of off-site backup where you don't care how many servers need to stay up in order to guarantee the availability of your data, then you should set servers_of_happiness=1, right?

And for the case that you have K=3 and M=10, then we could extend the upload peer selection algorithm so that if you have servers_of_happiness=2 then it has to put more than one share on each server, and in such a way that there are no two servers which have the same two shares. But instead we could make it so that your upload fails with the error "Didn't upload it in such a way that the survival of any 2 servers was sufficient for the survival of the file.", then you realize that if that's what you want you ought to set K=2 and re-upload.

How does that sound?

comment:8 Changed at 2009-08-15T22:38:51Z by kevan

I like the idea of extending the peer selection algorithm -- it'd be cool to be able to support choices of servers_of_happiness that are less than k, but I'm not sure how to do that. An algorithm for that probably wouldn't be too difficult, but maybe beyond the scope of this ticket? I'm fine with failing with an error, too.

As you say, what we're effectively saying with the backup scenario is that we don't care how many servers need to stay up to guarantee availability of data -- i.e.: we don't have a servers_of_happiness. Saying servers_of_happiness=1 is certainly a way of supporting that, but it seems confusing to adopt that as the standard -- we're special casing a configuration value to mean something when set to a particular value that it doesn't mean with other values (not to mention the fact that there might be users who would want the correct functioning of any one of the servers that they've uploaded shares to for a file to guarantee availability of that file). I think it'd be better if we supported the backup scenario by not performing the servers_of_happiness behavior unless the user has explicitly set a servers_of_happiness value in the config file (we could support this in code by using 0 as a default or something). This seems more consistent with what the user is saying with that value -- if they don't set it, they don't care, and if they do, then we can interpret it in a way that doesn't involve too many special cases.

comment:9 Changed at 2009-08-15T23:02:52Z by zooko

I'm confused -- I think that servers_of_happiness=1 is not a special case -- that it is the normal meaning of servers_of_happiness. I definitely think changing the upload algorithm is out of scope of this ticket.

Now I've gotta run to a friend's birthday so I'll re-read this when I return...

comment:10 Changed at 2009-08-15T23:03:43Z by zooko

Oh, I guess my idea leads us to conclude that if you are doing the "I'm doing a backup and I don't care how many servers" scenario then you have to set k=1, too.

Okay I'll think about this more later.

comment:11 Changed at 2009-08-16T03:40:34Z by kevan

Based on the definition given above, (i.e.: servers_of_happiness=n implies that the survival of n servers is sufficient for my file to be available) it doesn't seem unreasonable to interpret servers_of_happiness=1 to mean that if any one of the servers that my file is initially uploaded to is online, then my file is still available. This is not the same as saying that I do not care about servers_of_happiness -- solving the first interpretation would require placing k shares on each server that received any shares at all, while the second case is solved if the upload of the file is successful at all, regardless of how many shares end up on how many servers. Or at least that's how I interpret it. Hopefully that's clearer.

comment:12 Changed at 2009-08-16T20:09:42Z by zooko

Okay, I propose that for now we just check whether servers_of_happiness < k and if so error out with an error message explaining that we don't currently support that option.

That means that the "I don't care" use case that you describe is not currently supported.

How does that sound?

comment:13 Changed at 2009-08-16T23:37:15Z by kevan

Okay. This ticket is kind of a specific case of some of the tickets you mentioned up there, so we can always revisit the issue later, when solving them.

comment:14 Changed at 2009-08-17T00:30:02Z by kevan

hm. How would that work for users like metacrob, who have only one storage node? If they set servers_of_happiness=1, they get an error, and if they set servers_of_happiness=3 (assuming that k stays as a default), then they get another error (because there aren't enough servers to honor their preference). Maybe I'm misunderstanding something

To fix this, I propose that if there is no happy config value set, we ignore it (by, e.g.,

DEP["happy"] = 0 # or some other constant that the rest of the program can interpret
...
DEP["happy"] = int(self.get_config("client", "shares.happy", DEP["happy"]))

or something similar), and do not enforce servers_of_happiness. If the value is set, and is less than k, we error out as you suggest.

comment:15 Changed at 2009-08-17T00:48:38Z by zooko

Wouldn't metacrob getting an error because there aren't enough servers to honor their preference be exactly what we want?

I think that's what metacrob is asking for.

What do you think?

I suppose we should probably set k=3, m=10, servers_of_happiness=7 as the default.

comment:16 Changed at 2009-08-17T02:24:31Z by kevan

Yes -- sorry, I didn't think that comment through.

What I meant to say is that in fixing the problem for metacrob (in that tahoe will report failure for his grid), we end up making tahoe unusable for people who happen to have small grids (i.e.: one or two or whatever storage nodes, where whatever < k) and don't especially care about the behavior that we're implementing. That's why I think it is a good idea to support the "don't care" behavior somehow -- it doesn't seem particularly hard to support, and not doing it potentially makes tahoe a lot harder to use for some users. Hopefully that's clearer.

comment:17 follow-up: Changed at 2009-08-17T02:45:35Z by zooko

Okay, I understand what you mean. This hypothetical use case is in a sense the exact opposite of metacrob's. He wants Tahoe-LAFS to give an error when there are only one or two storage servers, this other user would want Tahoe-LAFS to silently succeed in the same case. If we implement shares_of_happiness as described so far in this ticket -- with no special case and with no change to the upload peer selection algorithm -- then metacrob's use case will be satisfied by the default settings (k=3, servers_of_happiness=7, m=10), and the other use case would have to change their k and their servers_of_happiness to both be <= the number of servers that they actually have. Is that right?

We could implement metacrob's use case right now without -- the "servers_of_happiness = dont_care" feature -- and then wait until we have more information about how it works, for example a bug report from someone who expected upload to work with the default parameters when they had only one or two servers.

What do you think of that?

comment:18 in reply to: ↑ 17 Changed at 2009-08-17T03:46:04Z by kevan

Replying to zooko:

Okay, I understand what you mean. This hypothetical use case is in a sense the exact opposite of metacrob's. He wants Tahoe-LAFS to give an error when there are only one or two storage servers, this other user would want Tahoe-LAFS to silently succeed in the same case. If we implement shares_of_happiness as described so far in this ticket -- with no special case and with no change to the upload peer selection algorithm -- then metacrob's use case will be satisfied by the default settings (k=3, servers_of_happiness=7, m=10), and the other use case would have to change their k and their servers_of_happiness to both be <= the number of servers that they actually have. Is that right?

Yes, exactly.

We could implement metacrob's use case right now without -- the "servers_of_happiness = dont_care" feature -- and then wait until we have more information about how it works, for example a bug report from someone who expected upload to work with the default parameters when they had only one or two servers.

What do you think of that?

That sounds fine.

I'm also fine with setting the defaults to be k = 3, m = 10, happy = 7 -- it's pretty easy to change them if they end up being poor choices.

comment:19 Changed at 2009-08-17T06:43:39Z by kevan

A summary of the discussion, for those of you following along at home (zooko, feel free to add to this if you think I've missed something):

The Problem

Tahoe-LAFS will store a file on a grid in an unreliable way (specifically, at least for this bug report, uploading everything associated with a file f to only one storage node) without reporting anything to the user.

The solution

We will change shares.happy in tahoe.cfg to mean servers_of_happiness.

servers_of_happiness means two things:

  1. If a file upload is successful, then shares for that file have gone to at least servers_of_happiness distinct storage nodes.
  2. If a file upload is successful, then the uploaded file can be recovered if no more than servers_of_happiness storage nodes uploaded to in the initial upload remain functioning.

Both of these conditions are necessary to solve metacrob's use case.

He should be able to tell Tahoe-LAFS that he does not consider an upload successful unless shares from that upload were distributed across at least n servers (> 1 for the bug report, but in general): the first condition addresses this.

This is not enough to solve metacrob's use case, though -- if he has uploaded shares from a file to 5 servers, but cannot recover that file unless one particular server is online and working, then he is no better off when that server fails than he would be if that server held every share of his file. The second condition addresses this.

If we remove the first condition, then servers_of_happiness is satisfied if the file is uploaded entirely to only one server (since 1 < "servers of happiness"): clearly, this is undesirable -- indeed, it is the exact behavior mentioned in the bug report.

Implementation issues

Supporting this in Tahoe-LAFS is fairly trivial if servers_of_happiness is greater than or equal to k, the number of distinct shares generated from a file f necessary to recover f: the first condition (servers_of_happiness distinct servers having a distinct share of a file f) implies the second (servers_of_happiness distinct servers being enough to reconstruct f), because no more than servers_of_happiness distinct pieces of f are necessary to reconstruct f.

Supporting servers_of_happiness values less than k is harder -- the first condition no longer implies the second. To see why this is, consider uploading a file f onto a grid of 10 well-behaved storage nodes with encoding parameters (happy=2, k=3, m=10). Suppose that each storage node accepts one share. Then each pair of server nodes have only two distinct shares between them -- not enough to reconstruct f.

We could support these values if we ensured that some servers had more than one share in such a way as to ensure that the cardinality of the set difference of the shares held by any servers_of_happiness servers is at least k, but this is tricky, and, for the moment, beyond the scope of this ticket. As a stop-gap, Tahoe-LAFS will fail with an error if asked to upload a file when a user has servers_of_happiness set to a value less than k.

The proposed default encoding parameters for servers_of_happiness are (k=3, happy=7, m=10). One consequence of these defaults and the stop-gap described above is that users of small grids (where there are one or two storage nodes) will by default not be able to upload files unless they change their ks to 2 or 1. If bug reports surface about this decision, we'll revisit it.

(I'm hoping that there aren't any more points to discuss in there -- my goal in writing it was to summarize the discussion so that I know what I need to do in fixing this ticket)

comment:20 follow-up: Changed at 2009-08-17T12:46:31Z by terrell

What is the problem/shortcoming with forcing k<=happy<=m?

And make the error message simply state that this has to be true for Tahoe to succeed?

Document some easily-expected example settings and ramifications:

k=3, happy=7, m=10 -- default

k=1, happy=1, m=X -- not recommended, no redundancy

k=2, happy=2, m=X -- friendnet, backup guaranteed (minimal protection)

k=3, happy=3, m=X -- friendnet, backup guaranteed (better protection)

etc...

comment:21 in reply to: ↑ 20 ; follow-ups: Changed at 2009-08-17T14:16:13Z by swillden

Replying to terrell:

k=1, happy=1, m=X -- not recommended, no redundancy k=2, happy=2, m=X -- friendnet, backup guaranteed (minimal protection) k=3, happy=3, m=X -- friendnet, backup guaranteed (better protection)

Without specifying m, you can't really be sure that the "protection level" statements are true. For example, with k=2, happy=2, m=3, only one of the two guaranteed servers has enough shares to reconstruct the file, so if that server fails, you've lost the file, which I would call "single point of failure", not "minimal protection".

If happy == k, you need m > k + 1 or you have a single point of failure.

Perhaps it would be better to measure happiness by "number of servers that can fail without losing the file"? I think that makes the implications of setting the parameter much easier for users to understand, and it's not complicated for Tahoe to compute: Just generate the peer list, assign share counts to the peers, sort by share count (ascending), and sum up the list until total >= k. The number of remaining, unsummed, servers is the maximum that can be lost without losing the file.

Obviously in the case that spawned this ticket, metacarob's max-losable would be zero, a very bad situation for reliability. Setting max-losable to one would provide a guarantee of minimal redundancy, setting it to two or three would provide good reliability.

comment:22 follow-up: Changed at 2009-08-17T14:33:20Z by zooko

""" servers_of_happiness means two things:

  1. If a file upload is successful, then shares for that file have gone to at least servers_of_happiness distinct storage nodes.
  2. If a file upload is successful, then the uploaded file can be recovered if no more than servers_of_happiness storage nodes uploaded to in the initial upload remain functioning.

"""

Hm.... I see that I have not understood what I want all this time.

I guess what I've really been wanting all this time is "The Erasue Coding Property" as applied by counting servers instead of by counting shares. That is:

k=3, m=10 means that after this upload there will ideally be 10 distinct servers, the survival of any 3 of which guarantee my file. The addition of servers of happness h=7 means that if, after the upload, there are 7 distinct servers, the survival of any 3 of which guarantee my file, I will consider the upload a success, but if there are only 6 I will consider the upload a failure.

So I think what I really want is to make servers_of_happiness mean "The number of servers in set X" and k mean "Any subset of X of size k is sufficient to recover the file".

I guess that means servers_of_happiness is "The minimum number of servers (such that any k of those servers is sufficient) to consider this upload a success.". Is that eight?

Kevan: I really like your method of working in which you thoroughly document it first, then write the tests, then the code (then someone reviews it for you). It reminds me of Brian's method of working.

Trel: It makes sense to me that k <= h <= m, but at the moment I'm not sure my intuitions about these things are right at all!

comment:23 in reply to: ↑ 22 Changed at 2009-08-17T20:39:37Z by swillden

Replying to zooko:

So I think what I really want is to make servers_of_happiness mean "The number of servers in set X" and k mean "Any subset of X of size k is sufficient to recover the file".

I guess that means servers_of_happiness is "The minimum number of servers (such that any k of those servers is sufficient) to consider this upload a success.". Is that right?

I'm not sure how this k interacts with the number-of-shares k. Are you assuming implicitly that there is only one share per server? I think there are good reasons to have more than one share per server (to maximize download performance while still assuring reliability), in which case the number-of-shares k must be distinct from, and larger than, the number-of-servers k. So it appears that this formulation of the approach requires two parameters: minimum-recoverable-server-set-size (k_server) and servers-of-happiness (h). These are in addition to the FEC parameters, k_share and m}.

The more I think about this, the more confusing servers-of-happiness becomes as a measure or selector of reliability. But then, I'm easily confused :-)

comment:24 in reply to: ↑ 21 Changed at 2009-08-17T22:16:05Z by davidsarah

Replying to swillden:

Perhaps it would be better to measure happiness by "number of servers that can fail without losing the file"? I think that makes the implications of setting the parameter much easier for users to understand, and it's not complicated for Tahoe to compute: Just generate the peer list, assign share counts to the peers, sort by share count (ascending), and sum up the list until total >= k. The number of remaining, unsummed, servers is the maximum that can be lost without losing the file.

This sounds like the right thing to me.

comment:25 follow-up: Changed at 2009-08-18T15:00:21Z by zooko

I'm not sure how this k interacts with the number-of-shares k. Are you assuming implicitly that there is only one share per server?

Treat that as an "implementation detail" that Tahoe-LAFS can optimize as it likes, as long as the guarantee that it offers to the user still holds: that there are h servers, the survival of any k of which will guarantee the survival of the file. (Where there will be m total servers if possible, but at least h servers, or else the upload fails.)

That seems to be what users like metacrob think that we are already offering, and it is also the property that I would like from my backups. It is "the erasure-coding property" as applied to servers, not to shares. (Under the hood, of course, Tahoe-LAFS has to produce shares and then decide how to upload them in order to achieve this property and also to optimize other things like up- and down- transfer performance.)

(Also, I think that it can be implemented as a small patch to the current upload algorithm.)

Shawn: what do you think?

Kevan: you're now my test to see whether I'm making sense about this topic. Is the above coherent? :-)

comment:26 in reply to: ↑ 25 Changed at 2009-08-18T18:03:27Z by swillden

Replying to zooko:

I'm not sure how this k interacts with the number-of-shares k. Are you assuming implicitly that there is only one share per server?

Treat that as an "implementation detail" that Tahoe-LAFS can optimize as it likes, as long as the guarantee that it offers to the user still holds

I like it. Not as much as a statistical reliability threshold, but I like it.

However, I think that it's worth working out how to implement the "implementation detail".

So, to be clear, this is my understanding of your proposal:

We still have three configured parameters, k, h, m which mean, respectively, "Minimum number of servers required to recover a file", "Servers of Happiness (aka the repair threshold)" and "Maximum number of servers to use".

FEC encoding parameters k_e, m_e must be chosen by Tahoe such that it's possible to satisfy k, h, m.

My suggested algorithm for selecting k_e is to set it to the number of servers chosen to receive shares, n, which must satisfy k <= h <= n == k_e <= m. Setting k_e to the number of servers chosen will maximize download performance when all servers are available (assuming all servers have similar performance -- and it could do a reasonable job even if they don't).

m_e must then be chosen such that when those m_e shares are equally distributed across the n servers, any k-server subset of them has sufficient shares to recover the file. That is, each of the n servers must have at least k_e // k shares. If k_e / k is an integer, that means that m_e = n * (k_e / k).

If k_e / k is not an integer, the minimal m_e is a little more complicated. In that case, most of the servers must have 1 + k_e // k shares, but k-1 of them only need k_e // k shares. So m_e = (n-k-1)*(1 + k_e // k) + (k-1)*(1 + k_e // k). Rearranging gives m_e = n - k + 1 + n * (k_e // k).

In Python code:

def compute_num_shares(k, n, k_e):
    if k_e % k == 0:
        return n * k_e / k
    else:
        return n - k + 1 + n * (k_e // k)

So, for k=3, n=10, and assuming we set k_e = n to maximize download performance, k_e = 10, m_e = 38. Eight of the 10 servers will get four shares and two will get three shares. The fewest shares that any three servers can have is 10, but three-server sets may have 11 or (most commonly) 12 shares.

This multiplication of shares may even make it easy to address performance mismatches between servers. The client can request one share from each of 10 servers, but if one of the servers is very fast and completes first, while another is very slow, the client could request another share from the fast server.

One downside of this approach is that it makes the expansion factor somewhat unpredictable, since it's dependent on the number of servers available. More servers means more expansion -- also more reliability.

For completeness, we should also define what this approach means for repair. An algorithm for deciding if a file needs to be repaired is:

  1. Sort the available servers by number of shares, ascending.
  2. Look at the first k servers in the list. If they collectively have fewer than k_e shares, remove the first one and repeat this step.
  3. If len(list) >= h, the file is happy.

comment:27 Changed at 2009-08-18T19:20:29Z by davidsarah

If k_e / k is not an integer, the minimal m_e is a little more complicated. In that case, most of the servers must have 1 + k_e // k shares, but k-1 of them only need k_e // k shares. So m_e = (n-k-1)*(1 + k_e // k) + (k-1)*(1 + k_e // k). Rearranging gives m_e = n - k + 1 + n * (k_e // k).

The first expression for m_e is wrong, even though the rest of the argument and the rearranged version is correct. What you mean is:

... So m_e = (n-(k-1))*(1 + k_e // k) + (k-1)*(k_e // k). Rearranging gives m_e = n - k + 1 + n * (k_e // k).

comment:28 follow-up: Changed at 2009-08-18T19:24:55Z by swillden

That's right. I meant to type n-k+1, but typed n-k-1.

comment:29 in reply to: ↑ 28 Changed at 2009-08-18T19:26:25Z by swillden

Replying to swillden:

That's right. I meant to type n-k+1, but typed n-k-1.

And added an extra 1 + in the second term. Typos galore :-)

comment:30 in reply to: ↑ 21 ; follow-up: Changed at 2009-08-19T06:32:29Z by kevan

zooko: I think you're being clear + coherent.

To summarize the recent discussion.

Share uploading and encoding will be controlled by five parameters.

The three that the user can directly specify are:

  • k: The number of servers that must survive for the file to survive (this is an upper bound, so another way of thinking of it is "I should be able to retrieve my file if at most k of the original servers that received parts of it continue to function").
  • m: The ideal number of distinct servers (or: an upper bound on distinct servers)
  • h: A lower bound on distinct servers: at least h servers must receive shares in order for an upload to be considered successful.

Two others, k_e and m_e (wouldn't it be cool if you could use LaTeX in trac tickets?), are calculated at upload time by tahoe based on user-defined parameters and grid data (namely, n, the number of servers receiving shares). These are defined as:

  • k_e = n
  • m_e = n * (k_e / k) if k divides k_e, otherwise
  • m_e = n - k + 1 + n * (k_e // k)

We impose the constraint that k <= m, that h <= m, and that h, k, m > 0. We do not say that h >= k because a user who simply wants a backup and doesn't care about the specific dispersal or replication of their shares would say h=1 (i.e., the upload is successful if the shares are somewhere). Additionally, we will say that the upload is a failure if n < h.

Does anyone disagree with my summary?

I like this.

It elegantly supports the use case of someone who doesn't care much about their files beyond the fact that they're on the grid somewhere by decoupling h from most of the logic (since it remains, unless I'm paraphrasing badly, only in checks at the end of file upload and in the repairer) -- they'd set (k=10, h=1, m=10), and be on their way. It also gives metacrob the tools he needs to support his use case.

What sort of defaults seem reasonable for this new behavior?

I feel a lot better about this ticket, now -- it will be pretty cool once we get it working, thanks to all the new feedback. :-)

comment:31 follow-ups: Changed at 2009-08-19T15:43:23Z by zooko

Kevan:

I think Shawn and you and I (at least) agree on the desired result -- "k-out-of-m" servers is the desired goal of upload and at least "k-out-of-h" servers is the least reliable upload that still counts as a success.

I think I like your and Shawn's proposed algorithm for how to choose k_e and m_e and to map shares onto servers. However, it should be a separate ticket from this one. The reason is that there is (I think) a very easy way to implement this ticket without implementing that improved algorithm. That is: let k_e = k, m_e = m (the same as it is now), then run the current algorithm for mapping-shares-to-servers, then check if the result satisfies the new criteria for success.

This will be much easier to implement because the algorithm for mapping shares to servers can be a complex. If I recall correctly, it currently adaptively deals with servers which fail to respond to the initial "Will you hold these shares?" requests, and with servers that respond with "Yes, but actually I already have shares number 1 and 4.", and with servers that respond with "Yes, I'll hold that one, Yes, I'll hold that one, No, I'm full go away.". (I could be wrong about some of those details.)

So if this ticket is just about changing the criteria of upload success and not changing the share-to-server-mapping algorithm then you can finish this ticket sooner and we can have greater confidence that it didn't cause regressions. Note that in the "normal" cases that I can think of, the current share-to-server-mapping algorithm will meet the new criteria, except of course in the case of fewer than h servers which can accept shares.

By the way, I'd like to have Brian Warner review this ticket once we've finalized the design.

comment:32 in reply to: ↑ 30 Changed at 2009-08-19T16:24:05Z by swillden

That's a good summary, kevan, but my method for calculating k_e and m_e assumes that n >= k and k_e >= k. If h < k, then presumably there are situations where n = k_e < k.

It's kind of odd to specify that k is the number of servers needed for the file to survive, but then to say that the we're happy even if there are less than k servers available. It seems like if you want to allow that, then what you really want is the "shares of happiness" algorithm, not the "servers of happiness" algorithm.

Also, if there's only one server available, is there any sense in doing FEC coding at all? Or, at least, is there any sense in setting k_e > 1 or m_e > 1? I suppose if there were a mechanism to redistribute the shares later it would. And Zooko's observation (in another ticket) that m_e can be cheaply increased would imply that it makes sense to set k_e = m_e = k, and then let a future repair cycle increase m_e when more servers are available. So, perhaps the k_e, m_e selection algorithm should have as a special case that k_e = m_e = k when n = 1. That would result in no FEC expansion when only a single server is being used, which is nice.

But what about when n > 1, n < k? k_e = m_e = k would result in lower reliability than only using one server, which seems like a bad idea. For n=2, any allocation that doesn't allow either server to reconstruct the file is worse than using only one server. For that case, k_e = k, m_e = k_e * n would make sense, which covers the n=1 case as well. For larger values of n (but still n < k), what to do is less clear. Suppose the user specified k = 10, h = 1, m = 10, as you suggested, and there were nine servers available. k_e = k is fine in that case, but m_e = k_e * n would result in FEC expansion of 9!

In that case, what the user appears to be asking for, with k = m, is distribution with no redundancy at all, meaning that if any of the servers fails, the file is lost. If that's what they really want, then k_e = m_e = k is fine. But what if they specified k = 5, h = 1, m = 10? That would indicate a desire for some redundancy, but a willingness to accept no redundancy rather than failing the upload.

Implicit in the choices of k and m is an acceptable FEC expansion amount and a desired redundancy level. Perhaps what makes the most sense is to try to honor the indicated FEC expansion limit, capping m_e at k_e * m / k.

So the rule would be, if n < k, then k_e = k, m_e = min(k_e * n, k_e * m / k). Of course, if n < h, then the upload should fail.

I'll attach a file with some code that implements this, and computes the reliability for various options. It seems to behave reasonably for a wide variety of inputs.

Changed at 2009-08-19T16:30:03Z by swillden

Implementation of proposed k_e, m_e selection algorithm, per comment 32

comment:33 in reply to: ↑ 31 Changed at 2009-08-19T17:11:58Z by swillden

Replying to zooko:

I think I like your and Shawn's proposed algorithm for how to choose k_e and m_e and to map shares onto servers. However, it should be a separate ticket from this one. The reason is that there is (I think) a very easy way to implement this ticket without implementing that improved algorithm. That is: let k_e = k, m_e = m (the same as it is now), then run the current algorithm for mapping-shares-to-servers, then check if the result satisfies the new criteria for success.

That seems like it works fine. The only major advantage of the "new" algorithm is increasing parallelism for downloads, which is a separate issue, and should be a separate ticket (I'll open one). In fact, I think there's a much simpler way to increase parallelism.

To be clear, the "new criteria for success" are, I believe:

  1. Any k-server subset of the n successful servers has sufficient shares to construct the file. If k = k_e, n >= k, this is trivially guaranteed to be satisfied. If n < k, then we don't have the FEC survivability guarantee, but survivability degrades fairly gracefully.
  1. n >= h.

Where n is the number of servers that receive at least one share, of course.

comment:34 Changed at 2009-08-22T19:41:22Z by warner

I just wanted to point out that metacarob's original problem could be solved much more simply:

 proposed_servers = set([server.id for server in proposed_sharemap.values()])
 if proposed_servers == set([myid]):
   raise TalkingToJustMyselfError("this isn't very useful for backup purposes")

I'd be cautious about a constraint-oriented algorithm: the implementation can get very tricky. The approach described above seems to avoid that rabbit hole, by merely testing an assertion *after* peer-selection has taken place (and presumeably again after the shares have finished uploading, in case you lost servers during the upload).

The terminology is confusing to me, because tahoe uses "k-of-N" encoding everywhere internally. I know that this ticket got started with the alternate notation, but when it comes time to implement this, could you find a different name for the new "n" (number of servers which end up holding a share) value? I'd suggest leaving "k" and "N" referring to shares, and introduce a new "k_s" and "N_s" to refer to servers. In the code (as opposed to Trac discussions), verboser is better (up to a point), so you could use "k_servers" and "N_servers" to avoid the ambiguity of the "s" prefix.

I'm not sure that changing the encoding parameters on the fly is a great idea: it may surprise people who thought they had already made their decision about the expansion-vs-reliability tradeoff. OTOH, I suppose my attitude reflects my bias towards stable grids, where the number of servers is supposedly known in advance, and doesn't change very much.. which may not be the case. I imagine that Shawn's efforts will result in a web page on the client node which has knobs for encoding parameters and displays probable reliability properties as an output, to help users make those decisions. Or the other way around, where the user dials in the reliability requirements, and is informed of the encoding strategy (and resulting expansion factor).

comment:35 in reply to: ↑ 31 Changed at 2009-08-23T07:19:20Z by kevan

Replying to zooko:

Kevan:

I think Shawn and you and I (at least) agree on the desired result -- "k-out-of-m" servers is the desired goal of upload and at least "k-out-of-h" servers is the least reliable upload that still counts as a success.

I think I like your and Shawn's proposed algorithm for how to choose k_e and m_e and to map shares onto servers. However, it should be a separate ticket from this one. The reason is that there is (I think) a very easy way to implement this ticket without implementing that improved algorithm. That is: let k_e = k, m_e = m (the same as it is now), then run the current algorithm for mapping-shares-to-servers, then check if the result satisfies the new criteria for success.

This will be much easier to implement because the algorithm for mapping shares to servers can be a complex. If I recall correctly, it currently adaptively deals with servers which fail to respond to the initial "Will you hold these shares?" requests, and with servers that respond with "Yes, but actually I already have shares number 1 and 4.", and with servers that respond with "Yes, I'll hold that one, Yes, I'll hold that one, No, I'm full go away.". (I could be wrong about some of those details.)

So if this ticket is just about changing the criteria of upload success and not changing the share-to-server-mapping algorithm then you can finish this ticket sooner and we can have greater confidence that it didn't cause regressions. Note that in the "normal" cases that I can think of, the current share-to-server-mapping algorithm will meet the new criteria, except of course in the case of fewer than h servers which can accept shares.

By the way, I'd like to have Brian Warner review this ticket once we've finalized the design.

This sounds fine. Does anyone have any objections to me starting along this path?

comment:36 Changed at 2009-08-25T19:20:48Z by zooko

  • Owner set to kevan

Kevan: go for it!

Everyone: please move discussion of improving FEC parameters and/or server selection to #791 (Optimize FEC parameters), and leave this ticket to track Kevan's progress on documenting, testing, and implementing the simpler goal. That goal is:

  • Use the user-configured FEC parameters k, n, and h ("happiness"), just as Tahoe-LAFS v1.5 does.
  • Use the current server-selection algorithm which is already implemented in Tahoe-LAFS v1.5.
  • After the server-selection has chosen the initial set of servers to use, test whether that selection would satisfy the criterion of "Reliability level is at least k-out-of-h servers.". If it wouldn't, abort the upload.
  • Whenever a server fails to accept shares which the client had intended to upload to that server, such as by disconnecting during an upload or returning an error message instead of storing a block, then recalculate whether the new set of servers would satisfy the criterion. If not, abort the upload.

comment:37 Changed at 2009-08-25T22:41:59Z by kevan

Cool.

I've written some tests for the new behavior, and implemented the behavior itself in the places that I think it needs to be implemented (some checks in upload.py, and one in encode.py, as well as some jiggling to handle the dual meanings of k and m) -- the roadblock now is improved documentation, and changing some unit tests that were written with the old happy behavior in mind (I think I have three of those left to fix).

comment:38 Changed at 2009-08-25T22:42:18Z by kevan

  • Status changed from new to assigned

comment:39 Changed at 2009-09-01T14:54:10Z by zooko

Kevan: how is it going? Do you need help improving the documentation?

comment:40 Changed at 2009-09-01T17:38:22Z by kevan

I'm pretty much done, aside from two issues.

  1. The CiphertextDownloader? class in download.py is responsible for setting the encoding parameters in the object that it saves data to -- this includes the value for happy, which (from what I can tell) isn't stored with the file on the grid. Right now, this is set to the number of shares retrieved from the grid -- this works fine for the current happy behavior, but breaks with the new happy behavior. The seemingly obvious way to fix this would be to re-read the configuration file and use that, but that seems to make CiphertextDownloader? a murkier abstraction than necessary. Any ideas for a better way to do this?
  2. I'm running into some writers block with the documentation. Maybe I should just post what I have.

Other than that, everything seems fine -- the new behavior is there and working, there are tests for it, and all of the unit tests that relied on the old behavior have been fixed to pass with the new behavior.

comment:41 Changed at 2009-09-02T03:12:23Z by kevan

Okay, there are three files, containing a few patches.

  • behavior.txt implements the behavior that we were discussing (at least that part of it that isn't relegated to #791). The bulk of this consists of changes in upload.py and encode.py, with a minor change to download.py (which is list item 1 above).
  • docs.txt, which updates configuration.txt and interfaces.py to reflect servers_of_happiness.
  • tests.txt, which updates a bunch of existing tests to work with the new semantics, and also adds a couple of new ones.

(apologies for the amount of time that it has taken for a simple task; I've been on vacation for the past week or so, and haven't had as much time to work on this as I would have wanted)

comment:42 Changed at 2009-09-02T15:31:09Z by zooko

  • Milestone changed from undecided to 1.5.1

Putting this in 1.5.1 Milestone (it looks like it is going to be done by 1.5.1 release and I don't want to hold it out of trunk).

comment:43 Changed at 2009-09-02T17:04:12Z by zooko

Hm, actually maybe we want to release a Tahoe-LAFS named "v1.5.1" which doesn't have any user-visible behavior changes/feature changes aside from bugfixes, in which case maybe we should hold this patch (safely stored in this ticket), until after that release. Please don't put off reviewing this patch, though, if you were planning to review this patch, do it now!

comment:44 Changed at 2009-09-10T06:34:02Z by kevan

I'm happy with where the documentation is for this. Perhaps someone can proofread?

I think the only stumbling block left here is the servers_of_happiness in CiphertextDownloader -- I think it's probably best to set this to whatever the user defined servers_of_happiness is, but I can't think of a clean way to do that. Perhaps someone with a fresh perspective can look and see if there's an obvious way that I've missed?

comment:45 Changed at 2009-09-10T10:20:09Z by kevan

(maybe writing up the problem in detail will help me to think of a solution)

If I understand the code, Tahoe-LAFS (in checker.py) defines an immutable file node as healthy if all of the shares originally placed onto the grid (m) are still available (for some definition of available, depending on the verify flag), unhealthy if fewer than m but more than k shares are still available, and unrecoverable if fewer than k shares are still available.

In src/allmydata/interfaces.py@4045#L1628, ICheckable defines the method check_and_repair, which tells the receiving object to check and attempt to repair (if necessary) itself: this interface and method are implemented by FileNode, which represents an immutable file node on the Tahoe-LAFS grid.

The check and repair process proceeds something like this (again, if I understand the logic):

  1. A Checker is instantiated and started on the verifycap of the FileNode?.
  2. If the results of the Checker indicate that the FileNode? is in need of repair, and that the FileNode? can be repaired, a Repairer is instantiated and started.
  3. The results of the repair operation are reported back to the caller.

(I know there's a bit of hand waving in there, but hopefully I got the gist of it)

The repairer (src/allmydata/immutable/repairer.py@4045#L14) is pretty simple: it downloads the content associated with the FileNode? in the normal way using a DownUpConnector? as a target, and then uploads the DownUpConnector? in the normal way. Since DownUpConnector implements IEncryptedUploadable, it is responsible for providing the encoding and uploading operations with encoding parameters, including servers_of_happiness.

The problem that this long-winded comment is getting to is here: src/allmydata/immutable/download.py@4048#L869. The CiphertextDownloader? sets the happy encoding parameter of its target to be k. Since k can be bigger than servers_of_happiness, this isn't good. In most cases, the accompanying comment is right; in the case of a file repair, it isn't, because the encoding parameters stored in the DownUpConnector? are used by the encoding + upload process.

I think that the CiphertextDownloader? should ideally be following the user's configured happy value instead of setting it to something else. Where I'm stuck is in figuring out a way to tell it what happy should be. Some ideas:

  • Parse the configuration file: this is straightforward, but ugly, because it duplicates the configuration file parsing code, and duplicates it in a part of the program that doesn't really have anything to do with parsing the configuration file.
  • Pass it as a parameter to the Repairer, which then passes it as a parameter to CiphertextDownloader?, which then uses it: but I don't see where in FileNode? I'd get happy.

In any case, that's basically the one stumbling block that I'm aware of in this ticket.

comment:46 Changed at 2009-09-14T13:14:31Z by zooko

Hm. I don't know the answer off the top of my head. There seem to be too many ways to configure the encoding parameters in upload.py. Perhaps some of them are vestigial? I'm going to assign this ticket to Brian. Brian: please just tell Kevan your opinion about how the repairer should learn the user's configuration of happy and then assign this ticket back to Kevan. Thanks!

comment:47 Changed at 2009-09-14T13:20:02Z by zooko

  • Owner changed from kevan to warner
  • Status changed from assigned to new

comment:48 Changed at 2009-09-22T17:47:07Z by zooko

  • Owner changed from warner to kevan

Brian doesn't know the answer to this off the top of his head, either. Help! Assigning this ticket to "nobody".

Oh wait, maybe repairer *doesn't* need to know servers of happiness. Maybe we should set it to 0. Repair is currently intended to finish regardless of how well it is doing and then report to the user how well it did, so then it should just report the resulting servers of happiness when it is done. That would, I think be analogous to the way the current repairer works with regard to shares of happiness (which by the way I wrote). Kevan: does that make any sense?

Assigning this ticket to Kevan.

comment:49 Changed at 2009-09-22T18:38:56Z by kevan

It seems consistent enough with the current repairer, which (at first glance) does what you say.

So, if I understand your point: the repairer's job is to attempt to repair an unhealthy file, and report its results -- presumably other code, if it desires, can then determine whether the repair was successful or not. I think that this makes sense, but the repairer (or at least the code I've skimmed before leaving for work) has an internal definition of what is healthy -- a file with all of its shares intact -- and what isn't. This is necessary, because otherwise the repairer wouldn't know when to attempt to repair a file, but I'd feel better if the repairer tried to use the same definition of healthy as the user, I guess.

I realize that the same argument applies to shares_of_happiness, too (though perhaps less so -- though the repairer doesn't respect shares_of_happiness directly, it is likely to be indirectly respected in the sense that the reparier's threshold of health is probably stricter than shares_of_happiness, while the repairer's threshold of health does not care about the distribution of shares): it is really more a gripe with the way the repairer works in general than an issue specific to this ticket. Given that, I'm fine with that solution to this ticket.

(apologies if this is less than clear: I wrote this on my way out the door to work :)

comment:50 Changed at 2009-09-24T04:28:55Z by kevan

I'm updating the behavior.txt patch to implement Zooko's suggestion.

comment:51 Changed at 2009-09-24T05:12:01Z by zooko

Great! Looking over your patch, this part catches my eye:

-            if placed_shares < self.shares_of_happiness:
+            servers_with_shares = self._servers_with_shares()
+            if len(servers_with_shares) < self.servers_of_happiness:

I guess this is the critical part of this patch.

If the number of landlords falls below H, then we definitely won't get at least K-out-of-H reliability at the end, so it should indeed treat this as a failure here. But, what if len(servers_with_shares) >= self.servers_of_happiness, and yet we still don't have K-out-of-H reliability because some of the servers have the same shares as each other (instead of different share numbers). Can that happen? I think so from reading the patch -- maybe we should have a test of this case.

comment:52 Changed at 2009-09-25T01:38:21Z by kevan

The case you describe, if I understand it correctly, would be fit by this example:

Suppose I have four servers, s_1 through s_4. Suppose I have k = 3, h = 3, N=10. Suppose that the shares generated from a file f are numbered f_n (so f_1 is share 1, and so on). Then the situation you describe is a problem if shares are distributed as follows:

  • s_1: f_1
  • s_2: f_1
  • s_3: f_1
  • s_4: f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9, f_10

because the set {s_1, s_2, s_3} is not sufficient to recover the file, though it is the same size as k.

Do you think we're on the same page?

From what I understand of the placement algorithm, Tahoe-LAFS wouldn't normally distribute one share to more than one server.

_loop in Tahoe2PeerSelector contains the guts of the peer selection algorithm. As I understand it, it only has one request for a particular share out to peers at a time -- either by using pop() on the list containing the homeless shares, or by setting the slice of the homeless shares list out on the wire at the moment to the empty list. If this is the case, then it shouldn't allocate one share to more than one server.

A more interesting case is if, for some reason, we attempt to upload f to a grid that has the following layout:

  • s_1: f_1
  • s_2: f_1
  • s_3: f_1
  • s_4: \emptyset

I'll also stipulate that s_1, s_2 and s_3 will not accept any more shares, while s_4 will accept every share it is asked to accept. Let's walk through what happens when I try to upload f.

I'll assume for simplicity that _loop starts with

self.homeless_shares = [f_1, f_2, f_3, f_4, f_5, f_6, f_7, f_8, f_9, f_10]
self.uncontacted_peers = [s_1, s_2, s_3, s_4]

_loop will start by checking to see if there are any homeless shares. There are, of course. It then checks to see if there are any uncontacted peers. At this point in the execution, there are, so it pops the first uncontacted peer (s_1) off of the list of uncontacted peers, and pops the first homeless share (f_1) off of the list of homeless shares. It then asks s_1 to store f_1. Recall that s_1 already has f_1. Since the StorageServer (src/allmydata/storage/server.py@3841#L35) tells remote callers about all of the shares it has for a given storage index, s_1 tells the Tahoe2PeerSelector instance that it has f_1, and that it has not allocated any shares. This is handled by the _got_response method of Tahoe2PeerSelector, which sees that s_1 has not allocated anything, but that it already has f_1: that means that f_1 is not added to the homeless list again, and that s_1 is set to be the prexisting peer with f_1.

_loop will now ask s_2 if it can store f_1 in the same way. s_2 will reply (in the same way as s_1) that it can't (i.e.: that it hasn't allocated f_2), but that it happens to have f_1. This causes _got_response to set s_2 as the prexisting share with f_1, overwriting the previously set value (s_1)

The same process occurs with s_3: the end result of the _loop/_got_response combination in that case is that s_3 is now the prexisting share with f_1.

Finally, _loop asks the last uncontacted peer, s_4, to store f_2. s_4 replies that it doesn't have any preexisting shares for the storage index, and that it has allocated f_2. s_4 is recorded in self.use_peers as having done this.

There are now no more uncontacted peers, so the Tahoe2PeerSelector instance will attempt to store a proportional amount of the homeless shares (see src/allmydata/immutable/upload.py@4045#L263) on each of the remaining servers. s_1, s_2, and s_3 will refuse to accept any of these, and will not be used in any further queries as a result. s_4 will accept all of the shares that Tahoe2PeerSelector wants to put on it, so it will be re-appended to the list of peers to consider. On the next pass, _loop will attempt to store all currently homeless shares on s_4, and will succeed.

Tahoe2PeerSelector now has no more shares to allocate, and executes the comparison that Zooko mentions. To get servers_with_shares, self._servers_with_shares takes the union of the set of peers that we're actually uploading shares to and the set of peers that we've recorded as already having shares. The latter set is built using the prexisting shares structure that is updated each time a response is received from s_1, s_2, or s_3: this will never have more than one server for f_1 (or any f_n). The only other server in this list will be s_4}}, since it has all of the other shares. {{{_servers_with_shares, then, returns a list of two servers, and Tahoe2PeerSelector correctly fails, because 2 is less than 3.

One example is obviously not proof that the peer selection algorithm guards against this in all cases, though, so maybe an additional test (or revising that one) would be a good idea. Did you have anything in particular in mind that would be a better indicator?

This is perhaps off-topic, but if I understand the execution of _got_response, it seems like we don't say that a query has made progress (or that it is a good query) unless we know that the remote server has either allocated shares for us, or we learn that it already has shares that we think are homeless. This means that if I ask a server to store one share, and it happens to have that share (but no others), its response isn't considered successful, because the share I asked it to store isn't in homeless shares (since we popped it/destructively modified homeless shares when we removed it), and nothing is allocated. Is this intentional?

(whew, that was a lot longer than I was planning on it being :-/)

comment:53 Changed at 2009-10-10T15:49:31Z by zooko

Kevan:

It's great to see that you study the code so carefully. This gives me a nice warm fuzzy feeling that more eyeballs have looked at it. Anyway, I'm very sorry it has been two weeks since it was my turn to reply on this ticket and I haven't done so. I've been really busy.

I guess the key fact that you've shown that I didn't appreciate is that the variable _servers_with_shares holds only servers that have a new share that isn't already held by one of the servers in that set. Perhaps it should be renamed to something like _servers_with_unique_shares. (If you do that, please use the darcs replace command to rename it.)

Now I can think of one more issue. You've pretty much convinced me that this way of counting _servers_with_shares can't overcount unique shares which are available on separate servers, but it could undercount. For example, suppose s_1: f_1, s_2: f_2, s_3: f_3, s_4: f_1, f_2, f_3. Then if s_4 is counted first it will prevent s_1, s_2, and s_3 from being counted because they don't have any new shares, so the final value of _servers_with_shares will be 1. On the other hand if s_1, then s_2, then s_3 are counted first the final value will be 3. If this is right, then it means that sometimes an upload could be reported as failing (because the uploader happened to talk to s_4 first) when it should have been reported as succeeding.

What do you think? It might be worth committing your patch as is, meaning that trunk would then potentially suffer from uploads spuriously failing when they shouldn't have (but they never suffer from uploads spuriously succeeding when they shouldn't have), and then starting on a separate patch to avoid that problem. Or, perhaps we should keep this patch out of trunk even longer and think about that issue.

Regards,

Zooko

comment:54 Changed at 2009-10-10T17:32:11Z by zooko

I interpret Shawn Willden's message http://allmydata.org/pipermail/tahoe-dev/2009-October/002972.html as being a problem that would be solved by this ticket. Shawn -- care to comment?

comment:55 Changed at 2009-10-10T22:23:54Z by kevan

Hm. That scenario would be a problem, and I don't really see an obvious solution to it.

We could alter the logic at src/allmydata/immutable/upload.py@4045#L225 to not just give up after determining that there are no homeless shares, but that there aren't enough distinct servers with shares to consider the upload a success.

We could, for example, figure out how many more servers need to have shares on them for the upload to work ( n = servers_of_happiness - servers_with_shares). We could then unallocate n shares from servers that have more than one share allocated, stick them back in self.homeless_shares, and then let the selection process continue as normal. We'd need a way to prevent it from looping, though -- maybe it should only do this if there are uncontacted peers. Would we want to remove shares from servers that happen to already have them if we're not counting them in the upload? If so, is there a way to do that?

Does that idea make sense?

Regarding holding up this patch versus committing now and making it a separate issue:

  • We'd probably want to write tests for this behavior. Do the test tools in Tahoe include a way to configure a grid so that it looks like the one in your example (I spent a while looking for such tools last weekend when I was trying to implement a test for your first example, but couldn't find them)? If not, we'd probably need to write them.
  • We'd probably want to make a better-defined algorithm for what I said in the paragraph up there (assuming that it is agreeable to everyone).

I have school and work to keep me busy, so I'd be able to dedicate maybe an afternoon or two a week to keep working on this issue. I'm happy to do that -- I'd like to finish it -- but it would probably be a little while before we ended up committing a fix if we waited for that to be done (if someone with more time on their hands wanted to take over, that issue would be solved, I guess). So I guess that's one argument for making it a separate issue. On the other hand, it'd be nice to eliminate edge cases before committing. So there's that. I'm not sure which way I lean.

comment:56 Changed at 2009-10-10T23:52:12Z by kevan

Oops -- I forgot that I made the naming changes that you suggested while writing that, and tried to upload a patch with them, but then I realized that there was a typo in the patch name, and that I missed some of the names. Here's a revised patch without those issues.

Changed at 2009-10-10T23:52:55Z by kevan

The behavior discussed in this ticket

comment:57 Changed at 2009-10-11T00:11:41Z by zooko

Yes, we definitely want tests. Check out GridTestMixin and the way that it is used by test_repairer.py. I guess we could use that pattern to set up a test grid with the shares stored on servers in a specific pattern such as discussed in comment:52 and comment:53.

comment:58 Changed at 2009-10-11T00:13:47Z by zooko

Kevan and I chatted on IRC and we both independently thought that his patch shouldn't go in when it is susceptible to that "spurious failure to upload" problem, because if that problem were to occur the user would have no good way to work-around it and would be stuck.

comment:59 Changed at 2009-10-12T09:21:52Z by kevan

Okay, I'm updating two patches.

I updated my tests patch to include a test for the scenario Zooko proposed in comment:53. It's not _quite_ ideal (I need to figure out a way to make the Tahoe2PeerSelector pop server 0 off the peers list first for it to be perfect), but it fails with the current code.

I also noticed that my _servers_with_unique_shares method in upload.py was comparing peerids with things that weren't peerids, so I made a minor change to the behavior.txt patch to address that.

My todo list is basically:

  • Add a test for the scenario I propose in comment:52
  • Design + implement changes to the peer selection algorithm to address the scenario in comment:53.

I welcome any comments.

comment:60 Changed at 2009-10-13T15:02:56Z by zooko

In http://allmydata.org/pipermail/tahoe-dev/2009-October/002998.html I suggested this measure of "servers of happiness":

First of all, let's call a set of servers "sufficient" if you can download the file from that set (i.e. if at least K distinct shares are hosted in that set of servers).

Now consider the largest set of servers such that every K-sized subset of it is sufficient.

Let's call the size of that largest set S. Now my intuition about "Happyness" is that I configure a Happyness number H, and if an upload results in S >= H then I'm happy.

I think this is also Robert Metcalf's intuition. It may also be Shawn Willden's intuition, but on the other hand perhaps Shawn Willden's intuition is something more sophisticated. ;-)

A neat thing about this way of thinking is that the number S could represent the "health" or "robustness" of the file. An upload or a file-check operation could report S to the user.

comment:61 Changed at 2009-10-14T03:29:47Z by kevan

I replied to this comment in http://allmydata.org/pipermail/tahoe-dev/2009-October/003000.html, but I want to reformat it for Trac before posting it here. In the meantime, there's that.

Changed at 2009-10-18T02:23:15Z by kevan

comment:62 Changed at 2009-10-18T02:25:43Z by kevan

I've updated tests.txt to have a test for the layout I discuss in comment:52. This required a minor change to a couple of the methods in NoNetworkGrid?, which that patch also contains.

(is there a way of deleting files from a ticket? Maybe I should just work harder at remembering to check the replace checkbox when I update tickets :-).

comment:63 Changed at 2009-10-24T05:18:48Z by kevan

I'm updating tests.txt to fix some bugs where I was mixing callbacks with synchronous code where I shouldn't have been. I also tweaked the share distribution for the comment:53 test so that the Tahoe2PeerSelector sees the servers in the right order.

The simplest fix I can think of for the comment:53 issue changes the way that the _got_response method in a Tahoe2PeerSelector instance handles existing shares in a response.

Right now, if a peer tells Tahoe2PeerSelector that it has a share that it wasn't asked to allocate (in the alreadygot part of the response), then the logic in _got_response will alter the entry for that share in preexisting_shares to point to the peer, regardless of whether or not that entry was already pointing to something else.

It's kind of rough, but this check fixes the issue for me (in that it makes the test pass). Thoughts?

comment:64 Changed at 2009-10-25T21:29:09Z by kevan

I'm updating behavior.txt to have the rough fix that I mention in comment:63, and tests.txt to add a test for the logic that calculates servers_of_happiness in Tahoe2PeerSelector. I think this fixes comment:53. Thoughts/feedback?

comment:65 Changed at 2009-10-27T22:05:45Z by zooko

  • Milestone changed from 1.5.1 to 1.6.0

comment:66 Changed at 2009-10-28T01:54:12Z by kevan

I was thinking about this the other day, and got to wondering about how the Encoder handles preexisting shares in the event of some servers failing during an upload.

(note that the following example is in terms of the existing shares_of_happiness behavior -- it is easier to link to that code than to my patches)

As an example, we first look at start_encrypted in CHKUploader. This method creates and runs a Tahoe2PeerSelector to distribute shares of an IEncryptedUploadable across the grid. The results of this are handled in set_shareholders. Note that the PeerTracker? instances in use_peers are send to the Encoder instance, while the peerids in already_peers are only used in the upload results. In any case, after invoking set_shareholders on the Encoder, the CHKUploader starts the upload.

The part of the Encoding process that concerns me is _remove_shareholder. This method is called when there is an error sending data to one of the shareholders. If a shareholder is lost, the Encoder will check to make sure that shares_of_happiness is still met even with the lost server -- if not, it will abort the upload. The problem with this check is that the Encoder, from what I can tell, has no way of knowing about the shares that already exist on the grid, and thus can't take them into account when making this check. So, if I (say) 8 shares for my storage index already on the grid, shares_of_happiness = 7, only two things for the Encoder to actually send, and one (or both) of those transfers fail, my upload will fail when it shouldn't.

Does it seem like I'm off-base there? If not, then it certainly seems like my implementation of servers_of_happiness would fall victim to pretty much the same issue. Is there an obvious way to fix that?

(this comment should have a unit test or two written for it, so that what I'm saying is demonstrated/more easily understood, but I need to leave for class now)

comment:67 Changed at 2009-10-28T02:43:10Z by zooko

Okay I read your explanation and followed along using the links that you embedded and I think you are right. I hope you write that unit test or two when you get back from class!

To fix this in terms of shares-of-happiness I guess CHKUploader.set_shareholders() could also give an integer "already uploaded shares" to the encoder and the encoder could add that integer to the len(self.landlords) that is uses to decide if there is still a chance of success.

To fix it in terms of servers-of-happiness, I suspect that CHKUploader.set_shareholders() would need to pass more information to the encoder, perhaps telling the encoder everything it knows about which servers already have which shares. I'm not entirely sure how your current patch works, but I'll write more about that in a separate comment.

comment:68 Changed at 2009-10-30T09:46:32Z by kevan

I'm uploading the new tests. I'll need to modify them again when we define a way to give the Encoder more information about which servers have which shares.

It occurred to me when writing that test that my patch to Encoder doesn't do what I thought it did when I wrote it. Specifically, since self.landlords in an Encoder is simply a list of IStorageBucketWriters (I'm not sure what I thought they were, but it wasn't that), doing set(self.landlords) doesn't really do anything (let alone give a good value for the number of servers that are left). So I need to think of a better way of doing that, too.

Changed at 2009-10-30T09:47:06Z by kevan

comment:69 Changed at 2009-11-04T04:21:20Z by kevan

I altered the set_shareholders method in IEncoder to require a servermap argument. servermap is a mapping of shnum to a string (the peerid, ideally) that will be storing (whether by result of an upload or by result of already having it) a share. This gives the Encoder enough information to make an accurate check for servers_of_happiness when it loses a peer and (combined with modifications to make code that used the Encoder use the new form of set_shareholders) also makes the tests pass.

Comments? Is there anything else in the way of this issue? Is there a cleaner way of altering the Encoder to do what I want it to do?

comment:70 Changed at 2009-11-04T04:43:52Z by zooko

I didn't understand your proposed solution in comment:63 to my proposed problem in comment:53. Doesn't your solution mean that if the uploader encounters the servers in the opposite order, ending with s_4, the server that has one copy of each share number, that the upload will then abort thinking that the "sufficiency number" S is a mere 1?

If that's the case, then a test just like the second layout in your nice new test_problem_layouts but with the order reversed should fail.

comment:71 Changed at 2009-11-04T12:08:51Z by kevan

Yes, that's right -- I misread your comment, and the test you suggest does indeed fail (I'll update tests.txt with the new test).

I think that the general solution I propose in comment:55 (but didn't use because, given my view of comment:53 at the time, the one I ended up implementing seemed easier and just as effective) would still work for that issue -- upon seeing that there are no homeless shares, but a too-small S, the Tahoe2PeerSelector would check to see if there are more uncontacted peers than servers_of_happiness - S, and, if so, stick servers_of_happiness - S shares back into homeless shares for the algorithm to try to distribute.

This is (at best) likely to be inefficient, though. If the Tahoe2PeerSelector has seen (as an example) s_4 with all the shares, and there are s_1, s_2, s_3 with f_1, f_2, f_3 respectively, we'd want to take advantage of the existing shares on s_1, s_2, s_3 and not allocate different shares to them. In this scenario, though, Tahoe2PeerSelector doesn't know anything about these servers other than the fact that it hasn't contacted them, so it can't do much more than guess. If the servers accept the shares, though, the upload will work as it should.

In the worst case -- that of full s_1, s_2, s_3 -- this will fail, too, because unless we happen to choose the right shares from those already allocated to s_4 to put back into homeless shares, we'll end up with homeless shares. Only I'm forgetting that the peer selection process has multiple phases. What I think will happen in that case is that when the Tahoe2PeerSelector attempts to store shares on those servers, it'll fail, but it will also discover that those servers have f_1, f_2, f_3, and record those in its mapping of already allocated shares. It will then ask s_4 to store the shares that it wasn't able to store on s_1, s_2, s_3 -- shares which s_4 already has. So these shares will be removed from the list of homeless shares, the list of homeless shares will be empty, and the servers_of_happiness check should succeed.

Does that seem right?

I think comment:53 generalizes to any instance where the first server (or really the first n servers, where n is less than servers_of_happiness) happen to store all of the shares associated with a storage index. To that end, I'm also adding a test where s_1, s_2, s_3 are empty. I'm also adding a test for the "worst case" described above.

comment:72 Changed at 2009-11-07T02:21:24Z by kevan

I'm attaching an updated behavior.txt with my solution from comment:55. It passes most of the new tests, with the exception of the one for the worst case in comment:71. This fails because of a check in Tahoe2PeerSelector that filters the list of peers given to the selector instance to only those peers capable of accepting a share of this file, based on the largest sharesize they can accept, according to the results of remote_get_version() in the storage server. If peers are full, they will be weeded out by this check, and their existing shares will not be accounted for in the peer selection process. However, removing the check entirely isn't a good option, either; according to the comment, the check is necessary to ensure that #439 is handled.

Ideas? It'd be nice to handle this case, though it is kind of an edge case, so maybe it's okay to not handle it? That feels kind of dirty, though. Is there another criterion that we could use to address #439? Maybe we could modify Tahoe2PeerSelector to view these peers as "read-only" peers; that is, it will ask them if they have any existing shares, but not attempt to allocate any shares to them?

comment:73 Changed at 2009-11-09T01:08:33Z by kevan

So I was playing with this yesterday and thought of a potential fix:

Instead of throwing away the peers that aren't able to accept a share of a potential upload, the Tahoe2PeerSelector should keep them (temporarily) around in a list of read-only peers; peers which may already have shares of a particular file, but which won't accept any new shares. We can then ask these peers (using remote_get_buckets()) which buckets (if any) they have for our storage index, using the results to populate self.preexisting_shares. The peer selection process then runs as it normally does, knowing about the shares on the peers that it would have thrown away before. I'm attaching updated behavior.txt and tests.txt with these fixes.

Does this seem reasonable? Does anyone have other suggestions?

comment:74 Changed at 2009-11-09T18:06:17Z by zooko

I haven't reviewed your code yet, but what you wrote in comment:73 sounds right to me.

comment:75 Changed at 2009-11-09T18:08:01Z by zooko

By the way I spoke with Brian Warner in person last night and we agreed that he will review this ticket once Kevan and I are satisfied with it just to make extra double sure that we aren't screwing something up. Brian is a bit skeptical about our whole approach (e.g. see his cautionary comments in comment:34) but I'm pretty sure we're on the right track.

comment:76 Changed at 2009-11-16T06:00:03Z by zooko

Hey check it out I just happened to notice #608 (premature abort of upload if some shares were already present and some servers fail). I think Kevan's patch might also fix #608.

comment:77 Changed at 2009-11-16T21:29:43Z by kevan

I made a tweak to one of the utility functions in upload.py to make sure that servers_of_happiness is not overcounted -- if the peer selection process yielded both a bucket in self.use_peers and an existing share in self.preexisting_shares, and those happened to be different peers, both of those peers would be counted. I added a test to demonstrate this behavior. My fix was to alter a utility function to check for this situation, and only count the peer associated with the bucket if it is detected. If anyone has a more elegant fix for this, please post it.

I also did a bit of cleanup and reorganization. In particular:

  • docs/configuration.txt now uses Zooko's description of this behavior.
  • Various occurrences of 'shares_of_happiness' in comments and docs have been updated.
  • test_upload is reorganized; in particular, I updated comments, and broke up the problem layouts test a bit.
  • I added a couple of new tests to more thoroughly test things.

Thoughts?

comment:78 Changed at 2009-11-16T23:48:16Z by kevan

Also, I think that my patch fixes #608 -- I added a test called test_dropped_servers_in_encoder that does the same thing (but with different numbers) as Zooko's suggested test in #608, and it passes now.

comment:79 Changed at 2009-11-18T17:38:57Z by zooko

  • Keywords review added
  • Owner changed from kevan to zooko
  • Status changed from new to assigned

Marking this ticket for review (by me). Once I approve it then let's make sure the docs and unit tests are so thorough that Brian can review it without having to read this entire thread. :-)

comment:80 Changed at 2009-11-18T17:54:08Z by kevan

Cool! The patches aren't necessarily the most intuitive to read -- if you want me to do anything to them to make your review process easier, just let me know. :)

comment:81 Changed at 2009-11-19T04:19:13Z by zooko

We need to raise a different exception than NotEnoughSharesError in the case that upload fails due to not enough servers being available. In case you think I am being overly picky, consider the fact that I am currently answering support mail in another window from an allmydata.com customer who is asking about this "NotEnoughSharesError" that he sees. :-)

comment:82 Changed at 2009-11-19T04:31:34Z by zooko

I opened ticket #834 (clarify error message when upload fails due to insufficient servers of happiness) and assigned it to Kevan.

comment:83 Changed at 2009-11-23T01:38:22Z by kevan

See http://allmydata.org/trac/tahoe/ticket/834#comment:3 for an explanation of these updates.

comment:84 Changed at 2009-11-29T23:43:49Z by zooko

I'm reviewing doc.txt.

  • It says "we'll be happy if we can place 7". It should be changed to say something like "we'll be happy if we can place shares on enough servers that there are 7 different servers, the correct function of any 3 of which guarantee the availability of the file".
  • It says "the correct functioning of at most k of which is sufficient", but (unless I'm confused again), that's wrong, and it should say "the correct functioning of at least k of which is sufficient".

Other than that it looks right!

comment:85 Changed at 2009-11-30T00:37:34Z by kevan

I think you're right on both points, and I've updated docs.txt to make those changes.

comment:86 follow-ups: Changed at 2009-11-30T00:43:51Z by zooko

I'm reviewing tests.txt.

  • At "def _remove_share_0():", I got confused for awhile because the comment numbers the servers as 1, 2, 3, 4 but the code indexes them as 1, 2, 3, 0. Maybe change the docs to use the code's indexes, or maybe add a comment in the docs explaining which server number corresponds to which index in self.shares? Oh, even easier would be rename _remove_share_0() to _remove_share_0_from_server_4().
  • #834:comment:3 says that UploadHappinessError is raised instead of NoSharesError during the upload process, but the test.txt attachment (the one that I am looking at right now -- unfortunately I don't know how to indicate which version of test.txt I mean) is testing for NoSharesError, like this:
              self.shouldFail(NotEnoughSharesError, "test_happy_semantics",
                              "shares could only be placed on 1 servers "
                              "(4 were requested)",
                              client.upload, upload.Data("data" * 10000,
                                                         convergence="")))
    
  • Also the error message that test.txt is testing for -- "shares could only be placed on 1 servers (4 were requested)" isn't really accurate. It is hard to make it more accurate while keeping it short. The truth is something like "We were able to locate or upload shares $X servers such that any $K of them are sufficient to download the file, but we were asked to make it so that there would be at least $H such servers."
  • In test_dropped_servers_in_encoder() I wasn't sure I remembered what was being tested -- please add some in-line comments at the beginning saying what the uploader ought to do in what case. Otherwise test_dropped_servers_in_encoder() looks fine.

more to come -- I'm not done yet reviewing tests.txt. I've reviewed it from the top down to test_dropped_servers_in_encoder().

comment:87 in reply to: ↑ 86 Changed at 2009-11-30T01:31:02Z by davidsarah

Replying to zooko:

  • Also the error message that test.txt is testing for -- "shares could only be placed on 1 servers (4 were requested)" isn't really accurate. It is hard to make it more accurate while keeping it short. The truth is something like "We were able to locate or upload shares $X servers such that any $K of them are sufficient to download the file, but we were asked to make it so that there would be at least $H such servers."

I think the message needs to be split into two cases. Technically, there are always at least $K-1 servers such that any $K of them will be able to reconstruct the file (vacuously, since there are no $K-element subsets of a ($K-1)-element set). However, it is confusing to say that. So, if $X < $K, we should use a different message, something like "We were only able to locate or upload shares to $S servers. We were asked to make it so that there would be at least $H servers, any $K of which are sufficient to download the file."

comment:88 Changed at 2009-11-30T01:34:40Z by davidsarah

Oh, and the special case "We were only able to locate or upload shares to 0 servers." would be better expressed as "We were not able to locate or upload shares to any servers."

comment:89 follow-up: Changed at 2009-11-30T01:42:47Z by davidsarah

From docs.txt:

shares.happy = (int, optional) 1 <= happy <= # of servers on your grid

[...]

shares.happy allows you control over the distribution of your file. An upload is only considered successful if shares are placed on at least 'shares.happy' distinct servers, the correct functioning of at least k of which is sufficient to guarantee the availability of the uploaded file. This value should not be larger than the number of servers on your grid.

I don't think it makes sense to allow shares.happy to be less than k, since (if I understand correctly) that is equivalent to letting all uploads succeed even if they place no shares.

comment:90 follow-up: Changed at 2009-11-30T01:45:16Z by davidsarah

Shouldn't UploadHappinessError be named UploadUnhappinessError?

comment:91 in reply to: ↑ 86 Changed at 2009-12-03T01:58:54Z by kevan

(apologies for the delay in replying to these comments -- I've been a bit slammed with schoolwork lately, so I'm getting to them as I have time.)

Replying to zooko:

I'm reviewing tests.txt.

  • At "def _remove_share_0():", I got confused for awhile because the comment numbers the servers as 1, 2, 3, 4 but the code indexes them as 1, 2, 3, 0. Maybe change the docs to use the code's indexes, or maybe add a comment in the docs explaining which server number corresponds to which index in self.shares? Oh, even easier would be rename _remove_share_0() to _remove_share_0_from_server_4().

Good catch -- I thought I'd caught all those. I changed the test to do that.

  • #834:comment:3 says that UploadHappinessError is raised instead of NoSharesError during the upload process, but the test.txt attachment (the one that I am looking at right now -- unfortunately I don't know how to indicate which version of test.txt I mean) is testing for NoSharesError, like this:
              self.shouldFail(NotEnoughSharesError, "test_happy_semantics",
                              "shares could only be placed on 1 servers "
                              "(4 were requested)",
                              client.upload, upload.Data("data" * 10000,
                                                         convergence="")))
    

This is actually changed in a later patch in tests.txt. Sorry -- making a lot of small patches made it easier to keep track of things as I was working on this, but it probably makes it harder to review. If you like, I could give a flattened, diff-style "patch" for you to review instead?

  • In test_dropped_servers_in_encoder() I wasn't sure I remembered what was being tested -- please add some in-line comments at the beginning saying what the uploader ought to do in what case. Otherwise test_dropped_servers_in_encoder() looks fine.

Okay, that's done too.

more to come -- I'm not done yet reviewing tests.txt. I've reviewed it from the top down to test_dropped_servers_in_encoder().

Thanks for reviewing.

comment:92 Changed at 2009-12-04T04:07:30Z by kevan

I agree with David-Sarah on the idea of having two error messages. For the error message where there are fewer servers than k, it now prints something like:

shares could only be placed or found on 2 server(s). We were asked to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file.

and, in the other case, it prints something like:

shares could only be placed on 5 server(s) such that any 3 of them have enough shares to recover the file, but we were asked to use at least 7 such servers.

Note that this particular error message is only output if the peer selection process managed to place all of the shares -- this means that there will always be at least one server, so we don't need to worry about special-casing it for 0 servers. In cases where the selection process wasn't able to place all of the shares, the error message looks like:

peer selection failed for <Tahoe2PeerSelector for upload dglev>: placed 0 shares out of 10 total (10 homeless), want to place on 4 servers, sent 5 queries to 5 peers, 0 queries placed some shares, 5 placed none (of which 5 placed none due to the server being full and 0 placed none due to an error)

Here, servers_of_happiness is handled by "want to place on 4 servers"; this is a necessary condition for happy=4 to be satisfied, but I think it could probably be improved, so I'll look into doing that if no one disagrees.

comment:93 Changed at 2009-12-05T03:39:36Z by kevan

The message above is very similar to a message logged when an upload succeeds: you can see it here: http://allmydata.org/trac/tahoe/browser/src/allmydata/immutable/upload.py?rev=c4d38ad4c56233e3#L228, while the message above can be seen here: http://allmydata.org/trac/tahoe/browser/src/allmydata/immutable/upload.py?rev=c4d38ad4c56233e3#L288

I abstracted these into one message, which reads like this:

placed 3 shares out of 10 total (7 homeless), 
want to place shares on at least 7 servers such that
any 3 of them have enough shares to recover the file,
sent 3 queries to 3 peers,
3 queries placed some shares, 0 placed none 
(of which 0 placed none due to the server being
full and 0 placed none due to an error) 

This is now logged when an upload is successful, and stored in the UploadHappinessError? when an upload fails for not having placed all shares. The messages that Zooko and David-Sarah commented on earlier do not include this message (they didn't before, either), but could easily be made to do so (maybe as a footnote to the more informative error text), if we wanted to.

comment:94 in reply to: ↑ 89 Changed at 2009-12-05T03:49:10Z by kevan

Replying to davidsarah:

From docs.txt:

shares.happy = (int, optional) 1 <= happy <= # of servers on your grid

[...]

shares.happy allows you control over the distribution of your file. An upload is only considered successful if shares are placed on at least 'shares.happy' distinct servers, the correct functioning of at least k of which is sufficient to guarantee the availability of the uploaded file. This value should not be larger than the number of servers on your grid.

I don't think it makes sense to allow shares.happy to be less than k, since (if I understand correctly) that is equivalent to letting all uploads succeed even if they place no shares.

I agree. I'm changing that. I'm also changing the upper bound to "N"; the peer selector doesn't know how to place more than N shares, so anything greater than that would not be satisfiable, either.

We should probably make the client do a bounds check on that value before using it, in case it is set to something that we don't support (especially if there were different acceptable defaults before). Is there a preferred way to do that?

comment:95 in reply to: ↑ 90 Changed at 2009-12-05T04:54:02Z by kevan

Replying to davidsarah:

Shouldn't UploadHappinessError be named UploadUnhappinessError?

That is more consistent with the naming of the other exceptions. I'll go ahead and change it.

Thanks for the feedback so far.

comment:96 Changed at 2009-12-21T02:52:51Z by davidsarah

  • Keywords review-needed added; review removed

comment:97 Changed at 2009-12-25T21:05:47Z by zooko

Dear Kevan:

I'm sorry it has taken me so long to do more review.

From comment:91:

"This is actually changed in a later patch in tests.txt. Sorry -- making a lot of small patches made it easier to keep track of things as I was working on this, but it probably makes it harder to review. If you like, I could give a flattened, diff-style "patch" for you to review instead?"

Either way will work fine for me. I didn't realize there were subsequent patches in tests.txt, but now that I know that I can handle it.

comment:98 Changed at 2009-12-29T19:35:59Z by zooko

I just realized that there is another ticket that this ticket will help with -- #614 (redefine "Healthy" to be "Happy" for checker/verifier/repairer). http://allmydata.org/pipermail/tahoe-dev/2009-December/003435.html

comment:99 Changed at 2009-12-30T06:58:31Z by zooko

Okay I've consolidated the tests.txt patches into one diff (with the appended darcs command line) and I'm reading through it. There's an error in the comments in test_upload.py:test_problem_layout_comment_52(). It says:

+        # server 0: shares 0 - 9
+        # server 1: share 0
+        # server 2: share 0 
+        # server 3: share 0 
+        # To get access to the shares, we will first upload to one 
+        # server, which will then have shares 1 - 10. We'll then 
+        # add three new servers, configure them to not accept any new
+        # shares, then write share 1 directly into the serverdir of each.
+        # Then each of servers 1 - 3 will report that they have share 1, 
+        # and will not accept any new share, while server 4 will report that
+        # it has shares 2 - 10 and will accept new shares.

I think it ought to say:

+        # server 0: shares 1 - 9
+        # server 1: share 0
+        # server 2: share 0 
+        # server 3: share 0 
+        # To get access to the shares, we will first upload to one 
+        # server, which will then have shares 0 - 9. We'll then 
+        # add three new servers, configure them to not accept any new
+        # shares, then write share 0 directly into the serverdir of each.
+        # Then each of servers 1 - 3 will report that they have share 0, 
+        # and will not accept any new share, while server 0 will report that
+        # it has shares 1 - 9 and will accept new shares.

comment:100 Changed at 2009-12-30T06:58:57Z by zooko

I think that the error message is still not quite accurate enough. (Sorry, I know explaining this in an error message is tricky.)

Currently in test_problem_layout_comment_52() when there is a comment:52 -style situation (per the comments quoted in comment:99), then the error message is "shares could only be placed or found on 2 server(s). We were asked to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file". This is wrong because shares were actually placed or found on 4 servers, right? How about "shares could be placed or found on 4 server(s), but the shares are not spread out evenly enough to ensure that any 3 of those servers have enough shares to recover the file. We were asked to place shares on at least 4 servers such that any 3 of them have enough shares to recover the file."?

comment:101 Changed at 2009-12-30T06:59:03Z by zooko

Here's the darcs command-line to show the combined diff of the eleven different test patches in tests.txt:

darcs diff -u --from-patch="Alter various unit tests to work with the new happy behavior" --to-patch="Replace \"UploadHappinessError\" with \"UploadUnhappinessError\" in tests." 

comment:102 Changed at 2009-12-30T21:21:40Z by kevan

Good catch on the comment -- I'll update the tests patch to fix that.

I like your error message better for the case where shares are on (or could be on) more than k servers. It is nonsensical, though, when we have placed shares on fewer than that -- you'd get something like "shares could be placed or found on 2 server(s), but the shares are not spread out evenly enough to ensure that any 3 of those...". So I left the existing message in for that condition.

I'll update behavior.txt and tests.txt with these changes now.

comment:103 Changed at 2010-01-07T18:30:17Z by kevan

I just updated docs.txt and tests.txt to fix some conflicts that they had with the current state of the project. If you're working with them, you should download the new versions.

comment:104 Changed at 2010-01-09T17:44:25Z by zooko

Brian: I'm not ready for you to review tests.txt or behavior.txt yet, but I am ready for you to review docs.txt, which I've already reviewed. See if you can read docs.txt and yet refrain from going on to read tests.txt and behavior.txt in order to understand the patch better. :-)

comment:105 Changed at 2010-01-09T17:45:39Z by zooko

I'm reviewing tests.txt now.

comment:106 Changed at 2010-01-16T01:12:50Z by zooko

  • Owner changed from zooko to kevan
  • Status changed from assigned to new

Kevan: I got confused about the server layouts in the comments in the tests. Here is a diff that shows the notes I made in the file as I went. Please consider these as questions or suggestions rather than as an actual diff for you to automatically apply. Thanks!

--- old-kevan-778-2010-01-12/src/allmydata/test/test_upload.py	2010-01-15 18:06:26.000000000 -0700
+++ new-kevan-778-2010-01-12/src/allmydata/test/test_upload.py	2010-01-15 18:06:27.000000000 -0700
@@ -820,8 +820,7 @@
     def test_happy_semantics(self):
         self._setUp(2)
         DATA = upload.Data("kittens" * 10000, convergence="")
-        # These parameters are unsatisfiable with the client that we've made
-        # -- we'll use them to test that the semantics work correctly.
+        # These parameters are unsatisfiable because we have only two servers.
         self.set_encoding_parameters(k=3, happy=5, n=10)
         d = self.shouldFail(UploadUnhappinessError, "test_happy_semantics",
                             "shares could only be placed or found on 2 "
@@ -832,7 +831,7 @@
         # Let's reset the client to have 10 servers
         d.addCallback(lambda ign:
             self._setUp(10))
-        # These parameters are satisfiable with the client we've made.
+        # These parameters are satisfiable with these servers.
         d.addCallback(lambda ign:
             self.set_encoding_parameters(k=3, happy=5, n=10))
         # this should work
@@ -843,7 +842,7 @@
         d.addCallback(lambda ign:
             self._setUp(7))
         # These encoding parameters should still be satisfiable with our 
-        # client setup
+        # setup.
         d.addCallback(lambda ign:
             self.set_encoding_parameters(k=3, happy=5, n=10))
         # This, then, should work.
@@ -868,7 +867,8 @@
         # To get access to the shares, we will first upload to one 
         # server, which will then have shares 0 - 9. We'll then 
         # add three new servers, configure them to not accept any new
-        # shares, then write share 0 directly into the serverdir of each.
+        # shares, then write share 0 directly into the serverdir of each,
+        # and then remove share 0 from server 0.
         # Then each of servers 1 - 3 will report that they have share 0, 
         # and will not accept any new share, while server 4 will report that
         # it has shares 1 - 9 and will accept new shares.
@@ -952,15 +952,14 @@
         def _change_basedir(ign):
             self.basedir = self.mktemp()
         _change_basedir(None)
-        d = self._setup_and_upload()
-        # We start by uploading all of the shares to one server (which has 
-        # already been done above).
+        # We start by uploading all of the shares to one server.
         # Next, we'll add three new servers to our NoNetworkGrid. We'll add
         # one share from our initial upload to each of these.
         # The counterintuitive ordering of the share numbers is to deal with 
         # the permuting of these servers -- distributing the shares this 
         # way ensures that the Tahoe2PeerSelector sees them in the order 
-        # described above.
+        # described below.
+        d = self._setup_and_upload()
         d.addCallback(lambda ign:
             self._add_server_with_share(server_number=1, share_number=2))
         d.addCallback(lambda ign:
@@ -972,7 +971,7 @@
         # server 1: share 2
         # server 2: share 0
         # server 3: share 1
-        # We want to change the 'happy' parameter in the client to 4. 
+        # We change the 'happy' parameter in the client to 4. 
         # The Tahoe2PeerSelector will see the peers permuted as:
         # 2, 3, 1, 0
         # Ideally, a reupload of our original data should work.
@@ -986,11 +985,6 @@
 
 
         # This scenario is basically comment:53, but with the order reversed;
-        # this means that the Tahoe2PeerSelector sees
-        # server 2: shares 1-10
-        # server 3: share 1
-        # server 1: share 2
-        # server 4: share 3
         d.addCallback(_change_basedir)
         d.addCallback(lambda ign:
             self._setup_and_upload())
@@ -1009,14 +1003,22 @@
         d.addCallback(lambda ign:
             self.g.remove_server(self.g.servers_by_number[0].my_nodeid))
         d.addCallback(lambda ign:
-            self._add_server_with_share(server_number=4, share_number=0))
+            self._add_server_with_share(server_number=4, share_number=0)) # XXX Wait, server number *4*? What's that? Isn't it supposed to be server number 0?
+        # So, we now have the following layout:
+        # again the Tahoe2PeerSelector will see them in order 2, 3, 1, 0.
+        # server 0: share 0
+        # server 1: share 2
+        # server 2: shares 0-9
+        # server 3: share 1
         # Now try uploading. 
         d.addCallback(_reset_encoding_parameters)
         d.addCallback(lambda client:
             client.upload(upload.Data("data" * 10000, convergence="")))
+
         # Try the same thing, but with empty servers after the first one
         # We want to make sure that Tahoe2PeerSelector will redistribute
         # shares as necessary, not simply discover an existing layout.
+        # XXX table showing what the servers look like in the same way as the previous tables
         d.addCallback(_change_basedir)
         d.addCallback(lambda ign:
             self._setup_and_upload())
@@ -1025,7 +1027,7 @@
         d.addCallback(lambda ign:
             self._add_server(server_number=3))
         d.addCallback(lambda ign:
-            self._add_server(server_number=1))
+            self._add_server(server_number=1)) # XXX why adding in a strange order?
         d.addCallback(_copy_shares)
         d.addCallback(lambda ign:
             self.g.remove_server(self.g.servers_by_number[0].my_nodeid))

comment:107 Changed at 2010-01-16T01:14:48Z by zooko

And, on line 897:

        # Uploading data should fail
        d.addCallback(lambda client:
            self.shouldFail(UploadUnhappinessError, "test_happy_semantics",
                            "shares could be placed or found on 4 server(s), "
                            "but they are not spread out evenly enough to "
                            "ensure that any 3 of these servers would have "
                            "enough shares to recover the file. "
                            "We were asked to place shares on at "
                            "least 4 servers such that any 3 of them have "
                            "enough shares to recover the file",
                            client.upload, upload.Data("data" * 10000,
                                                       convergence="")))

I was confused for a while because I didn't realize the servers were read-only.

comment:108 Changed at 2010-01-16T01:16:49Z by zooko

From behavior.txt:

Reading the following function signature and docstring:

def servers_with_unique_shares(existing_shares, used_peers=None):
    """
    I accept a dict of shareid -> peerid mappings (and optionally a list
    of PeerTracker instances) and return a list of servers that have shares.
    """

I don't understand is the meaning of "unique" and whether the return value will just be every server whose peerid appeared in the values of existing_shares. Also what is the meaning of the list of PeerTracker instances.

comment:109 Changed at 2010-01-16T02:53:24Z by zooko

Reading the body of servers_with_unique_shares() I still don't understand what it is calculating. Even though it has a few comments like:

            if existing_shares.has_key(k):
                # Prevent overcounting; favor the bucket, and not the 
                # prexisting share.
                del(existing_shares[k])

It isn't clear to me why favoring the pre-existing share would lead to overcounting, or even what the two different things peer.buckets and existing_shares are.

comment:110 Changed at 2010-01-16T02:55:40Z by zooko

By the way I am sleep-deprived, I'm tired from the effort of hobbling around on crutches (now that I have a job and I have to actually leave the house every day) and I have a cold. This makes me the perfect patch reviewer! If you can fix up the code and doc so that even I, in my foggy state, can understand it then anyone can understand it.

comment:111 Changed at 2010-01-16T03:09:48Z by zooko

Okay, looking at the way that servers_with_unique_shares() is invoked in upload.py, it kind of seems like its name could be changed to servers_of_happiness()! And maybe a docstring could be written for it explaining its semantics in more detail.

comment:112 Changed at 2010-01-16T04:50:59Z by zooko

servers_with_unique_shares() (which per the previous comment should perhaps be renamed servers_of_happiness() returns a set, but the various callers of it never do anything but take the length of that set. Refactor it so that this function returns just the length. (This might contribute to optimization of the implementation of the function, but more importantly it will make it easier for the confused reader, like me, to figure out what it does and what it is used for.)

comment:113 Changed at 2010-01-16T05:12:55Z by zooko

Okay having read the test function from test_upload.py named test_servers_with_unique_shares() I am left with some questions in mind about the intended semantics of servers_with_unique_shares(). The test function asserts that and existing_shares of {1:"server1",2:"server2",3:"server3",4:"server4"} results in 4 and that an existing_shares of {1:"server1",2:"server2",3:"server3",4:"server1"} results in 3.

Then it adds in the used_peers option and asserts that an existing_shares of {1:"server1",2:"server2",3:"server3",4:"server1"} plus a set of peers {"server5":[5],"server6":[6],"server7":[7],"server8":[8]} results in 7. That makes sense.

Then it asserts that existing_shares of {1:"server1",2:"server2",3:"server3",4:"server1"} plus a set of peers {"server5":[5],"server6":[6],"server7":[7],"server8":[8],"server1":[1]} also results in 7. That makes sense too.

Finally it asserts that when the inputs are empty the result is 0.

Okay, but now what about something like existing_shares of {1:"server1",2:"server2",3:"server3",4:"server4"} plus a set of peers {"server5":[4,5],"server6":[3,5]}. Should that result in 5 or 6? Or what about just a set of peers {"server0":[0,1,3],"server1":[1,2,4],"server2":[0,2,5]}. How many "servers with unique shares" is that? I guess it is 3 because each of the servers has at least one unique share.

Okay how about {"server0":[0,1],"server1":[1,2],"server2":[0,2]}. None of the servers have a share that isn't offered by any other server. So is the answer 0? I guess not!

Is this the function that actually implements the core "servers of happiness" logic? If so perhaps it should enumerate all relevant edge cases (that we've discovered so far in the very long discussion on this ticket!) or else have a comment directing the reader to the relevant other tests (such as test_problem_layout_comment_53(). And the test code -- or more likely the function itself servers_with_unique_shares() should have doc describing these edges cases or directing the reader to the relevant doc (perhaps docs/configuration.txt).

But maybe the edge cases discussed in this ticket don't apply to this function, which is a pure function mapping the input sets to the output integer (and without regard to the order of the input sets), when many of the edge cases such as test_problem_layout_comment_53() have to do with what order the peers are found in.

So, maybe documentation could be added specifying what behavior this pure function has to provide in order to allow the code that uses it to satisfy those edge cases.

As you can see I'm in the perfect state to review docs and tests patches to ask for more clarification. I'm perhaps not in the perfect state to review code and look for bugs or improvements. If I get through all the docs and tests I'll probably have some sleep. :-)

comment:114 Changed at 2010-01-16T05:36:52Z by zooko

A more idiomatic, and I think faster, and I think correct implementation of shares_by_server() would be:

def shares_by_server(existing_shares):
    """
    I accept a dict of shareid -> peerid, and return a dict
    of peerid -> set of shareid
    """
    servers = {}
    for shareid, peerid in existing_shares.iteritems():
        servers[peerid].setdefault(set()).add(shareid)
    return servers

(In fact, it is such a natural thing to want to do that I wonder if there is an even more idiomatic way to do it in Python or some builtin function that just does exactly this transformation. :-))

comment:115 Changed at 2010-01-16T06:19:08Z by zooko

def should_add_server(existing_shares, server, bucket):
    """
    I tell my caller whether the servers_of_happiness number will be
    increased or decreased if a particular server is added as the peer
    already holding a particular share. I take a dictionary, a peerid,
    and a bucket as arguments, and return a boolean.
    """

I guess the servers_of_happiness number can never be decreased by adding a server?

comment:116 Changed at 2010-01-16T06:54:13Z by zooko

Regarding should_add_server(), I don't think that the name bucket is right. That parameter appears to hold a share number. A "bucket" refers to a service running on a storage server which holds a single share. src/allmydata/interfaces.py@4147#L35

comment:117 Changed at 2010-01-16T07:01:36Z by zooko

Re: test_should_add_server}}: again, like with {{{test_servers_with_unique_shares(), when I'm looking at this test I wonder if it shouldn't exercise more interesting cases or edge cases of the underlying should_add_server(). But again, maybe those edge cases are better tested in the "higher level" tests earlier in test_upload.py. And again, maybe doc about the underlying function or the testing strategy would allay my concern and tell me where to look to make sure that the edge cases are covered.

comment:118 Changed at 2010-01-16T07:07:57Z by zooko

HOORAY! I'm finally finished reviewing all of Kevan's tests!

Now there are two things that can happen in parallel:

  • Kevan could read these comments and update the tests. As far as I can tell the only changes that need to be made are in the in-line comments in the tests, but perhaps I've also accidentally uncovered some bugs in the tests. As soon as Kevan does that then Brian can review the tests.
  • I can review the behavior.txt. After some sleep.

comment:119 Changed at 2010-01-16T20:10:25Z by kevan

I think it might be a good idea to make docs/specifications/servers-of-happiness.txt at some point, so that all of the edge cases and proofs and whatnot in this ticket are summarized in one place, and explained more clearly than in this ticket.

comment:120 Changed at 2010-01-16T22:09:22Z by zooko

That specification document might be a good idea, yes.

comment:121 Changed at 2010-01-16T23:04:23Z by kevan

Thanks for the comments on the test annotations. I've made some changes that will hopefully make the issues you addressed clearer. I've also annotated the functions in upload.py that you mentioned; I'd try to answer your questions in this comment, but it would probably be a better test for my modifications if you instead tried to see if they answer your questions. I like your revised implementation of should_add_server, though I think you probably meant to type:

def shares_by_server(existing_shares):
    """
    I accept a dict of shareid -> peerid, and return a dict
    of peerid -> set of shareid
    """
    servers = {}
    for shareid, peerid in existing_shares.iteritems():
        servers.setdefault(peerid, set()).add(shareid)
    return servers

because yours didn't work as written.

I'll upload the changes I've made now.

comment:122 Changed at 2010-01-16T23:38:54Z by zooko

In behavior.txt a line of src/allmydata/immutable/download.py gets changed to say:

+                0, # see ticket #778 for why this is

If you are going to create a servers-of-happiness spec doc then I guess that line should refer to it instead of to the ticket.

comment:123 Changed at 2010-01-17T04:26:12Z by zooko

  • Owner changed from kevan to warner

Reviewing the latest tests.txt:

Around line 986 it looks like the server numbers are 1-indexed instead of 0-indexed. Is that right or am I confused? I think they should all be consistent, and I guess that means all 0-indexed to match the Python code e.g. self._add_server_with_share(server_number=2, share_number=0), which is adding the server whose index in a Python list is "2", not the second server (the second server's index in a Python list is "1").

I read the comment at def test_servers_of_happiness_utility_function() and was confused about what servers_of_happiness() does, but then I discovered the docstring of servers_of_happiness() which explains it better. So I suggest to reduce the comment at def test_servers_of_happiness_utility_function() to something like:

# This function is concerned with the one-to-one (a.k.a. "injective" function # servers_of_happiness(); many issues that deal with other aspects of the # servers of happiness behavior that happen to use this function are tested # elsewhere. These tests exist to ensure that upload.servers_of_happiness # doesn't under or overcount the happiness value for given inputs.

The tests and the comments thereof are now looking very good! It is time for Brian and/or David-Sarah to review docs.txt and tests.txt!

comment:124 Changed at 2010-01-17T05:16:00Z by kevan

The apparent 1-indexing around 986 is intentional. I delete server 0 from the grid in that part of the test because it has all of the shares associated with the file, which isn't what I want it to have for that particular scenario. There is a comment to that effect:

        # Note that server 0 has been replaced by server 4; this makes it 
        # easier to ensure that the last server seen by Tahoe2PeerSelector
        # has only one share. 

What can I write there to make what I'm doing clearer? Or would that part of the test be clearer if I just manually deleted shares from the storage directory associated with server 0 (to me, adding server 4 made the test easier to read than doing that, but maybe I'm wrong).

I'm glad that the new docstring for upload.servers_of_happiness does the job, and I've removed the redundant comment at the top of its test.

comment:125 Changed at 2010-01-17T05:21:15Z by kevan

I made #911 for the task of making a specification file for the servers of happiness behavior.

comment:126 Changed at 2010-01-17T05:22:52Z by zooko

Okay I think you've done enough to make it clear and I just wasn't reading your comments carefully enough.

comment:127 follow-up: Changed at 2010-01-17T07:53:31Z by davidsarah

I've reviewed docs.txt and cannot find any problems. However there is some text in source:docs/frontends/webapi.txt that might need updating:

POST $URL?t=check

  This triggers the FileChecker to determine the current "health" of the
  given file or directory, by counting how many shares are available.

Is this still an accurate description of what ?t=check does?

comment:128 in reply to: ↑ 127 Changed at 2010-01-17T14:39:26Z by davidsarah

Replying to davidsarah:

I've reviewed docs.txt and cannot find any problems. However there is some text in source:docs/frontends/webapi.txt that might need updating: ... POST $URL?t=check

Actually, changing that would be part of #614 (redefine "Healthy" to be "Happy" for checker/verifier/repairer). So LGTM for docs.txt.

comment:129 follow-up: Changed at 2010-01-17T22:14:18Z by zooko

A few notes about behavior.txt:

  • please spell del(self.servermap[shareid]) as del self.servermap[shareid]
  • setting servers_left = list(set(self.servermap.values())) and then taking len(servers_left) looks like it could be much simplified and optimized to num_servers_left = len(self.servermap). :-)

comment:130 Changed at 2010-01-17T22:17:42Z by zooko

    def query_allocated(self):
        d = self._storageserver.callRemote("get_buckets",
                                           self.storage_index)
        return d

could be simplified to

    def query_allocated(self):
        return self._storageserver.callRemote("get_buckets",
                                           self.storage_index)

comment:131 Changed at 2010-01-17T22:23:31Z by zooko

About servers_of_happiness():

  • used_peers doesn't need to be copied because servers_of_happiness() never mutates it. *
    # The Tahoe2PeerSelector takes care of making sure that it doesn't
    # assign the same share to different servers, so we don't have
    # to worry about it here.
    

Agreed, but if there were an easy way to assert inside servers_of_happiness() that no share is assigned to multiple servers, that would be cool too.

  • at the declaration of bucketcountdict please put a comment saying what the keys and values are and, if is seems helpful, what bucketcountdict is for
  • same for sharecountdict

comment:132 Changed at 2010-01-17T22:50:29Z by zooko

Oh I see that sharedict already has a comment, which starts with # We build a dict of shareid => peerid mappings.... Maybe move that comment up to the declaration of sharedict.

comment:133 Changed at 2010-01-17T22:52:40Z by zooko

Tiny style detail:

                # Otherwise, we need to see whether our eventual result
                # is larger with the peer in existing_shares, or with the 
                # peer in used_peers.
                else:

is written as

                else:
                    # Otherwise, we need to see whether our eventual result
                    # is larger with the peer in existing_shares, or with the 
                    # peer in used_peers.

in the rest of our code, so it would be good to do so here for consistency.

comment:134 follow-up: Changed at 2010-01-17T23:01:38Z by zooko

test = existing_shares.values()
test.extend([x.peerid for x in used_peers])
test1 = test[:]
test2 = test[:]
test1.remove(sharedict[k])
test2.remove(existing_shares[k])
val1 = len(list(set(test1)))
val2 = len(list(set(test2)))
if val1 < val2:

could be:

test=set(existing_shares) | set([x.peerid for x in used_peers])
test1 = test - set([sharedict[k]]) 
test2 = test - set([existing_shares[k]])
if len(test1) < len(test2):

Hm, hey doesn't this boil down to checking whether sharedict[k] is in there or not?...

comment:135 in reply to: ↑ 129 Changed at 2010-01-18T21:25:13Z by kevan

Replying to zooko:

A few notes about behavior.txt:

  • setting servers_left = list(set(self.servermap.values())) and then taking len(servers_left) looks like it could be much simplified and optimized to num_servers_left = len(self.servermap). :-)

If I'm not mistaken, this will simply take the number of servers in the sharemap -- the set call is necessary so that we consider the number of distinct servers. It can be optimized to

    num_servers_left = len(set(self.servermap.values()))

though (I don't know why I thought that you couldn't take the length of a set), and I've done that.

comment:136 in reply to: ↑ 134 Changed at 2010-01-18T21:48:00Z by kevan

Replying to zooko:

Hm, hey doesn't this boil down to checking whether sharedict[k] is in there or not?...

Something like that, actually. I shied away from that when I was writing that snippet because I thought that it would end up being really fragile, but it turned out not to be. Thanks.

I liked your other suggestions, and implemented them, including the assertion in upload.servers_with_shares and a test for it. I've also fixed a few other issues that I found while I was fixing the ones you pointed out. I'm attaching new patches with all of these changes.

Changed at 2010-01-18T21:48:51Z by kevan

comment:137 follow-up: Changed at 2010-01-19T05:08:29Z by zooko

I realized as I was driving home just now that I don't know what the code will do, after Kevan's behavior.txt patch is applied, when "servers of happiness" can be achieved only by uploading redundant shares. For example, tests.txt adds a test in "test_upload.py" named test_problem_layout_comment_52 which creates a server layout like this:

        # server 0: shares 1 - 9
        # server 1: share 0
        # server 2: share 0
        # server 3: share 0

Where server 0 is read-write and servers 1, 2 and 3 are read-only. (And by the way Kevin, please make comments state that servers 1, 2 and 3 are read-only.)

In this scenario (with K == 3) the uploader can't achieve "servers of happiness" == 4 even though it can immediately see that all M == 10 of the shares are hosted on the grid.

But what about the case that servers 1, 2 and 3 were still able to accept new shares? Then our uploader could either abort and say "servers of happiness couldn't be satisfied", due to the fact that it can't achieve "servers of happiness" without uploading redundant copies of shares that are already on the grid, or it could succeed by uploading a new copy of shares 2 and 3.

We should have a test for this case. If our uploader gives up in this case then we should assert that the uploader gives up with a reasonable error message and without wasting bandwidth by uploading shares. If it proceeds in this case then we should assert that it succeeds and that it doesn't upload more shares than it has to (which is two in this case).

comment:138 Changed at 2010-01-20T04:13:31Z by zooko

From behavior.txt:

                assert(bucket not in sharedict.keys())

should be

                assert(bucket not in sharedict)

(The former constructs a new array (Python list) of the keys, tests for bucket in it and then immediately discards it. Also it is wordier.)

comment:139 follow-up: Changed at 2010-01-20T06:30:08Z by zooko

from upload.py:

                if should_add_server(self.preexisting_shares, peer, bucket):
                    # It is possible that more than one server returned
                    # to us has the same share for a given SI. If so, we
                    # need to make sure that we don't unintentionally
                    # report a lower happiness value than actually
                    # exists -- blindly clobbering entries in
                    # self.preexisting_shares would not do that.
                    self.preexisting_shares[bucket] = peer

I don't understand this comment. It appears that the code is blindly clobbering entries in self.preexisting_shares. Does that mean that the code is not making sure that we don't unintentionally report a lower happiness value than actually exists? I think it would make sense to turn self.preexisting_shares into a map from sharenum to a set of serverids instead of from sharenum to serverid.

If the code above is reporting a lower happiness value than it ought then we should be able to write a unit test that fails in which the file is actually happy but the upload reports it as unhappy.

comment:140 Changed at 2010-01-20T06:34:32Z by zooko

There is a similar issue in set_shareholders(). Here is the code snippet with a giant comment that I typed into it while doing code review:

        for peer in used_peers:
            buckets.update(peer.buckets)
            for shnum in peer.buckets:
                self._peer_trackers[shnum] = peer
                # XXX what if shnum is already in already_peers but also appears in used_peers.  This could happen if shares are already on servers but not "spread out" enough to achieve servers_of_happiness, and the uploader decides to upload a share that is already on one server to a different server. Could that happen? So it is not legit to say: "assert not shnum in servermap"? This code appears to silently overwrite the record of such a share in already_peers with the newly chosen recipient from user_peers. Is that the right thing to do? I guess so, because the encoder is going to use the servermap *only* for deciding if it should give up when a server fails -- it uses the servermap on that occasion to decide if there are enough servers left that it could possibly achieve servers_of_happiness.  Oh wait, so that shows how it is not necessarily the right thing to do -- could it be that a server that we are trying to upload a share to fails, and that this makes it seem like our servers of happiness cannot be reached, but that actually the "already existing" server that has the same-numbered share could help us satisfy servers of happiness?  I guess not, because then we shouldn't have been wasting our bandwidth uploading that share!  Right?!
                servermap[shnum] = peer.peerid

In this case I think that it is okay to blindly clobber the entry in servermap, but only because we will do so only in the case that that entry wasn't doing any good in terms of servers of happiness. However, I am really not sure, and I wonder if redefining servermap to be a map from sharenum to a set of serverids instead of a map from sharenum to a serverid wouldn't make things clearer and possibly also less buggy.

comment:141 follow-up: Changed at 2010-01-20T06:43:21Z by zooko

Kevan:

Here is a patch which attempts to make construction of the set of read-only easier to understand (for me). Please treat this patch merely as suggestions, not as an actual patch that I want you to apply. (If, for example, you think that it is better to have the self.readonly_peers member variable declared at the top of the function with the other member variables, that's okay with me, although then it should not be initialized to an empty list and then later redirected to point to a different object -- either the empty list should later be filled or else it should be initialized to None.)

--- old-kevan-778-2010-01-18-behavior/src/allmydata/immutable/upload.py 2010-01-19 23:33:47.000000000 -0700
+++ new-kevan-778-2010-01-18-behavior/src/allmydata/immutable/upload.py 2010-01-19 23:33:48.000000000 -0700
@@ -302,21 +302,11 @@
         self._started_second_pass = False
         self.use_peers = set() # PeerTrackers that have shares assigned to them
         self.preexisting_shares = {} # sharenum -> peerid holding the share
-        # We don't try to allocate shares to these servers, since they've 
-        # said that they're incapable of storing shares of the size that 
-        # we'd want to store. We keep them around because they may have
-        # existing shares for this storage index, which we want to know
-        # about for accurate servers_of_happiness accounting
-        self.readonly_peers = []
         # These peers have shares -- any shares -- for our SI. We keep
         # track
         # of these to write an error message with them later.
         self.peers_with_shares = []
 
-        peers = storage_broker.get_servers_for_index(storage_index)
-        if not peers:
-            raise NoServersError("client gave us zero peers")
-
         # this needed_hashes computation should mirror
         # Encoder.send_all_share_hash_trees. We use an IncompleteHashTree
         # (instead of a HashTree) because we don't require actual hashing
@@ -330,6 +320,10 @@
                                              None)
         allocated_size = wbp.get_allocated_size()
 
+        all_peers = storage_broker.get_servers_for_index(storage_index)
+        if not all_peers:
+            raise NoServersError("client gave us zero peers")
+
         # filter the list of peers according to which ones can accomodate
         # this request. This excludes older peers (which used a 4-byte size
         # field) from getting large shares (for files larger than about
@@ -338,10 +332,9 @@
             (peerid, conn) = peer
             v1 = conn.version["http://allmydata.org/tahoe/protocols/storage/v1"]
             return v1["maximum-immutable-share-size"]
-        new_peers = [peer for peer in peers
+        writeable_peers = [peer for peer in all_peers
                      if _get_maxsize(peer) >= allocated_size]
-        old_peers = list(set(peers).difference(set(new_peers)))
-        peers = new_peers
+        readonly_peers = set(all_peers) - set(writeable_peers)
 
         # decide upon the renewal/cancel secrets, to include them in the
         # allocate_buckets query.
@@ -362,8 +355,13 @@
                                bucket_cancel_secret_hash(file_cancel_secret,
                                                          peerid))
                     for (peerid, conn) in peers]
-        self.uncontacted_peers = _make_trackers(peers)
-        self.readonly_peers = _make_trackers(old_peers)
+        self.uncontacted_peers = _make_trackers(writeable_peers)
+        # We don't try to allocate shares to these servers, since they've 
+        # said that they're incapable of storing shares of the size that 
+        # we'd want to store. We keep them around because they may have
+        # existing shares for this storage index, which we want to know
+        # about for accurate servers_of_happiness accounting
+        self.readonly_peers = _make_trackers(readonly_peers)
         # We now ask peers that can't hold any new shares about existing
         # shares that they might have for our SI. Once this is done, we
         # start placing the shares that we haven't already accounted

comment:142 follow-up: Changed at 2010-01-20T06:44:44Z by zooko

servers_of_happiness() ends with:

    return len(list(set(servers)))

It seems like the servers local variable should be a set instead of a list from the start, and then the final line of the function can be:

    return len(servers)

comment:143 Changed at 2010-01-20T16:22:55Z by zooko

Kevan:

I've been struggling and struggling to understand the servers_of_happiness() function. The documentation -- that it attempts to find a 1-to-1 (a.k.a. "injective") function from servers to shares sounds great! But, despite many attempts, I have yet to understand if the code is actually doing the right thing. (Note: this may well be in part my fault for being thick-headed. Especially these days, when I am very sleep-deprived and stressed and busy. But if we can make a function that even I can understand then we'll be golden.)

So, one thing that occurs to me as I look at this function today is that it might help if existing_shares and used_peers had more consistent data types and names. If I understand correctly what they do (which is a big 'if' at this point), they could each be a map from shareid to serverid, or possibly a map from shareid to a set of serverid's, and their names could be existing_shares and planned_shares, and the doc could explain that existing_shares describes shares that are already alleged to be hosted by servers, and planned_shares describes shares that we are currently planning to upload to servers.

Would that be correct? It raises the question in my mind as to why servers_of_happiness() distinguishes between those two inputs instead of just generating its injective function from the union of those two inputs. I suspect that this is because we want to prefer existing shares instead of new shares when the two collide (i.e. when uploading a new share would be redundant) in the interests of upload efficiency. Is that true? Perhaps a note to that effect could be added to the servers_of_happiness() doc.

I realize that I have asked so many times for further explanation of servers_of_happiness() that it has become comment-heavy. Oh well! If we see ways to make the comments more concise and just as explanatory that would be cool, but better too many comments than too little, for this particular twisty little core function. :-)

Thanks!

comment:144 in reply to: ↑ 137 Changed at 2010-01-21T04:03:36Z by kevan

(I'm working on these as I have time -- I usually have a lot to do during the week)

Replying to zooko:

I realized as I was driving home just now that I don't know what the code will do, after Kevan's behavior.txt patch is applied, when "servers of happiness" can be achieved only by uploading redundant shares. For example, tests.txt adds a test in "test_upload.py" named test_problem_layout_comment_52 which creates a server layout like this:

        # server 0: shares 1 - 9
        # server 1: share 0
        # server 2: share 0
        # server 3: share 0

Where server 0 is read-write and servers 1, 2 and 3 are read-only. (And by the way Kevin, please make comments state that servers 1, 2 and 3 are read-only.)

In this scenario (with K == 3) the uploader can't achieve "servers of happiness" == 4 even though it can immediately see that all M == 10 of the shares are hosted on the grid.

But what about the case that servers 1, 2 and 3 were still able to accept new shares? Then our uploader could either abort and say "servers of happiness couldn't be satisfied", due to the fact that it can't achieve "servers of happiness" without uploading redundant copies of shares that are already on the grid, or it could succeed by uploading a new copy of shares 2 and 3.

We should have a test for this case. If our uploader gives up in this case then we should assert that the uploader gives up with a reasonable error message and without wasting bandwidth by uploading shares. If it proceeds in this case then we should assert that it succeeds and that it doesn't upload more shares than it has to (which is two in this case).

There is a test for this (or something very like this) in test_problem_layouts_comment_53:

        # Try the same thing, but with empty servers after the first one
        # We want to make sure that Tahoe2PeerSelector will redistribute
        # shares as necessary, not simply discover an existing layout.
        # The layout is:
        # server 2: shares 0 - 9
        # server 3: empty
        # server 1: empty
        # server 4: empty
        d.addCallback(_change_basedir)
        d.addCallback(lambda ign:
            self._setup_and_upload())
        d.addCallback(lambda ign:
            self._add_server(server_number=2))
        d.addCallback(lambda ign:
            self._add_server(server_number=3))
        d.addCallback(lambda ign:
            self._add_server(server_number=1))
        d.addCallback(_copy_shares)
        d.addCallback(lambda ign:
            self.g.remove_server(self.g.servers_by_number[0].my_nodeid))
        d.addCallback(lambda ign:
            self._add_server(server_number=4))
        d.addCallback(_reset_encoding_parameters)
        d.addCallback(lambda client:
            client.upload(upload.Data("data" * 10000, convergence="")))
        return d

Note that this is slightly different than your case, in that the other servers have no shares at all. So the correct number of shares for the encoder to push is 3, not 2. I didn't have the assertion in there, though, so I'll go ahead and attach a patch where the assertion is there. This also uncovered a bug in should_add_server, in which should_add_server would not approve of adding unknown shares to the existing_shares dict if they were on a server that was already in existing_shares. I've fixed this, and added a test for it.

comment:145 in reply to: ↑ 139 ; follow-up: Changed at 2010-01-21T04:36:32Z by kevan

Replying to zooko:

I don't understand this comment. It appears that the code is blindly clobbering entries in self.preexisting_shares. Does that mean that the code is not making sure that we don't unintentionally report a lower happiness value than actually exists? I think it would make sense to turn self.preexisting_shares into a map from sharenum to a set of serverids instead of from sharenum to serverid.

If the code above is reporting a lower happiness value than it ought then we should be able to write a unit test that fails in which the file is actually happy but the upload reports it as unhappy.

The point of should_add_server is to make sure that that assignment only happens if it would make self.preexisting_shares happier than it already is -- that's what the if-statement is checking. Maybe I don't understand what you mean when you say blindly clobbering shares?

comment:146 follow-up: Changed at 2010-01-21T04:38:47Z by kevan

I'm not sure what I think about making preexisting_shares a map to a set of peerids. It would remove the need for should_add_server, but would also make the end calculations more complicated. I'll think about that when I've worked through the rest of your comments.

(of course, if I'm wrong in my last comment, maybe it makes sense to do that regardless of complexity)

comment:147 in reply to: ↑ 145 Changed at 2010-01-21T14:09:55Z by zooko

Replying to kevan:

The point of should_add_server is to make sure that that assignment only happens if it would make self.preexisting_shares happier than it already is -- that's what the if-statement is checking. Maybe I don't understand what you mean when you say blindly clobbering shares?

Shouldn't the point be to be to make the union of self.preexisting_shares and used_peers (a.k.a. planned_shares) happier than it already is?

By "blindly clobbering" I mean what your comment said originally: "blindly clobbering entries in self.preexisting_shares", i.e. doing a self.preexisting_shares[shareid] = something without first checking whether there is already an entry in that dict under the key shareid.

comment:148 in reply to: ↑ 146 ; follow-up: Changed at 2010-01-21T14:14:01Z by zooko

Replying to kevan:

I'm not sure what I think about making preexisting_shares a map to a set of peerids. It would remove the need for should_add_server, but would also make the end calculations more complicated. I'll think about that when I've worked through the rest of your comments.

Okay, think about it!

(of course, if I'm wrong in my last comment, maybe it makes sense to do that regardless of complexity)

Yeah, there are two notions of "complexity". One is computational -- basically "the size of the code for the shortest implementation of this". The other is cognitive -- basically "the difficulty for the Tahoe-LAFS hackers to learn this and then to hold it correctly in their heads".

It might be the case that making both data structures be maps from shareid to set of serverid would make the algorithm "simpler" in the latter way -- easier to comprehend. I'm not sure.

comment:149 in reply to: ↑ 148 Changed at 2010-01-22T03:56:09Z by kevan

Replying to zooko:

Yeah, there are two notions of "complexity". One is computational -- basically "the size of the code for the shortest implementation of this". The other is cognitive -- basically "the difficulty for the Tahoe-LAFS hackers to learn this and then to hold it correctly in their heads".

It might be the case that making both data structures be maps from shareid to set of serverid would make the algorithm "simpler" in the latter way -- easier to comprehend. I'm not sure.

I thought about it more, and found a lot to like about this idea. Not only did it eliminate should_add_server, but it made servers_of_happiness (to me, at least) much clearer. I implemented that earlier today, and I'll attach updated patches with that shortly. Please read servers_of_happiness (now at src/allmydata/util/happinessutil.py, since there was no good reason not to calculate happiness the same way in the encoder as we do in the peer selection process) again, and if it is still confusing, hopefully we can fix it so that it isn't.

comment:150 follow-up: Changed at 2010-01-22T04:33:15Z by davidsarah

 class NotEnoughSharesError(Exception):
     """Download was unable to get enough shares, or upload was unable to
-    place 'shares_of_happiness' shares."""
+    place 'servers_of_happiness' shares."""

The description needs to be something like "or upload was unable to meet the 'servers_of_happiness' threshold." The name of the name of the exception is also not quite right, but I don't have a good suggestion if it still needs to be the same exception for upload and download. Perhaps it doesn't, in which case how about just "FailedDownloadError?" and "FailedUploadError?"? Why it failed is probably not something that client code is likely to switch based on.

comment:151 in reply to: ↑ 142 Changed at 2010-01-23T03:37:35Z by kevan

Replying to zooko:

servers_of_happiness() ends with:

    return len(list(set(servers)))

It seems like the servers local variable should be a set instead of a list from the start, and then the final line of the function can be:

    return len(servers)

In the current formulation of servers_of_happiness, I think this needs to be a list until it is returned, because we don't want the properties of a set until we've dealt with multiply-hosted shares. Perhaps you can read the revised function and let me know if you agree?

comment:152 in reply to: ↑ 141 Changed at 2010-01-23T03:55:26Z by kevan

Replying to zooko:

Kevan:

Here is a patch which attempts to make construction of the set of read-only easier to understand (for me). Please treat this patch merely as suggestions, not as an actual patch that I want you to apply. (If, for example, you think that it is better to have the self.readonly_peers member variable declared at the top of the function with the other member variables, that's okay with me, although then it should not be initialized to an empty list and then later redirected to point to a different object -- either the empty list should later be filled or else it should be initialized to None.)

--- old-kevan-778-2010-01-18-behavior/src/allmydata/immutable/upload.py 2010-01-19 23:33:47.000000000 -0700
+++ new-kevan-778-2010-01-18-behavior/src/allmydata/immutable/upload.py 2010-01-19 23:33:48.000000000 -0700
@@ -302,21 +302,11 @@
         self._started_second_pass = False
         self.use_peers = set() # PeerTrackers that have shares assigned to them
         self.preexisting_shares = {} # sharenum -> peerid holding the share
-        # We don't try to allocate shares to these servers, since they've 
-        # said that they're incapable of storing shares of the size that 
-        # we'd want to store. We keep them around because they may have
-        # existing shares for this storage index, which we want to know
-        # about for accurate servers_of_happiness accounting
-        self.readonly_peers = []
         # These peers have shares -- any shares -- for our SI. We keep
         # track
         # of these to write an error message with them later.
         self.peers_with_shares = []
 
-        peers = storage_broker.get_servers_for_index(storage_index)
-        if not peers:
-            raise NoServersError("client gave us zero peers")
-
         # this needed_hashes computation should mirror
         # Encoder.send_all_share_hash_trees. We use an IncompleteHashTree
         # (instead of a HashTree) because we don't require actual hashing
@@ -330,6 +320,10 @@
                                              None)
         allocated_size = wbp.get_allocated_size()
 
+        all_peers = storage_broker.get_servers_for_index(storage_index)
+        if not all_peers:
+            raise NoServersError("client gave us zero peers")
+
         # filter the list of peers according to which ones can accomodate
         # this request. This excludes older peers (which used a 4-byte size
         # field) from getting large shares (for files larger than about
@@ -338,10 +332,9 @@
             (peerid, conn) = peer
             v1 = conn.version["http://allmydata.org/tahoe/protocols/storage/v1"]
             return v1["maximum-immutable-share-size"]
-        new_peers = [peer for peer in peers
+        writeable_peers = [peer for peer in all_peers
                      if _get_maxsize(peer) >= allocated_size]
-        old_peers = list(set(peers).difference(set(new_peers)))
-        peers = new_peers
+        readonly_peers = set(all_peers) - set(writeable_peers)
 
         # decide upon the renewal/cancel secrets, to include them in the
         # allocate_buckets query.
@@ -362,8 +355,13 @@
                                bucket_cancel_secret_hash(file_cancel_secret,
                                                          peerid))
                     for (peerid, conn) in peers]
-        self.uncontacted_peers = _make_trackers(peers)
-        self.readonly_peers = _make_trackers(old_peers)
+        self.uncontacted_peers = _make_trackers(writeable_peers)
+        # We don't try to allocate shares to these servers, since they've 
+        # said that they're incapable of storing shares of the size that 
+        # we'd want to store. We keep them around because they may have
+        # existing shares for this storage index, which we want to know
+        # about for accurate servers_of_happiness accounting
+        self.readonly_peers = _make_trackers(readonly_peers)
         # We now ask peers that can't hold any new shares about existing
         # shares that they might have for our SI. Once this is done, we
         # start placing the shares that we haven't already accounted

I like having self.readonly_peers mentioned at the top of the function, if for no other reason than consistency, so if it's all the same to you, I'd like to leave that. I'll change it to initialize to None, though, and I'll amend that comment to mention that it is eventually a list, so the self-documenting aspects of that are preserved for the most part.

I think your other changes are a good idea, though, so I'll update my patches to incorporate them.

comment:153 in reply to: ↑ 150 Changed at 2010-01-23T04:04:48Z by kevan

Replying to davidsarah:

The description needs to be something like "or upload was unable to meet the 'servers_of_happiness' threshold." The name of the name of the exception is also not quite right, but I don't have a good suggestion if it still needs to be the same exception for upload and download. Perhaps it doesn't, in which case how about just "FailedDownloadError?" and "FailedUploadError?"? Why it failed is probably not something that client code is likely to switch based on.

Actually, I think that the upload process was switched in an earlier patch to use UploadUnhappinessError? instead of that, so it should only be raised in the context of a download (from what I can tell, it isn't used out of that context outside of the immutable file code, either, but I might have missed something). Then it should be fine to just get rid of that part of the comment entirely. If it is not raised in the context of an upload, do you still think the name is inappropriate?

comment:154 Changed at 2010-01-23T04:09:58Z by kevan

aha, I apparently already made that change -- it just made it into a later part of behavior.txt. I noticed that I had neglected to rename something that referenced NotEnoughSharesError? in the context of an upload while researching that, though, so I'll go ahead and incorporate that fix into the others that I'm uploading.

comment:155 Changed at 2010-01-25T05:02:24Z by zooko

  • Owner changed from warner to zooko
  • Status changed from new to assigned

Tomorrow morning we'll see whether behavior.txt passes the "bus to work" test. I have about 20 minutes on the bus to work. Can I read and understand the patch in 20 minutes? While sleep-deprived and distracted? If so, then it is a great patch in terms of ease of comprehension! Whether I can comprehend it or not, I'll try to post some comment about it when I get to work. :-)

(Brian, David-Sarah, others, you are still more than welcome to review docs.txt and tests.txt. Also you are welcome to review behavior.txt, but I would suggest that you post your feedback about docs.txt and tests.txt first, if you haven't already.)

comment:156 follow-up: Changed at 2010-01-25T16:51:03Z by zooko

Okay a few quick notes:

I'm getting closer! For one thing, there is less code now, and it is starting to make more sense to me. But I didn't manage to really understand the whole thing on this bus ride.

I got stuck on this line for a while:

    # This is a list of every peer that has a share in our layout.
    servers = reduce(lambda x, y: x + y, [list(x) for x in
                                          existing_shares.values()], [])

In general when reading Python I find it easier to understand "imperative style" in which variables get initialized and mutated than "functional style" in which values get computed as compound functions of their inputs. After I figured out what the line above does then I saw that the later code uses count() on servers, and so I think the intended semantics of servers is a list where each serverid appears one time for each unique share that the server holds.

I would find it easier to understand if servers were a map from serverid to number of unique shares that that server holds (or will hold according to the current plan as represented by used_peers). Oh, in fact, how about calling servers = shares_by_server(existing_shares) (after existing_shares has been updated to include the information from used_peers), then use len(servers[serverid]) to get the number of shares?

Oh, this existing_shares has the wrong name now that it includes information about planned shares from the used_peers argument. Maybe just shares. :-)

Okay, then I ran out of time and didn't read the new changes to upload.py.

I think it would help if the algorithm for calculating servers-of-happiness were written out in docs. I mean the algorithm as described imperatively as a sequence of calculations and mutations. Let me try to explain the algorithm myself as a way to get you started on documenting it:

The algorithm for calculating servers-of-happiness starts with a map from servers to shares that the server holds. More than one server can map to a single share and more than one share can be mapped to from a single server.

The algorithm looks for a share that has more than one server mapping to it and then excludes all but one of those links (from server to share). Then it iterates this until there are no more cases of a share with more than one server mapping to it, and then it is done and the happiness value is the number of servers mapping to a share.

Now to finish fully defining the algorithm we have to explain how it chooses which link to retain when it finds multiple links pointing to a share.

By the way last night a servers-of-happiness puzzle occurred to me. Suppose server A maps to shares 0 and 1, and server B maps to shares 1 and 2, and server C maps to share 2. Now if the servers-of-happiness algorithm described above first looks at share 1, it has to decide whether to exclude the link from server A to share 1 or from server B to share 1. If it chooses to exclude the link from server B to share 1 then it is going to end up returning a servers-of-happiness value of 2. But if it chooses to exclude the link from server A to share 1 then it is going to end up returning a servers-of-happiness value of 3. But I don't think the algorithm can figure that out just by examining share 2. We should write a unit test of this case and (a) see whether the current algorithm passes that test and (b) if it does, figure out why it does and see if we can write another unit test that it won't pass. :-)

comment:157 in reply to: ↑ 156 Changed at 2010-01-25T16:54:31Z by zooko

replying to myself:

Replying to zooko:

The algorithm for calculating servers-of-happiness starts with a map from servers to shares that the server holds. More than one server can map to a single share and more than one share can be mapped to from a single server.

The algorithm looks for a share that has more than one server mapping to it and then excludes all but one of those links (from server to share). Then it iterates this until there are no more cases of a share with more than one server mapping to it, and then it is done and the happiness value is the number of servers mapping to a share.

Now to finish fully defining the algorithm we have to explain how it chooses which link to retain when it finds multiple links pointing to a share.

Oh, and we have to specify how it chooses which share to examine next. Presumably a good way to choose the next share plus a good what to choose which links to exclude can solve the puzzle I posted in comment:156 and other similar puzzles.

comment:158 Changed at 2010-01-25T22:10:00Z by zooko

A simpler puzzle with the same issue is: server A maps to share 0 and share 1, server B maps to share 1.

comment:159 Changed at 2010-01-25T22:11:39Z by zooko

I think we should not include #778 in Tahoe-LAFS v1.6. Even after I finish reviewing the patch and am willing to commit it to trunk, then we should probably give it some time to be manually tested before putting into a stable release. So let's plan to put it into the next stable release.

Tahoe-LAFS v1.6 needs to be released by (I think, probably) Thursday to have a reasonable chance of inclusion in Ubuntu Lucid.

comment:160 Changed at 2010-01-26T02:55:03Z by kevan

Okay. Then I'm going to focus on finishing my review of #833 instead of this.

comment:161 Changed at 2010-01-26T15:05:16Z by zooko

  • Milestone changed from 1.6.0 to eventually

comment:162 Changed at 2010-01-27T01:35:12Z by davidsarah

I was sure that computing servers-of-happiness must correspond to some well-understood mathematical problem, but I've now discovered which one -- it's called "maximum bipartite matching".

Since servers are distinct from shares, the relation between servers and shares corresponds to a bipartite graph.

Any bijective function from servers to shares that is included in this relation is called a matching of the bipartite graph.

The servers-of-happiness number is the maximum size of any such matching. Intuitively, that's because a matching gives a set of H servers that have at least H distinct shares between them.

So, apply a maximum bipartite matching algorithm to the sharemap (to find some maximum matching), take its size, and you're done.

(Actually, the servers-of-happiness value as we originally defined it is max(k-1, max_bipartite_matching_size). But the max_bipartite_matching_size works as a measure of the quality of the server->share mapping even when it is less than k-1. In that case there won't be enough shares to recover the file, of course.)

I doubt that the algorithm outlined by Zooko always computes max_bipartite_matching_size correctly. That's because it is a greedy algorithm -- it works by removing edges from the bipartite graph that are not in some maximum matching, and assumes that it's possible to make a correct local decision of which edge to remove. There are definitely greedy algorithms that result in a near-maximum matching, but it won't always be maximum.

Algorithms to find maximum bipartite matchings are really simple, just a few lines of code, and have reasonable algorithmic complexity. I will try to find one and translate it to Python.

comment:163 Changed at 2010-01-27T02:39:39Z by terrell

this is brilliant. well done sir.

comment:164 Changed at 2010-01-27T06:08:47Z by zooko

  • Milestone changed from eventually to 1.7.0

comment:165 Changed at 2010-02-15T19:31:20Z by davidsarah

  • Milestone changed from 1.7.0 to 1.6.1

comment:166 Changed at 2010-02-15T20:15:31Z by davidsarah

  • Milestone changed from 1.6.1 to 1.7.0

Oops, I missed Zooko's message disagreeing with doing this for 1.6.1.

comment:167 Changed at 2010-02-15T20:47:54Z by kevan

I've been playing with this over weekend after (very slowly) brushing up on graph theory enough to understand bipartite matching algorithms. I'm attaching updated behavior.txt and tests.txt patches with an implementation of the Edmonds-Karp algorithm for computing a maximum bipartite matching. My implementation is fairly general, so there is probably room for optimization. Also, David-Sarah were working on an implementation which may be clearer/more efficient than mine. I have not added any of Zooko's puzzles to the tests yet, but my changes pass all of the existing tests.

Changed at 2010-02-15T20:48:24Z by kevan

adding #834 behavior to the #778 patches

comment:168 Changed at 2010-02-22T23:43:43Z by zooko

I closed #834 as a duplicate of this ticket.

comment:169 Changed at 2010-03-15T17:52:44Z by davidsarah

  • Keywords preservation added; reliability removed

comment:170 Changed at 2010-03-19T04:33:26Z by zooko

I'm sorry, I guess I was waiting for you to add my puzzles to the tests, but I guess you are waiting for me to review the current patches! I will do so when I have the opportunity--probably this coming weekend.

comment:171 Changed at 2010-03-19T04:53:01Z by kevan

oops, I'd forgotten about those completely. Sorry! I'll add them now.

Changed at 2010-03-19T05:24:17Z by kevan

tests updated to be current

comment:172 follow-up: Changed at 2010-04-25T04:33:25Z by zooko

I was describing this ticket to Kris Nuttycombe and he wants to know why we have a separately configurable h parameter instead of using n for the "servers of happiness". That is: if your erasure coding parameter n is 10, then your servers of happiness ought to be 10 too. This sounds like a really good idea to me because it is what people assume who (like Kris) know only the basic architectural fact of Tahoe-LAFS's use of erasure coding.

The fact that the docs in the current patch on this ticket warn the user not to set h higher than the number of servers they have seems to play into this.

(I have the vague feeling that we've already considered this option and rejected it for some reason or another in this giant nine month long, one hundred and seventy message long thread, but I thought it would be worth writing it down, so here's message one hundred and seventy two!)

comment:173 Changed at 2010-04-25T04:39:18Z by zooko

The docs from the doc patch advise you what you ought to set N equal to, but doesn't explicitly advise you what to set h to:

""" we'll try to place 10 shares, we'll be happy if we can place shares on enough servers that there are 7 different servers, the correct functioning of any 3 of which guarantee the availability of the file, and we need to get back any 3 to recover the file. This results in a 3.3x expansion factor. In general, you should set N about equal to the number of peers in your grid, then set N/k to achieve your desired availability goals. """

Kris points out that having each upload touch every one of the servers in your distributed filesystem seems like a strange thing to do. That's a good point! It probably makes good sense to set N to the number of servers when the number of servers that you have is between 6 and 16. It probably makes less sense when you have 50 or more.

comment:174 Changed at 2010-04-25T06:29:02Z by zooko

Okay I've read through the latest test.txt and I didn't see anything wrong with it. Nice work, Kevan!

comment:175 in reply to: ↑ 172 Changed at 2010-04-25T15:38:17Z by davidsarah

Replying to zooko:

I was describing this ticket to Kris Nuttycombe and he wants to know why we have a separately configurable h parameter instead of using n for the "servers of happiness". That is: if your erasure coding parameter n is 10, then your servers of happiness ought to be 10 too. This sounds like a really good idea to me because it is what people assume who (like Kris) know only the basic architectural fact of Tahoe-LAFS's use of erasure coding.

The desired behaviour is that an upload should try to place n shares, but settle for placing h shares mapping to unique servers. If they are the same, that would mean that we always only try to obtain the minimum preservation and availability that the user has requested. It seems to me to be a good idea to attempt to obtain a higher safety margin whenever the requests to the additional n-h servers succeed immediately.

(If #946 were implemented in a way that did not increase memory usage, then the only additional cost would be bandwidth, not upload latency.)

comment:176 Changed at 2010-04-26T23:04:23Z by kevan

Decoupling the n and h parameters might also be useful in the context of share redistribution and friendnets -- i.e.: if you have 10 mostly-reliable storage servers on your friendnet, then you'd ideally want one share on each of them, but you expect 2 or 3 of them to be offline sometimes, so you settle for h = 7, then let a redistribution process move the doubly-placed shares onto other storage servers as they become available. Using multiply-placed shares would allow a redistribution process to accomplish this without you giving it anything beyond the verify-cap.

It would, AFAIK, be possible to get as far as getting the unencoded ciphertext of a mutable or immutable file with only a verify cap, which also means that a redistribution process, given the verify cap, could re-encode the file if necessary, but it could not go farther than that. In the case of an immutable file, the block hash tree, share hash tree, and root hashes would all change, which means that the hash of the UEB would change, which means that the redistributed version of the file would be a new file, since a capability designating it would be distinct from a capability designating another encoding of the same ciphertext. In the case of a mutable file, the redistribution process would need a writecap to get the RSA private key necessary to sign the new encoding of the file, since the hash trees also change in the mutable file.

(in writing the above, I'm imagining a "redistributor" helper; you would give your verify caps to the redistributor, and it would, given the n encoded in the cap, try to make sure that h = n for you automatically)

OTOH, we don't have a redistribution helper right now, so maybe that's not a good reason to keep them separate.

(I don't think we've discussed this yet, so it is good that it was brought up.)

comment:177 Changed at 2010-04-27T00:05:20Z by kevan

when I say "doubly-placed shares", I mean "multiply-placed shares", and when I say "multiply-placed shares", I mean "shares that have been placed on storage servers that are already storing shares for that upload", which is another way of saying "shares that won't make servers-of-happiness go up".

comment:178 Changed at 2010-04-27T05:03:32Z by zooko

David-Sarah and Kevan: thanks for your explanations (comment:175 and comment:176) of why it is useful to have h < n. That makes sense. (It's roughly similar to what I sleepily told Kris when he asked the other night.) I liked Kevan's friendnet motivating example.

Kevan: I don't entirely follow what you are talking about in (comment:176) about repairing using a verify cap and the immutable file cap changing. There are at least two tickets that sound relevant: #568 (converge same file, same K, different M), #678 (converge same file, same K, different M).

comment:179 Changed at 2010-04-27T14:09:43Z by zooko

I intend to start reviewing Kevan's latest behavior.txt patch on the way to work this morning.

comment:180 Changed at 2010-04-27T16:36:19Z by kevan

If I'm not mistaken, the issue is essentially that in ticket 678, comment 3; a new encoding of the same file yields a different UEB hash than the old encoding; this causes clients with the old filecap to reject the shares from the new encoding because, from their perspective, the UEB is wrong.

If I'm not mistaken, the current repairer regenerates shares that were generated during the initial upload but are no longer available. In the first case in comment:176 (where we decouple h and n to tolerate multiply placed shares so we can move them around later), we're redistributing shares to achieve optimal reliability, even though all of the shares may still be available. In the second case (where h and n are the same thing, so we have to generate a new encoding of the same file to take advantage of opportunities to improve reliability), we're re-encoding the file so we can take advantage of an opportunity to improve reliability. The file is not necessarily damaged or unhealthy in either of these cases, so while a redistribution operation could conceptually be called a repair, it is important to distinguish that from file repair as currently implemented in Tahoe-LAFS. I mention this in case conflating this operation and the repair facilities currently implemented in Tahoe-LAFS led to your confusion.

Does that make more sense?

comment:181 Changed at 2010-04-27T16:38:00Z by kevan

This also suggests a value for h -- maybe it can be the smallest number of servers that you reasonably expect to be on your grid when it is functioning correctly.

Changed at 2010-04-28T00:49:27Z by kevan

update documentation patches per comment:181 and comment:173

comment:182 follow-up: Changed at 2010-05-06T05:18:07Z by zooko

Here is a partial review of the latest behavior.txt. I did this review on the bus after work, when I was tired, and now it is past bed-time and I'm even more tired so take it with a grain of salt. The line numbers in the following are from viewing this URL: http://allmydata.org/trac/tahoe-lafs/attachment/ticket/778/behavior.txt . By the way Kevan, please start versioning your patch attachments to this ticket, e.g. calling the next one "behavior2.txt" so that I can refer back to the one that I just partially reviewed, and so that these lines numbers will continue to be correct for that URL. --- Here is a suggested edit to an error message:

                       msg = ("shares could only be placed on %d server(s) "
                              "such that any %d of them have enough shares "
                              "to recover the file, but we were asked to use "
                              "at least %d such servers." %

change to

                       msg = ("shares could only be placed on %d server(s) "
                              "such that any %d of them have enough shares "
                              "to recover the file, but we were asked to place "
                              "shares on at least %d such servers." %

--- small improvement in Python usage:

       for k in servermap:
           assert isinstance(servermap[k], set)

replace with

       for v in servermap.itervalues():
           assert isinstance(v, set)

---

1088        +            peerid = self.landlords[shareid].get_peerid()
...
1091        +            if peerid:
1092	+                self.servermap[shareid].remove(peerid)

Can peerid be false? Well, perhaps this question is out of scope of this ticket, since I see that get_peerid() thinks that it can return None, although I can't see how it could ever do that. Maybe add precondition(nodeid) in WriteBucketProxy.__init__(). ---

1108        +            msg = ("lost too many servers during upload "
1109	+                   "(happiness is now %d, but we wanted %d): %s" %
1110	+                   (happiness,
1111	+                    self.servers_of_happiness, why))
1112	             raise UploadUnhappinessError(msg)

Does this mean we'll get a less detailed error message if we become unhappy due to server loss during upload than if we become unhappy due to not being able to find sufficient servers to begin the upload? E.g. the way in the latter case we distinguish between peer_count < self.needed_shares vs. happiness < self.needed_shares vs. len(x-happy-subset) < servers_of_happiness. Maybe we should extract that reporting code that produces a message for each of those three cases into a function and call that function from those two places that we decide to give up. ---

1427                             peer_count = len(list(set(self.peers_with_shares)))

Make self.peers_with_shares be a set in the first place. --- Okay I got down to line 1427 of the representation of behavior.txt at http://allmydata.org/trac/tahoe-lafs/attachment/ticket/778/behavior.txt . Hopefully I'll pick it up from there tomorrow!

comment:183 in reply to: ↑ 182 Changed at 2010-05-07T22:33:03Z by kevan

Replying to zooko:

Here is a partial review of the latest behavior.txt. I did this review on the bus after work, when I was tired, and now it is past bed-time and I'm even more tired so take it with a grain of salt. The line numbers in the following are from viewing this URL: http://allmydata.org/trac/tahoe-lafs/attachment/ticket/778/behavior.txt . By the way Kevan, please start versioning your patch attachments to this ticket, e.g. calling the next one "behavior2.txt" so that I can refer back to the one that I just partially reviewed, and so that these lines numbers will continue to be correct for that URL. --- Here is a suggested edit to an error message:

Done.

--- small improvement in Python usage:

Done.

Can peerid be false? Well, perhaps this question is out of scope of this ticket, since I see that get_peerid() thinks that it can return None, although I can't see how it could ever do that. Maybe add precondition(nodeid) in WriteBucketProxy?.init()

In the context of that statement, it can't be false. In general, it can, though -- see, for example,

wbp = layout.make_write_bucket_proxy(None, share_size, 0, num_segments,
                                             num_share_hashes, EXTENSION_SIZE,
                                             None)

in Tahoe2PeerSelector.get_shareholders, where the wbp is used without a peerid to get a size value that can then be used to filter out servers that don't accept a share as large as those associated with the file it is trying to place. So I don't think adding that precondition to WriteBucketProxy?.init is necessarily right -- we could still support the use case in get_shareholders with a dummy peerid if we did that, though. I'll remove that if-statement, and replace it with an assert peerid to make things clearer.

Does this mean we'll get a less detailed error message if we become unhappy due to server loss during upload than if we become unhappy due to not being able to find sufficient servers to begin the upload? E.g. the way in the latter case we distinguish between peer_count < self.needed_shares vs. happiness < self.needed_shares vs. len(x-happy-subset) < servers_of_happiness. Maybe we should extract that reporting code that produces a message for each of those three cases into a function and call that function from those two places that we decide to give up.

Good idea; done.

Make self.peers_with_shares be a set in the first place.

Done.

Thanks for the review.

Changed at 2010-05-07T22:34:07Z by kevan

Changed at 2010-05-07T22:34:20Z by kevan

comment:184 Changed at 2010-05-12T04:23:28Z by zooko

For your information, I decided to put Kevan's patches (from behavior2.txt) into trac so that I could explore them with a nice web UI:

http://tahoe-lafs.org/trac/tahoe-lafs-ticket778/changeset/4293

don't get confused -- URLs that begin with http://tahoe-lafs.org/trac/tahoe-lafs-ticket778 are a trac instance that has Kevan's patches and is therefore divergent from trunk, which has URL http://tahoe-lafs.org/trac/tahoe-lafs. The patches shown in the trac of Kevan's branch may mutate before they actually get applied to trunk.

comment:185 Changed at 2010-05-12T04:25:51Z by zooko

Okay and I also just applied the 12 tests patches from tests2.txt:

http://tahoe-lafs.org/trac/tahoe-lafs-ticket778/log/

comment:186 Changed at 2010-05-12T04:35:07Z by zooko

Currently error messages say something like "shares could only be placed on 2 servers". If you shift the word "only" to the right as far as possible then it becomes less ambiguous, e.g.:

  • "shares could only be placed on 2 servers"... but they could be found on two other servers where they couldn't be placed!

versus:

  • "shares could be placed on only 2 servers"... Oh, okay the word "only" modifies "2" here, not "placed".

Moral: always try to shift the word "only" as far to the right as possible.

comment:187 follow-ups: Changed at 2010-05-12T06:57:34Z by zooko

I get confused about this logic in _loop() on line 315 of upload.py on Kevan's branch.

Questions:

  • It is using only uncontacted_peers in if delta <= len(self.uncontacted_peers) and not also considering peers that have already been contacted but could accept another share, right? So what if we had a situation like {server0: (share0, share1, share2), server1: (share0), server2: (share0)} and the only way to achieve happiness is to upload share1 or share2 to either server1 or server2? Would the if evaluate to False since len(self.uncontacted_peers) is zero and the upload would fail when it shouldn't? This would be similar to the situation tested in line 1026 of test_upload.py on Kevan's branch, but in that test the servers have no shares instead of having one share each. It would also be similar to the situation tested in line 853 of test_upload.py on Kevan's branch but in that test the servers which have one share each are read-only so the upload must fail.
  • Is _loop() guaranteed to terminate? Could it endlessly cycle, reassigning some shares to some servers and then recursing in line 356 (Kevan's branch) until it hits the Python maximum recursion limit? If it is guaranteed to terminate, how can we tell?
  • For that matter, isn't _loop() which calls itself really sort of like the greedy algorithm that I proposed back in comment:157 and which is not guaranteed to find a solution even if one exists? Now that we have a maximum bipartite matching algorithm can't we do away with _loop() entirely?

Hm, I see that in addition to the self-call on line 356, _loop() is also called from line 411 (Kevan's branch) and from line 504 (Kevan's branch). Line 411 is when we've finished assigning shares to servers which already had shares, and line 504 is when we get a response from a remote call to a server's allocate_buckets().

It kind of seems like to me that only one of these three calls to _loop() should still be made in the new world of maximum bipartite matching, and that is the one where we just got new information from a server in response to our attempt to allocate_buckets(). (That new information could be information to the effect that the server already has some shares, information to the effect that the server hereby agrees to hold some shares that we will upload, or information to the effect that the server has failed.) The other two calls to _loop() seem to be part of an iterative/imperative/greedy algorithm which is hopefully superceded by maximum bipartite matching.

Well, one way to tell if what I just said is true is to add a unit test of a case like the one I mentioned above, where there are servers that are already holding shares (so they will not be in self.uncontacted_peers), but need to receive more shares in order to achieve happiness. Oh wait, what does it take for a server to be absent from self.uncontacted_peers? Well, per line 237 you get added to self.uncontacted_peers if you are writable, and then in line 369 you get popped off of self.uncontacted_peers whenever we decide to ask you to hold a share. So it seems like we need a test of a case where to succeed you have to first upload one share to every writable server and then you do not have any homeless shares (so that the if on line 316 will be True) and then you have not already achieved servers-of-happiness (so that the if on line 319 will be False) and then if you upload more shares to one or more of the servers that you've already uploaded a share to then you can achieve servers-of-happiness.

Is it even possible to have such a situation? I can't come up with an example in my head right now. Perhaps this means the current implementation is correct? :-)

Taking a step back, we have now a nice algorithm for determining how many servers-of-happiness we have for a given static mapping from servers to shares. Do we also have a nice algorithm for dynamically choosing which servers should receive which shares? Or do we have an ad-hoc and underspecified state machine encoded into _loop()? :-) (Mind you, if it is the latter, but it behaves at least as well as v1.6.1 does for all the tests we can throw at it, then it is probably still worth shipping in v1.7.0 beta.)

This dynamic algorithm has to do the following:

  1. At start-up time you tell it a set of read-only servers and a set of writable servers.
  2. Then it issues get_buckets() requests to every read-only server.
  3. Then it can issue allocate_buckets() requests to any servers that it wants.
  4. Each allocate_buckets() will eventually return a tuple of "already got these shares" and "I hereby agree to hold those shares" or else fail (per RIStorageServer). The interface doesn't say this, but the current storage servers will tell you that they already have shares from this file even if you didn't request that they hold those shares from this file--see StorageServer.remote_allocate_buckets(). If you asked a server to hold a share and it doesn't tell you that it already has the share nor does it tell you that it hereby agrees to hold the share then one of two things happened: it turned out to be full and couldn't hold that share, or someone else is simultaneously uploading that share right now but hasn't finished.
  5. The algorithm has to remove any failing servers from the set of known servers.
  6. The algorithm can issue more allocate_buckets() requests in response to earlier requests resolving or failing.
  7. Eventually the algorithm must terminate.
  8. Once it has terminated, then if the set of shares that non-failed servers have agreed to hold plus the set of shares that non-failed servers have announced that they already have satisfies the servers-of-happiness criterion then proceed to upload shares, else fail with a useful failure message.

So: can we describe what the current implementation on Kevan's branch does in terms of how it chooses its step 3 and step 6 moves? I remember that when we originally started on this ticket nine monhs ago I boldly stated that we could leave the current upload logic from Tahoe-LAFS v1.6.1 (i.e. the step 3 and step 6 choices) in place and just change the evaluation at the end (i.e. the step 8 decision). This might still be true, but at this point I'm not sure if we've introduced a regression in the step 3 and step 6 behavior. Can we convince ourselves that we haven't, by analysis or testing? Or could we define newer, simpler step 3 and step 6 behavior that would be easier to convince ourselves is correct behavior?

Considering that we really want to release Tahoe-LAFS v1.7 beta about 48 hours from now it would probably be better to go down the first route of analyzing the current implementation to convince ourselves that it is not worse-behaved than the v1.6.1 implementation instead of changing the upload behavior from the current implementation. :-)

Okay, here is a more specific question: what is the difference, if any, between Tahoe-LAFS v1.6.1 and Kevan's #778 branch with regard to steps 3 and 6? Can you answer that question, Kevan?

Hm. Wow, there is a difference between 1.6.1 and Kevan's #778 branch with regard to step 3. In Kevan's branch it issues get_buckets() queries to the read-only servers (step 2) serially. It waits to hear back from the first read-only server before it queries the second read-only server, and so on. That could be a performance problem when there are a lot of read-only servers. On the other hand in v1.6.1 it doesn't appear to query read-only servers at all, which means that if you've uploaded a file and then the servers to which you uploaded it fill up, then next time you upload that file you will upload all of its shares to new not-yet-full servers instead of being happy with the current servers. Can that be right? Kevan has added a test of that sort of situation, so if it is true that v1.6.1 behaves so badly then v1.6.1 should fail Kevan's new test. :-)

comment:188 follow-up: Changed at 2010-05-12T07:01:56Z by zooko

Okay folks I could use some help reviewing this branch. :-) Nobody can be expected to read through the history of this ticket, but perhaps you could start by just reading the diff of all of Kevan's patches put together. Kevan's patches include very good doc and test, which will hopefully serve as your guide.

(I got that combined diff showing all of Kevan's patches’ by looking at the log and selecting two versions to diff.)

comment:189 in reply to: ↑ 188 ; follow-up: Changed at 2010-05-12T08:10:33Z by taral

This change removes progress when a server already has a share. Intentional?

The algorithm as written appears to make assumptions about happiness. Specifically, happiness is the number of distinct servers with at least one share. If that's not the case, then this algorithm will probably break badly, for the reasons you specified. Otherwise, it appears to be correct.

My only recommendation would be to sort the servers in order of number of shares held (descending) before picking shares to redistribute.

I did not review the testcases.

comment:190 in reply to: ↑ 189 ; follow-up: Changed at 2010-05-12T13:50:11Z by zooko

Dear taral: thank you for the code review!

Replying to taral:

This change removes progress when a server already has a share. Intentional?

Hm... Good catch. That looks unintentional to me. At least it is inconsistent with the comment left at line 477. I think removing progress = True from there means that when you get a response from a server which did not accept any of your requests to hold a share but which did inform you that it already had some shares then you'll treat that as "not progress". Oh, but it looks like the only effect of "progress" is to change good_query_count, bad_query_count, and full_count, which are used only for reporting progress, so maybe it is intentional or at least harmless.

We'll leave it to Kevan to answer definitively.

The algorithm as written appears to make assumptions about happiness. Specifically, happiness is the number of distinct servers with at least one share.

That's not exactly the definition of happiness. I forgot to push Kevan's doc patches into the branch. Here they are. They state what happiness means.

My only recommendation would be to sort the servers in order of number of shares held (descending) before picking shares to redistribute.

Hm, I'm not sure about that -- I'll leave that to Kevan to think about.

comment:191 in reply to: ↑ 187 ; follow-ups: Changed at 2010-05-12T21:32:15Z by kevan

Replying to zooko:

  • It is using only uncontacted_peers in if delta <= len(self.uncontacted_peers) and not also considering peers that have already been contacted but could accept another share, right? So what if we had a situation like {server0: (share0, share1, share2), server1: (share0), server2: (share0)} and the only way to achieve happiness is to upload share1 or share2 to either server1 or server2? Would the if evaluate to False since len(self.uncontacted_peers) is zero and the upload would fail when it shouldn't? This would be similar to the situation tested in line 1026 of test_upload.py on Kevan's branch, but in that test the servers have no shares instead of having one share each. It would also be similar to the situation tested in line 853 of test_upload.py on Kevan's branch but in that test the servers which have one share each are read-only so the upload must fail.

Replying to zooko:

  • It is using only uncontacted_peers in if delta <= len(self.uncontacted_peers) and not also considering peers that have already been contacted but could accept another share, right? So what if we had a situation like {server0: (share0, share1, share2), server1: (share0), server2: (share0)} and the only way to achieve happiness is to upload share1 or share2 to either server1 or server2? Would the if evaluate to False since len(self.uncontacted_peers) is zero and the upload would fail when it shouldn't? This would be similar to the situation tested in line 1026 of test_upload.py on Kevan's branch, but in that test the servers have no shares instead of having one share each. It would also be similar to the situation tested in line 853 of test_upload.py on Kevan's branch but in that test the servers which have one share each are read-only so the upload must fail.

Hm.

First note that, if we're in this situation, we have 3 shares to place -- that condition, I think, is only hit when there are no more homeless shares, and the only shares that we've accounted for so far are 0, 1, and 2. Then the happiest upload we can make has happiness of 3, since happiness is bounded from above by the size of the smaller of the two vertex sets in the bipartite graph, and we know that both of these have size of either exactly (in the case of the shares) or at least (in the case of the servers) 3. The happiness that we see above is 2.

If this is fixable by redistribution, at least one of server 1 and server 2 must accept new shares; otherwise, the algorithm is right to declare the upload unhappy and give up. Also, at least one of server 1 and server 2 must be uncontacted (_loop has not yet asked it to hold a share), since _loop will not in its first pass (corresponding to step 3 in your terminology below) ask a peer to hold a share that it has already placed. So our possible causes for a layout like this are something like:

We see server 0 first; we either attempt to send share 0 to server 0, or server 0 is readonly but already has all of them. We see [server 1 | server 2] first; we either store share 0 on [server 1|server 2], or it is readonly and we detect it during step 2. We then see server 0, and stop placing shares because we have already accounted for them. Then, since [server 1 | server 2] is both writable and not yet contacted, we can't know that it holds share 0.

So we'll remove shares from server 0 (because it's the only one with more than one share), and put them in homeless_shares for distribution to server 2 (for the sake of argument, without loss of generality, etc). If we opt to redistribute share 1 or share 2, we're okay; happiness is 3. If we opt to redistribute share 0, we're not, because (though we wouldn't know it) server 2 already has that one.

So I guess that's not the right way of going about share redistribution. Makes sense; I spent a while looking for an example that would break that logic, but I couldn't find one.

  • Is _loop() guaranteed to terminate? Could it endlessly cycle, reassigning some shares to some servers and then recursing in line 356 (Kevan's branch) until it hits the Python maximum recursion limit? If it is guaranteed to terminate, how can we tell?

The reassignment is contingent upon there being uncontacted servers. Regardless of how we reassign shares, the loop will contact at least one uncontacted server on each iteration if there are any -- see line 368 of _loop. Since the number of uncontacted servers must either decrease or be zero as a result of/on each iteration of the loop, it will eventually terminate (or, if it does not terminate, it is for reasons other than that bit of logic).

  • For that matter, isn't _loop() which calls itself really sort of like the greedy algorithm that I proposed back in comment:157 and which is not guaranteed to find a solution even if one exists? Now that we have a maximum bipartite matching algorithm can't we do away with _loop() entirely?

Yes, I think so, after thinking about it, but I'll detail that in another comment.

Hm, I see that in addition to the self-call on line 356, _loop() is also called from line 411 (Kevan's branch) and from line 504 (Kevan's branch). Line 411 is when we've finished assigning shares to servers which already had shares, and line 504 is when we get a response from a remote call to a server's allocate_buckets().

It kind of seems like to me that only one of these three calls to _loop() should still be made in the new world of maximum bipartite matching, and that is the one where we just got new information from a server in response to our attempt to allocate_buckets(). (That new information could be information to the effect that the server already has some shares, information to the effect that the server hereby agrees to hold some shares that we will upload, or information to the effect that the server has failed.) The other two calls to _loop() seem to be part of an iterative/imperative/greedy algorithm which is hopefully superceded by maximum bipartite matching.

Well, one way to tell if what I just said is true is to add a unit test of a case like the one I mentioned above, where there are servers that are already holding shares (so they will not be in self.uncontacted_peers), but need to receive more shares in order to achieve happiness. Oh wait, what does it take for a server to be absent from self.uncontacted_peers? Well, per line 237 you get added to self.uncontacted_peers if you are writable, and then in line 369 you get popped off of self.uncontacted_peers whenever we decide to ask you to hold a share. So it seems like we need a test of a case where to succeed you have to first upload one share to every writable server and then you do not have any homeless shares (so that the if on line 316 will be True) and then you have not already achieved servers-of-happiness (so that the if on line 319 will be False) and then if you upload more shares to one or more of the servers that you've already uploaded a share to then you can achieve servers-of-happiness.

Is it even possible to have such a situation? I can't come up with an example in my head right now. Perhaps this means the current implementation is correct? :-)

Unless I'm misunderstanding your example, I think my case above would do the trick.

Taking a step back, we have now a nice algorithm for determining how many servers-of-happiness we have for a given static mapping from servers to shares. Do we also have a nice algorithm for dynamically choosing which servers should receive which shares? Or do we have an ad-hoc and underspecified state machine encoded into _loop()? :-) (Mind you, if it is the latter, but it behaves at least as well as v1.6.1 does for all the tests we can throw at it, then it is probably still worth shipping in v1.7.0 beta.)

We have one nice, well-defined algorithm for computing the happiness of a static share assignment and one poorly defined, ad-hoc, and suboptimal algorithm for redistributing shares grafted onto an existing state machine. This is how I learn, I guess. :-/

It wouldn't be hard to extend the bipartite matching algorithm to do this, I think, but I will outline the extension in another comment, since this comment is concerned with the ad-hoc state machine algorithm. :-)

  1. Then it can issue allocate_buckets() requests to any servers that it wants.
  1. The algorithm can issue more allocate_buckets() requests in response to earlier requests resolving or failing.

Okay, here is a more specific question: what is the difference, if any, between Tahoe-LAFS v1.6.1 and Kevan's #778 branch with regard to steps 3 and 6? Can you answer that question, Kevan?

If I understand your steps correctly, there aren't any significant differences between 1.6.1 step 3 and #778 step 3.

Step 3, if I understand right, corresponds to the initial attempts to place shares on the grid. Both #778 and 1.6.1 do this in the same way; they loop through the list of uncontacted peers (peers which have not yet been asked to store a share for this upload), and ask each of them to store a share. This proceeds until either all shares are placed or all peers are contacted. The peers that they choose to ask are also the same -- specifically, the peers that aren't readonly.

The difference between #778 and 1.6.1 in step 6 is in the halfhearted attempt to redistribute shares if servers of happiness fails while there are still uncontacted peers. This affects how the _loop opts to send more _allocate_buckets() requests.

In 1.6.1, after the initial series of requests, if there were failures, or there are still homeless shares, _loop will repeatedly ask each of the peers that accepted shares on the last pass to accept a proportional (shares left / peers that accepted shares on the last pass) number of shares, and continue to do this until there are no more homeless shares or until there are no more accepting peers.

This logic is not sufficient to guarantee that a completed upload implies acceptable file health when our health metric is servers of happiness instead of shares of happiness, so #778 adds more logic to step 6. Now, in addition to checking for homeless shares, we check for the case where all shares were placed, but not all peers were contacted, and attempt to remedy the case where that results in an unhealthy upload by issuing allocate_buckets() requests to peers that we have not yet contacted.

So, in summary, step 6 differs in #778 in that the upload process will send allocate_buckets() queries to peers in more cases than in 1.6.1.

Hm. Wow, there is a difference between 1.6.1 and Kevan's #778 branch with regard to step 3. In Kevan's branch it issues get_buckets() queries to the read-only servers (step 2) serially. It waits to hear back from the first read-only server before it queries the second read-only server, and so on. That could be a performance problem when there are a lot of read-only servers. On the other hand in v1.6.1 it doesn't appear to query read-only servers at all, which means that if you've uploaded a file and then the servers to which you uploaded it fill up, then next time you upload that file you will upload all of its shares to new not-yet-full servers instead of being happy with the current servers. Can that be right? Kevan has added a test of that sort of situation, so if it is true that v1.6.1 behaves so badly then v1.6.1 should fail Kevan's new test. :-)

If I'm understanding your steps right, wouldn't it be more correct here to say that #778 has a step 2, and 1.6.1 doesn't? It seems like this behavior is a part of step 2, not step 3. I think that that is the case, though -- 1.6.1 simply filters the peers that cannot accept shares for an upload out of the list of peers that it will check.

You are right that there is no reason for that to be done serially. I must have been trying to make it look like _loop, but it doesn't have the same need to coordinate as _loop.

comment:192 in reply to: ↑ 190 Changed at 2010-05-12T22:53:56Z by kevan

Replying to zooko:

Dear taral: thank you for the code review!

Replying to taral:

This change removes progress when a server already has a share. Intentional?

Hm... Good catch. That looks unintentional to me. At least it is inconsistent with the comment left at line 477. I think removing progress = True from there means that when you get a response from a server which did not accept any of your requests to hold a share but which did inform you that it already had some shares then you'll treat that as "not progress". Oh, but it looks like the only effect of "progress" is to change good_query_count, bad_query_count, and full_count, which are used only for reporting progress, so maybe it is intentional or at least harmless.

We'll leave it to Kevan to answer definitively.

That's correct -- progress = True implies that the query placed shares, while progress = False implies that it did not. This is used to know how to increment self.good_query_count and self.bad_query_count. I will fix the comment so that it is less misleading.

Though, I guess to be consistent with the error message that these figures are collected for, which is

 "want to place shares on at least %d servers such that any %d of them have enough shares to recover the file, sent %d queries to %d peers, %d queries placed some shares, %d placed none (of which %d placed none due to the server being full and %d placed none due to an error)"

I should say that the request makes progress if it finds shares that were not already placed, and does not make progress in other cases. This is consistent with the comment.

It is a bit confusing to say that a request that only finds shares that have already been placed is a bad query. However, the distinction between a good query and a bad query isn't as clear-cut in the world of servers of happiness as it was in the world of shares of happiness. A share discovered on a server incidentally as a result of share placement is not necessarily contributing to the health of the upload -- its discovery may not make the file any healthier. However, if a share is allocated as a result of a request to allocate that share, it is much more likely (perhaps even true in general, but I'm not quite sure of that) that its placement will help contribute to the health of the upload.

My only recommendation would be to sort the servers in order of number of shares held (descending) before picking shares to redistribute.

Hm, I'm not sure about that -- I'll leave that to Kevan to think about.

I don't think it would hurt -- it would help to spread out big concentrations of shares amongst more peers, which would reduce the number of shares lost if the peer that formerly had a big concentration of share went down.

comment:193 Changed at 2010-05-12T22:57:34Z by kevan

The treatment of progress is also wrong because it does not account for the case when the share that we wanted to allocate space for is not in allocated because it is in alreadygot, and falsely causes this case to be recorded as the server being full, which is not necessarily true.

comment:194 in reply to: ↑ 187 ; follow-up: Changed at 2010-05-12T23:12:05Z by kevan

Replying to zooko:

Do we also have a nice algorithm for dynamically choosing which servers should receive which shares?

Here is what I had in mind for a nice algorithm for dynamically choosing which servers should receive which shares.

Input: A list of peers. Output: A share -> peer assignment.

  1. Query all peers for their existing shares.
  2. Create a bipartite graph, with one set of vertices corresponding to shares and one to peers.
  3. Create an edge between peer and all of the shares that it already holds.
  4. Determine which peers are capable of accepting new shares.
  5. Draw an edge between each peers that can accept new shares, and every share in the vertex set of shares. These represent possible share placements.
  6. Compute a maximal bipartite matching in the resulting graph. Determine if the matching is happy; fail if it isn't.
  7. When necessary (i.e.: when an edge in the matching does not correspond to a "peer has a share" relationship), do allocate_buckets(). If allocate_buckets() fails due to the peer being full, remove all of the edges added for the peer in step 5 and resume from step 6. If allocate_buckets() fails due to peer failure, remove the peer from the graph, and resume from step 6.
  8. Attempt to distribute shares not covered by the matching evenly across the peers. It is not necessary for this step to be successful, since the file is already healthy, and its health is independent of what we do with the excluded shares, but distributing them evenly will avoid concentrations of shares, reducing the impact of the failure of any particular node.
  9. Finish.

I think the algorithm halts; if it does not find a solution in step 7, it removes either a vertex or some edges (together with one peer to which shares can be assigned) from the graph. Eventually the graph will either run out of vertices in the peer set altogether, or will not have any bucket allocations that can fail, and will finish. It should be optimal for the same reason that the check now is optimal; only now we allow it to find an optimal matching instead of verifying that a matching of a certain size exists.

Thoughts?

comment:195 Changed at 2010-05-12T23:15:04Z by taral

Looks good to me. Can #8 be partially folded into #6 as some kind of heuristic?

comment:196 Changed at 2010-05-12T23:20:17Z by taral

It occurs to me that bipartite matching isn't quite right... let me think about this.

comment:197 follow-up: Changed at 2010-05-12T23:28:21Z by taral

Okay, bipartite matching is sufficient but not necessary. I'm trying to see if something better exists.

comment:198 in reply to: ↑ 197 Changed at 2010-05-13T01:10:42Z by davidsarah

Replying to taral:

Okay, bipartite matching is sufficient but not necessary. I'm trying to see if something better exists.

It is true that maximum_matching_size >= servers_of_happiness_threshold is not necessary for the original definition of happiness that we started with. Here's a counterexample to it being necessary:

Let N = 4, k = 2, H = 3. I.e. we have 4 shares, which need to be placed on at least 3 servers such that the shares from any two of those servers are sufficient to reconstruct the file.

If we place copies of the same two shares on three of the servers, then the maximum matching size is only 2, not 3 as required. However, the original happiness criterion is met, because any two of the three servers (actually only one is needed) will have enough shares to reconstruct the file.

OTOH, this is not a good share placement!

  • It is inefficient in storage usage, because we needed 6 share copies, when at most 4 copies should have been needed (actually the minimum is 3).
  • Two shares were not stored at all. So, we cannot redistribute shares later so that they are more evenly spread (unless we recompute them).

So, I believe that maximum matching size is a good measure of happiness even though it's not necessary to meet the original happiness criterion, because it also ensures placements that don't use excess storage. (If there are share copies that aren't in the maximum matching, they can be deleted.)

comment:199 in reply to: ↑ 194 Changed at 2010-05-13T01:25:29Z by davidsarah

Replying to kevan:

  1. Compute a maximal bipartite matching in the resulting graph. Determine if the matching is happy; fail if it isn't.

To be precise, a ''maximal'' bipartite matching is not the same thing as a maximum bipartite matching. We want the latter.

comment:200 in reply to: ↑ 191 ; follow-up: Changed at 2010-05-13T04:41:51Z by zooko

My major concern at this point is: Does the ticket778 branch as it currently stands contain regressions vs. Tahoe-LAFS v1.6.1?

Replying to kevan:

We see [server 1 | server 2] first; we either store share 0 on [server 1|server 2], or it is readonly and we detect it during step 2. We then see server 0, and stop placing shares because we have already accounted for them. Then, since [server 1 | server 2] is both writable and not yet contacted, we can't know that it holds share 0.

So we'll remove shares from server 0 (because it's the only one with more than one share), and put them in homeless_shares for distribution to server 2 (for the sake of argument, without loss of generality, etc). If we opt to redistribute share 1 or share 2, we're okay; happiness is 3. If we opt to redistribute share 0, we're not, because (though we wouldn't know it) server 2 already has that one.

So I guess that's not the right way of going about share redistribution. Makes sense; I spent a while looking for an example that would break that logic, but I couldn't find one.

Can you write a unit test that causes this behavior?

How does v1.6.1 behave in a case like this? I guess it ignores read-only servers entirely during upload. But if it were a case like this one involving only writable servers, then v1.6.1 would go ahead and stop happily with "shares of happiness" even though there isn't actually "servers of happiness". So as far as I currently understand (which isn't that far), the current ticket778 branch does not have any regression vs. v1.6.1 with regard to this case.

  • Is _loop() guaranteed to terminate?

The reassignment is contingent upon there being uncontacted servers. Regardless of how we reassign shares, the loop will contact at least one uncontacted server on each iteration if there are any -- see line 368 of _loop.

Okay I agree that _loop() will terminate.

comment:201 in reply to: ↑ 191 ; follow-up: Changed at 2010-05-13T05:26:27Z by zooko

Replying to kevan:

Is it even possible to have such a situation? I can't come up with an example in my head right now. Perhaps this means the current implementation is correct? :-)

Unless I'm misunderstanding your example, I think my case above would do the trick.

As mentioned, if possible, I would like to see a unit test of this case, even if that test is going to be marked as "TODO" for the v1.7 release.

It wouldn't be hard to extend the bipartite matching algorithm to do this, I think, but I will outline the extension in another comment, since this comment is concerned with the ad-hoc state machine algorithm. :-)

Yeah it seems like your thoughts along those lines are destined to grow into a ticket for after v1.7 release which is something like "rewrite uploader to make it awesome". :-)

So, in summary, step 6 differs in #778 in that the upload process will send allocate_buckets() queries to peers in more cases than in 1.6.1.

Okay I think that I'm convinced that the ticket778 branch doesn't lead to any regressions with regard to step 6. I withhold my final opinion on that until I re-re-re-re-re-read your patches with this conversation in mind, though.

If I'm understanding your steps right, wouldn't it be more correct here to say that #778 has a step 2, and 1.6.1 doesn't? It seems like this behavior is a part of step 2, not step 3.

You are right—my mistake. I meant step 2.

You are right that there is no reason for that to be done serially. I must have been trying to make it look like _loop, but it doesn't have the same need to coordinate as _loop.

Doing it serially could be a significant performance regression vs. 1.6.1 if there are many read-only servers. In fact, waiting for all the answers to your step 2 queries—even if done in parallel instead of in serial—could be a performance regression (if there is a hung server which is marked read-only and which never replies to the query). However, while I think it might be worth changing your branch to issue the step 2 queries in parallel instead of serial, I do not think it would be worth changing your branch (before v1.7) to proceed with the later steps before all of the step 2 queries have been answered.

comment:202 Changed at 2010-05-13T05:45:40Z by zooko

Okay I intend to review the branch one last time tomorrow hunting for significant regressions from v1.6.1.

comment:203 in reply to: ↑ 201 Changed at 2010-05-13T17:43:16Z by davidsarah

Replying to zooko:

... In fact, waiting for all the answers to your step 2 queries—even if done in parallel instead of in serial—could be a performance regression (if there is a hung server which is marked read-only and which never replies to the query). However, while I think it might be worth changing your branch to issue the step 2 queries in parallel instead of serial, I do not think it would be worth changing your branch (before v1.7) to proceed with the later steps before all of the step 2 queries have been answered.

Agreed. Improving the availability of upload in the presence of hung or very slow servers is ticket #873 (also ticket #946 might help). If this affected download, I'd be reluctant to accept any regression, but for upload we know there is a lot more work to do to improve availability.

comment:204 Changed at 2010-05-13T17:43:54Z by davidsarah

  • Keywords availability performance upload added

comment:205 Changed at 2010-05-13T19:55:58Z by zooko

[Amber and Zooko are pair-programming on this review.]

query_allocated should be named ask_about_existing_shares.

self.peers_with_shares = set([]) should be self.peers_with_shares = set()

When querying readonly servers, we need to limit the number of readonly servers queried so that we don't query every readonly server on the grid. How about if we just query those that are among the first 2*n in all_peers. This would do it: replace:

        readonly_peers = set(all_peers) - set(writable_peers)

with:

        readonly_peers = set(all_peers[:2*n]) - set(writable_peers)

[Zooko adds: I still think we should parallelize that initial query to readonly servers.]

comment:206 Changed at 2010-05-13T21:29:23Z by taral

Greedy algorithms are known to construct maximal (but not necessarily maximum) bipartite matching.

I think step 6 can be improved by using a weighted algorithm where existing shares are "heavier" than new shares.

comment:207 Changed at 2010-05-13T22:33:31Z by zooko

Oh the docs in architecture.txt should be updated to describe the new step 2 in the upload algorithm.

comment:208 Changed at 2010-05-14T01:47:25Z by kevan

I'm attaching docs3.txt, behavior3.txt, and tests3.txt. These are the canonical versions of the #778 patches.Things that are changed:

  • The location of only has been changed in error messages.
  • There is a test for the failing redistribution scenario (it is marked as todo, because we can't pass it yet)
  • The readonly peer share discovery step now occurs in parallel, not sequentially.
  • The notion of progress when attempting to place shares is now more correctly defined and tested.
  • Only the readonly peers in the first 2*n peers are now asked about their existing shares.
  • The naming and Python usage tweaks suggested by Zooko and Amber have been implemented.
  • docs/architecture.txt now includes the readonly share discovery behavior.
  • Assorted cleanup of some things I didn't like about how some of my earlier tests were written.

Per the focus of this ticket on things that might be regressions against 1.6.1, I did not change the redistribution algorithm to take the number of shares on a peer into account when reassigning them any more than it already does.

Thanks for the feedback so far.

Changed at 2010-05-14T01:48:09Z by kevan

Changed at 2010-05-14T01:48:24Z by kevan

Changed at 2010-05-14T01:49:02Z by kevan

comment:209 in reply to: ↑ 200 Changed at 2010-05-14T02:00:46Z by kevan

Replying to zooko:

How does v1.6.1 behave in a case like this? I guess it ignores read-only servers entirely during upload. But if it were a case like this one involving only writable servers, then v1.6.1 would go ahead and stop happily with "shares of happiness" even though there isn't actually "servers of happiness". So as far as I currently understand (which isn't that far), the current ticket778 branch does not have any regression vs. v1.6.1 with regard to this case.

If we're comparing 1.6.1 and 1.7.0's ability to determine servers of happiness, you're right; 1.6.1 would call the upload successful even though the shares are not distributed in the required way, while these patches will call it unsuccessful because the share layout is unhappy, and the peer selection logic doesn't know how to distribute shares in such a way as to make it happy.

comment:211 Changed at 2010-05-14T05:07:46Z by zooko

Okay here I am reviewing #778 patches for what seems like the one millionth time. This time through the core "put them back so we can try to redistribute them" code in _loop() finally makes sense to me!

Let's start a new ticket named something like "make upload awesome" so that we don't forget details such as Kevan's idea in comment:194 and Taral's suggestion in comment:206.

Now reviewing the rest of the branch... :-)

Last edited at 2013-06-26T23:34:42Z by daira (previous) (diff)

comment:212 Changed at 2010-05-14T05:19:07Z by zooko

Okay! I'm done reviewing it! Way to go, Kevan! I'm running unit tests with code coverage reporting and then I'll push your branch into trunk!

comment:213 follow-up: Changed at 2010-05-14T05:51:15Z by zooko

Here's a code coverage report about immutable/upload.py from running the entire unit test suite:

http://tahoe-lafs.org/~zooko/src_allmydata_immutable_upload.html

Two interesting bits: it claims that line 286 and line 344 are only partially covered -- i.e. that for each one, of the two possible lines that could be executed next after that line, only one of those possible next-lines was actually executed during the tests. It gives a line number (on the right-hand side) for which line of code it could have jumped to but didn't, but that number is bogus. However, its claim that line 344 is only partially covered is substantiated by its claim that lines 355-357 are never executed.

I'm going to proceed with applying your branch to trunk, but you might want to go ahead and investigate this.

comment:214 follow-up: Changed at 2010-05-14T06:05:40Z by zooko

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixes applied in 7c4c6f393ec2ad2a through 77aabe7066e539c0. Kevan: please review my 77aabe7066e539c0. Thanks a lot for all your hard work on this Kevan! Thanks also to Shawn Willden, David-Sarah, Brian, Taral, and to Terrell for some drive-by encouragement. :-)

comment:215 Changed at 2010-05-14T17:10:32Z by zooko

In the shower this morning I suddenly realized I hadn't actually reviewed src/allmydata/util/happinessutil.py! Fortunately, a quick glance at the code coverage results from running just the allmydata.test.test_upload tests shows that happinessutil.py has 100% line- and branch- coverage. :-)

http://tahoe-lafs.org/~zooko/htmlcov-just-allmydata.test.test_upload/src_allmydata_util_happinessutil.html

I still plan to review happinessutil.py at some point.

comment:216 in reply to: ↑ 214 Changed at 2010-05-14T20:40:50Z by kevan

Replying to zooko:

Kevan: please review my 77aabe7066e539c0.

Change "servers nodes" to "server nodes" in the first paragraph.

"Each server has announced its available space when it connected to the introducer" sounds clunky. Maybe "Each server announced its available space when it connected to the introducer", or "Servers announce their available space when they connect to the introducer".

(The announcement field that contains the space announcement is actually called "maximum-immutable-share-size", though it is set to the amount of space the storage server thinks it has available, so what we have is still technically correct)

The rest of 77aabe7066e539c0 looks good to me.

comment:217 in reply to: ↑ 213 Changed at 2010-05-14T21:00:40Z by kevan

Replying to zooko:

Here's a code coverage report about immutable/upload.py from running the entire unit test suite:

http://tahoe-lafs.org/~zooko/src_allmydata_immutable_upload.html

Two interesting bits: it claims that line 286 and line 344 are only partially covered -- i.e. that for each one, of the two possible lines that could be executed next after that line, only one of those possible next-lines was actually executed during the tests. It gives a line number (on the right-hand side) for which line of code it could have jumped to but didn't, but that number is bogus. However, its claim that line 344 is only partially covered is substantiated by its claim that lines 355-357 are never executed.

I think the line number is right with line 343, since it would jump back to the start of the while loop if it didn't take the branch leading into the body of the if-statement. I'll add a test for that.

Also, I need to add a test for the situation where it can't distribute all of the shares, but still has a happy layout.

I think the other instances of partial coverage are benign -- do you agree?

comment:218 Changed at 2010-05-15T03:28:52Z by zooko

Yes—good enough.

comment:219 Changed at 2010-05-15T03:52:10Z by kevan

Patch attached -- I also removed a comment that didn't make much sense. This fixes both of the code coverage issues that I saw in your report.

comment:220 Changed at 2010-05-16T01:33:06Z by zooko

See also #911 (Create a specification for the servers of happiness behavior).

comment:221 Changed at 2010-05-18T17:45:45Z by zooko

A user with handle "bjp" joined the IRC channel and asked how to set up a simple grid in order to test SFTP. Reading the transcript the next day, I realized that this ticket (#778) has broken the setup instructions in http://tahoe-lafs.org/source/tahoe-lafs/trunk/docs/running.html because the default configuration of K, H, and N can't work without at least H storage servers, but the running.html instructions don't say that you need multiple storage servers.

Kevan: do you feel like updating running.html?

comment:222 Changed at 2010-05-19T00:33:55Z by kevan

Patch attached. I made a separate patch (within the single patch file, but distinct from the one that fixes the issue above) that changes uses of tahoe and Tahoe (where appropriate) to Tahoe-LAFS, to be consistent with the title and how Tahoe-LAFS is referred to elsewhere; feel free to apply or not apply that one as you see fit.

Changed at 2010-05-19T00:34:59Z by kevan

comment:223 Changed at 2010-05-19T07:17:25Z by zooko

To test this feature on your system build current Tahoe-LAFS trunk (e225f573b9c3fb0e or newer) and see how uploads perform and whether they fail with a clean error message if there aren't enough storage servers available to spread your file out over H different servers such that any K of those servers can reconstruct your file.

Changed at 2010-05-24T00:42:26Z by kevan

make a distinction between immutable file uploads and mutable file uploads wrt servers of happiness

comment:224 Changed at 2010-05-24T00:47:41Z by kevan

David-Sarah pointed out in IRC that the fact that this only applies to immutable files at the moment is not well-documented. The last attachment tries to rectify this.

comment:225 Changed at 2010-05-24T03:36:47Z by davidsarah

  • Keywords review-needed removed

From attachment:mutabledocs.dpatch:

+If we are uploading an immutable file and are unable to place every share that
+we want, but we still managed to place enough shares on enough servers to
+achieve a condition called "servers of happiness" then we'll do the upload
+anyways. If we cannot achieve "servers of happiness", the immutable file upload
+is declared a failure. If we are uploading a mutable file or a directory and
+are unable to place all of the shares that we want to place, the upload will
+fail regardless of where the shares were placed.

This can be read as implying that the mutable placement algorithm still gives at least the same guarantees as the immutable one. But in fact it succeeds even when there are fewer than "servers of happiness" connected servers.

+(mutable files use a different share placement algorithm that does not
+ consider this parameter)

'mutable' -> 'Mutable' and ')' -> '.)'

comment:226 Changed at 2010-05-24T04:39:58Z by kevan

I reworded the blurb in architecture.txt, and made the changes you suggested to configuration.txt -- see attachment:mutabledocsv2.dpatch.

Thanks for the feedback.

Changed at 2010-05-24T04:41:14Z by kevan

comment:227 Changed at 2010-12-29T09:12:28Z by zooko

  • Keywords servers-of-happiness added
Note: See TracTickets for help on using tickets.