#699 assigned defect

rebalance during repair or upload

Reported by: zooko Owned by: davidsarah
Priority: major Milestone: soon
Component: code-peerselection Version: 1.4.1
Keywords: upload repair preservation test anti-censorship Cc: tahoe-lafs.org@…
Launchpad Bug:

Description (last modified by lpirl)

In this mailing list message, Humberto Ortiz-Zuazaga asks how to rebalance the shares of a file. To close this ticket, ensure (either as an option or unconditionally) that repairing or uploading a file attempts to "rebalance" its shares, making it healthy as defined by #614.

See also related tickets #481 (build some share-migration tools), #231 (good handling of small numbers of servers, or strange choice of servers), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), and #543 ('rebalancing manager')

Change History (19)

comment:1 Changed at 2009-08-10T15:28:29Z by zooko

The following clump of tickets might be of interest to people who are interested in this ticket: #711 (repair to different levels of M), #699 (optionally rebalance during repair or upload), #543 ('rebalancing manager'), #232 (Peer selection doesn't rebalance shares on overwrite of mutable file.), #678 (converge same file, same K, different M), #610 (upload should take better advantage of existing shares), #573 (Allow client to control which storage servers receive shares).

comment:2 Changed at 2009-08-10T15:45:34Z by zooko

Also related: #778 ("shares of happiness" is the wrong measure; "servers of happiness" is better).

comment:3 Changed at 2009-12-22T18:45:56Z by davidsarah

  • Keywords repair preservation added

comment:4 follow-up: Changed at 2010-03-24T23:26:36Z by davidsarah

  • Description modified (diff)
  • Keywords upload added
  • Milestone changed from eventually to 1.7.0
  • Summary changed from optionally rebalance during repair or upload to rebalance during repair or upload
  • Type changed from enhancement to defect

It seems more usable for this behaviour to be unconditional rather than an option: why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload?

comment:5 in reply to: ↑ 4 Changed at 2010-03-25T04:56:28Z by zooko

Replying to davidsarah:

why would you not want to attempt to ensure that a file is healthy (per #614) on a repair or upload?

I guess I was thinking about bandwidth and storage space. There could be some case where you want to repair but you don't want -- at that particular time -- to rebalance because rebalancing might cost more bandwidth or something.

However, I strongly prefer to avoid offering options if we don't have to, and I can't think of a really good answer to this question, so I agree that this should be unconditional.

comment:6 follow-up: Changed at 2010-05-16T05:15:40Z by zooko

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files. Once we likewise have rebalancing on upload of mutable files then we can close this ticket.

comment:7 in reply to: ↑ 6 Changed at 2010-05-16T16:18:57Z by davidsarah

  • Milestone changed from 1.7.0 to 1.8.0

Replying to zooko:

#778 ("shares of happiness" is the wrong measure; "servers of happiness" is better) is fixed! I think that this fixes the parts of this ticket that have to do with immutable files.

Doesn't #778 only fix this for upload of immutable files, not repair?

comment:8 follow-up: Changed at 2010-05-16T16:58:12Z by zooko

Hm, let's see... the current repairer uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness).

comment:9 Changed at 2010-05-16T17:50:29Z by davidsarah

  • Keywords test added

comment:10 Changed at 2010-08-12T20:55:04Z by zooko

  • Milestone changed from 1.8.0 to eventually

comment:11 in reply to: ↑ 8 Changed at 2010-08-12T23:44:12Z by davidsarah

Replying to zooko:

Hm, let's see... the current repairer uses the current immutable uploader so I think it inherits the improvements from #778. However, I guess this means we need a unit test that would be red if the repairer fails to rebalance (at least up to servers-of-happiness and at least when not in one of the "tricky cases" for which our current uploader fails to achieve servers-of-happiness).

Those cases are described in #1124 and #1130. I don't think we should consider this ticket resolved for immutable uploads until those have been fixed.

Also, the uploader/repairer should be attempting to achieve "full happiness", i.e. a happiness value of N -- even though it only reports failure when it fails to meet the happiness threshold which may be lower than N. See ticket:778#comment:175.

comment:12 follow-up: Changed at 2010-10-14T07:05:25Z by sickness

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. I've tried to upload a file with one of the servers shut down, it said ok and put: 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it said that it needed rebalance but didn't do it no matter what. The solution was following zooko's advice in this mail: http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html and http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. Now all the servers had 1 share. I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) tnx! :)

comment:13 Changed at 2010-12-16T01:20:22Z by davidsarah

  • Keywords anti-censorship added

comment:14 in reply to: ↑ 12 ; follow-up: Changed at 2012-03-03T18:46:33Z by amontero

Replying to sickness:

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. I've tried to upload a file with one of the servers shut down, it said ok and put: 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it said that it needed rebalance but didn't do it no matter what. The solution was following zooko's advice in this mail: http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html and http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. Now all the servers had 1 share. I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) tnx! :)

Hi sickness.

I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested. I've created a bash script that may help you. It's not the most efficient way of doing it, but saves me a lot of time. If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited.

Hope you find it useful.

Last edited at 2012-03-03T18:47:16Z by amontero (previous) (diff)

comment:15 in reply to: ↑ 14 Changed at 2012-10-27T18:39:45Z by sickness

Replying to amontero:

Replying to sickness:

I've found the same behaviour experimenting with 1.8.0 on 10 nodes with 3-7-10. I've tried to upload a file with one of the servers shut down, it said ok and put: 1 share on 8 servers, 2 shares on 1 server, 0 shares on the shut down server (obviously). Then when I've powered the shutdown server back on, I've tried to do a repair on that file, it said that it needed rebalance but didn't do it no matter what. The solution was following zooko's advice in this mail: http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001735.html and http://tahoe-lafs.org/pipermail/tahoe-dev/2009-May/001739.html So basically I've deleted 1 of the 2 shares of the server with 2 shares and then repaired the file. Now all the servers had 1 share. I'd like to have an option to tell the uploader to never put more than N shares per server "no matter what" (I'd set it at 1 because I don't want one server to hold more "responsability" that I've planned to...) tnx! :)

Hi sickness.

I'm following zooko's advice, since I have the same problem. My use case is described in #1657 if you're interested. I've created a bash script that may help you. It's not the most efficient way of doing it, but saves me a lot of time. If some maintainer can upload it to some misc tools folder in the repo, maybe others will be able to use it as starting point and hopefully improve it, since my bash skills are somewhat limited.

Hope you find it useful.

yeah! tnx for this script! it would be really useful, but I've found it just now :/ and it spells out this error: tahoe-prune.sh: line 49: read: -i: invalid option read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...] tahoe-prune.sh: line 55: read: -i: invalid option read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...] (this on a debian with bash) then I've also found this tool: http://killyourtv.i2p.to/tahoe-lafs/rebalance-shares.py/ it seems to basically do the same but in python

comment:16 Changed at 2012-10-28T01:08:38Z by amontero

I can't help in why -i doesn't works for you. Anyway, you can just omit the

-i "n"

part, since it's only meant to default to "no" when asking.

Thanks for the pyhton script. Will keep it at hand when comes my time to learn Python :) By now, any of those scripts can do a "dirty rebalancing".

As a workaround, I think that a "hold no more than Z shares" setting in each server can make this easier. Just firing repairs at regular intervals would eventually create shares on all servers. Any server having the desired shares should just not accept any more shares for that file, so there would be no need of pruning. This way, we can easily tune how much "responsibility" each node would accept for each file. Arriving servers would eventually catch up when repairing.

I'm interested in developers feedback about this path. In case it could be easy enough for a Python novice, I could take a stab at it. I tried it months ago, but could not find my way through the code, so implementation advice is very welcome.

A future rebalancer would be way smarter, but meanwhile I think this approach will solve some use cases.

comment:17 Changed at 2013-02-15T03:31:02Z by davidsarah

  • Milestone changed from eventually to 1.11.0
  • Owner set to davidsarah
  • Status changed from new to assigned

comment:18 Changed at 2013-02-15T03:51:31Z by davidsarah

This ticket is likely to be fixed by implementing the repair algorithm in #1130.

comment:19 Changed at 2015-10-29T01:58:58Z by lpirl

  • Cc tahoe-lafs.org@… added
  • Description modified (diff)
Note: See TracTickets for help on using tickets.