[tahoe-dev] [tahoe-lafs] #302: stop permuting peerlist, use SI as offset into ring instead?

Fri Dec 25 21:57:31 PST 2009

#302: stop permuting peerlist, use SI as offset into ring instead?
------------------------------------+---------------------------------------
 Reporter:  warner                  |           Owner:           
     Type:  task                    |          Status:  new      
 Priority:  major                   |       Milestone:  undecided
Component:  code-peerselection      |         Version:  0.7.0    
 Keywords:  repair newcaps newurls  |   Launchpad_bug:           
------------------------------------+---------------------------------------

Comment(by warner):

 ok, so the "ringsim.py" simulator that I just attached to this ticket
 demonstrates one of the concerns I described above: non-permuted
 peer-selection will result in higher bytes-per-second upload rates to some
 servers than to others. (I haven't yet built a simulator to investigate
 the
 effect of full servers shedding load onto their clockwise neighbors).

 Run the code like {{{python ./ringsim.py --seed=abc --permute=1}}} . It
 will
 create a ring of 100 servers using "abc" as a seed to decide their
 nodeids.
 (any specific seed will result in a consisent distribution of nodeids).

 Then it will upload files (each with a size in {{{randrange(2GiB)}}}, mean
 size 1GiB) one at a time. Every few thousand uploads it will analyze the
 space used per-server and emit a report line like:

 {{{
 uploaded 16000
 min/max/(exp) usage-pf-ps 33.86 MB/38.66 MB/(35.72 MB): spread-pf: 4.80 MB
 (13.45%) stddev: 1.00 MB (2.81%)
 }}}

 The first block of numbers is "usage-per-file-per-server", meaning how
 much
 storage space was used on each server, divided by the total number of
 files
 that had been uploaded so far. If we pretend that we're uploading files at
 a
 rate of one per second, this is actually measuring bytes-per-second. The
 "min" value of 33.86 MB means that the least-used server had received
 33.86MB
 per file (i.e. per second). The most-used (fullest) server had received
 38.66MB per second. Our average filesize of 1GiB and 3-of-10 encoding
 parameters means that we'd expect to place 35.72MB per-server per-file.

 The "spread-pf" is the difference between the least-used and most-used
 servers: 4.80MB = 38.66-33.86. 13.45% is that spread expressed as a
 percentage of the expected usage value.

 The "stddev" is the standard deviation of all 100 servers' usage values.
 If
 usage were perfectly uniform, this would be zero. 2.81% is the standard
 deviation expressed as a percentage of the expected usage value.

 The simulator will run nearly forever. Run it with {{{--permute=1}}} and
 notice how the min/max values converge on the expected value over time,
 and
 how the spread and stddev drop towards zero. In my test run, after 200000
 files, the spread was down to 1.61MB (4.5%) and the stddev down to 265kB
 (0.74%). This is the law of large numbers in action.

 Now, re-run the simulator with {{{--permute=0}}} and {{{--seed=abc}}}. It
 runs much faster (because linear ring-offset selection is a lot easier
 than
 hash-based permutation). Look at the usage report for 16000 files:

 {{{
 uploaded 16000
 min/max/(exp) usage-pf-ps 7.10 MB/55.78 MB/(35.81 MB): spread-pf: 48.69 MB
 (135.96%) stddev: 12.12 MB (33.84%)
 }}}

 The spread is enormous, as is the standard deviation. The least-used
 server
 is using roughly an eighth as much space as the most-full server, whereas
 in
 the permuted case they were using within 15% of each other.

 And if you let it run for a while and look at the 200000 file report, it
 doesn't get better over time:

 {{{
 uploaded 200000
 min/max/(exp) usage-pf-ps 6.90 MB/56.05 MB/(35.69 MB): spread-pf: 49.15 MB
 (137.73%) stddev: 12.17 MB (34.10%)
 }}}

 Even after 200k files, the least-to-most-used ratio is 8x. And the stddev
 is
 basically constant.

 A bit of extra code reveals why. The least-used server has a nodeid that
 starts with dff96a, and the neighboring portion of the sorted list of
 serverids (with the separation between each node and its CCW neighbor)
 shows:

 {{{
 <Server dee63f41a151782ed22d012814dbce4e> 0112bfce339ff4781d6b7a6f9705aae2
 <Server df510eb7034c80b737719d76ca1d19d6> 006acf7561fb088865449c4eb5414b88
 <Server df6848a848d3a59966e5835a1883fa77> 001739f1458724e22f73e5e34e66e0a1
 <Server dfaff86b8c4d56ca38e673186fb11f56> 0047afc34379b130d200efbe572d24df
 <Server dfba0da90d512319e6447e6464a57bb3> 000a153d8103cc4fad5e0b4bf4f45c5d
 <Server dff96a791f1d835d6be65724b91e1225> 003f5cd011cc604385a1d8c054789672
 <--
 <Server e10e706a39fd09fe19db22fc498ed1ed> 011505f11adf86a0adf4cbd79070bfc8
 <Server e1bfb715d3cec4799ad3bb2e50f523fe> 00b146ab99d1ba7b80f8983207665211
 <Server e2f95cb1e2313cf26e738b3e8a0b8934> 0139a59c0e627878d39fd01039166536
 }}}

 100 uniformly-distributed servers would have a separation of 028F5C28F6...
 but the randomly-chosen nodeids in our {{{--seed=abc}}} ring are not
 uniformly distributed. In this case, lucky node dff96a happened to land
 unusually close after node dfba0, with a separation of just 003f5d...,
 about
 one tenth the ideal (uniform) separation. In fact it sits at the end of an
 unusally dense cluster of nodeids.

 (what we actually care about is the separation between node dff96a and
 it's
 10'th CCW neighbor, since we're encoding each file into 10 shares. The
 separation between dfba0 and dff96a is a big contributor to this, but not
 the
 whole thing).

 And similarly, the most-used server was 4f5ab8, and that portion of the
 ring
 looks like:

 {{{
 <Server 3be5ef429769acbb7b9bb73443ea9fee> 0183079e758e2d8ac094ba3269f908fb
 <Server 3f0b2004a830fc609b61621ee3b77b1f> 032530c210c74fa51fc5aaea9fccdb31
 <Server 4681adee12a4e943c07fed69b644c640> 07768de96a73ece3251e8b4ad28d4b21
 <Server 4f5ab87f270f850a05c443a14fa2042e> 08d90a91146a9bc645445637995d3dee
 <--
 <Server 50719e5b06af03bbea0f362bed7e4dd3> 0116e5dbdf9f7eb1e44af28a9ddc49a5
 <Server 52b741b1eb3e4d31ef44b78845b13a5f> 0245a356e48f49760535815c5832ec8c
 <Server 54b497bd7905c60256a6a8735b6c2581> 01fd560b8dc778d06761f0eb15baeb22
 }}}

 The 4f5ab8 node is sitting just clockwise of an unusually large gap, from
 4681ad, with a separation of 08d90b, about 3.75 times the ideal (uniform)
 separation.

 This is the "lumpy distribution" problem that I was worried about. The
 effect
 is reduced when shares are spread over more servers. If I re-run the
 simulation with {{{--N=40}}} (3-of-40 encoding), I see a spread of about
 50%
 the expected value, and a stddev of about 15%. There is a corresponding
 increase in the effect when shares are spread over fewer servers:
 {{{--N=5}}}
 gives me a spread of 195% and a sddev of 46%.

 The effect is easiest to understand when k=N=1. In that case, the inlet
 rate
 for any given server is strictly equal to the total upload rate times the
 fraction of the ring that lies between that server and its nearest CCW
 neighbor. For our "abc" seed, the smallest separation is between node
 b80ea
 and b8159, with a gap of 0006ea (which is 1/95 of the uniform gap), and
 the
 largest is between 6ef56 and 7b41d, with a gap of 0c4c6 (about 4.8 times
 the
 uniform gap). So we'd expect to see lucky node b8159 to get about 0.0095
 of
 the total traffic, and unlucky 7b41d to get about .048 of the traffic, and
 a
 ratio between the two of about 455x.

 And indeed, although b8159 and 5a82c are in competition for least-used
 server, after about 300000 files, we get this report:

 {{{
 uploaded 292000
 min/max/(exp) usage-pf-ps 153.15 kB/52.44 MB/(10.68 MB): spread-pf: 52.29
 MB (489.40%) stddev: 10.69 MB (100.02%)
 least: b8159546f332951c52367c4ad92fd9f7
 most: 7b41d2d31e6180d42eec56221d02cf4d
 }}}

 And 10.68MB/153kB is 70x, and 52.44/10.68 is 4.9x, matching our
 expectations
 of 95x and 4.8x pretty closely.

 If you re-run the program with e.g. {{{--seed=def --permute=0}}}, you'll
 get
 a different distribution of nodeids, which happens to get a 4x ratio
 between
 most-full and least-full, and a stddev of about 30%. Better, but still
 pretty
 bad. {{{--seed=def --permute=1}}} behaves just as well as the "abc" seed:
 stddev is again down to 0.74% after 200k files.

 If you're lucky and find a seed that gives you a uniform distribution,
 then
 {{{--permute=0}}} should give you the same statistics as
 {{{--permute=1}}}.
 But for most seeds (i.e. most grids), you'll get a very lumpy
 distribution.
 {{{--permute=1}}} tolerates arbitrarily lumpy server distributions.

 So, based upon this simulation, I'm fairly convinced that permuted-list is
 necessary to avoid long-term uneven upload rates to different servers. A
 simple linear ring-offset algorithm will subject servers to vastly
 different
 loads unless the nodeids can be controlled to maintain a uniform
 distribution
 (which means changing them every time a server is added or removed).

 Now, do we actually need uniform upload rates? What we really want, to
 attain
 maximum reliability, is to never double-up shares. That means we want all
 servers to become full at the same time, so instead of equal bytes-per-
 second
 for all servers, we actually want equal percentage-of-space-per-second for
 all servers. That's much trickier. If we could completely (and
 continuously)
 control nodeids (by decoupling peer-selection index values from
 cryptographic-backed server pubkeys), we could adjust them to achieve
 inter-server gaps that compensate for how much space they have remaining:
 small servers would be clustered closer together, large servers would be
 placed CW from large gaps. The math necessary to do this strikes me as
 pretty
 complicated, and I think that changing nodeids over time would damage
 efficient retrievability, since shares will no longer be in the ideal
 places
 when the downloader tries to perform the same peer-selection routine as
 the
 uploader did.

 We could also have servers refuse some percentage of incoming shares even
 if
 they had space for them, to get their percentage-full-per-second rates
 down
 to match the grid-wide average. This would induce the same problems that
 the
 ring-offset and lumpy-distribution scheme has: servers which happen to sit
 CW
 of a self-throttling node will get more traffic than usual.

 OTOH, it's a little bit easier than that: we don't need to engage in this
 load-shaping work until we start to run out of servers. If we have at
 least
 "N" servers with space available, then reliability is unaffected by the
 rate
 at which they're filling up. So we could have servers accept shares at
 full
 speed until it looked like the grid was starting to fill up, then have
 them
 switch into a mode where they defer requests to other servers more and
 more
 (to obtain uniform fill rates) as the remaining space dwindles. The
 shaping
 effect would be negligible in a grid with lots of free space. A managed
 grid,
 for which new servers are added before the grid gets full, would never
 need
 to engage in load shaping. But any amount of load shaping that *was* being
 performed would put off the day at which the first server gets full.

 So, in summary, I am re-convinced that linear ring-offset has real
 problems,
 and that permuted-list provides a more uniform bytes-per-second inlet
 rate,
 which is easier to deal with and gives better system-wide properties.

-- 
Ticket URL: <http://allmydata.org/trac/tahoe/ticket/302#comment:12>
tahoe-lafs <http://allmydata.org>
secure decentralized file storage grid