#599 new enhancement

maybe add share-metadata: "where-are-the-other-shares" hints

Reported by: warner Owned by:
Priority: major Milestone: undecided
Component: code-storage Version: 1.2.0
Keywords: download Cc:
Launchpad Bug:

Description

An idea that we've kicked around before came back to me today, in a different form. What if each share (living on some server somewhere), in addition to the data necessary to recover the file (blocks and hashes and signatures), also contained hints about the locations of other shares? These hints could take the form of publically-visible FURLs of the storage servers that are holding the other shares.

These "hints" would not be authoritative, because shares may have been deleted, copied, or regenerated without necessarily updating all the other shares. But, assuming that shares tend to stick around for a while, these hints could provide a high-probability way to locate all the other shares.

The basic addition would be an extra method on the WriteBucket object, something like set_share_location_hints, which accepts a list of FURLs, and stores it next to the share. The retrieve side would be an extra return argument to the get_buckets call, to return the list of hints.

The download algorithm (the new state-machine oriented one that we're thinking of building, to help with #287 and #193) would be changed. The first phase of the download process is responsible for acquiring ReadBucket objects, each of which connects to some remote server. This phase would be changed to use the regular "Tahoe-2" peer-selection algorithm to find a batch of servers (perhaps 5 servers) to whom get_buckets messages will be sent in parallel. Each time a response comes back, a new message will be sent to the next unasked peer in the permuted list. But if a response includes hints about the locations of other shares, then all of those hinted servers will be asked. The download process can begin once at least "k" shares have been located.

For mutable files, this increases the chance that we'll locate all shares of the file, which helps reduce the danger of accidental-rollback.

The hints can contain FURLs, but in general only the tubid portion will be used. The client will look in its list of connections to see if it has any which connect to the same tubid as in the hint. This minimizes the importance of servers maintaining their same IP address and port number for long periods of time. If the client hasn't heard about the tubid from the Introducer, it could conceivably try to connect to the given FURL anyways. This would cause servers which have left the grid (or merely been unable to connect to the introducer recently) to still be used, however it would also make it slightly easier to cause mischief by publishing FURLs that point to port 25, etc.

For maximum benefit, hints should be updated when the file is repaired, but we must be careful to define the necessary authority correctly. For mutable files, the authority to modify the share is sufficient: anyone who can clobber the file is also allowed to clobber the hints. For immutable files, the question is more difficult. One possibility is that the repairer submits potential hints to the servers that hold old shares, and the server validates the hint itself before committing them to the share's metadata. The server would do this by connecting to the FURL in question and performing the same get_buckets that a downloading client would do. The server could then randomly check a few segments against their block hash trees, and make sure the remote share hash fits into its local share hash tree. This verification process is made marginally easier by the fact that the verifying server already has a copy of the share hash tree.

This last feature (servers checking up on each other) is reminiscent of the original "Tahoe 1" design, in which each file had a cabal of servers holding its shares, and the members of this cabal kept track of each other (validating each others shares, repairing the file when necessary, recruiting new members if too many servers dropped out). We abandoned that design because it required each server to keep track of too much information, but if the location-of-other-shares list is treated merely as a hint, then perhaps it wouldn't be too bad.

Change History (1)

comment:1 Changed at 2010-01-06T20:59:30Z by warner

I was thinking more about this last night, in the context of #872, #447, #573, and other enhanced peer-selection goals. Many of these tickets express a desire to improve certain properties of the distributed files, typically reliability (by improving geographical diversity), reduced repair cost (getting k+1 shares into each datacenter), increased download speed (placing several shares near each downloader), and increased fairness of reliability over time (by trying to fill all servers at the same time instead of at the same rate, flattening out the entropy(peer-selection-process)-vs-time curve as described in comment:ticket:872:4).

All of these goals are hard to accomplish right now, because we want small fixed readcaps for each file, and that doesn't leave the downloader with a lot of information available to do the matching peer-selection algorithm. The fact that readcaps are immutable means that the repairer/rebalancer (which need to add shares to new places, and sometimes remove them from old ones) are changing the serverlist, and yet the downloader must be able to find the new shares efficiently using the old (fixed) readcap. Even if the file isn't repaired and all shares exist in the same places, the system needs to be tolerant of server churn.

The Tahoe2 selection algorithm provides small fixed readcaps, handles share movement due to repair (assuming the repairer uses Tahoe2 too), and has reasonable tolerance of server churn (adding 10% new servers will increase the downloader's search distance by an average of 10%). And it provides fill-at-same-rate load-balancing behavior, which is easy to understand and happens to give you fill-at-same-time behavior if you have homogeneous servers. But it is hard to accomodate any of the other goals contained in the tickets above, or to trade off some goals against others.

"All problems in computer science can be solved by introducing another layer of abstraction". In the context of this ticket, we could accomplish more of these goals simultaneously by introducing an intermediate where-are-the-shares table. This table would contain a mutable mapping from storage index to:

  • verifycap
  • share locations: list of (serverid, confidence level)
  • other serverlist locations: list of serverids

The serverid strings would merely need to be short prefixes of the full serverids, perhaps as little as 4 bytes, since they would be expanded by looking for matches in the full serverlist, and infrequent collisions would merely increase the search distance slightly. (this approach, rather than including a full FURL, trades off space against continued dependence on an Introducer and/or a notion of enumerable grid membership).

Each storage server would offer access to "location slots" in this table, just like they currently offer mutable-share slots. However, the slots contain plaintext, and would use have writecaps and readcaps. Instead, the remote operations would look like:

  • allocate location slot
    • later, set storage index and verifycap (once only)
  • propose locations (serverid1, serverid2, ...)
  • propose removal (serverid1, serverid2, ...)
  • fetch share location list
  • fetch other serverlist location list

The storage servers would be responsible for confirming the presence or absence of shares themselves. To this end, each table entry has a "confidence level", which the server uses to keep track of how much validation it has performed. When a "propose location" message arrives, the serverids are added with the lowest confidence level (0). The server sends do-you-have-share messages to those servers, and if they claim "YES", the level is raised to 1. Later, when bandwidth allows, the server can fetch a couple of blocks at random, and the hash chains, and make sure everything verifies up to the locally-held verifycap, and then raises the level to 2. Eventually, the server can fetch and verify the entire share, and raise the level to 3.

Upon receipt of a "propose removal" message, the server marks the entry as "possibly removed", and queries the indicated server directly. If and only if the server response to the DYHB with a "NO", then the entry is actually removed.

The server would need to run a service to manage this verification process, and to allocate a certain amount of bandwidth to it. This manager should also query the other serverlist holders, to update each other on the current share locations.

The idea is that shares would be uploaded to completely arbitrary locations, to accomplish whatever placement goals were desired (load balancing, download speed, datacenter diversity, etc). Then the location slots would be allocated on well-known servers (using the normal Tahoe2 algorithm), and filled with the locations of the shares. The location slot holders would tell each other about the shares to flood this information out to all slots.

A downloader would use Tahoe2 to find at least one location slot, read the serverlist out of it, and then query those servers for shares. They could read from multiple location slots if they somehow got out of sync. The likely implementation would be to send "fetch share location list" messages in parallel to the first N servers in the permuted list, and upon receipt of the first positive response, send DYHB messages in parallel to everything it mentions.

After the repairer uploaded new shares, it would locate the location slots and inform them about the new share. Through flooding, eventually all location slots would learn about the new share. When a share is removed from a server (via lease expiration, or mutable rebalancing), the "propose removal" message would be sent, and the location slot would confirm.

If the repairer notices that there are too few location slots, it can allocate and fill new ones. It would be appropriate to have "N" location slots. Since we're using 1-of-N encoding here (pure replication), the reliability of the location slots should be much higher than the reliability of the target file, and the extra layer of indirection should not significantly reduce reliability.

Having the location slot servers perform their own verification removes the need for an extra layer of writecaps (which would be needed by the repairer to update the location list, making it harder to have storage servers drive the repair process themselves). However it does increase their workload significantly, making this look more like the old Tahoe1 scheme as mentioned above. On the other hand, this could provide the basis for server-driven repair, where their periodic check on each other could provoke repair of unhealthy files without client involvement.

What will prevent us from wanting to manipulate the choice of location-slot holders in the same way that those tickets want to manipulate the choice of share holders? Location slots are very small, so we'd be unlikely to want to balance the storage load as we do with (large) shares. They'd use 1-of-N encoding, so certain performance and reliability concerns are reduced. But for the others, we'd just have to hope that providing arbitrary location of shares is enough to get folks to tolerate these fixed location of location-slot holders.

There would be a nontrivial performance hit to use location-slots. Uploaders would require one extra RTT, because the "propose location" messages could not be sent until all shares had been closed. Servers would spend considerable bandwidth and CPU checking up on each others shares. And downloaders would incur an extra RTT to query the location-slot holders in addition to the existing DYHB queries, however the DYHB queries would be highly likely to succeed, making the combination potentially faster in certain not-ideally-Tahoe2-distributed server selection schemes.

Other ideas:

  • use a limited subset of servers for the location slots (i.e. nodes might offer share-storage, or location-slot-storage, but not both)
  • alternatively, try to put the location-slots on the same servers as the shares, so they share fate and we avoid the reliability hit

Anyways, this feels like a lot of moving parts (and the server load worries me), but a layer like this would let us use arbitrary server-selection policies. It would also help us scale to millions of storage nodes, if the Tahoe2 permutation and frequent queries only needed to be done on a few hundred location-slot servers instead of the entire set of storage servers.

Note: See TracTickets for help on using tickets.