Context Navigation

← Previous Ticket
Next Ticket →

Opened at 2007-08-20T19:16:59Z

Closed at 2008-02-08T02:15:12Z

#115 closed enhancement (fixed)

update webapi docs for distributed dirnodes

Reported by:	warner	Owned by:	warner
Priority:	major	Milestone:	eventually
Component:	code	Version:	0.4.0
Keywords:		Cc:
Launchpad Bug:

Description

Our current (temporary) situation is to put all vdrive "directory node" information into an encrypted data structure that lives on a specific server. This was fairly easy to implement, but lacks certain properties that we want, specifically that it represents a single point of failure.

We want to improve the availability of dirnodes. There are a number of ways to accomplish this, some cooler than others. One approach is to leave the vdrive-server scheme in place but have multiple servers (each providing the same TubID, using separate connection hints, or the usual sort of IP-based load-balancer frontend box). This requires no change in code on the client side, but puts a significant burden on the operators of the network: they must run multiple machines.

A niftier approach would be to distribute the dirnode data in the same way we distribute file data. This requires distributed mutable files (i.e. SSK files), which will require a bunch of new code. It also opens up difficult questions about synchronized updates when race conditions result in different storage servers recording different versions of the directory.

The source:docs/dirnodes.txt file describes some of our goals and proposals.

Change History (26)

comment:1 Changed at 2007-09-19T03:59:01Z by warner

I'm starting to think that a reasonable solution is to distribute the data with SSK files, but have an optional central-coordinator node.

Small grids who don't want any centralization just don't use the coordinator. They run the risk of two people changing the same dirnode in incompatible ways, in which case they have to revert to an earlier version or something.. we'll need some tools to display the situation to the user, but not tools to automatically resolve it.

Large grids who are willing to accept some centralization *do* use the coordinator. Dirnode reads are still fully-distributed and reliable, however the ability to modify a dirnode is contingent upon the coordinator being available. In addition, dirnode-modification may be vulnerable to an attacker who just claims the lock all day long (however we can probably rig this so that only people with the dirnode's write-key can perform this attack, making it a non-issue).

Each SSK could have the FURL of a coordinator in it, and clients who want to change the SSK shares are supposed to first contact the coordinator and obtain a temporary lock on the storage index. Then they're only supposed to send the "SSK_UPDATE" message to the shareholders while they hold that lock. The full sequence of events would look like:

user provides desired change (add/rename/delete)
see if change is applicable (can't delete non-existent file)
do peer selection, compute list of likely SSK shareholders
contact first shareholder, discover coordinator FURL
contact coordinator, attempt to claim the lock
- if unsuccessful, wait a random number of seconds, then repeat at step 2
if successful, send SSK_UPDATE messages to all shareholders
when all responses come back (or timeout?), release the lock

Clients who are moving a file from one dirnode to another are allowed to claim multiple locks at once, as long as they drop all locks while they wait to retry.

If the coordinator is unavailable, the clients can proceed to update anyways, and just run the risk of conflicts.

We have two current ideas about implementing SSKs. In the simplest form, we store the same data on all shareholders (1-of-N encoding), and each degenerate share has a sequence number. Downloaders look for the highest sequence number they can find, and pick one of those shares at random. Conflicts are expressed as two different shares with the same sequence number.

In the more complex form, we continue to use k-of-N encoding, thus reducing the amount of data stored on each host. In this form, it is important to add a hash of the data (a hash of the crypttext is fine) to the version number, because if there *are* conflicts, the client needs to make sure the k shares they just pulled down are all for the same version (otherwise FEC will produce complete garbage).

Personally, I'm not convinced k-of-N SSK is a good idea, but we should explore it fully before dismissing it.

comment:2 Changed at 2007-10-11T10:18:40Z by warner

Milestone changed from undecided to 1.0

I'm working on a design for large mutable versioned distributed SSK-style data structure. This could be used for either mutable files or for mutable dirnodes. It allows fairly efficient access (both read and write) of arbitrary bytes, even inserts/deletes of byteranges, and lets you refer to older versions of the file. The design is inspired by Mercurial's "revlog" format.

In working on it, I realized that you want your dirnodes to have higher reliability and availability than the files they contain. Specifically, you don't want the availability of a file to be significantly impacted by the unavailability of one of its parent directories. This implies that the root dirnode should be the most reliable thing of all, followed by the intermediate directories, followed by the file itself. For example, we might require that the dirnodes be 20dBA better than whatever we pick for the CHK files. One way to think about this: pretend we have a directory hierarchy that is 10 deep, and a file at the bottom, like /1/2/3/4/5/6/7/8/9/10/file.txt . Now if the file has 40dBA availability (99.99%), that means that out of one million attempts to retrieve it, we'd expect to see 100 failures. If each dirnode has 60dBA, then we'd expect to see 110 failures: 10 failures because an intermediate dirnode was unavailable, 100 because the CHK shares were unavailable.

Given the same expansion factor and servers that are mostly availably, FEC gets you much much much better availability than simple replication. For example, 1-of-3 encoding (i.e. 3x replication) for 99% available servers gets you 60dBA (i.e. 99.9999%), but 3-of-9 encoding for 99% servers gets you about 125dBA. The reason is easy to visualize: start killing off servers one at a time; how many can you kill before the file is dead? 1-of-3 is a loss once you've killed off 3 servers, whereas 3-of-9 is ok until you've lost 7 servers. If we use 1-of-6 encoding (6x replication), we get about 120dBA, comparable to 3-of-9.

Anyways, the design I'm working on is complicated by FEC, and much simpler to implement with straight replication. To get comparable availability, we need to use more replication. So maybe dirnodes using this design should be encoded with 1-of-5 or so.

comment:3 Changed at 2007-10-29T22:39:54Z by warner

These will be implemented on top of Small Mutable Files (#197), which are mutable but replace-only.

comment:4 Changed at 2007-11-13T18:34:16Z by zooko

Milestone changed from 1.0 to 0.7.0

comment:5 Changed at 2007-11-14T02:19:18Z by zooko

Owner changed from somebody to zooko
Status changed from new to assigned

As mentioned in #207:

create new-style dirnode upon first boot instead of old-style one
remove old dirnode code, replace with dirnode2

comment:6 Changed at 2007-12-05T03:58:07Z by zooko

These last two tasks where completed in 3605354a952d8efd, but there are a few more things to do:

extend the POST command to enable upload of a file without linking it into a directory
put a form to do that on the front page, next to the form to download a file given only its URI ("cap")
more better test coverage -- Brian has been rocking on this

comment:7 Changed at 2007-12-06T22:33:04Z by zooko

Also to do for v0.7.0:

update the docs to describe the new kind of directories. I have "XXX change this" marked in a few places in the docs in my sandbox, but I haven't started writing replacement text yet.

comment:8 Changed at 2007-12-12T00:41:41Z by warner

Things left to do for 0.7.0:

document POST /uri in webapi.txt (upload a file without attaching it to a directory)
add form to the welcome page to use POST /uri
document+test+implement POST /uri?t=mkdir (create a new unattached directory)
- return new URI in response body
add form to the welcome page to use POST /uri?t=mkdir
- adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"

maybe for the future (post-0.7.0):

rename PUT into POST for certain things like t=mkdir
- (for "functions that aren't methods", so to speak)

comment:9 Changed at 2007-12-12T19:26:41Z by zooko

First priority is #231.

Then:

document+test+implement POST /uri?t=mkdir (create a new unattached directory)
- return new URI in response body
add form to the welcome page to use POST /uri?t=mkdir
- adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"

comment:10 Changed at 2007-12-12T19:31:38Z by zooko

Oh, insert #232 as top-priority, even above #231.

comment:11 Changed at 2007-12-13T00:17:45Z by zooko

add:

if the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page
if the client is configured to create a private directory, then put a not on the welcome page which says "private directory will be created once we are connected to X servers...", which note is replaced by a link to start.html after the private directory is created.

comment:12 Changed at 2007-12-18T00:55:24Z by zooko

Finished the part about "If the client is configured to create no private directory, then do not put a link from the welcome page to the start.html page", in 9848d2043df42bc3.

comment:13 Changed at 2007-12-18T01:03:16Z by zooko

I bumped the part about showing the pending creation of the private directory into #234 -- "Nice UI for creation of private directory.".

comment:14 Changed at 2007-12-18T01:04:59Z by zooko

#232 -- "peer selection doesn't rebalance shares on overwrite of mutable file" has been bumped out of Milestone 0.7.0 in favor of #233 -- "work-around the poor handling of weird server sets in v0.7.0".

comment:15 Changed at 2007-12-18T01:05:46Z by zooko

Still to do in this ticket:

document+test+implement POST /uri?t=mkdir (create a new unattached directory) o return new URI in response body
add form to the welcome page to use POST /uri?t=mkdir o adds a special kind of when_done flag that means "please redirect me to the directory page for the dirnode that I just created"

comment:16 Changed at 2007-12-19T22:43:28Z by zooko

50bc0d2fb34d2018 finishes test+implement POST /uri?t=mkdir, returning new URI (soon to be called "cap") in the response body

Still to do in this ticket:

document POST /uri?t=mkdir
add ?redirect_to_result=true flag to request an HTTP 303 See Other redirect to the resulting newly created directory
add a form to the welcome page to create a new directory and redirect to it

comment:17 Changed at 2007-12-22T21:26:01Z by zooko

So currently there is a POST /uri/?t=mkdir which works and has unit tests, but it is using the technique of encoding the arguments into the URL, and it needs to switch to the technique of encoding the arguments into the request body, which is the standard for POSTs. There is also a button (a form) in my local sandbox, but that form produces POST queries with the arguments encoded into the body, so it doesn't work with the current implementation.

comment:18 Changed at 2007-12-24T23:46:36Z by warner

I just pushed a change to make /uri look for the 't' argument in either the queryargs or the form fields, using a utility function named get_arg() that we could use to refactor other places that need args out of a request.

I think that "/uri" is the correct target of these commands. Note that "/uri/" is a different place. Our current docs/webish.txt (section 1.g) says that /uri?t=mkdir is the right place to do this, and the welcome page's form (as rendered by Root.render_mkdir_form) winds up pointing at /uri, so I'm going with "/uri" instead of "/uri/" .

To that end, I've changed the redirection URL that /uri?t=mkdir creates to match: this redirection is emitted by the /uri page, and therefore needs to be to "uri/$URI" instead of just "$URI". (The latter works if we were hitting /uri/?t=mkdir, but not when we hit /uri?t=mkdir).

I've also changed the unit test to exercise "/uri?t=mkdir" instead of "/uri/?t=mkdir", and to examine the redirection that comes back to make sure it is correct.

comment:19 Changed at 2007-12-25T22:44:12Z by zooko

See #233 -- "creation and management of "root" directories -- directories without parents".

comment:20 Changed at 2007-12-25T22:47:46Z by zooko

Still to do:

Document POST /uri?t=mkdir in webapi.txt [2].
Lots of other documentation updates, many of which Josh and I have in local sandboxes here at my mom's farm in New Mexico.