[tahoe-dev] proposal: new immutable file upload protocol: streaming, fewer round-trips, quota-respecting

Fri May 30 17:36:07 PDT 2008

I just got off the phone with Zooko, where we were sketching out a new
protocol for uploading immutable-file shares.. here are my notes.

The initial motivation for this was to handle out-of-space conditions on
storage servers a bit better. The allmydata prodnet, which has been in use
for several months now (and has several years worth of customer data migrated
over from our old "MV" product), has about 40 days left before it fills up,
measured using the misc/munin/tahoe_spacetime.py plugin. So we're getting
ready to add capacity by deploying the next round of storage servers. While
talking about how existing clients will react when we mark the nearly-full
servers as read-only, we came up with some new ideas on managing
storage-server disk quotas.

Disk quotas exist to protect disk space on a storage node so that it can be
used by other applications. Unix programs are traditionally pretty bad at
handling out-of-disk conditions: a lot of them will write log messages in
response to any error, including "my log write failed because the disk is
fill" messages, leading to badness. So most unix sysadmins are careful to not
allow the disks to ever actually hit 100% usage (indeed most filesystems
reserve 5% of their space for processes run by root, so that the admin will
be able to log in and free up space).

Our original approach was to have the Tahoe node measure how much space it is
using, by doing a recursive walk of NODEDIR/storage/ at startup, and refusing
lease requests when this size grows above the configured threshold
(NODEDIR/sizelimit). Unfortunately this is very expensive to compute, because
a large storage server holding nearly 1TB of shares can easily take 15
minutes to crawl through all those shares to measure their size. We changed
the Tahoe startup code to skip this "du -s" -style check if no sizelimit is
defined, becuase all the disk IO was taking too long.

Our new idea is to flip this around, and define a reserved-space size. There
are cheap interfaces to ask the filesystem how much space is left (/bin/df,
os.statvfs), and we can tell the tahoe node to stop accepting shares when the
remaining space goes *below* some threshold. This is most appropriate for a
dedicated storage server (such as those run by allmydata.com): you reserve
enough space to protect the rest of the OS (and logfiles, etc), perhaps a
couple of GB, but the rest of the disk is available for shares. This is not
the most convenient mode for friendnet scenarios (where you might want an
upper bound on the tahoe space instead of a lower bound on the remaining
space), but the disk IO cost really forces us to implement the df-style code.
We call this the "server-side hard reservation".

In addition, the admin might want to use OS-level filesystem quotas (using
the 'quota' or 'quotatool' debian package). When the filesystem is properly
prepared, these could provide du-style limitations on a per-user basis, and
much more strictly enforced, so you could run the Tahoe node as a separate
user and it would receive exceptions from the write() call that takes it over
the limit. (If run in this mode, it is important to put NODEDIR/storage/ on a
separate partition from NODEDIR/logs/ , so that logfiles and twistd.pid can
still be written even though share writes are rejected). We call this the
"os-level hard reservation".

Finally, there are situations where we'd like to kick a storage server into
"read-only" mode: the disk might be failing and we want to keep its shares
available while we move them to new hardware, or the server is being
decomissioned, or the admin has decided that the disk is full and wants to
manually prevent new space from being allocated, or the disk got mounted
read-only for some reason.

So we're designing a quota-limiting system that gives the storage server
admin the ability to use either of these modes, and also gives the uploading
client high availability. The client-side issue is that the client would
really like to know early that a given server won't be able to hold their
data, so they can efficiently switch to a different server that has more
space available. The current protocol uses reservations to implement this,
but we'd like to switch to a scheme that requires fewer round trips, and
which can also be used in a streaming mode where the size of the share is not
known ahead of time.

So our plan is:

 * the storage server's call to f.write() might raise IOError, either because
   the disk has been mounted read-only, or the NODEDIR/storage/ directory has
   been chmod'ed -w, or because an OS-level quota limit has been hit. Storage
   servers should catch this and report it in a sensible error message, and
   clients should be prepared to see this exception.
 * storage servers which have been manually placed in read-only mode will
   implement this mode by raising IOError just before they call f.write(), to
   use the same error-handling paths.
 * storage servers check their df levels before each write, and raise that
   same IOError if they are beyond their server-side hard reservation.
 * storage servers which believe they will be in read-only mode for a while
   should reannounce themselves to the introducer as a read-only server.
   Clients will consider this server for read/download operations, but not
   for upload operations.

Then we plan to modify the immutable-share storage server protocol (which
currently consists of allocate_buckets() and get_buckets()) to get rid of the
RIBucketWriter objects and instead use a single method as follows:

 def upload_immutable_share(upload_index, storage_index, sharenum,
                            writev, close=bool, expected_size=bool_or_None):
     return (accepted=bool, remaining_space=int)

The "upload_index" is an as-yet-unfinished token that allows a server to
upload a share in pieces (one segment per message) without holding a foolscap
Referenceable the whole time. This should improve resumed uploads. "writev="
is your usual write vector, a list of (offset, data) pairs. The "close=" flag
indicates whether this is the last segment or not, serving the same purpose
as the IPv4 "no more fragments" bit: when the server sees close=True, it
should terminate the upload_index and make the finished share visible to
other clients. If the client doesn't close the upload_index in a timely
fashion, the server can delete the partial share.

expected_size= is advisory, and tells the storage server how large the client
expects this share to become. It is optional: if the client is streaming a
file, it may not know how large the file will be, and cannot provide an
expected size. The server uses this advice to make a guess about how much
free space is left.

If the server accepts the write (i.e. it did not run out of space while
writing the share to disk, and it wasn't in a read-only mode), it returns
accepted=True. It also returns an indication of how much free space it thinks
it has left: this will be the 'df' space, minus the reserved space, minus the
sum of all other expected_size= values (TODO: maybe it should include this
one too, obviously we must be clear about which approach we take).

The client will use the remaining_space= response to decide whether it should
continue sending segments to this server, or if it thinks that the server is
likely to run out of space before it finishes sending the share (and
therefore might want to switch to a different server before wasting too much
work on the full one).

For single-segment files, the client will generate all shares, then send them
speculatively to N candidate servers (i.e. peer selection will just return
the first N servers in the introducer's list of non-readonly storage
servers). Each share will have just one block, and just one upload call, in
which the close= flag is True. These servers will either accept the share or
reject it (because of insufficient space). Any share which is rejected will
be submitted to the next candidate server on the permuted list. This approach
gets us a single roundtrip for small files when all servers have free space.
When some servers are full, we lose one block of network bandwidth for each
full server, and add at least one roundtrip. If clients think that servers
are likely to be full and want to avoid the wasted bandwidth, they could
spend an extra roundtrip by doing a small write and checking the accepted=
response before committing to sending the full block.

For multi-segment files, the client will generate the first segment's blocks,
and send it speculatively to N candidate servers, along with its
expected_size= (if available). These blocks will be retained in memory until
a server accepts them. The client has a choice about how much pipelining it
will do: it may encode additional segments and send them to the same servers,
or it might wait until the responses to the first segment come back. When
those responses come back, the client will drop any servers which reject the
first block, or whose remaining_space= indicates that the share won't fit.
Dropped servers will be replaced by the next candidate in the permuted list,
and the same blocks are sent again. The client will pipeline some number of
blocks (allowing multiple upload messages to be outstanding at once, each
being retired by a successful ack response) that depends upon how much memory
it wants to spend vs how much of the bandwidth-delay product it wants to
utilize.

The client has a "client soft threshold", which is the minimum
remaining_space= value that it is willing to tolerate. This implements a
tradeoff between storage utilization and chance of uploading the file
successfully on the first try. If this margin is too small, the client might
send the whole share to the server only to have the very last block be
rejected due to lack of space. But if the margin is too high, the client may
forego using mostly-full-but-still-useable servers.

The server cannot provide a guarantee of space. But the probability that a
non-initial block will be rejected can be made very small by:

 * all clients providing accurate expected_size= information
 * the server maintaining accurate df measurements
 * clients paying close attention to the remaining_space= responses

If a client loses this gamble (i.e. the server rejects one of their
non-initial blocks), they must either abandon that share (and wind up with a
less-than-100%-health file, in which fewer than N shares were placed), or
they must find a new home for that share and restart the encoder (which means
more round-trips and possibly more memory consumption.. one approach would be
to stall all other shares while we re-encode the earlier segments for the new
server and catch them up, then proceed forwards with the remaining segments
for all servers in parallel).

Since the chance of being rejected is highest for the first block (since the
client does not yet have any information about the server, indeed they cannot
be sure that the server is still online), it makes sense to hold on to the
first segment's blocks until that response has been received. An optimistic
client which was desperate to reduce memory footprint and improve throughput
could conceivably stream the whole file to candiate servers without waiting
for an ack, then look for responses and restart encoding if there were any
failures.

For streaming/resumeability, the storage protocol could also use a way to
abort an upload (to accelerate the share-unfinished-for-too-long timeout)
when the client decides to move to some other server (because there is not
enough space left).

Finally, mutable files are in a completely separate world. The first thing we
need for them is to reject the creation of new mutable shares when the server
is full. The current mutable file code will deal with this pretty well (it
will log a level=UNUSUAL message, which is a bit noisy, but will otherwise
fail over to new servers). But I'm worried about what will happen when you
try to modify existing mutable shares and the server throws an exception.
Ideally, if the share has grown too large, the client should be able to move
the share to a new server. It might be a good idea for us to rig the storage
servers to allow modification of mutable shares even after the disk is full
enough to warrant rejecting the creation of new ones.

Also note that this is only somewhat related to the "Accounting" project,
which will establish per-user storage quotas (as opposed to the server-wide
quotas described here). The clients will need to respond to "quota exceeded"
responses in the same way, but those responses will be triggered by some sort
of database summation (the total size of all leases on all servers has grown
beyond some administratively/contractually specified limit) rather than a
'df' value. Since this sort of quota is likely to be expensive to compute
across the whole set of storage servers, we expect to make the clients police
themselves, and do offline analysis of the storage server's lease tables to
catch violators, rather than reject shares in real-time.

ok, that was a bit confused, but I think I got most of the ideas down.

 -Brian