#102 closed enhancement (fixed)

smaller and prettier directory URIs

Reported by: warner Owned by: zooko
Priority: minor Milestone: undecided
Component: code-frontend-web Version: 0.7.0
Keywords: web newcaps newurls Cc:
Launchpad Bug:

Description

Our webapi.txt document currently contains the following admonition:

Note that since tahoe URIs may contain slashes (in particular, dirnode URIs
contain a FURL, which resembles a regular HTTP URL and starts with pb://),
when URIs are used in this form, they must be specially quoted. All slashes
in the URI must be replaced by '!' characters.  XXX consider changing the
allmydata.org uri format to relieve the user of this requirement.

This is an unfortunate wart. Can we find a way to remove this requirement?

Change History (23)

comment:1 Changed at 2007-08-14T00:30:33Z by warner

My current thought is to define the dirnode URI syntax as: URI:DIR:(vdrive-server):(storage-index)

as it is currently, but then declare that the (vdrive-server) part (which contains a FURL) shall always be base32-encoded. This would turn the typical 116-character long URI into one that is 165 characters long, but it would keep the FURL as an opaque string.

An alternative which I'm not really fond of would be to extract out the FURL components (relying upon their format and encoding), and re-packing them in the dirnode in a way that avoids the slash problem. For example, URI:DIR:tubid:ipaddr+port,ipaddr+port:swissnum:storage-index . I really don't want to break the abstraction boundary of a FURL this way.

If we knew that FURLs never used some character "X" which was safe to use in a URL, then we could declare that our vdrive-server spec is a FURL with the slashes replaced by X. This would be a part of the dirnode specification, so dirnode URIs everywhere would look like this, not just in the web API, which would be a big improvement.

The problem is, what would be a suitable value of X? FURLs use characters from the set a-z0-9:/@,, plus whatever characters the programmer decides to use inside a name passed to registerReference() (which is currently unbounded but I think it's fair to impose some restrictions on them). I think that slashes were the only real problem (basically it seemed that the apache reverse proxy that was causing us problems was url-decoding the URL, splitting on slashes, re-encoding the remaining pieces, then sending the results to the backend server, so url-encoding the slashes didn't help, but perhaps other characters remain encoded safely). If that's the case, it would be safe (although kind of ugly) to replace the slashes with anything outside the FURL character set (say ~ or & or % or or !), although if it's something that has special meaning then it will require that we always url-encode the dirnode URI before passing it to the web server, which is an easy thing to screw up.

The colons are another problem, since we use them to delimit components of the URI itself. Our current parser works because we only have one field that could contain colons, so we pull the prefix from the left and the storage-index from the right and then whatever's left must be the furl. But that's kind of a wart too.

So, I dunno. Establishing a rule that the dirnode-server portion is always packed according to the following seems like my current favorite approach, although I'm not yet that happy about it:

def pack_dirnode_uri(furl, storage_index):
    assert "^" not in furl
    assert "$" not in furl
    return "URI:DIR:%s:%s" % (furl.replace("/","^").replace(":","$"), idlib.b2a(storage_index))
def unpack_dirnode_uri(uri):
    u, d, f, si = uri.split(":")
    furl = f.replace("$",":").replace("^","/")
    storage_index = idlib.a2b(si)
    return furl, storage_index

This approach would give us dirnode URIs that look like URI:DIR:pb$^^t7p44biq3u6i5r5zjpb6cdqxid7v7vpx@192.168.69.247$58845,127.0.0.1$58845/vdrive:w57ncp9cmzyb6kwrjaebq7d8co and are still 116 characters long.

comment:2 Changed at 2007-08-14T18:58:35Z by warner

  • Component changed from code to code-frontend-web
  • Owner somebody deleted

comment:4 Changed at 2007-08-20T18:02:40Z by zooko

  • Milestone changed from undecided to 0.6.0

This is part of the "improved web API" task. I would like to see it done for v0.6.

comment:5 Changed at 2007-08-23T19:50:02Z by zooko

  • Owner set to zooko
  • Status changed from new to assigned
  • Type changed from defect to enhancement

comment:6 Changed at 2007-09-19T23:02:42Z by zooko

  • Milestone changed from 0.6.0 to 0.7.0

I think the next step is for me to propose "compressed furls", possibly also with an implementation, for foolscap. See foolscap trac ticket 24:

http://foolscap.lothar.com/trac/ticket/24

comment:7 Changed at 2007-10-01T18:32:53Z by zooko

  • Summary changed from web POST action requires munged dirnode URI to smaller and prettier directory URIs

See also ticket #120 and #105, where it is shown that dirnode URIs might need to be pasted into shells, multiplying the number of characters that will cause trouble (e.g. "$"), and emphasizing the usability cost of dirnodes being large.

comment:8 Changed at 2007-10-29T22:14:38Z by warner

this will be mostly fixed by the distributed-dirnodes fix (#115), as dirnode URIs become just like mutable-file URIs. We just need to decide upon a reasonable length for the crypto pieces. The dirnodes will need to have two hash values: one will be used as an AES key, the other is a validation hash.

comment:9 Changed at 2007-11-01T19:35:33Z by zooko

The first version of #115 is #197.

[docs/mutable.txt source:docs/mutable.txt] says:

URI:SSK-RW:b2a(writekey):b2a(verification_key_hash) URI:SSK-RO:b2a(readkey):b2a(verification_key_hash) URI:SSK-Verify:b2a(storage_index):b2a(verification_key_hash)

If we make writekey and verification_key_hash each be 256-bit values, then a RW URI would look like this URI:SSK-RW:j13ax9dtuxzxim5yg9a7e8xupjqq4t56tdprwi9ryqupid59xa6y:bux13ehzebbokwng7w6wzswyfppog6nqt3ndu3jxoz8kbbkihz4o.

If we made writekey and verification_key_hash each be 128-bit, then it would look like this: URI:SSK-RW:j13ax9dtuxzxim5yg9a7e8xupe:bux13ehzebbokwng7w6wzswyfc.

I would be comfortable with reducing the writekey size and the verification_key_hash size to something in the range of 100 bits each: URI:SSK-RW:j13ax9dtuxzxim5yg9a7:bux13ehzebbokwng7w6w.

Most users of these strings won't care about which part is the verification hash and which part is the key (and those users that do care can use slicing), so we could leave out the separator between those two: URI:SSK-RW:j13ax9dtuxzxim5yg9a7bux13ehzebbokwng7w6w.

The ":" stops my double-click from selecting the whole word (which suggests that users might cut-and-paste only the end part, thinking that the "URI:SSK-RW:" is not necessary), so how about: MUTRWj13ax9dtuxzxim5yg9a7bux13ehzebbokwng7w6w?

I looked for a special character to put between the "W" and the "j", but I guess special characters have the problem that they get treated specially by text editors -- also possibly by users.

What do you think?

Alternately, we can treat the leading parts as meant for user clarification and not actually a necessary part of the URI, so it could be spelled something like MUT-RW:j13ax9dtuxzxim5yg9a7bux13ehzebbokwng7w6w, and the app would accept input from the user of the form j13ax9dtuxzxim5yg9a7bux13ehzebbokwng7w6w and do the right thing with it.

I prefer this last form. I vote for mutable file URIs to look like: MUT-RW:j13ax9dtuxzxim5yg9a7bux13ehzebbokwng7w6w.

Now how does the code distinguish mutable files from mutable directories? We've previously discussed putting that type bit into the URI, but now I think this is a bad idea. Not only because it adds to the size of the URI, but also because if the user accidentally twiddles that bit then they get a file of binary garbage when they were supposed to get a directory. I guess URIs are a little too fragile to hold type bits.

What do you think?

comment:10 Changed at 2007-11-01T20:55:37Z by zooko

Hm. Actually, I feel unease. This demonstrates that I'm not really perfectly comfortable with 100-bit crypto values. Not, of course, that I'm worried about attackers brute-force computing something on the order of 2100 computations, but I'm worried about bugs and novel attacks which reduce the effective strength, or leaks partial information.

So how about 128-bit write keys and 127-bit verification hashes?

MUT-RW:j13ax9dtuxzxim5yg9a7e8xupegp6mfd17yrgbkoe5su4164oyi

If we had 128-bit verification hashes, then it would look like this if the last bit was 1: MUT-RW:j13ax9dtuxzxim5yg9a7e8xupegp6mfd17yrgbkoe5su4164oyio and this if the last bit was 0: MUT-RW:j13ax9dtuxzxim5yg9a7e8xupegp6mfd17yrgbkoe5su4164oyiy. It doesn't seem worth it to use a whole character (which is "o" or "y") to represent one bit.

comment:11 Changed at 2007-11-01T20:57:00Z by zooko

Likewise, you could enlarge the key from 128 to 130 bits and the verification hash from 127 to 130 bits, at the cost of adding one character to the URI.

comment:12 Changed at 2007-11-08T19:09:14Z by warner

My current ideas for URI format (which we should probably rename "printable representations of filenode/dirnode access capabilities" or something more accurate):

  • CHK ("I" for Immutable):
    • "IR_readkey_uebhash" : the usual CHK read-capability
    • "IV_storageindex_uebhash" : CHK verifier capability
  • SSK/SDMF ("M" for Mutable):
    • "MW_writekey_pubkeyhash" : the read-write capability
    • "MR_readkey_pubkeyhash" : the read-only capability
    • "MV_storageindex_pubkeyhash" : the verifier (note: cannot repair)
    • ??: repair capability, needs write-enabler, but not readkey nor writekey
  • LDMF ("L for Large) : support for versions, branches, insert/delete
    • "LW_stuff" : read-write
    • "LA_stuff" : append-only, wouldn't that be cool?
    • "LR_stuff" : read-only
    • "LV_stuff" : verifier

Directories which use these files as a backing store then use a short prefix to indicate how the file contents should be interpreted:

  • DIR_readkey_uebhash : immutable directory tree, i.e. "Virtual CD" #204
  • DIV_storageindex_uebhash : verifier for virtual CD
  • DMW_writekey_pubkeyhash : normal read-write dirnode
  • DMR_readkey_pubkeyhash : read-only dirnode (still mutable by others)
  • DMV_storageindex_pubkeyhash : dirnode verifier
  • DLW_stuff : large dirnodes

Each of these formats should have an internal binary representation, which is the "non-printable serialized filenode/dirnode access capability", and that is the form that should be stored in dirnodes. The printable forms should just be used by external APIs and UI tools like the web interface. The binary representation should probably start with a single non-printable byte so we can have code that accepts both printable and non-printable forms.

comment:13 Changed at 2007-11-08T20:57:10Z by warner

Oh, and of course the use of underscores in those URIs is to allow double-click to select the whole URI (versus the current colons, which most systems treat as word breaks). That will make it easier to cut-and-paste URIs into and out of tahoe UIs like the web page. It might also make URIs less vulnerable to wrapping and corruption by things like MUAs and mailing list software.

We should check to see if that actually works on all our platforms of interest.

Oh, and it might be a good idea to declare that all places you can paste in a URI (like on a web page) will remove all whitespace (both inside and out), to allow the pieces of a wrapped URI to be reassembled. I'm not sure how reliable that would be, though.

comment:14 Changed at 2007-11-09T14:57:26Z by zooko

Firefox on Macintosh breaks word-selection on underscore.

I think that separating the different crypto pieces from each other is more useful for tahoe hackers than for tahoe users.

I still don't know if we intend for the "is this a file or a directory" typing information to be present *only* in the URI, or also elsewhere, i.e. what is called a "URI extension block" in the context of CHKs.

I think we ought to do the latter (store that information in a place where it is quite inconvenient for a user to change it) and make the typing information in the URI be optional/advisory.

If it would cause real problems for the user to mangle or omit the typing information in the URI, then I think it ought to be glommed onto the crypto information with no intervening special characters. (Although it is okay for the typing information to be capitalized and the crypto information to be lowercase.)

comment:15 Changed at 2007-11-13T18:17:51Z by zooko

  • Milestone changed from 0.7.0 to 0.7.1
  • Version changed from 0.4.0 to 0.7.0

comment:16 Changed at 2007-11-28T23:29:26Z by zooko

I've been reading about key lengths (http://keylength.com and Ferguson & Schneier's Practical Cryptography among other sources), and worrying about the long-term security of smaller crypto values.

After all, if tahoe is relied upon as a storage system, then it may well be used for long-term storage. Ferguson & Schneier write that any cryptosystem deployed today might be in use for 30 years, and that once it is decommissioned, it ought to continue to provide backwards confidentiality for at least 20 more years.

Symmetric encryption keys of size 128 or so bits seem likely to last for 50 years, but secure hash values of 128 or so bits might not last for 30 years, in part because secure hashes and SHA-256 have not been really studied and optimized by cryptographers the way that symmetric ciphers and AES have. (Ferguson & Schneier wrote in Practical Cryptography -- 2003 -- that they generally regard the public crypto community as knowing as much about secure hashes as they knew about symmetric ciphers in the 1980's.)

Then I had a bit of a brainstorm -- tahoe capabilities can be canonically defined as containing full 256-bit SHA-256 outputs, like this: MUT-RW:upyf5nwrpccqw4f53hiidug96663eo5qq4hna4prbragh9e554eou7tqn1ife4tiiuw5eu73ihiia, but can be truncated for human convenience, e.g. to 128-bit hash values, like this: MUT-RW:upyf5nwrpccqw4f53hiidug96663eo5qq4hna4prbragh9e554eo.

The neat thing about this is that you can store the full hash in long term storage (for example, in tahoe directories pointing at other tahoe directories or files), but use the truncated form for short-term exchange through user-friendly tools like IM and e-mail.

Obviously there is a risk that someone stores the short form and wants to use it many years hence and therefore incurs more risk that the resulting file has been substituted by an attacker, but people who are conscious of the fact that they are storing a tahoe cap for the long-term can easily use the full form.

comment:17 Changed at 2007-11-29T20:19:41Z by warner

Neat idea. I've been pondering doing something like this with foolscap tubids to allow people to get shorter FURLs.

The implementation details would include:

  • the keys are always full-length, of course
  • however many bits you put into the hash that's in the URI, that's how many bits get checked. If you want to play fast and loose, leave the hash blank.
  • storage index values are always derived by hashing a fixed-length string.
    • For CHK we just keep using the hash of the read-key as usual.
    • For mutable slots, we've talked about making SI=hash(pubkey), and putting SI in the URI: this would allow storage servers to verify their own shares up to the signature, and gives us more options to protect against people uploading bogus data in the future. We'd need to declare some minimum length for the SI in this case (enough to provide adequate collision-resistance for billions of files), but that can still give some flexibility of how many bits of the hash(pubkey) you need to paste into an email

The main concern that I'd have would be the usual consequences of hash collisions:

  • I create two contracts, one good, one bad, carefully constructed to have the first N bits of their UEB hashes be equal
  • I upload the good one into Tahoe, and truncate the URI to only include N bits of the UEB hash
  • I send you the URI and sign a statement committing yourself to the contract as referenced by the URI
  • then I cancel all the leases on the good contract, allowing it to expire from the grid, then upload the bad contract
  • I go to a judge and point to the valid signature pointing to the bad contract
  • ...
  • step 3: profit

The obvious answer is to tell people to not bind themselves to anything with an insufficiently long hash.. there are "secure" URIs and "insecure" ones.

Not a major concern, but we'd want to make sure to document safe handling procedures for URIs w.r.t. the strength of their identification properties.

comment:18 Changed at 2007-12-18T00:15:52Z by zooko

Our current plan is to use the new crypto scheme described in #217 -- "better crypto for mutable files -- small URLs, fast file creation" so that we can have only one crypto value in a capability, and make crypto values be 256-bits, and use base-62 encoding so that the resulting strings are still double-clickable and googlable.

A related change is to stop calling them URIs! They are "caps". caps! caps! caps! Yay, caps!

comment:19 Changed at 2008-01-23T02:47:27Z by zooko

  • Milestone changed from 0.7.1 to undecided

comment:20 Changed at 2009-10-28T03:36:02Z by davidsarah

  • Keywords newcaps added

Tagging issues relevant to new cap protocol design.

comment:21 Changed at 2009-10-28T07:28:21Z by davidsarah

  • Keywords newurls added

comment:22 Changed at 2010-02-23T03:11:50Z by zooko

  • Milestone changed from eventually to 2.0.0

comment:23 Changed at 2010-08-03T09:22:16Z by davidsarah

  • Milestone changed from 2.0.0 to undecided
  • Resolution set to fixed
  • Status changed from assigned to closed

Er, isn't the description of this ticket about something that was fixed long ago?

I don't think there's anything remaining here that isn't covered by #882 and #432.

Note: See TracTickets for help on using tickets.