#453 new enhancement

safely add plaintext_hash to immutable UEB

Reported by: warner Owned by:
Priority: major Milestone: undecided
Component: code-encoding Version: 1.0.0
Keywords: integrity newcaps performance Cc:
Launchpad Bug:

Description

We used to put a hash of the plaintext in the UEB, both a flat hash of the whole file, and a merkle tree over the plaintext segments. This helps guard against bugs in the decryption process, for example the bug that pycryptopp had when compiled on certain compilers (I can't find a reference for it right now).

But we removed it, because exposing it to the whole world enables the "partial-information guessing attack", in exactly the same way that universally-convergent encryption keys would do.

Robk pointed out a long while back that we could solve both problems (i.e. get plaintext validation without exposure to the guessing attack) by encrypting the plaintext hash with (a derivative of) the read-key. This way, only the actual reader would get to see the plaintext hash. We could do this by adding a field named 'protected_plaintext_hash' (and also 'protected_plaintext_roothash', for the merkle tree) to the UEB, both of which are encrypted. Similarly we'd need to encrypt the plaintext merkle tree (which normally lives in the share but outside the UEB).

Zooko pointed out that we could generalize this to think of the share as containing several parts: one is unencrypted and visible to the whole world (i.e. access is protected by only the verify-cap, which is generally unguarded), but another is encrypted and visible only to the holder of the read-cap. (For mutable files, we could have a third section, visible only to the holder of the write-cap). The plaintext hash information would then go into the read-cap section.

We also considered that, because the UEB is encoded as a dictionary, we can add new keys to it without affecting earlier versions of the code: they will ignore the keys that they don't know about.

Change History (4)

comment:1 Changed at 2009-10-28T04:09:45Z by davidsarah

  • Keywords newcaps added

Tagging issues relevant to new cap protocol design.

comment:2 Changed at 2009-12-07T03:13:45Z by davidsarah

  • Keywords performance added

Se #658 for a way to use this to improve performance, by optimizing uploads/downloads where the target file already exists with the same contents.

comment:3 Changed at 2010-02-23T04:42:09Z by zooko

If we require a "flat hash" of anything (i.e. a Merkle–Damgård construction such as SHA-256 or one of the Merkle–Damgård, HAIFA, or other "chain-like" or "sponge-like" constructions that are SHA-3 candidates) then compatible implementations will be unable to parallelize the computation of the secure hash.

Most of our data structures are parallelizable. It would be kind of a shame if some future implementation of Tahoe-LAFS on a 144-core ColorForth chip had to stop processing a Tahoe-LAFS data structure in 144 parallel streams and wait for one of its cores to process the whole input by itself. :-)

In contrast, if we require a Merkle Tree hash of the thing, then future implementations will have the option of parallelizing the computation of it. A properly specified Merkle Tree has the same security properties as a flat hash has, although it doesn't have the property that it will detect bugs in the Merkle Tree implementation.

Another property that it doesn't have is that it computes a widely understood, standardized hash value of the input. (Unless, of course, we can convince the SHA-3 standards body to standardize on a tree hash instead of a linear hash.) So anyway, in the short term I think we should either not have a plaintext hash at all (which is the current situation) or have one which is optional so future implementations can skip it or define a Merkle Tree hash of the plaintext instead of a linear hash.

comment:4 Changed at 2010-02-23T09:43:57Z by warner

I like to support parallelism and performance, so if there really is a tradeoff between having a flat hash and being able to do high-speed super-parallel uploads, I'll prefer the choice that gives us performance. This may mean having the flat hash be optional: if the uploader chooses to provide it, and if the downloader chooses to download the entire file, then the downloader will check it. This also gives enough information for a very simple downloader to validate the whole file (or for a little shell script that's comparing hashes of files on disk against data in the UEB).

So I think there should be four hash-like items in the shares:

  • flat plaintext hash: optional
  • tree plaintext hash: optional
  • flat ciphertext hash: optional
  • tree ciphertext hash: mandatory

and the flat hashes are only checked by a downloader who is fetching the entire file. (note that my new downloader code is very segment-at-a-time random-access oriented, so even in the near term the downloader might start ignoring the flat hashes).

Then, on the day that we write a super-parallelized upload hasher (one day after all tahoe users install an 18 exabyte-per-second DSL line, and two days after we reduce the protocol to a single roundtrip, otherwise it wouldn't make any significant difference), we also add a tahoe.cfg option that enables or disables the generation of the flat hashes. Enabled would result in slower upload hashing (one core would have to linearly see every byte of the file). Disabled would result in faster uploads but would lose those simple flat hashes that some downloaders might want to use.

Oh, and I forgot to mention this in the original description: we can put the encrypted plaintext merkle tree hashes in the old plaintext_hash_tree section, which will be ignored by old clients as long as they don't see a plaintext_root_hash key in the UEB. This will let us quietly add an encrypted plaintext hash tree without impacting compatibility with older clients.

Note: See TracTickets for help on using tickets.