#5 closed defect (fixed)

verifierid as storage index: not the whole story

Reported by: warner Owned by:
Priority: minor Milestone:
Component: code-encoding Version: 0.6.0
Keywords: Cc:
Launchpad Bug:

Description (last modified by warner)

We've talked on and off about what key we should be using when looking up shares (the index sent to RIStorageServer.get_buckets). We're currently using the VerifierId. I'm wondering if we should be using some combination of the verifierid and the encoding parameters to make sure that this index consistently maps to the same set of shares, rather than merely shares generated from the same data.

The peer selection algorithm forces us to pick exactly one index value.

Pros and cons of different index values:

FileId:

  • good: quick to compute, allows potential uploaders to make the barest minimum of passes over the source data (two passes total: one for fileid+encryption_key determination, a second for encryption and encoding).
  • bad: a single plaintext file might be encrypted in different ways. It might also be encoded in different ways. Both will change the share data thus generated, and those shares should not be intermingled
  • bad: privacy leak, reveals more data about what files people are using (such that even custom encryption keys fail to hide the identity of the file)

VerifierId:

  • good: allows custom keys to protect the identity of the file
  • good: changes in encryption key result in different share identity
  • bad: requires an extra pass. The minimum memory/disk footprint approach means we don't want to store the crypttext, so we need a (fileid+key) pass and a (encrypt+discard+verifierid) pass, then we know the verifierid and can ask peers about shares, then if we need to upload the file for real we need a (encrypt+encode) pass.
  • bad: variations in encoding parameters (total number of shares, number of rquired shares, segment size) result in different shares, but these variations are not captured in the index

So I'm thinking that the share index needs to be verifierid plus a serialized representation of the encoding parameters. The serialized parameters can be compressed by just saying "v1" and havin that imply a certain algorithm applied to the filesize, but that should still give us the ability to change encoding parameters in the future and not wind up with incompatible shares that appear identical from the perspective of get_buckets().

There is certain information that needs to go into peer selection (depending upon the algorithm). The verifierid is one of them, the number of shares that were uploaded is another (at least for PeerSelection/TahoeThree? and PeerSelection/DenverAirport? .. PeerSelection/TahoeTwo? does not need it). There is some information that can affect the shares being generated without influencing peer selection (like segment size): this data could be stored on the peers and retrieved at download time. Peers could store shares from multiple encoded forms of the same crypttext. The download process would involve the downloader asking a set of likely peers about a verifierid, and learning of a set of encoded forms, such that the peer has buckets for some forms and not others. The response that provides a list of encoded forms includes the encoding parameters, so the downloader could learn about how many buckets for that form it needs to recover the file. The second step would be to pick one form and retrieve references to sufficient buckets for that form, then finally the data could be fetched and decoded.

Change History (7)

comment:1 Changed at 2007-04-27T03:24:21Z by warner

  • Description modified (diff)

fix some wikinames

comment:2 Changed at 2007-04-28T19:17:41Z by warner

  • Component changed from component1 to code

comment:3 Changed at 2007-06-29T23:27:33Z by warner

  • Version set to 0.3.0

we can probably put this one off for a little while. If the storage index is randomly generated (or derived from something randomly generated, like the readkey), then this isn't a problem. We could also say that the storage index should be the hash of (readkey, encoding parameters).

comment:4 Changed at 2007-07-25T02:59:44Z by warner

currently (in, say, source:src/allmydata/upload.py@1000) the Uploadable is responsible for generating the readkey, and it is suggested that convergent uploads use a hash of the file's contents and the desired encoding parameters. We don't do that quite yet, but if we did, then the readkey would be different for different encodings of the same file, and we'd have the properties that we want.

comment:5 Changed at 2007-08-14T18:55:46Z by warner

  • Component changed from code to code-encoding
  • Owner somebody deleted

comment:6 Changed at 2007-09-25T04:25:40Z by zooko

  • Resolution set to fixed
  • Status changed from new to closed

Nowadays the storage index is the secure hash of the encryption key. Closing as fixed.

comment:7 Changed at 2007-09-25T04:25:51Z by zooko

  • Version changed from 0.3.0 to 0.6.0
Note: See TracTickets for help on using tickets.