[tahoe-dev] #534: "tahoe cp" command encoding issue

Brian Warner warner at lothar.com
Fri Feb 27 09:45:24 PST 2009


[must be brief, typing on an iphone, I'll write more on Monday when  
I've got a real keyboard]

One limitation to keep in mind is that JSON cannot represent arbitrary  
binary data without application-visible encoding, and that both the  
webapi GET $dircap?t=json and the dirnode-format metadata dict use  
JSON. So any "store the original bytes and let the reader sort it out"  
approach must e.g. base32-encode those bytes on the way in and base32- 
decode them on the way out, in the CLI tool on the user side of the  
HTTP connection.

How about this: we treat the child name (which has more users right  
now, in terms of lines of code which think they know how to interpret  
it) as being the "share with others" name: always unicode, but not  
always a faithful roundtrippable representation of the original. Then,  
for files which were copies from a local disk (like with "tahoe cp" or  
"tahoe backup", as opposed to a WUI operation), let's add a metadata  
field that is defined to hold the base32-encoded representation of the  
original uninterpreted filename bytestring, and treat this metadata  
field as the "note to myself" value, used to restore from a backup but  
not meant for other users.

On the inbound side, if we can't decode the filename with the user's  
preferred encoding (which can default to utf-8, or utf-16 on windows,  
or something configured into python, etc), then we pretend to decode  
it with Latin-1, so that a human looking at the mangled unicode name  
can hopefully guess what the proper name should have been. We use the  
unicode result as the childname. In all cases, we store the orginal  
bytestring in the metadata.

Then, on the outbound side, we add a --use-original-binary-filename  
option, which tells "tahoe cp" to ignore the unicode name and just use  
the bytestring from the metadata. Normally, we have it encode the  
unicode childname into the preferred charset (again with some  
defaults) and ignore the metadata.

Thoughts?
  -Brian



More information about the tahoe-dev mailing list