[tahoe-dev] #534: "tahoe cp" command encoding issue

Mon Mar 2 21:29:34 PST 2009

> from a "meaning" perspective in the case of 2.d is wrong to publish  
> in the child name a some characters that have an unknown meaning  
> and that so are wrongly mapped to unicode entities hoping that the  
> client will know how to handle this situation.

Okay.  I'm starting to understand this better.  I've re-read various  
posts to this thread, and read a few web pages on the topic.

  *** my realization: Unicode is not enough

Until last weekend, I thought that the the canonical internal  
representation of strings in tahoe could be unicode (and the  
canonical serialization be utf-8).  Now I realize that unicode is not  
enough.  We need to be able to accept, store, and emit strings which  
we cannot translate into unicode.  This means that the real canonical  
definition of a string has to be a 2-tuple: string of bytes, and a  
suggested encoding.  We don't always have to explicitly store that 2- 
tuple, but we have to think of it as being the canonical information.

Now I think I understand why Brian's proposal [1] and Alberto's [2]  
included a complete separate copy of the name.  (My earlier idea of  
storing either the unicode string *or* the "original bytes" in the  
same slot using "decode-as-latin-1" is a clever hack to save space,  
but prevents us from using a lossy decoding, the better to indicate  
decode errors to the user.)

So here is my attempt to synthesize Brian's, Alberto's, François's  
[3], and Shawn's [4] proposals, plus my own discoveries.

When reading in a filename, what we really want is to get the unicode  
of that filename without risk of corruption due to false decoding.   
If we can do this, then we can optimize out the storage of the 2- 
tuple of (original bytes, suggested encoding) and just store the  
unicode.  Unfortunately this isn't possible except on Windows and  
Macintosh [footnote 1].

If not, then we get the original bytes of the filename, get the  
suggested encoding, and store that 2-tuple for optimal fidelity.   
Then, we attempt to decode those bytes using the suggested encoding  
and the mode which replaces unrecognized bytes with U+FFFD (as  
suggested by Alberto).  We put the result of that (which is a unicode  
object) into the child name field.

One remaining wrinkle is that there could be multiple entries with  
the same unicode child name but different (string-of-bytes, suggested- 
encoding) 2-tuples.  For newer tahoe clients, they could be required  
to understand that the unique key is the (string-of-bytes, suggested- 
encoding) *if* it is present, else the unique key is the unicode  
string.  However, what will older tahoe clients do if they get  
multiple children in the same directory with the same unicode child  
name?  Another solution would be to detect these collisions and  
further mangle the already mangled unicode names by appending "-1",  
"-2", etc.

Regards,

Zooko

footnote 1: how to get unicode filenames with Python on different  
platforms

On Python on Windows/NTFS, if you invoke os.getcwdu() or you pass a  
unicode object to os.listdir(), such as "os.listdir(u'.')", then  
you'll get back a list of unicode objects which are guaranteed to  
contain the correct values.  Be happy!  Unless you forget and  
accidentally invoke os.getcwd() or pass a string to os.listdir(),  
such as "os.listdir('.')".  Then you'll get something that is  
probably corrupted.  Don't do that; use the unicode Python APIs.

On Python on MacOSX/HFS+, if you invoke os.getcwdu() or you pass a  
unicode object to os.listdir(), then you'll get back a list of  
unicode objects which are guaranteed to contain the correct values.   
Be happy!  If you forget and invoke os.getcwd() or os.listdir('.')  
then you'll get a set of strings which are utf-8 encodings of the  
unicode objects.  You could then recover by utf-8-decoding them all,  
but why?  Just use the unicode APIs in the first place.

On Python on other Unix, if you invoke os.getcwdu(), then it will  
attempt to decode the cwd using the current locale.  If it fails it  
will raise a UnicodeDecodeError.  If it succeeds then you'll get a  
unicode object.  If the current locale doesn't indicate the right  
encoding for all the elements of the cwd, but they do accidentally  
decode, then you'll get a unicode object with a corrupted value in  
it.  If pass a unicode object to os.listdir(), then you'll get back  
the result of attempting to decode the items using the current  
locale.  If it didn't get a decode error, then the resulting item  
will be type unicode.  If it did get a decode error, then the  
resulting item will be the original bytes in a string.  As before, if  
the current locale doesn't indicate the right encoding for all the  
items in this directory, then some of them may be corrupted.   
Conclusion: never use the unicode APIs on Linux.  Use the flat  
bytestring APIs to get something which is at least guaranteed not to  
be corrupted, and use sys.getfilesystemencoding() to get Python's  
best guess about the suggested encoding and proceed from there.

I don't know about other filesystems under Windows or Mac such as  
VFAT (isn't that the filesystem typically used on thumb drives?) or  
CDROMs, etc.

[1] http://allmydata.org/pipermail/tahoe-dev/2009-February/001339.html
[2] http://allmydata.org/pipermail/tahoe-dev/2009-February/001348.html
[3] http://allmydata.org/pipermail/tahoe-dev/2009-February/001346.html
[4] http://allmydata.org/pipermail/tahoe-dev/2009-February/001322.html