[tahoe-dev] #534: "tahoe cp" command encoding issue

Shawn Willden shawn-tahoe at willden.org
Thu Feb 26 10:54:53 PST 2009


On Thursday 26 February 2009 10:56:02 am zooko wrote:
> Strategy 2.c.  If it fails, encode the bytes in some magical way that
> a later utf-8 decoding of them will get the same bytes back.

I don't think there can be any such magical encoding, and this isn't what the 
KDE folks do.

For it to work, if U represents the Unicode space, B represents the byte 
space, D:B->U is the UTF-8 decoding function, and g:U->B is the destination 
file system encoding function, then you need a function f:B->U such that:

		g(D(f(name_bytes))) = name_bytes

But given that g is unknown, what can you choose for f that will always work?  
In many cases there simply isn't a Unicode string which g will decode into 
the byte string you want.

What Brian said the KDE guys do is encode such names into an unused region of 
the Unicode space.  Then they provide a special 'g' that recognizes 
characters from that unused region and acts appropriately.  Essentially, 
they're encoding the "this is invalid even though it looks valid" flag into 
the Unicode.

My code, BTW, uses something essentially equivalent to 2.a -- though I don't 
have the interoperability constraints, since no one is using my code, not 
even me :-)

Oh, a nice way to handle raw strings and still pass them through code that 
expects Unicode is (as suggested by Kevin Reid) to take the undecodable bytes 
and decode them with the latin1 codec, since any byte string is valid latin1.

	Shawn


More information about the tahoe-dev mailing list