[tahoe-dev] String encoding in tahoe

Brian Warner warner-tahoe at allmydata.com
Fri Jan 2 16:36:12 PST 2009


On Mon, 22 Dec 2008 17:03:34 +0100
Francois Deppierraz <francois at ctrlaltdel.ch> wrote:

> We usually have UTF-8 bytestrings as input (sys.argv, filenames,
> aliases, etc.) and need UTF-8 bytestrings as output (urls, filenames,
> etc.). However, it is usually simpler and safer to use unicode strings
> internally.

> Should we (1) automatically convert sys.argv[] from bytestring to
> unicode in runner.runner(), or (2) do it selectively for each command
> (put, cp, etc.).
> 
> I gave a try to (1), see patch [3], which indeed fixed the test failure
> on slave3 (dapper box). However, it broke many tests at the same time,
> mostly assertions in util/base32.py which seems to require bytestrings
> instead of unicode strings.

> +++ new-tahoe/src/allmydata/scripts/runner.py   2008-12-22
> +    # Convert arguments to unicode
> +    new_argv = []
> +    for arg in argv:
> +      new_argv.append(arg.decode('utf-8'))
> +    argv = new_argv
> +


I'm not convinced that this is the best approach: there are several things in
sys.argv which are *not* filenames (like subcommand names, "--foo" argument
indicators, and non-filename argument values), and those should be left as
bytestrings. Those base32 test failures probably resulted from arguments
which are expected to be bytestrings that contain base32-encoded binary
values (like storage index strings, or human-readable filecaps), and I think
these should be left as bytestrings.

Basically, sys.argv is not just a list of unicode objects.. its type depends
upon how you parse it, and a detailed specification of its type (say, if you
were trying to declare main() in OCAML or Haskell or some language with a
sophisticated type system) would include a parse tree with integers, base32
strings, keyword tokens, and unicode strings, depending upon which part of
the parse tree you were looking at.

This tells me that we have to do the sys.argv conversion later, probably in
the options.Usage classes. I still like the overall "decode early, unicode
everywhere, encode late" advice, but only apply it to things that actually
*are* unicode. I think internal APIs should be as tightly defined as
possible: I get nervous about functions that will accept either unicode or
bytestring, since that implies that the conversion is being done by the
callee, who has less information than the caller.

cheers,
 -Brian


More information about the tahoe-dev mailing list