#565 closed defect (fixed)

unicode arguments on the command-line

Reported by: zooko Owned by: davidsarah
Priority: major Milestone: 1.8β
Component: code-frontend-cli Version: 1.2.0
Keywords: unicode windows Cc:
Launchpad Bug:

Description

How do we know what encoding was used to encode the filenames or other arguments that are passed in via Python 2's sys.argv? If we don't know, do we assume that it is utf-8, thus making it incompatible with platforms that don't encode arguments with utf-8? Or do we leave it undecoded, thus making it impossible to correctly inspect the string for the presence of '/' chars?

Attachments (1)

back-out-windows-specific-unicode-argv.dpatch (46.7 KB) - added by davidsarah at 2010-06-09T00:20:35Z.
Back out Windows-specific Unicode argument support for v1.7.

Download all attachments as: .zip

Change History (18)

comment:1 Changed at 2008-12-28T00:43:18Z by francois

As a data point, here's how it is handled in Python 3.0.


Some system APIs like os.environ and sys.argv can also present problems when the bytes made available by the system is not interpretable using the default encoding. Setting the LANG variable and rerunning the program is probably the best approach.


Source: What's new in Python 3.0

$ LANG=en_US.UTF-8 python3.0 -c "import sys; print(sys.argv[1])" ärtonwall
ärtonwall
$ LANG=C python3.0 -c "import sys; print(sys.argv[1])" ärtonwall
Could not convert argument 3 to string

We should probably implement something working in a similair way for python 2.

comment:2 Changed at 2009-12-07T04:57:53Z by davidsarah

  • Keywords windows added

[Windows-only]

http://bugs.python.org/issue2128 suggests that on Python 2.6.x for Windows, any non-ASCII characters will have been irretrievably mangled to question-marks in sys.argv. Unfortunately win32api.GetCommandLine seems to call GetCommandLineA, not GetCommandLineW. The bzr project solved this problem by using ctypes to call GetCommandLineW: https://bugs.launchpad.net/bzr/+bug/375934 . (bzr is GPL'd, so we can use that code.)

Note that this would require passing the correct unicode argv into twisted.python.usage.Options.parseOptions from source:src/allmydata/scripts/runner.py , i.e. change source:windows/tahoe.py to do

argv = get_cmdline_unicode()  # from bzr patch
rc = runner(argv[1:], install_node_control=False)
sys.exit(rc)

(assuming that twisted.python.usage.Options handles Unicode correctly, which I haven't tested).

comment:3 Changed at 2010-02-02T00:15:29Z by davidsarah

  • Milestone changed from undecided to 1.7.0

Needed for #534 which has milestone 1.7.0.

comment:4 Changed at 2010-04-05T23:42:25Z by francois

  • Owner set to francois

comment:5 Changed at 2010-04-30T18:24:04Z by davidsarah

Here's some code to get Unicode argv that should work on both Windows (including cygwin) and Unix. On Unix, it assumes that arguments are encoded according to the current locale encoding (or UTF-8 if that could not be determined by Python).

import sys, locale

if sys.platform == "win32":
    from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll
    def get_unicode_argv():
        GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32))
        CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \
          (("CommandLineToArgvW", windll.shell32))
        argc = c_int(0)
        argv = CommandLineToArgvW(GetCommandLineW(), byref(argc))
        return [argv[i] for i in xrange(1, argc.value)]
else:
    def get_unicode_argv():
        encoding = locale.getpreferredencoding()
        if not encoding:
            encoding = "utf-8"
        # This throws UnicodeError if any argument cannot be decoded.
        return [arg.decode(encoding, 'strict') for arg in sys.argv]

print get_unicode_argv()

comment:6 Changed at 2010-05-11T15:53:36Z by zooko

I really want to see this patch in trunk in the next 48 hours for Tahoe-LAFS v1.7, but I can't contribute to it myself right now.

comment:7 Changed at 2010-05-11T19:28:57Z by davidsarah

  • Owner changed from francois to davidsarah
  • Status changed from new to assigned

comment:8 Changed at 2010-05-11T19:29:35Z by davidsarah

  • Keywords review-needed added

comment:9 Changed at 2010-06-08T08:57:35Z by davidsarah

  • Keywords review-needed removed

Getting this working on Windows is more difficult than I thought. I have successfully got it to work by hacking the setuptools-generated entry script like this:

# EASY-INSTALL-ENTRY-SCRIPT: 'allmydata-tahoe==1.6.1-r4452','console_scripts','tahoe'
__requires__ = 'allmydata-tahoe==1.6.1-r4452'
import sys
from pkg_resources import load_entry_point

### start extra code
from ctypes import WINFUNCTYPE, POINTER, byref, c_wchar_p, c_int, windll

GetCommandLineW = WINFUNCTYPE(c_wchar_p)(("GetCommandLineW", windll.kernel32))
CommandLineToArgvW = WINFUNCTYPE(POINTER(c_wchar_p), c_wchar_p, POINTER(c_int)) \
                         (("CommandLineToArgvW", windll.shell32))

argc = c_int(0)
argv = CommandLineToArgvW(GetCommandLineW(), byref(argc))
sys.argv = [argv[i].encode('utf-8') for i in xrange(1, argc.value)]
### end extra code

sys.exit(
   load_entry_point('allmydata-tahoe==1.6.1-r4452', 'console_scripts', 'tahoe')()
)

but only by invoking this script directly from the command line, not via the tahoe.exe wrapper. The latter mangles the arguments beyond hope of recovery.

comment:10 Changed at 2010-06-08T18:25:38Z by davidsarah

It isn't necessary for the extra code to be in the entry script; it could be in source:allmydata/scripts/runner.py . However, Zooko and I decided that changing how the CLI entry works on Windows would be too disruptive for 1.7, so we're dropping support for Unicode args on Windows until the next release.

This ticket is fixed for other platforms in 1.7.

Changed at 2010-06-09T00:20:35Z by davidsarah

Back out Windows-specific Unicode argument support for v1.7.

comment:11 Changed at 2010-06-09T00:21:06Z by davidsarah

  • Keywords review-needed added
  • Owner changed from davidsarah to zooko
  • Status changed from assigned to new

comment:12 Changed at 2010-06-09T02:28:57Z by zooko

  • Keywords reviewed added; review-needed removed
  • Owner changed from zooko to davidsarah

The patch looks correct.

comment:13 Changed at 2010-06-12T20:48:23Z by davidsarah

  • Keywords reviewed removed
  • Milestone changed from 1.7.0 to 1.7.1
  • Status changed from new to assigned

back-out-windows-specific-unicode-argv.dpatch was applied in 32d9deace3d82637.

See #1074 for a patch that reenables Unicode argument support on Windows (but requires further discussion and refinement).

comment:14 Changed at 2010-07-14T02:44:22Z by davidsarah

The #1074 patch is now finished.

comment:15 Changed at 2010-07-17T03:50:28Z by davidsarah

  • Milestone changed from 1.7.1 to 1.8β

comment:16 Changed at 2010-08-02T07:23:26Z by david-sarah@…

In [4627/ticket798]:

Bundle setuptools-0.6c16dev (with Windows script changes, and the change to only warn if site.py wasn't generated by setuptools) instead of 0.6c15dev. addresses #565, #1073, #1074

comment:17 Changed at 2010-08-08T00:37:52Z by davidsarah

  • Resolution set to fixed
  • Status changed from assigned to closed

Fixed; see ticket:1074#comment:29 for changesets.

Note: See TracTickets for help on using tickets.