New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 787531 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux , Windows
Pri: 2
Type: Bug



Sign in to add a comment

Swarming tasks crash when input filenames contain non-ASCII characters

Project Member Reported by robertma@chromium.org, Nov 21 2017

Issue description

This bug report is mostly summarized from bsittler@ 's emails, along with some of my own investigation.

In order to test the behavior of uploading files with non-ASCII names, bsittler@ tried to add a resource file for layout tests, named as:
third_party/WebKit/LayoutTests/http/tests/local/resources/file-for-drag-to-send3-ABC~¤•★星🌟星★•¤~XYZ.txt
which caused the swarming tasks for layout tests to crash on both Linux and Windows.

Both failures are in the early stage (fetch_isolated) manipulating the input filenames, but in slightly different places.

# Linux failure

Task: https://chromium-swarm.appspot.com/task?id=39f872de03bf3d10
Crash point: https://cs.chromium.org/chromium/infra/luci/client/isolateserver.py?rcl=9761e1747f2a09c67176ffdafb11303ce0632404&l=155

The root cause seems to be that the bots are using ASCII (or 'C') locale, in which case `sys.getfilesystemencoding()` returns things like "ANSI_X3.4-1968", which cannot decode UTF-8 encoded filenames of course. The right fix is probably to use some UTF-8 locale (e.g. "en_US.UTF-8"), because other OS-related stuff might also break if we have non-ASCII filenames in ASCII locale.

# Window failure

Task: https://chromium-swarm.appspot.com/task?id=39bd29d3e5d08710
Crash point: https://cs.chromium.org/chromium/infra/luci/client/isolateserver.py?q=isolateserver&sq=package:chromium&l=1707

This might be related to tarball extraction. bsittler@ believes "it's dying here in os.path.join on Windows because non-ASCII PAX entries are returned with Unicode string names on Windows rather than e.g. UTF-8 (byte)strings as seen on Linux and Mac OS X - is the answer to just wrap basedir in unicode()". But I'm less sure about the right solution in this case.
 

Comment 1 by mar...@chromium.org, Nov 21 2017

Labels: -Pri-3 OS-Linux OS-Windows Pri-2
Thanks for the report. This should definitely be fixed.

Comment 2 by mar...@chromium.org, Nov 29 2017

Cc: -tansell@chromium.org
Owner: mar...@chromium.org
Status: Started (was: Available)
Started working on it. Taking the occasion to spread piles of poo around.
Thank you! I'm looking forward to dramatically simplifying some tests once this lands, and happy to help with testing, code reviews, etc.

(Also, we seriously need Unicode emoji reaction icons for crbug comments! 👍 💩)
> (Also, we seriously need Unicode emoji reaction icons for crbug comments! 👍 💩)

As a side note, if we put more emojis in all sorts of places, we will find more Unicode bugs :)

And thanks!

Comment 5 by mar...@chromium.org, Nov 29 2017

In fairness, emoji* support in monorail is fairly recent. :)

* namely non-BMP unicode characters.
Project Member

Comment 6 by bugdroid1@chromium.org, Nov 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/dcf287d892186f4adcc1d0e5787cc451f881c1b3

commit dcf287d892186f4adcc1d0e5787cc451f881c1b3
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Nov 30 00:03:17 2017

subprocess42: be more tolerant to unicode.

Make it automatically encode unicode strings to str as UTF-8. This simplifies
everyone's life. Asserting that all call sites are using str would be painful
and labour intensive so it's not worth.

R=vadimsh@chromium.org
Bug:  787531 
Change-Id: Ib683dbccad98d4d57f1ebd5bd627d4b1874062b2
Reviewed-on: https://chromium-review.googlesource.com/798150
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/dcf287d892186f4adcc1d0e5787cc451f881c1b3/client/utils/subprocess42.py

Project Member

Comment 7 by bugdroid1@chromium.org, Nov 30 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/1e5342e46f500e24ccd14f286a9032695502cf96

commit 1e5342e46f500e24ccd14f286a9032695502cf96
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Nov 30 23:01:19 2017

swarming: add emojis to the smoke test.

- Add emojis in isolated file name, in the file content, in environment
  variable, in stdout, in the command line.
- Rename many 'hello_world' variables to 'content' and other minor refactoring.

No functional change.

Don't get too excited, there are still problems, but there used to be more.
In particular, this required to use the 'fs' module instead of open(), os, os.path
and shutil.

R=vadimsh@chromium.org
Bug:  787531 
Change-Id: Iac60f82e1b4736abc20abfa390960dc9dd252cc7
Reviewed-on: https://chromium-review.googlesource.com/798892
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/appengine/swarming/local_smoke_test.py

Thanks for working on this! FYI, I have an in-progress change https://crrev.com/c/802554 which should speed up some slow Windows layout tests once this is fixed
FWIW Mac tryjobs seem to actually work already - so the remaining parts of this bug may be OS-specific. See http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/602360 for a working Mac run; the corresponding Linux http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/599827 and Windows http://build.chromium.org/p/tryserver.chromium.win/builders/win7_chromium_rel_ng/builds/56570 runs failed, apparently both on filename encoding problems.
In case it's helpful, here are some test characters my in-progress change attempts to exercise (in the sequence ABC~‾¥≈¤・・•∙·☼★星🌟星★☼·∙•・・¤≈¥‾~XYZ.)

Rationale for this particular test character sequence, which is
used in filenames and also in file contents:

- ABC~ ensures the string starts with something we can read to
  ensure it is from the correct source; ~ is used because even
  some 1-byte otherwise-ASCII-like parts of ISO-2022-JP
  interpret it differently.
- ‾¥ are inside a single-byte range of ISO-2022-JP and help
  diagnose problems due to filesystem encoding or locale
- ≈ is inside IBM437 and helps diagnose problems due to filesystem
  encoding or locale
- ¤ is inside Latin-1 and helps diagnose problems due to
  filesystem encoding or locale; it is also the "simplest" case
  needing substitution in ISO-2022-JP
- ・ is inside a single-byte range of ISO-2022-JP in some variants
  and helps diagnose problems due to filesystem encoding or locale;
  on the web it is distinct when decoding but unified when encoding
- ・ is inside a double-byte range of ISO-2022-JP and helps
  diagnose problems due to filesystem encoding or locale
- • is inside Windows-1252 and helps diagnose problems due to
  filesystem encoding or locale and also ensures these aren't
  accidentally turned into e.g. control codes
- ∙ is inside IBM437 and helps diagnose problems due to filesystem
  encoding or locale
- · is inside Latin-1 and helps diagnose problems due to
  filesystem encoding or locale and also ensures HTML named
  character references (e.g. &middot;) are not used
- ☼ is inside IBM437 shadowing C0 and helps diagnose problems due to
  filesystem encoding or locale and also ensures these aren't
  accidentally turned into e.g. control codes
- ★ is inside ISO-2022-JP on a non-Kanji page and makes correct
  output easier to spot
- 星 is inside ISO-2022-JP on a Kanji page and makes correct
  output easier to spot
- 🌟 is outside the BMP and makes incorrect surrogate pair
  substitution detectable and ensures substitutions work
  correctly immediately after Kanji 2-byte ISO-2022-JP
- 星 repeated here ensures the correct codec state is used
  after a non-BMP substitution
- ★ repeated here also makes correct output easier to spot
- ☼ is inside IBM437 shadowing C0 and helps diagnose problems due to
  filesystem encoding or locale and also ensures these aren't
  accidentally turned into e.g. control codes and also ensures
  substitutions work correctly immediately after non-Kanji
  2-byte ISO-2022-JP
- · is inside Latin-1 and helps diagnose problems due to
  filesystem encoding or locale and also ensures HTML named
  character references (e.g. &middot;) are not used
- ∙ is inside IBM437 and helps diagnose problems due to filesystem
  encoding or locale
- • is inside Windows-1252 and again helps diagnose problems
  due to filesystem encoding or locale
- ・ is inside a double-byte range of ISO-2022-JP and helps
  diagnose problems due to filesystem encoding or locale
- ・ is inside a single-byte range of ISO-2022-JP in some variants
  and helps diagnose problems due to filesystem encoding or locale;
  on the web it is distinct when decoding but unified when encoding
- ¤ is inside Latin-1 and helps diagnose problems due to
  filesystem encoding or locale; again it is a "simple"
  substitution case and also ensures
- ≈ is inside IBM437 and helps diagnose problems due to filesystem
  encoding or locale
- ¥‾ are inside a single-byte range of ISO-2022-JP and help
  diagnose problems due to filesystem encoding or locale
- ~XYZ ensures earlier errors don't lead to misencoding of
  simple ASCII

Overall the near-symmetry makes common I18N mistakes like
off-by-1-after-non-BMP easier to spot. All the characters
are also allowed in Windows Unicode filenames.
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/94b228dbdf2455238416378e3a238fa5c96b3986

commit 94b228dbdf2455238416378e3a238fa5c96b3986
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Dec 01 23:46:19 2017

example: put tons of emojis in the examples

- This helps testing the non-BMP unicode support live manually on broader
  platforms, e.g. Windows, OSX.
- Add non-emoji example too, as the emoji one fails on Windows and I need to
  test more.

R=iannucci@chromium.org
Bug:  787531 
Change-Id: Ice22ec3746a32ddbf5a41a1cdeec01ef053a2eb2
Reviewed-on: https://chromium-review.googlesource.com/803952
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/1_isolate.py
[delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/1_isolate_server.py
[add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/2_swarming.py
[delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/2_swarming_run.py
[delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/3_swarming_trigger_collect.py
[delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/4_swarming_run_manual_upload.py
[modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/README.md
[modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/common.py
[modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_world.isolate
[modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_world.py
[add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_🌐.isolate
[add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_🌐.py

Project Member

Comment 12 by bugdroid1@chromium.org, Dec 1 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/7546c137553c5d26bc003a4d35cef2f0d16d9acd

commit 7546c137553c5d26bc003a4d35cef2f0d16d9acd
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Dec 01 23:48:09 2017

client: Fix emoji handling at various places

Fixes:
- Fix remaining issues in emoji handling that were found in isolated handling on
  the Swarming bots.
- Remove downgrades of --env and --env-prefix from unicode to str, now
  unnecessary since dcf287d892186f4.
- Implement fs.readlink().

Missing:
- named cache and cipd will have to be handled in a follow up (haven't confirmed
  yet if there's actually issues there).
- I'm still observing issues when executing a python script that has a utf-8
  file name on Windows, other platforms are fine. I think it's related to
  CreateProcess and how python.exe processes the arguments. Will be handled
  separately.

R=iannucci@chromium.org
Bug:  787531 
Change-Id: I6c397dc99bda2b38ee25c70647c49f04b82903d3
Reviewed-on: https://chromium-review.googlesource.com/797393
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/isolated_format.py
[modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/isolateserver.py
[modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/run_isolated.py
[modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/utils/fs.py

I have a fix pending that will solve it for Ubuntu. The remaining one is Windows, but it seem that the tarfile code is broken, will investigate more the Go code.
I've been using https://chromium-swarm.appspot.com/task?id=3a3b60d5bee41110 for investigation.

In the isolated:
https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=48dca71999d8ca0341d376220bb2f6b1936bb90e
the broken tar is ".89ad7bc16ca61e6b07cd5c50c3e551940a3d4ce9.tar". I manually downloaded it and printed the non-ascii files in it and it is not valid UTF-8.
I just realized I never specified encoding='utf-8' to the tarfile.Tarfile constructor. Testing this.
Unsurprisingly, the official doc https://docs.python.org/2/library/tarfile.html#tarfile.ENCODING was effectively wrong;
https://chromium-swarm-dev.appspot.com/task?id=3a3c19ab68e0f510
Sending second CL.
Project Member

Comment 17 by bugdroid1@chromium.org, Dec 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/18c44b9f39caef0fdadaccba80b0d6fa998518b5

commit 18c44b9f39caef0fdadaccba80b0d6fa998518b5
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Dec 05 01:17:56 2017

client: isolated unicode fix on Ubuntu

The fix is in particular for the following case:
- Running on linux
- Python process is started with LANG unset

In this case, fix_encoding.fix_encoding() fails to fix
sys.getfilesystemencoding() to be utf-8, and is instead set to 'ascii', which
means that non-ascii file paths fails to be mapped in. Swarming bots with LANG
correctly set do not exhibit this issue.

R=vadimsh@chromium.org
Bug:  787531 
Change-Id: I7a754e1042ed0af7715e249546a7ceafaa269088
Reviewed-on: https://chromium-review.googlesource.com/807124
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/18c44b9f39caef0fdadaccba80b0d6fa998518b5/client/isolateserver.py

Project Member

Comment 18 by bugdroid1@chromium.org, Dec 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d863d1f29180bd8602aae83be75ed80b26e14ae7

commit d863d1f29180bd8602aae83be75ed80b26e14ae7
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Dec 05 22:31:24 2017

client: isolated unicode fix for Windows

While tarfile documentation says that encoding defaults to 'utf-8' on Windows,
this is a lie. So force it and it seems to have magically fixed the problem.

R=vadimsh@chromium.org
Bug:  787531 
Change-Id: I444fbe4dd9324440290f9baf04858788fa430dad
Reviewed-on: https://chromium-review.googlesource.com/806955
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/d863d1f29180bd8602aae83be75ed80b26e14ae7/client/isolateserver.py

Status: Fixed (was: Started)
Fixes all live on prod.

Sign in to add a comment