Swarming tasks crash when input filenames contain non-ASCII characters |
|||
Issue descriptionThis bug report is mostly summarized from bsittler@ 's emails, along with some of my own investigation. In order to test the behavior of uploading files with non-ASCII names, bsittler@ tried to add a resource file for layout tests, named as: third_party/WebKit/LayoutTests/http/tests/local/resources/file-for-drag-to-send3-ABC~¤•★星🌟星★•¤~XYZ.txt which caused the swarming tasks for layout tests to crash on both Linux and Windows. Both failures are in the early stage (fetch_isolated) manipulating the input filenames, but in slightly different places. # Linux failure Task: https://chromium-swarm.appspot.com/task?id=39f872de03bf3d10 Crash point: https://cs.chromium.org/chromium/infra/luci/client/isolateserver.py?rcl=9761e1747f2a09c67176ffdafb11303ce0632404&l=155 The root cause seems to be that the bots are using ASCII (or 'C') locale, in which case `sys.getfilesystemencoding()` returns things like "ANSI_X3.4-1968", which cannot decode UTF-8 encoded filenames of course. The right fix is probably to use some UTF-8 locale (e.g. "en_US.UTF-8"), because other OS-related stuff might also break if we have non-ASCII filenames in ASCII locale. # Window failure Task: https://chromium-swarm.appspot.com/task?id=39bd29d3e5d08710 Crash point: https://cs.chromium.org/chromium/infra/luci/client/isolateserver.py?q=isolateserver&sq=package:chromium&l=1707 This might be related to tarball extraction. bsittler@ believes "it's dying here in os.path.join on Windows because non-ASCII PAX entries are returned with Unicode string names on Windows rather than e.g. UTF-8 (byte)strings as seen on Linux and Mac OS X - is the answer to just wrap basedir in unicode()". But I'm less sure about the right solution in this case.
,
Nov 29 2017
Started working on it. Taking the occasion to spread piles of poo around.
,
Nov 29 2017
Thank you! I'm looking forward to dramatically simplifying some tests once this lands, and happy to help with testing, code reviews, etc. (Also, we seriously need Unicode emoji reaction icons for crbug comments! 👍 💩)
,
Nov 29 2017
> (Also, we seriously need Unicode emoji reaction icons for crbug comments! 👍 💩) As a side note, if we put more emojis in all sorts of places, we will find more Unicode bugs :) And thanks!
,
Nov 29 2017
In fairness, emoji* support in monorail is fairly recent. :) * namely non-BMP unicode characters.
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/dcf287d892186f4adcc1d0e5787cc451f881c1b3 commit dcf287d892186f4adcc1d0e5787cc451f881c1b3 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Thu Nov 30 00:03:17 2017 subprocess42: be more tolerant to unicode. Make it automatically encode unicode strings to str as UTF-8. This simplifies everyone's life. Asserting that all call sites are using str would be painful and labour intensive so it's not worth. R=vadimsh@chromium.org Bug: 787531 Change-Id: Ib683dbccad98d4d57f1ebd5bd627d4b1874062b2 Reviewed-on: https://chromium-review.googlesource.com/798150 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/dcf287d892186f4adcc1d0e5787cc451f881c1b3/client/utils/subprocess42.py
,
Nov 30 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/1e5342e46f500e24ccd14f286a9032695502cf96 commit 1e5342e46f500e24ccd14f286a9032695502cf96 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Thu Nov 30 23:01:19 2017 swarming: add emojis to the smoke test. - Add emojis in isolated file name, in the file content, in environment variable, in stdout, in the command line. - Rename many 'hello_world' variables to 'content' and other minor refactoring. No functional change. Don't get too excited, there are still problems, but there used to be more. In particular, this required to use the 'fs' module instead of open(), os, os.path and shutil. R=vadimsh@chromium.org Bug: 787531 Change-Id: Iac60f82e1b4736abc20abfa390960dc9dd252cc7 Reviewed-on: https://chromium-review.googlesource.com/798892 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/appengine/swarming/local_smoke_test.py
,
Dec 1 2017
Thanks for working on this! FYI, I have an in-progress change https://crrev.com/c/802554 which should speed up some slow Windows layout tests once this is fixed
,
Dec 1 2017
FWIW Mac tryjobs seem to actually work already - so the remaining parts of this bug may be OS-specific. See http://build.chromium.org/p/tryserver.chromium.mac/builders/mac_chromium_rel_ng/builds/602360 for a working Mac run; the corresponding Linux http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_rel_ng/builds/599827 and Windows http://build.chromium.org/p/tryserver.chromium.win/builders/win7_chromium_rel_ng/builds/56570 runs failed, apparently both on filename encoding problems.
,
Dec 1 2017
In case it's helpful, here are some test characters my in-progress change attempts to exercise (in the sequence ABC~‾¥≈¤・・•∙·☼★星🌟星★☼·∙•・・¤≈¥‾~XYZ.) Rationale for this particular test character sequence, which is used in filenames and also in file contents: - ABC~ ensures the string starts with something we can read to ensure it is from the correct source; ~ is used because even some 1-byte otherwise-ASCII-like parts of ISO-2022-JP interpret it differently. - ‾¥ are inside a single-byte range of ISO-2022-JP and help diagnose problems due to filesystem encoding or locale - ≈ is inside IBM437 and helps diagnose problems due to filesystem encoding or locale - ¤ is inside Latin-1 and helps diagnose problems due to filesystem encoding or locale; it is also the "simplest" case needing substitution in ISO-2022-JP - ・ is inside a single-byte range of ISO-2022-JP in some variants and helps diagnose problems due to filesystem encoding or locale; on the web it is distinct when decoding but unified when encoding - ・ is inside a double-byte range of ISO-2022-JP and helps diagnose problems due to filesystem encoding or locale - • is inside Windows-1252 and helps diagnose problems due to filesystem encoding or locale and also ensures these aren't accidentally turned into e.g. control codes - ∙ is inside IBM437 and helps diagnose problems due to filesystem encoding or locale - · is inside Latin-1 and helps diagnose problems due to filesystem encoding or locale and also ensures HTML named character references (e.g. ·) are not used - ☼ is inside IBM437 shadowing C0 and helps diagnose problems due to filesystem encoding or locale and also ensures these aren't accidentally turned into e.g. control codes - ★ is inside ISO-2022-JP on a non-Kanji page and makes correct output easier to spot - 星 is inside ISO-2022-JP on a Kanji page and makes correct output easier to spot - 🌟 is outside the BMP and makes incorrect surrogate pair substitution detectable and ensures substitutions work correctly immediately after Kanji 2-byte ISO-2022-JP - 星 repeated here ensures the correct codec state is used after a non-BMP substitution - ★ repeated here also makes correct output easier to spot - ☼ is inside IBM437 shadowing C0 and helps diagnose problems due to filesystem encoding or locale and also ensures these aren't accidentally turned into e.g. control codes and also ensures substitutions work correctly immediately after non-Kanji 2-byte ISO-2022-JP - · is inside Latin-1 and helps diagnose problems due to filesystem encoding or locale and also ensures HTML named character references (e.g. ·) are not used - ∙ is inside IBM437 and helps diagnose problems due to filesystem encoding or locale - • is inside Windows-1252 and again helps diagnose problems due to filesystem encoding or locale - ・ is inside a double-byte range of ISO-2022-JP and helps diagnose problems due to filesystem encoding or locale - ・ is inside a single-byte range of ISO-2022-JP in some variants and helps diagnose problems due to filesystem encoding or locale; on the web it is distinct when decoding but unified when encoding - ¤ is inside Latin-1 and helps diagnose problems due to filesystem encoding or locale; again it is a "simple" substitution case and also ensures - ≈ is inside IBM437 and helps diagnose problems due to filesystem encoding or locale - ¥‾ are inside a single-byte range of ISO-2022-JP and help diagnose problems due to filesystem encoding or locale - ~XYZ ensures earlier errors don't lead to misencoding of simple ASCII Overall the near-symmetry makes common I18N mistakes like off-by-1-after-non-BMP easier to spot. All the characters are also allowed in Windows Unicode filenames.
,
Dec 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/94b228dbdf2455238416378e3a238fa5c96b3986 commit 94b228dbdf2455238416378e3a238fa5c96b3986 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Dec 01 23:46:19 2017 example: put tons of emojis in the examples - This helps testing the non-BMP unicode support live manually on broader platforms, e.g. Windows, OSX. - Add non-emoji example too, as the emoji one fails on Windows and I need to test more. R=iannucci@chromium.org Bug: 787531 Change-Id: Ice22ec3746a32ddbf5a41a1cdeec01ef053a2eb2 Reviewed-on: https://chromium-review.googlesource.com/803952 Reviewed-by: Robbie Iannucci <iannucci@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/1_isolate.py [delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/1_isolate_server.py [add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/2_swarming.py [delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/2_swarming_run.py [delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/3_swarming_trigger_collect.py [delete] https://crrev.com/1e5342e46f500e24ccd14f286a9032695502cf96/client/example/4_swarming_run_manual_upload.py [modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/README.md [modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/common.py [modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_world.isolate [modify] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_world.py [add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_🌐.isolate [add] https://crrev.com/94b228dbdf2455238416378e3a238fa5c96b3986/client/example/payload/hello_🌐.py
,
Dec 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/7546c137553c5d26bc003a4d35cef2f0d16d9acd commit 7546c137553c5d26bc003a4d35cef2f0d16d9acd Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Fri Dec 01 23:48:09 2017 client: Fix emoji handling at various places Fixes: - Fix remaining issues in emoji handling that were found in isolated handling on the Swarming bots. - Remove downgrades of --env and --env-prefix from unicode to str, now unnecessary since dcf287d892186f4. - Implement fs.readlink(). Missing: - named cache and cipd will have to be handled in a follow up (haven't confirmed yet if there's actually issues there). - I'm still observing issues when executing a python script that has a utf-8 file name on Windows, other platforms are fine. I think it's related to CreateProcess and how python.exe processes the arguments. Will be handled separately. R=iannucci@chromium.org Bug: 787531 Change-Id: I6c397dc99bda2b38ee25c70647c49f04b82903d3 Reviewed-on: https://chromium-review.googlesource.com/797393 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Robbie Iannucci <iannucci@chromium.org> [modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/isolated_format.py [modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/isolateserver.py [modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/run_isolated.py [modify] https://crrev.com/7546c137553c5d26bc003a4d35cef2f0d16d9acd/client/utils/fs.py
,
Dec 4 2017
I have a fix pending that will solve it for Ubuntu. The remaining one is Windows, but it seem that the tarfile code is broken, will investigate more the Go code.
,
Dec 4 2017
I've been using https://chromium-swarm.appspot.com/task?id=3a3b60d5bee41110 for investigation. In the isolated: https://isolateserver.appspot.com/browse?namespace=default-gzip&hash=48dca71999d8ca0341d376220bb2f6b1936bb90e the broken tar is ".89ad7bc16ca61e6b07cd5c50c3e551940a3d4ce9.tar". I manually downloaded it and printed the non-ascii files in it and it is not valid UTF-8.
,
Dec 4 2017
I just realized I never specified encoding='utf-8' to the tarfile.Tarfile constructor. Testing this.
,
Dec 4 2017
Unsurprisingly, the official doc https://docs.python.org/2/library/tarfile.html#tarfile.ENCODING was effectively wrong; https://chromium-swarm-dev.appspot.com/task?id=3a3c19ab68e0f510 Sending second CL.
,
Dec 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/18c44b9f39caef0fdadaccba80b0d6fa998518b5 commit 18c44b9f39caef0fdadaccba80b0d6fa998518b5 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Dec 05 01:17:56 2017 client: isolated unicode fix on Ubuntu The fix is in particular for the following case: - Running on linux - Python process is started with LANG unset In this case, fix_encoding.fix_encoding() fails to fix sys.getfilesystemencoding() to be utf-8, and is instead set to 'ascii', which means that non-ascii file paths fails to be mapped in. Swarming bots with LANG correctly set do not exhibit this issue. R=vadimsh@chromium.org Bug: 787531 Change-Id: I7a754e1042ed0af7715e249546a7ceafaa269088 Reviewed-on: https://chromium-review.googlesource.com/807124 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/18c44b9f39caef0fdadaccba80b0d6fa998518b5/client/isolateserver.py
,
Dec 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/d863d1f29180bd8602aae83be75ed80b26e14ae7 commit d863d1f29180bd8602aae83be75ed80b26e14ae7 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Dec 05 22:31:24 2017 client: isolated unicode fix for Windows While tarfile documentation says that encoding defaults to 'utf-8' on Windows, this is a lie. So force it and it seems to have magically fixed the problem. R=vadimsh@chromium.org Bug: 787531 Change-Id: I444fbe4dd9324440290f9baf04858788fa430dad Reviewed-on: https://chromium-review.googlesource.com/806955 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/d863d1f29180bd8602aae83be75ed80b26e14ae7/client/isolateserver.py
,
Dec 5 2017
Fixes all live on prod. |
|||
►
Sign in to add a comment |
|||
Comment 1 by mar...@chromium.org
, Nov 21 2017