New issue
Advanced search Search tips

Issue 809196 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

A bunch of quarantined bots with "Failed to call hook get_state(): 'ascii' codec can't encode characters"

Project Member Reported by vadimsh@chromium.org, Feb 5 2018

Issue description

There's a bunch of Linux bots in luci.chromuim.try pool stuck with:

Failed to call hook get_state(): 'ascii' codec can't encode characters in position 108-132: ordinal not in range(128)
/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 294, in walk
for x in walk(new_path, topdown, onerror, followlinks):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/os.py", line 284, in walk
if isdir(join(top, name)):
File "/opt/infra-bot-setup/infra-python/ENV/bin/../../.cipd/pkgs/infra_python_linux-amd64-ubuntu14_04_tap59FFGuW/_current/ENV/lib/python2.7/genericpath.py", line 41, in isdir
st = os.stat(s)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 108-132: ordinal not in range(128)
Calling stack:
0 /b/swarming/swarming_bot.2.zip/api/bot.py:207:post_error()
1 /b/swarming/swarming_bot.2.zip/bot_code/bot_main.py:295:_call_hook_safe()
2 /b/swarming/swarming_bot.2.zip/bot_code/bot_main.py:350:_get_state()
3 /b/swarming/swarming_bot.2.zip/bot_code/bot_main.py:1041:_run_bot_inner()
4 /b/swarming/swarming_bot.2.zip/bot_code/bot_main.py:944:_run_bot()
5 /b/swarming/swarming_bot.2.zip/bot_code/bot_main.py:1326:main()
6 /b/swarming/swarming_bot.2.zip/__main__.py:166:CMDstart_bot()
7 /b/swarming/swarming_bot.2.zip/__main__.py:252:main()
8 /b/swarming/swarming_bot.2.zip/__main__.py:264:<module>()
9 /usr/lib/python2.7/runpy.py:72:_run_code()
10 /usr/lib/python2.7/runpy.py:162:_run_module_as_main()


These ones:
https://chromium-swarm.appspot.com/bot?id=swarm691-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm909-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm982-c4&sort_stats=total%3Adesc
https://chromium-swarm.appspot.com/bot?id=swarm983-c4&sort_stats=total%3Adesc

I'm curious to see what's causing this, since these bots run only recipes and recipes aren't expected to do any crazy stuff to bots.
 
Can't reproduce it manually, but notices few other issues:

1) get_recursive_size follows symlinks, which is incorrect for its purpose. It also exits with -1 if symlinks are broken (and we do have broken symlinks in vpython named caches).
2) get_recursive_size is actually sort of slow when evaluating size of named cache with chromium checkout. Thousands of files. Maybe doing it each poll is not a best approach.
get_recursive_size fails on 
/b/swarming/c/W8/cast_shell_linux/src/third_party/WebKit/LayoutTests/http/tests/local/fileapi/resources/file-for-drag-to-send3-ABC~‾¥≈¤・・•∙·☼★星🌟星★☼·∙•・・¤≈¥‾~XYZ.txt

But I'm not sure why it is using ASCII for file system path :( The default encoding is supposed to be utf-8 (and it works just fine in this case).

A bandaid would be to catch UnicodeEncodeError and give up.
Project Member

Comment 3 by bugdroid1@chromium.org, Feb 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/ca8e4ec9caf1c31e1b2f944fa600eb40a7d6ef75

commit ca8e4ec9caf1c31e1b2f944fa600eb40a7d6ef75
Author: Vadim Shtayura <vadimsh@chromium.org>
Date: Tue Feb 06 13:49:08 2018

Make get_recursive_size skip symlinks, don't crash on unicode paths.

It is not clear why it doesn't handle unicode paths, but it is better to
return -1 rather than crash. Crashing in get_recursive_size causes the bot to
quarantine itself, which is bad.

Also add dumb typo-catching test.

R=maruel@chromium.org, iannucci@chromium.org
BUG= 809196 

Change-Id: I6a26bcc6a2fc4651596473da4c8377a41fe25896
Reviewed-on: https://chromium-review.googlesource.com/902814
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/ca8e4ec9caf1c31e1b2f944fa600eb40a7d6ef75/appengine/swarming/swarming_bot/api/os_utilities.py
[modify] https://crrev.com/ca8e4ec9caf1c31e1b2f944fa600eb40a7d6ef75/appengine/swarming/swarming_bot/api/os_utilities_test.py

Comment 4 by athom@google.com, Feb 6 2018

Labels: -Pri-3 Pri-1
Almost all Dart Linux bots have quarantined themselves because of with that error:
https://chromium-swarm.appspot.com/botlist?c=id&c=os&c=task&c=status&c=pool&f=pool%3Aluci.dart.try&l=100&s=id%3Aasc

I'm raising the priority because our CQ will soon be out of capacity if this continues.
Woah, I didn't notice it is so severe for Dart bots. We have a potential fix, I'll be deploying it now to staging, and a bit later to chromium-swarm.

Comment 6 by athom@google.com, Feb 6 2018

Thanks!
Status: Fixed (was: Assigned)
Looks like the fix worked.

Sign in to add a comment