system_health.common_Desktop timing out on mac air 10.11 |
|||||||
Issue description
,
Jul 28 2017
The swarming data on that task took "1h 9m 10s". The IO timeout for it was 1h, so that's why swarming gave up. According to the task logs, the task itself took about 9 minutes... The first log line in the task is (WARNING) 2017-07-28 02:51:08,525 desktop_browser_finder.FindAllAvailableBrowsers:171 Chrome build location for mac_x86_64 not found. Browser will be run without Flash. The last log line with a timestamp is (INFO) 2017-07-28 03:00:14,050 cloud_storage.Insert:377 Uploading /b/s/w/it46CDtp/tmp1l6XaB.png to gs://chrome-telemetry-output/profiler-file-id_3-2017-07-28_03-00-147786.png So, that's about 9 minutes, if I'm reading the timestamps correctly. Not sure what's happening here then.... maruel@ is it possible swarming is hanging after the task finishes?
,
Jul 28 2017
,
Jul 28 2017
Zombie process?
,
Jul 28 2017
How does swarming decide the task is finished? Does it care about processes spawned by the task it starts at the beginning of the task, or does it just wait for the task it starts to finish?
,
Jul 28 2017
Nevermind, the task is hung. Compare https://chromium-swarm.appspot.com/task?id=37a3a4dc1a632c10&refresh=10&show_raw=1 (currently running) with https://chromium-swarm.appspot.com/task?id=379ff08e11853610&refresh=10&show_raw=1 (exited). There are some lines at the bottom in the exited one that are missing in the currently running one.
,
Jul 28 2017
I ssh-ed onto the bot and did `kill -2` on the process, which sent a keyboard interrupt on https://chromium-swarm.appspot.com/task?id=37a3a4dc1a632c10&refresh=10&show_raw=1, while the task was hung. This is what was in the log: Exception KeyboardInterrupt in <module 'threading' from '/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/threading.pyc'> ignored (CRITICAL) 2017-07-28 14:51:08,289 battor_wrapper.KillBattOrShell:191 BattOr shell was not properly closed. Killing now. (WARNING) 2017-07-28 14:51:08,295 ps_util._ListAllSubprocesses:86 psutil.AccessDenied (pid=405, name='battor_agent') (WARNING) 2017-07-28 14:51:08,295 ps_util._ListAllSubprocesses:89 Telemetry leaks these processes: battor_agent (405) Running ['/usr/bin/python', '../../tools/perf/run_benchmark', 'battor.steady_state', '-v', '--upload-results', '--output-format=chartjson', '--browser=reference', '--output-trace-tag=_ref', '--output-dir', '/b/s/w/itDlc7im/tmpUvOu6htelemetry', '--output-format=json'] in None (env: {'VERSIONER_PYTHON_PREFER_32_BIT': 'no', 'LOGNAME': 'chrome-bot', 'USER': 'chrome-bot', 'PATH': '/opt/local/bin:/opt/local/sbin:/usr/local/sbin:/usr/local/git/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'BOTO_CONFIG': '/Users/chrome-bot/.boto', 'HOME': '/Users/chrome-bot', 'SWARMING_BOT_ID': 'build127-b1', 'LANG': 'en_US.UTF-8', 'Apple_PubSub_Socket_Render': '/private/tmp/com.apple.launchd.D3Ob2W0dnw/Render', 'SWARMING_SERVER': 'https://chromium-swarm.appspot.com', 'VERSIONER_PYTHON_VERSION': '2.7', 'CHROME_DEVEL_SANDBOX': '/opt/chromium/chrome_sandbox', 'XPC_FLAGS': '0x0', 'SWARMING_HEADLESS': '1', 'LUCI_CONTEXT': '/b/s/w/luci_ctx.4pZhxx.json', 'XPC_SERVICE_NAME': 'org.swarm.bot.plist', 'SSH_AUTH_SOCK': '/private/tmp/com.apple.launchd.r5FFeYhO4X/Listeners', 'SWARMING_TASK_ID': '37a3a4dc1a632c11', 'SHELL': '/bin/bash', 'NO_GCE_CHECK': 'False', 'TMPDIR': '/b/s/w/itDlc7im', 'GIT_USER_AGENT': 'git/2.7.4 darwin build127-b1.labs.chromium.org', '__CF_USER_TEXT_ENCODING': '0x1F4:0x0:0x0'}) Command ['/usr/bin/python', '../../tools/perf/run_benchmark', 'battor.steady_state', '-v', '--upload-results', '--output-format=chartjson', '--browser=reference', '--output-trace-tag=_ref', '--output-dir', '/b/s/w/itDlc7im/tmpUvOu6htelemetry', '--output-format=json'] returned exit code 255 Doesn't look very illuminating. I was looking for a traceback to figure out where we're hung. My only guess is that some battor shell stuff is broken, only because there are battor issues in the bot log itself. cc-ing battor people.
,
Jul 28 2017
The task does reach this line (https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/story_runner.py?q=internal/story_runn&sq=package:chromium&l=373). So it gets there, and then returns, and from my understanding of the code, there shouldn't be anything to stop it from exiting, except for one thing. That is https://cs.chromium.org/chromium/src/third_party/catapult/telemetry/telemetry/internal/results/page_test_results.py?sq=package:chromium&l=295, the cleanup function. I'm not exactly sure what this all does; a cursory inspection made it look like it's trying to delete files.
,
Jul 28 2017
Also, the bot rebooted before the task mentioned in #7 happened. So doesn't look like a reboot will solve this :/
,
Jul 30
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jul 30
Reopen if relevant. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by rnep...@chromium.org
, Jul 28 2017