Issue metadata
Sign in to add a comment
|
Steps timing out mysteriously on new Linux Intel bots running Xenial |
||||||||||||||||||||||||
Issue descriptionThese two bots were just upgraded to Linux Xenial to pick up newer graphics drivers: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20(New%20Intel) https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20(New%20Intel) (build72-b1 and build73-b1) Steps are timing out mysteriously. Here are two builds that show strange behavior: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28New%20Intel%29/builds/1208 https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28New%20Intel%29/builds/1251 The end of the tests show that the tests completed, but something in the harness script failed: [ OK ] GpuProcess.readback_webgl_gpu_process (422 ms) [ PASSED ] 14 tests. (INFO) 2016-04-27 09:23:12,651 atexit_with_log._wrapped_function:10 Try running <function _ListAllSubprocesses at 0x7fe8648c9488> (INFO) 2016-04-27 09:23:12,666 atexit_with_log._wrapped_function:12 Did run <function _ListAllSubprocesses at 0x7fe8648c9488> Running ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'gpu_process', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpyYceAr/tmpa2ISn5telemetry', '--output-format=json'] Command ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'gpu_process', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpyYceAr/tmpa2ISn5telemetry', '--output-format=json'] returned exit code 0 Additional test environment: CHROME_DEVEL_SANDBOX=/opt/chromium/chrome_sandbox LANG=en_US.UTF-8 Command: /usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../content/test/gpu/run_gpu_test.py gpu_process --show-stdout --browser=release -v --extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc --isolated-script-test-output /tmp/tmpdXO4qL.json 10800 2016-04-27 16:23:12.704 I: Waiting for proces exit 10800 2016-04-27 16:23:12.704 I: Profiling: Section RunTest took 59.469 seconds 10800 2016-04-27 16:23:12.704 I: Command finished with exit code 0 (0x0) 10800 2016-04-27 16:23:12.704 I: rmtree(/tmp/isolated_rundjwsMa) 10800 2016-04-27 16:23:12.970 I: rmtree(/tmp/isolated_tmpyYceAr) 10800 2016-04-27 16:23:12.970 I: rmtree(/tmp/isolated_out5ZTBEC) 10800 2016-04-27 16:23:12.981 I: Result: {"exit_code":0,"had_hard_timeout":false,"internal_failure":null,"outputs_ref":null,"version":2} 10800 2016-04-27 16:23:12.981 I: Waiting for all threads to die... 10800 2016-04-27 16:23:12.981 I: Done. -------------------------------------------------------------------------------- started: Wed Apr 27 09:22:12 2016 ended: Wed Apr 27 10:03:13 2016 duration: 41 mins, 0 secs status: FAILURE status reason: return code was -1. I think the problem's in src/tools/swarming_client/run_isolated.py. M-A, do you think you could please take a look? These bots were reliable running Wily, but are flaky running Xenial. It's urgent to get them back to stability. Thanks.
,
Apr 28 2016
Note: this is still happening. Please investigate the behavior on build73-b1 first: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20(New%20Intel) build72-b1 appears to be misconfigured per Issue 607541.
,
Apr 28 2016
Note: Issue 607541 was a red herring. Debugging these problems either on build72-b1 or build73-b1 should be fine.
,
May 5 2016
After more thought -- these bots aren't using Swarming. There must be something going wrong in the internals of the recipe engine. Here is one example of a step that failed: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28New%20Intel%29/builds/1438 [ OK ] ScreenshotSync.WithCanvas (14158 ms) [ RUN ] ScreenshotSync.WithDivs [ OK ] ScreenshotSync.WithDivs (12999 ms) (WARNING) 2016-05-05 10:19:51,640 desktop_browser_backend.Close:527 Failed to gracefully shutdown. (WARNING) 2016-05-05 10:19:51,640 desktop_browser_backend.Close:531 Proceed to kill the browser. [ PASSED ] 2 tests. (INFO) 2016-05-05 10:19:51,644 atexit_with_log._wrapped_function:10 Try running <function _ListAllSubprocesses at 0x7f015f3d1758> (ERROR) 2016-05-05 10:19:51,667 ps_util._ListAllSubprocesses:89 Telemetry leaks these processes: chrome (21785) - [''], python (21887) - [''] (INFO) 2016-05-05 10:19:51,667 atexit_with_log._wrapped_function:12 Did run <function _ListAllSubprocesses at 0x7f015f3d1758> Running ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'screenshot_sync', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpCTprv_/tmpCDCsBwtelemetry', '--output-format=json'] Command ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'screenshot_sync', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpCTprv_/tmpCDCsBwtelemetry', '--output-format=json'] returned exit code 0 Additional test environment: CHROME_DEVEL_SANDBOX=/opt/chromium/chrome_sandbox LANG=en_US.UTF-8 Command: /usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../content/test/gpu/run_gpu_test.py screenshot_sync --show-stdout --browser=release -v --extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc --isolated-script-test-output /tmp/tmp1Lv7eB.json 21599 2016-05-05 17:19:51.709 I: Waiting for proces exit 21599 2016-05-05 17:19:51.709 I: Profiling: Section RunTest took 40.477 seconds 21599 2016-05-05 17:19:51.709 I: Command finished with exit code 0 (0x0) 21599 2016-05-05 17:19:51.709 I: rmtree(/tmp/isolated_run51Y9J1) 21599 2016-05-05 17:19:51.974 I: rmtree(/tmp/isolated_tmpCTprv_) 21599 2016-05-05 17:19:51.975 I: rmtree(/tmp/isolated_outxJxVtZ) 21599 2016-05-05 17:19:51.984 I: Result: {"exit_code":0,"had_hard_timeout":false,"internal_failure":null,"outputs_ref":null,"version":2} 21599 2016-05-05 17:19:51.984 I: Waiting for all threads to die... 21599 2016-05-05 17:19:51.984 I: Done. -------------------------------------------------------------------------------- started: Thu May 5 10:19:09 2016 ended: Thu May 5 10:59:52 2016 duration: 40 mins, 42 secs status: FAILURE status reason: return code was -1. Stephen, could you please help? These bots were stable when they were running Wily and are now unstable running Xenial. The upgrade was needed in order to run the latest version of Intel's graphics driver.
,
May 5 2016
I'll take a look. cc-ing friedman who's been doing some work on xenial lately
,
May 5 2016
friedman says he doesn't see anything obvious.
,
May 5 2016
Just to confirm, the error [1:1:0505/092241:ERROR:resource_bundle.cc(754)] Failed to load /tmp/isolated_runl3R7ze/out/Release/chrome_material_100_percent.pak is probably not related, right?
,
May 5 2016
It's a good question, and ideally the warning would be fixed by fixing Chromium's isolates, but it's not related. This job: https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28NVIDIA%29/builds/39665/steps/screenshot_sync_tests%20on%20NVIDIA%20GPU%20on%20Linux%20on%20Linux/logs/stdio has the same warning but the exit code is 0. (That job's also swarmed, but this shouldn't be related, I assume.)
,
May 5 2016
I caught it in a timeout about an hour ago, and got some data. I wasn't able pinpoint exactly what was wrong, but I did capture some data about it. Here's the output of strace while it's running: sudo strace -p 3180 strace: Process 3180 attached futex(0x1ce7460, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff^Cstrace: Process 3180 detached <detached ...> I've also attached the output of 'ps auxf' while the process was running. We've run into something similar ( crbug.com/522396 ) to this on windows layout_tests, which we recently discovered to be a problem with the recipe engine and how it launches some threads. cc-ing relevant people.
,
May 5 2016
cc-ing iannucci as well I'm going to wait until the bot hangs again, and try diagnosing more then.
,
May 6 2016
I've confirmed that this is the same thing as bug 522396 . Merging into that bug.
,
May 10 2016
I'm not willing to say that this bug and 522396 are the same issue, so I'm unmerging this bug. The test behaviour of https://codereview.chromium.org/1959563002 does not seem to be the same behaviour as seen with the Windows issue. The fix however might end up being the same.
,
May 10 2016
Comments 184 till 232 in https://bugs.chromium.org/p/chromium/issues/detail?id=522396 actually refer to this bug and not the Windows issue.
,
May 10 2016
Why don't you think these are the same issue? It looked like it was the same thing to me; I was able to produce stack traces similar to what you produced in bug 522396 on the linux machines, so I figured it was the same thing.
,
May 10 2016
Ah ok, just read the update on the other bug.
,
May 16 2016
These bots are running reliably now. Closing as fixed.
,
May 16 2016
,
May 16 2016
|
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by benhenry@google.com
, Apr 27 2016