New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 607229 link

Starred by 1 user

Issue metadata

Status: Duplicate
Merged: issue 522396
Owner:
Last visit > 30 days ago
Closed: May 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Linux
Pri: 1
Type: Bug

Blocking:
issue 606337



Sign in to add a comment

Steps timing out mysteriously on new Linux Intel bots running Xenial

Project Member Reported by kbr@chromium.org, Apr 27 2016

Issue description

These two bots were just upgraded to Linux Xenial to pick up newer graphics drivers:

https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20(New%20Intel)
https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20(New%20Intel)

(build72-b1 and build73-b1)

Steps are timing out mysteriously. Here are two builds that show strange behavior:

https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28New%20Intel%29/builds/1208
https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20%28New%20Intel%29/builds/1251

The end of the tests show that the tests completed, but something in the harness script failed:

[       OK ] GpuProcess.readback_webgl_gpu_process (422 ms)
[  PASSED  ] 14 tests.

(INFO) 2016-04-27 09:23:12,651 atexit_with_log._wrapped_function:10  Try running <function _ListAllSubprocesses at 0x7fe8648c9488>
(INFO) 2016-04-27 09:23:12,666 atexit_with_log._wrapped_function:12  Did run <function _ListAllSubprocesses at 0x7fe8648c9488>
Running ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'gpu_process', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpyYceAr/tmpa2ISn5telemetry', '--output-format=json']
Command ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'gpu_process', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpyYceAr/tmpa2ISn5telemetry', '--output-format=json'] returned exit code 0
Additional test environment:
    CHROME_DEVEL_SANDBOX=/opt/chromium/chrome_sandbox
    LANG=en_US.UTF-8
Command: /usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../content/test/gpu/run_gpu_test.py gpu_process --show-stdout --browser=release -v --extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc --isolated-script-test-output /tmp/tmpdXO4qL.json

10800 2016-04-27 16:23:12.704 I: Waiting for proces exit
10800 2016-04-27 16:23:12.704 I: Profiling: Section RunTest took 59.469 seconds
10800 2016-04-27 16:23:12.704 I: Command finished with exit code 0 (0x0)
10800 2016-04-27 16:23:12.704 I: rmtree(/tmp/isolated_rundjwsMa)
10800 2016-04-27 16:23:12.970 I: rmtree(/tmp/isolated_tmpyYceAr)
10800 2016-04-27 16:23:12.970 I: rmtree(/tmp/isolated_out5ZTBEC)
10800 2016-04-27 16:23:12.981 I: Result:
{"exit_code":0,"had_hard_timeout":false,"internal_failure":null,"outputs_ref":null,"version":2}
10800 2016-04-27 16:23:12.981 I: Waiting for all threads to die...
10800 2016-04-27 16:23:12.981 I: Done.


--------------------------------------------------------------------------------
started: Wed Apr 27 09:22:12 2016
ended: Wed Apr 27 10:03:13 2016
duration: 41 mins, 0 secs
status: FAILURE
status reason: return code was -1.

I think the problem's in src/tools/swarming_client/run_isolated.py. M-A, do you think you could please take a look? These bots were reliable running Wily, but are flaky running Xenial. It's urgent to get them back to stability. Thanks.

 
stdout-1.txt
136 KB View Download
stdout-2.txt
73.2 KB View Download

Comment 1 by benhenry@google.com, Apr 27 2016

Labels: -Infra-Swarming

Comment 2 by kbr@chromium.org, Apr 28 2016

Note: this is still happening. Please investigate the behavior on build73-b1 first:
https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Debug%20(New%20Intel)

build72-b1 appears to be misconfigured per Issue 607541.

Comment 3 by kbr@chromium.org, Apr 28 2016

Note: Issue 607541 was a red herring. Debugging these problems either on build72-b1 or build73-b1 should be fine.

Comment 4 by kbr@chromium.org, May 5 2016

Cc: -vadimsh@chromium.org iannucci@chromium.org pschmidt@chromium.org
Components: -Infra>Platform>Swarming Infra>Platform>Recipes
Owner: martiniss@chromium.org
After more thought -- these bots aren't using Swarming.

There must be something going wrong in the internals of the recipe engine. Here is one example of a step that failed:

https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28New%20Intel%29/builds/1438

[       OK ] ScreenshotSync.WithCanvas (14158 ms)
[ RUN      ] ScreenshotSync.WithDivs
[       OK ] ScreenshotSync.WithDivs (12999 ms)
(WARNING) 2016-05-05 10:19:51,640 desktop_browser_backend.Close:527  Failed to gracefully shutdown.
(WARNING) 2016-05-05 10:19:51,640 desktop_browser_backend.Close:531  Proceed to kill the browser.
[  PASSED  ] 2 tests.

(INFO) 2016-05-05 10:19:51,644 atexit_with_log._wrapped_function:10  Try running <function _ListAllSubprocesses at 0x7f015f3d1758>
(ERROR) 2016-05-05 10:19:51,667 ps_util._ListAllSubprocesses:89  Telemetry leaks these processes: chrome (21785) - [''], python (21887) - ['']
(INFO) 2016-05-05 10:19:51,667 atexit_with_log._wrapped_function:12  Did run <function _ListAllSubprocesses at 0x7f015f3d1758>
Running ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'screenshot_sync', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpCTprv_/tmpCDCsBwtelemetry', '--output-format=json']
Command ['/usr/bin/python', '../../content/test/gpu/run_gpu_test.py', 'screenshot_sync', '--show-stdout', '--browser=release', '-v', '--extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc', '--output-dir', '/tmp/isolated_tmpCTprv_/tmpCDCsBwtelemetry', '--output-format=json'] returned exit code 0
Additional test environment:
    CHROME_DEVEL_SANDBOX=/opt/chromium/chrome_sandbox
    LANG=en_US.UTF-8
Command: /usr/bin/python ../../testing/scripts/run_telemetry_benchmark_as_googletest.py ../../content/test/gpu/run_gpu_test.py screenshot_sync --show-stdout --browser=release -v --extra-browser-args=--enable-logging=stderr --js-flags=--expose-gc --isolated-script-test-output /tmp/tmp1Lv7eB.json

21599 2016-05-05 17:19:51.709 I: Waiting for proces exit
21599 2016-05-05 17:19:51.709 I: Profiling: Section RunTest took 40.477 seconds
21599 2016-05-05 17:19:51.709 I: Command finished with exit code 0 (0x0)
21599 2016-05-05 17:19:51.709 I: rmtree(/tmp/isolated_run51Y9J1)
21599 2016-05-05 17:19:51.974 I: rmtree(/tmp/isolated_tmpCTprv_)
21599 2016-05-05 17:19:51.975 I: rmtree(/tmp/isolated_outxJxVtZ)
21599 2016-05-05 17:19:51.984 I: Result:
{"exit_code":0,"had_hard_timeout":false,"internal_failure":null,"outputs_ref":null,"version":2}
21599 2016-05-05 17:19:51.984 I: Waiting for all threads to die...
21599 2016-05-05 17:19:51.984 I: Done.


--------------------------------------------------------------------------------
started: Thu May  5 10:19:09 2016
ended: Thu May  5 10:59:52 2016
duration: 40 mins, 42 secs
status: FAILURE
status reason: return code was -1.


Stephen, could you please help? These bots were stable when they were running Wily and are now unstable running Xenial. The upgrade was needed in order to run the latest version of Intel's graphics driver.

Cc: friedman@chromium.org
I'll take a look.

cc-ing friedman who's been doing some work on xenial lately
friedman says he doesn't see anything obvious. 
Status: Started (was: Assigned)
Just to confirm, the error

[1:1:0505/092241:ERROR:resource_bundle.cc(754)] Failed to load /tmp/isolated_runl3R7ze/out/Release/chrome_material_100_percent.pak


is probably not related, right?

Comment 8 by kbr@chromium.org, May 5 2016

It's a good question, and ideally the warning would be fixed by fixing Chromium's isolates, but it's not related. This job:

https://build.chromium.org/p/chromium.gpu.fyi/builders/Linux%20Release%20%28NVIDIA%29/builds/39665/steps/screenshot_sync_tests%20on%20NVIDIA%20GPU%20on%20Linux%20on%20Linux/logs/stdio

has the same warning but the exit code is 0. (That job's also swarmed, but this shouldn't be related, I assume.)

Cc: dpranke@chromium.org estaab@chromium.org tansell@chromium.org
I caught it in a timeout about an hour ago, and got some data. I wasn't able pinpoint exactly what was wrong, but I did capture some data about it.

Here's the output of strace while it's running:
sudo strace -p 3180
strace: Process 3180 attached
futex(0x1ce7460, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, ffffffff^Cstrace: Process 3180 detached
 <detached ...>

I've also attached the output of 'ps auxf' while the process was running.

We've run into something similar ( crbug.com/522396 ) to this on windows layout_tests, which we recently discovered to be a problem with the recipe engine and how it launches some threads. cc-ing relevant people.
martiniss_psauxf_4
24.7 KB View Download
cc-ing iannucci as well

I'm going to wait until the bot hangs again, and try diagnosing more then.
Mergedinto: 522396
Status: Duplicate (was: Started)
I've confirmed that this is the same thing as  bug 522396 . Merging into that bug.
Owner: iannucci@chromium.org
Status: Started (was: Duplicate)
I'm not willing to say that this bug and 522396 are the same issue, so I'm unmerging this bug.

The test behaviour of https://codereview.chromium.org/1959563002 does not seem to be the same behaviour as seen with the Windows issue.

The fix however might end up being the same.
Comments 184 till 232 in https://bugs.chromium.org/p/chromium/issues/detail?id=522396 actually refer to this bug and not the Windows issue.
Why don't you think these are the same issue? It looked like it was the same thing to me; I was able to produce stack traces similar to what you produced in  bug 522396  on the linux machines, so I figured it was the same thing.

Ah ok, just read the update on the other bug.

Comment 16 by kbr@chromium.org, May 16 2016

These bots are running reliably now. Closing as fixed.

Comment 17 by kbr@chromium.org, May 16 2016

Status: Fixed (was: Started)

Comment 18 by kbr@chromium.org, May 16 2016

Status: Duplicate (was: Fixed)
Rather, dup'ing into  Issue 522396 .

Sign in to add a comment