New issue
Advanced search Search tips

Issue 664295 link

Starred by 0 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 3
Type: Bug



Sign in to add a comment

We should make failures due to swarming timeouts more obvious

Project Member Reported by dpranke@chromium.org, Nov 10 2016

Issue description

(I'm filing this as a result of some of the failures I've seen while sheriffing today ...)

If a step times out due to timeout set via swarming, it's not all that obvious that that's what happened.

Example:

https://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/37148

the actual red step simply says that 30 tests failed, but not why.

If you click through to the log, you can see a bunch of messages like:

C  119.806s Main  [TIMEOUT] CoreTest.Basic:
C  119.806s Main    Suite execution terminated, probably due to swarming timeout.
C  119.806s Main    Your test may not have run.

but that's not terribly definitive.

In other cases, e.g.:

https://build.chromium.org/p/chromium.android/builders/Android%20N5X%20Swarm%20Builder/builds/6192/steps/telemetry_perf_unittests%20on%20Android/logs/stdio

you might just get obscure errors from logcat.

If you click back up to the "trigger" step and click over to the task summary, however, you can see that we get "State: Execution timed out".

We should make that very clear as part of the step that actually is marked as red. There's too many hoops to jump through.

In particular, since the timeout is set in the build configuration (the //testing/buildbot files), and not passed in to the test runner (and hence in the command lines) it may not be obvious at all to most sheriffs what is going on.

Thoughts on how to make this better?
 
See also bug 664211 (which is where the second log comes from).
Note that the "Suite execution terminated..." language in the log was my attempt to make this better from the client side. The language is wishy-washy because all the test runner knows is that it received a SIGTERM, not necessarily that swarming sent it due to a timeout.
Right. It would be much better if the swarming logic in the recipe did the right thing, since it should know what actually happened. We shouldn't have client-side logic for this at all ...
Cc: -andyb...@chromium.org

Comment 6 by stip@chromium.org, Feb 10 2017

Cc: -stip@chromium.org
Components: Infra>Client>Chrome

Comment 8 by mar...@chromium.org, Jul 11 2017

Status: Available (was: Untriaged)
Related: none of chromium python scripts handle SIGTERM properly. We need a complete overhaul there. I've done it in run_isolated, so it's doable. We need to make it more easily/as trivial as possible.

Comment 9 by mar...@chromium.org, Jun 21 2018

Cc: -phajdan.jr@chromium.org bpastene@chromium.org
Components: -Infra>Platform>Swarming Infra>Platform>Swarming>Admin
Ben worked on improving the SIGTERM handling as part of issue 733612.
Cc: -iannucci@chromium.org iannu...@google.com

Sign in to add a comment