We should make failures due to swarming timeouts more obvious |
|||||||
Issue description(I'm filing this as a result of some of the failures I've seen while sheriffing today ...) If a step times out due to timeout set via swarming, it's not all that obvious that that's what happened. Example: https://build.chromium.org/p/chromium.linux/builders/Android%20Tests%20%28dbg%29/builds/37148 the actual red step simply says that 30 tests failed, but not why. If you click through to the log, you can see a bunch of messages like: C 119.806s Main [TIMEOUT] CoreTest.Basic: C 119.806s Main Suite execution terminated, probably due to swarming timeout. C 119.806s Main Your test may not have run. but that's not terribly definitive. In other cases, e.g.: https://build.chromium.org/p/chromium.android/builders/Android%20N5X%20Swarm%20Builder/builds/6192/steps/telemetry_perf_unittests%20on%20Android/logs/stdio you might just get obscure errors from logcat. If you click back up to the "trigger" step and click over to the task summary, however, you can see that we get "State: Execution timed out". We should make that very clear as part of the step that actually is marked as red. There's too many hoops to jump through. In particular, since the timeout is set in the build configuration (the //testing/buildbot files), and not passed in to the test runner (and hence in the command lines) it may not be obvious at all to most sheriffs what is going on. Thoughts on how to make this better?
,
Nov 10 2016
Note that the "Suite execution terminated..." language in the log was my attempt to make this better from the client side. The language is wishy-washy because all the test runner knows is that it received a SIGTERM, not necessarily that swarming sent it due to a timeout.
,
Nov 10 2016
Right. It would be much better if the swarming logic in the recipe did the right thing, since it should know what actually happened. We shouldn't have client-side logic for this at all ...
,
Nov 10 2016
Another, non-Android example: https://uberchromegw.corp.google.com/i/chromium.win/builders/Win%207%20Tests%20x64%20%281%29/builds/18338
,
Jan 18 2017
,
Feb 10 2017
,
Mar 10 2017
,
Jul 11 2017
Related: none of chromium python scripts handle SIGTERM properly. We need a complete overhaul there. I've done it in run_isolated, so it's doable. We need to make it more easily/as trivial as possible.
,
Jun 21 2018
Ben worked on improving the SIGTERM handling as part of issue 733612.
,
Oct 19
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dpranke@chromium.org
, Nov 10 2016