Figure out what we can do to mark WindowsServer crashes as test failures in swarming and recipes |
|||||||
Issue descriptionSee bug 653353 for context - it looks like there's some issue on at least Mac 10.9 where running the browser_tests occasionally takes down the Mac window server and hence the whole machine. This gets propagated back to the swarming task as a SIGTERM (-15), and currently we treat this as an infra failure. However, given that we don't see SIGTERMs for anything else, it seems reasonable to reclassify this as a test failure, at least for now. Can we figure out how to make that happen?
,
Jan 18 2017
In the spirit of "if we don't prioritize infra failure bugs they'll never get done" I'm going to bump this up to P1.
,
Jan 18 2017
If this failure always manifests itself as: 2017-01-16 18:55:54.815 browser_tests[48622:303] An uncaught exception was raised 2017-01-16 18:55:54.815 browser_tests[48622:303] Error (268435459) creating CGSWindow on line 263 couldn't we catch that exception in browser_tests and force-fail the test? I'm worried that handling it at the Swarming level could cause more problems. Do we know for sure that we don't see SIGTERMs for anything else? How do we know that?
,
Jan 25 2017
,
Jan 26 2017
FYI, seeing additional failures on 2017-01-24 in 8 builds.
,
Jan 27 2017
I forget, are the failures reproducible, e.g. retrying the task results in the same hard failure mode? IIRC last time I had looked they weren't.
,
Jan 27 2017
AFAIK, they are not.
,
Mar 16 2017
Updating with some information from a private email thread: * An example a Swarming task page for a run where this error occurred: https://chromium-swarm.appspot.com/task?id=31a5cfd9354d1910&refresh=10&show_raw=1 <-- The task fails (desired behavior) and the exit code is -15 * Swarming collect should return 0 unless there is an issue when *collecting* results. The current behavior (return exit_code if exit_code is not None else 1) is a bug. (https://github.com/luci/luci-py/blob/master/client/swarming.py#L848) This may be why were are getting missing shard when we shouldn't be. There is a bug on file (Bug 693214) to fix this. This bug is blocked on that bug. * When collect returns zero, the task result should still include the -15 exit code.
,
Apr 26 2017
,
Oct 30 2017
See also issue 645280 for context. |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by katthomas@google.com
, Jan 16 2017