New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 681684 link

Starred by 1 user

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 1
Type: Bug

Blocked on:
issue 693214



Sign in to add a comment

Figure out what we can do to mark WindowsServer crashes as test failures in swarming and recipes

Project Member Reported by dpranke@chromium.org, Jan 16 2017

Issue description

See  bug 653353  for context - it looks like there's some issue on at least Mac 10.9 where running the browser_tests occasionally takes down the Mac window server and hence the whole machine. This gets propagated back to the swarming task as a SIGTERM (-15), and currently we treat this as an infra failure.

However, given that we don't see SIGTERMs for anything else, it seems reasonable to reclassify this as a test failure, at least for now.

Can we figure out how to make that happen?
 
Yup, I can take a look at this. Thanks for moving this one forward. 
Labels: -Pri-2 Pri-1
Status: Started (was: Assigned)
In the spirit of "if we don't prioritize infra failure bugs they'll never get done" I'm going to bump this up to P1.
Cc: dpranke@chromium.org
If this failure always manifests itself as:

2017-01-16 18:55:54.815 browser_tests[48622:303] An uncaught exception was raised
2017-01-16 18:55:54.815 browser_tests[48622:303] Error (268435459) creating CGSWindow on line 263

couldn't we catch that exception in browser_tests and force-fail the test? I'm worried that handling it at the Swarming level could cause more problems. Do we know for sure that we don't see SIGTERMs for anything else? How do we know that?



Labels: Hotlist-Infra-Failures

Comment 5 by efoo@chromium.org, Jan 26 2017

Cc: efoo@chromium.org
FYI, seeing additional failures on 2017-01-24 in 8 builds. 

Comment 6 by mar...@chromium.org, Jan 27 2017

I forget, are the failures reproducible, e.g. retrying the task results in the same hard failure mode? IIRC last time I had looked they weren't.
AFAIK, they are not.
Blockedon: 693214
Owner: mar...@chromium.org
Status: Assigned (was: Started)
Updating with some information from a private email thread:

* An example a Swarming task page for a run where this error occurred: https://chromium-swarm.appspot.com/task?id=31a5cfd9354d1910&refresh=10&show_raw=1 <-- The task fails (desired behavior) and the exit code is -15

* Swarming collect should return 0 unless there is an issue when *collecting* results. The current behavior (return exit_code if exit_code is not None else 1) is a bug. (https://github.com/luci/luci-py/blob/master/client/swarming.py#L848) This may be why were are getting missing shard when we shouldn't be. There is a bug on file (Bug 693214) to fix this. This bug is blocked on that bug. 

* When collect returns zero, the task result should still include the -15 exit code.
Components: Infra>Client>Chrome
See also  issue 645280  for context. 

Sign in to add a comment