CQ reports about preemptively killed bots are confusing |
||||
Issue descriptionwhen the CQ master kills bots after it knows the run will fail (because some other bot failed), it ends up reporting all those killed bots with messages like: The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/xxx The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down') in https://luci-milo.appspot.com/buildbot/chromeos/xxx these messages are indistinguishable from actual infra flakes/problems, or from people actually resetting the bot (which sometimes happens e.g. a waterfall restart). this leads people (like myself) to ignore those runs as "CQ flaked again" and then just resubmit my CL. which is bad when my CL is actually the problem. for every bot the CQ master kills, it should simply list them in one section like (feel free to wordsmith this): The following build(s) were halted prematurely due to other bots already failing: veyron_mighty-paladin cyan-paladin xxxxx it shouldn't show/list any other summary messages or links to their logs. similarly, the CQ master cbuildbot summary (CommitQueueCompletion) shouldn't list these as failures. for people triaging the tree, these just get in the way. CommitQueueCompletion ( 2 hrs 58 mins ) CommitQueueCompletion * stdio [stdout] * cyan-paladin: The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down')
,
Jun 2 2017
The master build CommitQueueCompletion does show the links of the builds which are ignored because of CQ master. For the failure summary, can you please link a example? In theory, the ignored slaves will be aborted by the next master, so the failures caused by self-destruction shouldn't be showed up in its master build.
,
Jun 2 2017
There's no self-destruction in master-paladin/14898. The failures could be caused by bad CLs. If a CQ master destructs itself, it shows a message in CompletionStage like: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14904
,
Jun 2 2017
this failure summary is mis-summarized then: https://luci-milo.appspot.com/buildbot/chromeos/cyan-paladin/2746 cyan-paladin: The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down')
,
Jun 2 2017
I checked the failureTable, VMTests failed with code 1 and HWTest was aborted with a signal and this is what showed on the page. looks to me the summary is right?
,
Jun 7 2017
those signal 15 shutting down messages are not distinguishable. for example, it's the same as this one where the bot was killed: samus-no-vmtest-pre-cq: The BuildPackages stage failed: (15, 'Received signal 15; shutting down') in https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/no_vmtest_pre_cq/55870 the cyan-paladin was killed because the test (appears) to have hung. that's what the user cares about. not that the chromite code resorted to sending SIGTERM to tear it down. the samus-pre-cq on the other hand had the infra reset on it which means the user's CL shouldn't have been blamed.
,
Jun 7 2017
As long as the master build completion stage doesn't show "The master destructed itself and stopped waiting for the following slaves", no self-destruction is triggered. Signal 15 could be caused by many reasons: 1) manually aborted by someone; 2) tests/stages got hung and so buildbot aborts. 3) network issues. The master won't know what's really causing the issue.
,
Mar 30 2018
,
Mar 30 2018
,
Jun 1 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by pprabhu@chromium.org
, Jun 2 2017