New issue
Advanced search Search tips

Issue 729168 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

CQ reports about preemptively killed bots are confusing

Project Member Reported by vapier@chromium.org, Jun 2 2017

Issue description

when the CQ master kills bots after it knows the run will fail (because some other bot failed), it ends up reporting all those killed bots with messages like:

The HWTest [bvt-inline] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** in https://luci-milo.appspot.com/buildbot/chromeos/xxx
The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down') in https://luci-milo.appspot.com/buildbot/chromeos/xxx

these messages are indistinguishable from actual infra flakes/problems, or from people actually resetting the bot (which sometimes happens e.g. a waterfall restart).  this leads people (like myself) to ignore those runs as "CQ flaked again" and then just resubmit my CL.  which is bad when my CL is actually the problem.

for every bot the CQ master kills, it should simply list them in one section like (feel free to wordsmith this):

The following build(s) were halted prematurely due to other bots already failing:

  veyron_mighty-paladin cyan-paladin xxxxx

it shouldn't show/list any other summary messages or links to their logs.

similarly, the CQ master cbuildbot summary (CommitQueueCompletion) shouldn't list these as failures.  for people triaging the tree, these just get in the way.

CommitQueueCompletion ( 2 hrs 58 mins ) CommitQueueCompletion
 * stdio [stdout]
 * cyan-paladin: The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down')
 
Owner: nxia@chromium.org

Comment 2 by nxia@chromium.org, Jun 2 2017

The master build CommitQueueCompletion does show the links of the builds which are ignored because of CQ master.

For the failure summary, can you please link a example? In theory, the ignored slaves will be aborted by the next master, so the failures caused by self-destruction shouldn't be showed up in its master build.

Comment 4 by nxia@chromium.org, Jun 2 2017

There's no self-destruction in master-paladin/14898. The failures could be caused by bad CLs. 

If a CQ master destructs itself, it shows a message in CompletionStage like:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14904
this failure summary is mis-summarized then:
https://luci-milo.appspot.com/buildbot/chromeos/cyan-paladin/2746
cyan-paladin: The VMTest (attempt 2) stage failed: ** VMTests failed with code 1 ** The HWTest [arc-bvt-cq] stage failed: (15, 'Received signal 15; shutting down')

Comment 6 by nxia@chromium.org, Jun 2 2017

 I checked the failureTable, VMTests failed with code 1 and HWTest was aborted with a signal and this is what showed on the page. looks to me the summary is right?
those signal 15 shutting down messages are not distinguishable.  for example, it's the same as this one where the bot was killed:

samus-no-vmtest-pre-cq: The BuildPackages stage failed: (15, 'Received signal 15; shutting down') in https://luci-milo.appspot.com/buildbot/chromiumos.tryserver/no_vmtest_pre_cq/55870

the cyan-paladin was killed because the test (appears) to have hung.  that's what the user cares about.  not that the chromite code resorted to sending SIGTERM to tear it down.  the samus-pre-cq on the other hand had the infra reset on it which means the user's CL shouldn't have been blamed.

Comment 8 by nxia@chromium.org, Jun 7 2017

As long as the master build completion stage doesn't show "The master destructed itself and stopped waiting for the following slaves", no self-destruction is triggered.

Signal 15 could be caused by many reasons: 1) manually aborted by someone; 2) tests/stages got hung and so buildbot aborts. 3) network issues.

The master won't know what's really causing the issue.
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS

Comment 11 by nxia@chromium.org, Jun 1 2018

Cc: -dgarr...@chromium.org -nxia@chromium.org -pprabhu@chromium.org
Owner: ----

Sign in to add a comment