New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 703309 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug
som



Sign in to add a comment

Self-destructing master builds causes Sheriff-o-Matic not to dispatch

Project Member Reported by davidri...@chromium.org, Mar 20 2017

Issue description

Two separate issues with self-destructing master and Sheriff-o-Matic:
1. som_alerts_dispatcher isn't called no new alerts are created
2. som_alerts_dispatcher is querying for the most recent completed build (ie final=1) and won't find builds that are aborted.

An example build:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016

Issue 682763 touched on this regarding #2 and akeshet@ and I discussed it and what we thought might work is when the next build starts, it goes and closes out the previous one by marking it as final and changes any statuses from inflight to aborted.
 
Cc: chingcodes@chromium.org

Comment 2 by nxia@chromium.org, Mar 20 2017

For #1:

Can you explain more what's the issue?

For #2:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016

This build was cancelled by someone on purpose. The self-destruction will mark the build as 'fail' and 'final'=True in CIDb. 

#1. The som_alert_dispatcher script isn't called from Report stage and no new Sheriff-o-Matic alerts are generated for the failures.

#2. I don't see that:
mysql> select * from buildTable where build_config = 'master-paladin' and build_number = 14016\G
*************************** 1. row ***************************
                 id: 1395655
       last_updated: 2017-03-20 17:56:47
    master_build_id: NULL
buildbot_generation: 1
       builder_name: master-paladin
          waterfall: chromeos
       build_number: 14016
       build_config: master-paladin
       bot_hostname: cros-wimpy0-c2.c.chromeos-bot.internal
         start_time: 2017-03-20 17:41:48
        finish_time: 0000-00-00 00:00:00
             status: inflight
      status_pickle: NULL
         build_type: paladin
     chrome_version: NULL
  milestone_version: 59
   platform_version: 9385.0.0-rc1
       full_version: R59-9385.0.0-rc1
        sdk_version: NULL
      toolchain_url: 2017/03/%(target)s-2017.03.19.180736.tar.xz
              final: 0
       metadata_url: NULL
            summary: NULL
           deadline: 2017-03-20 22:12:41
          important: 1
     buildbucket_id: NULL
           unibuild: 0
   suite_scheduling: 0
1 row in set (0.10 sec)

Comment 4 by nxia@chromium.org, Mar 20 2017

I mean the build 14016 was killed someone by clicking the 'stop' button on the waterfall page, not killed by the self-destructed CQ. 
Self-destructed CQ will run ReportStage and will mark the build as 'fail' and 'final' in CIDB. You can look into other self-destructed CQ.

Ah... the confusion might be my fault then.
Cc: d...@chromium.org pbe...@chromium.org
I'm not convinced.

From the logs of the paladin:
will_not_submit set contains 12 changes: [CL:*338625 CL:454920 CL:455267 CL:456044 CL:456501 CL:456847 CL:457124 CL:457364 CL:457365 CL:457366 CL:457367 CL:457368]

12:12:24: WARNING: No need to wait for the remaining running slaves given the results of relevant change triages.

It detected that daisy_spring-paladin died at 11:52:
11:52:01: INFO: Build config daisy_spring-paladin completed with status "FAILURE".

daisy_spring-paladin is showing an abort during unit test, but logs differ between buildbot and logdog:
https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-paladin/builds/14720/steps/UnitTest/logs/stdio
https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fdaisy_spring-paladin%2F14720%2F%2B%2Frecipes%2Fsteps%2FUnitTest%2F0%2Fstdout

In particular, logdog shows the stage as completing at 11:41:
** Finished Stage UnitTest - Mon, 20 Mar 2017 11:41:57 -0700 (PDT)

CIDB agrees with that:
| 40607082 |  1395695 | UnitTest                    | daisy_spring | pass     | 2017-03-20 18:41:57 | 2017-03-20 18:30:58 | 2017-03-20 18:41:57 |     1 |

I would expect that the active stage would have all of it's logs dumped and no buffering effects, so I don't understand why buildbot and logdog does not agree on that step.

So if someone clicked "stop", it would have been on both daisy_spring-paladin and leon-paladin, or there's some other infrastructure failure causing:
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]


Comment 7 by nxia@chromium.org, Mar 20 2017

To clarify, "self-destruction" doesn't mean kill the CQ itself, it means the CQ master doesn't wait for all slaves to complete once it thinks there's nothing to test. And it doesn't trigger any "kill" operations on CQ-master. dnj@, can we find any logs about why CQ-master 14016 was canceled ?
Status: WontFix (was: Untriaged)
So it appears as if there have been network issues (crbug.com/702658) which caused the slave builds to fail (daisy_spring at 11:51, leon 12:12), and then the master stopped waiting and was in the middle of PublishUprevChanges when it got caught by the same network issues and was also aborted at 12:17.

It looked like a self-destructing master (because the master was in the process of self destructing), but it didn't actually fully complete.

The slave builds continued to run after their network connection to the master was lost, and additional things were logged to logdog and the local cbuildbot.log file, and stage information was logged to CIDB up to 11:57 (for daisy_spring).

Sheriff-o-Matic dispatching should be made more robust to handle interrupted builds like this.

Sign in to add a comment