Self-destructing master builds causes Sheriff-o-Matic not to dispatch |
|||
Issue descriptionTwo separate issues with self-destructing master and Sheriff-o-Matic: 1. som_alerts_dispatcher isn't called no new alerts are created 2. som_alerts_dispatcher is querying for the most recent completed build (ie final=1) and won't find builds that are aborted. An example build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016 Issue 682763 touched on this regarding #2 and akeshet@ and I discussed it and what we thought might work is when the next build starts, it goes and closes out the previous one by marking it as final and changes any statuses from inflight to aborted.
,
Mar 20 2017
For #1: Can you explain more what's the issue? For #2: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14016 This build was cancelled by someone on purpose. The self-destruction will mark the build as 'fail' and 'final'=True in CIDb.
,
Mar 20 2017
#1. The som_alert_dispatcher script isn't called from Report stage and no new Sheriff-o-Matic alerts are generated for the failures.
#2. I don't see that:
mysql> select * from buildTable where build_config = 'master-paladin' and build_number = 14016\G
*************************** 1. row ***************************
id: 1395655
last_updated: 2017-03-20 17:56:47
master_build_id: NULL
buildbot_generation: 1
builder_name: master-paladin
waterfall: chromeos
build_number: 14016
build_config: master-paladin
bot_hostname: cros-wimpy0-c2.c.chromeos-bot.internal
start_time: 2017-03-20 17:41:48
finish_time: 0000-00-00 00:00:00
status: inflight
status_pickle: NULL
build_type: paladin
chrome_version: NULL
milestone_version: 59
platform_version: 9385.0.0-rc1
full_version: R59-9385.0.0-rc1
sdk_version: NULL
toolchain_url: 2017/03/%(target)s-2017.03.19.180736.tar.xz
final: 0
metadata_url: NULL
summary: NULL
deadline: 2017-03-20 22:12:41
important: 1
buildbucket_id: NULL
unibuild: 0
suite_scheduling: 0
1 row in set (0.10 sec)
,
Mar 20 2017
I mean the build 14016 was killed someone by clicking the 'stop' button on the waterfall page, not killed by the self-destructed CQ. Self-destructed CQ will run ReportStage and will mark the build as 'fail' and 'final' in CIDB. You can look into other self-destructed CQ.
,
Mar 20 2017
Ah... the confusion might be my fault then.
,
Mar 20 2017
I'm not convinced. From the logs of the paladin: will_not_submit set contains 12 changes: [CL:*338625 CL:454920 CL:455267 CL:456044 CL:456501 CL:456847 CL:457124 CL:457364 CL:457365 CL:457366 CL:457367 CL:457368] [1;33m12:12:24: WARNING: No need to wait for the remaining running slaves given the results of relevant change triages.[0m It detected that daisy_spring-paladin died at 11:52: 11:52:01: INFO: Build config daisy_spring-paladin completed with status "FAILURE". daisy_spring-paladin is showing an abort during unit test, but logs differ between buildbot and logdog: https://uberchromegw.corp.google.com/i/chromeos/builders/daisy_spring-paladin/builds/14720/steps/UnitTest/logs/stdio https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fdaisy_spring-paladin%2F14720%2F%2B%2Frecipes%2Fsteps%2FUnitTest%2F0%2Fstdout In particular, logdog shows the stage as completing at 11:41: ** Finished Stage UnitTest - Mon, 20 Mar 2017 11:41:57 -0700 (PDT) CIDB agrees with that: | 40607082 | 1395695 | UnitTest | daisy_spring | pass | 2017-03-20 18:41:57 | 2017-03-20 18:30:58 | 2017-03-20 18:41:57 | 1 | I would expect that the active stage would have all of it's logs dumped and no buffering effects, so I don't understand why buildbot and logdog does not agree on that step. So if someone clicked "stop", it would have been on both daisy_spring-paladin and leon-paladin, or there's some other infrastructure failure causing: remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]
,
Mar 20 2017
To clarify, "self-destruction" doesn't mean kill the CQ itself, it means the CQ master doesn't wait for all slaves to complete once it thinks there's nothing to test. And it doesn't trigger any "kill" operations on CQ-master. dnj@, can we find any logs about why CQ-master 14016 was canceled ?
,
Mar 20 2017
So it appears as if there have been network issues (crbug.com/702658) which caused the slave builds to fail (daisy_spring at 11:51, leon 12:12), and then the master stopped waiting and was in the middle of PublishUprevChanges when it got caught by the same network issues and was also aborted at 12:17. It looked like a self-destructing master (because the master was in the process of self destructing), but it didn't actually fully complete. The slave builds continued to run after their network connection to the master was lost, and additional things were logged to logdog and the local cbuildbot.log file, and stage information was logged to CIDB up to 11:57 (for daisy_spring). Sheriff-o-Matic dispatching should be made more robust to handle interrupted builds like this. |
|||
►
Sign in to add a comment |
|||
Comment 1 by chingcodes@chromium.org
, Mar 20 2017