Inconsistent build status in CIDB and waterfall |
||||
Issue descriptionCIDB status == fail https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11670 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11670&milestone_version=&experimental=False&duration=1d&refresh=-1 master: https://viceroy.corp.google.com/chromeos/build_details?build_id=1414289 Why is this marked fail? This should be marked as aborted? CIDB status == aborted https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11673 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11673&milestone_version=&experimental=False&duration=1d&refresh=-1 Multiple stages have a start time after the build finish time. Is this expected? crbug.com/707894 will have information indicating that each of the slaves has been failed for this reason. It's been causing issues with graphing because if there is no finish time, then we use the finish time of the build which happens to be before the start time of the stage which breaks things. Should I just be correcting that timestamp? (There's a lot of this data massaging resulting from the data being unclean and I'm concerned because it will need to be replicated in all usages of the data, potentially result in inconsistencies). CIDB status == inflight https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11674&milestone_version=&experimental=False&duration=1d&refresh=-1 This should be addressed by crbug.com/707897 and become expired, right? This was from the golo losing network right? https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt reports: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. Is this safe to key off of this in SoM as a sign of that sort of failure? CIDB status == inflight https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=master-paladin&build_number=14147&milestone_version=&experimental=False&duration=1d&refresh=-1 This should be addressed by crbug.com/707897 and become expired, right? This was from someone clicking cancel? If we do this via a cronjob and parse the reason we can detect the reason it was aborted? In that case, does it become CIDB status aborted or expired? If it's aborted, we can then use buildMessageTable to differentiate between it and self-destructing CQ as well. Can we transitively apply this aborted state to all the slaves that end up being restarted when the master restarts?
,
Apr 5 2017
,
Apr 5 2017
Case 1) CIDB status == fail https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11670 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11670&milestone_version=&experimental=False&duration=1d&refresh=-1 master: https://viceroy.corp.google.com/chromeos/build_details?build_id=1414289 Why is this marked fail? This should be marked as aborted? When the next master clean up stage canceled lumpy-incremental-paladin/11670 and inserted 'aborted' build status at 13:34:*. The build got the kill signal, _ShutDownException was inserted in cidb.failureTable for BuildPackageStage, 'fail' stage status was reported for the BuildPakcageStage and 'fail' build status was reported at 13:38:*. I guess it's because BuildPackagesStage was running the commandline to build packages, _ShutDownException was caught and reported, and so report stage got run and marked the build as 'fail'. http://shortn/_RRi305xFXN proposed fix: after the clean up stage of the next master marks the build as 'abort' and 'final=True', the build cannot update the build status itself anymore. But there's another potential inconsistency, after the master sends the cancel request and inserts the 'aborted' entry, there's delays (3~5 mins to my knowledge) for the buildbucket server to really cancel the build, during which time the build can either fail or pass. It should be fine we still treat this build as an aborted build as the master sends out the abortion request when the build is still running.
,
Apr 6 2017
Case 2): CIDB status == aborted https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11673 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11673&milestone_version=&experimental=False&duration=1d&refresh=-1 Multiple stages have a start time after the build finish time. Is this expected? crbug.com/707894 will have information indicating that each of the slaves has been failed for this reason. It's been causing issues with graphing because if there is no finish time, then we use the finish time of the build which happens to be before the start time of the stage which breaks things. Should I just be correcting that timestamp? (There's a lot of this data massaging resulting from the data being unclean and I'm concerned because it will need to be replicated in all usages of the data, potentially result in inconsistencies). The CleanUp stage of the next master found the lumpy-incremental-paladin/11673 was still running, it sent the cancel request and inserted 'aborted' build status (with it's current_timestamp) for this build. The build took time to receive the cancellation signal, and it's why some stages even started after the build was marked as 'aborted' (~ 3 mins in this case). The master doesn't have knowledge about when the build is going to be really killed, and I think the code to generate the graph can somehow use MAX(stage_start_time, build_finish_time) for this special case?
,
Apr 6 2017
CIDB status == inflight https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11674&milestone_version=&experimental=False&duration=1d&refresh=-1 This should be addressed by crbug.com/707897 and become expired, right? This was from the golo losing network right? https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt reports: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. Is this safe to key off of this in SoM as a sign of that sort of failure? This wasn't aborted by Buildbucket. As you pointed out, it was caused by some network connection issues in that buildslave. I think it's safe to categorize this into external failures. "build": { "status": "COMPLETED", "failure_reason": "INFRA_FAILURE",
,
Apr 6 2017
case 4) CIDB status == inflight https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147 https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=master-paladin&build_number=14147&milestone_version=&experimental=False&duration=1d&refresh=-1 This should be addressed by crbug.com/707897 and become expired, right? This was from someone clicking cancel? If we do this via a cronjob and parse the reason we can detect the reason it was aborted? In that case, does it become CIDB status aborted or expired? If it's aborted, we can then use buildMessageTable to differentiate between it and self-destructing CQ as well. Can we transitively apply this aborted state to all the slaves that end up being restarted when the master restarts? This is master-paladin, which wasn't scheduled by Buildbucket. This should be someone clicking the 'cancel' button. crbug.com/707897 should mark this build as expired. and its slaves would be marked as 'aborted' by the next master-paladin build.
,
Apr 6 2017
Re c#3: Case 1) It feels as if there is this priority of statuses: pass > aborted > fail. If a build is aborted, it should not be marked as fail because that could be incorrect and could introduce noise to the data. But marking it as pass if it completes later is finally.
,
Apr 6 2017
Re c#4: Case 2) It is possible to change the graphing code, but as I pointed out that every place that is potentially dealing with this data now needs to handle these special cases. In particular this case becomes: stage_end = max(stage_start, min(stage_end, build_end, now)) Also, the incomplete data makes it harder to get useful information from CIDB about how long stages and builds took. This data is used by GoldenEye to indicate how long a stage compares to average, and is mined to answer queries about slow stages.
,
Apr 6 2017
Re #7: The build status is reported in the Report stage of a build. A build may insert a 'pass' build status at the beginning of the Report stage, but get aborted in the middle of the Report stage or in the following buildbot stages. The build will be showed as a failed build on the waterfall page. Marking these builds as aborted doesn't hurt as all their stages are either passed or inflight, so the viceroy won't show this build as a alert.
,
Apr 6 2017
Re #8: As I pointed out, when the master sends out the cancel request and marks the build as 'aborted' in CIDB, it doesn't have control about when the build is really killed. If the data really matters, one way I can think of is to adjust the time in the crbug.com/707897. When the job checks an already finished build, it could mark the right status and correct the finish time.
,
Apr 6 2017
Re: #10 Case 2) Could we have a view of buildStageTable? The few would essentially replace the inflight/status/finish_time of build stages based on the build table and the current time? It would enforce: - replace inflight status with aborted, timed out or expired based on build status - for non-inflight build stages, enforce start time <= finish time
,
Apr 6 2017
Re c#6: Case 4) I'd still like to transitively apply a build abort from the master to it's slaves if there's no better information for slave. This might be best done in Sheriff-o-Matic however.
,
Apr 6 2017
Actually, if we're aborting a slave build, is it reasonable to look up the reason a master was aborted and add it to CIDB as a build message?
,
Apr 6 2017
Re #12 #13, for now, the build doesn't have information about why a master build is aborted, it could be network issues, waterfall restarts, someone clicking the cancel buttons.
,
Apr 7 2017
Re c#14: As I mentioned, if we scan the interrupt data via buildbot we have information on the master build being aborted: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147/steps/steps/logs/interrupt The web-page 'stop build' button was pressed by '<unknown>': cl/464366 chumped. Will cause merge issue. https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. I'm not sure about waterfall restarts, but it's better than nothing.
,
Apr 7 2017
Re #15: Didn't know we had the interruption information, and the second link you pasted is 404 now. This issue isn't introduced by any recent changes. Not sure which component would scan the web page, parse the info and report to CIDB. akeshet@ can comment on #15.
,
Apr 10 2017
One thing I just realized is that som_alerts_dispatcher isn't pick up changes to CIDB state because they happen after the alerts are generated (at the time of master completion). I need to think about how to address that.
,
May 9 2017
Observed the issue again today. https://uberchromegw.corp.google.com/i/chromeos/builders/butterfly-paladin/builds/22280/steps/steps/logs/interrupt
,
May 9 2017
This happened again with: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14525 Specifically: https://luci-milo.appspot.com/buildbot/chromeos/butterfly-paladin/22280 (build id 1503350) Missing in CIDB: mysql> select * from buildMessageTable where build_id = 1503293 order by message_value; +---------+----------+----------------+------------------+---------------+---------------------+-------+ | id | build_id | message_type | message_subtype | message_value | timestamp | board | +---------+----------+----------------+------------------+---------------+---------------------+-------+ | 1116571 | 1503293 | ignored_reason | self_destruction | 1503351 | 2017-05-08 23:28:04 | NULL | | 1116577 | 1503293 | ignored_reason | self_destruction | 1503353 | 2017-05-08 23:28:04 | NULL | | 1116573 | 1503293 | ignored_reason | self_destruction | 1503354 | 2017-05-08 23:28:04 | NULL | | 1116584 | 1503293 | ignored_reason | self_destruction | 1503355 | 2017-05-08 23:28:05 | NULL | | 1116566 | 1503293 | ignored_reason | self_destruction | 1503356 | 2017-05-08 23:28:03 | NULL | | 1116585 | 1503293 | ignored_reason | self_destruction | 1503357 | 2017-05-08 23:28:05 | NULL | | 1116599 | 1503293 | ignored_reason | self_destruction | 1503358 | 2017-05-08 23:28:07 | NULL | | 1116572 | 1503293 | ignored_reason | self_destruction | 1503361 | 2017-05-08 23:28:04 | NULL | | 1116586 | 1503293 | ignored_reason | self_destruction | 1503363 | 2017-05-08 23:28:06 | NULL | | 1116590 | 1503293 | ignored_reason | self_destruction | 1503364 | 2017-05-08 23:28:06 | NULL | | 1116601 | 1503293 | ignored_reason | self_destruction | 1503365 | 2017-05-08 23:28:07 | NULL | | 1116587 | 1503293 | ignored_reason | self_destruction | 1503366 | 2017-05-08 23:28:06 | NULL | | 1116582 | 1503293 | ignored_reason | self_destruction | 1503367 | 2017-05-08 23:28:05 | NULL | | 1116602 | 1503293 | ignored_reason | self_destruction | 1503369 | 2017-05-08 23:28:07 | NULL | | 1116578 | 1503293 | ignored_reason | self_destruction | 1503370 | 2017-05-08 23:28:05 | NULL | | 1116579 | 1503293 | ignored_reason | self_destruction | 1503371 | 2017-05-08 23:28:05 | NULL | | 1116592 | 1503293 | ignored_reason | self_destruction | 1503373 | 2017-05-08 23:28:06 | NULL | | 1116588 | 1503293 | ignored_reason | self_destruction | 1503375 | 2017-05-08 23:28:06 | NULL | | 1116570 | 1503293 | ignored_reason | self_destruction | 1503376 | 2017-05-08 23:28:04 | NULL | | 1116597 | 1503293 | ignored_reason | self_destruction | 1503377 | 2017-05-08 23:28:07 | NULL | | 1116580 | 1503293 | ignored_reason | self_destruction | 1503378 | 2017-05-08 23:28:05 | NULL | | 1116568 | 1503293 | ignored_reason | self_destruction | 1503381 | 2017-05-08 23:28:03 | NULL | | 1116589 | 1503293 | ignored_reason | self_destruction | 1503382 | 2017-05-08 23:28:06 | NULL | | 1116591 | 1503293 | ignored_reason | self_destruction | 1503383 | 2017-05-08 23:28:06 | NULL | | 1116594 | 1503293 | ignored_reason | self_destruction | 1503384 | 2017-05-08 23:28:06 | NULL | | 1116576 | 1503293 | ignored_reason | self_destruction | 1503385 | 2017-05-08 23:28:04 | NULL | | 1116596 | 1503293 | ignored_reason | self_destruction | 1503386 | 2017-05-08 23:28:07 | NULL | | 1116581 | 1503293 | ignored_reason | self_destruction | 1503388 | 2017-05-08 23:28:05 | NULL | | 1116564 | 1503293 | ignored_reason | self_destruction | 1503392 | 2017-05-08 23:28:03 | NULL | | 1116593 | 1503293 | ignored_reason | self_destruction | 1503394 | 2017-05-08 23:28:06 | NULL | | 1116574 | 1503293 | ignored_reason | self_destruction | 1503395 | 2017-05-08 23:28:04 | NULL | | 1116595 | 1503293 | ignored_reason | self_destruction | 1503398 | 2017-05-08 23:28:07 | NULL | | 1116567 | 1503293 | ignored_reason | self_destruction | 1503399 | 2017-05-08 23:28:03 | NULL | | 1116569 | 1503293 | ignored_reason | self_destruction | 1503403 | 2017-05-08 23:28:03 | NULL | | 1116583 | 1503293 | ignored_reason | self_destruction | 1503405 | 2017-05-08 23:28:05 | NULL | | 1116575 | 1503293 | ignored_reason | self_destruction | 1503406 | 2017-05-08 23:28:04 | NULL | | 1116565 | 1503293 | ignored_reason | self_destruction | 1503407 | 2017-05-08 23:28:03 | NULL | | 1116598 | 1503293 | ignored_reason | self_destruction | 1503408 | 2017-05-08 23:28:07 | NULL | | 1116600 | 1503293 | ignored_reason | self_destruction | 1503412 | 2017-05-08 23:28:07 | NULL | +---------+----------+----------------+------------------+---------------+---------------------+-------+ 39 rows in set (0.12 sec)
,
May 9 2017
,
May 9 2017
looks like butterfly-paladin failed because of the buildbot network issue, it's not ignored by the master. 6:27:49: INFO: Build config butterfly-paladin completed with status "FAILURE".
,
May 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9f9230f8b69c99693b617ba9696c18831a9a2850 commit 9f9230f8b69c99693b617ba9696c18831a9a2850 Author: David Riley <davidriley@chromium.org> Date: Thu May 11 03:58:10 2017 som_alerts_dispatcher: Add link to YAQS for indeterminate CIDB statuses. BUG= chromium:708392 TEST=som_alerts_dispatcher CREDS 1503293,1000 Change-Id: Ibc1c48645473c5e3cd81dc68a18219e06cf234cb Reviewed-on: https://chromium-review.googlesource.com/501130 Commit-Ready: David Riley <davidriley@chromium.org> Tested-by: David Riley <davidriley@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/9f9230f8b69c99693b617ba9696c18831a9a2850/scripts/som_alerts_dispatcher.py
,
Mar 19 2018
,
Mar 19 2018
|
||||
►
Sign in to add a comment |
||||
Comment 1 by davidri...@chromium.org
, Apr 5 2017