New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 708392 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

Inconsistent build status in CIDB and waterfall

Project Member Reported by nxia@chromium.org, Apr 5 2017

Issue description



CIDB status == fail
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11670
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11670&milestone_version=&experimental=False&duration=1d&refresh=-1
master: https://viceroy.corp.google.com/chromeos/build_details?build_id=1414289
Why is this marked fail?  This should be marked as aborted?

CIDB status == aborted
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11673
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11673&milestone_version=&experimental=False&duration=1d&refresh=-1 
Multiple stages have a start time after the build finish time.  Is this expected?   crbug.com/707894  will have information indicating that each of the slaves has been failed for this reason.  It's been causing issues with graphing because if there is no finish time, then we use the finish time of the build which happens to be before the start time of the stage which breaks things.  Should I just be correcting that timestamp?  (There's a lot of this data massaging resulting from the data being unclean and I'm concerned because it will need to be replicated in all usages of the data, potentially result in inconsistencies).

CIDB status == inflight
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11674&milestone_version=&experimental=False&duration=1d&refresh=-1
This should be addressed by crbug.com/707897 and become expired, right?  This was from the golo losing network right?
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt reports:
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
Is this safe to key off of this in SoM as a sign of that sort of failure?

CIDB status == inflight
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=master-paladin&build_number=14147&milestone_version=&experimental=False&duration=1d&refresh=-1
This should be addressed by crbug.com/707897 and become expired, right?  This was from someone clicking cancel?
If we do this via a cronjob and parse the reason we can detect the reason it was aborted?  In that case, does it become CIDB status aborted or expired?  If it's aborted, we can then use buildMessageTable to differentiate between it and self-destructing CQ as well.  Can we transitively apply this aborted state to all the slaves that end up being restarted when the master restarts?
 
Cc: akes...@chromium.org dgarr...@chromium.org davidri...@chromium.org
Cc: jinjingl@chromium.org

Comment 3 by nxia@chromium.org, Apr 5 2017

Case 1)

CIDB status == fail
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11670
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11670&milestone_version=&experimental=False&duration=1d&refresh=-1
master: https://viceroy.corp.google.com/chromeos/build_details?build_id=1414289
Why is this marked fail?  This should be marked as aborted?

When the next master clean up stage canceled lumpy-incremental-paladin/11670 and inserted 'aborted' build status at 13:34:*. The build got the kill signal, _ShutDownException was inserted in cidb.failureTable for BuildPackageStage, 'fail' stage status was reported for the BuildPakcageStage and 'fail' build status was reported at 13:38:*. 

I guess it's because BuildPackagesStage was running the commandline to build packages, _ShutDownException was caught and reported, and so report stage got run and marked the build as 'fail'.

http://shortn/_RRi305xFXN

proposed fix: after the clean up stage of the next master marks the build as 'abort' and 'final=True', the build cannot update the build status itself anymore. But there's another potential inconsistency, after the master sends the cancel request and inserts the 'aborted' entry, there's delays (3~5 mins to my knowledge) for the buildbucket server to really cancel the build, during which time the build can either fail or pass. It should be fine we still treat this build as an aborted build as the master sends out the abortion request when the build is still running.

Comment 4 by nxia@chromium.org, Apr 6 2017

Case 2):

CIDB status == aborted
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11673
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11673&milestone_version=&experimental=False&duration=1d&refresh=-1 
Multiple stages have a start time after the build finish time.  Is this expected?   crbug.com/707894  will have information indicating that each of the slaves has been failed for this reason.  It's been causing issues with graphing because if there is no finish time, then we use the finish time of the build which happens to be before the start time of the stage which breaks things.  Should I just be correcting that timestamp?  (There's a lot of this data massaging resulting from the data being unclean and I'm concerned because it will need to be replicated in all usages of the data, potentially result in inconsistencies).


The CleanUp stage of the next master found the lumpy-incremental-paladin/11673 was still running, it sent the cancel request and inserted 'aborted' build status (with it's current_timestamp) for this build. The build took time to receive the cancellation signal, and it's why some stages even started after the build was marked as 'aborted' (~ 3 mins in this case). The master doesn't have knowledge about when the build is going to be really killed, and I think the code to generate the graph can somehow use MAX(stage_start_time, build_finish_time) for this special case?

Comment 5 by nxia@chromium.org, Apr 6 2017

CIDB status == inflight
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=lumpy-incremental-paladin&build_number=11674&milestone_version=&experimental=False&duration=1d&refresh=-1
This should be addressed by crbug.com/707897 and become expired, right?  This was from the golo losing network right?
https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt reports:
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
Is this safe to key off of this in SoM as a sign of that sort of failure?


This wasn't aborted by Buildbucket. As you pointed out, it was caused by some network connection issues in that buildslave. I think it's safe to categorize this into external failures. 

 "build": {
  "status": "COMPLETED",
  "failure_reason": "INFRA_FAILURE",

Comment 6 by nxia@chromium.org, Apr 6 2017

case 4)

CIDB status == inflight
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147
https://viceroy.corp.google.com/chromeos/build_details?waterfall=chromeos&build_config=master-paladin&build_number=14147&milestone_version=&experimental=False&duration=1d&refresh=-1
This should be addressed by crbug.com/707897 and become expired, right?  This was from someone clicking cancel?
If we do this via a cronjob and parse the reason we can detect the reason it was aborted?  In that case, does it become CIDB status aborted or expired?  If it's aborted, we can then use buildMessageTable to differentiate between it and self-destructing CQ as well.  Can we transitively apply this aborted state to all the slaves that end up being restarted when the master restarts?
 

This is master-paladin, which wasn't scheduled by Buildbucket. This should be someone clicking the 'cancel' button. crbug.com/707897 should mark this build as expired. and its slaves would be marked as 'aborted' by the next master-paladin build.

Re c#3:
Case 1)

It feels as if there is this priority of statuses: pass > aborted > fail.  If a build is aborted, it should not be marked as fail because that could be incorrect and could introduce noise to the data.  But marking it as pass if it completes later is finally.


Re c#4:
Case 2)
It is possible to change the graphing code, but as I pointed out that every place that is potentially dealing with this data now needs to handle these special cases.  In particular this case becomes:
stage_end = max(stage_start, min(stage_end, build_end, now))

Also, the incomplete data makes it harder to get useful information from CIDB about how long stages and builds took.  This data is used by GoldenEye to indicate how long a stage compares to average, and is mined to answer queries about slow stages.

Comment 9 by nxia@chromium.org, Apr 6 2017

Re #7:

The build status is reported in the Report stage of a build. A build may insert a 'pass' build status at the beginning of the Report stage, but get aborted in the middle of the Report stage or in the following buildbot stages. The build will be showed as a failed build on the waterfall page. Marking these builds as aborted doesn't hurt as all their stages are either passed or inflight, so the viceroy won't show this build as a alert. 

Comment 10 by nxia@chromium.org, Apr 6 2017

Re #8:

As I pointed out, when the master sends out the cancel request and marks the build as 'aborted' in CIDB, it doesn't have control about when the build is really killed. If the data really matters, one way I can think of is to adjust the time in the crbug.com/707897. When the job checks an already finished build, it could mark the right status and correct the finish time.
Re: #10
Case 2)
Could we have a view of buildStageTable?  The few would essentially replace the inflight/status/finish_time of build stages based on the build table and the current time?

It would enforce:
- replace inflight status with aborted, timed out or expired based on build status
- for non-inflight build stages, enforce start time <= finish time

Re c#6:
Case 4)

I'd still like to transitively apply a build abort from the master to it's slaves if there's no better information for slave.  This might be best done in Sheriff-o-Matic however.
Actually, if we're aborting a slave build, is it reasonable to look up the reason a master was aborted and add it to CIDB as a build message?

Comment 14 by nxia@chromium.org, Apr 6 2017

Re #12 #13, for now, the build doesn't have information about why a master build is aborted, it could be network issues, waterfall restarts, someone clicking the cancel buttons. 
Re c#14:

As I mentioned, if we scan the interrupt data via buildbot we have information on the master build being aborted:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14147/steps/steps/logs/interrupt
The web-page 'stop build' button was pressed by '&lt;unknown&gt;': cl/464366 chumped. Will cause merge issue.

https://uberchromegw.corp.google.com/i/chromeos/builders/lumpy-incremental-paladin/builds/11674/steps/steps/logs/interrupt 
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.

I'm not sure about waterfall restarts, but it's better than nothing.

Comment 16 by nxia@chromium.org, Apr 7 2017

Re #15:

Didn't know we had the interruption information, and the second link you pasted is 404 now. This issue isn't introduced by any recent changes. Not sure which component would scan the web page, parse the info and report to CIDB. 

akeshet@ can comment on #15.
One thing I just realized is that som_alerts_dispatcher isn't pick up changes to CIDB state because they happen after the alerts are generated (at the time of master completion).  I need to think about how to address that.
This happened again with:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14525
Specifically:
https://luci-milo.appspot.com/buildbot/chromeos/butterfly-paladin/22280 (build id 1503350)

Missing in CIDB:
mysql> select * from buildMessageTable where build_id = 1503293 order by message_value;
+---------+----------+----------------+------------------+---------------+---------------------+-------+
| id      | build_id | message_type   | message_subtype  | message_value | timestamp           | board |
+---------+----------+----------------+------------------+---------------+---------------------+-------+
| 1116571 |  1503293 | ignored_reason | self_destruction | 1503351       | 2017-05-08 23:28:04 | NULL  |
| 1116577 |  1503293 | ignored_reason | self_destruction | 1503353       | 2017-05-08 23:28:04 | NULL  |
| 1116573 |  1503293 | ignored_reason | self_destruction | 1503354       | 2017-05-08 23:28:04 | NULL  |
| 1116584 |  1503293 | ignored_reason | self_destruction | 1503355       | 2017-05-08 23:28:05 | NULL  |
| 1116566 |  1503293 | ignored_reason | self_destruction | 1503356       | 2017-05-08 23:28:03 | NULL  |
| 1116585 |  1503293 | ignored_reason | self_destruction | 1503357       | 2017-05-08 23:28:05 | NULL  |
| 1116599 |  1503293 | ignored_reason | self_destruction | 1503358       | 2017-05-08 23:28:07 | NULL  |
| 1116572 |  1503293 | ignored_reason | self_destruction | 1503361       | 2017-05-08 23:28:04 | NULL  |
| 1116586 |  1503293 | ignored_reason | self_destruction | 1503363       | 2017-05-08 23:28:06 | NULL  |
| 1116590 |  1503293 | ignored_reason | self_destruction | 1503364       | 2017-05-08 23:28:06 | NULL  |
| 1116601 |  1503293 | ignored_reason | self_destruction | 1503365       | 2017-05-08 23:28:07 | NULL  |
| 1116587 |  1503293 | ignored_reason | self_destruction | 1503366       | 2017-05-08 23:28:06 | NULL  |
| 1116582 |  1503293 | ignored_reason | self_destruction | 1503367       | 2017-05-08 23:28:05 | NULL  |
| 1116602 |  1503293 | ignored_reason | self_destruction | 1503369       | 2017-05-08 23:28:07 | NULL  |
| 1116578 |  1503293 | ignored_reason | self_destruction | 1503370       | 2017-05-08 23:28:05 | NULL  |
| 1116579 |  1503293 | ignored_reason | self_destruction | 1503371       | 2017-05-08 23:28:05 | NULL  |
| 1116592 |  1503293 | ignored_reason | self_destruction | 1503373       | 2017-05-08 23:28:06 | NULL  |
| 1116588 |  1503293 | ignored_reason | self_destruction | 1503375       | 2017-05-08 23:28:06 | NULL  |
| 1116570 |  1503293 | ignored_reason | self_destruction | 1503376       | 2017-05-08 23:28:04 | NULL  |
| 1116597 |  1503293 | ignored_reason | self_destruction | 1503377       | 2017-05-08 23:28:07 | NULL  |
| 1116580 |  1503293 | ignored_reason | self_destruction | 1503378       | 2017-05-08 23:28:05 | NULL  |
| 1116568 |  1503293 | ignored_reason | self_destruction | 1503381       | 2017-05-08 23:28:03 | NULL  |
| 1116589 |  1503293 | ignored_reason | self_destruction | 1503382       | 2017-05-08 23:28:06 | NULL  |
| 1116591 |  1503293 | ignored_reason | self_destruction | 1503383       | 2017-05-08 23:28:06 | NULL  |
| 1116594 |  1503293 | ignored_reason | self_destruction | 1503384       | 2017-05-08 23:28:06 | NULL  |
| 1116576 |  1503293 | ignored_reason | self_destruction | 1503385       | 2017-05-08 23:28:04 | NULL  |
| 1116596 |  1503293 | ignored_reason | self_destruction | 1503386       | 2017-05-08 23:28:07 | NULL  |
| 1116581 |  1503293 | ignored_reason | self_destruction | 1503388       | 2017-05-08 23:28:05 | NULL  |
| 1116564 |  1503293 | ignored_reason | self_destruction | 1503392       | 2017-05-08 23:28:03 | NULL  |
| 1116593 |  1503293 | ignored_reason | self_destruction | 1503394       | 2017-05-08 23:28:06 | NULL  |
| 1116574 |  1503293 | ignored_reason | self_destruction | 1503395       | 2017-05-08 23:28:04 | NULL  |
| 1116595 |  1503293 | ignored_reason | self_destruction | 1503398       | 2017-05-08 23:28:07 | NULL  |
| 1116567 |  1503293 | ignored_reason | self_destruction | 1503399       | 2017-05-08 23:28:03 | NULL  |
| 1116569 |  1503293 | ignored_reason | self_destruction | 1503403       | 2017-05-08 23:28:03 | NULL  |
| 1116583 |  1503293 | ignored_reason | self_destruction | 1503405       | 2017-05-08 23:28:05 | NULL  |
| 1116575 |  1503293 | ignored_reason | self_destruction | 1503406       | 2017-05-08 23:28:04 | NULL  |
| 1116565 |  1503293 | ignored_reason | self_destruction | 1503407       | 2017-05-08 23:28:03 | NULL  |
| 1116598 |  1503293 | ignored_reason | self_destruction | 1503408       | 2017-05-08 23:28:07 | NULL  |
| 1116600 |  1503293 | ignored_reason | self_destruction | 1503412       | 2017-05-08 23:28:07 | NULL  |
+---------+----------+----------------+------------------+---------------+---------------------+-------+
39 rows in set (0.12 sec)

Comment 21 by nxia@chromium.org, May 9 2017

looks like butterfly-paladin failed because of the buildbot network issue, it's not ignored by the master.

6:27:49: INFO: Build config butterfly-paladin completed with status "FAILURE".
Project Member

Comment 22 by bugdroid1@chromium.org, May 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9f9230f8b69c99693b617ba9696c18831a9a2850

commit 9f9230f8b69c99693b617ba9696c18831a9a2850
Author: David Riley <davidriley@chromium.org>
Date: Thu May 11 03:58:10 2017

som_alerts_dispatcher: Add link to YAQS for indeterminate CIDB statuses.

BUG= chromium:708392 
TEST=som_alerts_dispatcher CREDS 1503293,1000

Change-Id: Ibc1c48645473c5e3cd81dc68a18219e06cf234cb
Reviewed-on: https://chromium-review.googlesource.com/501130
Commit-Ready: David Riley <davidriley@chromium.org>
Tested-by: David Riley <davidriley@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/9f9230f8b69c99693b617ba9696c18831a9a2850/scripts/som_alerts_dispatcher.py

Comment 23 by nxia@chromium.org, Mar 19 2018

Status: Fixed (was: Untriaged)

Comment 24 by nxia@chromium.org, Mar 19 2018

Status: Archived (was: Fixed)

Sign in to add a comment