CommitQueueCompletion failed in UprevStage with KeyError for a slave |
||||||||
Issue descriptionhttps://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15821 ************************************************************ ** PASS PublishUprevChanges (0:04:25) ************************************************************ Failed in stage CommitQueueCompletion: Traceback (most recent call last): File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run self.PerformStage() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage super(CommitQueueCompletionStage, self).PerformStage() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage statuses = self._FetchSlaveStatuses() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses manager, build_id, db, builder_names, timeout)) File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus timeout=timeout) File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus side_effect_func=_PrintRemainingTime) File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess return retry() File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry if not retry_check(value): File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda> lambda x: slave_status.ShouldWait(), File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 474, in ShouldWait retval, slaves_remain, long_pole = self._ShouldWait() File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 521, in _ShouldWait triage_relevant_changes.ShouldSelfDestruct()) File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 757, in ShouldSelfDestruct self._AllCompletedSlavesPassedUploadPrebuiltsStage()) File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 709, in _AllCompletedSlavesPassedUploadPrebuiltsStage not self.PassedAnyOfStages(self.slave_stages_dict[build_config], KeyError: 'lakitu-gpu-paladin'
,
Aug 23 2017
,
Aug 23 2017
Happened again: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15825 File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run self.PerformStage() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage super(CommitQueueCompletionStage, self).PerformStage() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage statuses = self._FetchSlaveStatuses() File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses manager, build_id, db, builder_names, timeout)) File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus timeout=timeout) File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus side_effect_func=_PrintRemainingTime) File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess return retry() File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry if not retry_check(value): File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda> lambda x: slave_status.ShouldWait(), File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 474, in ShouldWait retval, slaves_remain, long_pole = self._ShouldWait() File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 521, in _ShouldWait triage_relevant_changes.ShouldSelfDestruct()) File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 740, in ShouldSelfDestruct self._ProcessCompletedBuilds() File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 580, in _ProcessCompletedBuilds build_config) File "/b/c/cbuild/repository/chromite/lib/builder_status_lib.py", line 357, in GetBuilderStatusForBuild build_config, self.cidb_info_dict, self.buildbucket_info_dict) File "/b/c/cbuild/repository/chromite/lib/builder_status_lib.py", line 321, in _GetDashboardUrl return buildbucket_info_dict[build_config].url KeyError: u'coral-paladin' [0m 10:08:29: INFO: Translating result u'coral-paladin' to fail. This time, I had added coral-paladin while the run was midway. --------------------------- This is now impacting my confidence in using the tree status for marking paladins experimental, as the failed CQ runs end up not submitting _anything_, so it's counter-productive.
,
Aug 23 2017
,
Aug 23 2017
[1;31m03:48:11: ERROR: Timeout getting experimental builders from the tree status.[0m Requests to tree_status timed out in one loop. At the beginning of the loop, experimental_builders from metadata were ["lakitu-gpu-paladin", "lakitu-paladin", "lakitu-st-paladin"], so buildbucket_info_dict and cidb_status_dict didn't query statuses for these three builders; then http://shortn/_doJcpU3gBw request timed out and experimental_builders were updated to an empty list; so _GetExpectedBuilders contained the experimental builders; then relevant_changes.TriageRelevantChanges threw the KeyError because statuses for theses three experimental builders were not fetched at beginning of the method.
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/893f101b5ab4b5ed408577a98b3fd35aceb79768 commit 893f101b5ab4b5ed408577a98b3fd35aceb79768 Author: Ningning Xia <nxia@chromium.org> Date: Thu Aug 24 04:17:27 2017 Update experimental builders at beginning of UpdateSlaveStatus Updating experimental builders in the middle of UpdateSlaveStatus will lead different process getting status from different expected builders. BUG= chromium:758247 TEST=run_tests Change-Id: I8029f07b15c6337b9b8edb77686cda8aaff9821e Reviewed-on: https://chromium-review.googlesource.com/629496 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/893f101b5ab4b5ed408577a98b3fd35aceb79768/lib/tree_status.py [modify] https://crrev.com/893f101b5ab4b5ed408577a98b3fd35aceb79768/cbuildbot/build_status.py
,
Aug 24 2017
Is this still causing problems?
@@@STEP_FAILURE@@@
07:41:07: ERROR: Traceback (most recent call last):
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run
self.PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage
super(CommitQueueCompletionStage, self).PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage
statuses = self._FetchSlaveStatuses()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses
manager, build_id, db, builder_names, timeout))
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus
timeout=timeout)
File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus
side_effect_func=_PrintRemainingTime)
File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess
return retry()
File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry
if not retry_check(value):
File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda>
lambda x: slave_status.ShouldWait(),
File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 479, in ShouldWait
retval, slaves_remain, long_pole = self._ShouldWait()
File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 526, in _ShouldWait
triage_relevant_changes.ShouldSelfDestruct())
File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 757, in ShouldSelfDestruct
self._AllCompletedSlavesPassedUploadPrebuiltsStage())
File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 709, in _AllCompletedSlavesPassedUploadPrebuiltsStage
not self.PassedAnyOfStages(self.slave_stages_dict[build_config],
KeyError: 'lakitu-gpu-paladin'
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15836/steps/CommitQueueCompletion/logs/stdio
This blocked the last 2 CQ runs.
Maybe the short-term answer is to stop using the tree status for setting EXPERIMENTAL? Does that help?
,
Aug 24 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf commit d309fc1d81014cf7c28ab7c001fd2e9a425f76bf Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Aug 24 17:18:30 2017 chromeos_config: Mark lakitu paladins experimental. These paladins are currently failing VMTest flakily due to GCE quota issues. A real fix in the CQ, until then, ignore failures. BUG=b/64934461 BUG= chromium:758247 TEST=unittests Change-Id: I1c8e7c8ad66e9b73fd42316af21d8774ace30abd Reviewed-on: https://chromium-review.googlesource.com/633723 Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Brian Norris <briannorris@chromium.org> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf/cbuildbot/config_dump.json [modify] https://crrev.com/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf/cbuildbot/chromeos_config.py
,
Aug 25 2017
Trying to understand the status of this. Did the change in #6 fix the issue? #7 implies that it did not. I'm trying to understand why this feature is suddenly causing problems after having been launched and used somewhat regularly since then. It looks like there is some problem about builders being removed from the list of experimental builders during the polling of UpdateStatus and ShouldWait. My guess is that up until now we've been lucky enough to not make such a change at a problematic time during a CQ run. The last few problematic runs showed that the query to request the tree status had timed out sporadically (default timeout of 1 second), which would have the same effect of someone clearing the list of EXPERIMENTAL builders and putting it back again. This may have uncovered the problem. Ningning, is there anything else that you discovered when looking into this?
,
Aug 25 2017
Note that the first failure happened overnight when noone was toggling the tree status (afaict). This was triggered just by calls to tree status app timing out. The second failure was uncovered by my adding an experimental builder to the status during the run when that slave had already finished successfully.
,
Aug 25 2017
,
Aug 25 2017
> Note that the first failure happened overnight when noone was toggling the tree status (afaict). > This was triggered just by calls to tree status app timing out. Noted. If the call to query the tree status times out, it returns an empty list, so it's the same as if someone removed all EXPERIMENTAL builders. > The second failure was uncovered by my adding an experimental builder to the status during the run when that slave had already finished successfully. Thanks for the additional info. I wonder if this is just a case that got overlooked. I'll see if I can write a test for this.
,
Aug 25 2017
If the tree times out, it returns an empty list, and it is the same as someone removed the builders. but previously the case (or both cases) wasn't handled properly. The two CLs were to fix the timeout problem (or builders are toggled manually) during the waiting loop.
,
Aug 26 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/9b4fe414a6947999640d9d734bb782a9f8113b69 commit 9b4fe414a6947999640d9d734bb782a9f8113b69 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Sat Aug 26 02:54:15 2017 Revert "chromeos_config: Mark lakitu paladins experimental." This reverts commit d309fc1d81014cf7c28ab7c001fd2e9a425f76bf. Reason for revert: lakitu paladins are stable after CL:630018 Original change's description: > chromeos_config: Mark lakitu paladins experimental. > > These paladins are currently failing VMTest flakily due to GCE quota > issues. A real fix in the CQ, until then, ignore failures. > > BUG=b/64934461 > BUG= chromium:758247 > > TEST=unittests > > Change-Id: I1c8e7c8ad66e9b73fd42316af21d8774ace30abd > Reviewed-on: https://chromium-review.googlesource.com/633723 > Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> > Reviewed-by: Brian Norris <briannorris@chromium.org> > Reviewed-by: Dan Shi <dshi@google.com> BUG=b/64934461 BUG= chromium:758247 Change-Id: I5c2722f03bcd0e12a46e366ee3c75cc2dd118c5d Reviewed-on: https://chromium-review.googlesource.com/636443 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Daniel Wang <wonderfly@google.com> Reviewed-by: Brian Norris <briannorris@chromium.org> [modify] https://crrev.com/9b4fe414a6947999640d9d734bb782a9f8113b69/cbuildbot/config_dump.json [modify] https://crrev.com/9b4fe414a6947999640d9d734bb782a9f8113b69/cbuildbot/chromeos_config.py
,
Aug 28 2017
the second fix is at https://chromium-review.googlesource.com/c/chromiumos/chromite/+/633888
,
Aug 28 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/30cf9b2722c321ed693b79a6de6435037a4056bc commit 30cf9b2722c321ed693b79a6de6435037a4056bc Author: Ningning Xia <nxia@chromium.org> Date: Mon Aug 28 22:24:13 2017 Remove tree_status experimental builders from completed_builds set SlaveStatus.completed_builds set keeps track of completed and importat builds. If a build was important and got added to the completed_builds before, but is marked as experimental in tree status in current run, should take it out from the completed_builds set. BUG= chromium:758247 TEST=unit_tests Change-Id: I9dacfd9d0c1bd4e023868453333c0aa0a756fb0d Reviewed-on: https://chromium-review.googlesource.com/633888 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/30cf9b2722c321ed693b79a6de6435037a4056bc/cbuildbot/build_status_unittest.py [modify] https://crrev.com/30cf9b2722c321ed693b79a6de6435037a4056bc/cbuildbot/build_status.py
,
Aug 29 2017
,
Jan 22 2018
|
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by pprabhu@chromium.org
, Aug 23 2017