Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: Fixed
Owner:
Closed: Aug 29
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
CommitQueueCompletion failed in UprevStage with KeyError for a slave
Project Member Reported by pprabhu@chromium.org, Aug 23 Back to list
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15821

************************************************************
** PASS PublishUprevChanges (0:04:25)
************************************************************
Failed in stage CommitQueueCompletion:
Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage
    super(CommitQueueCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage
    statuses = self._FetchSlaveStatuses()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses
    manager, build_id, db, builder_names, timeout))
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus
    timeout=timeout)
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus
    side_effect_func=_PrintRemainingTime)
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess
    return retry()
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry
    if not retry_check(value):
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda>
    lambda x: slave_status.ShouldWait(),
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 474, in ShouldWait
    retval, slaves_remain, long_pole = self._ShouldWait()
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 521, in _ShouldWait
    triage_relevant_changes.ShouldSelfDestruct())
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 757, in ShouldSelfDestruct
    self._AllCompletedSlavesPassedUploadPrebuiltsStage())
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 709, in _AllCompletedSlavesPassedUploadPrebuiltsStage
    not self.PassedAnyOfStages(self.slave_stages_dict[build_config],
KeyError: 'lakitu-gpu-paladin'
 
Here's the relevant lakitu-gpu-paladin build: https://uberchromegw.corp.google.com/i/chromeos/builders/lakitu-gpu-paladin/builds/1104

Note that at the time of this failure, the tree status line was:

Tree is open (EXPERIMENTAL=lakitu-gpu-paladin,lakitu-paladin,lakitu-st-paladin b/64934461)

in order to avoid some known VMTest flakiness in lakitu.
But this status has been so for the last day, and this is the only time I've seen the master choke on it.
Owner: nxia@chromium.org
Cc: -nxia@chromium.org apronin@chromium.org briannorris@chromium.org
Labels: -Pri-2 Pri-1
Happened again:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15825

  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage
    super(CommitQueueCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage
    statuses = self._FetchSlaveStatuses()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses
    manager, build_id, db, builder_names, timeout))
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus
    timeout=timeout)
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus
    side_effect_func=_PrintRemainingTime)
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess
    return retry()
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry
    if not retry_check(value):
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda>
    lambda x: slave_status.ShouldWait(),
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 474, in ShouldWait
    retval, slaves_remain, long_pole = self._ShouldWait()
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 521, in _ShouldWait
    triage_relevant_changes.ShouldSelfDestruct())
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 740, in ShouldSelfDestruct
    self._ProcessCompletedBuilds()
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 580, in _ProcessCompletedBuilds
    build_config)
  File "/b/c/cbuild/repository/chromite/lib/builder_status_lib.py", line 357, in GetBuilderStatusForBuild
    build_config, self.cidb_info_dict, self.buildbucket_info_dict)
  File "/b/c/cbuild/repository/chromite/lib/builder_status_lib.py", line 321, in _GetDashboardUrl
    return buildbucket_info_dict[build_config].url
KeyError: u'coral-paladin'

10:08:29: INFO: Translating result u'coral-paladin' to fail.

This time, I had added coral-paladin while the run was midway.
---------------------------

This is now impacting my confidence in using the tree status for marking paladins experimental, as the failed CQ runs end up not submitting _anything_, so it's counter-productive.
Cc: ddavenp...@chromium.org


03:48:11: ERROR: Timeout getting experimental builders from the tree status.

Requests to tree_status timed out in one loop.

At the beginning of the loop, experimental_builders from metadata were ["lakitu-gpu-paladin", "lakitu-paladin", "lakitu-st-paladin"], so buildbucket_info_dict and cidb_status_dict didn't query statuses for these three builders; then http://shortn/_doJcpU3gBw request timed out and experimental_builders were updated to an empty list; so _GetExpectedBuilders contained the experimental builders; then relevant_changes.TriageRelevantChanges threw the KeyError because statuses for theses three experimental builders were not fetched at beginning of the method.


Project Member Comment 6 by bugdroid1@chromium.org, Aug 24
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/893f101b5ab4b5ed408577a98b3fd35aceb79768

commit 893f101b5ab4b5ed408577a98b3fd35aceb79768
Author: Ningning Xia <nxia@chromium.org>
Date: Thu Aug 24 04:17:27 2017

Update experimental builders at beginning of UpdateSlaveStatus

Updating experimental builders in the middle of UpdateSlaveStatus will
lead different process getting status from different expected builders.

BUG= chromium:758247 
TEST=run_tests

Change-Id: I8029f07b15c6337b9b8edb77686cda8aaff9821e
Reviewed-on: https://chromium-review.googlesource.com/629496
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/893f101b5ab4b5ed408577a98b3fd35aceb79768/lib/tree_status.py
[modify] https://crrev.com/893f101b5ab4b5ed408577a98b3fd35aceb79768/cbuildbot/build_status.py

Labels: OS-Chrome
Is this still causing problems?

@@@STEP_FAILURE@@@
07:41:07: ERROR: Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 693, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 919, in PerformStage
    super(CommitQueueCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 246, in PerformStage
    statuses = self._FetchSlaveStatuses()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 169, in _FetchSlaveStatuses
    manager, build_id, db, builder_names, timeout))
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 915, in _GetSlaveBuildStatus
    timeout=timeout)
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 648, in GetBuildersStatus
    side_effect_func=_PrintRemainingTime)
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 316, in WaitForSuccess
    return retry()
  File "/b/c/cbuild/repository/chromite/lib/timeout_util.py", line 290, in retry
    if not retry_check(value):
  File "/b/c/cbuild/repository/chromite/cbuildbot/manifest_version.py", line 644, in <lambda>
    lambda x: slave_status.ShouldWait(),
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 479, in ShouldWait
    retval, slaves_remain, long_pole = self._ShouldWait()
  File "/b/c/cbuild/repository/chromite/cbuildbot/build_status.py", line 526, in _ShouldWait
    triage_relevant_changes.ShouldSelfDestruct())
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 757, in ShouldSelfDestruct
    self._AllCompletedSlavesPassedUploadPrebuiltsStage())
  File "/b/c/cbuild/repository/chromite/cbuildbot/relevant_changes.py", line 709, in _AllCompletedSlavesPassedUploadPrebuiltsStage
    not self.PassedAnyOfStages(self.slave_stages_dict[build_config],
KeyError: 'lakitu-gpu-paladin'

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15836/steps/CommitQueueCompletion/logs/stdio

This blocked the last 2 CQ runs.

Maybe the short-term answer is to stop using the tree status for setting EXPERIMENTAL? Does that help?
Project Member Comment 8 by bugdroid1@chromium.org, Aug 24
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf

commit d309fc1d81014cf7c28ab7c001fd2e9a425f76bf
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Aug 24 17:18:30 2017

chromeos_config: Mark lakitu paladins experimental.

These paladins are currently failing VMTest flakily due to GCE quota
issues. A real fix in the CQ, until then, ignore failures.

BUG=b/64934461
BUG= chromium:758247 

TEST=unittests

Change-Id: I1c8e7c8ad66e9b73fd42316af21d8774ace30abd
Reviewed-on: https://chromium-review.googlesource.com/633723
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf/cbuildbot/config_dump.json
[modify] https://crrev.com/d309fc1d81014cf7c28ab7c001fd2e9a425f76bf/cbuildbot/chromeos_config.py

Trying to understand the status of this. Did the change in #6 fix the issue? #7 implies that it did not.

I'm trying to understand why this feature is suddenly causing problems after having been launched and used somewhat regularly since then.

It looks like there is some problem about builders being removed from the list of experimental builders during the polling of UpdateStatus and ShouldWait. My guess is that up until now we've been lucky enough to not make such a change at a problematic time during a CQ run. The last few problematic runs showed that the query to request the tree status had timed out sporadically (default timeout of 1 second), which would have the same effect of someone clearing the list of EXPERIMENTAL builders and putting it back again. This may have uncovered the problem.

Ningning, is there anything else that you discovered when looking into this?
Note that the first failure happened overnight when noone was toggling the tree status (afaict).
This was triggered just by calls to tree status app timing out.

The second failure was uncovered by my adding an experimental builder to the status during the run when that slave had already finished successfully.
Status: Started
> Note that the first failure happened overnight when noone was toggling the tree status (afaict).
> This was triggered just by calls to tree status app timing out. 
Noted. If the call to query the tree status times out, it returns an empty list, so it's the same as if someone removed all EXPERIMENTAL builders.

> The second failure was uncovered by my adding an experimental builder to the status during the run when that slave had already finished successfully.
Thanks for the additional info. I wonder if this is just a case that got overlooked. I'll see if I can write a test for this.
If the tree times out, it returns an empty list, and it is the same as someone removed the builders. but previously the case (or both cases) wasn't handled properly. The two CLs were to fix the timeout problem (or builders are toggled manually) during the waiting loop. 
Project Member Comment 14 by bugdroid1@chromium.org, Aug 26
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/9b4fe414a6947999640d9d734bb782a9f8113b69

commit 9b4fe414a6947999640d9d734bb782a9f8113b69
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Sat Aug 26 02:54:15 2017

Revert "chromeos_config: Mark lakitu paladins experimental."

This reverts commit d309fc1d81014cf7c28ab7c001fd2e9a425f76bf.

Reason for revert: lakitu paladins are stable after CL:630018

Original change's description:
> chromeos_config: Mark lakitu paladins experimental.
>
> These paladins are currently failing VMTest flakily due to GCE quota
> issues. A real fix in the CQ, until then, ignore failures.
>
> BUG=b/64934461
> BUG= chromium:758247 
>
> TEST=unittests
>
> Change-Id: I1c8e7c8ad66e9b73fd42316af21d8774ace30abd
> Reviewed-on: https://chromium-review.googlesource.com/633723
> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
> Reviewed-by: Brian Norris <briannorris@chromium.org>
> Reviewed-by: Dan Shi <dshi@google.com>

BUG=b/64934461
BUG= chromium:758247 

Change-Id: I5c2722f03bcd0e12a46e366ee3c75cc2dd118c5d
Reviewed-on: https://chromium-review.googlesource.com/636443
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Daniel Wang <wonderfly@google.com>
Reviewed-by: Brian Norris <briannorris@chromium.org>

[modify] https://crrev.com/9b4fe414a6947999640d9d734bb782a9f8113b69/cbuildbot/config_dump.json
[modify] https://crrev.com/9b4fe414a6947999640d9d734bb782a9f8113b69/cbuildbot/chromeos_config.py

Project Member Comment 16 by bugdroid1@chromium.org, Aug 28
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/30cf9b2722c321ed693b79a6de6435037a4056bc

commit 30cf9b2722c321ed693b79a6de6435037a4056bc
Author: Ningning Xia <nxia@chromium.org>
Date: Mon Aug 28 22:24:13 2017

Remove tree_status experimental builders from completed_builds set

SlaveStatus.completed_builds set keeps track of completed and importat
builds. If a build was important and got added to the completed_builds
before, but is marked as experimental in tree status in current run,
should take it out from the completed_builds set.

BUG= chromium:758247 
TEST=unit_tests

Change-Id: I9dacfd9d0c1bd4e023868453333c0aa0a756fb0d
Reviewed-on: https://chromium-review.googlesource.com/633888
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/30cf9b2722c321ed693b79a6de6435037a4056bc/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/30cf9b2722c321ed693b79a6de6435037a4056bc/cbuildbot/build_status.py

Status: Fixed
Sign in to add a comment