New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 710542 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Apr 2017
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 711034



Sign in to add a comment

Chrome OS CQ gets stuck in CommitQueueCompletion

Reported by jrbarnette@chromium.org, Apr 11 2017

Issue description

The Chrome OS CQ is unable to commit code changes, as it
can't detect slave paladin failures.  The logs from the
CommitQueueCompletion stage look like this:

09:23:40: INFO: Running cidb query on pid 14637, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fe025911450>
09:23:40: INFO: Recording status pass for [u'master-paladin']
09:23:40: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cp -v -- /tmp/cbuildbot-tmpczG3xQ/tmpmjFgdC gs://chromeos-manifest-versions/builder-status/9452.0.0-rc2/master-paladin
09:23:41: INFO: Running cidb query on pid 14637, repr(query) starts with 'SELECT deadline >= NOW(), TIMEDIFF(deadline, NOW()) from buildTable where id = 1445082'

The logs stop after the lines above, and eventually the builder reports failure.

 
Logging in to the builder for the CQ master, we found a
number of cbuildbot processes from previous runs still in
progress on the system.

We don't know why that happened, or exactly why we're seeing
the underlying symptom.  However, the builder is supposed to
reboot in between every run.

We're rebooting the builder, and restarting CQ runs, to see what
impact that has.

Comment 2 by nxia@chromium.org, Apr 11 2017

The CQ master completion stage hung at getting slave statuses and was interrupted because no logs for a long time. 
The master was expected to reboot after every run.  I've filed
bug 710552 to fix the builder config.

We shouldn't be blocked on that change, but not having a full
root cause, it could turn out we _are_ blocked.

Project Member

Comment 4 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/62bfee10020a54816b83760fa0c45f4eb577cda9

commit 62bfee10020a54816b83760fa0c45f4eb577cda9
Author: Ningning Xia <nxia@chromium.org>
Date: Tue Apr 11 20:12:22 2017

Add info logs in the completion stage.

This is for debugging stuck CQ master completion stage and will be reverted
once the cause is found.

BUG= chromium:710542 
TEST=unit_tests

Change-Id: I7467d15e956f19307b43b7b2cc87b5c8d7a4a869
Reviewed-on: https://chromium-review.googlesource.com/475024
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/auth.py
[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/buildbucket_lib.py
[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/cbuildbot/build_status.py

Project Member

Comment 5 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/62bfee10020a54816b83760fa0c45f4eb577cda9

commit 62bfee10020a54816b83760fa0c45f4eb577cda9
Author: Ningning Xia <nxia@chromium.org>
Date: Tue Apr 11 20:12:22 2017

Add info logs in the completion stage.

This is for debugging stuck CQ master completion stage and will be reverted
once the cause is found.

BUG= chromium:710542 
TEST=unit_tests

Change-Id: I7467d15e956f19307b43b7b2cc87b5c8d7a4a869
Reviewed-on: https://chromium-review.googlesource.com/475024
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Richard Barnette <jrbarnette@google.com>

[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/auth.py
[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/buildbucket_lib.py
[modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/cbuildbot/build_status.py

After rebooting/restarting the master-paladin, we're still seeing
the symptom.  We've added additional logging to try and guess at
where the system is getting held up.

We're going to kill/restart the builder again.


From my quick look it looks like there is a call path of:
CommitQueueCompletionStage.PerformStage
MasterSlaveSyncCompletionStage.PerformStage
 _FetchSlaveStatuses
  db.GetTimeToDeadline - finished based on logs
  _GetSlaveBuildStatus
   BuildSpecsManager.GetBuildersStatus
    SlaveStatus.__init__
     ValidationPool.GetDependMapForChanges
     
CPU usage seems to be going up and up as processes get locked consuming all CPU:
https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?duration=1d&hostname=cros-wimpy0-c2&refresh=-1&utc_end=1491944620.54

Memory usage remains constant:
 7224 chrome-+  20   0  242164  76668   6004 R 100.0  0.2 136:19.58 python2
13781 chrome-+  20   0  244096  78596   6004 R  97.9  0.3  30:38.58 python2

Likely some of the graph iteration code is in an endless loop.

Potential next actions:
- disable the dependency map generation and just logs the data that would be used to compute it
- try and manually run the dependency generation based on the set of changes that is failing from the build logs
- try and attach a debugger to see where it's stuck

In the future, if we're concerned about this, we could add a timeout or something to ensure that we don't end in an endless loop here.
Project Member

Comment 8 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/45b1330dfca9209714acdf38951bfe1cfb47c92e

commit 45b1330dfca9209714acdf38951bfe1cfb47c92e
Author: Ningning Xia <nxia@chromium.org>
Date: Tue Apr 11 21:32:32 2017

Add more info logs in completion stage.

Add more logs around the manifest_version and dependency pre-computing
logic.

BUG= chromium:710542 
TEST=unit_tests

Change-Id: Ie8936bc44331f8123f07cd637e3bbf082db9ebd9
Reviewed-on: https://chromium-review.googlesource.com/475025
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/validation_pool.py
[modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/stages/completion_stages.py
[modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/build_status.py
[modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/manifest_version.py

Quick and dirty test code to test the set of changes in CQ run:
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14257

Uploaded as:
https://chromium-review.googlesource.com/475106 WIP: test program to test ValidationPool.GetDependMapForChanges
OK.  The short summary of the problem is that there's a bug in
chromite such that some dependency trees can cause chromite to
loop forever.  Some of the CLs in the recent run were tickling
that bug.

We throttled the tree, which caused the CQ to omit a large
stack of changes, and now the CQ is moving forward again.  That
means we know which changes were tickling the bug.  We don't have
a full root cause, so at any moment some new set of CLs could find
the problem again.  We're hoping that won't happen before we have
the complete root cause and a fix.

Comment 12 by nxia@chromium.org, Apr 11 2017

We already know the patch set containing CL:459166 triggered the issue, the owner/reviewer have been notified to chump in their changes after confirming all the slaves passed. 

I'm going to disable the depend-map computing logic and enable it when the graph logic is fixed. 
 
Project Member

Comment 13 by bugdroid1@chromium.org, Apr 11 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/0826166852d4b150414d634868a8f9ba2d4a5280

commit 0826166852d4b150414d634868a8f9ba2d4a5280
Author: Ningning Xia <nxia@chromium.org>
Date: Tue Apr 11 23:47:57 2017

Disable depedency map computing in build_status

Disabling it may only delay CQ self-destruction, it won't affect CL
syncing or merging. Will enable it once the issue is resolved.

BUG= chromium:710542 
TEST=unit_tests

Change-Id: I0b0326c7018e2dd0e592838073f4bf6aa7697377
Reviewed-on: https://chromium-review.googlesource.com/474970
Reviewed-by: Richard Barnette <jrbarnette@google.com>
Tested-by: Ningning Xia <nxia@chromium.org>
Trybot-Ready: Ningning Xia <nxia@chromium.org>
Commit-Queue: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/0826166852d4b150414d634868a8f9ba2d4a5280/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/0826166852d4b150414d634868a8f9ba2d4a5280/cbuildbot/build_status.py

Blockedon: 711034

Comment 15 by nxia@chromium.org, Apr 12 2017

Labels: -Pri-0 Pri-1
not blocking the CQ now. 
Project Member

Comment 16 by bugdroid1@chromium.org, Apr 13 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/4d91f8fce8d02e75419de7452c27a3c67cbe505d

commit 4d91f8fce8d02e75419de7452c27a3c67cbe505d
Author: Paul Hobbs <phobbs@google.com>
Date: Thu Apr 13 22:19:06 2017

validation_pool: _UpdateDependencyMap is now DFS

The function does try to memoize, using a "visited" set across top-level calls
to _UpdateDependencyMap.

However, the "visited" call map doesn't do anything to prevent the exponential
blowup that Allen compares to naive fibonacci. This is because within each call
tree we can visit a given node an exponential number of times.

Even if we solved this problem it's still pretty inefficient, because it uses
Set.intersection a LOT, for which each operation is O(n). If each node has d
children, this makes the approximate cost O(n*d) per node.

Replace it with simple n calls to DFS per node, which is O(e) runtime.

BUG= chromium:710542 
TEST=Tests pass.

Change-Id: I5235fcb2c8323ee84ce98832551b4c0a49b782e3
Reviewed-on: https://chromium-review.googlesource.com/475750
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: David Riley <davidriley@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/4d91f8fce8d02e75419de7452c27a3c67cbe505d/cbuildbot/validation_pool.py
[modify] https://crrev.com/4d91f8fce8d02e75419de7452c27a3c67cbe505d/cbuildbot/validation_pool_unittest.py

Project Member

Comment 17 by bugdroid1@chromium.org, Apr 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/8dbd2c59b33e12575b4d8e808d440863060c2fc7

commit 8dbd2c59b33e12575b4d8e808d440863060c2fc7
Author: Ningning Xia <nxia@chromium.org>
Date: Tue Apr 18 05:50:26 2017

Enable change dependency computing in CompletionStage.

1) enable GetDependMapForChanges
2) add TimeoutDecorator on GetTransitiveDependMap
3) remove debugging logs

BUG= chromium:710542 
TEST=unit_tests

Change-Id: Ie17fe9ac1ed21b194d8f705673b897f055ac9f1d
Reviewed-on: https://chromium-review.googlesource.com/478477
Commit-Ready: Ningning Xia <nxia@chromium.org>
Tested-by: Ningning Xia <nxia@chromium.org>
Reviewed-by: Ningning Xia <nxia@chromium.org>

[modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/build_status_unittest.py
[modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/validation_pool.py
[modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/build_status.py
[modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/validation_pool_unittest.py

Comment 18 by nxia@chromium.org, Apr 18 2017

Status: Fixed (was: Assigned)

Sign in to add a comment