Chrome OS CQ gets stuck in CommitQueueCompletion
Reported by
jrbarnette@chromium.org,
Apr 11 2017
|
||||
Issue descriptionThe Chrome OS CQ is unable to commit code changes, as it can't detect slave paladin failures. The logs from the CommitQueueCompletion stage look like this: 09:23:40: INFO: Running cidb query on pid 14637, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fe025911450> 09:23:40: INFO: Recording status pass for [u'master-paladin'] 09:23:40: INFO: RunCommand: /b/cbuild/internal_master/.cache/common/gsutil_4.19.tar.gz/gsutil/gsutil -o 'Boto:num_retries=10' cp -v -- /tmp/cbuildbot-tmpczG3xQ/tmpmjFgdC gs://chromeos-manifest-versions/builder-status/9452.0.0-rc2/master-paladin 09:23:41: INFO: Running cidb query on pid 14637, repr(query) starts with 'SELECT deadline >= NOW(), TIMEDIFF(deadline, NOW()) from buildTable where id = 1445082' The logs stop after the lines above, and eventually the builder reports failure.
,
Apr 11 2017
The CQ master completion stage hung at getting slave statuses and was interrupted because no logs for a long time.
,
Apr 11 2017
The master was expected to reboot after every run. I've filed bug 710552 to fix the builder config. We shouldn't be blocked on that change, but not having a full root cause, it could turn out we _are_ blocked.
,
Apr 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/62bfee10020a54816b83760fa0c45f4eb577cda9 commit 62bfee10020a54816b83760fa0c45f4eb577cda9 Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 11 20:12:22 2017 Add info logs in the completion stage. This is for debugging stuck CQ master completion stage and will be reverted once the cause is found. BUG= chromium:710542 TEST=unit_tests Change-Id: I7467d15e956f19307b43b7b2cc87b5c8d7a4a869 Reviewed-on: https://chromium-review.googlesource.com/475024 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/auth.py [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/buildbucket_lib.py [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/cbuildbot/build_status.py
,
Apr 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/62bfee10020a54816b83760fa0c45f4eb577cda9 commit 62bfee10020a54816b83760fa0c45f4eb577cda9 Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 11 20:12:22 2017 Add info logs in the completion stage. This is for debugging stuck CQ master completion stage and will be reverted once the cause is found. BUG= chromium:710542 TEST=unit_tests Change-Id: I7467d15e956f19307b43b7b2cc87b5c8d7a4a869 Reviewed-on: https://chromium-review.googlesource.com/475024 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Richard Barnette <jrbarnette@google.com> [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/auth.py [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/lib/buildbucket_lib.py [modify] https://crrev.com/62bfee10020a54816b83760fa0c45f4eb577cda9/cbuildbot/build_status.py
,
Apr 11 2017
After rebooting/restarting the master-paladin, we're still seeing the symptom. We've added additional logging to try and guess at where the system is getting held up. We're going to kill/restart the builder again.
,
Apr 11 2017
From my quick look it looks like there is a call path of:
CommitQueueCompletionStage.PerformStage
MasterSlaveSyncCompletionStage.PerformStage
_FetchSlaveStatuses
db.GetTimeToDeadline - finished based on logs
_GetSlaveBuildStatus
BuildSpecsManager.GetBuildersStatus
SlaveStatus.__init__
ValidationPool.GetDependMapForChanges
CPU usage seems to be going up and up as processes get locked consuming all CPU:
https://viceroy.corp.google.com/chrome_infra/Machines/per_machine?duration=1d&hostname=cros-wimpy0-c2&refresh=-1&utc_end=1491944620.54
Memory usage remains constant:
7224 chrome-+ 20 0 242164 76668 6004 R 100.0 0.2 136:19.58 python2
13781 chrome-+ 20 0 244096 78596 6004 R 97.9 0.3 30:38.58 python2
Likely some of the graph iteration code is in an endless loop.
Potential next actions:
- disable the dependency map generation and just logs the data that would be used to compute it
- try and manually run the dependency generation based on the set of changes that is failing from the build logs
- try and attach a debugger to see where it's stuck
In the future, if we're concerned about this, we could add a timeout or something to ensure that we don't end in an endless loop here.
,
Apr 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/45b1330dfca9209714acdf38951bfe1cfb47c92e commit 45b1330dfca9209714acdf38951bfe1cfb47c92e Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 11 21:32:32 2017 Add more info logs in completion stage. Add more logs around the manifest_version and dependency pre-computing logic. BUG= chromium:710542 TEST=unit_tests Change-Id: Ie8936bc44331f8123f07cd637e3bbf082db9ebd9 Reviewed-on: https://chromium-review.googlesource.com/475025 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/validation_pool.py [modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/stages/completion_stages.py [modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/build_status.py [modify] https://crrev.com/45b1330dfca9209714acdf38951bfe1cfb47c92e/cbuildbot/manifest_version.py
,
Apr 11 2017
Quick and dirty test code to test the set of changes in CQ run: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14257 Uploaded as: https://chromium-review.googlesource.com/475106 WIP: test program to test ValidationPool.GetDependMapForChanges
,
Apr 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4e97c88c1330a422687d5a156b653cd5c0044316 commit 4e97c88c1330a422687d5a156b653cd5c0044316 Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 11 22:23:08 2017 Revert debug logging added in Completion Stage to reduce noise. BUG= chromium:710542 TEST=unit_tests Change-Id: I1733244ca7f218858c7616786c46f2f80e083a78 Reviewed-on: https://chromium-review.googlesource.com/475047 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/4e97c88c1330a422687d5a156b653cd5c0044316/cbuildbot/validation_pool.py [modify] https://crrev.com/4e97c88c1330a422687d5a156b653cd5c0044316/lib/buildbucket_lib.py [modify] https://crrev.com/4e97c88c1330a422687d5a156b653cd5c0044316/cbuildbot/build_status.py [modify] https://crrev.com/4e97c88c1330a422687d5a156b653cd5c0044316/lib/auth.py
,
Apr 11 2017
OK. The short summary of the problem is that there's a bug in chromite such that some dependency trees can cause chromite to loop forever. Some of the CLs in the recent run were tickling that bug. We throttled the tree, which caused the CQ to omit a large stack of changes, and now the CQ is moving forward again. That means we know which changes were tickling the bug. We don't have a full root cause, so at any moment some new set of CLs could find the problem again. We're hoping that won't happen before we have the complete root cause and a fix.
,
Apr 11 2017
We already know the patch set containing CL:459166 triggered the issue, the owner/reviewer have been notified to chump in their changes after confirming all the slaves passed. I'm going to disable the depend-map computing logic and enable it when the graph logic is fixed.
,
Apr 11 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/0826166852d4b150414d634868a8f9ba2d4a5280 commit 0826166852d4b150414d634868a8f9ba2d4a5280 Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 11 23:47:57 2017 Disable depedency map computing in build_status Disabling it may only delay CQ self-destruction, it won't affect CL syncing or merging. Will enable it once the issue is resolved. BUG= chromium:710542 TEST=unit_tests Change-Id: I0b0326c7018e2dd0e592838073f4bf6aa7697377 Reviewed-on: https://chromium-review.googlesource.com/474970 Reviewed-by: Richard Barnette <jrbarnette@google.com> Tested-by: Ningning Xia <nxia@chromium.org> Trybot-Ready: Ningning Xia <nxia@chromium.org> Commit-Queue: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/0826166852d4b150414d634868a8f9ba2d4a5280/cbuildbot/build_status_unittest.py [modify] https://crrev.com/0826166852d4b150414d634868a8f9ba2d4a5280/cbuildbot/build_status.py
,
Apr 12 2017
,
Apr 12 2017
not blocking the CQ now.
,
Apr 13 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/4d91f8fce8d02e75419de7452c27a3c67cbe505d commit 4d91f8fce8d02e75419de7452c27a3c67cbe505d Author: Paul Hobbs <phobbs@google.com> Date: Thu Apr 13 22:19:06 2017 validation_pool: _UpdateDependencyMap is now DFS The function does try to memoize, using a "visited" set across top-level calls to _UpdateDependencyMap. However, the "visited" call map doesn't do anything to prevent the exponential blowup that Allen compares to naive fibonacci. This is because within each call tree we can visit a given node an exponential number of times. Even if we solved this problem it's still pretty inefficient, because it uses Set.intersection a LOT, for which each operation is O(n). If each node has d children, this makes the approximate cost O(n*d) per node. Replace it with simple n calls to DFS per node, which is O(e) runtime. BUG= chromium:710542 TEST=Tests pass. Change-Id: I5235fcb2c8323ee84ce98832551b4c0a49b782e3 Reviewed-on: https://chromium-review.googlesource.com/475750 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: David Riley <davidriley@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/4d91f8fce8d02e75419de7452c27a3c67cbe505d/cbuildbot/validation_pool.py [modify] https://crrev.com/4d91f8fce8d02e75419de7452c27a3c67cbe505d/cbuildbot/validation_pool_unittest.py
,
Apr 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/8dbd2c59b33e12575b4d8e808d440863060c2fc7 commit 8dbd2c59b33e12575b4d8e808d440863060c2fc7 Author: Ningning Xia <nxia@chromium.org> Date: Tue Apr 18 05:50:26 2017 Enable change dependency computing in CompletionStage. 1) enable GetDependMapForChanges 2) add TimeoutDecorator on GetTransitiveDependMap 3) remove debugging logs BUG= chromium:710542 TEST=unit_tests Change-Id: Ie17fe9ac1ed21b194d8f705673b897f055ac9f1d Reviewed-on: https://chromium-review.googlesource.com/478477 Commit-Ready: Ningning Xia <nxia@chromium.org> Tested-by: Ningning Xia <nxia@chromium.org> Reviewed-by: Ningning Xia <nxia@chromium.org> [modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/build_status_unittest.py [modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/validation_pool.py [modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/build_status.py [modify] https://crrev.com/8dbd2c59b33e12575b4d8e808d440863060c2fc7/cbuildbot/validation_pool_unittest.py
,
Apr 18 2017
|
||||
►
Sign in to add a comment |
||||
Comment 1 by jrbarnette@chromium.org
, Apr 11 2017