master-paladin hangs during run in CommitQueueCompletion step and has to be killed |
|||
Issue descriptionThis is the second time in two weeks that the master-paladin run hung during the night (approximately just after 2 AM pacific time) and had to be killed. Today's build that hung: https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/19601 This also happened last week: https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/19563# Log info (not very helpful): 02:03:55: INFO: Running cidb query on pid 27659, repr(query) starts with 'SELECT build_id, build_config, waterfall, builder_name, build_number, message_type, message_subtype 02:03:55: INFO: Running cidb query on pid 27659, repr(query) starts with 'SELECT build_id, build_config, waterfall, builder_name, build_number, message_type, message_subtype 02:26:51: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 02:26:51: INFO: Refreshing access_token 02:31:54: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2) 02:31:54: INFO: Refreshing access_token Continues to refresh oauth access token hourly, but no other CIDB query activity. Initial ssh into master didn't turn up anything interesting. cbuildbot had 5 ESTABLISHED connections to the cidb mysql server. By the time I got a mysql connection on this cidb server and started trying to debug the connections/queries the connections had moved to CLOSE_WAIT and mysql did not show any status of these connections. Suspect a deadlock of some sort on client side.
,
Sep 25
There appears to have been a new example of this here: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8934428819204578352
,
Sep 25
Looking at the slave builds which never completed, they appear to have not run correctly. Example: https://ci.chromium.org/p/chromiumos-chromite/builders/master.chromeos/betty-arc64-paladin/b8934427683943960848 Looking up that same build in buildbucket, it did run: https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8934427683597589280&_h=1& Looking at it in buildbot UI: https://uberchromegw.corp.google.com/i/chromeos/builders/betty-arcnext-paladin/builds/1442 It did run (and failed), but doesn't appear to have reported correctly to the buildinfo API used by Legoland/Milo. Checking CIDB for the same build: mysql> SELECT * FROM buildTable WHERE id = 2976161\G *************************** 1. row *************************** id: 2976161 last_updated: 2018-09-25 08:23:38 master_build_id: 2976145 buildbot_generation: 1 builder_name: betty-arcnext-paladin waterfall: chromeos build_number: 1442 build_config: betty-arcnext-paladin bot_hostname: build119-m2.golo.chromium.org start_time: 2018-09-25 08:12:18 finish_time: 0000-00-00 00:00:00 status: inflight status_pickle: NULL build_type: paladin chrome_version: NULL milestone_version: 71 platform_version: 11097.0.0-rc3 full_version: R71-11097.0.0-rc3 sdk_version: NULL toolchain_url: 2018/09/%(target)s-2018.09.24.022525.tar.xz final: 0 metadata_url: NULL summary: NULL deadline: 2018-09-25 12:28:49 important: 1 buildbucket_id: 8934427683597589280 unibuild: 0 suite_scheduling: 0 branch: master 1 row in set (0.06 sec) That shows the build as still in progress, which would confuse the CQ master builder.
,
Sep 25
The CommitQueueCompletion stage appears to have failed this this error:
05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fb034e26a90>
Preconditions for the stage successfully met. Beginning to execute stage...
05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fb034ee9b50>
05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with 'SELECT id, build_stage_id, outer_failure_id, exception_type, exception_message, exception_category,
[1;31m05:28:49: ERROR: Traceback (most recent call last):
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 702, in Run
self.PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 689, in PerformStage
super(CommitQueueCompletionStage, self).PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 267, in PerformStage
super(MasterSlaveSyncCompletionStage, self).PerformStage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 77, in PerformStage
self.message = self.GetBuildFailureMessage()
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 399, in GetBuildFailureMessage
return self.GetBuildFailureMessageFromCIDB(build_id, db)
File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 371, in GetBuildFailureMessageFromCIDB
failure in stage_failures)
StopIteration
[0m
,
Sep 25
That error happened in the stage error handling code. This means the "run build steps" part of cbuildbot crashed. If that happens we do NOT correctly report build results to CIDB, metrics, etc. Since the CQ master looks at CIDB (not buildbucket which has valid data in these cases), the master has no way to know that the slaves crashed without finishing and hangs.
,
Sep 25
The short term solution is to fix the crash in cbuildbot. The long term fix is to have the master builder use buildbucket instead of CIDB (either directly, or indirectly via the CI API).
,
Sep 25
,
Sep 26
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/74114507769ea3979bc0612ea7920a3c0e942029 commit 74114507769ea3979bc0612ea7920a3c0e942029 Author: Don Garrett <dgarrett@google.com> Date: Wed Sep 26 17:32:40 2018 generic_stage: Stop crashing on poor CIDB slave status. If a build failed to correctly populate CIDB with stage failures, it could cause crashes in stage failure UIs of both the slave itself, and master builders. This fixes the crash on bad data, but does not explain why CIDB wasn't correctly populated in the first place. BUG= chromium:883421 TEST=run_tests Change-Id: I5f2c2acbdd4610b64d912175caac49ddf63cb5d1 Reviewed-on: https://chromium-review.googlesource.com/1243145 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Don Garrett <dgarrett@chromium.org> Reviewed-by: Gregory Meinke <gmeinke@chromium.org> [modify] https://crrev.com/74114507769ea3979bc0612ea7920a3c0e942029/cbuildbot/stages/generic_stages.py
,
Sep 26
Not confirmed in the PreCQ, but confirmed in a PreCQ build that also reproduced the problem. http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8934298334731039440 |
|||
►
Sign in to add a comment |
|||
Comment 1 by jclinton@chromium.org
, Sep 12