New issue
Advanced search Search tips

Issue 883421 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 26
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

master-paladin hangs during run in CommitQueueCompletion step and has to be killed

Project Member Reported by gmeinke@chromium.org, Sep 12

Issue description

This is the second time in two weeks that the master-paladin run hung during the night (approximately just after 2 AM pacific time) and had to be killed.

Today's build that hung:
https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/19601

This also happened last week:

https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/19563#


Log info (not very helpful):

02:03:55: INFO: Running cidb query on pid 27659, repr(query) starts with 'SELECT build_id, build_config, waterfall, builder_name, build_number, message_type, message_subtype
02:03:55: INFO: Running cidb query on pid 27659, repr(query) starts with 'SELECT build_id, build_config, waterfall, builder_name, build_number, message_type, message_subtype
02:26:51: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
02:26:51: INFO: Refreshing access_token
02:31:54: INFO: OAuth token TTL expired, auto-refreshing (attempt 1/2)
02:31:54: INFO: Refreshing access_token

Continues to refresh oauth access token hourly, but no other CIDB query activity.

Initial ssh into master didn't turn up anything interesting. cbuildbot had 5 ESTABLISHED connections to the cidb mysql server.

By the time I got a mysql connection on this cidb server and started trying to debug the connections/queries the connections had moved to CLOSE_WAIT and mysql did not show any status of these connections.

Suspect a deadlock of some sort on client side.
 
Status: Available (was: Untriaged)
Looking at the slave builds which never completed, they appear to have not run correctly.

Example:

https://ci.chromium.org/p/chromiumos-chromite/builders/master.chromeos/betty-arc64-paladin/b8934427683943960848

Looking up that same build in buildbucket, it did run:

https://apis-explorer.appspot.com/apis-explorer/?base=https://cr-buildbucket.appspot.com/_ah/api#p/buildbucket/v1/buildbucket.get?id=8934427683597589280&_h=1&

Looking at it in buildbot UI:

https://uberchromegw.corp.google.com/i/chromeos/builders/betty-arcnext-paladin/builds/1442

It did run (and failed), but doesn't appear to have reported correctly to the buildinfo API used by Legoland/Milo.

Checking CIDB for the same build:


mysql> SELECT * FROM buildTable WHERE id = 2976161\G 
*************************** 1. row ***************************
                 id: 2976161
       last_updated: 2018-09-25 08:23:38
    master_build_id: 2976145
buildbot_generation: 1
       builder_name: betty-arcnext-paladin
          waterfall: chromeos
       build_number: 1442
       build_config: betty-arcnext-paladin
       bot_hostname: build119-m2.golo.chromium.org
         start_time: 2018-09-25 08:12:18
        finish_time: 0000-00-00 00:00:00
             status: inflight
      status_pickle: NULL
         build_type: paladin
     chrome_version: NULL
  milestone_version: 71
   platform_version: 11097.0.0-rc3
       full_version: R71-11097.0.0-rc3
        sdk_version: NULL
      toolchain_url: 2018/09/%(target)s-2018.09.24.022525.tar.xz
              final: 0
       metadata_url: NULL
            summary: NULL
           deadline: 2018-09-25 12:28:49
          important: 1
     buildbucket_id: 8934427683597589280
           unibuild: 0
   suite_scheduling: 0
             branch: master
1 row in set (0.06 sec)


That shows the build as still in progress, which would confuse the CQ master builder.

The CommitQueueCompletion stage appears to have failed this this error:


05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fb034e26a90>
Preconditions for the stage successfully met. Beginning to execute stage...
05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7fb034ee9b50>
05:28:49: INFO: Running cidb query on pid 17012, repr(query) starts with 'SELECT id, build_stage_id, outer_failure_id, exception_type, exception_message, exception_category,
05:28:49: ERROR: Traceback (most recent call last):
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 702, in Run
    self.PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 689, in PerformStage
    super(CommitQueueCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 267, in PerformStage
    super(MasterSlaveSyncCompletionStage, self).PerformStage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/completion_stages.py", line 77, in PerformStage
    self.message = self.GetBuildFailureMessage()
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 399, in GetBuildFailureMessage
    return self.GetBuildFailureMessageFromCIDB(build_id, db)
  File "/b/c/cbuild/repository/chromite/cbuildbot/stages/generic_stages.py", line 371, in GetBuildFailureMessageFromCIDB
    failure in stage_failures)
StopIteration

That error happened in the stage error handling code. This means the "run build steps" part of cbuildbot crashed.

If that happens we do NOT correctly report build results to CIDB, metrics, etc. Since the CQ master looks at CIDB (not buildbucket which has valid data in these cases), the master has no way to know that the slaves crashed without finishing and hangs.
The short term solution is to fix the crash in cbuildbot.

The long term fix is to have the master builder use buildbucket instead of CIDB (either directly, or indirectly via the CI API).
Owner: dgarr...@chromium.org
Status: Started (was: Available)
Project Member

Comment 8 by bugdroid1@chromium.org, Sep 26

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/74114507769ea3979bc0612ea7920a3c0e942029

commit 74114507769ea3979bc0612ea7920a3c0e942029
Author: Don Garrett <dgarrett@google.com>
Date: Wed Sep 26 17:32:40 2018

generic_stage: Stop crashing on poor CIDB slave status.

If a build failed to correctly populate CIDB with stage failures, it
could cause crashes in stage failure UIs of both the slave itself, and
master builders.

This fixes the crash on bad data, but does not explain why CIDB wasn't
correctly populated in the first place.

BUG= chromium:883421 
TEST=run_tests

Change-Id: I5f2c2acbdd4610b64d912175caac49ddf63cb5d1
Reviewed-on: https://chromium-review.googlesource.com/1243145
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Don Garrett <dgarrett@chromium.org>
Reviewed-by: Gregory Meinke <gmeinke@chromium.org>

[modify] https://crrev.com/74114507769ea3979bc0612ea7920a3c0e942029/cbuildbot/stages/generic_stages.py

Status: Fixed (was: Started)
Not confirmed in the PreCQ, but confirmed in a PreCQ build that also reproduced the problem.

http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8934298334731039440

Sign in to add a comment