New issue
Advanced search Search tips
Starred by 2 users
Status: Unconfirmed
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
x86-zgb-paladin was retried with the old buildbucket_id because of lease expiration
Project Member Reported by pprabhu@chromium.org, Jan 27 2017 Back to list
https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fx86-zgb-paladin%2F8811%2F%2B%2Frecipes%2Fsteps%2FBuildStart%2F0%2Fstdout

Traceback (most recent call last):
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/failures_lib.py", line 172, in wrapped_functor
    return functor(*args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/cbuildbot/stages/report_stages.py", line 250, in PerformStage
    buildbucket_id=self._run.options.buildbucket_id)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 117, in wrapper
    return f(self, *args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 663, in InsertBuild
    return self._Insert('buildTable', values)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 367, in _Insert
    r = self._Execute(ins)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 522, in _Execute
    *args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 545, in _ExecuteWithEngine
    return self._RunFunctorWithRetries(f)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 557, in _RunFunctorWithRetries
    functor=functor)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_stats.py", line 180, in RetryWithStats
    *args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_util.py", line 122, in GenericRetry
    ret = functor(*args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_stats.py", line 167, in wrapper
    result = functor(*args, **kwargs)
  File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 542, in <lambda>
    f = lambda: engine.execute(query, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1614, in execute
    return connection.execute(statement, *multiparams, **params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662, in execute
    params)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1024, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 196, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 867, in _execute_context
    context)
  File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 324, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
IntegrityError: (IntegrityError) (1062, "Duplicate entry '8989365383602477376' for key 'buildbucket_id_index'") 'INSERT INTO `buildTable` (master_build_id, buildbot_generation, builder_name, waterfall, build_number, build_config, bot_hostname, start_time, deadline, important, buildbucket_id) VALUES (%s, %s, %s, %s, %s, %s, %s, CURRENT_TIMESTAMP, %s, %s, %s)' (1286680, 1, 'x86-zgb-paladin', 'chromeos', 8811, 'x86-zgb-paladin', 'cros-beefy417-c2.c.chromeos-bot.internal', datetime.datetime(2017, 1, 27, 3, 2, 10), True, '8989365383602477376')

 
Components: Infra>Client>ChromeOS
Labels: current-issue
Comment 2 by nxia@chromium.org, Jan 30 2017
Cc: no...@chromium.org
Labels: -current-issue
build #8810 and build #8811 were triggered with the same buildbucket_id 8989365383602477376.


https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/8810
https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/8811


duplicated buildbucket_id may happen when the waterfall restarts and the previously running build was killed, the retried build will be assigned with the same buildbucket_id. I'm not sure what happens here. I don't think there was a waterfall restart, as the master-paladin kept running and checking the build status around that time. The master-paladin only got the 'FAILURE' status which should be related to the second build, although the master-build kept querying the build status from 15:00 to 17:06. 

https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F13464%2F%2B%2Frecipes%2Fsteps%2FCommitQueueCompletion%2F0%2Fstdout


nodir@, any thoughts? any logs from buildbucket servers can add more information here?

Comment 3 by no...@chromium.org, Jan 31 2017
here are log entries related to the build

BigQuery query https://bigquery.cloud.google.com/results/cr-buildbucket:bquijob_5710d133_159f1e82e78?pli=1

SELECT
  metadata.timestamp,
  protoPayload.line.logMessage
FROM
  TABLE_DATE_RANGE([logs.appengine_googleapis_com_request_log_], TIMESTAMP('2017-01-01'), TIMESTAMP('2017-04-01'))
WHERE
  protoPayload.line.logMessage LIKE '%8989365383602477376%'
ORDER BY
  metadata.timestamp;

Results:
1	2017-01-26 22:35:35 UTC	Build 8989365383602477376 was created by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com	 
2	2017-01-26 22:35:38 UTC	Build 8989365383602477376 was leased by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com	 
3	2017-01-26 22:35:39 UTC	Build 8989365383602477376 was started. URL: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-zgb-paladin/builds/8810	 
4	2017-01-27 01:01:20 UTC	Request URL: https://cr-buildbucket.appspot.com/_ah/api/buildbucket/v1/builds/8989365383602477376 Method: buildbucket.get Error Code: 401 Reason: required Message:	 
5	2017-01-27 01:05:00 UTC	Expired build 8989365383602477376 was reset	 
6	2017-01-27 01:05:08 UTC	Build 8989365383602477376 was leased by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com	 
7	2017-01-27 01:05:09 UTC	Build 8989365383602477376 was started. URL: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-zgb-paladin/builds/8811	 
8	2017-01-27 01:05:45 UTC	Build 8989365383602477376 was completed. Status: COMPLETED. Result: FAILURE
Comment 4 by no...@chromium.org, Jan 31 2017
looks like the build lease expired. This build took 1.5h. Does cbuildbot extend leases (heartbeat API)?
Comment 5 by nxia@chromium.org, Jan 31 2017
Cc: d...@chromium.org
Cbuildbot doesn't send any heartbeat requests. I recall dnj@ mentioned lease/heartbeat would be taken care by buildbot-buildbucket integration, so Cbuildbot only needed to send the Put requests. dnj@ can comment?
Comment 6 by d...@chromium.org, Jan 31 2017
The thing I said is still correct. BuildBot owns the lease and will send heartbeats.

nodir@, cbuildbot is not taking out the lease. cbuildbot is running on behalf of a BuildBot build scheduled by another cbuildbot instance, similar to triggers.
Comment 7 by nxia@chromium.org, Jan 31 2017
Owner: no...@chromium.org
Summary: x86-zgb-paladin was retried with the old buildbucket_id because of lease expiration (was: CQ slave failed BuildStart because cidb rejected duplicate buildbucket_id)
Passing nodir@ to investigate the lease expiration issue first.
Comment 8 by no...@chromium.org, Jan 31 2017
you're right, I was misguided by a milo bug  crbug.com/687291 


from the buildbot log:

2017-01-26 16:04:57-0800 [-] PubSub: Status update (buildFinished): ('x86-zgb-paladin', <BuildStatus #8810>, 0)
2017-01-26 16:04:57-0800 [-] [buildbucket] Build 8989365383602477376 finished with status "SUCCESS"
2017-01-26 16:04:57-0800 [-] pruning '/home/chrome-bot/buildbot/build_internal/masters/master.chromeos/x86-zgb-paladin-master/7810'
2017-01-26 16:04:57-0800 [-]  setting expectations for next time
2017-01-26 16:04:57-0800 [-] new expectations: 5049.6119523 seconds
2017-01-26 16:04:57-0800 [-] releaseLocks(<AutoRebootBuildSlave 'cros-beefy417-c2', current builders: x86-zgb-paladin>): []
2017-01-26 16:04:57-0800 [-] releaseLocks(<Build x86-zgb-paladin>): [(<MasterLock(x86-zgb-paladin, 1)>, <buildbot.locks.LockAccess instance at 0x2ad4a4117908>)]
2017-01-26 16:04:57-0800 [Broker,2624,192.168.120.8] BuildSlave.detached(cros-beefy417-c2)
2017-01-26 16:04:57-0800 [Broker,2624,192.168.120.8] Buildslave cros-beefy417-c2 detached from x86-zgb-paladin
2017-01-26 16:04:59-0800 [Broker,2367,192.168.120.8] <RemoteShellCommand '['python', '-u', '../../../scripts/slave/annotated_run.py', '--use-factory-properties-from-disk', '--build-properties-gz=eNqtUkuPmzAQ/iuVzyEGB4zjU3vpqadKPW1WyJhZ4o3BxDbZjVb57zUY9iGlt56w53vMx3jeUK1FB1o5j/jD4wbVVvTyiDhC4Twq3dTG//n9KxSO3g+OYzzWYOXRmg7al600dti2xrQawrnDCkfIOPxuMMoTBHf0doiFA+Lf5vNUny4H1AnnwW5X7QFtQlFaEB6aqr5G0ujA8jyneZFmO5pTmnQnQ8aspYUu1dkZV8h073sVQHg+t2XxvYELaDME6zaIL0qCkNKMvZ/Cfu3iXeyS5azIC8ZolqV0t0sjSzURZXu235UlSQtCKCU0ZQtBg3BQnWDJmu1pVqasIHmEvWhn/4dlBpW/DsCl6IW9RsZcduC5XMeO41SWT2Jh7oEzwihh5ZK+ritp+ifV8ldGE6HhtTrCyv3gRI8q9lYN/2ISQf5T6EnyeLutTwe2D8sRnu6O98rpxy4sBOJZSYrN3Ot9haIv2nxK+U+vOyHRmjKiF2GV6KdFimMLoihYMq7LE+qDNc8gfdxiC4NxypugWO7nEYKu+REI62uX6QcQfgb9xzWbW16UU6aPAZwWF1hDW+OSGuDpyopEkoC+GHtq1JQB13geBZ4F+M7ckmXCt7/lxT1L']'> rc=0
2017-01-26 16:04:59-0800 [-] [buildbucket] RPC "('peek',)" failed: Traceback (most recent call last):
	  File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 297, in _retry
	    res = yield threads.deferToThreadPool(reactor, self._pool, call)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/threadpool.py", line 207, in _worker
	    result = context.call(ctx, function, *args, **kwargs)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/context.py", line 59, in callWithContext
	    return self.currentContext().callWithContext(ctx, func, *args, **kw)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/context.py", line 37, in callWithContext
	    return func(*args,**kw)
	  File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 368, in single_call
	    response = method(*args, **kwargs).execute(http=self._th_local.http)
	  File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/util.py", line 129, in positional_wrapper
	    return wrapped(*args, **kwargs)
	  File "/home/chrome-bot/buildbot/build/third_party/google_api_python_client/googleapiclient/http.py", line 716, in execute
	    body=self.body, headers=self.headers)
	  File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/util.py", line 129, in positional_wrapper
	    return wrapped(*args, **kwargs)
	  File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 515, in new_request
	    self._refresh(request_orig)
	  File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 727, in _refresh
	    self._do_refresh_request(http_request)
	  File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 756, in _do_refresh_request
	    self.token_uri, method='POST', body=body, headers=headers)
	  File "/home/chrome-bot/buildbot/build/third_party/infra_libs/infra_libs/httplib2_utils.py", line 252, in request
	    uri, method, body, *args, **kwargs)
	  File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1593, in request
	    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
	  File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1335, in _request
	    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
	  File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1263, in _conn_request
	    raise ServerNotFoundError("Unable to find the server at %s" % conn.host)
	ServerNotFoundError: Unable to find the server at accounts.google.com

2017-01-26 16:04:59-0800 [-] while polling for changes
	Traceback (most recent call last):
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 388, in errback
	    self._startRunCallbacks(fail)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 455, in _startRunCallbacks
	    self._runCallbacks()
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 542, in _runCallbacks
	    current.result = callback(current.result, *args, **kw)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1076, in gotResult
	    _inlineCallbacks(r, g, deferred)
	--- <exception caught here> ---
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/home/chrome-bot/buildbot/build/scripts/master/buildbucket/poller.py", line 47, in poll
	    yield self.integrator.poll_builds()
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/home/chrome-bot/buildbot/build/scripts/master/buildbucket/integration.py", line 422, in poll_builds
	    start_cursor=start_cursor,
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks
	    result = result.throwExceptionIntoGenerator(g)
	  File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator
	    return g.throw(self.type, self.value, self.tb)
	  File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 312, in _retry
	    raise ex
	httplib2.ServerNotFoundError: Unable to find the server at accounts.google.com


buildbot-buildbucket's attempt to mark the build as completed failed permanently after several retries.
I see 14 occurrencies of "Unable to find the server at accounts.google.com" between 14:25 PDT and 16:33 PDT.

Buildbucket's behavior to mark incomplete-on-time builds as expired is intended to deal with cases like this. Eventually the buildbucket build was completed from the perspective of buildbucket API users.

Can cbuildbot do something to deal with this behavior? Note that only one buildbot build is considered "final" one, and it is the one referenced in build's url and result_details_json attributes.
Comment 9 by d...@chromium.org, Jan 31 2017
Owner: nxia@chromium.org
To be clear, nodir@ is suggesting that the master cbuildbot process manage this case, not the triggered build.
Comment 10 by a...@chromium.org, Feb 3 2017
Labels: OS-Chrome
Sign in to add a comment