x86-zgb-paladin was retried with the old buildbucket_id because of lease expiration |
|||||||
Issue descriptionhttps://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fx86-zgb-paladin%2F8811%2F%2B%2Frecipes%2Fsteps%2FBuildStart%2F0%2Fstdout Traceback (most recent call last): File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/failures_lib.py", line 172, in wrapped_functor return functor(*args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/cbuildbot/stages/report_stages.py", line 250, in PerformStage buildbucket_id=self._run.options.buildbucket_id) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 117, in wrapper return f(self, *args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 663, in InsertBuild return self._Insert('buildTable', values) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 367, in _Insert r = self._Execute(ins) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 522, in _Execute *args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 545, in _ExecuteWithEngine return self._RunFunctorWithRetries(f) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 557, in _RunFunctorWithRetries functor=functor) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_stats.py", line 180, in RetryWithStats *args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_util.py", line 122, in GenericRetry ret = functor(*args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/retry_stats.py", line 167, in wrapper result = functor(*args, **kwargs) File "/b/build/slave/x86-zgb-paladin-master/build/chromite/lib/cidb.py", line 542, in <lambda> f = lambda: engine.execute(query, *args, **kwargs) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1614, in execute return connection.execute(statement, *multiparams, **params) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 662, in execute params) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 761, in _execute_clauseelement compiled_sql, distilled_params File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 874, in _execute_context context) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1024, in _handle_dbapi_exception exc_info File "/usr/lib/python2.7/dist-packages/sqlalchemy/util/compat.py", line 196, in raise_from_cause reraise(type(exception), exception, tb=exc_tb) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 867, in _execute_context context) File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 324, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/usr/lib/python2.7/dist-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue IntegrityError: (IntegrityError) (1062, "Duplicate entry '8989365383602477376' for key 'buildbucket_id_index'") 'INSERT INTO `buildTable` (master_build_id, buildbot_generation, builder_name, waterfall, build_number, build_config, bot_hostname, start_time, deadline, important, buildbucket_id) VALUES (%s, %s, %s, %s, %s, %s, %s, CURRENT_TIMESTAMP, %s, %s, %s)' (1286680, 1, 'x86-zgb-paladin', 'chromeos', 8811, 'x86-zgb-paladin', 'cros-beefy417-c2.c.chromeos-bot.internal', datetime.datetime(2017, 1, 27, 3, 2, 10), True, '8989365383602477376') [0m
,
Jan 30 2017
build #8810 and build #8811 were triggered with the same buildbucket_id 8989365383602477376. https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/8810 https://luci-milo.appspot.com/buildbot/chromeos/x86-zgb-paladin/8811 duplicated buildbucket_id may happen when the waterfall restarts and the previously running build was killed, the retried build will be assigned with the same buildbucket_id. I'm not sure what happens here. I don't think there was a waterfall restart, as the master-paladin kept running and checking the build status around that time. The master-paladin only got the 'FAILURE' status which should be related to the second build, although the master-build kept querying the build status from 15:00 to 17:06. https://luci-logdog.appspot.com/v/?s=chromeos%2Fbb%2Fchromeos%2Fmaster-paladin%2F13464%2F%2B%2Frecipes%2Fsteps%2FCommitQueueCompletion%2F0%2Fstdout nodir@, any thoughts? any logs from buildbucket servers can add more information here?
,
Jan 31 2017
here are log entries related to the build BigQuery query https://bigquery.cloud.google.com/results/cr-buildbucket:bquijob_5710d133_159f1e82e78?pli=1 SELECT metadata.timestamp, protoPayload.line.logMessage FROM TABLE_DATE_RANGE([logs.appengine_googleapis_com_request_log_], TIMESTAMP('2017-01-01'), TIMESTAMP('2017-04-01')) WHERE protoPayload.line.logMessage LIKE '%8989365383602477376%' ORDER BY metadata.timestamp; Results: 1 2017-01-26 22:35:35 UTC Build 8989365383602477376 was created by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com 2 2017-01-26 22:35:38 UTC Build 8989365383602477376 was leased by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com 3 2017-01-26 22:35:39 UTC Build 8989365383602477376 was started. URL: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-zgb-paladin/builds/8810 4 2017-01-27 01:01:20 UTC Request URL: https://cr-buildbucket.appspot.com/_ah/api/buildbucket/v1/builds/8989365383602477376 Method: buildbucket.get Error Code: 401 Reason: required Message: 5 2017-01-27 01:05:00 UTC Expired build 8989365383602477376 was reset 6 2017-01-27 01:05:08 UTC Build 8989365383602477376 was leased by user:446450136466-mko2u1g65l7iqsos5c09tni364ejqg75@developer.gserviceaccount.com 7 2017-01-27 01:05:09 UTC Build 8989365383602477376 was started. URL: https://uberchromegw.corp.google.com/i/chromeos/builders/x86-zgb-paladin/builds/8811 8 2017-01-27 01:05:45 UTC Build 8989365383602477376 was completed. Status: COMPLETED. Result: FAILURE
,
Jan 31 2017
looks like the build lease expired. This build took 1.5h. Does cbuildbot extend leases (heartbeat API)?
,
Jan 31 2017
Cbuildbot doesn't send any heartbeat requests. I recall dnj@ mentioned lease/heartbeat would be taken care by buildbot-buildbucket integration, so Cbuildbot only needed to send the Put requests. dnj@ can comment?
,
Jan 31 2017
The thing I said is still correct. BuildBot owns the lease and will send heartbeats. nodir@, cbuildbot is not taking out the lease. cbuildbot is running on behalf of a BuildBot build scheduled by another cbuildbot instance, similar to triggers.
,
Jan 31 2017
Passing nodir@ to investigate the lease expiration issue first.
,
Jan 31 2017
you're right, I was misguided by a milo bug crbug.com/687291 from the buildbot log: 2017-01-26 16:04:57-0800 [-] PubSub: Status update (buildFinished): ('x86-zgb-paladin', <BuildStatus #8810>, 0) 2017-01-26 16:04:57-0800 [-] [buildbucket] Build 8989365383602477376 finished with status "SUCCESS" 2017-01-26 16:04:57-0800 [-] pruning '/home/chrome-bot/buildbot/build_internal/masters/master.chromeos/x86-zgb-paladin-master/7810' 2017-01-26 16:04:57-0800 [-] setting expectations for next time 2017-01-26 16:04:57-0800 [-] new expectations: 5049.6119523 seconds 2017-01-26 16:04:57-0800 [-] releaseLocks(<AutoRebootBuildSlave 'cros-beefy417-c2', current builders: x86-zgb-paladin>): [] 2017-01-26 16:04:57-0800 [-] releaseLocks(<Build x86-zgb-paladin>): [(<MasterLock(x86-zgb-paladin, 1)>, <buildbot.locks.LockAccess instance at 0x2ad4a4117908>)] 2017-01-26 16:04:57-0800 [Broker,2624,192.168.120.8] BuildSlave.detached(cros-beefy417-c2) 2017-01-26 16:04:57-0800 [Broker,2624,192.168.120.8] Buildslave cros-beefy417-c2 detached from x86-zgb-paladin 2017-01-26 16:04:59-0800 [Broker,2367,192.168.120.8] <RemoteShellCommand '['python', '-u', '../../../scripts/slave/annotated_run.py', '--use-factory-properties-from-disk', '--build-properties-gz=eNqtUkuPmzAQ/iuVzyEGB4zjU3vpqadKPW1WyJhZ4o3BxDbZjVb57zUY9iGlt56w53vMx3jeUK1FB1o5j/jD4wbVVvTyiDhC4Twq3dTG//n9KxSO3g+OYzzWYOXRmg7al600dti2xrQawrnDCkfIOPxuMMoTBHf0doiFA+Lf5vNUny4H1AnnwW5X7QFtQlFaEB6aqr5G0ujA8jyneZFmO5pTmnQnQ8aspYUu1dkZV8h073sVQHg+t2XxvYELaDME6zaIL0qCkNKMvZ/Cfu3iXeyS5azIC8ZolqV0t0sjSzURZXu235UlSQtCKCU0ZQtBg3BQnWDJmu1pVqasIHmEvWhn/4dlBpW/DsCl6IW9RsZcduC5XMeO41SWT2Jh7oEzwihh5ZK+ritp+ifV8ldGE6HhtTrCyv3gRI8q9lYN/2ISQf5T6EnyeLutTwe2D8sRnu6O98rpxy4sBOJZSYrN3Ot9haIv2nxK+U+vOyHRmjKiF2GV6KdFimMLoihYMq7LE+qDNc8gfdxiC4NxypugWO7nEYKu+REI62uX6QcQfgb9xzWbW16UU6aPAZwWF1hDW+OSGuDpyopEkoC+GHtq1JQB13geBZ4F+M7ckmXCt7/lxT1L']'> rc=0 2017-01-26 16:04:59-0800 [-] [buildbucket] RPC "('peek',)" failed: Traceback (most recent call last): File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 297, in _retry res = yield threads.deferToThreadPool(reactor, self._pool, call) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/threadpool.py", line 207, in _worker result = context.call(ctx, function, *args, **kwargs) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/context.py", line 59, in callWithContext return self.currentContext().callWithContext(ctx, func, *args, **kw) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 368, in single_call response = method(*args, **kwargs).execute(http=self._th_local.http) File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/util.py", line 129, in positional_wrapper return wrapped(*args, **kwargs) File "/home/chrome-bot/buildbot/build/third_party/google_api_python_client/googleapiclient/http.py", line 716, in execute body=self.body, headers=self.headers) File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/util.py", line 129, in positional_wrapper return wrapped(*args, **kwargs) File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 515, in new_request self._refresh(request_orig) File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 727, in _refresh self._do_refresh_request(http_request) File "/home/chrome-bot/buildbot/build/third_party/oauth2client/oauth2client/client.py", line 756, in _do_refresh_request self.token_uri, method='POST', body=body, headers=headers) File "/home/chrome-bot/buildbot/build/third_party/infra_libs/infra_libs/httplib2_utils.py", line 252, in request uri, method, body, *args, **kwargs) File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1593, in request (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey) File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1335, in _request (response, content) = self._conn_request(conn, request_uri, method, body, headers) File "/home/chrome-bot/buildbot/build/third_party/httplib2/python2/httplib2/__init__.py", line 1263, in _conn_request raise ServerNotFoundError("Unable to find the server at %s" % conn.host) ServerNotFoundError: Unable to find the server at accounts.google.com 2017-01-26 16:04:59-0800 [-] while polling for changes Traceback (most recent call last): File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 388, in errback self._startRunCallbacks(fail) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 455, in _startRunCallbacks self._runCallbacks() File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 542, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1076, in gotResult _inlineCallbacks(r, g, deferred) --- <exception caught here> --- File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/home/chrome-bot/buildbot/build/scripts/master/buildbucket/poller.py", line 47, in poll yield self.integrator.poll_builds() File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/home/chrome-bot/buildbot/build/scripts/master/buildbucket/integration.py", line 422, in poll_builds start_cursor=start_cursor, File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/internet/defer.py", line 1018, in _inlineCallbacks result = result.throwExceptionIntoGenerator(g) File "/home/chrome-bot/buildbot/build/third_party/twisted_10_2/twisted/python/failure.py", line 349, in throwExceptionIntoGenerator return g.throw(self.type, self.value, self.tb) File "/home/chrome-bot/buildbot/build/scripts/master/deferred_resource.py", line 312, in _retry raise ex httplib2.ServerNotFoundError: Unable to find the server at accounts.google.com buildbot-buildbucket's attempt to mark the build as completed failed permanently after several retries. I see 14 occurrencies of "Unable to find the server at accounts.google.com" between 14:25 PDT and 16:33 PDT. Buildbucket's behavior to mark incomplete-on-time builds as expired is intended to deal with cases like this. Eventually the buildbucket build was completed from the perspective of buildbucket API users. Can cbuildbot do something to deal with this behavior? Note that only one buildbot build is considered "final" one, and it is the one referenced in build's url and result_details_json attributes.
,
Jan 31 2017
To be clear, nodir@ is suggesting that the master cbuildbot process manage this case, not the triggered build.
,
Feb 3 2017
,
Apr 18 2017
Looks like the same issue: https://uberchromegw.corp.google.com/i/chromeos/builders/veyron_mighty-paladin/builds/5035
,
Mar 30 2018
|
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by pprabhu@chromium.org
, Jan 27 2017Labels: current-issue