shard_client rpc failures cause test failures/timeouts |
||||||||
Issue description
Noticed in a recent shard_client log
02/08 17:54:16.626 INFO | shard_client:0331| Performing heartbeat.
02/08 17:54:16.642 INFO | shard_client:0297| Known jobs: [175284317L, 175284326L, 175284340L, 175284355L, 175284367L, 175284379L, 17528
4389L, 175284400L, 175284409L, 175284418L, 175284425L, 175285900L, 175285906L, 175285911L, 175285917L, 175285922L, 175285926L, 175285931L, 1
75285937L, 175285942L, 175285947L, 175285950L, 175286077L, 175286090L, 175286102L, 175286115L, 175286934L, 175286936L, 175286938L, 175286940
L, 175286942L, 175287282L, 175287283L, 175287284L, 175287287L, 175287291L, 175422898L, 175422918L, 175422938L, 175422957L, 175422980L, 17542
2994L, 175423011L, 175423025L, 175423038L, 175423054L, 175423063L, 175423067L, 175423846L, 175423892L, 175423938L, 175424009L, 175424053L, 1
75424096L, 175424139L, 175424178L, 175424221L, 175424290L, 175424324L, 175424348L]
02/08 17:54:38.185 INFO | shard_client:0303| Uploading jobs [175422866L, 175423801L]
02/08 17:56:04.299 DEBUG| retry_util:0202| ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCExcepti
on'>(DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2131, in shard_heartbeat
known_host_ids=known_host_ids)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 956, in find_records_for_shard
jobs = models.Job.assign_to_shard(shard, known_job_ids)
File "/usr/local/autotest/frontend/afe/models.py", line 1619, in assign_to_shard
Job.objects.filter(pk__in=job_ids).update(shard=shard)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 567, in update
rows = query.get_compiler(self.db).execute_sql(None)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
cursor.execute(sql, params)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
)
02/08 17:56:04.299 ERROR| shard_client:0313| Heartbeat failed. JSONRPCException: DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2131, in shard_heartbeat
known_host_ids=known_host_ids)
File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 956, in find_records_for_shard
jobs = models.Job.assign_to_shard(shard, known_job_ids)
File "/usr/local/autotest/frontend/afe/models.py", line 1619, in assign_to_shard
Job.objects.filter(pk__in=job_ids).update(shard=shard)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 567, in update
rows = query.get_compiler(self.db).execute_sql(None)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
cursor.execute(sql, params)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
-------
But then later in the same log
02/08 18:28:37.868 INFO | shard_client:0331| Performing heartbeat.
02/08 18:28:37.900 INFO | shard_client:0297| Known jobs: [175284317L, 175284326L, 175284340L, 175284355L, 175284367L, 175284379L, 175284389L, 175284400L, 175284409L, 175284418L, 175284425L, 175285900L, 175285906L, 175285911L, 175285917L, 175285922L, 175285926L, 175285931L, 175285937L, 175285942L, 175285947L, 175285950L, 175286077L, 175286090L, 175286102L, 175286115L, 175286934L, 175286936L, 175286938L, 175286940L, 175286942L, 175287282L, 175287283L, 175287284L, 175287287L, 175287291L, 175422898L, 175422918L, 175422938L, 175422957L, 175422980L, 175422994L, 175423011L, 175423025L, 175423038L, 175423054L, 175423063L, 175423067L, 175423846L, 175423892L, 175423938L, 175424009L, 175424053L, 175424096L, 175424139L, 175424178L, 175424221L, 175424290L, 175424324L, 175424348L]
02/08 18:28:58.050 INFO | shard_client:0303| Uploading jobs [175422866L, 175423801L]
02/08 18:28:58.193 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:28:58.193 DEBUG| retry_util:0066| Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ...
02/08 18:29:03.339 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:03.339 DEBUG| retry_util:0066| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ...
02/08 18:29:13.492 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:13.493 DEBUG| retry_util:0066| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ...
02/08 18:29:33.647 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:33.648 DEBUG| retry_util:0066| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ...
02/08 18:30:13.829 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:30:13.829 DEBUG| retry_util:0066| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ...
02/08 18:31:33.994 DEBUG| retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:31:33.994 DEBUG| retry_util:0066| Retrying in 160.000000 (160.000000 + jitter 0.000000) seconds ...
02/08 18:33:58.053 ERROR| shard_client:0313| Heartbeat failed. TimeoutError: Timeout occurred- waited 300.0 seconds.
The latter behavior is probably desired.
,
Feb 9 2018
All the failures were due to temporary db overload, so the same real root cause.
,
Feb 12 2018
,
Mar 26 2018
Another build failure: https://luci-milo.appspot.com/buildbot/chromeos/edgar-paladin/2531
,
Mar 26 2018
Jacob, this is the shard database problem we discussed in the meeting this morning
,
Mar 26 2018
The RPC change (crrev.com/c/969994) applies to cros-full-0006.mtv.corp.google.com, cros-full-0007.mtv.corp.google.com, and cros-full-0008.mtv.corp.google.com. They don't show up in the obvious logs, but I'm not seeing anything like the logs mentioned above either, so I'm not looking in the right place.
,
Mar 26 2018
Looking at a relevant graph, it very obviously started when the change landed behind a flag. Reverting. pcon/chromeos-infra-alert-owners/queryplayground?yAxisMin=0&yAxisMin2=0&oldHeatmap=false&query=CAEY9QOKAXcKZQgPemEKXzJdChltb25hcmNoLmFjcXVpc2l0aW9ucy5UYXNrEkAKPi9jaHJvbWUvaW5mcmEvY2hyb21lb3MvYXV0b3Rlc3Qvc2hhcmRfY2xpZW50L2hlYXJ0YmVhdF9mYWlsdXJlIICYmrwEKgIgATCAyM60DZIBFwoGCAEYoI0GEgAaCWhvc3RfbmFtZWAA&names=Query%201&duration=1w
,
Mar 26 2018
Scratch that, when I looked with more detail I found that the incident started at 3:55 PM Friday, and my change didn't leave the CQ until 5:31 PM. Not this, I have no knowledge of it, handing back to deputy.
,
Mar 26 2018
I also don't think this is the same problem as was seen in February. The failure messages don't look similar.
,
Mar 26 2018
Log: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fedgar-paladin%2F2531%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-cq_%2F0%2Fstdout Reason: Tests failed. 03-25-2018 [07:58:03] Output below this line is for buildbot consumption: @@@STEP_LINK@[Test-Logs]: logging_UserCrash: ERROR: Command <make > failed, rc=127, Command returned non-zero exit status@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496193-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: logging_UserCrash@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=logging_UserCrash@@@ @@@STEP_LINK@[Test-History]: logging_UserCrash@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=logging_UserCrash@@@ @@@STEP_LINK@[Test-Logs]: platform_DebugDaemonCupsAddPrinters: retry_count: 1, FAIL: Unhandled IOError: [Errno 2] No such file or directory: '/usr/local/autotest/site_tests/platform_DebugDaemonCupsAddPrinters/src/GenericPostScript.ppd.gz'@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496669-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: platform_DebugDaemonCupsAddPrinters@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=platform_DebugDaemonCupsAddPrinters@@@ @@@STEP_LINK@[Test-History]: platform_DebugDaemonCupsAddPrinters@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=platform_DebugDaemonCupsAddPrinters@@@ @@@STEP_LINK@[Test-Logs]: hardware_PerfCallgraphVerification: retry_count: 1, GOOD: completed successfully@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496671-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: hardware_PerfCallgraphVerification@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=hardware_PerfCallgraphVerification@@@ @@@STEP_LINK@[Test-History]: hardware_PerfCallgraphVerification@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=hardware_PerfCallgraphVerification@@@ @@@STEP_LINK@[Test-Logs]: p2p_ServeFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496672-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: p2p_ServeFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ServeFiles@@@ @@@STEP_LINK@[Test-History]: p2p_ServeFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ServeFiles@@@ @@@STEP_LINK@[Test-Logs]: p2p_ConsumeFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496673-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: p2p_ConsumeFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ConsumeFiles@@@ @@@STEP_LINK@[Test-History]: p2p_ConsumeFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ConsumeFiles@@@ @@@STEP_LINK@[Test-Logs]: p2p_ShareFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496674-chromeos-test/@@@ @@@STEP_LINK@[Flake-Dashboard]: p2p_ShareFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ShareFiles@@@ @@@STEP_LINK@[Test-History]: p2p_ShareFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ShareFiles@@@ Will return from run_suite with status: ERROR cmd=['/b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/tmp/cbuildbot-tmpj_qIwZ/tmp87SerG/temp_summary.json', '--raw-cmd', '--task-name', u'edgar-paladin/R67-10516.0.0-rc5-bvt-cq', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--print-status-updates', '--timeout', '9000', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:bvt-cq', u'--tags=build:edgar-paladin/R67-10516.0.0-rc5', u'--tags=task_name:edgar-paladin/R67-10516.0.0-rc5-bvt-cq', u'--tags=board:edgar', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'edgar-paladin/R67-10516.0.0-rc5', '--board', u'edgar', '--suite_name', u'bvt-cq', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 74445514L, 'cidb_build_id': 2413390, 'datastore_parent_key': ('Build', 2413390, 'BuildStage', 74445514L)}", '--test_args', "{'fast': 'True'}", '-m', '186496039'][0m [1;31m07:58:05: ERROR: wait_cmd has lab failures: cmd=['/b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/tmp/cbuildbot-tmpj_qIwZ/tmp87SerG/temp_summary.json', '--raw-cmd', '--task-name', u'edgar-paladin/R67-10516.0.0-rc5-bvt-cq', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--print-status-updates', '--timeout', '9000', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:bvt-cq', u'--tags=build:edgar-paladin/R67-10516.0.0-rc5', u'--tags=task_name:edgar-paladin/R67-10516.0.0-rc5-bvt-cq', u'--tags=board:edgar', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'edgar-paladin/R67-10516.0.0-rc5', '--board', u'edgar', '--suite_name', u'bvt-cq', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 74445514L, 'cidb_build_id': 2413390, 'datastore_parent_key': ('Build', 2413390, 'BuildStage', 74445514L)}", '--test_args', "{'fast': 'True'}", '-m', '186496039']. Exception will be raised in the next json_dump run.[0m
,
Mar 29 2018
Relevant logs are here: http://pantheon/storage/browser/chromeos-autotest-results/186496193-chromeos-test/chromeos2-row5-rack2-host5/debug/ I've tagged this bug on the three autotest CLs in the CQ run most likely to have been the cause. I'm also going to crosoncall to poke the Sheriff about it.
,
Mar 29 2018
,
Mar 29 2018
this bug conflates an old Feb outage with a recent one. Let's make this bug about the recent one.
,
Mar 29 2018
I don't see the relationship at all between the failure linked in #10 and the original shard_client errors actually. What is this bug about? Is there a recent occurrence of it? If not, I'm tempted to call this bug a WontFix and attribute #10 to test flake of undetermined cause.
,
Mar 29 2018
There's ongoing RPC problems, http://shortn/_tCVYRVBvRZ
,
Mar 29 2018
^ I suggest using the chrome-infra role instead, will eliminate that scary quota warning http://shortn/_tEBHMwavFz These look like small occasional failure spikes, I'm not convinced of their severity. Seems unrelated to #10 though.
,
Mar 29 2018
examining shard client logs on one of the affected shards (cros-full-18) I see:
chromeos-test@cros-full-0018:/usr/local/autotest/logs$ less shard_client.latest
...
...
03/29 08:28:58.910 ERROR| shard_client:0344| Heartbeat failed. JSONRPCException: DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
results['result'] = self.invokeServiceEndpoint(meth, args)
File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
return meth(*args)
File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
return f(*args, **keyword_args)
File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2202, in shard_heartbeat
host_model.save()
File "/usr/local/autotest/frontend/afe/models.py", line 740, in save
super(Host, self).save(*args, **kwargs)
File "/usr/local/autotest/frontend/afe/model_logic.py", line 1249, in save
super(ModelWithInvalid, self).save(*args, **kwargs)
File "/usr/local/autotest/site-packages/django/db/models/base.py", line 546, in save
force_update=force_update, update_fields=update_fields)
File "/usr/local/autotest/site-packages/django/db/models/base.py", line 626, in save_base
rows = manager.using(using).filter(pk=pk_val)._update(values)
File "/usr/local/autotest/site-packages/django/db/models/query.py", line 591, in _update
return query.get_compiler(self.db).execute_sql(None)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
cursor.execute(sql, params)
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
return self.cursor.execute(query, args)
File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
which does sound similar to #10. But it is not happening all that often.
,
Mar 29 2018
The correlated spike on all shards at once at 09:18:38 is more concerning.
,
Mar 29 2018
They appear to have started abruptly last week, but maybe that's not an issue.
,
Mar 29 2018
My guess would be slow queries piling up on the master db at that time. (which is what the slave replica rpc is intended to mitigate or resolve)
,
Mar 29 2018
Hmm, that makes sense, and would explain why it's only shown up recently. I'll unblock 810965 and push that fix, and possibly it will deal with that as well.
,
Mar 29 2018
#19 yes there's probably a real bug to be found. At this point this particular crbug has a severe identity crisis though. I suggest wontfix'ing this and starting a fresh one about the onset of this background rate of heartbeat failures.
,
Mar 29 2018
Agreed. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by akes...@chromium.org
, Feb 9 2018