New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 810625 link

Starred by 3 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

shard_client rpc failures cause test failures/timeouts

Project Member Reported by akes...@chromium.org, Feb 9 2018

Issue description

Noticed in a recent shard_client log

02/08 17:54:16.626 INFO |      shard_client:0331| Performing heartbeat.
02/08 17:54:16.642 INFO |      shard_client:0297| Known jobs: [175284317L, 175284326L, 175284340L, 175284355L, 175284367L, 175284379L, 17528
4389L, 175284400L, 175284409L, 175284418L, 175284425L, 175285900L, 175285906L, 175285911L, 175285917L, 175285922L, 175285926L, 175285931L, 1
75285937L, 175285942L, 175285947L, 175285950L, 175286077L, 175286090L, 175286102L, 175286115L, 175286934L, 175286936L, 175286938L, 175286940
L, 175286942L, 175287282L, 175287283L, 175287284L, 175287287L, 175287291L, 175422898L, 175422918L, 175422938L, 175422957L, 175422980L, 17542
2994L, 175423011L, 175423025L, 175423038L, 175423054L, 175423063L, 175423067L, 175423846L, 175423892L, 175423938L, 175424009L, 175424053L, 1
75424096L, 175424139L, 175424178L, 175424221L, 175424290L, 175424324L, 175424348L]
02/08 17:54:38.185 INFO |      shard_client:0303| Uploading jobs [175422866L, 175423801L]
02/08 17:56:04.299 DEBUG|        retry_util:0202| ending retries with error: <class 'autotest_lib.frontend.afe.json_rpc.proxy.JSONRPCExcepti
on'>(DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2131, in shard_heartbeat
    known_host_ids=known_host_ids)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 956, in find_records_for_shard
    jobs = models.Job.assign_to_shard(shard, known_job_ids)
  File "/usr/local/autotest/frontend/afe/models.py", line 1619, in assign_to_shard
    Job.objects.filter(pk__in=job_ids).update(shard=shard)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 567, in update
    rows = query.get_compiler(self.db).execute_sql(None)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
)
02/08 17:56:04.299 ERROR|      shard_client:0313| Heartbeat failed. JSONRPCException: DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2131, in shard_heartbeat
    known_host_ids=known_host_ids)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 956, in find_records_for_shard
    jobs = models.Job.assign_to_shard(shard, known_job_ids)
  File "/usr/local/autotest/frontend/afe/models.py", line 1619, in assign_to_shard
    Job.objects.filter(pk__in=job_ids).update(shard=shard)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 567, in update
    rows = query.get_compiler(self.db).execute_sql(None)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')


-------


But then later in the same log

02/08 18:28:37.868 INFO |      shard_client:0331| Performing heartbeat.
02/08 18:28:37.900 INFO |      shard_client:0297| Known jobs: [175284317L, 175284326L, 175284340L, 175284355L, 175284367L, 175284379L, 175284389L, 175284400L, 175284409L, 175284418L, 175284425L, 175285900L, 175285906L, 175285911L, 175285917L, 175285922L, 175285926L, 175285931L, 175285937L, 175285942L, 175285947L, 175285950L, 175286077L, 175286090L, 175286102L, 175286115L, 175286934L, 175286936L, 175286938L, 175286940L, 175286942L, 175287282L, 175287283L, 175287284L, 175287287L, 175287291L, 175422898L, 175422918L, 175422938L, 175422957L, 175422980L, 175422994L, 175423011L, 175423025L, 175423038L, 175423054L, 175423063L, 175423067L, 175423846L, 175423892L, 175423938L, 175424009L, 175424053L, 175424096L, 175424139L, 175424178L, 175424221L, 175424290L, 175424324L, 175424348L]
02/08 18:28:58.050 INFO |      shard_client:0303| Uploading jobs [175422866L, 175423801L]
02/08 18:28:58.193 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:28:58.193 DEBUG|        retry_util:0066| Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ...
02/08 18:29:03.339 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:03.339 DEBUG|        retry_util:0066| Retrying in 10.000000 (10.000000 + jitter 0.000000) seconds ...
02/08 18:29:13.492 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:13.493 DEBUG|        retry_util:0066| Retrying in 20.000000 (20.000000 + jitter 0.000000) seconds ...
02/08 18:29:33.647 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:29:33.648 DEBUG|        retry_util:0066| Retrying in 40.000000 (40.000000 + jitter 0.000000) seconds ...
02/08 18:30:13.829 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:30:13.829 DEBUG|        retry_util:0066| Retrying in 80.000000 (80.000000 + jitter 0.000000) seconds ...
02/08 18:31:33.994 DEBUG|        retry_util:0204| <class 'urllib2.HTTPError'>(HTTP Error 500: INTERNAL SERVER ERROR)
02/08 18:31:33.994 DEBUG|        retry_util:0066| Retrying in 160.000000 (160.000000 + jitter 0.000000) seconds ...
02/08 18:33:58.053 ERROR|      shard_client:0313| Heartbeat failed. TimeoutError: Timeout occurred- waited 300.0 seconds.



The latter behavior is probably desired.
 
Cc: pho...@chromium.org
Presumably JSONRPCException is being treated differently than HTTPError by the retry logic.
All the failures were due to temporary db overload, so the same real root cause.
Owner: jkop@chromium.org
Jacob, this is the shard database problem we discussed in the meeting this morning

Comment 6 by jkop@chromium.org, Mar 26 2018

The RPC change (crrev.com/c/969994) applies to cros-full-0006.mtv.corp.google.com, cros-full-0007.mtv.corp.google.com, and cros-full-0008.mtv.corp.google.com. They don't show up in the obvious logs, but I'm not seeing anything like the logs mentioned above either, so I'm not looking in the right place.

Comment 7 by jkop@chromium.org, Mar 26 2018

Status: Started (was: Available)
Looking at a relevant graph, it very obviously started when the change landed behind a flag. Reverting.

pcon/chromeos-infra-alert-owners/queryplayground?yAxisMin=0&yAxisMin2=0&oldHeatmap=false&query=CAEY9QOKAXcKZQgPemEKXzJdChltb25hcmNoLmFjcXVpc2l0aW9ucy5UYXNrEkAKPi9jaHJvbWUvaW5mcmEvY2hyb21lb3MvYXV0b3Rlc3Qvc2hhcmRfY2xpZW50L2hlYXJ0YmVhdF9mYWlsdXJlIICYmrwEKgIgATCAyM60DZIBFwoGCAEYoI0GEgAaCWhvc3RfbmFtZWAA&names=Query%201&duration=1w

Comment 8 by jkop@chromium.org, Mar 26 2018

Cc: jkop@chromium.org
Owner: shuqianz@chromium.org
Status: Assigned (was: Started)
Scratch that, when I looked with more detail I found that the incident started at 3:55 PM Friday, and my change didn't leave the CQ until 5:31 PM. Not this, I have no knowledge of it, handing back to deputy.

Comment 9 by jkop@chromium.org, Mar 26 2018

I also don't think this is the same problem as was seen in February. The failure messages don't look similar.

Comment 10 by jkop@chromium.org, Mar 26 2018

Log: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fedgar-paladin%2F2531%2F%2B%2Frecipes%2Fsteps%2FHWTest__bvt-cq_%2F0%2Fstdout

  Reason: Tests failed.
  
   03-25-2018 [07:58:03] Output below this line is for buildbot consumption:
  @@@STEP_LINK@[Test-Logs]: logging_UserCrash: ERROR: Command <make  > failed, rc=127, Command returned non-zero exit status@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496193-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: logging_UserCrash@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=logging_UserCrash@@@
  @@@STEP_LINK@[Test-History]: logging_UserCrash@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=logging_UserCrash@@@
  @@@STEP_LINK@[Test-Logs]: platform_DebugDaemonCupsAddPrinters: retry_count: 1, FAIL: Unhandled IOError: [Errno 2] No such file or directory: '/usr/local/autotest/site_tests/platform_DebugDaemonCupsAddPrinters/src/GenericPostScript.ppd.gz'@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496669-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: platform_DebugDaemonCupsAddPrinters@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=platform_DebugDaemonCupsAddPrinters@@@
  @@@STEP_LINK@[Test-History]: platform_DebugDaemonCupsAddPrinters@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=platform_DebugDaemonCupsAddPrinters@@@
  @@@STEP_LINK@[Test-Logs]: hardware_PerfCallgraphVerification: retry_count: 1, GOOD: completed successfully@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496671-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: hardware_PerfCallgraphVerification@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=hardware_PerfCallgraphVerification@@@
  @@@STEP_LINK@[Test-History]: hardware_PerfCallgraphVerification@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=hardware_PerfCallgraphVerification@@@
  @@@STEP_LINK@[Test-Logs]: p2p_ServeFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496672-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: p2p_ServeFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ServeFiles@@@
  @@@STEP_LINK@[Test-History]: p2p_ServeFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ServeFiles@@@
  @@@STEP_LINK@[Test-Logs]: p2p_ConsumeFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496673-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: p2p_ConsumeFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ConsumeFiles@@@
  @@@STEP_LINK@[Test-History]: p2p_ConsumeFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ConsumeFiles@@@
  @@@STEP_LINK@[Test-Logs]: p2p_ShareFiles: retry_count: 1, ERROR: Unhandled PackageInstallError: Installation of lansim(type:dep) failed : dep-lansim.tar.bz2 could not be fetched from any of the repos ['autoserv://']@http://cautotest-prod/tko/retrieve_logs.cgi?job=/results/186496674-chromeos-test/@@@
  @@@STEP_LINK@[Flake-Dashboard]: p2p_ShareFiles@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=p2p_ShareFiles@@@
  @@@STEP_LINK@[Test-History]: p2p_ShareFiles@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=p2p_ShareFiles@@@
  Will return from run_suite with status: ERROR
cmd=['/b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/tmp/cbuildbot-tmpj_qIwZ/tmp87SerG/temp_summary.json', '--raw-cmd', '--task-name', u'edgar-paladin/R67-10516.0.0-rc5-bvt-cq', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--print-status-updates', '--timeout', '9000', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:bvt-cq', u'--tags=build:edgar-paladin/R67-10516.0.0-rc5', u'--tags=task_name:edgar-paladin/R67-10516.0.0-rc5-bvt-cq', u'--tags=board:edgar', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'edgar-paladin/R67-10516.0.0-rc5', '--board', u'edgar', '--suite_name', u'bvt-cq', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 74445514L, 'cidb_build_id': 2413390, 'datastore_parent_key': ('Build', 2413390, 'BuildStage', 74445514L)}", '--test_args', "{'fast': 'True'}", '-m', '186496039']
07:58:05: ERROR: wait_cmd has lab failures: cmd=['/b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/tmp/cbuildbot-tmpj_qIwZ/tmp87SerG/temp_summary.json', '--raw-cmd', '--task-name', u'edgar-paladin/R67-10516.0.0-rc5-bvt-cq', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--print-status-updates', '--timeout', '9000', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:bvt-cq', u'--tags=build:edgar-paladin/R67-10516.0.0-rc5', u'--tags=task_name:edgar-paladin/R67-10516.0.0-rc5-bvt-cq', u'--tags=board:edgar', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'edgar-paladin/R67-10516.0.0-rc5', '--board', u'edgar', '--suite_name', u'bvt-cq', '--pool', u'cq', '--file_bugs', 'False', '--priority', 'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 74445514L, 'cidb_build_id': 2413390, 'datastore_parent_key': ('Build', 2413390, 'BuildStage', 74445514L)}", '--test_args', "{'fast': 'True'}", '-m', '186496039'].
Exception will be raised in the next json_dump run.

Comment 11 by jkop@chromium.org, Mar 29 2018

Relevant logs are here: http://pantheon/storage/browser/chromeos-autotest-results/186496193-chromeos-test/chromeos2-row5-rack2-host5/debug/

I've tagged this bug on the three autotest CLs in the CQ run most likely to have been the cause. I'm also going to crosoncall to poke the Sheriff about it.

Comment 12 by jkop@chromium.org, Mar 29 2018

Blocking: 810965
Cc: shuqianz@chromium.org
Owner: bingxue@chromium.org
Owner: jkop@chromium.org
Summary: shard_client rpc failures cause test failures/timeouts (was: exponential backoff in shard heartbeat is inconsistently applied)
this bug conflates an old Feb outage with a recent one. Let's make this bug about the recent one.
I don't see the relationship at all between the failure linked in #10 and the original shard_client errors actually.

What is this bug about? Is there a recent occurrence of it?

If not, I'm tempted to call this bug a WontFix and attribute #10 to test flake of undetermined cause.

Comment 15 by jkop@chromium.org, Mar 29 2018

There's ongoing RPC problems, http://shortn/_tCVYRVBvRZ
^ I suggest using the chrome-infra role instead, will eliminate that scary quota warning http://shortn/_tEBHMwavFz

These look like small occasional failure spikes, I'm not convinced of their severity. Seems unrelated to #10 though. 
examining shard client logs on one of the affected shards (cros-full-18) I see:

chromeos-test@cros-full-0018:/usr/local/autotest/logs$ less shard_client.latest 
...
...
03/29 08:28:58.910 ERROR|      shard_client:0344| Heartbeat failed. JSONRPCException: DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2202, in shard_heartbeat
    host_model.save()
  File "/usr/local/autotest/frontend/afe/models.py", line 740, in save
    super(Host, self).save(*args, **kwargs)
  File "/usr/local/autotest/frontend/afe/model_logic.py", line 1249, in save
    super(ModelWithInvalid, self).save(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/base.py", line 546, in save
    force_update=force_update, update_fields=update_fields)
  File "/usr/local/autotest/site-packages/django/db/models/base.py", line 626, in save_base
    rows = manager.using(using).filter(pk=pk_val)._update(values)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 591, in _update
    return query.get_compiler(self.db).execute_sql(None)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 1014, in execute_sql
    cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 128, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1205, 'Lock wait timeout exceeded; try restarting transaction')


which does sound similar to #10. But it is not happening all that often.
The correlated spike on all shards at once at 09:18:38 is more concerning.

Comment 19 by jkop@chromium.org, Mar 29 2018

They appear to have started abruptly last week, but maybe that's not an issue.
My guess would be slow queries piling up on the master db at that time. (which is what the slave replica rpc is intended to mitigate or resolve)

Comment 21 by jkop@chromium.org, Mar 29 2018

Blocking: -810965
Hmm, that makes sense, and would explain why it's only shown up recently. I'll unblock 810965 and push that fix, and possibly it will deal with that as well.
#19 yes there's probably a real bug to be found. At this point this particular crbug has a severe identity crisis though. I suggest wontfix'ing this and starting a fresh one about the onset of this background rate of heartbeat failures.

Comment 23 by jkop@chromium.org, Mar 29 2018

Status: WontFix (was: Assigned)
Agreed.

Sign in to add a comment