New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 818271 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment

shard-client in the lab went down

Project Member Reported by nxia@chromium.org, Mar 2 2018

Issue description

11:58:29 ERROR| Heartbeat failed. JSONRPCException: DatabaseError: (1146, "Table 'chromeos_autotest_db.afe_jobs' doesn't exist")
Traceback (most recent call last):
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 109, in dispatchRequest
    results['result'] = self.invokeServiceEndpoint(meth, args)
  File "/usr/local/autotest/frontend/afe/json_rpc/serviceHandler.py", line 147, in invokeServiceEndpoint
    return meth(*args)
  File "/usr/local/autotest/frontend/afe/rpc_handler.py", line 270, in new_fn
    return f(*args, **keyword_args)
  File "/usr/local/autotest/frontend/afe/rpc_interface.py", line 2196, in shard_heartbeat
    rpc_utils.persist_records_sent_from_shard(shard_obj, jobs, hqes)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 1029, in persist_records_sent_from_shard
    shard, jobs, models.Job)
  File "/usr/local/autotest/frontend/afe/rpc_utils.py", line 989, in _persist_records_with_type_sent_from_shard
    current_record = record_type.objects.get(pk=pk)
  File "/usr/local/autotest/site-packages/django/db/models/manager.py", line 143, in get
    return self.get_query_set().get(*args, **kwargs)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 382, in get
    num = len(clone)
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 90, in __len__
    self._result_cache = list(self.iterator())
  File "/usr/local/autotest/site-packages/django/db/models/query.py", line 301, in iterator
    for row in compiler.results_iter():
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 775, in results_iter
    for rows in self.execute_sql(MULTI):
  File "/usr/local/autotest/site-packages/django/db/models/sql/compiler.py", line 840, in execute_sql
    cursor.execute(sql, params)
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 130, in execute
    six.reraise(utils.DatabaseError, utils.DatabaseError(*tuple(e.args)), sys.exc_info()[2])
  File "/usr/local/autotest/site-packages/django/db/backends/mysql/base.py", line 120, in execute
    return self.cursor.execute(query, args)
  File "/usr/local/autotest/site-packages/MySQLdb/cursors.py", line 174, in execute
    self.errorhandler(self, exc, value)
  File "/usr/local/autotest/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
    raise errorclass, errorvalue
DatabaseError: (1146, "Table 'chromeos_autotest_db.afe_jobs' doesn't exist")

 

Comment 1 by nxia@chromium.org, Mar 2 2018

Components: Infra>Client>ChromeOS
Labels: -Pri-3 Pri-1
This shard affected sentry and lumpy

Comment 2 by nxia@chromium.org, Mar 2 2018

Labels: -Pri-1 Pri-0
Summary: shard-client in the lab went down (was: chromeos-skunk-5 shard-client went down )
Project Member

Comment 3 by bugdroid1@chromium.org, Mar 2 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2e315cf2389fd1d9642a17694054f76e30f0240a

commit 2e315cf2389fd1d9642a17694054f76e30f0240a
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Mar 02 20:33:06 2018

Revert "rpc: Fetch from readonly db during shard heartbeat"

This reverts commit 3c75d9252ec132f724e72d025939a82fa64c7d14.

Reason for revert: Shard hearbeat failure in prod.

Original change's description:
> rpc: Fetch from readonly db during shard heartbeat
> 
> Fall back to master if readonly isn't available (mostly during tests)
> 
> BUG=chromium:810965
> TEST=unit tests
> 
> Change-Id: I5442d7b31a79908e12d09a60bed3f42645422ebc
> Reviewed-on: https://chromium-review.googlesource.com/938384
> Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
> Tested-by: Jacob Kopczynski <jkop@chromium.org>
> Reviewed-by: Xixuan Wu <xixuan@chromium.org>

BUG=chromium:810965
BUG= chromium:818271 

Change-Id: I2c428cf91a6a57c2a96c98c89938f683120ed77b
Reviewed-on: https://chromium-review.googlesource.com/946750
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/2e315cf2389fd1d9642a17694054f76e30f0240a/frontend/afe/models.py

Emergency push for #3 done.
shard clients are returning to life: https://viceroy.corp.google.com/chromeos/deputy-view?duration=6h#_VG_lnuPnWCa
Cc: jkop@chromium.org
Owner: pprabhu@chromium.org
Status: Fixed (was: Untriaged)
The problem was that #3 made the hearbeat RPC use the cautotest::readonly_host for part of the DB queries. But that is actually pointing to the CloudSQL TKO database.

I had a CL to try to fix that: https://chrome-internal-review.googlesource.com/c/chromeos/chromeos-admin/+/581567 but I'm not sure what is using that setting to actually refer to the TKO instance. The correct fix would involve separating out the reference to TKO DB from the readonly AFE DB.

Comment 6 by jkop@chromium.org, Mar 2 2018

I already have a CL in progress to make it fall back to master DB and a test to check that it doesn't leave it in a bad state. I don't think it will guarantee this doesn't recur, but will fail more gracefully.
Project Member

Comment 7 by bugdroid1@chromium.org, Mar 9 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d6a5e919d346e17834f3730a899ba221f4d4ce17

commit d6a5e919d346e17834f3730a899ba221f4d4ce17
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Mar 09 03:29:04 2018

rpc: Guarantee reset to good state after readonly

The kludge needed to query readonly backup DB could leave the connection
to master broken in case of an error. Remedy this.
Behind a feature flag, fetch_readonly_jobs, defaults to False.

BUG=chromium:810965
BUG= chromium:818271 
TEST=Old and new unit tests, feature flag for canarying rollout

Change-Id: Idc5e3793f5dc5a2bd1022e468456b88b2f347ed3
Reviewed-on: https://chromium-review.googlesource.com/944041
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/d6a5e919d346e17834f3730a899ba221f4d4ce17/frontend/afe/models.py
[modify] https://crrev.com/d6a5e919d346e17834f3730a899ba221f4d4ce17/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/d6a5e919d346e17834f3730a899ba221f4d4ce17/global_config.ini

Project Member

Comment 8 by bugdroid1@chromium.org, Mar 9 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/c47ff078b5cd97050c345c5206d07e561f233175

commit c47ff078b5cd97050c345c5206d07e561f233175
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Mar 09 19:31:44 2018

Project Member

Comment 9 by bugdroid1@chromium.org, Mar 12 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/01489ded899f10e420085db62e3ea5a3b71ede1d

commit 01489ded899f10e420085db62e3ea5a3b71ede1d
Author: Jacob Kopczynski <jkop@google.com>
Date: Mon Mar 12 23:11:57 2018

Project Member

Comment 10 by bugdroid1@chromium.org, Mar 13 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/b9766dea3c8cfdc07359edc729a8c47ba688b846

commit b9766dea3c8cfdc07359edc729a8c47ba688b846
Author: Jacob Kopczynski <jkop@google.com>
Date: Tue Mar 13 18:55:58 2018

Project Member

Comment 11 by bugdroid1@chromium.org, Mar 13 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/c21ff807077ffc27a0d39468ff609020e1ebc08a

commit c21ff807077ffc27a0d39468ff609020e1ebc08a
Author: Jacob Kopczynski <jkop@google.com>
Date: Tue Mar 13 19:00:14 2018

Project Member

Comment 12 by bugdroid1@chromium.org, Mar 24 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/348f0c77d5d57c63f7a1f88e24d0535cdbf19316

commit 348f0c77d5d57c63f7a1f88e24d0535cdbf19316
Author: Jacob Kopczynski <jkop@google.com>
Date: Sat Mar 24 00:29:49 2018

autotest: rpc: rollout shard heartbeat whitelist

BUG=chromium:810965
BUG= chromium:818271 
TEST=unit tests

Change-Id: I818b1ec237dc09caba68ca79fac05705f9b94b17
Reviewed-on: https://chromium-review.googlesource.com/969994
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Allen Li <ayatane@chromium.org>

[modify] https://crrev.com/348f0c77d5d57c63f7a1f88e24d0535cdbf19316/frontend/afe/models.py
[modify] https://crrev.com/348f0c77d5d57c63f7a1f88e24d0535cdbf19316/global_config.ini

Project Member

Comment 13 by bugdroid1@chromium.org, Mar 30 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/77ecb5da72f5bfc68f4d0ef0b4d27e8a6e4e4429

commit 77ecb5da72f5bfc68f4d0ef0b4d27e8a6e4e4429
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Mar 30 18:53:32 2018

Project Member

Comment 14 by bugdroid1@chromium.org, Mar 30 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d94fced6c17b8266b604d95c740453e7716cc8f4

commit d94fced6c17b8266b604d95c740453e7716cc8f4
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Mar 30 22:50:39 2018

Project Member

Comment 15 by bugdroid1@chromium.org, Apr 18 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/02b064bfd46c9125849dafcbb7c584a9c316e8d7

commit 02b064bfd46c9125849dafcbb7c584a9c316e8d7
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed Apr 18 17:22:28 2018

Project Member

Comment 16 by bugdroid1@chromium.org, Apr 19 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a3cd7591c19a7dd66bb4486cfd8abe9b96d9fc5b

commit a3cd7591c19a7dd66bb4486cfd8abe9b96d9fc5b
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Apr 19 03:07:42 2018

Project Member

Comment 17 by bugdroid1@chromium.org, Apr 26 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/62489c8a040b4e9e7ba1230f030cd97c5d55b42c

commit 62489c8a040b4e9e7ba1230f030cd97c5d55b42c
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Apr 26 19:30:41 2018

Project Member

Comment 18 by bugdroid1@chromium.org, Apr 27 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/36ff72fa159ee5e6589b5ad443c2e2b1db314f49

commit 36ff72fa159ee5e6589b5ad443c2e2b1db314f49
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Apr 27 23:31:49 2018

Project Member

Comment 19 by bugdroid1@chromium.org, Apr 28 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/aa312346fec2b56405cc466c3b42d8260d13cddf

commit aa312346fec2b56405cc466c3b42d8260d13cddf
Author: Jacob Kopczynski <jkop@google.com>
Date: Sat Apr 28 03:18:51 2018

Project Member

Comment 20 by bugdroid1@chromium.org, May 2 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/a2f738d3ac6c289cbe66bacb96d12a72787a1f71

commit a2f738d3ac6c289cbe66bacb96d12a72787a1f71
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 02 20:21:34 2018

Project Member

Comment 21 by bugdroid1@chromium.org, May 2 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/f244072bbcbb45f4879f844ae173ee29de75e2fa

commit f244072bbcbb45f4879f844ae173ee29de75e2fa
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 02 22:34:06 2018

Project Member

Comment 22 by bugdroid1@chromium.org, May 2 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/4a904e6961fe9c6f889464fea3cf40b9ce766a88

commit 4a904e6961fe9c6f889464fea3cf40b9ce766a88
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 02 23:23:43 2018

Project Member

Comment 23 by bugdroid1@chromium.org, May 7 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4525d2453cefdfe0ebb36ad20a25a4faa2510c6c

commit 4525d2453cefdfe0ebb36ad20a25a4faa2510c6c
Author: Jacob Kopczynski <jkop@google.com>
Date: Mon May 07 18:04:33 2018

autotest: Readonly heartbeat on all shards.

BUG=chromium:810965
BUG= chromium:818271 
TEST=No problems in the partial rollout.

Change-Id: Ib778a5d2492f24c88878e773359ede965d8e39df
Reviewed-on: https://chromium-review.googlesource.com/1038619
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/4525d2453cefdfe0ebb36ad20a25a4faa2510c6c/frontend/afe/models.py

Sign in to add a comment