New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 845537 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

in shard heartbeat rpc, failure to query slave should failover to master

Project Member Reported by akes...@chromium.org, May 22 2018

Issue description

Forked from  Issue 845172 

Relevant code:

        job_ids = set([])
        <snip>

        if cls.FETCH_READONLY_JOBS:
            #TODO(jkop): Get rid of this kludge when we update Django to >=1.7
            #correct usage would be .raw(..., using='readonly')
            old_db = Job.objects._db
            try:
                Job.objects._db = 'readonly'
                job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])
            finally:
                Job.objects._db = old_db
        if not job_ids:
            #If the replica is down or we're in a test, fetch from master.
            job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])

The master is consulted if the first query returns no results. However, it is not consulted if the first queries times out or fails and throws an exception, which is what I believe was happening here.

I don't think this is the intended before for two reasons:
 1) It doesn't failover on the common case of the slave being unavailable or very slow.
 2) It might failover incorrectly in the case where the jobs_ids set is truly empty.


 
alternate strategy: failover to other slave db, then to master.

This might be difficult to implement in our django models nightmare though, which currently knows about the existence of 2 databases.
Labels: -Chase-Pending Chase
Owner: jkop@chromium.org
Status: Assigned (was: Available)
CL in flight
Project Member

Comment 3 by bugdroid1@chromium.org, May 30 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/592cbd67bb20cfbb0e9a9d247b0db3c01cd1dc67

commit 592cbd67bb20cfbb0e9a9d247b0db3c01cd1dc67
Author: Jacob Kopczynski <jkop@google.com>
Date: Wed May 30 08:02:39 2018

autotest: Switch heartbeat query retry to on error

Retrying on no new jobs is overeager, so hide it behind a config flag.
Retrying when there is an error handles the cases we care about (timeouts and
slave DB outages) much more reasonably.

BUG= chromium:845537 
TEST=unit tests pass

Change-Id: Ia60541be22d6c5389d3697770c93a10691783a44
Reviewed-on: https://chromium-review.googlesource.com/1069587
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/592cbd67bb20cfbb0e9a9d247b0db3c01cd1dc67/frontend/afe/models.py
[modify] https://crrev.com/592cbd67bb20cfbb0e9a9d247b0db3c01cd1dc67/frontend/afe/rpc_interface_unittest.py
[modify] https://crrev.com/592cbd67bb20cfbb0e9a9d247b0db3c01cd1dc67/global_config.ini

Comment 4 by cra...@chromium.org, May 30 2018

Labels: cros-infra-pm-2018-05-21
Status: Fixed (was: Assigned)

Sign in to add a comment