New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 845172 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Last visit > 30 days ago
Closed: May 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 810965



Sign in to add a comment

Testing outage in the Chrome OS test lab

Reported by jrbarnette@chromium.org, May 21 2018

Issue description

Tests are currently unable to run in the Autotest lab.  Most suites
simply time out.

We've identified some suspicious symptoms in the database servers:
  * "show processlist;" on the master server shows a large number
    of sleeping queries.
  * ssh to the slave server cros-bighd-0002.mtv times out, but the
    server is up.

 

Comment 1 by cra...@chromium.org, May 21 2018

Attached is the output from "SHOW PROCESSLIST" on cros-bighd-0002.mtv from a few minutes ago.

process_list
62.6 KB View Download
Cc: jkop@chromium.org
Those look like the usual "slave heartbeat" queries to me from a glance. This is a recent change due to jkop's project to redirect those queries to the slave.

The idea was to reduce lock contention and improve performance, but maybe that's not working out so well.

I'm in favor of restarting the slave.
Cc: pprabhu@chromium.org ayatane@chromium.org dshi@chromium.org akes...@chromium.org jclinton@chromium.org petershih@google.com mikenichols@chromium.org
 Issue 845101  has been merged into this issue.
Slave has not returned from reboot after 20+ minutes. Investiating other options, such as reverting the change to use slave for heartbeats, or switching to our other slave (bighd-0003).
For context, the change to use slave for heartbeats is described at Issue 810965
Owner: shuqianz@chromium.org
Update on the failure mode:  The secondary (bighd-0002) is responsible
for certain slave heartbeat queries; if it goes down, the shard heartbeats
will hang up, and work will grind to a halt.

As for the status of bighd-0002:  It was definitely sick enough to qualify
as "down", and restarting hasn't made things better.  Root cause of that
server outage isn't known.

The next step is to get the heartbeat queries redirected to some new server.

Blocking: 810965
Project Member

Comment 8 by bugdroid1@chromium.org, May 21 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/860b187bceea419834a6f09f4d187e9f79bd4f26

commit 860b187bceea419834a6f09f4d187e9f79bd4f26
Author: Allen Li <ayatane@chromium.org>
Date: Mon May 21 17:22:04 2018

We believe that the root cause is a hardware failure on the underlying
node hosting three of our servers:
    cros-bighd-0002 - slave DB server
    cros-full-0025 - shard for buddy and glimmer
    cros-full-0028 - drone

We're working to find alternative hardware

This email just showed up in my inbox:

to: mdb.chromeos-build-deputy@google.com, ganeti-ftr@google.com

Hello Ganeti user!

This message is regarding the following machine(s):
  * cros-bighd-0002.mtv.corp.google.com
  * cros-full-0025.mtv.corp.google.com
  * cros-full-0028.mtv.corp.google.com

Due to unforeseen circumstances, there was an unplanned outage in the cluster where your instance resides. This has resulted in downtime for your instance from 2018-05-21 17:30:00 UTC to2018-05-21 18:00:00 UTC, but we believe it is running again now. We're very sorry about the unexpected downtime.

Questions?
If you have any questions please open a GUTS ticket for 'sre-ganeti': go/ganeti-ticket

Thanks,
Ganeti Virtualization SRE

Why did I get this email?
We send notifications to the individuals/groups listed in the service_contacts Rigging property associated with the affected machines, or the MDB group who owns the machines.
Please see: http://google3/ops/sysops/cmt/scripts/get_machine_contacts.py

Labels: -Pri-0 Pri-1
That mail also means our VMs have failed over, or otherwise been
repaired.  And it looks like services on all three affected servers
are back.

I believe the worst of the outage is over; downgrading to P1.

Labels: OS-Chrome
I'm going to be that guy and ask for a postmortem. Here's a few things that went wrong:

* Monitoring on AFE DB did not fire; had to be told there was an outage by users
* Secondary was not familiar with DB config nor the software associated with them; we could not determine which hosts were primary/secondary
* Secondary could not determine the current primary oncall deputy: two calendars said two different things
* We could not get serial console access to cros-bighd-0002
* Outage lasted >12 hours
* We don't have an explanation for the large increase in network traffic to the master at the same time that the secondary went down

Comment 13 by jkop@chromium.org, May 21 2018

I may have an explanation for the last point. If the slave went down, IIRC the heartbeat queries to it would fall back to pinging the master DB.

Also, AFAIK the outage lasted ~2 hours from first person in the office to resolution. We got copious alert emails, but until business hours MTV time no one was responsible for responding to them, because that's all we guarantee support for. 
> I may have an explanation for the last point. If the slave went
> down, IIRC the heartbeat queries to it would fall back to pinging
> the master DB.

That would put some holes in our overall theory of the outage, though.
The evidence is that some time after the slave DB failed, heartbeats
on all of the shards dropped to 0.  Do we think that the problem was
that the fail-over drove up load on the master enough to cause some
other problem?

I'll note that I'm kind of disposed to believe that the real problem
wasn't merely the slave failure:  Data on hand shows a timeline a bit
like this:
    5/20 10:15 - slave DB fails
    5/20 18:00 - shard heartbeats stop

So, there's an 8 hour gap between DB failure and lab outage that I don't
have a good answer for.

> jkop@
> I may have an explanation for the last point. If the slave went
> down, IIRC the heartbeat queries to it would fall back to pinging
> the master DB.

I'm not sure about this. Examining the relevant code inside the rpc (frontend/afe/models.py):

        job_ids = set([])
        <snip>

        if cls.FETCH_READONLY_JOBS:
            #TODO(jkop): Get rid of this kludge when we update Django to >=1.7
            #correct usage would be .raw(..., using='readonly')
            old_db = Job.objects._db
            try:
                Job.objects._db = 'readonly'
                job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])
            finally:
                Job.objects._db = old_db
        if not job_ids:
            #If the replica is down or we're in a test, fetch from master.
            job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])

The master is consulted if the first query returns no results. However, it is not consulted if the first queries times out or fails and throws an exception, which is what I believe was happening here.



> jrbarnette@
> So, there's an 8 hour gap between DB failure and lab outage that I don't
> have a good answer for.

I don't understand this yet either. But keep in mind that the shard_client is the only thing on the shard that depends on the slave db, and only for 1 part of the handling of the heartbeat query. I'd have to read the rpc code in detail, but its possible that if that one part fails, the rpc can still upload write results for any already running or enqueued jobs.

Comment 16 Deleted

Forked retry question to  Issue 845537 
Blocking: 845619
Blocking: -845619
This is unrelated to the ci_results_archiver issue.

There's a case to be made that we should close this as "fixed", since
the outage is over.  My vote is to hold this open until we document the
complete explanation of the chain of events leading to the failure.

The open question is "why did it take nearly 8 hours after the failure
of the slave DB for the shard heartbeats to stop?"

Status: Fixed (was: Assigned)

Sign in to add a comment