Testing outage in the Chrome OS test lab
Reported by
jrbarnette@chromium.org,
May 21 2018
|
||||||||||
Issue description
Tests are currently unable to run in the Autotest lab. Most suites
simply time out.
We've identified some suspicious symptoms in the database servers:
* "show processlist;" on the master server shows a large number
of sleeping queries.
* ssh to the slave server cros-bighd-0002.mtv times out, but the
server is up.
,
May 21 2018
Those look like the usual "slave heartbeat" queries to me from a glance. This is a recent change due to jkop's project to redirect those queries to the slave. The idea was to reduce lock contention and improve performance, but maybe that's not working out so well. I'm in favor of restarting the slave.
,
May 21 2018
Issue 845101 has been merged into this issue.
,
May 21 2018
Slave has not returned from reboot after 20+ minutes. Investiating other options, such as reverting the change to use slave for heartbeats, or switching to our other slave (bighd-0003).
,
May 21 2018
For context, the change to use slave for heartbeats is described at Issue 810965
,
May 21 2018
Update on the failure mode: The secondary (bighd-0002) is responsible for certain slave heartbeat queries; if it goes down, the shard heartbeats will hang up, and work will grind to a halt. As for the status of bighd-0002: It was definitely sick enough to qualify as "down", and restarting hasn't made things better. Root cause of that server outage isn't known. The next step is to get the heartbeat queries redirected to some new server.
,
May 21 2018
,
May 21 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/860b187bceea419834a6f09f4d187e9f79bd4f26 commit 860b187bceea419834a6f09f4d187e9f79bd4f26 Author: Allen Li <ayatane@chromium.org> Date: Mon May 21 17:22:04 2018
,
May 21 2018
We believe that the root cause is a hardware failure on the underlying
node hosting three of our servers:
cros-bighd-0002 - slave DB server
cros-full-0025 - shard for buddy and glimmer
cros-full-0028 - drone
We're working to find alternative hardware
,
May 21 2018
This email just showed up in my inbox: to: mdb.chromeos-build-deputy@google.com, ganeti-ftr@google.com Hello Ganeti user! This message is regarding the following machine(s): * cros-bighd-0002.mtv.corp.google.com * cros-full-0025.mtv.corp.google.com * cros-full-0028.mtv.corp.google.com Due to unforeseen circumstances, there was an unplanned outage in the cluster where your instance resides. This has resulted in downtime for your instance from 2018-05-21 17:30:00 UTC to2018-05-21 18:00:00 UTC, but we believe it is running again now. We're very sorry about the unexpected downtime. Questions? If you have any questions please open a GUTS ticket for 'sre-ganeti': go/ganeti-ticket Thanks, Ganeti Virtualization SRE Why did I get this email? We send notifications to the individuals/groups listed in the service_contacts Rigging property associated with the affected machines, or the MDB group who owns the machines. Please see: http://google3/ops/sysops/cmt/scripts/get_machine_contacts.py
,
May 21 2018
That mail also means our VMs have failed over, or otherwise been repaired. And it looks like services on all three affected servers are back. I believe the worst of the outage is over; downgrading to P1.
,
May 21 2018
I'm going to be that guy and ask for a postmortem. Here's a few things that went wrong: * Monitoring on AFE DB did not fire; had to be told there was an outage by users * Secondary was not familiar with DB config nor the software associated with them; we could not determine which hosts were primary/secondary * Secondary could not determine the current primary oncall deputy: two calendars said two different things * We could not get serial console access to cros-bighd-0002 * Outage lasted >12 hours * We don't have an explanation for the large increase in network traffic to the master at the same time that the secondary went down
,
May 21 2018
I may have an explanation for the last point. If the slave went down, IIRC the heartbeat queries to it would fall back to pinging the master DB. Also, AFAIK the outage lasted ~2 hours from first person in the office to resolution. We got copious alert emails, but until business hours MTV time no one was responsible for responding to them, because that's all we guarantee support for.
,
May 22 2018
> I may have an explanation for the last point. If the slave went
> down, IIRC the heartbeat queries to it would fall back to pinging
> the master DB.
That would put some holes in our overall theory of the outage, though.
The evidence is that some time after the slave DB failed, heartbeats
on all of the shards dropped to 0. Do we think that the problem was
that the fail-over drove up load on the master enough to cause some
other problem?
I'll note that I'm kind of disposed to believe that the real problem
wasn't merely the slave failure: Data on hand shows a timeline a bit
like this:
5/20 10:15 - slave DB fails
5/20 18:00 - shard heartbeats stop
So, there's an 8 hour gap between DB failure and lab outage that I don't
have a good answer for.
,
May 22 2018
> jkop@
> I may have an explanation for the last point. If the slave went
> down, IIRC the heartbeat queries to it would fall back to pinging
> the master DB.
I'm not sure about this. Examining the relevant code inside the rpc (frontend/afe/models.py):
job_ids = set([])
<snip>
if cls.FETCH_READONLY_JOBS:
#TODO(jkop): Get rid of this kludge when we update Django to >=1.7
#correct usage would be .raw(..., using='readonly')
old_db = Job.objects._db
try:
Job.objects._db = 'readonly'
job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])
finally:
Job.objects._db = old_db
if not job_ids:
#If the replica is down or we're in a test, fetch from master.
job_ids |= set([j.id for j in Job.objects.raw(raw_sql)])
The master is consulted if the first query returns no results. However, it is not consulted if the first queries times out or fails and throws an exception, which is what I believe was happening here.
> jrbarnette@
> So, there's an 8 hour gap between DB failure and lab outage that I don't
> have a good answer for.
I don't understand this yet either. But keep in mind that the shard_client is the only thing on the shard that depends on the slave db, and only for 1 part of the handling of the heartbeat query. I'd have to read the rpc code in detail, but its possible that if that one part fails, the rpc can still upload write results for any already running or enqueued jobs.
,
May 22 2018
Forked retry question to Issue 845537
,
May 22 2018
,
May 22 2018
This is unrelated to the ci_results_archiver issue. There's a case to be made that we should close this as "fixed", since the outage is over. My vote is to hold this open until we document the complete explanation of the chain of events leading to the failure. The open question is "why did it take nearly 8 hours after the failure of the slave DB for the shard heartbeats to stop?"
,
May 29 2018
|
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by cra...@chromium.org
, May 21 201862.6 KB
62.6 KB View Download