New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 800132 link

Starred by 2 users

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Jan 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Infrastructure issues are causing build failures (HWTest [provision] timing out)

Project Member Reported by diand...@chromium.org, Jan 8 2018

Issue description

If you look at:
  https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17435

You can see a whole pile of boards failed with:
  The HWTest [provision] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) **

Digging into bob-paladin, AKA:
  https://luci-milo.appspot.com/buildbot/chromeos/bob-paladin/2004

...and then following "HWTest [provision] HWTest [provision] -> stdout", you see:
  https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fbob-paladin%2F2004%2F%2B%2Frecipes%2Fsteps%2FHWTest__provision_%2F0%2Fstdout

The failure appears to be:

  01-08-2018 [13:17:16] Suite job is finished.
  Suite timed out. Started on 01-08-2018 [11:41:03], timed out on 01-08-2018 [13:17:16]


Possibly this is related to these messages?

11:49:18: INFO: Refreshing due to a 401 (attempt 1/2)
11:49:18: INFO: Refreshing access_token
11:52:41: INFO: Refreshing due to a 401 (attempt 1/2)
11:52:41: INFO: Refreshing access_token
12:45:27: INFO: Refreshing due to a 401 (attempt 1/2)
12:45:27: INFO: Refreshing access_token
12:49:21: INFO: Refreshing due to a 401 (attempt 1/2)
12:49:21: INFO: Refreshing access_token
12:52:48: INFO: Refreshing due to a 401 (attempt 1/2)
12:52:49: INFO: Refreshing access_token
 
Cc: pprabhu@chromium.org ayatane@chromium.org
> Possibly this is related to these messages?
>
> 11:49:18: INFO: Refreshing due to a 401 (attempt 1/2)
> 11:49:18: INFO: Refreshing access_token

I believe that these messages are unrelated.

Digging in, we start at the suite job:
    http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=168600820

That shows what the builder says:  The job timed out because none of
the individual provision tasks completed.  According to that page, none
of the tests even _started_.

Digging into one selected test, you find this on the master:
    http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=168600829
That also shows "Aborted (Queued)", meaning the job was never assigned
to a DUT.

However, looking at the test on the shard, you see a different
story:
    http://chromeos-server97.mtv.corp.google.com/afe/#tab_id=view_job&object_id=168600829
That shows "Aborted (Parsing)", meaning that the job wasn't aborted
until it had already completed writing its results.

Following that job's logs, you can get here:
    http://chromeos-server97.mtv.corp.google.com/results/168600829-chromeos-test/chromeos6-row3-rack13-host9/
That shows that the DUT successfully installed the new build,
and successfully ran the dummy_Pass test.

The DUT in question is chromeos6-row3-rack13-host9; checking its history:
$ dut-status -f -d 3 -u '2018-01-08 13:40:00' chromeos6-row3-rack13-host9
chromeos6-row3-rack13-host9
    2018-01-08 11:43:39  OK https://ubercautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack13-host9/2524751-provision/

That says that the DUT completed a provision task shortly after
creation of the suite job.

Finally, checking the DUT:
$ ssh chromeos6-row3-rack13-host19 grep CHROMEOS_RELEASE_BUILDER_PATH /etc/lsb-release
CHROMEOS_RELEASE_BUILDER_PATH=bob-paladin/R65-10291.0.0-rc1

So, the DUT successfully installed the build, just like the shard
logs say.

Going back to the master, the provision suite logs finish up like this:
01/08 11:41:33.661 DEBUG|             suite:1192| Scheduled 8 tests, writing the total to keyval.
01/08 11:41:33.661 DEBUG|     dynamic_suite:0622| Waiting on suite.

After those lines, the logs are supposed to report periodically
that they're waiting, until they see the updates that the jobs are
done.  So, it _looks_ like the process either hung or died after
starting to wait.

You see the same thing on all of the failed boards.  The problem happened
on multiple shards, with multiple different drones.  I can't find any
logs to explain the failures, and the problem isn't isolated to one specific
place.  So, I don't really understand.

Digging in to the shard logs, in the scheduler logs for the job you
see these events:

Event 1, autoserv starts provisioning:
01/08 11:55:12.645 INFO |     drone_manager:0786| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9', '-m', u'chromeos6-row3-rack13-host9', '-u', u'chromeos-test', '-l', u'bob-paladin/R65-10291.0.0-rc1/provision/dummy_Pass', '-c', '--lab', 'True', '-P', u'168600829-chromeos-test/chromeos6-row3-rack13-host9', '-n', '/usr/local/autotest/results/drone_tmp/attach.2248', '--verify_job_repo_url']

Event 2, (I have no freakin' clue):
01/08 11:55:17.449 ERROR|     drone_manager:0856| Process localhost/84373 found, but not an autoserv process. Is {'pgid': '20866', 'pid': '84373', 'ppid': '1', 'comm': 'autoserv', 'args': '/usr/bin/python -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9 -m chromeos6-row3-rack13-host9 -u chromeos-test -l bob-paladin/R65-10291.0.0-rc1/provision/dummy_Pass -c --lab True -P 168600829-chromeos-test/chromeos6-row3-rack13-host9 -n /usr/local/autotest/results/drone_tmp/attach.2248 --verify_job_repo_url'}

Event 3, start monitoring autoserv; this happens after autoserv has run to completion and exited):
01/08 11:56:23.271 INFO |        luciferlib:0235| Spawning '/usr/local/autotest/bin/job_reporter', ['/usr/local/autotest/bin/job_reporter', '--run-job-path', '/opt/infra-tools/usr/bin/lucifer_run_job', '--jobdir', '/usr/local/autotest/leases', '--job-id', '168600829', '--autoserv-exit', '0', '--', '-resultsdir', u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9', '-autotestdir', '/usr/local/autotest', '-watcherpath', '/opt/infra-tools/usr/bin/lucifer_watcher'], u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9/job_reporter_output.log'
01/08 12:17:57.621 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos6-row3-rack13-host9/168600829 (168997561) due to max runtime

Event 4, the shard times out waiting for a process that's already finished:
01/08 12:17:57.621 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos6-row3-rack13-host9/168600829 (168997561) due to max runtime

Digging a bit further:

> Event 1, autoserv starts provisioning:
This is actually the start of the test after provisioning passed.


> Event 2, (I have no freakin' clue):
This event seems to be unique to one job, and unrelated to the failure.

Owner: ayatane@chromium.org
Status: Assigned (was: Untriaged)
I don't know what's going on here; ayatane@ knows more about how to
debug this...

Good news is that this probably is not hitting this CQ run :)
From what I can tell, tko/parse is hanging before it can even get so far as to drop a pidfile.  It is not correlated with any code changes.
Hmmm...  There was an alert for our TKO instance at almost exactly the
time of these failures:

Stackdriver has detected that one of your resources has entered an alert state. 
Summary: Queries for google.com:chromeos-lab tko is below the threshold of 0.1 with a value of 0. 
Violation Began: 2018-01-08 19:56:13 UTC (3 mins 40 secs) 
Condition Name: Metric Threshold on Google Cloud SQL tko 

View violation details: https://app.google.stackdriver.com/incidents/0.kn21hl6yo6c0?project=google.com%3Achromeos-lab 
Document:

TKO db: Query rate has been zero for 5 minutes. Is the instance down?

Instance: https://pantheon.corp.google.com/sql/instances/tko/overview?authuser=0&project=google.com:chromeos-lab&duration=PT6H

Comment 8 by hiroh@chromium.org, Jan 9 2018

This failure isn't appeared on CQ now.
Should we mark fixed?
Note for fixing jammed database rows caused by an intersecting bug:

 atest shard list | sed 1d | awk -F' +' '{print $2}' | while read -r x; do autotest-db -s -h "$x" -- -e "select job.id, created_on, active, complete, aborted, started_on, h.created, h.completed from afe_jobs as job join afe_host_queue_entries as hqe on job.id = job_id join afe_job_handoffs as h on job.id = h.job_id where aborted = 1 and active = 0 and complete = 0"; done

 atest shard list | sed 1d | awk -F' +' '{print $2}' | while read -r x; do autotest-db -s -h "$x" -- -e "update afe_jobs as job join afe_host_queue_entries as hqe on job.id = job_id join afe_job_handoffs as h on job.id = h.job_id set h.completed = 0 where aborted = 1 and active = 0 and complete = 0"; done
Cc: jdufault@chromium.org xixuan@chromium.org jrbarnette@chromium.org
 Issue 800104  has been merged into this issue.
Owner: jrbarnette@chromium.org
> This failure isn't appeared on CQ now.
> Should we mark fixed?

Given  bug 800104 , I'm satisfied that the reported outage in
our Cloud SQL instance (see  crbug.com/800132 #7) is the cause
of these failures.  Since we didn't/don't control such outages,
in that sense this problem is fixed.  However, if time permits,
I'd like to spend time to understand if there's mitigation we
can and should apply.

So, holding this bug open for a bit.

It's also be super if we could somehow make this more obvious for next time.  Ideas:

* Somehow surface alerts of outages to a place where everyone will see them, like SOM, the waterfall, a group chat, crosoncall, or ...?

* Make it so that it was more obvious what we were waiting on when we timed out.
OK.  I dug a bit to confirm the diagnosis.  The Cloud SQL logs are
pretty clear about there being a failure starting at 11:48:48:
    2018-01-08 19:48:48 126822 [ERROR] InnoDB: The OS said file flush did not succeed

Some subsequent error messages said that errno was EIO.

The service automatically restarted, and the outage was over
at 12:09:36.

I'm uncertain as to whether there's much we could have done
about any of our failures.  Presumably, tko/parse was in some
sort of retry loop trying to talk to the down TKO instance, so
that part of the failure might be improved on.  For the rest,
I don't know.

Status: WontFix (was: Assigned)
I haven't had any further inspirations regarding possible
corrective actions.  The root cause of the failures is
pretty well definitive:  There were the alerts we received
in e-mail, and also an automatically filed CloudSQL ticket
at b/71709521.

So, in conclusion:
  * Something happened to the TKO service, causing an outage
    of about 20 minutes.
  * The service outage cause various errors and failures in
    Autotest.
  * We don't know what caused the TKO service outage.
  * There's no obvious way to improve Autotest handling of this
    kind of problem.

So, let's throw up our hands and move on.

Sign in to add a comment