Infrastructure issues are causing build failures (HWTest [provision] timing out) |
|||||
Issue descriptionIf you look at: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/17435 You can see a whole pile of boards failed with: The HWTest [provision] stage failed: ** HWTest did not complete due to infrastructure issues (code 3) ** Digging into bob-paladin, AKA: https://luci-milo.appspot.com/buildbot/chromeos/bob-paladin/2004 ...and then following "HWTest [provision] HWTest [provision] -> stdout", you see: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos%2Fbob-paladin%2F2004%2F%2B%2Frecipes%2Fsteps%2FHWTest__provision_%2F0%2Fstdout The failure appears to be: 01-08-2018 [13:17:16] Suite job is finished. Suite timed out. Started on 01-08-2018 [11:41:03], timed out on 01-08-2018 [13:17:16] Possibly this is related to these messages? 11:49:18: INFO: Refreshing due to a 401 (attempt 1/2) 11:49:18: INFO: Refreshing access_token 11:52:41: INFO: Refreshing due to a 401 (attempt 1/2) 11:52:41: INFO: Refreshing access_token 12:45:27: INFO: Refreshing due to a 401 (attempt 1/2) 12:45:27: INFO: Refreshing access_token 12:49:21: INFO: Refreshing due to a 401 (attempt 1/2) 12:49:21: INFO: Refreshing access_token 12:52:48: INFO: Refreshing due to a 401 (attempt 1/2) 12:52:49: INFO: Refreshing access_token
,
Jan 9 2018
Digging in to the shard logs, in the scheduler logs for the job you
see these events:
Event 1, autoserv starts provisioning:
01/08 11:55:12.645 INFO | drone_manager:0786| command = ['nice', '-n', '10', '/usr/local/autotest/server/autoserv', '-p', '-r', u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9', '-m', u'chromeos6-row3-rack13-host9', '-u', u'chromeos-test', '-l', u'bob-paladin/R65-10291.0.0-rc1/provision/dummy_Pass', '-c', '--lab', 'True', '-P', u'168600829-chromeos-test/chromeos6-row3-rack13-host9', '-n', '/usr/local/autotest/results/drone_tmp/attach.2248', '--verify_job_repo_url']
Event 2, (I have no freakin' clue):
01/08 11:55:17.449 ERROR| drone_manager:0856| Process localhost/84373 found, but not an autoserv process. Is {'pgid': '20866', 'pid': '84373', 'ppid': '1', 'comm': 'autoserv', 'args': '/usr/bin/python -u /usr/local/autotest/server/autoserv -p -r /usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9 -m chromeos6-row3-rack13-host9 -u chromeos-test -l bob-paladin/R65-10291.0.0-rc1/provision/dummy_Pass -c --lab True -P 168600829-chromeos-test/chromeos6-row3-rack13-host9 -n /usr/local/autotest/results/drone_tmp/attach.2248 --verify_job_repo_url'}
Event 3, start monitoring autoserv; this happens after autoserv has run to completion and exited):
01/08 11:56:23.271 INFO | luciferlib:0235| Spawning '/usr/local/autotest/bin/job_reporter', ['/usr/local/autotest/bin/job_reporter', '--run-job-path', '/opt/infra-tools/usr/bin/lucifer_run_job', '--jobdir', '/usr/local/autotest/leases', '--job-id', '168600829', '--autoserv-exit', '0', '--', '-resultsdir', u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9', '-autotestdir', '/usr/local/autotest', '-watcherpath', '/opt/infra-tools/usr/bin/lucifer_watcher'], u'/usr/local/autotest/results/168600829-chromeos-test/chromeos6-row3-rack13-host9/job_reporter_output.log'
01/08 12:17:57.621 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos6-row3-rack13-host9/168600829 (168997561) due to max runtime
Event 4, the shard times out waiting for a process that's already finished:
01/08 12:17:57.621 WARNI|monitor_db_cleanup:0114| Aborting entry chromeos6-row3-rack13-host9/168600829 (168997561) due to max runtime
,
Jan 9 2018
Digging a bit further: > Event 1, autoserv starts provisioning: This is actually the start of the test after provisioning passed. > Event 2, (I have no freakin' clue): This event seems to be unique to one job, and unrelated to the failure.
,
Jan 9 2018
I don't know what's going on here; ayatane@ knows more about how to debug this...
,
Jan 9 2018
Good news is that this probably is not hitting this CQ run :)
,
Jan 9 2018
From what I can tell, tko/parse is hanging before it can even get so far as to drop a pidfile. It is not correlated with any code changes.
,
Jan 9 2018
Hmmm... There was an alert for our TKO instance at almost exactly the time of these failures: Stackdriver has detected that one of your resources has entered an alert state. Summary: Queries for google.com:chromeos-lab tko is below the threshold of 0.1 with a value of 0. Violation Began: 2018-01-08 19:56:13 UTC (3 mins 40 secs) Condition Name: Metric Threshold on Google Cloud SQL tko View violation details: https://app.google.stackdriver.com/incidents/0.kn21hl6yo6c0?project=google.com%3Achromeos-lab Document: TKO db: Query rate has been zero for 5 minutes. Is the instance down? Instance: https://pantheon.corp.google.com/sql/instances/tko/overview?authuser=0&project=google.com:chromeos-lab&duration=PT6H
,
Jan 9 2018
This failure isn't appeared on CQ now. Should we mark fixed?
,
Jan 9 2018
Note for fixing jammed database rows caused by an intersecting bug:
atest shard list | sed 1d | awk -F' +' '{print $2}' | while read -r x; do autotest-db -s -h "$x" -- -e "select job.id, created_on, active, complete, aborted, started_on, h.created, h.completed from afe_jobs as job join afe_host_queue_entries as hqe on job.id = job_id join afe_job_handoffs as h on job.id = h.job_id where aborted = 1 and active = 0 and complete = 0"; done
atest shard list | sed 1d | awk -F' +' '{print $2}' | while read -r x; do autotest-db -s -h "$x" -- -e "update afe_jobs as job join afe_host_queue_entries as hqe on job.id = job_id join afe_job_handoffs as h on job.id = h.job_id set h.completed = 0 where aborted = 1 and active = 0 and complete = 0"; done
,
Jan 9 2018
Issue 800104 has been merged into this issue.
,
Jan 9 2018
> This failure isn't appeared on CQ now. > Should we mark fixed? Given bug 800104 , I'm satisfied that the reported outage in our Cloud SQL instance (see crbug.com/800132 #7) is the cause of these failures. Since we didn't/don't control such outages, in that sense this problem is fixed. However, if time permits, I'd like to spend time to understand if there's mitigation we can and should apply. So, holding this bug open for a bit.
,
Jan 9 2018
It's also be super if we could somehow make this more obvious for next time. Ideas: * Somehow surface alerts of outages to a place where everyone will see them, like SOM, the waterfall, a group chat, crosoncall, or ...? * Make it so that it was more obvious what we were waiting on when we timed out.
,
Jan 10 2018
OK. I dug a bit to confirm the diagnosis. The Cloud SQL logs are
pretty clear about there being a failure starting at 11:48:48:
2018-01-08 19:48:48 126822 [ERROR] InnoDB: The OS said file flush did not succeed
Some subsequent error messages said that errno was EIO.
The service automatically restarted, and the outage was over
at 12:09:36.
I'm uncertain as to whether there's much we could have done
about any of our failures. Presumably, tko/parse was in some
sort of retry loop trying to talk to the down TKO instance, so
that part of the failure might be improved on. For the rest,
I don't know.
,
Jan 17 2018
I haven't had any further inspirations regarding possible corrective actions. The root cause of the failures is pretty well definitive: There were the alerts we received in e-mail, and also an automatically filed CloudSQL ticket at b/71709521. So, in conclusion: * Something happened to the TKO service, causing an outage of about 20 minutes. * The service outage cause various errors and failures in Autotest. * We don't know what caused the TKO service outage. * There's no obvious way to improve Autotest handling of this kind of problem. So, let's throw up our hands and move on. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by jrbarnette@chromium.org
, Jan 8 2018> Possibly this is related to these messages? > > 11:49:18: INFO: Refreshing due to a 401 (attempt 1/2) > 11:49:18: INFO: Refreshing access_token I believe that these messages are unrelated. Digging in, we start at the suite job: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=168600820 That shows what the builder says: The job timed out because none of the individual provision tasks completed. According to that page, none of the tests even _started_. Digging into one selected test, you find this on the master: http://cautotest-prod.corp.google.com/afe/#tab_id=view_job&object_id=168600829 That also shows "Aborted (Queued)", meaning the job was never assigned to a DUT. However, looking at the test on the shard, you see a different story: http://chromeos-server97.mtv.corp.google.com/afe/#tab_id=view_job&object_id=168600829 That shows "Aborted (Parsing)", meaning that the job wasn't aborted until it had already completed writing its results. Following that job's logs, you can get here: http://chromeos-server97.mtv.corp.google.com/results/168600829-chromeos-test/chromeos6-row3-rack13-host9/ That shows that the DUT successfully installed the new build, and successfully ran the dummy_Pass test. The DUT in question is chromeos6-row3-rack13-host9; checking its history: $ dut-status -f -d 3 -u '2018-01-08 13:40:00' chromeos6-row3-rack13-host9 chromeos6-row3-rack13-host9 2018-01-08 11:43:39 OK https://ubercautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack13-host9/2524751-provision/ That says that the DUT completed a provision task shortly after creation of the suite job. Finally, checking the DUT: $ ssh chromeos6-row3-rack13-host19 grep CHROMEOS_RELEASE_BUILDER_PATH /etc/lsb-release CHROMEOS_RELEASE_BUILDER_PATH=bob-paladin/R65-10291.0.0-rc1 So, the DUT successfully installed the build, just like the shard logs say. Going back to the master, the provision suite logs finish up like this: 01/08 11:41:33.661 DEBUG| suite:1192| Scheduled 8 tests, writing the total to keyval. 01/08 11:41:33.661 DEBUG| dynamic_suite:0622| Waiting on suite. After those lines, the logs are supposed to report periodically that they're waiting, until they see the updates that the jobs are done. So, it _looks_ like the process either hung or died after starting to wait. You see the same thing on all of the failed boards. The problem happened on multiple shards, with multiple different drones. I can't find any logs to explain the failures, and the problem isn't isolated to one specific place. So, I don't really understand.