autotest should not start provision when DUT is not responsive |
||||||
Issue descriptionhttps://luci-milo.appspot.com/buildbot/chromeos/glimmer-release/1163 has two failed tests: platform_DMVerityBitCorruption [ FAILED ] platform_DMVerityBitCorruption ABORT: Timed out, did not run. security_ModuleLocking [ FAILED ] security_ModuleLocking ABORT: Timed out, did not run. Logs for the second test are in this bucket: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103098987-chromeos-test/chromeos4-row7-rack3-host13/debug/ autoserv.DEBUG shows that the device is not responding, but then proceeds to start provisioning. It probably should not. 02/23 23:26:32.844 ERROR| traceback:0013| Duration: 63.1573028564 02/23 23:26:32.845 ERROR| traceback:0013| 02/23 23:26:32.845 ERROR| traceback:0013| stderr: 02/23 23:26:32.845 ERROR| traceback:0013| ssh: connect to host chromeos4-row7-rack3-host13 port 22: Connection timed out) 02/23 23:26:32.877 DEBUG|provision_AutoUpda:0059| Start provisioning <remote host: chromeos4-row7-rack3-host13> to glimmer-release/R58-9312.0.0 02/23 23:26:32.881 DEBUG| dev_server:0694| The host chromeos4-row7-rack3-host13 (100.115.198.29) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.192.0:19. 02/23 23:26:32.883 DEBUG| base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"'' 02/23 23:26:35.085 INFO | dev_server:1075| Staging artifacts on devserver http://100.115.219.136:8082: build=glimmer-release/R58-9312.0.0, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/glimmer-release/R58-9312.0.0
,
Feb 24 2017
Kevin maybe you can help explain this? Here's the sequence of jobs on this DUT: 2017-02-23 23:54:35 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782681-provision/ 2017-02-23 23:31:41 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782616-repair/ 2017-02-23 23:23:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782583-provision/ 2017-02-23 16:17:00 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103052397-chromeos-test/ 2017-02-23 16:16:40 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781907-reset/ 2017-02-23 15:54:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781765-repair/ 2017-02-23 15:52:37 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781758-reset/ 2017-02-23 15:50:08 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103049700-chromeos-test/ 2017-02-23 15:40:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781739-provision/ I don't understand this output. I see either "OK" or "--" as the result (? I assume) of a provision. So "--" would mean the provision was not OK, i.e. it failed? But for ###-chromeos-test I only see "--" results (even on other DUTs). Is the meaning different? In any case, here we have (order reversed): - successful provision at 15:40 - test at 15:50. Failed? That would explain the immediately following: -- reset (failed?) -- repair OK -- reset OK -- another test (OK?) and then nothing happens until 2323 - provision (fails) - repair (works) - provision (works) Unfortunately /var/log/messages from the last repair doesn't seem to include the time of the previous failed provision. It starts like this: 2017-02-24T00:24:08.856660+00:00 INFO kernel: [ 0.000000] Initializing cgroup subsys cpuset and stays up until autotest reboots it: 2017-02-24T00:28:34.804452+00:00 NOTICE ag[5206]: autotest server[stack::run_op|reboot|run_background] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!) then the clock goes back 1 hour (I've never seen that) 2017-02-23T23:54:00.434981-08:00 INFO kernel: [ 0.000000] Initializing cgroup subsys cpuset and that's already too late.
,
Mar 1 2017
May be part of Richard's provision re-write
,
Mar 2 2017
Re c#2, the questions seem not entirely relevant to this bug.
For more details on how to interpret the output from dut-status,
try this:
vi $(which dut-status)
The docstring has answers to the questions.
,
Mar 2 2017
Re c#3: I'm strongly considering implementing the suggestion from c#1. That is, schedule a verify/reset task prior to provisioning. That would prevent provisioning from running on an offline DUT in most cases. Done right, it could also alleviate problems with "Provision failures are recorded as test failures".
,
Jun 21 2017
,
Mar 31 2018
Status?
,
Mar 31 2018
,
Mar 31 2018
The requested behavior would be at least partly addressed as a
side-effect of this design proposal:
https://docs.google.com/document/d/1Nr8qbqJlVm7eMeKaA6bgecRGsgfJfC1TaybpWTmDZvg/edit
See, specifically, the provisioning step entitled "Prepare the host".
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by jrbarnette@chromium.org
, Feb 24 2017Owner: ----
Status: Available (was: Untriaged)