New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 695899 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Feature



Sign in to add a comment

autotest should not start provision when DUT is not responsive

Project Member Reported by semenzato@chromium.org, Feb 24 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromeos/glimmer-release/1163

has two failed tests:

  platform_DMVerityBitCorruption      [ FAILED ]
  platform_DMVerityBitCorruption        ABORT: Timed out, did not run.
  security_ModuleLocking              [ FAILED ]
  security_ModuleLocking                ABORT: Timed out, did not run.

Logs for the second test are in this bucket:

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103098987-chromeos-test/chromeos4-row7-rack3-host13/debug/

autoserv.DEBUG shows that the device is not responding, but then proceeds to start provisioning.  It probably should not.

02/23 23:26:32.844 ERROR|         traceback:0013| Duration: 63.1573028564
02/23 23:26:32.845 ERROR|         traceback:0013| 
02/23 23:26:32.845 ERROR|         traceback:0013| stderr:
02/23 23:26:32.845 ERROR|         traceback:0013| ssh: connect to host chromeos4-row7-rack3-host13 port 22: Connection timed out)
02/23 23:26:32.877 DEBUG|provision_AutoUpda:0059| Start provisioning <remote host: chromeos4-row7-rack3-host13> to glimmer-release/R58-9312.0.0
02/23 23:26:32.881 DEBUG|        dev_server:0694| The host chromeos4-row7-rack3-host13 (100.115.198.29) is in a restricted subnet. Try to locate a devserver inside subnet 100.115.192.0:19.
02/23 23:26:32.883 DEBUG|        base_utils:0185| Running 'ssh 100.115.219.136 'curl "http://100.115.219.136:8082/check_health?"''
02/23 23:26:35.085 INFO |        dev_server:1075| Staging artifacts on devserver http://100.115.219.136:8082: build=glimmer-release/R58-9312.0.0, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/glimmer-release/R58-9312.0.0

 
Labels: -Type-Bug Type-Feature
Owner: ----
Status: Available (was: Untriaged)
A plausible answer would be to run a reset/verify job prior
to running provisioning.  I think it's subtler than that,
in that some verify checks could maybe be skipped if the
next action will be provision.

That's a non-trivial change to the scheduler, and out of scope
for me right now.

Cc: kevcheng@chromium.org
Kevin maybe you can help explain this?


Here's the sequence of jobs on this DUT:

2017-02-23 23:54:35 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782681-provision/
2017-02-23 23:31:41 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782616-repair/
2017-02-23 23:23:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/782583-provision/
2017-02-23 16:17:00 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103052397-chromeos-test/
2017-02-23 16:16:40 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781907-reset/
2017-02-23 15:54:58 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781765-repair/
2017-02-23 15:52:37 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781758-reset/
2017-02-23 15:50:08 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/103049700-chromeos-test/
2017-02-23 15:40:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row7-rack3-host13/781739-provision/

I don't understand this output.  I see either "OK" or "--" as the result (? I assume) of a provision.  So "--" would mean the provision was not OK, i.e. it failed?  But for ###-chromeos-test I only see "--" results (even on other DUTs).  Is the meaning different?

In any case, here we have (order reversed):

- successful provision at 15:40
- test at 15:50.  Failed?  That would explain the immediately following:
-- reset (failed?)
-- repair OK
-- reset OK
-- another test (OK?)

and then nothing happens until 2323

- provision (fails)
- repair (works)
- provision (works)

Unfortunately /var/log/messages from the last repair doesn't seem to include the time of the previous failed provision.  It starts like this:

2017-02-24T00:24:08.856660+00:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuset

and stays up until autotest reboots it:

2017-02-24T00:28:34.804452+00:00 NOTICE ag[5206]: autotest server[stack::run_op|reboot|run_background] -> ssh_run(( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!)

then the clock goes back 1 hour (I've never seen that)

2017-02-23T23:54:00.434981-08:00 INFO kernel: [    0.000000] Initializing cgroup subsys cpuset

and that's already too late.


Comment 3 by aut...@google.com, Mar 1 2017

Owner: jrbarnette@chromium.org
May be part of Richard's provision re-write
Re c#2, the questions seem not entirely relevant to this bug.
For more details on how to interpret the output from dut-status,
try this:
    vi $(which dut-status)

The docstring has answers to the questions.

Re c#3:  I'm strongly considering implementing the suggestion
from c#1.  That is, schedule a verify/reset task prior to
provisioning.  That would prevent provisioning from running
on an offline DUT in most cases.  Done right, it could also
alleviate problems with "Provision failures are recorded as
test failures".

Labels: -Pri-2 Pri-3
Status: ass (was: Available)
Status?
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Test
Status: Assigned (was: ass)
The requested behavior would be at least partly addressed as a
side-effect of this design proposal:
    https://docs.google.com/document/d/1Nr8qbqJlVm7eMeKaA6bgecRGsgfJfC1TaybpWTmDZvg/edit

See, specifically, the provisioning step entitled "Prepare the host".

Sign in to add a comment