Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 5 users
Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 729099



Sign in to add a comment
guado_moblab-paladin: host did not return from reboot
Project Member Reported by nxia@chromium.org, Apr 21 Back to list
https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5647

https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/112820478-chromeos-test/chromeos2-row1-rack8-host1/

 04-15-2017 [02:23:17] Output below this line is for buildbot consumption:
@@@STEP_LINK@[Test-Logs]: provision: ABORT: Host did not return from reboot@http://localhost/tko/retrieve_logs.cgi?job=/results/3-moblab/@@@
Will return from run_suite with status: INFRA_FAILURE
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 817, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 470, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 380, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 62, in run_once
    raise e
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x     -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -o Protocol=2 -l root -p 22 chromeos2-row1-rack8-host1 "export
    LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger
    -tag \"autotest\" \"server[stack::_call_run_once|run_once|run_as_moblab]
    -> ssh_run(su - moblab -c '/usr/local/autotest/site_utils/run_suite.py
    --pool='' --board=cyan --build=cyan-release/R55-8872.67.0
    --suite_name=dummy_server')\";fi; su - moblab -c
    '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
    --build=cyan-release
 
Status: Unconfirmed
Cc: -shuqianz@chromium.org -nxia@chromium.org pho...@chromium.org xixuan@chromium.org
Cc: akes...@chromium.org aaboagye@chromium.org grundler@chromium.org haddowk@chromium.org
Labels: -Pri-2 Pri-1
Suddenly this week this error happened a lot:

https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14730
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14731
https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/14740

Filed b/62086388 for lab to check the host. 

Any recent changes make it the DUT hard to come back from reboot?
+sheriff to check
Cc: -aaboagye@chromium.org jrbarnette@chromium.org
The logs collected from the test aren't too helpful. In the provision job it seems that they indicate a reboot but then the device doesn't come back.

Can we try and pull the logs from the DUT? Grabbing the firmware event log would be helpful as well.


-------------8<-------------
05/25 09:40:21.016 DEBUG|          ssh_host:0284| Running (ssh) '/tmp/stateful_update http://192.168.231.1:8080/static/cyan-release/R57-9202.66.0 --stateful_change=clean 2>&1'
05/25 09:40:21.122 DEBUG|             utils:0298| [stdout] Downloading stateful payload from http://192.168.231.1:8080/static/cyan-release/R57-9202.66.0/stateful.tgz
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   HTTP/1.1 200 OK
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   Content-Length: 296178920
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   Accept-Ranges: bytes
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   Server: CherryPy/3.2.2
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   Last-Modified: Thu, 25 May 2017 16:38:35 GMT
05/25 09:40:21.162 DEBUG|             utils:0298| [stdout]   Date: Thu, 25 May 2017 16:40:21 GMT
05/25 09:40:21.163 DEBUG|             utils:0298| [stdout]   Content-Type: application/x-gtar-compressed
05/25 09:40:36.254 DEBUG|             utils:0298| [stdout] Downloading command returns code 0.
05/25 09:40:36.255 DEBUG|             utils:0298| [stdout] Successfully downloaded update.
05/25 09:40:36.256 DEBUG|             utils:0298| [stdout] Restoring state to factory_install with dev_image.
05/25 09:40:36.257 INFO |       autoupdater:0582| Update complete.
05/25 09:40:36.258 INFO |       autoupdater:0593| Update engine log has downloaded in sysinfo/update_engine dir. Check the lastest.
05/25 09:40:36.258 DEBUG|          ssh_host:0284| Running (ssh) 'cat /etc/lsb-release'
05/25 09:40:36.334 INFO |        server_job:0184| 		START	----	reboot	timestamp=1495730436	localtime=May 25 09:40:36	
05/25 09:40:36.336 INFO |        server_job:0184| 			GOOD	----	reboot.start	timestamp=1495730436	localtime=May 25 09:40:36	
05/25 09:40:36.338 DEBUG|          ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
05/25 09:40:36.387 DEBUG|             utils:0298| [stdout] 7e1f709d-3630-492d-ba58-0862ceca1e0d
05/25 09:40:36.428 DEBUG|          ssh_host:0284| Running (ssh) '( sleep 1; reboot & sleep 10; reboot -f ) </dev/null >/dev/null 2>&1 & echo -n $!'
05/25 09:40:36.520 DEBUG|             utils:0317| [stdout] 370
05/25 09:40:36.521 DEBUG|      abstract_ssh:0653| Host 192.168.231.102 pre-shutdown boot_id is 7e1f709d-3630-492d-ba58-0862ceca1e0d
05/25 09:40:36.521 DEBUG|          ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
05/25 09:40:36.579 DEBUG|             utils:0298| [stdout] 7e1f709d-3630-492d-ba58-0862ceca1e0d
05/25 09:40:37.626 DEBUG|          ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
05/25 09:40:37.663 DEBUG|             utils:0298| [stdout] 7e1f709d-3630-492d-ba58-0862ceca1e0d
05/25 09:40:38.710 DEBUG|          ssh_host:0284| Running (ssh) 'if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi'
05/25 09:42:37.215 WARNI|             utils:0931| run process timeout (118) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_oFgJCZssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_for_restart|wait_down|get_boot_id] -> ssh_run(if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi)\";fi; if [ -f '/proc/sys/kernel/random/boot_id' ]; then cat '/proc/sys/kernel/random/boot_id'; else echo 'no boot_id available'; fi"
05/25 09:42:38.220 DEBUG|      abstract_ssh:0674| Host 192.168.231.102 is now unreachable over ssh, is down
05/25 09:42:38.221 DEBUG|          ssh_host:0284| Running (ssh) 'true'
05/25 09:50:38.359 WARNI|             utils:0931| run process timeout (480) fired on: /usr/bin/ssh -a -x    -o ControlPath=/tmp/_autotmp_oFgJCZssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::wait_up|is_up|ssh_ping] -> ssh_run(true)\";fi; true"
05/25 09:50:39.367 DEBUG|          ssh_host:0212| retrying ssh command after timeout
05/25 09:56:16.562 ERROR|             utils:0298| [stderr] mux_client_request_session: read from master failed: Broken pipe
05/25 09:56:27.843 ERROR|             utils:0298| [stderr] ssh: connect to host 192.168.231.102 port 22: No route to host
05/25 09:56:27.844 DEBUG|          ssh_host:0218| retry 2: restarting master connection
05/25 09:56:27.845 DEBUG|      abstract_ssh:0744| Restarting master ssh connection
05/25 09:56:27.845 DEBUG|      abstract_ssh:0756| Nuking master_ssh_job.
05/25 09:56:27.845 DEBUG|      abstract_ssh:0762| Cleaning master_ssh_tempdir.
05/25 09:56:27.845 INFO |      abstract_ssh:0809| Starting master ssh connection '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_WbWgbAssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102'
05/25 09:56:27.846 DEBUG|             utils:0203| Running '/usr/bin/ssh -a -x   -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_WbWgbAssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 192.168.231.102'
05/25 09:56:58.003 INFO |      abstract_ssh:0824| Timed out waiting for master-ssh connection to be established.
05/25 09:57:09.927 ERROR|             utils:0298| [stderr] ssh: connect to host 192.168.231.102 port 22: No route to host
05/25 09:57:10.939 DEBUG|      abstract_ssh:0599| Host 192.168.231.102 is still down after waiting 872 seconds
05/25 09:57:10.940 INFO |        server_job:0184| 			ABORT	----	reboot.verify	timestamp=1495731430	localtime=May 25 09:57:10	Host did not return from reboot
05/25 09:57:10.941 INFO |        server_job:0184| 		END FAIL	----	reboot	timestamp=1495731430	localtime=May 25 09:57:10	Host did not return from reboot
  Traceback (most recent call last):
    File "/usr/local/autotest/server/server_job.py", line 938, in run_op
      op_func()
    File "/usr/local/autotest/server/hosts/remote.py", line 160, in reboot
      **dargs)
    File "/usr/local/autotest/server/hosts/remote.py", line 229, in wait_for_restart
      self.log_op(self.OP_REBOOT, op_func)
    File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 548, in log_op
      op_func()
    File "/usr/local/autotest/server/hosts/remote.py", line 228, in op_func
      super(RemoteHost, self).wait_for_restart(timeout=timeout, **dargs)
    File "/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 309, in wait_for_restart
      raise error.AutoservRebootError("Host did not return from reboot")
  AutoservRebootError: Host did not return from reboot
-------------8<-------------

Cc: aaboagye@chromium.org
And by DUT, I mean the cyan connected to the moblab.
If we can't identify a believable suspect CL, then we should mark this paladin as experimental temporarily while we investigate.

I don't love doing that for guado_moblab, but test_push will help protect us from bad code reaching the lab.
re: c#8 

You should be able to just ssh into the machine by IP or chromeosX-rowX-rackX-hostX.cros
I guess I could hop on the moblab and then try SSH'ing to the local IP from there. Is there a servo attached to the moblab? How do you powercycle the DUT?
Will hold for a moment for making guado_moblab as experimental.

My partly observation is:

For all failures I check, the moblab DUTs use 192.168.231.101 & 102, and the 102 cannot return from reboot.
Checked one successful moblab, it uses 192.168.231.100 & 101.

I think it's related to the failure.
Cc: dshi@chromium.org sbasi@chromium.org
+moblab knowledgable people:

Longer term question: How can we make guado_moblab-paladin more robust against failures in its sub-DUTs. I feel some of these failures should be possible to automatically (within moblab_quick, say) classify as "not related to moblab code" and thus still allow the paladin to pass.
Its a tough issue.

You could:
* Raise the number of DUTs for more reliability.
* Support Servos on this setup and MobLab does a servo repair in this case.
* Better tooling to monitor the state of moblab and its duts in the lab.

I don't know if you want to just classify it as "not related to moblab code" b/c then the DUTs could go down and you just start green lighting all testing that goes through the moblab when say a CL potentially could have broken provision.
I am currently working on the second point.

I am trying to get more moblab devices installed into the lab to the CQ/bvt pool can be bigger, however this is an uphill battle.

How about more retries on the jobs in the moblab test suite ?  If one dut goes repair failed then the other dut should be able to re-run the test ?
What I see wrong here is we have to "run" moblab to prove that Chrome OS support for moblab still works. Why can't we just test the components provided by Chrome OS that moblab uses without running moblab and trying to parse it's output?

I guess a "separate CQ" is sort of what I'm advocating for moblab application in a similar way that Chrome has it's own CQ.  Can then run through that CQ to make sure moblab itself continues to work correctly (e.g. more like unit tests).
It's failing again. I'm marking it as experimental.
Project Member Comment 20 by bugdroid1@chromium.org, May 25
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/3dcf6413c97b4862e3157cc3e96e8492cd25fa92

commit 3dcf6413c97b4862e3157cc3e96e8492cd25fa92
Author: Aviv Keshet <akeshet@chromium.org>
Date: Thu May 25 22:18:57 2017

chromeos_config: mark guado_moblab-paladin experimental

BUG=chromium:714330
TEST=None

Change-Id: I7954ee27972fa516f7911294878040c48401f80e
Reviewed-on: https://chromium-review.googlesource.com/516544
Reviewed-by: Aseda Aboagye <aaboagye@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/3dcf6413c97b4862e3157cc3e96e8492cd25fa92/cbuildbot/config_dump.json
[modify] https://crrev.com/3dcf6413c97b4862e3157cc3e96e8492cd25fa92/cbuildbot/chromeos_config.py

Revert of that prepared for when bot goes green again: https://chromium-review.googlesource.com/516603

Last failure is 192.168.231.100 did not return from reboot, I log into the host to check, it currently has 101 & 101, two working cyan DUTs, just no 100.

my theory for moblab failure is cyan DUTs sometimes change their internal IPs after they got updated and reboot, e.g. 192.168.231.100 -> 192.168.231.102, so the host is not reachable. 

I don't know if the above case is possible, do we have any consideration behind giving 192.168.231.100-120 21 IPs for only two DUTs?


I was wondering if the DUT IPs are static or dynamic too. I'm assuming that moblab runs a DHCP server but I don't know if the DUTs will get the same IPs across moblab_RunSuite runs.

Also, if the DHCP server is up and running, shouldn't it assign the MAC the same IP address?
Correct my #22, the host currently has 192.168.231.101 & 192.168.231.102, no 192.168.231.100. 
If that is the case there is a bug in the DHCP server config, the same dut should get the same IP address when it boots back up, based on its IP address

The only other way I can think what you are seeing should happen is if the DUT had 2 USB dongles plugged in, or the mac address of the usb dongle is changing.

I am going to try to see if I can have the moblab retry the failing test on the other good dut, at the cost of some time it would conceivably have better success rate.

In general we have issues with the USB dongles, but I am not seeing the problem you are being reported by partners.
It looks like the guado_moblab-paladin failed again. It seems that .102 didn't come back up. I ssh'd into the DUT and was checking for DHCP assigned addresses.

-----------8<--------------
localhost ~ # cat /var/log/messages | grep -i dhcp | grep -e 'DHCPACK\|OFFER'                                                                                        
2017-05-25T23:43:11.074830+00:00 INFO dhcpd[2275]: DHCPOFFER on 192.168.231.100 to 80:3f:5d:08:8b:b9 via lxcbr0
2017-05-25T23:43:11.081729+00:00 INFO dhcpd[2275]: DHCPACK on 192.168.231.100 to 80:3f:5d:08:8b:b9 via lxcbr0
2017-05-25T23:43:11.410906+00:00 INFO dhcpd[2275]: DHCPOFFER on 192.168.231.101 to 80:3f:5d:08:0f:73 via lxcbr0
2017-05-25T23:43:11.415833+00:00 INFO dhcpd[2275]: DHCPACK on 192.168.231.101 to 80:3f:5d:08:0f:73 via lxcbr0
2017-05-25T23:46:50.418732+00:00 INFO dhcpd[2275]: DHCPOFFER on 192.168.231.176 to 00:16:3e:9f:d1:1e (test_192_168_231_101) via lxcbr0
2017-05-25T23:46:50.424348+00:00 INFO dhcpd[2275]: DHCPACK on 192.168.231.176 to 00:16:3e:9f:d1:1e (test_192_168_231_101) via lxcbr0
-----------8<--------------

Hmm, I don't see .102 there. In fact the only mention is here:

-----------8<--------------
localhost ~ # cat /var/log/messages | grep '192.168.231.102'                                                                                                         
2017-05-25T23:37:41.893311+00:00 NOTICE ag[5351]: autotest server[stack::find_and_add_duts|add_dut|run_as_moblab] -> ssh_run(su - moblab -c '/usr/local/autotest/cli/atest host create 192.168.231.102')
-----------8<--------------

When it searches for the devices it finds 101 and 102,  it is very unusual in all my tests with moblab I have never seen it not use 100 for the first DUT.

05/25 16:40:45.040 DEBUG|          ssh_host:0284| Running (ssh) 'fping -g 192.168.231.100 192.168.231.120'
05/25 16:40:45.708 DEBUG|             utils:0298| [stdout] 192.168.231.101 is alive
05/25 16:40:45.733 DEBUG|             utils:0298| [stdout] 192.168.231.102 is alive
Last night I updated the firmware on the cyan DUT connected to the moblab chromeos2-row2-rack8-host1 - I will monitor that moblab and see if it improves the suite run success rate.
Project Member Comment 29 by bugdroid1@chromium.org, May 26
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/35a5ca380aa854bef5044bde2aadc3ca27aad28c

commit 35a5ca380aa854bef5044bde2aadc3ca27aad28c
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri May 26 21:35:34 2017

Revert "chromeos_config: mark guado_moblab-paladin experimental"

This reverts commit 3dcf6413c97b4862e3157cc3e96e8492cd25fa92.

BUG=chromium:714330
TEST=None

Change-Id: I85b93583bc52cd4daf64dc6c01dda51c66dbf966
Reviewed-on: https://chromium-review.googlesource.com/516603
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Aseda Aboagye <aaboagye@chromium.org>

[modify] https://crrev.com/35a5ca380aa854bef5044bde2aadc3ca27aad28c/cbuildbot/config_dump.json
[modify] https://crrev.com/35a5ca380aa854bef5044bde2aadc3ca27aad28c/cbuildbot/chromeos_config.py

Owner: haddowk@chromium.org
This is happening again. https://uberchromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/6106

and several other occurrences.

Ok I will look again in the morning - clearly the test code that detects and sets up the DUT's is faulty - look at this screen shot which is based on information from the DHCP server.

https://screenshot.googleplex.com/mXAwoKp7Wv8

102 should never be configured as a device.






ssh_host:0284| Running (ssh) 'fping -g 192.168.231.100 192.168.231.120'
05/31 17:17:56.988 DEBUG|             utils:0298| [stdout] 192.168.231.101 is alive
05/31 17:17:57.013 DEBUG|             utils:0298| [stdout] 192.168.231.102 is alive
05/31 17:17:59.401 DEBUG|             utils:0298| [stderr] ICMP Host Unreachable from 192.168.231.1 for ICMP Echo sent to 192.168.231.100
05/31 17:17:59.401 DEBUG|             utils:0298| [stderr] ICMP Host Unreachable from 192.168.231.1 for ICMP Echo sent to 192.168.231.100
05/31 17:17:59.401 DEBUG|             utils:0298| [stderr] ICMP Host Unreachable from 192.168.231.1 for ICMP Echo sent to 192.168.231.100

I am not familiar with fping but it seems to be returning the wrong information.
It's time to temporarily mark experimental until this is fixed.
Project Member Comment 35 by bugdroid1@chromium.org, Jun 2
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/3d3abe53436377423f5b596207cbe68b651bf533

commit 3d3abe53436377423f5b596207cbe68b651bf533
Author: Aviv Keshet <akeshet@chromium.org>
Date: Fri Jun 02 02:55:14 2017

chromeos_config: mark guado_moblab-paladin experimental

BUG=chromium:714330
TEST=None

Change-Id: I1efae66bafc266ae3b4ea3d56312b22f4e6e9ff0
Reviewed-on: https://chromium-review.googlesource.com/522229
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/3d3abe53436377423f5b596207cbe68b651bf533/cbuildbot/config_dump.json
[modify] https://crrev.com/3d3abe53436377423f5b596207cbe68b651bf533/cbuildbot/chromeos_config.py

I looked at the failed provision job logs (from inside moblab) from builds 6104, 6105, 6110.
All of them fail on chromeos2-row2-rack8-host1


- The IP of the sub-DUT that fails provision keeps migrating (192.168.231.100, 192.168.231.101, 192.168.231.102 all fail to come back up (no ping) after provision for a while once)

Come next build, both DUTs are back in action.

My guess is that one of the cyan sub-DUTs has some issues (mostly disk corruption) that make it take forever to come back from reboot.
Meanwhile, the IP migrates because moblab's DHCP assigns the IPs and IPs can / will migrate between DUTs arbitrarily, especially when one of them may not come up immediately.

I've locked the bad moblab DUT. I'm waiting for the cyan sub-DUT to come back alive (I trust it will) so that I can poke around.

Things that make this analysis hard:
- The logs from inside moblab are gzipped up, so looking at them involves download + gunzip + poking around.
- The sub-DUTs IP can migrate since moblab assigns the IP.
I am going to the lab tomorrow for another reason - if you need me to inspect devices then let me know and I can pull logs.

I might make a change to the setup to one moblab in the lab, to save the cost of a network switch the lab assigns 4 ports of the rack switch to a mini network and connects the duts and the moblab via that.  I might connect one up with a netgear switch like I do in my own lab - since I do not see that problem we see,  If I did the two devices in the CQ I assume it would be the best test ?
That cyan sub-DUT never came back. I'm leaving chromeos2-row2-rack8-host1 locked for the night so that it doesn't cause more havoc.
I will take two Cyan DUT with me and swap out the DUT's connected to that device.
Keith,
putting in an unmanaged switch to replace the "minilan" sounds like a good idea. The managed (rack) switches sometimes have rules about bouncing links that result in the port getting shutdown for a while. In this case, I suspect it's something else but eliminating this source of issues would be good.
FYI: I've unlocked chromeos2-row2-rack8-host1.
It's sub-DUT (cyan) never came back overnight. I think it'll no longer kill CQ (since now it's simply down to 1 DUT).

In the meantime, looks like something decided to not respect my lock overnight (a separate bug?) and this very DUT failed provision twice.
I've filed a bug around that: https://bugs.chromium.org/p/chromium/issues/detail?id=729083

haddowk@ is going to swap out the cyan DUT when he goes to lab next.
FYI: I had found another moblab setup (this time bvt pool) in the lab in this condition. b/62260569 is where that was recovered.
I'll pore over the attached logs there.
Labels: -Pri-0 Pri-1
This is not currently killing the CQ because the DUT that was at fault (chromeos2-row2-rack8-host1) has lost its bad sub-DUT entirely.
Re-up to Pri-0 if this kills even a single CQ run.

Re #41: The bad sub-DUT in that case was a samus. The point remains that we didn't realize that a sub-DUT died for > 10 days, and only accidentally (because I looked at it)
https://b.corp.google.com/issues/62260569#comment4
Blockedon: 729099
Besides #43 (I have a CL for that), here's my take on the AIs here:

short-term:
[1] swap the bad cyan DUT on that moblab (haddowk@)

medium-term:
[2] audit our (small number of) moblabs in the lab to make sure that the sub-DUTs have a recent enough firmware on them. (chromeos-infra@ (I'll find an owner))
[3] use a separate network switch between the moblab and its sub-DUTs (haddowk@ to work with the englab-sys-cros. Please keep chromeos-infra@ in the loop / ask for help where needed. It'd be great if you can set one up as an experiment similar to what the peng does. We can then get it deployed elsewhere with chromeos-infra@ input)

longer-term
[4] moblab really, really needs servo support -- both the moblab DUT needs to be recoverable via servo from the lab (is this the case today) and it should be able to recover it's sub-DUTs via servo. I have a feeling there is an open bug about this somewhere, but I don't expect to be involved in that myself. (It's closer to an OKR level bug and I'm not looking to pick it up atm)
Update:

I swapped out the DUT's on chromeos2-row2-rack8-host1

A netgear unmanaged hub has been placed on chromeos2-row1-rack8-host1 we should see if this makes things worse or better.

There is still a lot of lab work to do - the labels on the shelves do not match the hostnames.

There are 2 devices that are working but somehow not connected to the lab network.
Comment 47 Deleted
A similar failure was found at 
https://chromegw.corp.google.com/i/chromeos/builders/guado_moblab-paladin/builds/6162
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121606795-chromeos-test/chromeos2-row2-rack8-host11/.


@@@STEP_LINK@[Test-Logs]: provision: ABORT: Host did not return from reboot@http://localhost/tko/retrieve_logs.cgi?job=/results/2-moblab/@@@
Will return from run_suite with status: INFRA_FAILURE
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 818, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 471, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 348, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 381, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 62, in run_once
    raise e
AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x     -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -o Protocol=2 -l root -p 22 chromeos2-row2-rack8-host11 "export
    LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger
    -tag \"autotest\" \"server[stack::_call_run_once|run_once|run_as_moblab]
    -> ssh_run(su - moblab -c '/usr/local/autotest/site_utils/run_suite.py
    --pool='' --board=cyan --build=cyan-release/R57-9202.66.0
    --suite_name=dummy_server')\";fi; su - moblab -c
    '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
    --build=cyan-release/R57-9202.66.0 --suite_name=dummy_server'"
Exit status: 3
Duration: 2257.44684005
Re #48: This was a different moblab DUT than the one we've had trouble with so far: chromeos2-row2-rack8-host11.
My comment on Aviv's CL that was trying to mark guado_moblab-paladin important again (hence the failure in #48 killed the CQ):
----------------

I don't think guado_moblab is ready for prime time.

The latest failure: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=121606788 is again one of the sub-DUTs failing provision for no good reason.

Yes, retrying the test on the other DUT internally would have helped in this case (That CL is in flight). But, the sub-DUTs are failing provision too often for guado_moblab to deserve CQ status atm.


-----------------
That said, it definitely looks like the sub-DUTs are dying far too often for even retries to help.
Did we end up with a bad cyan stable build? Why are these sub-DUTs failing provision so often?
In the particular case of #48, the sub-DUT has gone to repair failed state. So it really did die.
Filed b/62347855 to get some information about the failed sub-DUT.
Labels: Chase-Pending
+Chase-Pending
Justification:
- This was responsbile for nearly half of the failures last Wednesday/Thursday (two days of utter CQ redness). Besides that, it's been failing frequently enough and killing CQ runs arbitrarily.
- We have very little redundancy in moblab, and very little automated notification / recovery.
- The intention of adding this to the list is to 
  - measure the current rate of failure of guado_moblab due to flakiness
  - Find out / implement short term mitigation required
  - demonstrate improved failure rate.

Caveat: This bug is a bit open-ended, perhaps we should file a sub-bug that is add to Chase which is about improving moblab failure rate within the next few weeks.
Labels: -Chase-Pending
I have a setup at my desk that emulates a lab install ( moblab 2 DUT's etc)

I have some changes ( test retries, fping more than one packet and replace arp with ip n )

Any suggestion on the best way to simulate the typical CQ load from my desktop ?
Not sure what you mean by typical CQ load. But the suite we run from the CQ is "moblab_quick"
Yeah - I am running that, but since we flash a new build to the moblab each time if you just run that test over and over it is not the same as running in the lab.

Currently I am using test_that 100.107.3.16 suite:moblab_quick but really that is not a good test since it does not set up the moblab each time.

Ideally I would find a way to provision the moblab and run the suite like the lab does.
Project Member Comment 56 by bugdroid1@chromium.org, Jun 8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/84563fb82cc0fcd37fec6b9dc0ab1963cc15835a

commit 84563fb82cc0fcd37fec6b9dc0ab1963cc15835a
Author: Keith Haddow <haddowk@chromium.org>
Date: Thu Jun 08 02:23:59 2017

[autotest] Mark moblab paladin as not important.

The lab is not yet stable, moblab should not break the CQ.

TEST=None
BUG=chromium:714330

Change-Id: I1f8e8b75db9f375b285345f748e8198d2c5c9bc3
Reviewed-on: https://chromium-review.googlesource.com/527525
Reviewed-by: Keith Haddow <haddowk@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Commit-Queue: Keith Haddow <haddowk@chromium.org>
Tested-by: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/84563fb82cc0fcd37fec6b9dc0ab1963cc15835a/cbuildbot/config_dump.json
[modify] https://crrev.com/84563fb82cc0fcd37fec6b9dc0ab1963cc15835a/cbuildbot/chromeos_config.py

Labels: Proj-Moblab
Cc: gwendal@chromium.org
 Issue 598503  has been merged into this issue.
b/62347855 has logs from another sub-DUT that had died in this fashion and then magically recovered.
To give an update.

I have fixed all moblab DUT's that were incorrectly configured in cautotest which should make the pool of device available for testing larger.

I am testing the Cyan subduts regularly and going over to the lab and rebooting when required.  So far 5 DUT's have failed 3 are the DUT is working but no network connection is indicated on the display.  I filed https://bugs.chromium.org/p/chromium/issues/detail?id=733425

2 failures where the DUT just being non responsive and require long press of power button to recover.  There is work in progress to get a budget for the DUT's we use for moblab testing ( partners should not pay for testing our product ) and the hope is we can switch the cyans out for what seem to be much more reliable Electo devices.

I have managed to get labstation working inside the moblab sub net and connected to servos, there is still a log of coding to get the detection and setup working correctly in moblab_host.py but work is ongoing.  Most likely I will be able to use just running a servod process on the moblab rather than a separate labstation but that is not yet tested and will be subject to a 3 DUT limit ( so not suitable for partners )

I have been pulled to work on CTS - if there is anything very urgent I need to attend to on this issue let me know - the paladin failure rate is way down.
Cc: dgarr...@chromium.org fdeng@chromium.org ayatane@chromium.org chingcodes@chromium.org
 Issue 577888  has been merged into this issue.
Another update.

I added in 3 new moblabs each with 2 cyan DUT's to the lab
I have factory reset the Cyans in the lab 
I monitor and regularly fix the issues with cyans when they loose network

Next step is to replace the cyans with electros, add servos to the eceltros and get servod running on moblab by default that is not likely to happen for a month or two due to availability of electro devices.
Project Member Comment 63 by bugdroid1@chromium.org, Jul 18
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/434b6fc9fcc4c5ddab573fcd5939fecadf362c50

commit 434b6fc9fcc4c5ddab573fcd5939fecadf362c50
Author: Aviv Keshet <akeshet@chromium.org>
Date: Tue Jul 18 01:32:40 2017

chromeos_config: mark guado_moblab as important

BUG=chromium:714330,  chromium:743100 
TEST=None

Change-Id: Icb270cd21d19471cee266fbb40ba4725834ade3f
Reviewed-on: https://chromium-review.googlesource.com/572084
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: David Riley <davidriley@chromium.org>

[modify] https://crrev.com/434b6fc9fcc4c5ddab573fcd5939fecadf362c50/cbuildbot/config_dump.json
[modify] https://crrev.com/434b6fc9fcc4c5ddab573fcd5939fecadf362c50/cbuildbot/chromeos_config.py

Sign in to add a comment