improve qemu/VMTest reboot in case of bad failures |
|||
Issue description
[Notice: Normally rebooting from the inside of the qemu instance brings it back after a few seconds as expected. This is about a hang on shutdown or an instance completely gone bad.]
When a CTS test goes bad (say Chrome logs in, but Android doesn't com up fully) inside of a VM and the test tries to recover the device by rebooting, the ssh connection often does not come back.
19:11:51 INFO | autoserv| Will raise error TestFail('Error: Failed to set up adb connection',) due to unexpected return: False
19:11:51 INFO | autoserv| Skipping reboot, restarting browser.
19:13:51 INFO | autoserv| run process timeout (120) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_JPgcizssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9222 localhost "export LIBC_FATAL_STDERR_=1; stop ui&& find /tmp/ -mindepth 1 -delete && start ui"
19:13:52 INFO | autoserv| Restarting browser has failed.
19:13:52 INFO | autoserv| Will reboot DUT when Chrome stops.
19:13:52 INFO | autoserv| Rebooting...
[Nothing]
Running a new test leads to
19:21:06 INFO | autoserv| Timed out waiting for master-ssh connection to be established.
19:21:16 INFO | autoserv| run process timeout (10) fired on: /usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_otWafTssh-master/socket -o StrictHostKeyChecking=no -o UserKnown
HostsFile=/tmp/tmpf349Km -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -o ServerAliveCountMax=3 -o ConnectionAttempts=1 -l root -p 9222 localhost " if type
\"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_detect_host|check_host|run] -> ssh_run(test -f /mnt/stateful_partition/.android_tester)\";fi; tes
t -f /mnt/stateful_partition/.android_tester"
One solution would be to nuke qemu and restart
sudo pkill -f qemu
./bin/cros_start_vm --no_graphics --image_path=../build/images/${BOARD}/latest/chromiumos_qemu_image.bin
I think that is outside of scope for tradefed_test.py though. The reboot code should detect the situation and handle this transparently.
,
Jan 3 2018
> When a CTS test goes bad (say Chrome logs in, but Android doesn't com up fully) inside of a VM and the test tries to recover the device by rebooting, the ssh connection often does not come back. This doesn't sound normal. Is there a way to pull /var/log/net.log and messages from an affected system, or reproduce the issue on demand?
,
Mar 30 2018
,
Mar 30 2018
|
|||
►
Sign in to add a comment |
|||
Comment 1 by pwang@chromium.org
, Jan 3 2018one problem of nuking qemu and restart is that builder often put the vm image in arbitrary folder say /tmp/cbuildbot-tmplRubb7/chromiumos_qemu_disk.bin.0E997a other than build/images/${BOARD}/latest. cc cernekee as he may know more about QEMU network issue.