New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 797164 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug



Sign in to add a comment

improve qemu/VMTest reboot in case of bad failures

Project Member Reported by ihf@chromium.org, Dec 22 2017

Issue description

[Notice: Normally rebooting from the inside of the qemu instance brings it back after a few seconds as expected. This is about a hang on shutdown or an instance completely gone bad.]

When a CTS test goes bad (say Chrome logs in, but Android doesn't com up fully) inside of a VM and the test tries to recover the device by rebooting, the ssh connection often does not come back.

19:11:51 INFO | autoserv| Will raise error TestFail('Error: Failed to set up adb connection',) due to unexpected return: False
19:11:51 INFO | autoserv| Skipping reboot, restarting browser.
19:13:51 INFO | autoserv| run process timeout (120) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_JPgcizssh-master/socket -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9222 localhost "export LIBC_FATAL_STDERR_=1; stop ui&& find /tmp/ -mindepth 1 -delete && start ui"
19:13:52 INFO | autoserv| Restarting browser has failed.
19:13:52 INFO | autoserv| Will reboot DUT when Chrome stops.
19:13:52 INFO | autoserv| Rebooting...

[Nothing]

Running a new test leads to

19:21:06 INFO | autoserv| Timed out waiting for master-ssh connection to be established.
19:21:16 INFO | autoserv| run process timeout (10) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_otWafTssh-master/socket -o StrictHostKeyChecking=no -o UserKnown
HostsFile=/tmp/tmpf349Km -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=300 -o ServerAliveCountMax=3 -o ConnectionAttempts=1 -l root -p 9222 localhost " if type
 \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::_detect_host|check_host|run] -> ssh_run(test -f /mnt/stateful_partition/.android_tester)\";fi; tes
t -f /mnt/stateful_partition/.android_tester"

One solution would be to nuke qemu and restart

sudo pkill -f qemu
./bin/cros_start_vm --no_graphics --image_path=../build/images/${BOARD}/latest/chromiumos_qemu_image.bin

I think that is outside of scope for tradefed_test.py though. The reboot code should detect the situation and handle this transparently.
 

Comment 1 by pwang@chromium.org, Jan 3 2018

Cc: cernekee@chromium.org
one problem of nuking qemu and restart is that builder often put the vm image in arbitrary folder say /tmp/cbuildbot-tmplRubb7/chromiumos_qemu_disk.bin.0E997a other than build/images/${BOARD}/latest.

cc cernekee as he may know more about QEMU network issue.
> When a CTS test goes bad (say Chrome logs in, but Android doesn't com up fully) inside of a VM and the test tries to recover the device by rebooting, the ssh connection often does not come back.

This doesn't sound normal.  Is there a way to pull /var/log/net.log and messages from an affected system, or reproduce the issue on demand?
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS

Sign in to add a comment