New issue
Advanced search Search tips

Issue 900804 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 10
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

TastVMTest fails on second attempt due to SSH port 9222 being in use

Project Member Reported by derat@chromium.org, Nov 1

Issue description

I think that whenever TastVMTest (attempt 1) fails on a VM builder due to a test failure, TastVMTest (attempt 2) fails to run entirely due to the VM SSH port still being in use.

See e.g. the betty-paladin run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8931120407743696336 :

************************************************************
** Start Stage TastVMTest (attempt 2) - Wed, 31 Oct 2018 16:26:49 -0700 (PDT)
** 
** Runs Tast integration tests in a virtual machine.
************************************************************
16:26:49: INFO: Running cidb query on pid 27681, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f45508105d0>
Preconditions for the stage successfully met. Beginning to execute stage...
16:26:49: INFO: Running cidb query on pid 27681, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f45508106d0>
16:26:49: INFO: Running Tast VM test suite tast_vm_paladin ((!disabled && !"group:*" && !informational))
16:26:49: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cros_sdk -- cros_run_vm_test --no-display '--board=betty' '--image-path=/mnt/host/source/src/build/images/betty/R72-11215.0.0-rc1/chromiumos_test_image.bin' --copy-on-write '--results-dir=/tmp/cbuildbot1ZxdAD/tast_vm_paladin' --host-cmd -- tast -verbose run '-build=false' '-resultsdir=/tmp/cbuildbot1ZxdAD/tast_vm_paladin' '-extrauseflags=tast_vm' 127.0.0.1:9222 '(!disabled && !"group:*" && !informational)'
16:26:49: NOTICE: /b/swarming/w/ir/cache/cbuild/repository/chroot.img is using 38 GiB more than needed.  Running fstrim.
16:26:50: INFO: /tmp/cros_vm_9222/kvm.pid does not exist.
16:26:50: DEBUG: Stop VM
16:26:50: INFO: /tmp/cros_vm_9222/kvm.pid does not exist.
16:26:50: INFO: SSH port 9222 in use...
16:26:50: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:26:50: DEBUG: Retrying in 1.000000 (1.000000 + jitter 0.000000) seconds ...
16:26:51: INFO: SSH port 9222 in use...
16:26:51: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:26:51: DEBUG: Retrying in 2.000000 (2.000000 + jitter 0.000000) seconds ...
16:26:53: INFO: SSH port 9222 in use...
16:26:53: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:26:53: DEBUG: Retrying in 3.000000 (3.000000 + jitter 0.000000) seconds ...
16:26:56: INFO: SSH port 9222 in use...
16:26:56: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:26:56: DEBUG: Retrying in 4.000000 (4.000000 + jitter 0.000000) seconds ...
16:27:00: INFO: SSH port 9222 in use...
16:27:00: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:27:00: DEBUG: Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ...
16:27:05: INFO: SSH port 9222 in use...
16:27:05: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:27:05: DEBUG: Retrying in 6.000000 (6.000000 + jitter 0.000000) seconds ...
16:27:11: INFO: SSH port 9222 in use...
16:27:11: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:27:11: DEBUG: Retrying in 7.000000 (7.000000 + jitter 0.000000) seconds ...
16:27:18: INFO: SSH port 9222 in use...
16:27:18: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
16:27:18: ERROR: SSH port 9222 in use
cros_run_vm_test: Unhandled exception:
...

I suspect that this started when we started using cros_vm.py to start the VM ( issue 891928 ).

Achuith, any ideas about what's happening here? I'm guessing that the first attempt isn't stopping the VM. I see this near the end of its log:

...
16:26:45: INFO: Tests succeeded.
16:26:45: DEBUG: Stop VM
16:26:45: INFO: Killing 266.
16:26:45: INFO: RunCommand: kill -9 266
16:26:45: DEBUG: Stop VM
16:26:45: DEBUG: /tmp/cros_vm_9222 not present.
16:26:45: INFO: Time elapsed 0:03:51.950403.
...
 
Cc: cychiang@chromium.org
It's possible the kill -9 isn't succeeding for some reason? The end of the log looks pretty normal, but we don't really check if the process exited.

We have a couple of options:
1. If the port is not explicitly set, we can pick another port when we detect the port is busy. This is probably a nice feature on its own, but in this particular case, there's a danger we're leak VM instances and eventually bring the bot to its knees.
2. We could also ssh into the VM and halt it prior to the kill -9. And perhaps wait for the process to exit after the kill -9. 

I think we should probably go with #2, but I'd like to add #1 as a feature down the road. Thoughts?
I think we probably don't want to switch over to another port. Being able to reliably stop earlier VMs seems important. :-)

Making Stop() wait a bit in hopes that the old process goes away sounds reasonable. I'm less sure about SSH-ing into the old VM -- is there any reason why that should be faster than sending SIGKILL?

I wonder if there's some way to collect more info about what the problem is here. Does qemu log anything anywhere? Can we tell if the old process is e.g. in uninterruptible sleep a.k.a. the 'D' state?
Owner: achuith@chromium.org
Status: Started (was: Untriaged)
SSH + exit is definitely not faster, but it may be more reliable. I'm not sure if there's some permissions issue that's preventing the SIGKILL from doing its job?

I don't know where QEMU logs anything if it does. I can add some code that'll throw if the old VM doesn't relinquish the port.
(There's a change at https://crrev.com/c/1321353, but I think it has a bug.)
Owner: derat@chromium.org
I'm concerned about power.Reboot - perhaps it's possible it leaves the VM unreachable and unkillable somehow.

Some speculative CLs that may help us along:
https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1345545
https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1345539
Re power.Reboot, interesting theory. See e.g. http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8929329919858702736.

2018/11/20 08:08:09 Started test power.Reboot
2018/11/20 08:08:09 [08:08:09.185] Rebooting DUT
2018/11/20 08:08:09 [08:08:09.193] Waiting for DUT to become unreachable
2018/11/20 08:08:13 [08:08:13.198] DUT became unreachable (as expected)
2018/11/20 08:08:13 [08:08:13.198] Reconnecting to DUT
2018/11/20 08:08:45 [08:08:45.091] Reconnected to DUT
2018/11/20 08:09:14 Completed test power.Reboot in 35.909s with 0 error(s)
2018/11/20 08:09:14 [08:08:45.092] Disconnecting from DUT
2018/11/20 08:09:14 Ran 3 remote test(s) in 1m7.077s
2018/11/20 08:09:14 Collecting system information
2018/11/20 08:09:14 Connecting to 127.0.0.1:9222
2018/11/20 08:09:14 /var/spool/crash/cros_camera_service.20181120.080549.1647.dmp: skipping; too many files
2018/11/20 08:09:14 /var/spool/crash/cros_camera_service.20181120.100701.3600.dmp: skipping; too many files
2018/11/20 08:09:14 Copying /tmp/tast_logs.879199309 from host to /tmp/cbuildbotIoXCWt/tast_vm_paladin/system_logs
2018/11/20 08:09:15 Cleaning /tmp/tast_logs.879199309 on host
2018/11/20 08:09:15 Copying /tmp/tast_crashes.475682120 from host to /tmp/cbuildbotIoXCWt/tast_vm_paladin/crashes
2018/11/20 08:09:15 Cleaning /tmp/tast_crashes.475682120 on host

So the DUT was definitely still reachable over SSH after the test rebooted it: the test was able to re-establish an SSH connection to it, and then the main "tast" process was also able to connect to it at 127.0.0.1:9222. After that, things also seem okay (at least from the logs):

08:09:15: INFO: tast exited with status code 0.
08:09:15: DEBUG: Stop VM
08:09:15: INFO: RunCommand: kill -9 266
08:09:17: DEBUG: Stop VM
08:09:17: DEBUG: /tmp/cros_vm_9222 not present.
08:09:17: INFO: Time elapsed: 0:05:36.019299

But then the retry stage (executed since a Tast test failed) sees the port in use:

08:09:31: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cros_sdk -- cros_run_vm_test --no-display '--board=amd64-generic' '--image-path=/mnt/host/source/src/build/images/amd64-generic/R72-11280.0.0-rc1/chromiumos_test_image.bin' --copy-on-write '--results-dir=/tmp/cbuildbotS2E_Ui/tast_vm_paladin' --host-cmd -- tast -verbose run '-build=false' '-resultsdir=/tmp/cbuildbotS2E_Ui/tast_vm_paladin' '-extrauseflags=tast_vm' 127.0.0.1:9222 '(!disabled && !"group:*" && !informational)'
08:09:34: INFO: /tmp/cros_vm_9222/kvm.pid does not exist.
08:09:34: DEBUG: Stop VM
08:09:34: INFO: /tmp/cros_vm_9222/kvm.pid does not exist.
08:09:34: INFO: SSH port 9222 in use...
08:09:34: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:34: DEBUG: Retrying in 1.000000 (1.000000 + jitter 0.000000) seconds ...
08:09:35: INFO: SSH port 9222 in use...
08:09:35: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:35: DEBUG: Retrying in 2.000000 (2.000000 + jitter 0.000000) seconds ...
08:09:37: INFO: SSH port 9222 in use...
08:09:37: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:37: DEBUG: Retrying in 3.000000 (3.000000 + jitter 0.000000) seconds ...
08:09:40: INFO: SSH port 9222 in use...
08:09:40: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:40: DEBUG: Retrying in 4.000000 (4.000000 + jitter 0.000000) seconds ...
08:09:44: INFO: SSH port 9222 in use...
08:09:44: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:44: DEBUG: Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ...
08:09:49: INFO: SSH port 9222 in use...
08:09:49: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:49: DEBUG: Retrying in 6.000000 (6.000000 + jitter 0.000000) seconds ...
08:09:55: INFO: SSH port 9222 in use...
08:09:55: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:09:55: DEBUG: Retrying in 7.000000 (7.000000 + jitter 0.000000) seconds ...
08:10:02: INFO: SSH port 9222 in use...
08:10:02: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>()
08:10:02: ERROR: SSH port 9222 in use

It'd be possible to skip power.Reboot on VMs, but we'd lose some coverage.

I wonder if qemu's PID might be changing as a result of the reboot, with that causing the kill command to not work. That might be an argument for trying to SSH in to the DUT to shut it down...
I just ran a cros_run_vm_test command matching the one that we use on the builders against a betty image locally, and I didn't see the qemu process sticking around (or anything else listening on 127.0.0.1:9222) after testing finished:

...
2018/11/20 22:11:03 Completed test power.Reboot in 22.354s with 0 error(s)
...
2018/11/20 22:11:04 Results saved to /tmp/tast_vm
22:11:04: INFO: tast exited with status code 0.
22:11:04: DEBUG: Stop VM
22:11:04: INFO: RunCommand: kill -9 15219
22:11:04: DEBUG: Stop VM
22:11:04: DEBUG: /tmp/cros_vm_9222 not present.
22:11:04: INFO: Time elapsed: 0:02:23.760778

I also tried running a test that always fails (example.Fail) to see if that makes a difference, but it doesn't -- the qemu process is still gone afterward.
Ok, I think the picture has become clearer.

The VM is actually going away. I tried sudo kill -9 in a retry-loop, but that's unnecessary.

You can look at the log here:
https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8928663624210416960/+/steps/TastVMTest__attempt_1_/0/stdout

There's another process that's connnected and holding on to the port. nestat -ap reveals:
17:14:14: INFO: tcp        0      0 localhost:9222          0.0.0.0:*               LISTEN      268/qemu-system-x86 
17:14:14: INFO: tcp        0      0 localhost:9222          localhost:54215         CLOSE_WAIT  268/qemu-system-x86 
17:14:14: INFO: tcp        0      0 localhost:9222          localhost:53949         TIME_WAIT   -                   

After we kill pid 268, which is the VM pid, we're left with:
17:14:16: INFO: tcp        0      0 localhost:9222          localhost:53949         TIME_WAIT   -                   

The socket is eventually freed in ~45 sec, so if I increased the time that we wait for the port to become free, we should stop failing. Going to try that next. 

Is Tast properly cleaning up at the end of the tests?
Thanks for adding the extra logging. I think that http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html is probably relevant. I can look into this more later (gardener this week, so time is scarce).
I haven't landed the netstat logging, and not entirely sure if we want to land it.

We want the first 2, and not sure about #3:
1. Clean up, properly closing all sockets at the end of the test run.
2. Wait longer - this will add a minute in the interim.
3. Since we aren't leaking VMs but the port is temporarily occupied by the last test run, we can use random ports to avoid this specific problem.

It's a shot in the dark, but the Close method in the ssh.Host class that I wrote closes its socket asynchronously. That can probably skip getting run when the main goroutine exits.

I'm still not completely sure that this is what's going on, but I can try making tast synchronously close the connection before exiting. The downside is that we'll probably hang when the DUT is unreachable, which I was trying to avoid.

(IUIC, TIME_WAIT here means that the kernel is trying to deliver tast's final ACK to qemu. I'm surprised that this would still be happening because I thought that the tast process is by definition completely gone by the time that we kill qemu.)
Cc: derat@chromium.org
Owner: achuith@chromium.org
Sounds like https://crrev.com/c/1352930 is expected to fix this. Thanks again!
Project Member

Comment 15 by bugdroid1@chromium.org, Nov 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/49d2238170a6bcfff367def2a542de233f76ae78

commit 49d2238170a6bcfff367def2a542de233f76ae78
Author: Achuith Bhandarkar <achuith@chromium.org>
Date: Thu Nov 29 20:11:04 2018

cros_vm: Wait for SSH port to free upon exit.

* Move _WaitForSSHPort to Stop()
* Wait longer (275 sec) in _WaitForSSHPort.
* Always close socket in _CheckSSHPortBusy
* _KillVM function

BUG= chromium:900804 
TEST=manual

Change-Id: Icec46ebcb40aa3b4358be1d0735b8dd6103f997f
Reviewed-on: https://chromium-review.googlesource.com/1352930
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/49d2238170a6bcfff367def2a542de233f76ae78/lib/vm.py

Project Member

Comment 16 by bugdroid1@chromium.org, Dec 3

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/89808a631e39923b1ba8d4027449d47f7c8b3f5b

commit 89808a631e39923b1ba8d4027449d47f7c8b3f5b
Author: Achuith Bhandarkar <achuith@chromium.org>
Date: Mon Dec 03 20:43:40 2018

tast_test_stages: Launch VM outside cros_sdk shell

Also --results-dir doesn't do anything.

BUG= chromium:900804 
TEST=bots

Change-Id: I29afd2227fafcade0a5821ea39673e36761f53b1
Reviewed-on: https://chromium-review.googlesource.com/1345545
Commit-Ready: Achuith Bhandarkar <achuith@chromium.org>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/89808a631e39923b1ba8d4027449d47f7c8b3f5b/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/89808a631e39923b1ba8d4027449d47f7c8b3f5b/cbuildbot/stages/tast_test_stages_unittest.py

Owner: derat@chromium.org
Passing to Dan - I don't know if you'd like to investigate this further based on #c11. If not, please feel free to close. 
Owner: achuith@chromium.org
Status: Fixed (was: Started)
Nope, I think that that was probably a red herring. Thanks for fixing this!
Project Member

Comment 19 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/b2c827c6440e0c07397673af44403c63423d32ed

commit b2c827c6440e0c07397673af44403c63423d32ed
Author: Achuith Bhandarkar <achuith@chromium.org>
Date: Wed Dec 19 05:33:29 2018

cros_test: Run Tast tests.

Similar to --autotest.

BUG= chromium:900804 
TEST=manual

Change-Id: I5750c18768ee9bbc9eca784ffc544b453888a65d
Reviewed-on: https://chromium-review.googlesource.com/1381064
Commit-Ready: Achuith Bhandarkar <achuith@chromium.org>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/b2c827c6440e0c07397673af44403c63423d32ed/lib/cros_test.py

Project Member

Comment 20 by bugdroid1@chromium.org, Dec 28

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/18fd9f3fb15d29974e4d676c8e4a25bf945a295e

commit 18fd9f3fb15d29974e4d676c8e4a25bf945a295e
Author: Achuith Bhandarkar <achuith@chromium.org>
Date: Fri Dec 28 08:43:09 2018

tast_test_stages: Use cros_run_vm_test --tast.

BUG= chromium:900804 
TEST=bots

Change-Id: Ic0f5c2f004fe971a7bf60043d7f617d4d7d77018
Reviewed-on: https://chromium-review.googlesource.com/1382947
Commit-Ready: Achuith Bhandarkar <achuith@chromium.org>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/18fd9f3fb15d29974e4d676c8e4a25bf945a295e/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/18fd9f3fb15d29974e4d676c8e4a25bf945a295e/cbuildbot/stages/tast_test_stages_unittest.py

Project Member

Comment 21 by bugdroid1@chromium.org, Jan 9

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/284871915cd6db175bdb079d1c533d736f7e5515

commit 284871915cd6db175bdb079d1c533d736f7e5515
Author: Achuith Bhandarkar <achuith@chromium.org>
Date: Wed Jan 09 19:55:44 2019

tast_test_stages: Use VM image for Tast VM tests.

BUG= chromium:900804 
TEST=amd64-generic-paladin-tryjob betty-paladin-tryjob

Change-Id: I676d64209a2c19d9d4a8690ce1124865e85798db
Reviewed-on: https://chromium-review.googlesource.com/1395598
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Achuith Bhandarkar <achuith@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/tast_test_stages.py
[modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/build_stages.py
[modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/build_stages_unittest.py

Sign in to add a comment