TastVMTest fails on second attempt due to SSH port 9222 being in use |
|||||||
Issue descriptionI think that whenever TastVMTest (attempt 1) fails on a VM builder due to a test failure, TastVMTest (attempt 2) fails to run entirely due to the VM SSH port still being in use. See e.g. the betty-paladin run at http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8931120407743696336 : ************************************************************ ** Start Stage TastVMTest (attempt 2) - Wed, 31 Oct 2018 16:26:49 -0700 (PDT) ** ** Runs Tast integration tests in a virtual machine. ************************************************************ 16:26:49: INFO: Running cidb query on pid 27681, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f45508105d0> Preconditions for the stage successfully met. Beginning to execute stage... 16:26:49: INFO: Running cidb query on pid 27681, repr(query) starts with <sqlalchemy.sql.expression.Update object at 0x7f45508106d0> 16:26:49: INFO: Running Tast VM test suite tast_vm_paladin ((!disabled && !"group:*" && !informational)) 16:26:49: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cros_sdk -- cros_run_vm_test --no-display '--board=betty' '--image-path=/mnt/host/source/src/build/images/betty/R72-11215.0.0-rc1/chromiumos_test_image.bin' --copy-on-write '--results-dir=/tmp/cbuildbot1ZxdAD/tast_vm_paladin' --host-cmd -- tast -verbose run '-build=false' '-resultsdir=/tmp/cbuildbot1ZxdAD/tast_vm_paladin' '-extrauseflags=tast_vm' 127.0.0.1:9222 '(!disabled && !"group:*" && !informational)' 16:26:49: NOTICE: /b/swarming/w/ir/cache/cbuild/repository/chroot.img is using 38 GiB more than needed. Running fstrim. 16:26:50: INFO: /tmp/cros_vm_9222/kvm.pid does not exist. 16:26:50: DEBUG: Stop VM 16:26:50: INFO: /tmp/cros_vm_9222/kvm.pid does not exist. 16:26:50: INFO: SSH port 9222 in use... 16:26:50: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:26:50: DEBUG: Retrying in 1.000000 (1.000000 + jitter 0.000000) seconds ... 16:26:51: INFO: SSH port 9222 in use... 16:26:51: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:26:51: DEBUG: Retrying in 2.000000 (2.000000 + jitter 0.000000) seconds ... 16:26:53: INFO: SSH port 9222 in use... 16:26:53: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:26:53: DEBUG: Retrying in 3.000000 (3.000000 + jitter 0.000000) seconds ... 16:26:56: INFO: SSH port 9222 in use... 16:26:56: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:26:56: DEBUG: Retrying in 4.000000 (4.000000 + jitter 0.000000) seconds ... 16:27:00: INFO: SSH port 9222 in use... 16:27:00: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:27:00: DEBUG: Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ... 16:27:05: INFO: SSH port 9222 in use... 16:27:05: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:27:05: DEBUG: Retrying in 6.000000 (6.000000 + jitter 0.000000) seconds ... 16:27:11: INFO: SSH port 9222 in use... 16:27:11: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:27:11: DEBUG: Retrying in 7.000000 (7.000000 + jitter 0.000000) seconds ... 16:27:18: INFO: SSH port 9222 in use... 16:27:18: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 16:27:18: ERROR: SSH port 9222 in use cros_run_vm_test: Unhandled exception: ... I suspect that this started when we started using cros_vm.py to start the VM ( issue 891928 ). Achuith, any ideas about what's happening here? I'm guessing that the first attempt isn't stopping the VM. I see this near the end of its log: ... 16:26:45: INFO: Tests succeeded. 16:26:45: DEBUG: Stop VM 16:26:45: INFO: Killing 266. 16:26:45: INFO: RunCommand: kill -9 266 16:26:45: DEBUG: Stop VM 16:26:45: DEBUG: /tmp/cros_vm_9222 not present. 16:26:45: INFO: Time elapsed 0:03:51.950403. ...
,
Nov 5
,
Nov 5
It's possible the kill -9 isn't succeeding for some reason? The end of the log looks pretty normal, but we don't really check if the process exited. We have a couple of options: 1. If the port is not explicitly set, we can pick another port when we detect the port is busy. This is probably a nice feature on its own, but in this particular case, there's a danger we're leak VM instances and eventually bring the bot to its knees. 2. We could also ssh into the VM and halt it prior to the kill -9. And perhaps wait for the process to exit after the kill -9. I think we should probably go with #2, but I'd like to add #1 as a feature down the road. Thoughts?
,
Nov 5
I think we probably don't want to switch over to another port. Being able to reliably stop earlier VMs seems important. :-) Making Stop() wait a bit in hopes that the old process goes away sounds reasonable. I'm less sure about SSH-ing into the old VM -- is there any reason why that should be faster than sending SIGKILL? I wonder if there's some way to collect more info about what the problem is here. Does qemu log anything anywhere? Can we tell if the old process is e.g. in uninterruptible sleep a.k.a. the 'D' state?
,
Nov 7
SSH + exit is definitely not faster, but it may be more reliable. I'm not sure if there's some permissions issue that's preventing the SIGKILL from doing its job? I don't know where QEMU logs anything if it does. I can add some code that'll throw if the old VM doesn't relinquish the port.
,
Nov 16
(There's a change at https://crrev.com/c/1321353, but I think it has a bug.)
,
Nov 21
I'm concerned about power.Reboot - perhaps it's possible it leaves the VM unreachable and unkillable somehow. Some speculative CLs that may help us along: https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1345545 https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1345539
,
Nov 21
Re power.Reboot, interesting theory. See e.g. http://cros-goldeneye/chromeos/healthmonitoring/buildDetails?buildbucketId=8929329919858702736. 2018/11/20 08:08:09 Started test power.Reboot 2018/11/20 08:08:09 [08:08:09.185] Rebooting DUT 2018/11/20 08:08:09 [08:08:09.193] Waiting for DUT to become unreachable 2018/11/20 08:08:13 [08:08:13.198] DUT became unreachable (as expected) 2018/11/20 08:08:13 [08:08:13.198] Reconnecting to DUT 2018/11/20 08:08:45 [08:08:45.091] Reconnected to DUT 2018/11/20 08:09:14 Completed test power.Reboot in 35.909s with 0 error(s) 2018/11/20 08:09:14 [08:08:45.092] Disconnecting from DUT 2018/11/20 08:09:14 Ran 3 remote test(s) in 1m7.077s 2018/11/20 08:09:14 Collecting system information 2018/11/20 08:09:14 Connecting to 127.0.0.1:9222 2018/11/20 08:09:14 /var/spool/crash/cros_camera_service.20181120.080549.1647.dmp: skipping; too many files 2018/11/20 08:09:14 /var/spool/crash/cros_camera_service.20181120.100701.3600.dmp: skipping; too many files 2018/11/20 08:09:14 Copying /tmp/tast_logs.879199309 from host to /tmp/cbuildbotIoXCWt/tast_vm_paladin/system_logs 2018/11/20 08:09:15 Cleaning /tmp/tast_logs.879199309 on host 2018/11/20 08:09:15 Copying /tmp/tast_crashes.475682120 from host to /tmp/cbuildbotIoXCWt/tast_vm_paladin/crashes 2018/11/20 08:09:15 Cleaning /tmp/tast_crashes.475682120 on host So the DUT was definitely still reachable over SSH after the test rebooted it: the test was able to re-establish an SSH connection to it, and then the main "tast" process was also able to connect to it at 127.0.0.1:9222. After that, things also seem okay (at least from the logs): 08:09:15: INFO: tast exited with status code 0. 08:09:15: DEBUG: Stop VM 08:09:15: INFO: RunCommand: kill -9 266 08:09:17: DEBUG: Stop VM 08:09:17: DEBUG: /tmp/cros_vm_9222 not present. 08:09:17: INFO: Time elapsed: 0:05:36.019299 But then the retry stage (executed since a Tast test failed) sees the port in use: 08:09:31: INFO: RunCommand: /b/swarming/w/ir/cache/cbuild/repository/chromite/bin/cros_sdk -- cros_run_vm_test --no-display '--board=amd64-generic' '--image-path=/mnt/host/source/src/build/images/amd64-generic/R72-11280.0.0-rc1/chromiumos_test_image.bin' --copy-on-write '--results-dir=/tmp/cbuildbotS2E_Ui/tast_vm_paladin' --host-cmd -- tast -verbose run '-build=false' '-resultsdir=/tmp/cbuildbotS2E_Ui/tast_vm_paladin' '-extrauseflags=tast_vm' 127.0.0.1:9222 '(!disabled && !"group:*" && !informational)' 08:09:34: INFO: /tmp/cros_vm_9222/kvm.pid does not exist. 08:09:34: DEBUG: Stop VM 08:09:34: INFO: /tmp/cros_vm_9222/kvm.pid does not exist. 08:09:34: INFO: SSH port 9222 in use... 08:09:34: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:34: DEBUG: Retrying in 1.000000 (1.000000 + jitter 0.000000) seconds ... 08:09:35: INFO: SSH port 9222 in use... 08:09:35: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:35: DEBUG: Retrying in 2.000000 (2.000000 + jitter 0.000000) seconds ... 08:09:37: INFO: SSH port 9222 in use... 08:09:37: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:37: DEBUG: Retrying in 3.000000 (3.000000 + jitter 0.000000) seconds ... 08:09:40: INFO: SSH port 9222 in use... 08:09:40: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:40: DEBUG: Retrying in 4.000000 (4.000000 + jitter 0.000000) seconds ... 08:09:44: INFO: SSH port 9222 in use... 08:09:44: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:44: DEBUG: Retrying in 5.000000 (5.000000 + jitter 0.000000) seconds ... 08:09:49: INFO: SSH port 9222 in use... 08:09:49: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:49: DEBUG: Retrying in 6.000000 (6.000000 + jitter 0.000000) seconds ... 08:09:55: INFO: SSH port 9222 in use... 08:09:55: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:09:55: DEBUG: Retrying in 7.000000 (7.000000 + jitter 0.000000) seconds ... 08:10:02: INFO: SSH port 9222 in use... 08:10:02: DEBUG: <class 'chromite.scripts.cros_vm._SSHPortInUseError'>() 08:10:02: ERROR: SSH port 9222 in use It'd be possible to skip power.Reboot on VMs, but we'd lose some coverage. I wonder if qemu's PID might be changing as a result of the reboot, with that causing the kill command to not work. That might be an argument for trying to SSH in to the DUT to shut it down...
,
Nov 21
I just ran a cros_run_vm_test command matching the one that we use on the builders against a betty image locally, and I didn't see the qemu process sticking around (or anything else listening on 127.0.0.1:9222) after testing finished: ... 2018/11/20 22:11:03 Completed test power.Reboot in 22.354s with 0 error(s) ... 2018/11/20 22:11:04 Results saved to /tmp/tast_vm 22:11:04: INFO: tast exited with status code 0. 22:11:04: DEBUG: Stop VM 22:11:04: INFO: RunCommand: kill -9 15219 22:11:04: DEBUG: Stop VM 22:11:04: DEBUG: /tmp/cros_vm_9222 not present. 22:11:04: INFO: Time elapsed: 0:02:23.760778 I also tried running a test that always fails (example.Fail) to see if that makes a difference, but it doesn't -- the qemu process is still gone afterward.
,
Nov 28
Ok, I think the picture has become clearer. The VM is actually going away. I tried sudo kill -9 in a retry-loop, but that's unnecessary. You can look at the log here: https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8928663624210416960/+/steps/TastVMTest__attempt_1_/0/stdout There's another process that's connnected and holding on to the port. nestat -ap reveals: 17:14:14: INFO: tcp 0 0 localhost:9222 0.0.0.0:* LISTEN 268/qemu-system-x86 17:14:14: INFO: tcp 0 0 localhost:9222 localhost:54215 CLOSE_WAIT 268/qemu-system-x86 17:14:14: INFO: tcp 0 0 localhost:9222 localhost:53949 TIME_WAIT - After we kill pid 268, which is the VM pid, we're left with: 17:14:16: INFO: tcp 0 0 localhost:9222 localhost:53949 TIME_WAIT - The socket is eventually freed in ~45 sec, so if I increased the time that we wait for the port to become free, we should stop failing. Going to try that next. Is Tast properly cleaning up at the end of the tests?
,
Nov 28
Thanks for adding the extra logging. I think that http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html is probably relevant. I can look into this more later (gardener this week, so time is scarce).
,
Nov 28
I haven't landed the netstat logging, and not entirely sure if we want to land it. We want the first 2, and not sure about #3: 1. Clean up, properly closing all sockets at the end of the test run. 2. Wait longer - this will add a minute in the interim. 3. Since we aren't leaking VMs but the port is temporarily occupied by the last test run, we can use random ports to avoid this specific problem.
,
Nov 28
It's a shot in the dark, but the Close method in the ssh.Host class that I wrote closes its socket asynchronously. That can probably skip getting run when the main goroutine exits. I'm still not completely sure that this is what's going on, but I can try making tast synchronously close the connection before exiting. The downside is that we'll probably hang when the DUT is unreachable, which I was trying to avoid. (IUIC, TIME_WAIT here means that the kernel is trying to deliver tast's final ACK to qemu. I'm surprised that this would still be happening because I thought that the tast process is by definition completely gone by the time that we kill qemu.)
,
Nov 28
Sounds like https://crrev.com/c/1352930 is expected to fix this. Thanks again!
,
Nov 29
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/49d2238170a6bcfff367def2a542de233f76ae78 commit 49d2238170a6bcfff367def2a542de233f76ae78 Author: Achuith Bhandarkar <achuith@chromium.org> Date: Thu Nov 29 20:11:04 2018 cros_vm: Wait for SSH port to free upon exit. * Move _WaitForSSHPort to Stop() * Wait longer (275 sec) in _WaitForSSHPort. * Always close socket in _CheckSSHPortBusy * _KillVM function BUG= chromium:900804 TEST=manual Change-Id: Icec46ebcb40aa3b4358be1d0735b8dd6103f997f Reviewed-on: https://chromium-review.googlesource.com/1352930 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/49d2238170a6bcfff367def2a542de233f76ae78/lib/vm.py
,
Dec 3
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/89808a631e39923b1ba8d4027449d47f7c8b3f5b commit 89808a631e39923b1ba8d4027449d47f7c8b3f5b Author: Achuith Bhandarkar <achuith@chromium.org> Date: Mon Dec 03 20:43:40 2018 tast_test_stages: Launch VM outside cros_sdk shell Also --results-dir doesn't do anything. BUG= chromium:900804 TEST=bots Change-Id: I29afd2227fafcade0a5821ea39673e36761f53b1 Reviewed-on: https://chromium-review.googlesource.com/1345545 Commit-Ready: Achuith Bhandarkar <achuith@chromium.org> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/89808a631e39923b1ba8d4027449d47f7c8b3f5b/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/89808a631e39923b1ba8d4027449d47f7c8b3f5b/cbuildbot/stages/tast_test_stages_unittest.py
,
Dec 10
Passing to Dan - I don't know if you'd like to investigate this further based on #c11. If not, please feel free to close.
,
Dec 10
Nope, I think that that was probably a red herring. Thanks for fixing this!
,
Dec 19
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/b2c827c6440e0c07397673af44403c63423d32ed commit b2c827c6440e0c07397673af44403c63423d32ed Author: Achuith Bhandarkar <achuith@chromium.org> Date: Wed Dec 19 05:33:29 2018 cros_test: Run Tast tests. Similar to --autotest. BUG= chromium:900804 TEST=manual Change-Id: I5750c18768ee9bbc9eca784ffc544b453888a65d Reviewed-on: https://chromium-review.googlesource.com/1381064 Commit-Ready: Achuith Bhandarkar <achuith@chromium.org> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/b2c827c6440e0c07397673af44403c63423d32ed/lib/cros_test.py
,
Dec 28
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/18fd9f3fb15d29974e4d676c8e4a25bf945a295e commit 18fd9f3fb15d29974e4d676c8e4a25bf945a295e Author: Achuith Bhandarkar <achuith@chromium.org> Date: Fri Dec 28 08:43:09 2018 tast_test_stages: Use cros_run_vm_test --tast. BUG= chromium:900804 TEST=bots Change-Id: Ic0f5c2f004fe971a7bf60043d7f617d4d7d77018 Reviewed-on: https://chromium-review.googlesource.com/1382947 Commit-Ready: Achuith Bhandarkar <achuith@chromium.org> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/18fd9f3fb15d29974e4d676c8e4a25bf945a295e/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/18fd9f3fb15d29974e4d676c8e4a25bf945a295e/cbuildbot/stages/tast_test_stages_unittest.py
,
Jan 9
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/284871915cd6db175bdb079d1c533d736f7e5515 commit 284871915cd6db175bdb079d1c533d736f7e5515 Author: Achuith Bhandarkar <achuith@chromium.org> Date: Wed Jan 09 19:55:44 2019 tast_test_stages: Use VM image for Tast VM tests. BUG= chromium:900804 TEST=amd64-generic-paladin-tryjob betty-paladin-tryjob Change-Id: I676d64209a2c19d9d4a8690ce1124865e85798db Reviewed-on: https://chromium-review.googlesource.com/1395598 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Achuith Bhandarkar <achuith@chromium.org> Reviewed-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/tast_test_stages.py [modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/build_stages.py [modify] https://crrev.com/284871915cd6db175bdb079d1c533d736f7e5515/cbuildbot/stages/build_stages_unittest.py |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by cychiang@google.com
, Nov 5