TastVMTest fails due to missing local_test_runner on VM after reboot |
|||
Issue descriptionThere was a strange failure in the TastVMTest stage in the amd64-generic-paladin build at https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928579551775583840 that I don't think I've seen before. Basically, all Tast tests passed, but when the tast process tried to SSH to the VM to collect logs and crashes after testing, the local_test_runner executable was missing: ... 2018/11/28 15:20:33 Started test power.Reboot 2018/11/28 15:20:33 [15:20:33.253] Rebooting DUT 2018/11/28 15:20:33 [15:20:33.311] Waiting for DUT to become unreachable 2018/11/28 15:20:38 [15:20:38.433] DUT became unreachable (as expected) 2018/11/28 15:20:38 [15:20:38.433] Reconnecting to DUT 2018/11/28 15:22:52 [15:22:52.158] Reconnected to DUT 2018/11/28 15:22:52 Completed test power.Reboot in 2m18.928s with 0 error(s) 2018/11/28 15:22:52 [15:22:52.158] Disconnecting from DUT 2018/11/28 15:22:52 Ran 3 remote test(s) in 2m25.543s 2018/11/28 15:22:52 Collecting system information 2018/11/28 15:22:52 Connecting to 127.0.0.1:9222 2018/11/28 15:22:52 -------------------------------------------------------------------------------- 2018/11/28 15:22:52 arc.Boot [ SKIP ] missing deps: android ... 2018/11/28 15:22:52 power.Reboot [ PASS ] 2018/11/28 15:22:52 -------------------------------------------------------------------------------- 2018/11/28 15:22:52 Results saved to /tmp/cbuildbotHfWMWM/tast_vm_paladin 2018/11/28 15:22:52 Failed to write results: Process exited with status 127: bash: /usr/local/bin/local_test_runner: No such file or directory 15:22:52: INFO: tast exited with status code 1. Both TastVMTest attempts failed in the same way. local_test_runner was clearly there earlier, because it's the thing that runs local tests. power.Reboot had rebooted the DUT just before sys info collection. Maybe there's a race where the SSH daemon starts listening for connections before /usr/local/bin is mounted? That seems pretty unlikely, but it's the best that I can come up with right now. I can try to add some additional debug logging if local_test_runner is missing to figure out what happened, I guess.
,
Nov 29
If it is a corruption problem, perhaps because the shutdown was improper left unflushed files, for example, then it's unlikely that logging will help :/
,
Nov 29
power.Reboot just runs "reboot", so the VM should be going through the normal Chrome OS shutdown path. Re the parallel access theory, I don't see any VMTest stages at https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928579551775583840, so I'm not sure if that could've happened here. It would be pretty easy to make this test not run on VMs by excluding it from the tast_vm_paladin_tests expression in config/chromeos_config.py in chromite. If we think that rebooting is generally unsafe in VMs, it might make more sense to add a "reboot" feature so that this test (and any others that need to reboot) can be skipped automatically.
,
Nov 29
If the DUT somehow came back up after the reboot using a non-test image (e.g. dev), that could explain this. But I don't know if that's remotely possible.
,
Nov 29
Since I don't think we have any leads here, I'm adding a new "reboot" dependency so that power.Reboot can be skipped on VMs.
,
Nov 29
This failed again here: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928562757459451056 Fixing it seems high-priority.
,
Nov 29
Actually, per http://cros-goldeneye/chromeos/legoland/builderHistory?buildConfig=amd64-generic-paladin&buildBranch=master, there were those two failed runs yesterday and then four successful runs since then. So maybe I'll hold off on landing those changes until we see this again.
,
Nov 30
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast/+/e758f64dbc3ee76c44bc54615975c7ad2749cbdf commit e758f64dbc3ee76c44bc54615975c7ad2749cbdf Author: Daniel Erat <derat@chromium.org> Date: Fri Nov 30 00:21:13 2018 tast: Add "reboot" software dependency. Add a new "reboot" software feature that remote tests that reboot the DUT can depend on. This is really just a hack to let us skip power.Reboot from running on VM builders, where we're apparently sometimes seeing the local_tast_runner executable go missing after a reboot. The cause of this is currently unknown. BUG=chromium:909955 TEST=none Change-Id: Ib266ac964075a106547e909f232a22156fb080bb Reviewed-on: https://chromium-review.googlesource.com/c/1355899 Tested-by: Dan Erat <derat@chromium.org> Trybot-Ready: Dan Erat <derat@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [modify] https://crrev.com/e758f64dbc3ee76c44bc54615975c7ad2749cbdf/src/chromiumos/cmd/local_test_runner/main.go [modify] https://crrev.com/e758f64dbc3ee76c44bc54615975c7ad2749cbdf/docs/test_dependencies.md
,
Nov 30
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/471867c92088121b49087b5176650cfb8bdf0a8d commit 471867c92088121b49087b5176650cfb8bdf0a8d Author: Daniel Erat <derat@chromium.org> Date: Fri Nov 30 00:21:25 2018 tast-tests: Make power.Reboot depend on "reboot" feature. Make the power.Reboot remote test get skipped on devices that aren't able to reboot reliably. BUG=chromium:909955 TEST=test still compiles CQ-DEPEND=Ib266ac964075a106547e909f232a22156fb080bb Change-Id: Iab1118778d35fbacad946695212e634ae07fd8b8 Reviewed-on: https://chromium-review.googlesource.com/c/1355548 Tested-by: Dan Erat <derat@chromium.org> Trybot-Ready: Dan Erat <derat@chromium.org> Reviewed-by: Eric Caruso <ejcaruso@chromium.org> [modify] https://crrev.com/471867c92088121b49087b5176650cfb8bdf0a8d/src/chromiumos/tast/remote/bundles/cros/power/reboot.go
,
Dec 3
Now that the test isn't running on VMs, I'm unlikely to spend time trying to figure out the cause of the problem. The problem happened rarely. |
|||
►
Sign in to add a comment |
|||
Comment 1 by achuith@chromium.org
, Nov 29