New issue
Advanced search Search tips

Issue 909955 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

TastVMTest fails due to missing local_test_runner on VM after reboot

Project Member Reported by derat@chromium.org, Nov 29

Issue description

There was a strange failure in the TastVMTest stage in the amd64-generic-paladin build at https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928579551775583840 that I don't think I've seen before.

Basically, all Tast tests passed, but when the tast process tried to SSH to the VM to collect logs and crashes after testing, the local_test_runner executable was missing:

...
2018/11/28 15:20:33 Started test power.Reboot
2018/11/28 15:20:33 [15:20:33.253] Rebooting DUT
2018/11/28 15:20:33 [15:20:33.311] Waiting for DUT to become unreachable
2018/11/28 15:20:38 [15:20:38.433] DUT became unreachable (as expected)
2018/11/28 15:20:38 [15:20:38.433] Reconnecting to DUT
2018/11/28 15:22:52 [15:22:52.158] Reconnected to DUT
2018/11/28 15:22:52 Completed test power.Reboot in 2m18.928s with 0 error(s)
2018/11/28 15:22:52 [15:22:52.158] Disconnecting from DUT
2018/11/28 15:22:52 Ran 3 remote test(s) in 2m25.543s
2018/11/28 15:22:52 Collecting system information
2018/11/28 15:22:52 Connecting to 127.0.0.1:9222
2018/11/28 15:22:52 --------------------------------------------------------------------------------
2018/11/28 15:22:52 arc.Boot                        [ SKIP ] missing deps: android
...
2018/11/28 15:22:52 power.Reboot                    [ PASS ]
2018/11/28 15:22:52 --------------------------------------------------------------------------------
2018/11/28 15:22:52 Results saved to /tmp/cbuildbotHfWMWM/tast_vm_paladin
2018/11/28 15:22:52 Failed to write results: Process exited with status 127: bash: /usr/local/bin/local_test_runner: No such file or directory
15:22:52: INFO: tast exited with status code 1.

Both TastVMTest attempts failed in the same way.

local_test_runner was clearly there earlier, because it's the thing that runs local tests. power.Reboot had rebooted the DUT just before sys info collection. Maybe there's a race where the SSH daemon starts listening for connections before /usr/local/bin is mounted? That seems pretty unlikely, but it's the best that I can come up with right now.

I can try to add some additional debug logging if local_test_runner is missing to figure out what happened, I guess.
 
That's the CL:
https://chromium-review.googlesource.com/c/chromiumos/chromite/+/1352930

I *think* we saw something similar prior to copy-on-write support in chrome infra where VM images would get corrupted by parallel access and expected files/directories would go missing. It's possible the VM copy is getting corrupted here too.
If it is a corruption problem, perhaps because the shutdown was improper left unflushed files, for example, then it's unlikely that logging will help :/
power.Reboot just runs "reboot", so the VM should be going through the normal Chrome OS shutdown path.

Re the parallel access theory, I don't see any VMTest stages at https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928579551775583840, so I'm not sure if that could've happened here.

It would be pretty easy to make this test not run on VMs by excluding it from the tast_vm_paladin_tests expression in config/chromeos_config.py in chromite.

If we think that rebooting is generally unsafe in VMs, it might make more sense to add a "reboot" feature so that this test (and any others that need to reboot) can be skipped automatically.
If the DUT somehow came back up after the reboot using a non-test image (e.g. dev), that could explain this. But I don't know if that's remotely possible.
Status: Started (was: Assigned)
Since I don't think we have any leads here, I'm adding a new "reboot" dependency so that power.Reboot can be skipped on VMs.
This failed again here: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8928562757459451056

Fixing it seems high-priority.
Actually, per http://cros-goldeneye/chromeos/legoland/builderHistory?buildConfig=amd64-generic-paladin&buildBranch=master, there were those two failed runs yesterday and then four successful runs since then. So maybe I'll hold off on landing those changes until we see this again.
Project Member

Comment 8 by bugdroid1@chromium.org, Nov 30

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast/+/e758f64dbc3ee76c44bc54615975c7ad2749cbdf

commit e758f64dbc3ee76c44bc54615975c7ad2749cbdf
Author: Daniel Erat <derat@chromium.org>
Date: Fri Nov 30 00:21:13 2018

tast: Add "reboot" software dependency.

Add a new "reboot" software feature that remote tests that
reboot the DUT can depend on. This is really just a hack to
let us skip power.Reboot from running on VM builders, where
we're apparently sometimes seeing the local_tast_runner
executable go missing after a reboot. The cause of this is
currently unknown.

BUG=chromium:909955
TEST=none

Change-Id: Ib266ac964075a106547e909f232a22156fb080bb
Reviewed-on: https://chromium-review.googlesource.com/c/1355899
Tested-by: Dan Erat <derat@chromium.org>
Trybot-Ready: Dan Erat <derat@chromium.org>
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/e758f64dbc3ee76c44bc54615975c7ad2749cbdf/src/chromiumos/cmd/local_test_runner/main.go
[modify] https://crrev.com/e758f64dbc3ee76c44bc54615975c7ad2749cbdf/docs/test_dependencies.md

Project Member

Comment 9 by bugdroid1@chromium.org, Nov 30

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/471867c92088121b49087b5176650cfb8bdf0a8d

commit 471867c92088121b49087b5176650cfb8bdf0a8d
Author: Daniel Erat <derat@chromium.org>
Date: Fri Nov 30 00:21:25 2018

tast-tests: Make power.Reboot depend on "reboot" feature.

Make the power.Reboot remote test get skipped on devices
that aren't able to reboot reliably.

BUG=chromium:909955
TEST=test still compiles
CQ-DEPEND=Ib266ac964075a106547e909f232a22156fb080bb

Change-Id: Iab1118778d35fbacad946695212e634ae07fd8b8
Reviewed-on: https://chromium-review.googlesource.com/c/1355548
Tested-by: Dan Erat <derat@chromium.org>
Trybot-Ready: Dan Erat <derat@chromium.org>
Reviewed-by: Eric Caruso <ejcaruso@chromium.org>

[modify] https://crrev.com/471867c92088121b49087b5176650cfb8bdf0a8d/src/chromiumos/tast/remote/bundles/cros/power/reboot.go

Cc: derat@chromium.org
Labels: -Pri-1 Pri-2
Owner: ----
Status: Available (was: Started)
Now that the test isn't running on VMs, I'm unlikely to spend time trying to figure out the cause of the problem. The problem happened rarely.

Sign in to add a comment