New issue
Advanced search Search tips

Issue 907222 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 906289
Owner:
Closed: Nov 27
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

VMTest fails on lakitu-gpu-paladin

Project Member Reported by gmeinke@chromium.org, Nov 20

Issue description

we're seeing an error on lakitu-gpu-paladin for GCETest and VMTest suites:
11/20 11:36:45.178 ERROR|             utils:0287| [stderr] mm_send_fd: file descriptor passing not supported
11/20 11:36:45.178 ERROR|             utils:0287| [stderr] mux_client_request_session: send fds failed
11/20 11:36:45.179 INFO |            remote:0076| Failed to copy /var/log/messages at startup: command execution error

Last three CQ runs for lakitu-gpu-paladin have failed.

Here is the link to the last run: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8929315805791908768

 
Owner: wonderfly@chromium.org
This needs a lakitu owner.
Marking lakitu-gpu experimental with crrev.com/c/1344635 until a few passed CQ runs.
Project Member

Comment 3 by bugdroid1@chromium.org, Nov 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/bde30632b4521a8cb728fe2ea3f51ce312bec2b2

commit bde30632b4521a8cb728fe2ea3f51ce312bec2b2
Author: Gregory Meinke <gmeinke@chromium.org>
Date: Tue Nov 20 22:01:49 2018

lakitu-gpu: mark as experimental

BUG= chromium:907222 
TEST=local unittests

Change-Id: Iab7725050e6db49e56fa89b6149552c086082df8
Reviewed-on: https://chromium-review.googlesource.com/1344635
Commit-Ready: Gregory Meinke <gmeinke@chromium.org>
Tested-by: Gregory Meinke <gmeinke@chromium.org>
Reviewed-by: Shelley Chen <shchen@chromium.org>

[modify] https://crrev.com/bde30632b4521a8cb728fe2ea3f51ce312bec2b2/config/chromeos_config.py
[modify] https://crrev.com/bde30632b4521a8cb728fe2ea3f51ce312bec2b2/config/config_dump.json

Cc: lakitu-dev@google.com wonderfly@google.com
Owner: rkolchmeyer@google.com
The error message mentioned in comment#1 shows up in all passing tests as well so it's probably not fatal. The real one is:

11/20 11:36:16.475 ERROR|        server_job:0825| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 817, in run
    self._execute_code(server_control_file, namespace)
  File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 1340, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/tmp/cbuildbot7FGGjo/smoke/test_harness/all/SimpleTestVerify/1_autotest_tests/results-17-platform_Locale/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 619, in parallel_simple
    log=log, timeout=timeout, return_results=return_results)
  File "/build/lakitu-gpu/usr/local/build/autotest/server/subcommand.py", line 98, in parallel_simple
    function(arg)
  File "/tmp/cbuildbot7FGGjo/smoke/test_harness/all/SimpleTestVerify/1_autotest_tests/results-17-platform_Locale/control.srv", line 6, in run_client
    host.log_kernel()
  File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 540, in log_kernel
    kernel = self.get_kernel_ver()
  File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 494, in get_kernel_ver
    cmd_uname = path_utils.must_be_installed('/bin/uname', host=self)
  File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/cros/path_utils.py", line 66, in must_be_installed
    raise error.TestError(error_msg)
TestError: Unable to find /bin/uname on 127.0.0.1

I don't think uname is installed by any Lakitu specific package so either some CL in that CQ run was really bad, or some catastrophic breakage has happened...

Anyways, I'll have our oncall take a look.
/bin/uname belongs to sys-apps/coreutils, and the logs indicate that sys-apps/coreutils was successfully installed on the image-under-test. I have no idea what's going on, but I suspect that the issue is in the test framework somehow. Continuing to look.
The root cause is not that /bin/uname is missing (it is not missing). SSH to the VM under test is failing:

11/20 07:37:03.244 DEBUG|          ssh_host:0310| Running (ssh) 'test :' from 'get_network_stats|create_target_machine|create_host|_verify_connectivity|run|run_very_slowly'
11/20 07:37:03.248 INFO |     ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_kZQi_3ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1'
11/20 07:37:03.248 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_kZQi_3ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1'
11/20 07:37:03.465 ERROR|             utils:0287| [stderr] mm_send_fd: file descriptor passing not supported
11/20 07:37:03.465 ERROR|             utils:0287| [stderr] mux_client_request_session: send fds failed
11/20 07:37:03.467 WARNI|           factory:0219| Failed to verify connectivity to host. Skipping host auto detection logic.

SSH multiplexing doesn't appear to be working properly on the host. I have a feeling that something is misconfigured on swarm-cros-417. I'll check to see if any tests are succeeding at all on that machine.
If ToT is passing then either the error is intermittent (which hints infra flakiness) or some CL in the previous CQ run was bad. In either case, someone from cros infra should be a better owner for this imo, and we can make lakitu-gpu-paladin important again.
If it passes another run or two seems like making it important again makes sense
Cc: zamorzaev@chromium.org
The hosts swarm-cros-378 and swarm-cros-417 seem broken; https://chrome-swarming.appspot.com/bot?id=swarm-cros-378&sort_stats=total%3Adesc, https://chrome-swarming.appspot.com/bot?id=swarm-cros-417&sort_stats=total%3Adesc

Builds on both of these hosts are failing consistently for lakitu-gpu-paladin and other configs. Meanwhile, lakitu-gpu-paladin succeeded on swarm-cros-375 and swarm-cros-414. Would it make sense to make these hosts unschedulable for now and see if the problem continues to occur? The nature of the lakitu-gpu-paladin error messages suggests an issue with the host as well.
On the surface, all bots are identical in their configuration and base image.  That is not to say they are totally "clean" since they will reuse a previous checkout.  Those bots could be rebuilt but it would then raise the question; what is causing these VM tests to only fail for lakitu builds?  

There really hasn't been a good baseline to determine that a bot is good; the CQ has not been healthy for weeks and these bots are only associated with CQ runs.  Refreshing bots, just to refresh due to us assuming there is something wrong, is a costly endeavor as it means the next execution will take an additional 40 minutes for the initial checkout.  

-- Mike
Cc: mikenichols@chromium.org
Cc: vapier@chromium.org
"ssh multiplexing failed" is tracked at  issue 906289 . Is this a dupe of that bug?
Mergedinto: 906289
Status: Duplicate (was: Untriaged)
Yeah, this looks like a dupe of  issue 906289  to me. Thanks for pointing that out!

Sign in to add a comment