Issue metadata
Sign in to add a comment
|
VMTest fails on lakitu-gpu-paladin |
||||||||||||||||||||||
Issue descriptionwe're seeing an error on lakitu-gpu-paladin for GCETest and VMTest suites: 11/20 11:36:45.178 ERROR| utils:0287| [stderr] mm_send_fd: file descriptor passing not supported 11/20 11:36:45.178 ERROR| utils:0287| [stderr] mux_client_request_session: send fds failed 11/20 11:36:45.179 INFO | remote:0076| Failed to copy /var/log/messages at startup: command execution error Last three CQ runs for lakitu-gpu-paladin have failed. Here is the link to the last run: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/CQ/b8929315805791908768
,
Nov 20
Marking lakitu-gpu experimental with crrev.com/c/1344635 until a few passed CQ runs.
,
Nov 20
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/bde30632b4521a8cb728fe2ea3f51ce312bec2b2 commit bde30632b4521a8cb728fe2ea3f51ce312bec2b2 Author: Gregory Meinke <gmeinke@chromium.org> Date: Tue Nov 20 22:01:49 2018 lakitu-gpu: mark as experimental BUG= chromium:907222 TEST=local unittests Change-Id: Iab7725050e6db49e56fa89b6149552c086082df8 Reviewed-on: https://chromium-review.googlesource.com/1344635 Commit-Ready: Gregory Meinke <gmeinke@chromium.org> Tested-by: Gregory Meinke <gmeinke@chromium.org> Reviewed-by: Shelley Chen <shchen@chromium.org> [modify] https://crrev.com/bde30632b4521a8cb728fe2ea3f51ce312bec2b2/config/chromeos_config.py [modify] https://crrev.com/bde30632b4521a8cb728fe2ea3f51ce312bec2b2/config/config_dump.json
,
Nov 20
The error message mentioned in comment#1 shows up in all passing tests as well so it's probably not fatal. The real one is:
11/20 11:36:16.475 ERROR| server_job:0825| Exception escaped control file, job aborting:
Traceback (most recent call last):
File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 817, in run
self._execute_code(server_control_file, namespace)
File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 1340, in _execute_code
execfile(code_file, namespace, namespace)
File "/tmp/cbuildbot7FGGjo/smoke/test_harness/all/SimpleTestVerify/1_autotest_tests/results-17-platform_Locale/control.srv", line 10, in <module>
job.parallel_simple(run_client, machines)
File "/build/lakitu-gpu/usr/local/build/autotest/server/server_job.py", line 619, in parallel_simple
log=log, timeout=timeout, return_results=return_results)
File "/build/lakitu-gpu/usr/local/build/autotest/server/subcommand.py", line 98, in parallel_simple
function(arg)
File "/tmp/cbuildbot7FGGjo/smoke/test_harness/all/SimpleTestVerify/1_autotest_tests/results-17-platform_Locale/control.srv", line 6, in run_client
host.log_kernel()
File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 540, in log_kernel
kernel = self.get_kernel_ver()
File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 494, in get_kernel_ver
cmd_uname = path_utils.must_be_installed('/bin/uname', host=self)
File "/build/lakitu-gpu/usr/local/build/autotest/client/common_lib/cros/path_utils.py", line 66, in must_be_installed
raise error.TestError(error_msg)
TestError: Unable to find /bin/uname on 127.0.0.1
I don't think uname is installed by any Lakitu specific package so either some CL in that CQ run was really bad, or some catastrophic breakage has happened...
Anyways, I'll have our oncall take a look.
,
Nov 20
/bin/uname belongs to sys-apps/coreutils, and the logs indicate that sys-apps/coreutils was successfully installed on the image-under-test. I have no idea what's going on, but I suspect that the issue is in the test framework somehow. Continuing to look.
,
Nov 21
The root cause is not that /bin/uname is missing (it is not missing). SSH to the VM under test is failing: 11/20 07:37:03.244 DEBUG| ssh_host:0310| Running (ssh) 'test :' from 'get_network_stats|create_target_machine|create_host|_verify_connectivity|run|run_very_slowly' 11/20 07:37:03.248 INFO | ssh_multiplex:0096| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_kZQi_3ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1' 11/20 07:37:03.248 DEBUG| utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_kZQi_3ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1' 11/20 07:37:03.465 ERROR| utils:0287| [stderr] mm_send_fd: file descriptor passing not supported 11/20 07:37:03.465 ERROR| utils:0287| [stderr] mux_client_request_session: send fds failed 11/20 07:37:03.467 WARNI| factory:0219| Failed to verify connectivity to host. Skipping host auto detection logic. SSH multiplexing doesn't appear to be working properly on the host. I have a feeling that something is misconfigured on swarm-cros-417. I'll check to see if any tests are succeeding at all on that machine.
,
Nov 21
For the record, lakitu-gpu-paladin-tryjob succeeded: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8929297081355191792
,
Nov 21
If ToT is passing then either the error is intermittent (which hints infra flakiness) or some CL in the previous CQ run was bad. In either case, someone from cros infra should be a better owner for this imo, and we can make lakitu-gpu-paladin important again.
,
Nov 21
If it passes another run or two seems like making it important again makes sense
,
Nov 26
,
Nov 26
The hosts swarm-cros-378 and swarm-cros-417 seem broken; https://chrome-swarming.appspot.com/bot?id=swarm-cros-378&sort_stats=total%3Adesc, https://chrome-swarming.appspot.com/bot?id=swarm-cros-417&sort_stats=total%3Adesc Builds on both of these hosts are failing consistently for lakitu-gpu-paladin and other configs. Meanwhile, lakitu-gpu-paladin succeeded on swarm-cros-375 and swarm-cros-414. Would it make sense to make these hosts unschedulable for now and see if the problem continues to occur? The nature of the lakitu-gpu-paladin error messages suggests an issue with the host as well.
,
Nov 26
On the surface, all bots are identical in their configuration and base image. That is not to say they are totally "clean" since they will reuse a previous checkout. Those bots could be rebuilt but it would then raise the question; what is causing these VM tests to only fail for lakitu builds? There really hasn't been a good baseline to determine that a bot is good; the CQ has not been healthy for weeks and these bots are only associated with CQ runs. Refreshing bots, just to refresh due to us assuming there is something wrong, is a costly endeavor as it means the next execution will take an additional 40 minutes for the initial checkout. -- Mike
,
Nov 26
,
Nov 27
"ssh multiplexing failed" is tracked at issue 906289 . Is this a dupe of that bug?
,
Nov 27
Yeah, this looks like a dupe of issue 906289 to me. Thanks for pointing that out! |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by akes...@chromium.org
, Nov 20