New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 865511 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Oct 16
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

VMTests are failing because /bin/uname does not exist on image

Project Member Reported by mukai@chromium.org, Jul 19

Issue description

I've seen lots of error messages saying like

23:14:36 INFO | autoserv| Running (ssh) 'python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)'' from 'crashinfo|report_crashdumps|_find_orphaned_crashdumps|list_files_glob|run|run_very_slowly'
23:14:36 INFO | autoserv| [stderr] mm_send_fd: file descriptor passing not supported
23:14:36 INFO | autoserv| [stderr] mux_client_request_session: send fds failed
23:14:36 INFO | autoserv| Nuking ssh master_job
23:14:37 INFO | autoserv| Cleaning ssh master_tempdir
23:14:37 INFO | autoserv| Traceback (most recent call last):
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/autoserv", line 603, in run_autoserv
23:14:37 INFO | autoserv| use_packaging=(not no_use_packaging))
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 840, in run
23:14:37 INFO | autoserv| self._collect_crashes(namespace, collect_crashinfo)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 704, in _collect_crashes
23:14:37 INFO | autoserv| self._execute_code(crash_control_file, namespace)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 1326, in _execute_code
23:14:37 INFO | autoserv| execfile(code_file, namespace, namespace)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/control_segments/crashinfo", line 14, in <module>
23:14:37 INFO | autoserv| job.parallel_simple(crashinfo, machines, log=False)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 611, in parallel_simple
23:14:37 INFO | autoserv| log=log, timeout=timeout, return_results=return_results)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/subcommand.py", line 98, in parallel_simple
23:14:37 INFO | autoserv| function(arg)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/control_segments/crashinfo", line 10, in crashinfo
23:14:37 INFO | autoserv| crashcollect.report_crashdumps(host)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/site_crashcollect.py", line 182, in report_crashdumps
23:14:37 INFO | autoserv| for crashfile in _find_orphaned_crashdumps(host):
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/site_crashcollect.py", line 171, in _find_orphaned_crashdumps
23:14:37 INFO | autoserv| return host.list_files_glob(os.path.join(constants.CRASH_DIR, '*'))
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 579, in list_files_glob
23:14:37 INFO | autoserv| timeout=60).stdout
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/hosts/ssh_host.py", line 323, in run
23:14:37 INFO | autoserv| return self.run_very_slowly(*args, **kwargs)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/hosts/ssh_host.py", line 312, in run_very_slowly
23:14:37 INFO | autoserv| ssh_failure_retry_ok)
23:14:37 INFO | autoserv| File "/build/amd64-generic/usr/local/build/autotest/server/hosts/ssh_host.py", line 262, in _run
23:14:37 INFO | autoserv| raise error.AutoservRunError("command execution error", result)
23:14:37 INFO | autoserv| AutoservRunError: command execution error
23:14:37 INFO | autoserv| * Command:
23:14:37 INFO | autoserv| /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_IWIq5essh-master/socket
23:14:37 INFO | autoserv| -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
23:14:37 INFO | autoserv| -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
23:14:37 INFO | autoserv| ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9227 127.0.0.1
23:14:37 INFO | autoserv| "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then
23:14:37 INFO | autoserv| logger -tag \"autotest\"
23:14:37 INFO | autoserv| \"server[stack::_find_orphaned_crashdumps|list_files_glob|run] ->
23:14:37 INFO | autoserv| ssh_run(python -c 'import cPickle, glob,
23:14:37 INFO | autoserv| sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)')\";fi; python -c
23:14:37 INFO | autoserv| 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]),
23:14:37 INFO | autoserv| sys.stdout, 0)' \"/var/spool/crash/*\""
23:14:37 INFO | autoserv| Exit status: 255


Still not sure why this happens.
Summary: VMTests are failing because /bin/uname does not exist on image (was: amd64-generic-chromium-pfq failures 7/19)
07/18 23:41:23.673 ERROR|        server_job:0811| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 803, in run
    self._execute_code(server_control_file, namespace)
  File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 1326, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/tmp/cbuildbotZG0hw9/pfq_suite/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-01-security_NetworkListeners/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/build/amd64-generic/usr/local/build/autotest/server/server_job.py", line 611, in parallel_simple
    log=log, timeout=timeout, return_results=return_results)
  File "/build/amd64-generic/usr/local/build/autotest/server/subcommand.py", line 98, in parallel_simple
    function(arg)
  File "/tmp/cbuildbotZG0hw9/pfq_suite/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-01-security_NetworkListeners/control.srv", line 6, in run_client
    host.log_kernel()
  File "/build/amd64-generic/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 546, in log_kernel
    kernel = self.get_kernel_ver()
  File "/build/amd64-generic/usr/local/build/autotest/client/common_lib/hosts/base_classes.py", line 500, in get_kernel_ver
    cmd_uname = path_utils.must_be_installed('/bin/uname', host=self)
  File "/build/amd64-generic/usr/local/build/autotest/client/common_lib/cros/path_utils.py", line 66, in must_be_installed
    raise error.TestError(error_msg)
TestError: Unable to find /bin/uname on 127.0.0.1
Cc: mikenichols@chromium.org achuith@chromium.org
Labels: -Pri-3 Pri-1
Owner: pmalani@chromium.org
That check is pretty old. Someone needs to go figure out which CL removed /bin/uname from the image, and why it wasn't caught in the CQ.

--> sheriffs@

Along the same lines, I'd think this will break on the CQ as well...

Cc: mukai@chromium.org
Labels: -Pri-1 Pri-0
The broken Pre-CQ is blocking CLs, e.g. https://chrome-internal-review.googlesource.com/c/chromeos/overlays/overlay-novato-private/+/653354

I think betty-pre-cq runs on almost all CLs, so that will probably block just about everyone.

Raising the priority.
I downloaded the chromiumos_qemu_image.bin of the failed build, /bin/uname exists there.
Cc: dgarr...@chromium.org
betty-paladin is running VMTests just fine.

This looks like an environment quirk of pre-cq/pfq vs paladins.

+dgarrett
Cc: -mikenichols@chromium.org pmalani@chromium.org
Owner: mikenichols@chromium.org
If my guess in #8 is correct, the CI bobby is the most knowledgeable party here.
Recent autotest CLs:
https://chromium-review.googlesource.com/q/project:chromiumos/third_party/autotest+status:merged

Nothing stands out to me as being suspicious.
The paladins run on Golo builders (physical hardware), the PreCQ builds run on GCE instances.

That seems like the most relevant difference in environments.
test -f /mnt/stateful_partition/.android_tester' from 'create_target_machine|create_host|_detect_host|check_host|run|run_very_slowly
grep CHROMEOS_RELEASE_BOARD /etc/lsb-release' from 'create_target_machine|create_host|_detect_host|check_host|run|run_very_slowly
Unable to apply conventional host detection methods, defaulting to chromeos host.

What does /etc/lsb-release look like in the image you downloaded?
lsb-release also looks normal:

localhost ~ # cat /etc/lsb-release
CHROMEOS_RELEASE_BUILDER_PATH=amd64-generic-chromium-pfq/R69-10890.0.0-rc2
GOOGLE_RELEASE=10890.0.0-rc2
CHROMEOS_DEVSERVER=http://swarm-cros-526.c.chromeos-bot.internal:8080
CHROMEOS_RELEASE_BOARD=amd64-generic
CHROMEOS_RELEASE_BUILD_NUMBER=10890
CHROMEOS_RELEASE_BRANCH_NUMBER=0
CHROMEOS_RELEASE_CHROME_MILESTONE=69
CHROMEOS_RELEASE_PATCH_NUMBER=0-rc2
CHROMEOS_RELEASE_TRACK=testimage-channel
CHROMEOS_RELEASE_DESCRIPTION=10890.0.0-rc2 (Continuous Builder - Builder: N/A) amd64-generic
CHROMEOS_RELEASE_NAME=Chromium OS
CHROMEOS_RELEASE_BUILD_TYPE=Continuous Builder - Builder: N/A
CHROMEOS_RELEASE_VERSION=10890.0.0-rc2
CHROMEOS_AUSERVER=http://swarm-cros-526.c.chromeos-bot.internal:8080/update


grep works as expected

localhost ~ # grep CHROMEOS_RELEASE_BOARD /etc/lsb-release
CHROMEOS_RELEASE_BOARD=amd64-generic


I obtained the chromiumos_qemu_image.tar.xz from the artifacts of https://stainless.corp.google.com/browse/chromeos-image-archive/amd64-generic-chromium-pfq/R69-10890.0.0-rc2

I have a suspicion that there's some kind of networking issue between the host and the VM
I see lots of
03:37:19 INFO | autoserv| [stderr] mm_send_fd: file descriptor passing not supported
03:37:19 INFO | autoserv| [stderr] mux_client_request_session: send fds failed

These are openssh error messages. agreed #14.
I'm a little confused.  I do agree those appear to be SSH issues and relate to a network issue.  The confusing part, not knowing the autotest suite, is that the connection seems to be local:

04:41:54 INFO | autoserv| /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_vqwDxDssh-master/socket
04:41:54 INFO | autoserv| -o Protocol=2 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
04:41:54 INFO | autoserv| -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o
04:41:54 INFO | autoserv| ServerAliveCountMax=3 -o ConnectionAttempts=4 -l root -p 9227 127.0.0.1
04:41:54 INFO | autoserv| "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then
04:41:54 INFO | autoserv| logger -tag \"autotest\"
04:41:54 INFO | autoserv| \"server[stack::_find_orphaned_crashdumps|list_files_glob|run] ->
04:41:54 INFO | autoserv| ssh_run(python -c 'import cPickle, glob,
04:41:54 INFO | autoserv| sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)')\";fi; python -c
04:41:54 INFO | autoserv| 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]),
04:41:54 INFO | autoserv| sys.stdout, 0)' \"/var/spool/crash/*\""
04:41:54 INFO | autoserv| Exit status: 255

I assume there is a major component involved that I'm missing.  Anyone understand the autotest process to explain where we think the network issue resides?

-- Mike
No idea if it's a similar issue or not, but FYI we've had VM networking issues on specific hosts because of IP "conflicts": https://bugs.chromium.org/p/chromium/issues/detail?id=808045
I'm assuming it's local because it's qemu doing port forwarding. So the question is is qemu loading the correct image, or is the port we are connecting to the correct one?
Pre-CQ doesn't always fail, https://chrome-internal-review.googlesource.com/c/chromeos/overlays/overlay-novato-private/+/653354 just passed the Pre-CQ and it was failing before.
I'm starting to think that the ssh errors are legit:
autoserv| [stderr] mm_send_fd: file descriptor passing not supported
autoserv| [stderr] mux_client_request_session: send fds failed

It looks like the control master connection is opened correctly, but the subsequent ssh commands that use the control connection are failing because they can't get a fd from the master.

Looking at the successful test run for novato-pre-cq (https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940558702404076640) it looks like it doesn't have the fs problems.

Cc: rrangel@chromium.org
The GCE instances may be running a different kernel.
Looks like maybe we have a bad version of ssh?

https://cs.corp.google.com/piper///depot/google3/third_party/openssh/openssh7_6p1/monitor_fdpass.c?type=cs&q=%22file+descriptor+passing+not+supported%22&g=0&l=107

That is a static time configuration. Is it possible to ssh into the box that was running the test?
swarm-cros-0 is our staging builder, this seems a reasonable use.

You can launch builds against it using "cros tryjob --staging"

I just reformatted swarm-cros-0 back to the standard config. As soon as puppet finishes, it should be identical to all other swarm bots.

Looks like I don't have permissions to see that page. I tried both @google.com and @chromiumos.org. 
Kernel versions could be different but the kernel versions on the GCE instances are all the same; 282 was successful while 587 failed.  It does appear to be network related as the failing versions fail to set the release board:

Successful:

11:10:21 INFO | autoserv| Get master ssh connection for root@127.0.0.1:9228
11:10:21 INFO | autoserv| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release' from 'create_target_machine|create_host|_detect_host|check_host|run|run_very_slowly'
11:10:21 INFO | autoserv| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_MnicfHssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1'
11:10:21 INFO | autoserv| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_MnicfHssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9228 127.0.0.1'
11:10:22 INFO | autoserv| Running (ssh) 'grep CHROMEOS_RELEASE_BOARD /etc/lsb-release' from 'create_target_machine|create_host|_detect_host|check_host|run|run_very_slowly'
11:10:23 INFO | autoserv| [stdout] CHROMEOS_RELEASE_BOARD=novato

Failure:
03:49:45 INFO | autoserv| Get master ssh connection for root@127.0.0.1:9227
03:49:45 INFO | autoserv| Running (ssh) 'grep -q CHROMEOS /etc/lsb-release && ! test -f /mnt/stateful_partition/.android_tester && ! grep -q moblab /etc/lsb-release' from 'create_target_machine|create_host|_detect_host|check_host|run|run_very_slowly'
03:49:45 INFO | autoserv| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_tIJn8Yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
03:49:45 INFO | autoserv| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_tIJn8Yssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
03:49:46 INFO | autoserv| [stderr] mm_send_fd: file descriptor passing not supported

The only thing that appears obvious is the port difference for the master; 9228 versus 9227.

-- Mike
Cc: zamorzaev@chromium.org
Cc: ihf@chromium.org
Adding ihf@ as he has some experience with VMTests in the past.  

-- Mike
Labels: Restrict-View-Google
Note that the swarming builders are not uniform.

Forcussing on the just the navato-pre-cq failures for the CL in #19

Failed on cros-swarm-72

pprabhu@swarm-cros-72:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10, OpenSSL 1.0.1f 6 Jan 2014
pprabhu@swarm-cros-72:~$ uname -a
Linux swarm-cros-72 3.13.0-147-generic #196-Ubuntu SMP Wed May 2 15:51:34 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

passed on cros-swarm-282
pprabhu@swarm-cros-282:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10, OpenSSL 1.0.1f 6 Jan 2014
pprabhu@swarm-cros-282:~$ uname -a
Linux swarm-cros-282 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

So the kernel versions are different.


Moreover, the openssh version on both of these is different from the staging builder that dgarrett@ posted above:
prabhu@swarm-cros-0:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8, OpenSSL 1.0.1f 6 Jan 2014
pprabhu@swarm-cros-0:~$ uname -a
Linux swarm-cros-0 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

That openssl version is ancient.

What gives?
The versions should be aligned but staging is a bit of an outlier:

mikenichols@swarm-cros-282:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10, OpenSSL 1.0.1f 6 Jan 2014

mikenichols@swarm-cros-282:~$ uname -a
Linux swarm-cros-282 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

mikenichols@swarm-cros-587:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10, OpenSSL 1.0.1f 6 Jan 2014

mikenichols@swarm-cros-587:~$ uname -a
Linux swarm-cros-587 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Both run VM tests, different boards, and one passes (cros-swarm-282) while the other fails (cros-swarm-587).

The builders are all on Trusty (14.04) which probably explains the openssl version. 

-- Mike
Hmm, between #30 and #31, we've eliminated both the kernel version (#31 shows pass / fail with the same kernel) and target board (#30 shows pass / fail with the same CL on the same board) as distinguishing factors.

We're back to square 1. :(
Mike,
Do you have log links for the builds that ran on 282 and 587
Related to #33, same versions but different boards.  I'll try to find a pass on a different host for the same board.  It is not quite apples-to-apples comparison right now.

-- Mike 
Also note that the port 9227 vs 9228 is also a red herring. It depends on whether SimpleTestUpdateAndVerify (9227) or SimpleTestVerify (9228) is being run.
Same board (betty-arqnext-chrome-pfq):

cros-swarm-405:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940627171154314096

cros-swarm-587:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940589955327151968

Kernel/SSH info for 405:
mikenichols@swarm-cros-405:~$ uname -a
Linux swarm-cros-405 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
mikenichols@swarm-cros-405:~$ ssh -V
OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.10, OpenSSL 1.0.1f 6 Jan 2014

-- Mike
Can one of you with SSH access try setting up a control master connection:

ssh -vv -o ControlPersist=1m -o ControlMaster=yes -o ControlPath=/tmp/control-socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -p 22 127.0.0.1
should ask you to log in. Once logged in you can exit

Then within a minute run: ssh -vv -o ControlPath=/tmp/control-socket -p 22 127.0.0.1

it should let you connect without authenticating.
btw, there is some reason to believe that this was a problem that magically corrected itself ~7:00 this morning.

If you look at all the tasks on betty-pre-cq: 
https://chrome-swarming.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1532036100000&f=cbb_config-tag%3Abetty-pre-cq&l=50&n=true&q=cbb_config%3Abetty-pre-cq&s=created_ts%3Adesc&st=1531949700000

All builds were failing from 4:00 PM yesterday to ~7:00 AM this morning (PDT)
The failures are intermittent since then, and none have this symptom.
Re #38, the gce instance doesn't want to allow me to loopback SSH. 
So autotest runs from the chroot, not the host. So we should check the ssh version in the chroot. Maybe we have a bad binary package floating around?
Status: WontFix (was: Assigned)
betty-arcnext-chrome-pfq and amd64-generic-chromium-pfq
 are now green. Since we don't have access to the chroot of the failing builds I don't think there is much else we can do on this bug.
Those builds are now randomly assigned to different swarm bots each time they run.

If we have uneven kernel versions deployed, and the problem is associated with the kernel version, it could reappear randomly based on the swarm bot used for a given build.
Labels: -Pri-0 Pri-2
Status: Assigned (was: WontFix)
I'm going to keep this open, lowering priority, and will continue to poke at a few different theories.  The concern is that this is flake and may potentially present again in the near future.  

-- Mike
FYI, I encountered this ssh issue locally.

How do I reproduce (repro rate 100%):
0. on my workstation, z840
1. repo sync'ed with manifest of 10891.0.0
2. build_packages
3. build ssh control connection to any chromeos DUT (version doesn't matter)
ssh -vv -o ControlPersist=1m -o ControlMaster=yes -o ControlPath=/tmp/control-socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -p 22 $DUT

Then got following error message:
mm_send_fd: file descriptor passing not supported
mux_client_request_session: send fds failed


I will try different version later.

Oops, after 'sudo emerge net-misc/openssh' inside chroot, it works now.
So it must be something wrong such that my ssh inside chroot was broken.

TLDR; the ssh inside the chromeos SDK is broken since 10890.0.0

Details:
1. via bisect, I found 10889.0.0 and earlier is good, 10890.0.0 is broken.
2. because it's problem of SDK, the easiest reproduce method is deleting chroot.
Reproduce step:
 a. cros_sdk --delete
 b. cros_sdk ssh -o ControlPersist=1m -o ControlMaster=yes -o ControlPath=/tmp/control-socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes $DUT_IP true
This line should exit 0.
Since 10890.0.0, this line will print error message (comment 45) and exit 255.


Follow up of comment 47. This issue only exists in 10890.0.0 and 10891.0.0.
The SDK is good since 10892.0.0.

No idea why ssh in sdk was broken though.

Status: Fixed (was: Assigned)
Late closing this out but it appears that it was an issue with the SDK as stated in comment #48. 

-- Mike
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>Build
Labels: -Restrict-View-Google
i confirmed that cros-sdk-2018.07.17.223228.tar.xz had a bad ssh build, but cros-sdk-2018.07.16.091630.tar.xz and cros-sdk-2018.07.18.170014.tar.xz were ok.

$ strings cros-sdk-2018.07.16*/usr/bin/ssh | grep file.*passing
$ strings cros-sdk-2018.07.17*/usr/bin/ssh | grep file.*passing
%s: file descriptor passing not supported                       <-- bad
$ strings cros-sdk-2018.07.18*/usr/bin/ssh | grep file.*passing
FYI, I created a separate bug ( issue 899490 ) for the ssh issue because I encountered the same issue again one month ago.

Sign in to add a comment