New issue
Advanced search Search tips

Issue 865171 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jul 31
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----



Sign in to add a comment

Skylab: ssh timeout causes test lasts too long and admin task timeout.

Project Member Reported by xixuan@chromium.org, Jul 18

Issue description

07/18 10:38:02.368 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_Z_6Dt7ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row3-rack12-host15'
07/18 10:38:07.230 ERROR|             utils:2723| Timed out waiting for condition: Wait for a socket file to exist
07/18 10:38:07.230 INFO |     ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established.
07/18 10:38:10.806 INFO |        base_label:0199| checking label ModelLabel
07/18 10:38:10.816 DEBUG|          ssh_host:0301| Running (ssh) 'cat /etc/lsb-release' from 'get|generate_labels|_parse_lsb_output|parse_cmd_output|run|run_very_slowly'
07/18 10:38:10.822 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos6-row3-rack12-host15 is down.
07/18 10:38:10.822 DEBUG|     ssh_multiplex:0124| Nuking ssh master_job
07/18 10:38:10.822 DEBUG|     ssh_multiplex:0129| Cleaning ssh master_tempdir
07/18 10:38:10.823 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_rmtCEzssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row3-rack12-host15'
07/18 10:38:10.823 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_rmtCEzssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row3-rack12-host15'
07/18 10:38:15.682 ERROR|             utils:2723| Timed out waiting for condition: Wait for a socket file to exist
07/18 10:38:15.682 INFO |     ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established.
07/18 10:38:19.243 DEBUG|          ssh_host:0301| Running (ssh) 'cros_config / test-label' from 'update_labels|get_labels|get|generate_labels|run|run_very_slowly'
07/18 10:38:19.248 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos6-row3-rack12-host15 is down.
07/18 10:38:19.248 DEBUG|     ssh_multiplex:0124| Nuking ssh master_job
07/18 10:38:19.249 DEBUG|     ssh_multiplex:0129| Cleaning ssh master_tempdir
07/18 10:38:19.249 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_JTN2Nfssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos6-row3-rack12-host15'

Summary: Skylab: ssh timeout causes test lasts too long and admin task timeout. (was: Skylab admin task execution timeout is too short?)
Single test lasts particular longer than normal due to ssh timeout: "Timed out waiting for condition: Wait for a socket file to exist"

https://sponge-qa.corp.googleusercontent.com/file?invocationId=987dd4b4-2d6b-4054-a22e-26ee410e14f4&targetResultIndex=0&largeText=logs%2Fdebug%2Fautoserv.DEBUG&filename=autoserv.DEBUG&contentType=text%2Fplain%3B%20charset%3Dutf-8&format=ansi&setTZID=America%2FLos_Angeles

07/18 14:13:24.715 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_eQxdK9ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:24.716 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_eQxdK9ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:29.579 ERROR|             utils:2723| Timed out waiting for condition: Wait for a socket file to exist
07/18 14:13:29.580 INFO |     ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established.
07/18 14:13:31.137 DEBUG|             utils:0287| [stdout] 3.10.18
07/18 14:13:31.141 INFO |        server_job:0216| INFO	----	----	kernel=3.10.18	localtime=Jul 18 14:13:31	timestamp=1531948411	
07/18 14:13:31.153 DEBUG|          ssh_host:0301| Running (ssh) 'true' from '_install|wait_up|is_up|ssh_ping|run|run_very_slowly'
07/18 14:13:31.159 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos4-row7-rack7-host7 is down.
07/18 14:13:31.160 DEBUG|     ssh_multiplex:0124| Nuking ssh master_job
07/18 14:13:31.160 DEBUG|     ssh_multiplex:0129| Cleaning ssh master_tempdir
07/18 14:13:31.160 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_9zykVWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:31.160 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_9zykVWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:36.026 ERROR|             utils:2723| Timed out waiting for condition: Wait for a socket file to exist
07/18 14:13:36.026 INFO |     ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established.
07/18 14:13:37.630 DEBUG|      abstract_ssh:0670| Host chromeos4-row7-rack7-host7 is now up
07/18 14:13:37.630 INFO |          autotest:0334| Installing autotest on chromeos4-row7-rack7-host7
07/18 14:13:37.641 DEBUG|          ssh_host:0301| Running (ssh) 'test -x /usr/local/autotest/bin/autotest' from 'install|_install|get_install_dir|get_installed_autodir|run|run_very_slowly'
07/18 14:13:37.647 INFO |     ssh_multiplex:0079| Master ssh connection to chromeos4-row7-rack7-host7 is down.
07/18 14:13:37.647 DEBUG|     ssh_multiplex:0124| Nuking ssh master_job
07/18 14:13:37.647 DEBUG|     ssh_multiplex:0129| Cleaning ssh master_tempdir
07/18 14:13:37.648 INFO |     ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_H9qWSsssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:37.648 DEBUG|             utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_H9qWSsssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7'
07/18 14:13:42.511 ERROR|             utils:2723| Timed out waiting for condition: Wait for a socket file to exist


It looks like we're trying to create the ssh socket in an nonexistent directory:

/usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_Z_6Dt7ssh-master/socket

This is supposed to be a freshly created tempdir, so I would expect a different exception here, from creating the tempdir.
Re Comment 3, that's probably wrong.  The directory probably does exist, but there's probably something wrong with SSH, and the code makes debugging that a pain since all output is thrown away by design.
Owner: ayatane@chromium.org
Re #4: No logs should be thrown away. Swarming workdirs are currently not cleaned up at all. This hasn't been a huge problem because there isn't that much log data in them.

--> ayatane@ who is working this bug now.
Looks like setting up the master takes more time than the timeout due to

debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.corp.google.com.: Name or service not known
debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.prod.google.com.: Name or service not known
debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.prodz.google.com.: Name or service not known
debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.net.google.com.: Name or service not known
debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.google.com.: Name or service not known
debug1: Re-reading configuration after hostname canonicalisation

Going to monkeypatch drone 1 so I can get ssh logs, I suspect SSH is just taking longer than 5 seconds so we keep killing it before it can get far enough along
Oh boy

unix_listener: "/usr/local/google/home/chromeos-test/skylab_bots/d493/w/itA0ohCW/_autotmp_QhD4zZssh-master/socket.qB5PIblT24Hl2mT5" too long for Unix domain socket

$ cat /usr/include/linux/un.h | grep "define UNIX_PATH_MAX"
#define UNIX_PATH_MAX	108
Status: Started (was: Assigned)
Project Member

Comment 11 by bugdroid1@chromium.org, Jul 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cbe02d9bffa9894291a42613b260690df68cfc4e

commit cbe02d9bffa9894291a42613b260690df68cfc4e
Author: Allen Li <ayatane@chromium.org>
Date: Fri Jul 20 12:31:08 2018

autotest: Use /tmp for SSH sockets

Previously, we use the "default temp directory".  This is platform
dependent, but crucially can vary based on TMPDIR, TEMP, or TMP
environment variables.  I'm pretty sure Swarming sets this to a
different directory for each task.

The problem is that paths for Unix sockets are limited to 108 bytes,
not the standard 4096 for regular paths.

The Python documentations says:

The default directory is chosen from a platform-dependent list, but
the user of the application can control the directory location by
setting the TMPDIR, TEMP or TMP environment variables. There is thus
no guarantee that the generated filename will have any nice
properties, such as not requiring quoting when passed to external
commands via os.popen().

The key is that there is no guarantee that the path can fit
comfortably in 108 bytes either.

So we should hard code to use /tmp instead of relying on the default.

Note that this does make us dependent on POSIX platforms, but Autotest
isn't really portable anyway.

BUG= chromium:865171 
TEST=None

Change-Id: I79af4285e055846a87b37e35bdb0ce6957c07417
Reviewed-on: https://chromium-review.googlesource.com/1144404
Commit-Ready: Allen Li <ayatane@chromium.org>
Tested-by: Allen Li <ayatane@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/cbe02d9bffa9894291a42613b260690df68cfc4e/server/hosts/ssh_multiplex.py

Status: Fixed (was: Started)

Sign in to add a comment