Skylab: ssh timeout causes test lasts too long and admin task timeout. |
|||||
Issue descriptionSome admin tasks are timedout: https://chrome-swarming.appspot.com/task?id=3ec7396c7c350c10&refresh=10 https://chrome-swarming.appspot.com/task?id=3ec770598840f410&refresh=10 https://chrome-swarming.appspot.com/task?id=3ec7396c4de79c10&refresh=10 they're aborted finally.
,
Jul 18
Single test lasts particular longer than normal due to ssh timeout: "Timed out waiting for condition: Wait for a socket file to exist" https://sponge-qa.corp.googleusercontent.com/file?invocationId=987dd4b4-2d6b-4054-a22e-26ee410e14f4&targetResultIndex=0&largeText=logs%2Fdebug%2Fautoserv.DEBUG&filename=autoserv.DEBUG&contentType=text%2Fplain%3B%20charset%3Dutf-8&format=ansi&setTZID=America%2FLos_Angeles 07/18 14:13:24.715 INFO | ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_eQxdK9ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:24.716 DEBUG| utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_eQxdK9ssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:29.579 ERROR| utils:2723| Timed out waiting for condition: Wait for a socket file to exist 07/18 14:13:29.580 INFO | ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established. 07/18 14:13:31.137 DEBUG| utils:0287| [stdout] 3.10.18 07/18 14:13:31.141 INFO | server_job:0216| INFO ---- ---- kernel=3.10.18 localtime=Jul 18 14:13:31 timestamp=1531948411 07/18 14:13:31.153 DEBUG| ssh_host:0301| Running (ssh) 'true' from '_install|wait_up|is_up|ssh_ping|run|run_very_slowly' 07/18 14:13:31.159 INFO | ssh_multiplex:0079| Master ssh connection to chromeos4-row7-rack7-host7 is down. 07/18 14:13:31.160 DEBUG| ssh_multiplex:0124| Nuking ssh master_job 07/18 14:13:31.160 DEBUG| ssh_multiplex:0129| Cleaning ssh master_tempdir 07/18 14:13:31.160 INFO | ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_9zykVWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:31.160 DEBUG| utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_9zykVWssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:36.026 ERROR| utils:2723| Timed out waiting for condition: Wait for a socket file to exist 07/18 14:13:36.026 INFO | ssh_multiplex:0112| Timed out waiting for master-ssh connection to be established. 07/18 14:13:37.630 DEBUG| abstract_ssh:0670| Host chromeos4-row7-rack7-host7 is now up 07/18 14:13:37.630 INFO | autotest:0334| Installing autotest on chromeos4-row7-rack7-host7 07/18 14:13:37.641 DEBUG| ssh_host:0301| Running (ssh) 'test -x /usr/local/autotest/bin/autotest' from 'install|_install|get_install_dir|get_installed_autodir|run|run_very_slowly' 07/18 14:13:37.647 INFO | ssh_multiplex:0079| Master ssh connection to chromeos4-row7-rack7-host7 is down. 07/18 14:13:37.647 DEBUG| ssh_multiplex:0124| Nuking ssh master_job 07/18 14:13:37.647 DEBUG| ssh_multiplex:0129| Cleaning ssh master_tempdir 07/18 14:13:37.648 INFO | ssh_multiplex:0095| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_H9qWSsssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:37.648 DEBUG| utils:0219| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/usr/local/google/home/chromeos-test/skylab_bots/d2489/w/itrd33WC/_autotmp_H9qWSsssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos4-row7-rack7-host7' 07/18 14:13:42.511 ERROR| utils:2723| Timed out waiting for condition: Wait for a socket file to exist
,
Jul 19
It looks like we're trying to create the ssh socket in an nonexistent directory: /usr/local/google/home/chromeos-test/skylab_bots/d6409/w/itVGiCYs/_autotmp_Z_6Dt7ssh-master/socket This is supposed to be a freshly created tempdir, so I would expect a different exception here, from creating the tempdir.
,
Jul 19
Re Comment 3, that's probably wrong. The directory probably does exist, but there's probably something wrong with SSH, and the code makes debugging that a pain since all output is thrown away by design.
,
Jul 19
Re #4: No logs should be thrown away. Swarming workdirs are currently not cleaned up at all. This hasn't been a huge problem because there isn't that much log data in them. --> ayatane@ who is working this bug now.
,
Jul 19
Looks like setting up the master takes more time than the timeout due to debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.corp.google.com.: Name or service not known debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.prod.google.com.: Name or service not known debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.prodz.google.com.: Name or service not known debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.net.google.com.: Name or service not known debug1: ssh: Could not resolve hostname chromeos4-row7-rack8-host7.google.com.: Name or service not known debug1: Re-reading configuration after hostname canonicalisation
,
Jul 19
Going to monkeypatch drone 1 so I can get ssh logs, I suspect SSH is just taking longer than 5 seconds so we keep killing it before it can get far enough along
,
Jul 19
Oh boy unix_listener: "/usr/local/google/home/chromeos-test/skylab_bots/d493/w/itA0ohCW/_autotmp_QhD4zZssh-master/socket.qB5PIblT24Hl2mT5" too long for Unix domain socket
,
Jul 19
$ cat /usr/include/linux/un.h | grep "define UNIX_PATH_MAX" #define UNIX_PATH_MAX 108
,
Jul 19
,
Jul 20
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/cbe02d9bffa9894291a42613b260690df68cfc4e commit cbe02d9bffa9894291a42613b260690df68cfc4e Author: Allen Li <ayatane@chromium.org> Date: Fri Jul 20 12:31:08 2018 autotest: Use /tmp for SSH sockets Previously, we use the "default temp directory". This is platform dependent, but crucially can vary based on TMPDIR, TEMP, or TMP environment variables. I'm pretty sure Swarming sets this to a different directory for each task. The problem is that paths for Unix sockets are limited to 108 bytes, not the standard 4096 for regular paths. The Python documentations says: The default directory is chosen from a platform-dependent list, but the user of the application can control the directory location by setting the TMPDIR, TEMP or TMP environment variables. There is thus no guarantee that the generated filename will have any nice properties, such as not requiring quoting when passed to external commands via os.popen(). The key is that there is no guarantee that the path can fit comfortably in 108 bytes either. So we should hard code to use /tmp instead of relying on the default. Note that this does make us dependent on POSIX platforms, but Autotest isn't really portable anyway. BUG= chromium:865171 TEST=None Change-Id: I79af4285e055846a87b37e35bdb0ce6957c07417 Reviewed-on: https://chromium-review.googlesource.com/1144404 Commit-Ready: Allen Li <ayatane@chromium.org> Tested-by: Allen Li <ayatane@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/cbe02d9bffa9894291a42613b260690df68cfc4e/server/hosts/ssh_multiplex.py
,
Jul 26
I'm not sure whether this is the same case, 3 bots are still experiencing TIMED_OUT issue in admin tasks: https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-0d4bd243-de01-4d33-8d42-c8a795cf7e0e&sort_stats=total%3Adesc https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-95770639-a879-473d-8509-22e453bc5596&sort_stats=total%3Adesc https://chrome-swarming.appspot.com/bot?id=chromeos-skylab-bot-82709c0b-67a9-4d15-9e9b-581c5ef7510a&sort_stats=total%3Adesc
,
Jul 31
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by ayatane@chromium.org
, Jul 18