cros_host:firmware_install incorrectly accesses devservers directly -- fails for devserver in different subnet |
|||||||||||||
Issue descriptionHi Deputy, The following devservers have been rebuilt using the same hostname. Not sure if any modification needs to be made to the config file or just a confirmation. 172.27.215.249 - chromeos1-infra-devserver2 172.27.215.248 - chromeos1-infra-devserver3 Thanks, Joe
,
Nov 11 2016
It might be a configuration problem. Not sure how it works, but these devservers have the same hostnames as the previous devservers. Since they have been rebuilt, I'm not sure if they have to be removed and re-added to the config file to get these into production.
,
Nov 11 2016
The two devservers were not removed from the config file. Not sure if any other configurations are needed. pprabhu@, any advice here?
,
Nov 11 2016
Does not look like a configuration problem. Looks like we tried to stage some artifacts, which succeeded, but then when trying to get the artifacts, we lost connection to the devserver. My next step would be to SSH into the server and see inside /usr/local/autotest/logs/ for devserver logs for any more hints. But this devserver is not a ganeti server, is not owned by chromeos-infra and I do not know how to SSH into it. Relevant bits of the log: 11/11 09:14:29.429 INFO | base_packages:0183| Fetching client-autotest.tar.bz2 from http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/autotest/packages to /tmp/sysinfo/autoserv-BQBbQ3/packages/client-autotest.tar.bz2 ... 11/11 09:14:30.847 INFO | base_packages:0200| Successfully fetched client-autotest.tar.bz2 from http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/autotest/packages/client-autotest.tar.bz2 ... 11/11 09:15:28.293 INFO | dev_server:0971| Staging artifacts on devserver http://172.27.215.249:8082: build=veyron_jerry-release/R56-8977.0.0, artifacts=['firmware'], files=, archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0 11/11 09:15:28.294 DEBUG| base_utils:0185| Running 'ssh 172.27.215.249 'curl "http://172.27.215.249:8082/stage?artifacts=firmware&files=&async=True&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"'' 11/11 09:15:34.074 DEBUG| dev_server:0918| response for RPC: 'Success' 11/11 09:15:34.074 DEBUG| base_utils:0185| Running 'ssh 172.27.215.249 'curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"'' 11/11 09:15:39.845 DEBUG| dev_server:0874| whether artifact is staged: 'True' 11/11 09:15:39.846 INFO | dev_server:0993| Finished staging artifacts: build=veyron_jerry-release/R56-8977.0.0, artifacts=['firmware'], files=, archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0 11/11 09:15:39.847 DEBUG| base_utils:0185| Running 'wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2' 11/11 09:15:39.894 ERROR| base_utils:0280| [stderr] --2016-11-11 09:15:39-- http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2 11/11 09:15:39.931 ERROR| base_utils:0280| [stderr] Connecting to 172.27.215.249:8082... failed: No route to host. So, both fetch and stage succeeded a few times before failing consistently for the rest of the run.
,
Nov 11 2016
I was able to ssh into the devserver with the standard chromeos-test/test_me login/password. Also, there's no /usr/local/autotest directory on it as of now.
,
Nov 11 2016
,
Nov 11 2016
shchen@, are you under the right user space to check for the logs? pprabhu@, any specific logs you're looking for? shchen@ may be able to help here?
,
Nov 11 2016
Not sure? How do I check? Also, here's what I see when I log in: shchen-macbookair:debug shchen$ ssh chromeos-test@chromeos1-infra-devserver2.cros chromeos-test@chromeos1-infra-devserver2.cros.corp.google.com's password: Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64) * Documentation: https://help.ubuntu.com/ System information as of Fri Nov 11 13:34:39 PST 2016 System load: 0.23 Processes: 158 Usage of /: 11.2% of 1.44TB Users logged in: 1 Memory usage: 5% IP address for eth0: 172.27.215.249 Swap usage: 0% Graph this data and manage this system at: https://landscape.canonical.com/ New release '16.04.1 LTS' available. Run 'do-release-upgrade' to upgrade to it. *** System restart required *** Last login: Fri Nov 11 13:08:17 2016 from 172.19.54.139 chromeos-test@chromeos1-infra-devserver2:~$ ls /usr/local/autotest ls: cannot access /usr/local/autotest: No such file or directory chromeos-test@chromeos1-infra-devserver2:~$ ls /usr/local/ bin etc games google include lib man sbin share src chromeos-test@chromeos1-infra-devserver2:~$
,
Nov 11 2016
pprabhu@, which user should we check? chromeos-test@ or we need to switch to root?
,
Nov 11 2016
This is a port access problem on the devserver. In particular the following passes: ssh chromeos-test@172.27.215.249 'curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"' (with password), but the following fails: pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0" curl: (7) Failed to connect to 172.27.215.249 port 8082: No route to host So port 8082 is not reachable on that server. Curiously, if I first SSH into the other devserver in the same subnet: pprabhu@pprabhu:/work/chromiumos/chromeos-admin/puppet/modules/lab$ ssh chromeos-test@172.27.215.248 ... chromeos-test@chromeos1-infra-devserver3:~$ curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0" True So, it seems like port 8082 is not reachable across subnets. May be because of Google's network restrictions. And the drone that runs autoserv is not in the same subnet as the devserver. OTOH, the DUT that runs the client test may be in the same subnet (that's why this devserver was chosen).
,
Nov 12 2016
Who will be able to confirm the network settings on comment#11, jashur@? The emails sent by Nagios said that the two devservers were recovered and UP.
,
Nov 12 2016
The real question is, has this devserver ever been demonstrated to work in the current setup (ie before it went down and we set it up again). iiuc, this seems like a flaw in the system: autoserv uses the restricted_subnet property to choose devserver for a given DUT. Then, it checks that the devserver works via the is_staged curl call (over ssh). Now, the devserver _is correctly setup_ to be accessed from the DUT, but when autoserv tries to get the staged file (in this case for firmware update on the DUT via test_that would be my guess), it fails because port 8082 on the arbitrary subnet is not accessible from the drone in .corp. nagios will not complain about these any more because it merely runs health_check? (likely over ssh also).
,
Nov 14 2016
pass this to the current deputy
,
Nov 14 2016
This seems to me an infra software issue (autotest is picking a devserver in a subnet that the server job can not reach later) -- in this case, this devserver would never have worked. Did this ever work? Handing over to shchen@ to answer that ^^^ and for follow up. Does not look like an operations problem (which is what deputy is for).
,
Nov 15 2016
According to the log of c#1, it uses "wget" directly to contact the devserver, without ssh.
11/11 09:15:39.932 ERROR|provision_Firmware:0061| Command <wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command:
wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2
http://172.27.215.249:8082/static/veyron_jerry-
release/R56-8977.0.0/firmware_from_source.tar.bz2
Exit status: 4
Duration: 0.046719789505
It implies that the "enable_ssh_connection_for_devserver" config in the Shard's shadow_config.ini file is "False". In the lab environment, this value should be "True". It is controlled by Puppet, check the file chromeos-admin/puppet/modules/lab/templates/shadow_config/shard.ini.erb:
# Flags to enable/disable SSH connection for devserver.
enable_ssh_connection_for_devserver: True
The shadow_config.ini looks like not yet deployed to the Shard (or other lab server which executed the above code). Please pprabhu@ help to check the Shard to ensure the shadow_config.ini has the correct value. Shelly and I have no access to the Shard (would be nice if someone can grand us permission and guide us the steps).
,
Nov 15 2016
The autoserv process in #2 ran on chromeos-server41.cbf I verified that enable_ssh_connection_for_devserver is set to True on that drone/shard. --------------- Here's the bug: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/cros_host.py?q=wget+file:%5Esrc/third_party/autotest/files/+package:%5Echromeos_public$&dr=C&l=1008 cros_host uses the devserver module to stage the artifact, but then goes and 'wget's the staged package itself. In this call, it does not respect the enable_ssh_connection_for_devserver flag. I don't think this ever worked for restricted subnets. Maybe we've just never tried to do a firmware update in a restricted subnet before.
,
Nov 15 2016
Assigned to dshi@. Please fix the firmware_install() method. Probably don't call 'wget' directly which doesn't respect the enable_ssh_connection_for_devserver flag.
,
Nov 15 2016
,
Nov 15 2016
+akeshet Aviv, can you find an owner for this bug?
,
Nov 16 2016
xixuan@ can you take a look?
,
Nov 17 2016
The powerwash tests for testing push seems to have the same issue. Here is the log of one of the failed powerwash test: 11/17 12:08:28.654 INFO | autoserv:0705| Results placed in /usr/local/autotest/results/1832-autotest_sy stem/chromeos4-row10-rack9-host15 11/17 12:08:28.654 DEBUG| autoserv:0713| autoserv is running in drone chromeos-shard2-staging.hot.corp. google.com. 11/17 12:08:28.655 DEBUG| autoserv:0714| autoserv command was: /usr/local/autotest/server/autoserv -p - r /usr/local/autotest/results/1832-autotest_system/chromeos4-row10-rack9-host15 -m chromeos4-row10-rack9-host15 -u autotest_system -l powerwash -s --lab True -P 1832-autotest_system/chromeos4-row10-rack9-host15 -n /usr/local /autotest/results/drone_tmp/attach.16 --require-ssp --verify_job_repo_url 11/17 12:08:28.655 INFO | pidfile:0016| Logged pid 26829 to /usr/local/autotest/results/1832-autotest_ system/chromeos4-row10-rack9-host15/.autoserv_execute 11/17 12:08:28.660 DEBUG| autoserv:0426| faulthandler registered on SIGTERM. 11/17 12:08:28.661 DEBUG| base_job:0350| Persistent state global_properties.test_retry now set to 0 11/17 12:08:28.661 DEBUG| base_job:0350| Persistent state global_properties.tag now set to '1832-autote st_system/chromeos4-row10-rack9-host15' 11/17 12:08:28.915 DEBUG| base_utils:0185| Running 'cp /usr/local/autotest/results/drone_tmp/attach.16 /u sr/local/autotest/results/1832-autotest_system/chromeos4-row10-rack9-host15/attach.16' 11/17 12:08:28.951 DEBUG| retry:0155| Converted retries value: 0 -> Retry(total=0, connect=None, rea d=None, redirect=0) 11/17 12:08:28.951 INFO | connectionpool:0188| Starting new HTTP connection (1): metadata.google.internal 11/17 12:08:28.953 DEBUG| base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F n ame,state' 11/17 12:08:28.955 NOTIC| cros_logging:0037| ts_mon was set up. 11/17 12:08:29.175 DEBUG| base_utils:0185| Running 'sudo test -e "/usr/local/autotest/containers/test_183 2_1479413308_26829"' 11/17 12:08:29.392 DEBUG| base_utils:0185| Running 'sudo -n virt-what' 11/17 12:08:29.658 DEBUG| site_utils:1141| virt-what output: xen 11/17 12:08:29.659 DEBUG| base_utils:0185| Running 'sudo lxc-clone -p /usr/local/autotest/containers -P / usr/local/autotest/containers base_01 test_1832_1479413308_26829 -s -B aufs' 11/17 12:08:30.096 DEBUG| base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F n ame,state' 11/17 12:08:30.328 DEBUG| base_utils:0185| Running 'echo 'lxc.utsname = test_chromeos4-row10-rack9-host15 ' | sudo tee --append /usr/local/autotest/containers/test_1832_1479413308_26829/config> /dev/null' 11/17 12:08:30.575 DEBUG| base_utils:0185| Running 'sudo lxc-info -P /usr/local/autotest/containers -n te st_1832_1479413308_26829 -c lxc.rootfs' 11/17 12:08:30.797 DEBUG| base_utils:0185| Running 'sudo mkdir -p /usr/local/autotest/containers/test_183 2_1479413308_26829/delta0/usr/local' 11/17 12:08:31.010 DEBUG| base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80 82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18 32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2' 11/17 12:08:31.290 WARNI| retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac kage.tar.bz2> failed, rc=4, Command returned non-zero exit status * Command: sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks- release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_ package.tar.bz2 Exit status: 4 Duration: 0.243041992188 ) 11/17 12:08:31.293 WARNI| retry:0173| Retrying in 3.777172 seconds... 11/17 12:08:35.087 DEBUG| base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80 82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18 32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2' 11/17 12:10:44.656 WARNI| retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac kage.tar.bz2> failed, rc=4, Command returned non-zero exit status * Command: sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks- release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_ package.tar.bz2 Exit status: 4 Duration: 0.243041992188 ) 11/17 12:08:31.293 WARNI| retry:0173| Retrying in 3.777172 seconds... 11/17 12:08:35.087 DEBUG| base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80 82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18 32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2' 11/17 12:10:44.656 WARNI| retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac kage.tar.bz2> failed, rc=4, Command returned non-zero exit status * Command: sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks- release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_ package.tar.bz2 Exit status: 4 Duration: 129.53272295 ..... It will keep retrying to download the autotest_serve_package.tar.bz2 until it times out. Can you take a look too?
,
Nov 17 2016
have a CL for that, under review.
,
Nov 17 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a04b42d8f105b91600c67468793009d1d318d1f2 commit a04b42d8f105b91600c67468793009d1d318d1f2 Author: xixuan <xixuan@chromium.org> Date: Wed Nov 16 22:30:43 2016 autotest: temporary fix for fetching firmware package Current firmware_install (running on drone/shard) use wget to fetch firmware_from_source.tar.bz2 for firware installation from devserver. However, only port 22 is allowed for devservers under ACL. This CL is a temporary fix for using port 22 to get the firmware package. A long-term solution will be 'choose a devserver by host, build, and package' so that we can use ganeti devserver for such use cases and reduce lab devserver's load. BUG= chromium:664333 TEST=Run '/usr/local/autotest/server/autoserv -p -r /tmp/firmware-test5 -m chromeos1-row1-rack9-host6 --verbose --lab True --provision --job-labels cros-version:veyron_jerry-release/R56-8977.0.0,fwro-version:veyron_jerry-release/R56-8977.0.0' on chromeos-server41.cbf, verify that successfully downloading the packages. Change-Id: Ie08a8ef47aa92044d1d1152889067ca37cc7a2b6 Reviewed-on: https://chromium-review.googlesource.com/412028 Reviewed-by: Xixuan Wu <xixuan@chromium.org> Commit-Queue: Xixuan Wu <xixuan@chromium.org> Tested-by: Xixuan Wu <xixuan@chromium.org> [modify] https://crrev.com/a04b42d8f105b91600c67468793009d1d318d1f2/server/hosts/cros_host.py [modify] https://crrev.com/a04b42d8f105b91600c67468793009d1d318d1f2/client/common_lib/cros/dev_server.py
,
Nov 17 2016
Issue 666498 has been merged into this issue.
,
Nov 17 2016
Re #22: That is issue 666414 . Workaround has been merged, being tested now.
,
Nov 17 2016
Re c#22, they are different issues, but similar cause, i.e. hard-coding the "wget" command. The failure in c#22 happened in download_extract() in site_utils/lxc.py. The fix in c#24 fixed the firmware_install() in server/hosts/cros_host.py.
,
Nov 17 2016
,
Nov 23 2016
Aviv's link http://shortn/_RrJ2HerbHE from https://bugs.chromium.org/p/chromium/issues/detail?id=666380#c3 still points to the FAFT related hosts still failing provisioning very frequently. Has the temporary fix from c#24 not been rolled out yet?
,
Nov 23 2016
Re #29: Fix from #24 was rolled out last week. And I can comfirm that at least a similar fix was effective in getting us over the SSP issues we hit last week: issue 666414
,
Dec 8 2016
No report. Assuming it's fixed. Feel free to reopen it if we find another case. |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by shchen@chromium.org
, Nov 11 2016Thanks for rebuilding the devservers. However, I am now seeing the following error from a DUT to chromeos1-infra-devserver2: 11/11 09:15:39.894 ERROR| base_utils:0280| [stderr] --2016-11-11 09:15:39-- http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2 11/11 09:15:39.931 ERROR| base_utils:0280| [stderr] Connecting to 172.27.215.249:8082... failed: No route to host. 11/11 09:15:39.932 ERROR|provision_Firmware:0061| Command <wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2> failed, rc=4, Command returned non-zero exit status * Command: wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry- release/R56-8977.0.0/firmware_from_source.tar.bz2 Exit status: 4 Duration: 0.046719789505 stderr: --2016-11-11 09:15:39-- http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2 Connecting to 172.27.215.249:8082... failed: No route to host. Is this a configuration problem?