New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 664333 link

Starred by 4 users

Issue metadata

Status: Fixed
Owner:
Closed: Dec 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

cros_host:firmware_install incorrectly accesses devservers directly -- fails for devserver in different subnet

Project Member Reported by jashur@chromium.org, Nov 10 2016

Issue description

Hi Deputy,

The following devservers have been rebuilt using the same hostname. Not sure if any modification needs to be made to the config file or just a confirmation.

172.27.215.249 - chromeos1-infra-devserver2
172.27.215.248 - chromeos1-infra-devserver3

Thanks,

Joe

 

Comment 1 by shchen@chromium.org, Nov 11 2016

Thanks for rebuilding the devservers.

However, I am now seeing the following error from a DUT to chromeos1-infra-devserver2:

11/11 09:15:39.894 ERROR|        base_utils:0280| [stderr] --2016-11-11 09:15:39--  http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2
11/11 09:15:39.931 ERROR|        base_utils:0280| [stderr] Connecting to 172.27.215.249:8082... failed: No route to host.
11/11 09:15:39.932 ERROR|provision_Firmware:0061| Command <wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command: 
    wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2
    http://172.27.215.249:8082/static/veyron_jerry-
    release/R56-8977.0.0/firmware_from_source.tar.bz2
Exit status: 4
Duration: 0.046719789505

stderr:
--2016-11-11 09:15:39--  http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2
Connecting to 172.27.215.249:8082... failed: No route to host.

Is this a configuration problem?

Comment 3 by jashur@chromium.org, Nov 11 2016

It might be a configuration problem. Not sure how it works, but these devservers have the same hostnames as the previous devservers. Since they have been rebuilt, I'm not sure if they have to be removed and re-added to the config file to get these into production.

Comment 4 by nxia@chromium.org, Nov 11 2016

The two devservers were not removed from the config file. Not sure if any other configurations are needed. pprabhu@, any advice here?
Does not look like a configuration problem. Looks like we tried to stage some artifacts, which succeeded, but then when trying to get the artifacts, we lost connection to the devserver.

My next step would be to SSH into the server and see inside /usr/local/autotest/logs/ for devserver logs for any more hints. But this devserver is not a ganeti server, is not owned by chromeos-infra and I do not know how to SSH into it.

Relevant bits of the log:

11/11 09:14:29.429 INFO |     base_packages:0183| Fetching client-autotest.tar.bz2 from http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/autotest/packages to /tmp/sysinfo/autoserv-BQBbQ3/packages/client-autotest.tar.bz2

...

11/11 09:14:30.847 INFO |     base_packages:0200| Successfully fetched client-autotest.tar.bz2 from http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/autotest/packages/client-autotest.tar.bz2

...

11/11 09:15:28.293 INFO |        dev_server:0971| Staging artifacts on devserver http://172.27.215.249:8082: build=veyron_jerry-release/R56-8977.0.0, artifacts=['firmware'], files=, archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0
11/11 09:15:28.294 DEBUG|        base_utils:0185| Running 'ssh 172.27.215.249 'curl "http://172.27.215.249:8082/stage?artifacts=firmware&files=&async=True&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"''
11/11 09:15:34.074 DEBUG|        dev_server:0918| response for RPC: 'Success'
11/11 09:15:34.074 DEBUG|        base_utils:0185| Running 'ssh 172.27.215.249 'curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"''
11/11 09:15:39.845 DEBUG|        dev_server:0874| whether artifact is staged: 'True'
11/11 09:15:39.846 INFO |        dev_server:0993| Finished staging artifacts: build=veyron_jerry-release/R56-8977.0.0, artifacts=['firmware'], files=, archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0
11/11 09:15:39.847 DEBUG|        base_utils:0185| Running 'wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2'
11/11 09:15:39.894 ERROR|        base_utils:0280| [stderr] --2016-11-11 09:15:39--  http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2
11/11 09:15:39.931 ERROR|        base_utils:0280| [stderr] Connecting to 172.27.215.249:8082... failed: No route to host.

So, both fetch and stage succeeded a few times before failing consistently for the rest of the run.

Comment 6 by shchen@google.com, Nov 11 2016

I was able to ssh into the devserver with the standard chromeos-test/test_me login/password.

Also, there's no /usr/local/autotest directory on it as of now.

Comment 7 by shchen@chromium.org, Nov 11 2016

Cc: shchen@chromium.org

Comment 8 by nxia@chromium.org, Nov 11 2016

shchen@, are you under the right user space to check for the logs?

pprabhu@, any specific logs you're looking for? shchen@ may be able to help here?

Comment 9 by shchen@chromium.org, Nov 11 2016

Not sure?  How do I check?

Also, here's what I see when I log in:

shchen-macbookair:debug shchen$ ssh chromeos-test@chromeos1-infra-devserver2.cros
chromeos-test@chromeos1-infra-devserver2.cros.corp.google.com's password: 
Welcome to Ubuntu 14.04.4 LTS (GNU/Linux 4.2.0-27-generic x86_64)

 * Documentation:  https://help.ubuntu.com/

  System information as of Fri Nov 11 13:34:39 PST 2016

  System load:  0.23              Processes:           158
  Usage of /:   11.2% of 1.44TB   Users logged in:     1
  Memory usage: 5%                IP address for eth0: 172.27.215.249
  Swap usage:   0%

  Graph this data and manage this system at:
    https://landscape.canonical.com/

New release '16.04.1 LTS' available.
Run 'do-release-upgrade' to upgrade to it.

*** System restart required ***
Last login: Fri Nov 11 13:08:17 2016 from 172.19.54.139
chromeos-test@chromeos1-infra-devserver2:~$ ls /usr/local/autotest
ls: cannot access /usr/local/autotest: No such file or directory
chromeos-test@chromeos1-infra-devserver2:~$ ls /usr/local/
bin  etc  games  google  include  lib  man  sbin  share  src
chromeos-test@chromeos1-infra-devserver2:~$ 

Comment 10 by nxia@chromium.org, Nov 11 2016

pprabhu@, which user should we check? chromeos-test@ or we need to switch to root?
This is a port access problem on the devserver.
In particular the following passes:
ssh chromeos-test@172.27.215.249 'curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"'

(with password),
but the following fails:
pprabhu@pprabhu:/work/chromiumos/chromeos-admin$ curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"
curl: (7) Failed to connect to 172.27.215.249 port 8082: No route to host

So port 8082 is not reachable on that server.

Curiously, if I first SSH into the other devserver in the same subnet:
pprabhu@pprabhu:/work/chromiumos/chromeos-admin/puppet/modules/lab$ ssh chromeos-test@172.27.215.248
...
chromeos-test@chromeos1-infra-devserver3:~$ curl "http://172.27.215.249:8082/is_staged?artifacts=firmware&files=&archive_url=gs://chromeos-image-archive/veyron_jerry-release/R56-8977.0.0"
True


So, it seems like port 8082 is not reachable across subnets. May be because of Google's network restrictions. And the drone that runs autoserv is not in the same subnet as the devserver. OTOH, the DUT that runs the client test may be in the same subnet (that's why this devserver was chosen).

Comment 12 by nxia@chromium.org, Nov 12 2016

Who will be able to confirm the network settings on comment#11,  jashur@? 

The emails sent by Nagios said that the two devservers were recovered and UP.
The real question is, has this devserver ever been demonstrated to work in the current setup (ie before it went down and we set it up again).

iiuc, this seems like a flaw in the system: autoserv uses the restricted_subnet property to choose devserver for a given DUT. Then, it checks that the devserver works via the is_staged curl call (over ssh). Now, the devserver _is correctly setup_ to be accessed from the DUT, but when autoserv tries to get the staged file (in this case for firmware update on the DUT via test_that would be my guess), it fails because port 8082 on the arbitrary subnet is not accessible from the drone in .corp.

nagios will not complain about these any more because it merely runs health_check? (likely over ssh also).

Comment 14 by nxia@chromium.org, Nov 14 2016

Owner: pprabhu@chromium.org
pass this to the current deputy
Owner: shchen@chromium.org
Status: Assigned (was: Untriaged)
This seems to me an infra software issue (autotest is picking a devserver in a subnet that the server job can not reach later) -- in this case, this devserver would never have worked. Did this ever work?

Handing over to shchen@ to answer that ^^^ and for follow up. Does not look like an operations problem (which is what deputy is for).

Cc: waihong@chromium.org
Owner: pprabhu@chromium.org
According to the log of c#1, it uses "wget" directly to contact the devserver, without ssh.

11/11 09:15:39.932 ERROR|provision_Firmware:0061| Command <wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2 http://172.27.215.249:8082/static/veyron_jerry-release/R56-8977.0.0/firmware_from_source.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command: 
    wget -O /tmp/_autotmp_KamvxQfwimage/firmware_from_source.tar.bz2
    http://172.27.215.249:8082/static/veyron_jerry-
    release/R56-8977.0.0/firmware_from_source.tar.bz2
Exit status: 4
Duration: 0.046719789505

It implies that the "enable_ssh_connection_for_devserver" config in the Shard's shadow_config.ini file is "False". In the lab environment, this value should be "True". It is controlled by Puppet, check the file chromeos-admin/puppet/modules/lab/templates/shadow_config/shard.ini.erb:
  # Flags to enable/disable SSH connection for devserver.
  enable_ssh_connection_for_devserver: True

The shadow_config.ini looks like not yet deployed to the Shard (or other lab server which executed the above code). Please pprabhu@ help to check the Shard to ensure the shadow_config.ini has the correct value. Shelly and I have no access to the Shard (would be nice if someone can grand us permission and guide us the steps).
Owner: ----
Status: Available (was: Assigned)
Summary: cros_host:firmware_install incorrectly accesses devservers directly -- fails for devserver in different subnet (was: Add Devservers to Config File)
The autoserv process in #2 ran on chromeos-server41.cbf 
I verified that enable_ssh_connection_for_devserver is set to True on that drone/shard.
---------------

Here's the bug: https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/server/hosts/cros_host.py?q=wget+file:%5Esrc/third_party/autotest/files/+package:%5Echromeos_public$&dr=C&l=1008

cros_host uses the devserver module to stage the artifact, but then goes and 'wget's the staged package itself. In this call, it does not respect the enable_ssh_connection_for_devserver flag.

I don't think this ever worked for restricted subnets. Maybe we've just never tried to do a firmware update in a restricted subnet before.
Cc: -dshi@chromium.org
Owner: dshi@chromium.org
Status: Assigned (was: Available)
Assigned to dshi@. Please fix the firmware_install() method. Probably don't call 'wget' directly which doesn't respect the enable_ssh_connection_for_devserver flag.
Labels: current-issue

Comment 20 by dshi@chromium.org, Nov 15 2016

Owner: akes...@chromium.org
+akeshet

Aviv, can you find an owner for this bug?
Owner: xixuan@chromium.org
xixuan@ can you take a look?
The powerwash tests for testing push seems to have the same issue. Here is the log of one of the failed powerwash test:

11/17 12:08:28.654 INFO |          autoserv:0705| Results placed in /usr/local/autotest/results/1832-autotest_sy
stem/chromeos4-row10-rack9-host15
11/17 12:08:28.654 DEBUG|          autoserv:0713| autoserv is running in drone chromeos-shard2-staging.hot.corp.
google.com.
11/17 12:08:28.655 DEBUG|          autoserv:0714| autoserv command was: /usr/local/autotest/server/autoserv -p -
r /usr/local/autotest/results/1832-autotest_system/chromeos4-row10-rack9-host15 -m chromeos4-row10-rack9-host15 
-u autotest_system -l powerwash -s --lab True -P 1832-autotest_system/chromeos4-row10-rack9-host15 -n /usr/local
/autotest/results/drone_tmp/attach.16 --require-ssp --verify_job_repo_url
11/17 12:08:28.655 INFO |           pidfile:0016| Logged pid 26829 to /usr/local/autotest/results/1832-autotest_
system/chromeos4-row10-rack9-host15/.autoserv_execute
11/17 12:08:28.660 DEBUG|          autoserv:0426| faulthandler registered on SIGTERM.
11/17 12:08:28.661 DEBUG|          base_job:0350| Persistent state global_properties.test_retry now set to 0
11/17 12:08:28.661 DEBUG|          base_job:0350| Persistent state global_properties.tag now set to '1832-autote
st_system/chromeos4-row10-rack9-host15'
11/17 12:08:28.915 DEBUG|        base_utils:0185| Running 'cp /usr/local/autotest/results/drone_tmp/attach.16 /u
sr/local/autotest/results/1832-autotest_system/chromeos4-row10-rack9-host15/attach.16'
11/17 12:08:28.951 DEBUG|             retry:0155| Converted retries value: 0 -> Retry(total=0, connect=None, rea
d=None, redirect=0)
11/17 12:08:28.951 INFO |    connectionpool:0188| Starting new HTTP connection (1): metadata.google.internal
11/17 12:08:28.953 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F n
ame,state'
11/17 12:08:28.955 NOTIC|      cros_logging:0037| ts_mon was set up.
11/17 12:08:29.175 DEBUG|        base_utils:0185| Running 'sudo test -e "/usr/local/autotest/containers/test_183
2_1479413308_26829"'
11/17 12:08:29.392 DEBUG|        base_utils:0185| Running 'sudo -n virt-what'
11/17 12:08:29.658 DEBUG|        site_utils:1141| virt-what output: xen
11/17 12:08:29.659 DEBUG|        base_utils:0185| Running 'sudo lxc-clone -p /usr/local/autotest/containers -P /
usr/local/autotest/containers base_01 test_1832_1479413308_26829 -s -B aufs'
11/17 12:08:30.096 DEBUG|        base_utils:0185| Running 'sudo lxc-ls -P /usr/local/autotest/containers -f -F n
ame,state'
11/17 12:08:30.328 DEBUG|        base_utils:0185| Running 'echo 'lxc.utsname = test_chromeos4-row10-rack9-host15
' | sudo tee --append /usr/local/autotest/containers/test_1832_1479413308_26829/config> /dev/null'
11/17 12:08:30.575 DEBUG|        base_utils:0185| Running 'sudo lxc-info -P /usr/local/autotest/containers -n te
st_1832_1479413308_26829 -c lxc.rootfs'
11/17 12:08:30.797 DEBUG|        base_utils:0185| Running 'sudo mkdir -p /usr/local/autotest/containers/test_183
2_1479413308_26829/delta0/usr/local'
11/17 12:08:31.010 DEBUG|        base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80
82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18
32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2'
11/17 12:08:31.290 WARNI|             retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman
d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p
ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac
kage.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command: 
    sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-
    release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote
    st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_
    package.tar.bz2
Exit status: 4
Duration: 0.243041992188
)
11/17 12:08:31.293 WARNI|             retry:0173| Retrying in 3.777172 seconds...
11/17 12:08:35.087 DEBUG|        base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80
82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18
32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2'
11/17 12:10:44.656 WARNI|             retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman
d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p
ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac
kage.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command: 
    sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-
    release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote
    st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_
    package.tar.bz2
Exit status: 4
Duration: 0.243041992188
)
11/17 12:08:31.293 WARNI|             retry:0173| Retrying in 3.777172 seconds...
11/17 12:08:35.087 DEBUG|        base_utils:0185| Running 'sudo wget --timeout=300 -nv http://100.107.227.251:80
82/static/quawks-release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autotest/containers/test_18
32_1479413308_26829/delta0/usr/local/autotest_server_package.tar.bz2'
11/17 12:10:44.656 WARNI|             retry:0218| <class 'autotest_lib.client.common_lib.error.CmdError'>(Comman
d <sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-release/R54-8743.44.0/autotest_server_p
ackage.tar.bz2 -O /usr/local/autotest/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_pac
kage.tar.bz2> failed, rc=4, Command returned non-zero exit status
* Command: 
    sudo wget --timeout=300 -nv http://100.107.227.251:8082/static/quawks-
    release/R54-8743.44.0/autotest_server_package.tar.bz2 -O /usr/local/autote
    st/containers/test_1832_1479413308_26829/delta0/usr/local/autotest_server_
    package.tar.bz2
Exit status: 4
Duration: 129.53272295

..... 

It will keep retrying to download the autotest_serve_package.tar.bz2 until it times out.

Can you take a look too?
have a CL for that, under review.
Project Member

Comment 24 by bugdroid1@chromium.org, Nov 17 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/a04b42d8f105b91600c67468793009d1d318d1f2

commit a04b42d8f105b91600c67468793009d1d318d1f2
Author: xixuan <xixuan@chromium.org>
Date: Wed Nov 16 22:30:43 2016

autotest: temporary fix for fetching firmware package

Current firmware_install (running on drone/shard) use wget to fetch
firmware_from_source.tar.bz2 for firware installation from devserver.
However, only port 22 is allowed for devservers under ACL.

This CL is a temporary fix for using port 22 to get the firmware package. A
long-term solution will be 'choose a devserver by host, build, and package' so
that we can use ganeti devserver for such use cases and reduce lab devserver's
load.

BUG= chromium:664333 
TEST=Run '/usr/local/autotest/server/autoserv -p -r /tmp/firmware-test5 -m
chromeos1-row1-rack9-host6 --verbose --lab True --provision --job-labels
cros-version:veyron_jerry-release/R56-8977.0.0,fwro-version:veyron_jerry-release/R56-8977.0.0'
on chromeos-server41.cbf, verify that successfully downloading the packages.

Change-Id: Ie08a8ef47aa92044d1d1152889067ca37cc7a2b6
Reviewed-on: https://chromium-review.googlesource.com/412028
Reviewed-by: Xixuan Wu <xixuan@chromium.org>
Commit-Queue: Xixuan Wu <xixuan@chromium.org>
Tested-by: Xixuan Wu <xixuan@chromium.org>

[modify] https://crrev.com/a04b42d8f105b91600c67468793009d1d318d1f2/server/hosts/cros_host.py
[modify] https://crrev.com/a04b42d8f105b91600c67468793009d1d318d1f2/client/common_lib/cros/dev_server.py

Cc: rohi...@chromium.org dhadd...@chromium.org josa...@chromium.org
 Issue 666498  has been merged into this issue.
Re #22: That is  issue 666414 . Workaround has been merged, being tested now.
Re c#22, they are different issues, but similar cause, i.e. hard-coding the "wget" command.

The failure in c#22 happened in download_extract() in site_utils/lxc.py. The fix in c#24 fixed the firmware_install() in server/hosts/cros_host.py.
Cc: -rohi...@chromium.org
Aviv's link http://shortn/_RrJ2HerbHE from https://bugs.chromium.org/p/chromium/issues/detail?id=666380#c3 still points to the FAFT related hosts still failing provisioning very frequently.  Has the temporary fix from c#24 not been rolled out yet?
Re #29: Fix from #24 was rolled out last week. And I can comfirm that at least a similar fix was effective in getting us over the SSP issues we hit last week:  issue 666414 
Status: Fixed (was: Assigned)
No report. Assuming it's fixed. Feel free to reopen it if we find another case.

Sign in to add a comment