Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Starred by 1 user
Status: WontFix
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment
incorrectly blaming *all devservers* when a single devserver call flakes
Project Member Reported by pprabhu@chromium.org, Jan 23 2017 Back to list
Example: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/97555145-chromeos-test/chromeos6-row2-rack11-host16/debug/

- We had already picked a devserver
- We had successfully staged artifacts on it, and listed 'em.
- Then, god knows why, we decided to do a health check.
- When that failed (network flake? devserver load?) we threw an exception saying "all devservers are currently down"

That is doomsaying. And it's a lie:

01/23 10:33:07.485 INFO |        dev_server:1055| Staging artifacts on devserver http://100.115.185.226:8082: build=peach_pit-chrome-pfq/R58-9211.0.0-rc1, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/peach_pit-chrome-pfq/R58-9211.0.0-rc1
01/23 10:33:10.512 INFO |        dev_server:1073| Finished staging artifacts: build=peach_pit-chrome-pfq/R58-9211.0.0-rc1, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/peach_pit-chrome-pfq/R58-9211.0.0-rc1
01/23 10:33:10.514 INFO |        dev_server:1391| Requesting contents from devserver http://100.115.185.226:8082 for image peach_pit-chrome-pfq/R58-9211.0.0-rc1
01/23 10:33:11.950 INFO |        dev_server:1396| Listing contents of :/home/chromeos-test/images/peach_pit-chrome-pfq/R58-9211.0.0-rc1
01/23 10:33:11.952 INFO |        dev_server:1396| Name: /home/chromeos-test/images/peach_pit-chrome-pfq/R58-9211.0.0-rc1/control_files.tar Accessed: 2017-

[... SNIP ...]

01/23 10:34:15.033 ERROR|        dev_server:0404| Devserver call failed: "http://100.115.185.226:8082/check_health?", timeout: 60 seconds, Error: retry exception (label="make_call"), timeout = 60s
01/23 10:34:15.037 ERROR|        dev_server:0720| All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16
01/23 10:34:15.042 WARNI|              test:0606| Autotest caught exception when running test:
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 810, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 111, in run_once
    force_full_update=force)
  File "/usr/local/autotest/server/afe_utils.py", line 254, in machine_install_and_update_labels
    *args, **dargs)
  File "/usr/local/autotest/server/hosts/cros_host.py", line 678, in machine_install_by_devserver
    devserver = dev_server.resolve(build, self.hostname)
  File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2350, in resolve
    return ImageServer.resolve(build, hostname)
  File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 721, in resolve
    raise DevServerException(error_msg)
DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16
 
Cc: jamescook@chromium.org
Also, it may be that we are failing in this particular mode too often. See the autofiled issue 675564

We should consider increasing this timeout / figuring out what's up here. The impact of this failure is huge (it fails PFQ for example)
Comment 3 by dshi@chromium.org, Jan 23 2017
Labels: -current-issue
The devserver allocation is done by looking up devservers in restricted subnet. 

UnhandledTestFail: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16

That message seems to indicate that there is only one devserver in the same restricted subnet as chromeos6-row2-rack11-host16. Please check the shadow config to see if there are other devservers in the same subnet. If not, more devservers should be added in that subnet.
Status: WontFix
my understanding of the failure was wrong.
my understanding of the failure was wrong.
Did we ever figure out the root cause of failed Chrome PFQ runs with "all devservers are down" like:
https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-chrome-pfq/3034

Comment 7 by x...@chromium.org, Jan 30 2017
This failure still happens on peach_pit information builder: https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/peach_pit-tot-chrome-pfq-informational/builds/4659:

provision FAIL: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host1
Sign in to add a comment