incorrectly blaming *all devservers* when a single devserver call flakes |
|||
Issue descriptionExample: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/97555145-chromeos-test/chromeos6-row2-rack11-host16/debug/ - We had already picked a devserver - We had successfully staged artifacts on it, and listed 'em. - Then, god knows why, we decided to do a health check. - When that failed (network flake? devserver load?) we threw an exception saying "all devservers are currently down" That is doomsaying. And it's a lie: 01/23 10:33:07.485 INFO | dev_server:1055| Staging artifacts on devserver http://100.115.185.226:8082: build=peach_pit-chrome-pfq/R58-9211.0.0-rc1, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/peach_pit-chrome-pfq/R58-9211.0.0-rc1 01/23 10:33:10.512 INFO | dev_server:1073| Finished staging artifacts: build=peach_pit-chrome-pfq/R58-9211.0.0-rc1, artifacts=['full_payload', 'stateful', 'autotest_packages'], files=, archive_url=gs://chromeos-image-archive/peach_pit-chrome-pfq/R58-9211.0.0-rc1 01/23 10:33:10.514 INFO | dev_server:1391| Requesting contents from devserver http://100.115.185.226:8082 for image peach_pit-chrome-pfq/R58-9211.0.0-rc1 01/23 10:33:11.950 INFO | dev_server:1396| Listing contents of :/home/chromeos-test/images/peach_pit-chrome-pfq/R58-9211.0.0-rc1 01/23 10:33:11.952 INFO | dev_server:1396| Name: /home/chromeos-test/images/peach_pit-chrome-pfq/R58-9211.0.0-rc1/control_files.tar Accessed: 2017- [... SNIP ...] 01/23 10:34:15.033 ERROR| dev_server:0404| Devserver call failed: "http://100.115.185.226:8082/check_health?", timeout: 60 seconds, Error: retry exception (label="make_call"), timeout = 60s 01/23 10:34:15.037 ERROR| dev_server:0720| All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16 01/23 10:34:15.042 WARNI| test:0606| Autotest caught exception when running test: Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 600, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 810, in _call_test_function raise error.UnhandledTestFail(e) UnhandledTestFail: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16 Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 804, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 461, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 347, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 376, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 111, in run_once force_full_update=force) File "/usr/local/autotest/server/afe_utils.py", line 254, in machine_install_and_update_labels *args, **dargs) File "/usr/local/autotest/server/hosts/cros_host.py", line 678, in machine_install_by_devserver devserver = dev_server.resolve(build, self.hostname) File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 2350, in resolve return ImageServer.resolve(build, hostname) File "/usr/local/autotest/client/common_lib/cros/dev_server.py", line 721, in resolve raise DevServerException(error_msg) DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16
,
Jan 23 2017
Also, it may be that we are failing in this particular mode too often. See the autofiled issue 675564 We should consider increasing this timeout / figuring out what's up here. The impact of this failure is huge (it fails PFQ for example)
,
Jan 23 2017
The devserver allocation is done by looking up devservers in restricted subnet. UnhandledTestFail: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host16 That message seems to indicate that there is only one devserver in the same restricted subnet as chromeos6-row2-rack11-host16. Please check the shadow config to see if there are other devservers in the same subnet. If not, more devservers should be added in that subnet.
,
Jan 25 2017
my understanding of the failure was wrong.
,
Jan 25 2017
my understanding of the failure was wrong.
,
Jan 27 2017
Did we ever figure out the root cause of failed Chrome PFQ runs with "all devservers are down" like: https://luci-milo.appspot.com/buildbot/chromeos/peach_pit-chrome-pfq/3034
,
Jan 30 2017
This failure still happens on peach_pit information builder: https://uberchromegw.corp.google.com/i/chromeos.chrome/builders/peach_pit-tot-chrome-pfq-informational/builds/4659: provision FAIL: Unhandled DevServerException: All devservers are currently down: set(['http://100.115.185.226:8082']). dut hostname: chromeos6-row2-rack11-host1 |
|||
►
Sign in to add a comment |
|||
Comment 1 by jamescook@chromium.org
, Jan 23 2017