chromeos-server10 shard is down [Affects nyan_kitty, peach_pit, butterfly] |
|||||
Issue descriptionnyan kitty paladin 3970 failed in swarming.py, which in my experience is not very good at reporting errors. This time there are various "refreshes" before the timeout. 04:40:50: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475 04:53:26: INFO: Refreshing due to a 401 (attempt 1/2) 04:53:26: INFO: Refreshing access_token 04:57:31: INFO: Refreshing due to a 401 (attempt 1/2) 04:57:31: INFO: Refreshing access_token 05:47:50: INFO: Refreshing due to a 401 (attempt 1/2) 05:47:50: INFO: Refreshing access_token 05:53:29: INFO: Refreshing due to a 401 (attempt 1/2) 05:53:29: INFO: Refreshing access_token 05:57:37: INFO: Refreshing due to a 401 (attempt 1/2) 05:57:37: INFO: Refreshing access_token 06:10:51: INFO: Re-run swarming_cmd to avoid buildbot salency check. 06:10:51: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475 [1;33m06:16:36: WARNING: Exception is not retriable return code: 3; command: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475 Priority was reset to 100 and then: Suite timed out. Started on 12-28-2017 [04:40:49], timed out on 12-28-2017 [06:16:03] 12-28-2017 [06:16:03] Start collecting test results and dump them to json. Suite job [ FAILED ] Suite job ABORT:
,
Dec 28 2017
,
Dec 28 2017
server isn't responding to ping. This is the second ganeti server that has died completely in the last two days. See issue 797791 where chromeos-server118.mtv died yesterday.
,
Dec 28 2017
For my benefit, and if it's not too hard to explain, what should I have done next to figure out that the shard is down? Thanks.
,
Dec 28 2017
Power cycle: off: https://portal.corp.google.com/request/6aad7e5e-6a18-4a3e-b1cd-c288280d7c1c on: https://portal.corp.google.com/request/945ad209-41b0-416e-9728-073cfbb9d7e4 For detecting this: - infra gets alerts (that's how I knew) on this mailing list: https://groups.google.com/a/google.com/forum/#!forum/chromeos-build-alerts We don't expect non-infra folks to watch this. - This is also immediately visible on the deputy dashboard (where I also noticed it): https://viceroy.corp.google.com/chromeos/deputy-view (See that shard heartbeat line going south) - This _should_ have shown up on omens, but doesn't yet. https://viceroy.corp.google.com/chromeos/omens There is a bug open for that, I think +phobbs: All alerts should be omens.
,
Dec 28 2017
Shard is back up. Metrics are improving: https://viceroy.corp.google.com/chromeos/deputy-view?duration=1h
,
Dec 28 2017
Issue 797978 has been merged into this issue.
,
Dec 28 2017
Me thinks.
,
Jul 30
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by semenzato@chromium.org
, Dec 28 2017