New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 797965 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Closed: Dec 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 0
Type: Bug



Sign in to add a comment

chromeos-server10 shard is down [Affects nyan_kitty, peach_pit, butterfly]

Project Member Reported by semenzato@chromium.org, Dec 28 2017

Issue description

nyan kitty paladin 3970 failed in swarming.py, which in my experience is not very good at reporting errors.  This time there are various "refreshes" before the timeout.

04:40:50: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475
04:53:26: INFO: Refreshing due to a 401 (attempt 1/2)
04:53:26: INFO: Refreshing access_token
04:57:31: INFO: Refreshing due to a 401 (attempt 1/2)
04:57:31: INFO: Refreshing access_token
05:47:50: INFO: Refreshing due to a 401 (attempt 1/2)
05:47:50: INFO: Refreshing access_token
05:53:29: INFO: Refreshing due to a 401 (attempt 1/2)
05:53:29: INFO: Refreshing access_token
05:57:37: INFO: Refreshing due to a 401 (attempt 1/2)
05:57:37: INFO: Refreshing access_token
06:10:51: INFO: Re-run swarming_cmd to avoid buildbot salency check.
06:10:51: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475
06:16:36: WARNING: Exception is not retriable return code: 3; command: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpf2ZzVv/tmp9xUA9D/temp_summary.json --raw-cmd --task-name nyan_kitty-paladin/R65-10257.0.0-rc1-provision --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:provision' '--tags=build:nyan_kitty-paladin/R65-10257.0.0-rc1' '--tags=task_name:nyan_kitty-paladin/R65-10257.0.0-rc1-provision' '--tags=board:nyan_kitty' -- /usr/local/autotest/site_utils/run_suite.py --build nyan_kitty-paladin/R65-10257.0.0-rc1 --board nyan_kitty --suite_name provision --pool cq --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --suite_args "{u'num_required': 1}" --offload_failures_only False --job_keyvals "{'cidb_build_stage_id': 66384923L, 'cidb_build_id': 2166993, 'datastore_parent_key': ('Build', 2166993, 'BuildStage', 66384923L)}" --test_args "{'fast': 'True'}" -m 165912475
Priority was reset to 100

and then:

Suite timed out. Started on 12-28-2017 [04:40:49], timed out on 12-28-2017 [06:16:03]
  12-28-2017 [06:16:03] Start collecting test results and dump them to json.
  Suite job   [ FAILED ]
  Suite job     ABORT: 
  
 
Happened exactly on peach-pit and nyan-kitty.  There was also a separate ARC test failure on auron-yuma,  issue 783832  fyi.
Labels: -Pri-1 Pri-0
Status: Started (was: Untriaged)
Summary: chromeos-server10 shard is down [Affects nyan_kitty, peach_pit, butterfly] (was: swarming produces 401 errors, then times out)
server isn't responding to ping.
This is the second ganeti server that has died completely in the last two days. See issue 797791 where chromeos-server118.mtv died yesterday.
For my benefit, and if it's not too hard to explain, what should I have done next to figure out that the shard is down?  Thanks.
Cc: pho...@chromium.org
Power cycle:

off: https://portal.corp.google.com/request/6aad7e5e-6a18-4a3e-b1cd-c288280d7c1c
on: https://portal.corp.google.com/request/945ad209-41b0-416e-9728-073cfbb9d7e4

For detecting this:
- infra gets alerts (that's how I knew) on this mailing list: https://groups.google.com/a/google.com/forum/#!forum/chromeos-build-alerts
We don't expect non-infra folks to watch this.

- This is also immediately visible on the deputy dashboard (where I also noticed it): https://viceroy.corp.google.com/chromeos/deputy-view (See that shard heartbeat line going south)

- This _should_ have shown up on omens, but doesn't yet. https://viceroy.corp.google.com/chromeos/omens
There is a bug open for that, I think +phobbs: All alerts should be omens.
Shard is back up. Metrics are improving: https://viceroy.corp.google.com/chromeos/deputy-view?duration=1h
Issue 797978 has been merged into this issue.
Status: Fixed (was: Started)
Me thinks.
Status: Archived (was: Fixed)

Sign in to add a comment