New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 738139 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

swarming failure, HWTest results missing

Project Member Reported by ayatane@chromium.org, Jun 29 2017

Issue description

build: https://uberchromegw.corp.google.com/i/chromeos/builders/master-paladin/builds/15209

Note that all of the suites passed successfully, but results were not retrieved.

Looks like swarming failure?

Relevant HWTest logs:

04:14:47: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpDQ_U8z/tmpTFmydB/temp_summary.json --raw-cmd --task-name peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:peach_pit-paladin/R61-9696.0.0-rc1' '--tags=task_name:peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-paladin/R61-9696.0.0-rc1 --board peach_pit --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only True --job_keyvals "{'cidb_build_stage_id': 48609511L, 'cidb_build_id': 1629418, 'datastore_parent_key': ('Build', 1629418, 'BuildStage', 48609511L)}" -c
DEBUG:root:Bug filing disabled. No module named tlslite.utils
Autotest instance: cautotest
06-29-2017 [04:25:19] Submitted create_suite_job rpc
06-29-2017 [04:25:31] Created suite job: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125735676
@@@STEP_LINK@Link to suite@http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=125735676@@@
--create_and_return was specified, terminating now.
Will return from run_suite with status: OK
04:25:44: INFO: RunCommand: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpDQ_U8z/tmpEy7Bb4/temp_summary.json --raw-cmd --task-name peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:peach_pit-paladin/R61-9696.0.0-rc1' '--tags=task_name:peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-paladin/R61-9696.0.0-rc1 --board peach_pit --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only True --job_keyvals "{'cidb_build_stage_id': 48609511L, 'cidb_build_id': 1629418, 'datastore_parent_key': ('Build', 1629418, 'BuildStage', 48609511L)}" -m 125735676
04:35:53: INFO: Refreshing due to a 401 (attempt 1/2)
04:35:53: INFO: Refreshing access_token
04:46:04: WARNING: Exception is not retriable return code: 1; command: /b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py run --swarming chromeos-proxy.appspot.com --task-summary-json /tmp/cbuildbot-tmpDQ_U8z/tmpEy7Bb4/temp_summary.json --raw-cmd --task-name peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline --dimension os Ubuntu-14.04 --dimension pool default --print-status-updates --timeout 9000 --io-timeout 9000 --hard-timeout 9000 --expiration 1200 '--tags=priority:CQ' '--tags=suite:bvt-inline' '--tags=build:peach_pit-paladin/R61-9696.0.0-rc1' '--tags=task_name:peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline' '--tags=board:peach_pit' -- /usr/local/autotest/site_utils/run_suite.py --build peach_pit-paladin/R61-9696.0.0-rc1 --board peach_pit --suite_name bvt-inline --pool cq --num 6 --file_bugs False --priority CQ --timeout_mins 90 --retry True --max_retries 5 --minimum_duts 4 --offload_failures_only True --job_keyvals "{'cidb_build_stage_id': 48609511L, 'cidb_build_id': 1629418, 'datastore_parent_key': ('Build', 1629418, 'BuildStage', 48609511L)}" -m 125735676
Priority was reset to 100
Triggered task: peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline
Waiting for results from the following shards: 0
N/A: 370c6b0fc966eb10 None

cmd=['/b/c/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/tmp/cbuildbot-tmpDQ_U8z/tmpEy7Bb4/temp_summary.json', '--raw-cmd', '--task-name', u'peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--print-status-updates', '--timeout', '9000', '--io-timeout', '9000', '--hard-timeout', '9000', '--expiration', '1200', u'--tags=priority:CQ', u'--tags=suite:bvt-inline', u'--tags=build:peach_pit-paladin/R61-9696.0.0-rc1', u'--tags=task_name:peach_pit-paladin/R61-9696.0.0-rc1-bvt-inline', u'--tags=board:peach_pit', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'peach_pit-paladin/R61-9696.0.0-rc1', '--board', u'peach_pit', '--suite_name', u'bvt-inline', '--pool', u'cq', '--num', '6', '--file_bugs', 'False', '--priority', u'CQ', '--timeout_mins', '90', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '4', '--offload_failures_only', 'True', '--job_keyvals', "{'cidb_build_stage_id': 48609511L, 'cidb_build_id': 1629418, 'datastore_parent_key': ('Build', 1629418, 'BuildStage', 48609511L)}", '-m', '125735676']
04:46:04: INFO: No json dump found, no HWTest results to report
04:46:04: INFO: Running cidb query on pid 16060, repr(query) starts with <sqlalchemy.sql.expression.Insert object at 0x7f869710a750>

@@@STEP_FAILURE@@@
04:46:04: ERROR: ** HWTest failed (code 1) **
04:46:04: INFO: Translating result ** HWTest failed (code 1) ** to fail.

 

Comment 1 by xixuan@chromium.org, Jun 29 2017

I think your guess is right.

Lots of suites at that time get expired, like the one you pasted: https://chromeos-proxy.appspot.com/task?id=370c6b0fc966eb10&refresh=10&show_raw=1.

metrics also shows there's a non-op period between 4~6am today: https://viceroy.corp.google.com/chromeos/swarming_proxy.

Comment 2 by xixuan@chromium.org, Jun 29 2017

Maybe it's caused by a large peak of request:

https://pantheon.corp.google.com/appengine?project=chromeos-proxy&organizationId=433637338589&duration=P30D&graph=AE_REQUESTS_BY_TYPE

There's a clear peak for number of requests when this expiration happens, which may indicate that we should add more swarming bots.
Labels: Chase-Pending
Status: Available (was: Untriaged)
Given that this caused failures, someone should definitively
answer the question of whether we need more swarming bots.

Cc: cernekee@chromium.org jrbarnette@chromium.org
 Issue 738033  has been merged into this issue.
Labels: -Pri-2 -Chase-Pending Pri-1
Owner: ayatane@chromium.org
Status: Assigned (was: Available)
P1 - This appears to be causing regular canary failures on
at least one board.

Alas, I know very little about swarming, other than the fact that they cause HWTest failures with a given signature.
Owner: xixuan@chromium.org
xixuan is best suited for this and she's the current deputy.
https://viceroy.corp.google.com/chromeos/swarming_proxy is experiencing sth bad.

Current swarming server (chromeos-server22.cbf) isn't in good health condition:

CPL | avg1  157.57 |  avg5  170.76 |              | avg15 176.57  |              | csw   262645 |  intr  218541 |              |               | numcpu     6 |
MEM | tot    31.4G |  free  205.3M | cache  76.8M | dirty   0.3M  | buff   11.4M | slab  150.5M |               |              |               |              |
SWP | tot     3.8G |  free    0.0M |              |               |              |              |               |              | vmcom  36.8G  | vmlim  19.5G |
PAG | scan   77059 |               | stall      0 |               |              |              |               | swin       0 |               | swout      0 |
DSK |         xvda |  busy     97% | read   11257 | write    740  | KiB/r      6 | KiB/w     32 |  MBr/s   7.12 | MBw/s   2.34 | avq    13.49  | avio 0.75 ms |
Labels: -Pri-1 Pri-0
Cc: uekawa@chromium.org victorhsieh@chromium.org
TO-DO list:

1. Will migrate an existing swarming proxy server (server31) to current pool for basic use. (doing)
2. Provision more servers ready for swarming proxy.
3. Figure out a way to monitor each bot's health individually.
Project Member

Comment 13 by bugdroid1@chromium.org, Jul 5 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/1859d7c7932bbee64ef2724543e0e8160b5f11c6

commit 1859d7c7932bbee64ef2724543e0e8160b5f11c6
Author: xixuan <xixuan@chromium.org>
Date: Wed Jul 05 21:29:56 2017

Labels: -Pri-0 Pri-1
Project Member

Comment 15 by bugdroid1@chromium.org, Jul 6 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d52600fcfbf703a26b6100b24744ef617d7da120

commit d52600fcfbf703a26b6100b24744ef617d7da120
Author: xixuan <xixuan@chromium.org>
Date: Thu Jul 06 22:15:27 2017

Project Member

Comment 16 by bugdroid1@chromium.org, Jul 7 2017

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/infradata/config/+/524fae9dca6d122fdeea9d31301a9975b1102a2a

commit 524fae9dca6d122fdeea9d31301a9975b1102a2a
Author: xixuan <xixuan@chromium.org>
Date: Fri Jul 07 00:03:59 2017

Comment 18 by oka@chromium.org, Jul 10 2017

Somes HWTests are still failing in swarming. Are they related to this bug?

Comment 20 by oka@chromium.org, Jul 10 2017

Cc: oka@chromium.org
swarming is working after checking.

Sign in to add a comment