New issue
Advanced search Search tips

Issue 864000 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jul 18
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

reef-chrome-pfq failures 7/14

Project Member Reported by achuith@chromium.org, Jul 16

Issue description

Unlikely to be swarming related if it's specific to a board, or specific to a test.
Cc: -dgarr...@chromium.org
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
One of the runs failed due to infra issues, and the others timed out with the following error:

03:15:51: ERROR: wait_cmd has lab failures: cmd=['/b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/b/swarming/w/ir/tmp/t/cbuildbot-tmp9Z4vAS/tmptTqYw4/temp_summary.json', '--print-status-updates', '--timeout', '18600', '--raw-cmd', '--task-name', u'reef-chrome-pfq/R69-10878.0.0-rc1-bvt-arc', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--io-timeout', '18600', '--hard-timeout', '18600', '--expiration', '1200', u'--tags=priority:PFQ', u'--tags=suite:bvt-arc', u'--tags=build:reef-chrome-pfq/R69-10878.0.0-rc1', u'--tags=task_name:reef-chrome-pfq/R69-10878.0.0-rc1-bvt-arc', u'--tags=board:reef', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'reef-chrome-pfq/R69-10878.0.0-rc1', '--board', u'reef', '--suite_name', u'bvt-arc', '--pool', u'bvt', '--file_bugs', 'True', '--priority', 'PFQ', '--timeout_mins', '250', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '3', '--suite_min_duts', '3', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 85747786L, 'cidb_build_id': 2749804, 'datastore_parent_key': ('Build', 2749804, 'BuildStage', 85747786L)}", '-m', '217371576'].
Exception will be raised in the next json_dump run.

Cc: -pprabhu@chromium.org dgarr...@chromium.org
Owner: pprabhu@chromium.org
Not ARC specific, so reassigning to infra deputy.
tl;dr these are not one issue. They need to be handled separately, and not all by infra-deputy.
A lot of things unfortunately can lead to test/suite time.
----------------------------------------------------

The first build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940963728568387856

had a suite timeout because one of the DUTs took a long time in failing provision:
suite timeline is your friend in these cases: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=217371576
and the failure shows a large number of crashes: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack10-host13/1088438-provision/


I think an arc constable should follow this up.

----------------------------------------------------
The second build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940889496862660080
was aborted without running any tests. This is definitely a test-infra issue.
The reef bvt pool looks OK:
pprabhu@pprabhu:skylab_inventory$ dut-status -b reef -p bvt
hostname                       S   last checked         URL
chromeos6-row4-rack9-host9     OK  2018-07-16 16:49:20  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host9/1097554-reset/
chromeos6-row4-rack9-host8     NO  2018-07-16 14:22:43  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host8/1096721-repair/
chromeos6-row4-rack10-host13   OK  2018-07-16 16:01:52  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host13/1097271-repair/
chromeos6-row3-rack10-host1    OK  2018-07-16 16:44:33  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack10-host1/1097531-reset/
chromeos6-row4-rack10-host14   OK  2018-07-16 16:50:32  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host14/1097556-reset/
chromeos6-row4-rack10-host16   OK  2018-07-16 16:25:18  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host16/1097422-reset/
chromeos6-row4-rack9-host18    OK  2018-07-16 16:44:45  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host18/1097534-reset/
chromeos6-row4-rack9-host21    OK  2018-07-16 16:49:53  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host21/1097555-reset/
chromeos6-row3-rack12-host1    OK  2018-07-16 16:48:46  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack12-host1/1097551-reset/
chromeos6-row3-rack12-host3    OK  2018-07-16 16:49:17  http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack12-host3/1097553-reset/

and I don't see any obvious problems with the shard: http://shortn/_WXxi9k0qeH

----------------------------------------------------
The third build
also had a super long provision, that then failed: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=217745693

From the failed provision logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack9-host8/1094500-provision/

In this case it seems like either there was a network issue in the lab (I don't have any other independent indication of this though) or somehow the image to install was not properly created / staged on the devserver.

2018-07-16 02:43:11-07:00 INFO: Update rootfs /dev/mmcblk1p5
2018-07-16 02:43:12-07:00 INFO: Updated status: DUT: Updating rootfs /dev/mmcblk1p5
2018-07-16 02:43:12-07:00 INFO: Updating /dev/mmcblk1p5 with http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz
--2018-07-16 02:43:12--  http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz
Connecting to 100.115.219.139:8082... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Mon, 16 Jul 2018 09:43:12 GMT
  Server: Apache/2.4.7 (Ubuntu)
  Last-Modified: Mon, 16 Jul 2018 08:53:52 GMT
  ETag: "43404068-57119f800cd2b"
  Accept-Ranges: bytes
  Content-Length: 1128284264
  Keep-Alive: timeout=60, max=1000
  Connection: Keep-Alive
  Content-Type: application/x-gzip
Length: 1128284264 (1.1G) [application/x-gzip]
Saving to: 'STDOUT'

     0K ........ ........ ........ ........  2% 4.30M 4m3s
 32768K ........ ........ ........ ........  5% 28.4M 2m16s
 65536K ........ ........ ........ ........  8% 25.8M 1m40s
 98304K ........ ........ ........ ........ 11% 4.31M 2m8s
131072K ........ ........ ........ ........ 14% 29.5M 1m45s
163840K ........ ........ ........ ........ 17% 4.32M 1m58s
196608K ........ ........ ........ ........ 20% 32.2M 1m42s
229376K ........ ........ ........ ........ 23% 32.2M 89s
262144K ........ ........ ........ ........ 26% 32.2M 79s
294912K ..                                  26% 20.5M=29s

2018-07-16 02:58:40 (10.1 MB/s) - Read error at byte 304488029/1128284264 (Connection timed out). Retrying.

--2018-07-16 02:58:41--  (try: 2)  http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz
Connecting to 100.115.219.139:8082... failed: Network is unreachable.
Retrying.

--2018-07-16 02:58:43--  (try: 3)  http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz
Connecting to 100.115.219.139:8082... failed: Network is unreachable.
Retrying.

--2018-07-16 02:58:46--  (try: 4)  http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz
Connecting to 100.115.219.139:8082... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Mon, 16 Jul 2018 09:58:47 GMT
  Server: Apache/2.4.7 (Ubuntu)
  Last-Modified: Mon, 16 Jul 2018 08:53:52 GMT
  ETag: "43404068-57119f800cd2b"
  Accept-Ranges: bytes
  Content-Length: 1128284264
  Keep-Alive: timeout=60, max=1000
  Connection: Keep-Alive
  Content-Type: application/x-gzip
Length: 1128284264 (1.1G) [application/x-gzip]
Saving to: 'STDOUT'

gzip: stdin: invalid compressed data--format violated

     0K                                      0% 33.5M=0.004s


Cannot write to '-' (Broken pipe).


I didn't see any systemic issues in the lab at this point (regarding devserver), so I'm at a dead end there.
Still digging into build #2: 

The test that timed out was http://cros-full-0015.mtv.corp.google.com/afe/#tab_id=view_job&object_id=217642350

That provision basically took ~2 hours to run and fail: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack9-host18/1093337-provision/20181507195528/

This is a non insane behaviour when DUT has intermittent SSH connecitvity. Poked issue 730067 with the link here.
Handing the bug back to Arc constanble to investigate build failure #1, and to observe future builds.

I'll continue to dig build failure #3.

I don't have any reason to suspect a test lab regression.

Owner: domlasko...@chromium.org
Really handing off.
Owner: achuith@chromium.org
Actually, the chrome gardener is likely to keep an eye on the future builds.
Cc: pprabhu@chromium.org domlasko...@chromium.org glevin@chromium.org
Owner: domlasko...@chromium.org
The builds are now getting terminated because they are taking too long:
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940765927402876096
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940782741112278912
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940817590236956272
https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940800421565554864

HWTestArc took 6.5, 10.5, 20, and 14.5 hours.

It looks like there are 36 tests and they passed:
http://cautotest-prod/afe/#tab_id=view_job&object_id=218063823

Not sure what's going on.

Status: Fixed (was: Assigned)
Builder is now green with 3 successful runs:
https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=reef-chrome-pfq&buildBranch=master

Sign in to add a comment