reef-chrome-pfq failures 7/14 |
|||||||
Issue description3 failing builds so far staring 7/14: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940963728568387856 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940889496862660080 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940867126904681264 HWTest stage seems to be timing out: https://luci-logdog.appspot.com/v/?s=chromeos/buildbucket/cr-buildbucket.appspot.com/8940867126904681264/+/steps/HWTest__bvt-arc_/0/stdout The only failing test is cheets_CTS_N.7.1_r18.x86.CtsAccountManagerTestCases ARC constable - can you take a look? Also cc-ing lab deputy and Don in case this is a swarming issue. I'm planning to mark reef-chrome-pfq as experimental.
,
Jul 16
One of the runs failed due to infra issues, and the others timed out with the following error: [1;31m03:15:51: ERROR: wait_cmd has lab failures: cmd=['/b/swarming/w/ir/cache/cbuild/repository/chromite/third_party/swarming.client/swarming.py', 'run', '--swarming', 'chromeos-proxy.appspot.com', '--task-summary-json', '/b/swarming/w/ir/tmp/t/cbuildbot-tmp9Z4vAS/tmptTqYw4/temp_summary.json', '--print-status-updates', '--timeout', '18600', '--raw-cmd', '--task-name', u'reef-chrome-pfq/R69-10878.0.0-rc1-bvt-arc', '--dimension', 'os', 'Ubuntu-14.04', '--dimension', 'pool', 'default', '--io-timeout', '18600', '--hard-timeout', '18600', '--expiration', '1200', u'--tags=priority:PFQ', u'--tags=suite:bvt-arc', u'--tags=build:reef-chrome-pfq/R69-10878.0.0-rc1', u'--tags=task_name:reef-chrome-pfq/R69-10878.0.0-rc1-bvt-arc', u'--tags=board:reef', '--', '/usr/local/autotest/site_utils/run_suite.py', '--build', u'reef-chrome-pfq/R69-10878.0.0-rc1', '--board', u'reef', '--suite_name', u'bvt-arc', '--pool', u'bvt', '--file_bugs', 'True', '--priority', 'PFQ', '--timeout_mins', '250', '--retry', 'True', '--max_retries', '5', '--minimum_duts', '3', '--suite_min_duts', '3', '--offload_failures_only', 'False', '--job_keyvals', "{'cidb_build_stage_id': 85747786L, 'cidb_build_id': 2749804, 'datastore_parent_key': ('Build', 2749804, 'BuildStage', 85747786L)}", '-m', '217371576']. Exception will be raised in the next json_dump run.[0m
,
Jul 16
Not ARC specific, so reassigning to infra deputy.
,
Jul 16
tl;dr these are not one issue. They need to be handled separately, and not all by infra-deputy. A lot of things unfortunately can lead to test/suite time. ---------------------------------------------------- The first build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940963728568387856 had a suite timeout because one of the DUTs took a long time in failing provision: suite timeline is your friend in these cases: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=217371576 and the failure shows a large number of crashes: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack10-host13/1088438-provision/ I think an arc constable should follow this up. ---------------------------------------------------- The second build: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8940889496862660080 was aborted without running any tests. This is definitely a test-infra issue. The reef bvt pool looks OK: pprabhu@pprabhu:skylab_inventory$ dut-status -b reef -p bvt hostname S last checked URL chromeos6-row4-rack9-host9 OK 2018-07-16 16:49:20 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host9/1097554-reset/ chromeos6-row4-rack9-host8 NO 2018-07-16 14:22:43 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host8/1096721-repair/ chromeos6-row4-rack10-host13 OK 2018-07-16 16:01:52 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host13/1097271-repair/ chromeos6-row3-rack10-host1 OK 2018-07-16 16:44:33 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack10-host1/1097531-reset/ chromeos6-row4-rack10-host14 OK 2018-07-16 16:50:32 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host14/1097556-reset/ chromeos6-row4-rack10-host16 OK 2018-07-16 16:25:18 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack10-host16/1097422-reset/ chromeos6-row4-rack9-host18 OK 2018-07-16 16:44:45 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host18/1097534-reset/ chromeos6-row4-rack9-host21 OK 2018-07-16 16:49:53 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row4-rack9-host21/1097555-reset/ chromeos6-row3-rack12-host1 OK 2018-07-16 16:48:46 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack12-host1/1097551-reset/ chromeos6-row3-rack12-host3 OK 2018-07-16 16:49:17 http://cautotest.corp.google.com/tko/retrieve_logs.cgi?job=/results/hosts/chromeos6-row3-rack12-host3/1097553-reset/ and I don't see any obvious problems with the shard: http://shortn/_WXxi9k0qeH ---------------------------------------------------- The third build also had a super long provision, that then failed: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=217745693 From the failed provision logs: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack9-host8/1094500-provision/ In this case it seems like either there was a network issue in the lab (I don't have any other independent indication of this though) or somehow the image to install was not properly created / staged on the devserver. 2018-07-16 02:43:11-07:00 INFO: Update rootfs /dev/mmcblk1p5 2018-07-16 02:43:12-07:00 INFO: Updated status: DUT: Updating rootfs /dev/mmcblk1p5 2018-07-16 02:43:12-07:00 INFO: Updating /dev/mmcblk1p5 with http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz --2018-07-16 02:43:12-- http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz Connecting to 100.115.219.139:8082... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 16 Jul 2018 09:43:12 GMT Server: Apache/2.4.7 (Ubuntu) Last-Modified: Mon, 16 Jul 2018 08:53:52 GMT ETag: "43404068-57119f800cd2b" Accept-Ranges: bytes Content-Length: 1128284264 Keep-Alive: timeout=60, max=1000 Connection: Keep-Alive Content-Type: application/x-gzip Length: 1128284264 (1.1G) [application/x-gzip] Saving to: 'STDOUT' 0K ........ ........ ........ ........ 2% 4.30M 4m3s 32768K ........ ........ ........ ........ 5% 28.4M 2m16s 65536K ........ ........ ........ ........ 8% 25.8M 1m40s 98304K ........ ........ ........ ........ 11% 4.31M 2m8s 131072K ........ ........ ........ ........ 14% 29.5M 1m45s 163840K ........ ........ ........ ........ 17% 4.32M 1m58s 196608K ........ ........ ........ ........ 20% 32.2M 1m42s 229376K ........ ........ ........ ........ 23% 32.2M 89s 262144K ........ ........ ........ ........ 26% 32.2M 79s 294912K .. 26% 20.5M=29s 2018-07-16 02:58:40 (10.1 MB/s) - Read error at byte 304488029/1128284264 (Connection timed out). Retrying. --2018-07-16 02:58:41-- (try: 2) http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz Connecting to 100.115.219.139:8082... failed: Network is unreachable. Retrying. --2018-07-16 02:58:43-- (try: 3) http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz Connecting to 100.115.219.139:8082... failed: Network is unreachable. Retrying. --2018-07-16 02:58:46-- (try: 4) http://100.115.219.139:8082/static/reef-chrome-pfq/R69-10881.0.0-rc1/full_dev_part_ROOT.bin.gz Connecting to 100.115.219.139:8082... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 16 Jul 2018 09:58:47 GMT Server: Apache/2.4.7 (Ubuntu) Last-Modified: Mon, 16 Jul 2018 08:53:52 GMT ETag: "43404068-57119f800cd2b" Accept-Ranges: bytes Content-Length: 1128284264 Keep-Alive: timeout=60, max=1000 Connection: Keep-Alive Content-Type: application/x-gzip Length: 1128284264 (1.1G) [application/x-gzip] Saving to: 'STDOUT' gzip: stdin: invalid compressed data--format violated 0K 0% 33.5M=0.004s Cannot write to '-' (Broken pipe). I didn't see any systemic issues in the lab at this point (regarding devserver), so I'm at a dead end there.
,
Jul 17
Still digging into build #2: The test that timed out was http://cros-full-0015.mtv.corp.google.com/afe/#tab_id=view_job&object_id=217642350 That provision basically took ~2 hours to run and fail: https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack9-host18/1093337-provision/20181507195528/ This is a non insane behaviour when DUT has intermittent SSH connecitvity. Poked issue 730067 with the link here.
,
Jul 17
Handing the bug back to Arc constanble to investigate build failure #1, and to observe future builds. I'll continue to dig build failure #3. I don't have any reason to suspect a test lab regression.
,
Jul 17
Really handing off.
,
Jul 17
Actually, the chrome gardener is likely to keep an eye on the future builds.
,
Jul 17
The builds are now getting terminated because they are taking too long: https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940765927402876096 https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940782741112278912 https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940817590236956272 https://ci.chromium.org/p/chromeos/builders/luci.chromeos.general/Prod/b8940800421565554864 HWTestArc took 6.5, 10.5, 20, and 14.5 hours. It looks like there are 36 tests and they passed: http://cautotest-prod/afe/#tab_id=view_job&object_id=218063823 Not sure what's going on.
,
Jul 18
Builder is now green with 3 successful runs: https://cros-goldeneye.corp.google.com/chromeos/legoland/builderHistory?buildConfig=reef-chrome-pfq&buildBranch=master |
|||||||
►
Sign in to add a comment |
|||||||
Comment 1 by dgarr...@chromium.org
, Jul 16