caroline chrome informational PFQ failed because of TestLab failure |
||
Issue descriptionIt started failing since https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926984325900441984 It failed because one of the host seems stop working, see selected error message from https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926984325900441984/+/steps/HWTest__bvt-arc_/0/stdout: host: chromeos6-row4-rack23-host17, status: Repairing, locked: False diagnosis: Failed repair labels: ['board:caroline', '4k_video_h264', '4k_video_vp8', 'storage:mmc', 'hw_video_acc_enc_h264', 'hw_jpeg_acc_enc', 'cts_abi_x86', 'cts_abi_arm', 'os:cros', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_h264', 'webcam', 'bluetooth', 'accel:cros-ec', 'arc', 'power:battery', 'model:caroline', 'caroline', 'internal_display', 'pool:cts', 'pool:continuous', 'audio_loopback_dongle', 'touchpad', 'touchscreen', 'stylus', 'phase:PVT', 'servo', 'variant:caroline', 'hw_video_acc_enc_vp8', 'ec:cros', 'sparse_coverage_5', 'sku:caroline_intel_skylake_core_m3_4Gb', 'cros-version:caroline-release/R69-10895.21.0'] Last 10 jobs within 3:18:00: 2677249 Repair started on: 2018-12-16 09:24:44 status FAIL 2677210 Verify started on: 2018-12-16 09:19:20 status FAIL Reason: Suite job failed. 12-16-2018 [09:44:03] Output below this line is for buildbot consumption: Will return from run_suite with status: INFRA_FAILURE Assign to current infra deputy. xixuan@, dgarrett@, is this issue also caused by the lab outage? Do we need to worry about the problematic host chromeos6-row4-rack23-host17? The error is persistent in later runs.
,
Dec 17
I think it's due to there're only 3 working DUTs, and they cannot finish all the tests in the given suite in time. xixuan@xixuan0:~/chromiumos/infra/suite_scheduler$ dut-status -b veyron_minnie -p continuous hostname S last checked URL chromeos4-row9-rack9-host17 OK 2018-12-17 13:21:02 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack9-host17/3007699-cleanup/ chromeos4-row9-rack10-host7 OK 2018-12-17 13:21:57 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack10-host7/3007718-cleanup/ chromeos4-row9-rack10-host9 OK 2018-12-17 13:21:57 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack10-host9/3007717-cleanup/ xixuan@xixuan0:~/chromiumos/infra/suite_scheduler$ dut-status -b caroline -p continuous hostname S last checked URL chromeos6-row2-rack23-host16 OK 2018-12-17 15:46:54 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack23-host16/2685911-reset/ chromeos6-row2-rack21-host20 OK 2018-12-17 15:47:12 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack21-host20/2685913-reset/ chromeos6-row2-rack23-host14 OK 2018-12-17 15:44:59 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack23-host14/2685896-reset/ chromeos6-row4-rack23-host17 NO 2018-12-17 14:33:54 https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack23-host17/2685630-repair/
,
Dec 18
There're no available DUTs to balance for veyron_minnie, file b/121159914 to eng-lab. I add one more DUT to pool:continuuos, file b/121159736 to eng-lab to fix.
,
Dec 18
Re xixuan@, thanks for the reply! Let's see if it helps. But I just found that I made a mistake in the bug report, I checked the last green run and found the host was already broken at that time, so it looks like it was not the host that caused this failure. Sorry for the misleading information. In the latest caroline run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926871626225005072, it failed at TestPlan, HWTest [bvt-inline] and HWTest [bvt-cq] steps, all with TestLabFailure. The same for the latest veyron_minnie: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=3240313. So I assume it's caused by the lab failure? They also both failed at HWTest [chrome-informational] stage with a different error, tast.cryptohome.Login, tast.security.OpenFDs, tast.informat_SERVER_JOB failed, which might be caused by a different cause, I'll take a look at those.
,
Dec 18
Re #4, Yes, I tend to think the caroline failure https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926871626225005072 is caused by lack of DUTs, as I see these DUTs work very hard to run these tests, but still can't finish them all: See the timeline in suite details: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=268059633 https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=268059636
,
Dec 18
I see, makes sense. It's hard to understand why the tests now require much more times to finish compared with previous runs. Take caroline informational pfq as an example, it usually takes about 300 minutes to finish, but recently, the number has bumped to more than 350 minutes, given the working hosts number are the same (3 hosts).
,
Dec 19
The TestLab failure can on longer be observed on caroline information PFQ. Seems the lab has recovered. The replacement of the broken host is tracked in b/121159736. |
||
►
Sign in to add a comment |
||
Comment 1 by x...@chromium.org
, Dec 17