New issue
Advanced search Search tips

Issue 915850 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Dec 19
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

caroline chrome informational PFQ failed because of TestLab failure

Project Member Reported by x...@chromium.org, Dec 17

Issue description

It started failing since https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926984325900441984

It failed because one of the host seems stop working, see selected error message from https://luci-logdog.appspot.com/logs/chromeos/buildbucket/cr-buildbucket.appspot.com/8926984325900441984/+/steps/HWTest__bvt-arc_/0/stdout:
host: chromeos6-row4-rack23-host17, status: Repairing, locked: False diagnosis: Failed repair
labels: ['board:caroline', '4k_video_h264', '4k_video_vp8', 'storage:mmc', 'hw_video_acc_enc_h264', 'hw_jpeg_acc_enc', 'cts_abi_x86', 'cts_abi_arm', 'os:cros', 'hw_jpeg_acc_dec', 'hw_video_acc_vp8', 'hw_video_acc_h264', 'webcam', 'bluetooth', 'accel:cros-ec', 'arc', 'power:battery', 'model:caroline', 'caroline', 'internal_display', 'pool:cts', 'pool:continuous', 'audio_loopback_dongle', 'touchpad', 'touchscreen', 'stylus', 'phase:PVT', 'servo', 'variant:caroline', 'hw_video_acc_enc_vp8', 'ec:cros', 'sparse_coverage_5', 'sku:caroline_intel_skylake_core_m3_4Gb', 'cros-version:caroline-release/R69-10895.21.0']
Last 10 jobs within 3:18:00:
2677249 Repair started on: 2018-12-16 09:24:44 status FAIL
2677210 Verify started on: 2018-12-16 09:19:20 status FAIL

Reason: Suite job failed.

 12-16-2018 [09:44:03] Output below this line is for buildbot consumption:
Will return from run_suite with status: INFRA_FAILURE

Assign to current infra deputy. 

xixuan@, dgarrett@, is this issue also caused by the lab outage? Do we need to worry about the problematic host chromeos6-row4-rack23-host17? The error is persistent in later runs.
 
The failures on veyron_minnie informational builds are presumably caused by the same TestLab failure, see the latest run https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926880441334377840, though all hosts are working fine on veyron_minnie. 
I think it's due to there're only 3 working DUTs, and they cannot finish all the tests in the given suite in time. 

xixuan@xixuan0:~/chromiumos/infra/suite_scheduler$ dut-status -b veyron_minnie -p continuous
hostname                       S   last checked         URL
chromeos4-row9-rack9-host17    OK  2018-12-17 13:21:02  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack9-host17/3007699-cleanup/
chromeos4-row9-rack10-host7    OK  2018-12-17 13:21:57  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack10-host7/3007718-cleanup/
chromeos4-row9-rack10-host9    OK  2018-12-17 13:21:57  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos4-row9-rack10-host9/3007717-cleanup/


xixuan@xixuan0:~/chromiumos/infra/suite_scheduler$ dut-status -b caroline -p continuous
hostname                       S   last checked         URL
chromeos6-row2-rack23-host16   OK  2018-12-17 15:46:54  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack23-host16/2685911-reset/
chromeos6-row2-rack21-host20   OK  2018-12-17 15:47:12  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack21-host20/2685913-reset/
chromeos6-row2-rack23-host14   OK  2018-12-17 15:44:59  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row2-rack23-host14/2685896-reset/
chromeos6-row4-rack23-host17   NO  2018-12-17 14:33:54  https://stainless.corp.google.com/browse/chromeos-autotest-results/hosts/chromeos6-row4-rack23-host17/2685630-repair/

There're no available DUTs to balance for veyron_minnie, file b/121159914 to eng-lab.

I add one more DUT to pool:continuuos, file b/121159736 to eng-lab to fix.
Re xixuan@, thanks for the reply! Let's see if it helps. 
But I just found that I made a mistake in the bug report, I checked the last green run and found the host was already broken at that time, so it looks like it was not the host that caused this failure. Sorry for the misleading information.

In the latest caroline run: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926871626225005072, it failed at TestPlan, HWTest [bvt-inline] and HWTest [bvt-cq] steps, all with TestLabFailure. The same for the latest veyron_minnie: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?id=3240313. So I assume it's caused by the lab failure?

They also both failed at HWTest [chrome-informational] stage with a different error, tast.cryptohome.Login, tast.security.OpenFDs, tast.informat_SERVER_JOB failed, which might be caused by a different cause, I'll take a look at those.
Re #4, Yes, I tend to think the caroline failure https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8926871626225005072 is caused by lack of DUTs, as I see these DUTs work very hard to run these tests, but still can't finish them all:

See the timeline in suite details:
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=268059633
https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/suiteDetails?suiteId=268059636


I see, makes sense. It's hard to understand why the tests now require much more times to finish compared with previous runs. Take caroline informational pfq as an example, it usually takes about 300 minutes to finish, but recently, the number has bumped to more than 350 minutes, given the working hosts number are the same (3 hosts). 

Status: Fixed (was: Assigned)
The TestLab failure can on longer be observed on caroline information PFQ. Seems the lab has recovered. 
The replacement of the broken host is tracked in b/121159736.

Sign in to add a comment