coral-release fails HWTests - DUTs look really unstable
Project Member Reported by email@example.com, Feb 23 2018
https://luci-milo.appspot.com/buildbot/chromeos/coral-release/ The failure is not always the same (sometimes provisioning, sometimes instability in the tests) but are so frequent that it's probably a symptom of a generic instability. Various failures to provision (device not coming back up from reboot, not rebooting, ...): https://luci-milo.appspot.com/buildbot/chromeos/coral-release/779 Some provisioning errors, mixed with very flaky tests: https://luci-milo.appspot.com/buildbot/chromeos/coral-release/778 Assigning to vineeths@ who's agreed to help find an owner
Feb 23 2018,
Assigning to Charles. Looks like Coral's have been red for some time now possibly because Coral's need many devices to pass for a green as opposed to other devices which require only one. Example Reks here, which looks much better: https://luci-milo.appspot.com/buildbot/chromeos/reks-release/#
Feb 26 2018,
I've never found anything that wasn't infra related. Since coral runs 13 models, it's probably of getting hit by transient failures goes up by 13x.
coral seems to fail provisioning at a pretty high rate (6-7%). Compounded with the fact that we run so many models this means that a lot of runs are going to be knocked out by this alone. I'll do some more digging here.
Provision failure percentage is hovering aorund 5% recently: https://viceroy.corp.google.com/chromeos/provision?board=coral&breakdowns=build_type&build_type=&delta_window=60m&devserver=&duration=8d&dut=&groups=Provision&percentile=90&prior_alpha=0.1&prior_beta=1.9&refresh=-1&repository_behavior=DO_NOT_SKIP&success=&topstreams=5&type=Provision&use_precomputation=1 Contrast with generic failure rate often <2%: https://viceroy.corp.google.com/chromeos/provision?groups=Provision&breakdowns=build_type&board=&build_type=&success=&devserver=&dut=&topstreams=5&delta_window=60m&duration=8d&percentile=90&prior_alpha=0.1&prior_beta=1.9&refresh=-1&repository_behavior=DO_NOT_SKIP&type=Provision&use_precomputation=1 ->ejcaruso did you get anywhere with your investigation?
Not yet, I've been sidetracked by last-minute modemfwd changes. I'll do more investigation tomorrow.
Issue 826903 has been merged into this issue.
Coral is still failing pretty widely. Is there a way forward that isn't blocked on the logging feature request at Issue 819882 ?
Are there any logs of specific machines failing? I see mention of Cr50 failure on the other bug. Cr50 update will fail by rollback protection if requested to downrev (like if you test a build from master, then try to test an M65 build). I'd expect this to fail gracefully but maybe there's some reporting bug there. Coral is one of the first devices to see regular Cr50 updates so it's plausible that this could be a factor.
cros-cts-te update, M66 Golden eye: https://cros-goldeneye.corp.google.com/chromeos/console/listBuild?boards=coral&milestone=66&chromeOsVersion=&chromeVersion=&startTimeFrom=&startTimeTo=&token=ALeBcqHQClKOIJ0kZLo_KHX3A2iB%3A1524476144609#/ https://luci-milo.appspot.com/buildbot/chromeos_release/coral-release%20release-R66-10452.B/ build package issue: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos_release%2Fcoral-release_release-R66-10452.B%2F40%2F%2B%2Frecipes%2Fsteps%2FBuildPackages__afdo_use_%2F0%2Fstdout synchrome issue: https://logs.chromium.org/v/?s=chromeos%2Fbb%2Fchromeos_release%2Fcoral-release_release-R66-10452.B%2F41%2F%2B%2Frecipes%2Fsteps%2FSyncChrome%2F0%2Fstdout
coral-release/R68-10621.0.0/bvt-inline/provision_AutoUpdate.double (195246198-chromeos-test) https://ubercautotest.corp.google.com/afe/#tab_id=view_job&object_id=195246198 debug log: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/195246198-chromeos-test/chromeos6-row3-rack23-host15/debug
Sign in to add a comment