nyan_kitty: DUT rebooted during graphics_dEQP HWTest |
|||||||||
Issue descriptionhttps://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1595 INFO ---- ---- kernel=3.10.18 localtime=May 11 19:16:58 timestamp=1494555418 START ---- ---- timestamp=1494555435 localtime=May 11 19:17:15 START graphics_dEQP graphics_dEQP timestamp=1494555435 localtime=May 11 19:17:15 FAIL ---- ---- timestamp=1494556353 localtime=May 11 19:32:33 Autotest client terminated unexpectedly: DUT rebooted during the test run. END FAIL ---- ---- timestamp=1494556353 localtime=May 11 19:32:33 END GOOD ---- ---- timestamp=1494556353 localtime=May 11 19:32:33 INFO ---- ---- timestamp=1494556358 localtime=May 11 19:32:38 Start crashcollection record INFO ---- Orphaned Crash Dump timestamp=1494556358 localtime=May 11 19:32:38 /var/spool/crash/os-release INFO ---- Orphaned Crash Dump timestamp=1494556358 localtime=May 11 19:32:38 /var/spool/crash/lsb-release INFO ---- ---- timestamp=1494556358 localtime=May 11 19:32:38 End crashcollection record The entire test log, looks like it cuts off? 05/11 19:17:16.223 DEBUG| utils:0202| Running 'wflinfo -p null -a gles2' 05/11 19:17:16.363 INFO | graphics_utils:1146| Found gles3.2. 05/11 19:17:16.405 DEBUG| test:0362| starting before_iteration_hooks 05/11 19:17:16.482 INFO | base_sysinfo:0380| ChromeOS BOARD = nyan_kitty_2.1GHz_4GB 05/11 19:17:16.484 DEBUG| utils:0202| Running 'logger "autotest starting iteration /usr/local/autotest/results/default/graphics_dEQP/sysinfo/iteration.1 on nyan_kitty_2.1GHz_4GB"' 05/11 19:17:16.502 DEBUG| test:0365| before_iteration_hooks completed 05/11 19:17:16.503 DEBUG| test:0379| starting test(run_once()), test details follow () 05/11 19:17:16.505 INFO | graphics_dEQP:0499| Test Options: {'test_names': '', 'hasty': 'False', 'shard_number': '0', 'subset_to_run': 'Pass', 'filter': 'dEQP-GLES3.accuracy', 'test_names_file': '', 'timeout': 70, 'shard_count': '1', 'debug': 'False'} 05/11 19:17:16.507 INFO | graphics_dEQP:0514| ChromeOS BOARD = nyan_kitty 05/11 19:17:16.508 INFO | graphics_dEQP:0515| ChromeOS CPU family = tegra 05/11 19:17:16.509 INFO | graphics_dEQP:0516| ChromeOS GPU family = tegra 05/11 19:17:16.510 INFO | graphics_dEQP:0520| dEQP test filter = dEQP-GLES3.accuracy 05/11 19:17:16.515 DEBUG| utils:0202| Running 'status ui' 05/11 19:17:16.542 DEBUG| utils:0202| Running 'stop ui'
,
May 12 2017
What exactly do you want me to look at? This seems to be an HWTest failure, which isn't something that I have anything to do with.
,
May 23 2017
,
May 25 2017
,
May 25 2017
Passing to deputy. If I remember correctly, jrbarnette@ said this looks like a bad CL causing ChromeOS to crash.
,
May 25 2017
a bug in ToT or a CL in CQ?
,
May 25 2017
A CL in CQ, unless this test has been regularly flaking which I don't think it is.
,
May 25 2017
You can see the recent history of this test on ToT here:
https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=graphics_dEQP
The short summary is that the test is reasonably if not
perfectly stable.
Looking at the specific failure, there's no CL that seems to be
a plausible suspect; the failure suggests a kernel crash, but
there's no kernel CL in the blamelist.
,
Jun 6 2017
Issue 700536 has been merged into this issue.
,
Jun 6 2017
There are few recent reports (last few weeks, 2 month max) of nyan_kitty rebooting on very simple/sanity dEQP tests. We should investigate. Unfortunately the logs from the lab are not very meaningful as truncated. Joe, could you set up a nyan board (there are many different ones, probably nyan_blaze, we have to check what is in the library) and try to reproduce? Just to get an idea what is happening. Lab logs but pretty useless https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1876 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121622784-chromeos-test/chromeos4-row13-rack8-host10/
,
Jun 7 2017
Issue 730263 has been merged into this issue.
,
Jun 9 2017
I have tested graphics_dEQP.bvt and graphics_dEQP.gles3.accuracy on nyan platforms big, blaze, and kitty with numerous 100 iteration runs and have been unable to repro the crash EXCEPT for the specific chromeos4-row13-host10.cros kitty hardware that previously reported failure. Even on this machine, I did not experience any test failures, but did detect several reboots per 100 iterations which must have occurred between test runs. I should also note that all of the luci-milo crash reports above involve the same bot cros-beefy377-c2. I think we might just be seeing flaky hardware rather than a regression.
,
Jun 9 2017
> I should also note that all of the luci-milo crash reports
> above involve the same bot cros-beefy377-c2
"cros-beefy377-c2" is the builder, which is irrelevant to the
hardware that runs the test.
If we have reproducible failures on a specific DUT, those need
to be explained.
The name "chromeos4-row13-host10" isn't a valid DUT hostname;
I can see at least four different hosts that that might refer to:
chromeos4-row13-rack1-host10
chromeos4-row13-rack2-host10
chromeos4-row13-rack8-host10
chromeos4-row13-rack9-host10
Which DUT(s), specifically, have shown failures?
,
Jun 9 2017
Indeed, I miss-typed the host name it is: chromeos4-row13-rack8-host10.cros and this is the only one I observed reboots on.
,
Jun 9 2017
I think the right move is to lock the dut and ask for a replacement. Does anyone disagree?
,
Jun 9 2017
> I think the right move is to lock the dut and ask for a replacement. It's very likely that there are no replacements easily available. We need to do some basic due diligence to prove that the problem is hardware, not software. If there's positive evidence to support "bad hardware", then yes, we ask to decommission the unit.
,
Jun 9 2017
Agreed. I acquired my own kitty hardware and will continue testing. I also dug deeper into some of the crash reports above and did verify that it isn't necessarily just one host reporting crashes; for example chromeos4-row13-rack9-host7.cros also reported a crash.
,
Jun 9 2017
One thing I should add to this is: Every build I have tried on any Nyan hardware has resulted lots of garbled/pixelated imagery when you interact with the UI. I am also looking into this and suspect the two bugs may be linked.
,
Jun 9 2017
Here are the same symptoms on caroline and nyan_kitty from the same CQ run for two different graphics tests, and I see no CLs that are likely candidates. Is it reasonable to assume this bug is the cause of both? graphics_dEQP https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/206 graphics_Gbm https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1920
,
Jun 9 2017
These crashes do have some similarity, but consensus around here is that it is unlikely that they are the same bug given how different the platforms are. We are looking into the caroline crash, but probably won't merge the issues now.
,
Jun 9 2017
Please ignore caroline graphics_dEQP link here. I filed issue 731923. The kitty graphics_Gbm failure looks real.
,
Jul 20 2017
Quick update: I haven't actively worked on this bug recently, but I do keep checking for crashes, which I have not seen in quite a while. I spent over a week in June running the test in the background on both lab machines and my own, many thousands of times. I never had a reboot during the test but I did observe a few reboots between runs. The logs weren't helpful in determining the cause. I'll continue to monitor for these crashes. The test may not be causing the crashes, rather we may be seeing crashes during this test because it takes so long to run.
,
Jul 23
Issue has not been modified or commented on in the last 365 days, please re-open or file a new bug if this is still an issue. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by ayatane@chromium.org
, May 12 2017