New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 721867 link

Starred by 2 users

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Jul 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

nyan_kitty: DUT rebooted during graphics_dEQP HWTest

Project Member Reported by ayatane@chromium.org, May 12 2017

Issue description

https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1595



INFO	----	----	kernel=3.10.18	localtime=May 11 19:16:58	timestamp=1494555418	
START	----	----	timestamp=1494555435	localtime=May 11 19:17:15	
	START	graphics_dEQP	graphics_dEQP	timestamp=1494555435	localtime=May 11 19:17:15	
		FAIL	----	----	timestamp=1494556353	localtime=May 11 19:32:33	Autotest client terminated unexpectedly: DUT rebooted during the test run.
	END FAIL	----	----	timestamp=1494556353	localtime=May 11 19:32:33	
END GOOD	----	----	timestamp=1494556353	localtime=May 11 19:32:33	
INFO	----	----	timestamp=1494556358	localtime=May 11 19:32:38	Start crashcollection record
INFO	----	Orphaned Crash Dump	timestamp=1494556358	localtime=May 11 19:32:38	/var/spool/crash/os-release
INFO	----	Orphaned Crash Dump	timestamp=1494556358	localtime=May 11 19:32:38	/var/spool/crash/lsb-release
INFO	----	----	timestamp=1494556358	localtime=May 11 19:32:38	End crashcollection record



The entire test log, looks like it cuts off?

05/11 19:17:16.223 DEBUG|             utils:0202| Running 'wflinfo -p null -a gles2'
05/11 19:17:16.363 INFO |    graphics_utils:1146| Found gles3.2.
05/11 19:17:16.405 DEBUG|              test:0362| starting before_iteration_hooks
05/11 19:17:16.482 INFO |      base_sysinfo:0380| ChromeOS BOARD = nyan_kitty_2.1GHz_4GB
05/11 19:17:16.484 DEBUG|             utils:0202| Running 'logger "autotest starting iteration /usr/local/autotest/results/default/graphics_dEQP/sysinfo/iteration.1 on nyan_kitty_2.1GHz_4GB"'
05/11 19:17:16.502 DEBUG|              test:0365| before_iteration_hooks completed
05/11 19:17:16.503 DEBUG|              test:0379| starting test(run_once()), test details follow
()
05/11 19:17:16.505 INFO |     graphics_dEQP:0499| Test Options: {'test_names': '', 'hasty': 'False', 'shard_number': '0', 'subset_to_run': 'Pass', 'filter': 'dEQP-GLES3.accuracy', 'test_names_file': '', 'timeout': 70, 'shard_count': '1', 'debug': 'False'}
05/11 19:17:16.507 INFO |     graphics_dEQP:0514| ChromeOS BOARD = nyan_kitty
05/11 19:17:16.508 INFO |     graphics_dEQP:0515| ChromeOS CPU family = tegra
05/11 19:17:16.509 INFO |     graphics_dEQP:0516| ChromeOS GPU family = tegra
05/11 19:17:16.510 INFO |     graphics_dEQP:0520| dEQP test filter = dEQP-GLES3.accuracy
05/11 19:17:16.515 DEBUG|             utils:0202| Running 'status ui'
05/11 19:17:16.542 DEBUG|             utils:0202| Running 'stop ui'
 
Owner: d...@chromium.org
+dnj to take a look at test failure

Comment 3 by d...@chromium.org, May 12 2017

Owner: ayatane@chromium.org
What exactly do you want me to look at? This seems to be an HWTest failure, which isn't something that I have anything to do with.

Comment 4 by aut...@google.com, May 23 2017

Status: Unconfirmed (was: Untriaged)
Owner: xixuan@chromium.org
Passing to deputy.  If I remember correctly, jrbarnette@ said this looks like a bad CL causing ChromeOS to crash.

Comment 7 by xixuan@chromium.org, May 25 2017

Cc: jrbarnette@chromium.org
Owner: ----
a bug in ToT or a CL in CQ?
A CL in CQ, unless this test has been regularly flaking which I don't think it is.
You can see the recent history of this test on ToT here:
    https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=graphics_dEQP

The short summary is that the test is reasonably if not
perfectly stable.

Looking at the specific failure, there's no CL that seems to be
a plausible suspect; the failure suggests a kernel crash, but
there's no kernel CL in the blamelist.

Comment 10 by ihf@chromium.org, Jun 6 2017

 Issue 700536  has been merged into this issue.

Comment 11 by ihf@chromium.org, Jun 6 2017

Cc: marc...@chromium.org
Components: -Infra>Client>ChromeOS OS>Kernel>Graphics
Labels: M-61 OS-Chrome
Owner: djmk@chromium.org
There are few recent reports (last few weeks, 2 month max) of nyan_kitty rebooting on very simple/sanity dEQP tests. We should investigate.

Unfortunately the logs from the lab are not very meaningful as truncated.
Joe, could you set up a nyan board (there are many different ones, probably nyan_blaze, we have to check what is in the library) and try to reproduce? Just to get an idea what is happening.

Lab logs but pretty useless
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1876
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/121622784-chromeos-test/chromeos4-row13-rack8-host10/

Comment 12 by ihf@chromium.org, Jun 7 2017

Cc: ihf@chromium.org mojahsu@chromium.org dgarr...@chromium.org mcchou@chromium.org bleung@chromium.org
 Issue 730263  has been merged into this issue.

Comment 13 by djmk@chromium.org, Jun 9 2017

I have tested graphics_dEQP.bvt and graphics_dEQP.gles3.accuracy on nyan platforms big, blaze, and kitty with numerous 100 iteration runs and have been unable to repro the crash EXCEPT for the specific chromeos4-row13-host10.cros kitty hardware that previously reported failure.  Even on this machine, I did not experience any test failures, but did detect several reboots per 100 iterations which must have occurred between test runs. I should also note that all of the luci-milo crash reports above involve the same bot cros-beefy377-c2.  I think we might just be seeing flaky hardware rather than a regression.
> I should also note that all of the luci-milo crash reports
> above involve the same bot cros-beefy377-c2

"cros-beefy377-c2" is the builder, which is irrelevant to the
hardware that runs the test.

If we have reproducible failures on a specific DUT, those need
to be explained.

The name "chromeos4-row13-host10" isn't a valid DUT hostname;
I can see at least four different hosts that that might refer to:
    chromeos4-row13-rack1-host10
    chromeos4-row13-rack2-host10
    chromeos4-row13-rack8-host10
    chromeos4-row13-rack9-host10

Which DUT(s), specifically, have shown failures?

Comment 15 by djmk@chromium.org, Jun 9 2017

Labels: OS-Fuchsia
Indeed, I miss-typed the host name it is: chromeos4-row13-rack8-host10.cros and this is the only one I observed reboots on.
I think the right move is to lock the dut and ask for a replacement.

Does anyone disagree?
> I think the right move is to lock the dut and ask for a replacement.

It's very likely that there are no replacements easily available.

We need to do some basic due diligence to prove that the problem is
hardware, not software.  If there's positive  evidence to support
"bad hardware", then yes, we ask to decommission the unit.

Comment 18 Deleted

Comment 19 by djmk@chromium.org, Jun 9 2017

Agreed.  I acquired my own kitty hardware and will continue testing.  I also dug deeper into some of the crash reports above and did verify that it isn't necessarily just one host reporting crashes; for example chromeos4-row13-rack9-host7.cros also reported a crash.  

Comment 20 by djmk@chromium.org, Jun 9 2017

One thing I should add to this is: Every build I have tried on any Nyan hardware has resulted lots of garbled/pixelated imagery when you interact with the UI.  I am also looking into this and suspect the two bugs may be linked.
Here are the same symptoms on caroline and nyan_kitty from the same CQ run for two different graphics tests, and I see no CLs that are likely candidates. Is it reasonable to assume this bug is the cause of both?

graphics_dEQP
https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/206

graphics_Gbm
https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1920

Comment 22 by djmk@chromium.org, Jun 9 2017

These crashes do have some similarity, but consensus around here is that it is unlikely that they are the same bug given how different the platforms are.  We are looking into the caroline crash, but probably won't merge the issues now.

Comment 23 by ihf@chromium.org, Jun 9 2017

Please ignore caroline graphics_dEQP link here. I filed issue 731923.

The kitty graphics_Gbm failure looks real.

Comment 24 by djmk@chromium.org, Jul 20 2017

Quick update:  I haven't actively worked on this bug recently, but I do keep checking for crashes, which I have not seen in quite a while.  I spent over a week in June running the test in the background on both lab machines and my own, many thousands of times.  I never had a reboot during the test but I did observe a few reboots between runs.  The logs weren't helpful in determining the cause.  I'll continue to monitor for these crashes.  The test may not be causing the crashes, rather we may be seeing crashes during this test because it takes so long to run.
Project Member

Comment 25 by sheriffbot@chromium.org, Jul 23

Status: Archived (was: Unconfirmed)
Issue has not been modified or commented on in the last 365 days, please re-open or file a new bug if this is still an issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment