New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 2 users

Issue metadata

Status: Available
Owner: ----
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 3
Type: Bug

Sign in to add a comment

graphics_Gbm: DUT Rebooted unexpectedly nyan_kitty.

Project Member Reported by, Mar 21 2017 Back to list

Issue description

This build failure was caused by an unexpected reboot on nyan_kitty running the graphics_Gbm test.

I'm inclined to call this a rare system failure exposed by the, but I'm not sure how common it is.

INFO	----	----	kernel=3.10.18	localtime=Mar 18 03:24:43	timestamp=1489832683	
START	----	----	timestamp=1489832699	localtime=Mar 18 03:24:59	
	START	graphics_Gbm	graphics_Gbm	timestamp=1489832699	localtime=Mar 18 03:24:59	
		FAIL	----	----	timestamp=1489833602	localtime=Mar 18 03:40:02	Autotest client terminated unexpectedly: DUT rebooted during the test run.
	END FAIL	----	----	timestamp=1489833602	localtime=Mar 18 03:40:02	
END GOOD	----	----	timestamp=1489833602	localtime=Mar 18 03:40:02	

Comment 3 by, Mar 23 2017

It's rare but not unprecedented:
On multiple boards as well.

Does this sound like the test is triggering a crash?

Can we find an owner on the graphics team to triage?

Comment 5 by, Mar 23 2017


It could be. I'm not sure what the canary failures tell us, they're pretty far apart so it may be just noise. But twice in a week, yes, it might indicate that a bug is triggered.

Gurchetan, I think you've been working on minigbm lately. Can you take a look?
Status: Started
The worst I would expect from graphics_Gbm/minigbm is a segfault or failure, not a reboot.  Looking at /dev/pstore/console-ramoops may insightful.  How can I ssh into the paladin? 

ssh root@chromeos-nyan_kitty-paladin-1114.cros doesn't seem to work.
The DUT in question is probably running a different OS version, been rebooted many times, and run several thousand tests since then. Would logging in still be useful?
In that case, the logging wouldn't be very helpful.

Looks very clean here:

Are the machines that run in the lab are the same machines that with canary builders?
Yes, the DUTs for CQ and Canary are in different pools, but otherwise the same, and can be moved between pools from time to time.
Might I suggest getting a dut from somewhere, and schedule the test against it 1000 times in a row, setup to exit on first failure?

PS: Not certain of the best way to do this, but it can make flake reproducable and make it easier to gather needed data.
Hmmm, actually we have some data:

which links to the autotest results:


Console ramoops says:

[  471.125904] host1x 50000000.host1x: start latency exceeded, new value 100584 ns
[  472.053765] init: debugd main process (3855) killed by TERM signal
[  472.302552] smsc95xx 1-1.1:1.0 eth1: hardware isn't capable of remote wakeup
[  474.139735] init: crash-sender main process (1445) killed by TERM signal
[  474.140027] init: cros-machine-id-regen-periodic main process (1452) killed by TERM signal
[  474.140343] init: mtpd main process (1470) killed by TERM signal
[  474.140623] init: log-rotate main process (1478) killed by TERM signal
[  474.140875] init: cras main process (1515) killed by TERM signal
[  474.141159] init: bluetoothd main process (1524) killed by TERM signal
[  474.149073] init: powerd main process (843) killed by TERM signal
[  474.154430] EXT4-fs (dm-0): re-mounted. Opts: (null)
[  474.156011] init: warn-collector main process (2220) killed by TERM signal
[  474.156387] init: activate_date main process (2230) killed by TERM signal
[  474.156692] init: tlsdated main process (2253) killed by TERM signal
[  474.156969] init: netfilter-queue main process (2311) killed by TERM signal
[  474.157236] init: chapsd main process (3527) killed by TERM signal
[  474.181522] init: cryptohomed main process (3541) killed by TERM signal
[  474.213027] tpm_i2c_infineon 2-0020: command 0x98 (size 10) returned code 0x800
[  474.253279] init: recover_duts main process (2217) killed by TERM signal
[  474.306958] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
[  474.591547] Restarting system.

I don't know what make these warnings, though. Doesn't look graphics related.
To copy the ramoops, run this in the chroot:

gsutil cp gs://chromeos-autotest-results/109549676-chromeos-test/chromeos4-row13-rack8-host10/sysinfo/console-ramoops .

Any ideas??
Status: Available
I ran graphics_Gbm 500 times with 9404.0.0 nyan_kitty (took 5+ hours, mostly due to autotest setup/teardown) locally.  There were no reboots or failures.  I'll note the three recent CQ failures all happened on cros-beefy377-c2.  If this keeps happening, I suggesting replacing that machine since the failure logs don't really offer any hints.
Owner: ----

Comment 16 by, Apr 13 2017

Sounds like an issue with kernel / graphics. Not a problem with the builder. 

+ snanda can you help us find an owner for this investigation? 

Comment 17 by, Apr 13 2017

Components: -Infra>Client>ChromeOS OS>Kernel
The console-ramoops is from the previous boot and therefore doesn't have useful information for the reboot that happened during the course of this particular test :(

Couple of things:
1. Luigi could you please help dig up the console-ramoops from the subsequent test on those DUTs?
2. Should we file a feature request to capture the logs post reboot for tests where we run into unexpected reboots?
So I looked at the latest instance:

semenzato@luigi:~/test-lab$ dut-status -f -s '2017-03-29 06:00:00' --duration 5 chromeos4-row13-rack8-host10
    2017-03-29 10:36:28  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572053-chromeos-test/
    2017-03-29 10:35:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913832-reset/
    2017-03-29 10:34:51  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572086-chromeos-test/
    2017-03-29 10:34:21  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913821-reset/
    2017-03-29 10:33:23  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572067-chromeos-test/
    2017-03-29 10:32:50  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913812-reset/
    2017-03-29 10:31:49  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572047-chromeos-test/
    2017-03-29 10:31:10  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913796-reset/
    2017-03-29 10:30:08  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572031-chromeos-test/
    2017-03-29 10:29:05  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913772-reset/
    2017-03-29 10:27:52  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572024-chromeos-test/
    2017-03-29 10:18:30  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913730-provision/
    2017-03-29 07:06:05  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549676-chromeos-test/
    2017-03-29 07:05:25  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913178-reset/
    2017-03-29 06:59:42  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549549-chromeos-test/
    2017-03-29 06:58:56  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913059-reset/
    2017-03-29 06:57:31  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549523-chromeos-test/
    2017-03-29 06:47:44  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2912888-provision/

The failed test is the one that started at 0706.  The next operation on that DUT is the provision at 1018.  Unfortunately there is no console-ramoops in those logs.

I will check the earlier failure as well.

Labels: OS-Chrome
Oh, I forgot to mention: the job following the failure is a provision, which involves reboot, so there is no point in looking for console-ramoops in later jobs.
I'd suggest getting a dedicated DUT locally and running the test repeatedly until it fails.
Owner: ----
Sadly exactly the same thing happened with the first failure.

It may be a good idea to collect console-ramoops immediately when we detect that the device has rebooted unexpectedly.

    2017-03-23 11:14:50  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108479249-chromeos-test/
    2017-03-23 11:14:15  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858918-reset/
    2017-03-23 11:12:54  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108479241-chromeos-test/
    2017-03-23 11:03:29  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858901-provision/
    2017-03-23 09:24:01  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472526-chromeos-test/
    2017-03-23 09:23:26  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858667-reset/
    2017-03-23 09:22:18  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472497-chromeos-test/
    2017-03-23 09:21:44  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858627-reset/
    2017-03-23 09:20:36  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472461-chromeos-test/
    2017-03-23 09:19:59  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858600-reset/
    2017-03-23 09:18:34  -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472439-chromeos-test/
    2017-03-23 09:07:55  OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858535-provision/

This caused additional instability in the CQ last week.

Luigi, can we get a bug related to collecting the console-ramops? jrbarnette is probably a good first owner. And point out that it's related to tracking down the cause of this bug.

Sign in to add a comment