graphics_Gbm: DUT Rebooted unexpectedly nyan_kitty. |
||||||||||||||||
Issue descriptionThis build failure was caused by an unexpected reboot on nyan_kitty running the graphics_Gbm test. I'm inclined to call this a rare system failure exposed by the, but I'm not sure how common it is. INFO ---- ---- kernel=3.10.18 localtime=Mar 18 03:24:43 timestamp=1489832683 START ---- ---- timestamp=1489832699 localtime=Mar 18 03:24:59 START graphics_Gbm graphics_Gbm timestamp=1489832699 localtime=Mar 18 03:24:59 FAIL ---- ---- timestamp=1489833602 localtime=Mar 18 03:40:02 Autotest client terminated unexpectedly: DUT rebooted during the test run. END FAIL ---- ---- timestamp=1489833602 localtime=Mar 18 03:40:02 END GOOD ---- ---- timestamp=1489833602 localtime=Mar 18 03:40:02
,
Mar 23 2017
It happened again: https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/1114
,
Mar 23 2017
It's rare but not unprecedented: https://bugs.chromium.org/p/chromium/issues/detail?id=647268
,
Mar 23 2017
On multiple boards as well. Does this sound like the test is triggering a crash? Can we find an owner on the graphics team to triage?
,
Mar 23 2017
It could be. I'm not sure what the canary failures tell us, they're pretty far apart so it may be just noise. But twice in a week, yes, it might indicate that a bug is triggered. Gurchetan, I think you've been working on minigbm lately. Can you take a look?
,
Mar 23 2017
The worst I would expect from graphics_Gbm/minigbm is a segfault or failure, not a reboot. Looking at /dev/pstore/console-ramoops may insightful. How can I ssh into the paladin? ssh root@chromeos-nyan_kitty-paladin-1114.cros doesn't seem to work.
,
Mar 23 2017
The DUT in question is probably running a different OS version, been rebooted many times, and run several thousand tests since then. Would logging in still be useful?
,
Mar 23 2017
In that case, the logging wouldn't be very helpful. Looks very clean here: https://wmatrix.googleplex.com/unfiltered?hide_missing=True&releases=tot&tests=graphics_Gbm&days_back=10 Are the machines that run in the lab are the same machines that with canary builders?
,
Mar 23 2017
Yes, the DUTs for CQ and Canary are in different pools, but otherwise the same, and can be moved between pools from time to time.
,
Mar 29 2017
,
Mar 29 2017
Might I suggest getting a dut from somewhere, and schedule the test against it 1000 times in a row, setup to exit on first failure? PS: Not certain of the best way to do this, but it can make flake reproducable and make it easier to gather needed data.
,
Mar 29 2017
Hmmm, actually we have some data: https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/1179/steps/HWTest%20%5Bbvt-cq%5D/logs/stdio which links to the autotest results: http://cautotest/tko/retrieve_logs.cgi?job=/results/109549676-chromeos-test/ Console ramoops says: [ 471.125904] host1x 50000000.host1x: start latency exceeded, new value 100584 ns [ 472.053765] init: debugd main process (3855) killed by TERM signal [ 472.302552] smsc95xx 1-1.1:1.0 eth1: hardware isn't capable of remote wakeup [ 474.139735] init: crash-sender main process (1445) killed by TERM signal [ 474.140027] init: cros-machine-id-regen-periodic main process (1452) killed by TERM signal [ 474.140343] init: mtpd main process (1470) killed by TERM signal [ 474.140623] init: log-rotate main process (1478) killed by TERM signal [ 474.140875] init: cras main process (1515) killed by TERM signal [ 474.141159] init: bluetoothd main process (1524) killed by TERM signal [ 474.149073] init: powerd main process (843) killed by TERM signal [ 474.154430] EXT4-fs (dm-0): re-mounted. Opts: (null) [ 474.156011] init: warn-collector main process (2220) killed by TERM signal [ 474.156387] init: activate_date main process (2230) killed by TERM signal [ 474.156692] init: tlsdated main process (2253) killed by TERM signal [ 474.156969] init: netfilter-queue main process (2311) killed by TERM signal [ 474.157236] init: chapsd main process (3527) killed by TERM signal [ 474.181522] init: cryptohomed main process (3541) killed by TERM signal [ 474.213027] tpm_i2c_infineon 2-0020: command 0x98 (size 10) returned code 0x800 [ 474.253279] init: recover_duts main process (2217) killed by TERM signal [ 474.306958] Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required. [ 474.591547] Restarting system. I don't know what make these warnings, though. Doesn't look graphics related.
,
Mar 29 2017
To copy the ramoops, run this in the chroot: gsutil cp gs://chromeos-autotest-results/109549676-chromeos-test/chromeos4-row13-rack8-host10/sysinfo/console-ramoops . Any ideas??
,
Mar 30 2017
I ran graphics_Gbm 500 times with 9404.0.0 nyan_kitty (took 5+ hours, mostly due to autotest setup/teardown) locally. There were no reboots or failures. I'll note the three recent CQ failures all happened on cros-beefy377-c2. If this keeps happening, I suggesting replacing that machine since the failure logs don't really offer any hints.
,
Mar 30 2017
,
Apr 13 2017
Sounds like an issue with kernel / graphics. Not a problem with the builder. + snanda can you help us find an owner for this investigation?
,
Apr 13 2017
,
Apr 13 2017
The console-ramoops is from the previous boot and therefore doesn't have useful information for the reboot that happened during the course of this particular test :( Couple of things: 1. Luigi could you please help dig up the console-ramoops from the subsequent test on those DUTs? 2. Should we file a feature request to capture the logs post reboot for tests where we run into unexpected reboots?
,
Apr 13 2017
So I looked at the latest instance:
semenzato@luigi:~/test-lab$ dut-status -f -s '2017-03-29 06:00:00' --duration 5 chromeos4-row13-rack8-host10
chromeos4-row13-rack8-host10
2017-03-29 10:36:28 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572053-chromeos-test/
2017-03-29 10:35:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913832-reset/
2017-03-29 10:34:51 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572086-chromeos-test/
2017-03-29 10:34:21 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913821-reset/
2017-03-29 10:33:23 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572067-chromeos-test/
2017-03-29 10:32:50 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913812-reset/
2017-03-29 10:31:49 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572047-chromeos-test/
2017-03-29 10:31:10 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913796-reset/
2017-03-29 10:30:08 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572031-chromeos-test/
2017-03-29 10:29:05 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913772-reset/
2017-03-29 10:27:52 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109572024-chromeos-test/
2017-03-29 10:18:30 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913730-provision/
2017-03-29 07:06:05 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549676-chromeos-test/
2017-03-29 07:05:25 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913178-reset/
2017-03-29 06:59:42 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549549-chromeos-test/
2017-03-29 06:58:56 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2913059-reset/
2017-03-29 06:57:31 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/109549523-chromeos-test/
2017-03-29 06:47:44 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2912888-provision/
semenzato@luigi:~/test-lab$
The failed test is the one that started at 0706. The next operation on that DUT is the provision at 1018. Unfortunately there is no console-ramoops in those logs.
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host10/2913730-provision/20172903101825/sysinfo/
I will check the earlier failure as well.
,
Apr 13 2017
Oh, I forgot to mention: the job following the failure is a provision, which involves reboot, so there is no point in looking for console-ramoops in later jobs.
,
Apr 13 2017
,
Apr 13 2017
I'd suggest getting a dedicated DUT locally and running the test repeatedly until it fails.
,
Apr 13 2017
Sadly exactly the same thing happened with the first failure.
It may be a good idea to collect console-ramoops immediately when we detect that the device has rebooted unexpectedly.
2017-03-23 11:14:50 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108479249-chromeos-test/
2017-03-23 11:14:15 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858918-reset/
2017-03-23 11:12:54 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108479241-chromeos-test/
2017-03-23 11:03:29 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858901-provision/
2017-03-23 09:24:01 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472526-chromeos-test/
2017-03-23 09:23:26 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858667-reset/
2017-03-23 09:22:18 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472497-chromeos-test/
2017-03-23 09:21:44 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858627-reset/
2017-03-23 09:20:36 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472461-chromeos-test/
2017-03-23 09:19:59 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858600-reset/
2017-03-23 09:18:34 -- http://cautotest/tko/retrieve_logs.cgi?job=/results/108472439-chromeos-test/
2017-03-23 09:07:55 OK http://cautotest/tko/retrieve_logs.cgi?job=/results/hosts/chromeos4-row13-rack8-host10/2858535-provision/
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/hosts/chromeos4-row13-rack8-host10/2858901-provision/20172303110326/sysinfo/
,
May 1 2017
,
May 1 2017
nyan boards experience random reboots on desktopui_MashLogin as well https://wmatrix.googleplex.com/platform/unfiltered?hide_missing=True&tests=desktopui_MashLogin&days_back=30&releases=tot&platforms=nyan_*
,
May 30 2017
This caused additional instability in the CQ last week. Luigi, can we get a bug related to collecting the console-ramops? jrbarnette is probably a good first owner. And point out that it's related to tracking down the cause of this bug.
,
Jun 9 2017
,
Jun 11 2018
This issue has been Available for over a year. If it's no longer important or seems unlikely to be fixed, please consider closing it out. If it is important, please re-triage the issue. Sorry for the inconvenience if the bug really should have been left as Available. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Jun 11 2018
|
||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||
Comment 1 by dgarr...@chromium.org
, Mar 21 2017