Issue metadata
Sign in to add a comment
|
DUT crash in cheets_StartAndroid.stress on caroline
Reported by
jrbarnette@chromium.org,
Jun 12 2017
|
||||||||||||||||||||
Issue description
The caroline-paladin failed cheets_StartAndroid.stress here:
https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/235
Logs of the test failure:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122946801-chromeos-test/chromeos2-row8-rack1-host20/
The blamelist doesn't appear to include any CLs that actually
run code on caroline, so it seems unlikely that there was a bad
CL.
It seems most likely that there's a bug in ToT. Note, however,
that the test appears to be quite reliable on the canaries, so
if there's a bug, it either arrived very recently (possibly not
through the CQ), or it's very rare.
,
Jun 13 2017
Adding ARC++ folks and non-PST sheriffs to investigate since it's EOD.
,
Jun 13 2017
The log above indicates that caroline unexpectedly rebooted after the fifth Android start. 06/12 08:09:45.396 INFO | arc_common:0037| Waiting for Android to boot completely. 06/12 08:09:45.396 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:47.440 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:49.571 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:51.594 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:53.617 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:55.644 DEBUG| utils:0203| Running 'android-sh -c "getprop sys.boot_completed"' 06/12 08:09:55.666 INFO | arc_common:0043| Android has booted completely. 06/12 08:09:57.668 DEBUG| arc_util:0041| ARC is enabled in mode enabled 06/12 08:09:57.669 INFO | arc_util:0105| Saving Android dumpstate. 06/12 08:10:09.319 INFO | arc_util:0125| Android dumpstate successfully saved. 06/12 08:10:09.366 DEBUG| cros_interface:0363| ListProcesses(<predicate>)->[273 processes] 06/12 08:10:09.368 INFO | cros_interface:0546| (Re)starting the ui (logs the user out) 06/12 08:10:09.382 DEBUG| cros_interface:0439| IsServiceRunning(ui)->True 06/12 08:10:09.382 DEBUG| cros_interface:0058| sh -c restart ui ����������� wmatrix is still fairly clean on this, but I see a problem starting Chrome in some logs. https://wmatrix.googleplex.com/platform/unfiltered?hide_missing=True&tests=cheets_StartAndroid.stress&days_back=20&releases=tot&platforms=caroline Notice the next paladin run with the Intel DRM changes also caused a caroline reboot. They test might just be doing its good work. https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/239
,
Jun 13 2017
By the way, the test just logs into Chrome and starts Android and logs out (10 times in a row).
,
Jun 13 2017
ihf, each "log out" restarts graphics, login screen, unmounts $HOME, and just about anything else associated with the user, right?
,
Jun 13 2017
Seems to be an infra issue. It seems to have passed in build #236, #237 and #238 (https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/) and failed again in #239.
,
Jun 13 2017
A paladin tests different changes in each run (that have not landed yet). If the reboot is due to changes that have not landed yet, the failure to start Android reliably would only manifest itself on changes that cause the reboot. There certainly is no infra issue here. The only question here is, has the problematic change landed already, and I think wmatrix says no.
,
Jun 13 2017
> The only question here is, has the problematic change landed already, and I think wmatrix says no.
Note this comment from the description:
"The blamelist doesn't appear to include any CLs that actually
run code on caroline, so it seems unlikely that there was a bad
CL."
I looked through the blame list, and as best I can tell, none of the
changes in the first run reported included any code that actually run
on caroline. So, the bug is almost certainly in ToT. That it's not
showing up in wmatrix probably just means that the problem is too new
and too infrequent to have hit any canary.
,
Jun 13 2017
I saw skylake changes which would run on caroline, but they went into a separate branch chromeos-2016.05.
,
Jun 13 2017
Did you see the client.0.DEBUG log for the failing test in build 239? It looks like the DUT ate dirt and died after running "sh -c restart ui" Log attached here. You can dive to reach this log from https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/239 by clicking "link to suite" under the HWTest section, then selecting the specific job that failed in the AFE, then clicking "debug logs" to open the Pantheon repository. The "autoserv.DEBUG" log has clues that direct the user to view client.0.DEBUG
,
Jun 13 2017
I agree with comment 3, the Intel DRM stuff is suspicious but I'm a little confused as to how those garbage characters got into the logs
,
Jun 13 2017
I've seen similar barf exactly once before (very recently) in my Autotest testbed, but I'm hesitant to declare the two issues are related. See http://jsautotest1.cbf.corp.google.com:9083/afe/#tab_id=view_job&object_id=378716
,
Jun 13 2017
The garbage characters are the machine going down without flushing files. I found something else suspicious going back in time. The test failed 6 times in the past week and each was on chromeos2-row8-rack1-host20. Might be a very lucky machine stuck in the cq pool. https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/?limit=200 https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122619215-chromeos-test/chromeos2-row8-rack1-host20/cheets_StartAndroid.stress/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122500047-chromeos-test/chromeos2-row8-rack1-host20/debug/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122991162-chromeos-test/chromeos2-row8-rack1-host20/cheets_StartAndroid.stress/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122946801-chromeos-test/chromeos2-row8-rack1-host20/cheets_StartAndroid.stress/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122453098-chromeos-test/chromeos2-row8-rack1-host20/cheets_StartAndroid.stress/ https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122500047-chromeos-test/chromeos2-row8-rack1-host20/debug/
,
Jun 13 2017
7 builds after the last failure have been successful.
,
Jun 13 2017
TL;DR: assign this bug to radhakrishna.sripada@intel.com ? Richard, the blame list in build 239: https://luci-milo.appspot.com/buildbot/chromeos/caroline-paladin/239 includes a pile of DRM/i915 changes. caroline is Skylake-Y and I thought used i915 driver for graphics...at least bleung thinks it does in his comments here: https://chromium-review.googlesource.com/c/421984 This DRM series caused problems on June 8 (cave and caroline) and caused problems again when it was resubmitted on June 12 (with only updates to comment). I think Ihf is right: the test is doing what it was supposed to do. So I'll argue that despite coming from UPSTREAM, the DRM patch series is primary cause of the paladin (CQ) failures and we should reassign the bug to radhakrishna.sripada@intel.com: https://chromium-review.googlesource.com/q/owner:radhakrishna.sripada%40intel.com it's possible the "UPSTREAM" tag was misapplied and perhaps reviewers might be able to check for BACKPORT type changes to the patch series - or maybe some of the changes should be BACKPORT (modified from UPSTREAM) so they work correctly.
,
Jun 13 2017
Sorry, I assumed author would own this bug. Instead consider assigned the bug to brian.j.lovin@intel.com who did CQ+1 for CL421984 .
,
Jun 13 2017
> Richard, the blame list in build 239: Unfortunately, the blame list in build 239 can't explain the failure in build 235. It is, of course, possible that the 239 run had additional bugs...
,
Jun 13 2017
Spoke to Benson Leung (who did +2 on the CL indicated in comment#15). Looks like this is a known issue. This has been investigated here ( crbug.com/731253 ). Seems like there are other CL's that depend on this CL (like https://chromium-review.googlesource.com/c/421985), which will fix this. This issue is seen only on paladin and when they go into the CQ, they should land together and not cause this issue. On Benson's suggestion, I am duping this bug (to crbug.com/731253 )
,
Jun 13 2017
,
Jun 13 2017
Wait. This needs more explanation. The failure in bug 731253 is a crash at boot time, not a crash starting the Android container. Moreover, IIUC, the original failure was attributed to a CL that wasn't in the 235 run. In fact, AFAIK, bug 731253 isn't a bug in the tree at all, but this failure is emphatically not a bad CL.
,
Jun 13 2017
Adding benson, as he understands this better.
,
Jun 13 2017
I locked chromeos2-row8-rack1-host20 and ran the cheets_StartAndroid.stress test twice. Got rebooted without anything spewing in dmesg on the second run. Let's see if other caroline DUTs show the same failure.
,
Jun 13 2017
I've balanced pools to get the DUT out of the CQ pool. Given that the problem showed up so easily, it seems likely that the problem is specific to this DUT. We need a caroline expert to look at this and evaluate what's going on.
,
Jun 13 2017
Did we a get a ramoops collection? eventlog output? I don't see much specifics here.
,
Jun 13 2017
> Did we a get a ramoops collection? eventlog output? I don't see much specifics here. According to c#22, the problem is reproducible. Although in principle the logs might have the data, it sounds like it will be easier to reproduce the problem, and then debug.
,
Jun 13 2017
Re 24: Aaron, the DUT is locked and if you want take a look at it. ssh root@chromeos2-row8-rack1-host20.cros Strangely it started running again though, not sure how this happened. It lost all state while I was at the dentist.
,
Jun 13 2017
Unfortunately chromeos2-row8-rack1-host20 started running jobs even though it is locked by me. I filed issue 732999 for this.
,
Jun 13 2017
DUT chromeos2-row8-rack1-host20 is now locked again, and can be used for testing/reproducing this problem.
,
Jun 14 2017
I don't have access to get into corp network. If someone can repo and update the bug w/ eventlog and any ramoops. That would be helpful.
,
Jun 15 2017
I've just seen a different but possibly related failure:
https://luci-milo.appspot.com/buildbot/chromeos/cave-paladin/589
Failure logs are here:
https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/123265055-chromeos-test/chromeos2-row8-rack6-host11/
The symptom seems potentially similar in there's a failure during
"restart ui". The last events in the the client log show this:
====
06/14 04:44:34.078 INFO | arc_common:0043| Android has booted completely.
06/14 04:44:36.081 DEBUG| arc_util:0041| ARC is enabled in mode enabled
06/14 04:44:36.081 INFO | arc_util:0105| Saving Android dumpstate.
06/14 04:44:56.094 INFO | arc_util:0125| Android dumpstate successfully saved.
06/14 04:44:56.123 DEBUG| cros_interface:0363| ListProcesses(<predicate>)->[276 processes]
06/14 04:44:56.125 INFO | cros_interface:0546| (Re)starting the ui (logs the user out)
06/14 04:44:56.140 DEBUG| cros_interface:0439| IsServiceRunning(ui)->True
06/14 04:44:56.141 DEBUG| cros_interface:0058| sh -c restart ui
====
The symptoms are different in that
1) The failure is on cave, not caroline.
2) The symptom shows a process hang, not a system crash.
,
Jun 15 2017
I've filed bug 733738 to allow tracking the two failures separately, if needed.
,
Jun 15 2017
I poked on the cave logs (simple failure to start Android) and they look different from the caroline hang/reboot.
,
Jan 26 2018
,
Sep 28
Triage nag: This Chrome OS bug has an owner but no component. Please add a component so that this can be tracked by the relevant team.
,
Nov 8
<UI triage> Bug owners, please add the appropriate component to your bug. Thanks! |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by jrbarnette@chromium.org
, Jun 12 2017Labels: -Pri-3 Pri-1
Owner: sureshraj@chromium.org
Status: Assigned (was: Untriaged)