desktopui_MashLogin | FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run. |
||||||||||||||||||||
Issue descriptionBuild: https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/837 Test logs: http://cautotest/tko/retrieve_logs.cgi?job=/results/103384742-chromeos-test/ In particular, I see some chrome core dumps in https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/103384742-chromeos-test/chromeos4-row13-rack9-host7/crashinfo.chromeos4-row13-rack9-host7/
,
Feb 27 2017
There's a single, small (7 MB) chrome core there. The Chrome crash is a CHECK in some zygote code, see issue 692227. It's bad that the CHECK is happening, and bad that the test doesn't fail, but I don't think Chrome is why the device is rebooting. I see this in /var/log/messages, maybe that's normal? 2017-02-25T06:05:11.216648-08:00 ALERT kernel: [ 1093.863585] Unhandled prefetch abort: breakpoint debug exception (0x222) at 0x00000000 Also, /var/log/messages is full of null bytes around 2017-02-25T06:05:21.696654-08:00 - maybe that's due to unclean reboot? akeshet, maybe someone more familiar with the OS-level logs could take a look?
,
Feb 27 2017
any ideas?
,
Feb 28 2017
If the zygote is checking, how do you know anything is working at all? I looked at the logs, but nothing obvious to me. There are suspicious events/log files, but I don't know how mash looks normally.
,
Feb 28 2017
@2: it seems like twice chrome exited very quickly, and then once it crashes. If this keeps happening the session_manager decides to reboot. The reboots in the logs are perfectly normal, this is all a problem on the Chrome side.
,
Feb 28 2017
,
Feb 28 2017
+derat, do you know off the top of your head how many times chrome has to abnormally exit before session_manager reboots the device? And whether it logs when it does that? @5 - The test is supposed to start chrome and exit quickly, twice. Once it starts it with --mash and verifies it loads OOBE. Then it starts it with --mash and verifies login works. Do you see evidence that chrome is crash-looping often enough that session_manager would reboot? @4, zygote is crashing during chrome shutdown, it's some sort of race where a process is trying to start while things are being torn down. zygote is working normally during startup. Regardless, I will look some more.
,
Feb 28 2017
er, really +derat, see @7
,
Feb 28 2017
src/platform2/login_manager/init/scripts/ui-respawn: # The ui job requires special respawning logic, as the system needs to respond # differently to different kinds of exits. # # 0: Normal exit; respawn forever. # 2: (Defined in session_manager_service.h) The browser is exiting too much too # quickly. In some cases, such as a hung GPU, rebooting the device will allow # the system to recover. Respawn up to TOO_CRASHY_LIMIT times in # TOO_CRASHY_WINDOW_SECONDS before trying a reboot. Try this up to # REBOOT_LIMIT times in REBOOT_WINDOW_SECONDS before giving up and waiting # for an AU. # 3: (Defined in session_manager_service.h) The session_manager has detected # a condition that requires a powerwash to recover. It should have requested # that the device reboot before exiting, so stop respawning the UI. # Other nonzero: respawn up to RESPAWN_LIMIT times in RESPAWN_WINDOW_SECONDS, # then stop. ... RESPAWN_LIMIT=6 RESPAWN_WINDOW_SECONDS=60 ... TOO_CRASHY_LIMIT=1 TOO_CRASHY_WINDOW_SECONDS=180 ... REBOOT_LIMIT=1 REBOOT_WINDOW_SECONDS=$(( 3 * TOO_CRASHY_WINDOW_SECONDS ))
,
Feb 28 2017
session_manager likely won't log anything (other than the Chrome exits) since this is all done via Upstart, but it looks like ui-respawn logs what it's doing (probably to /var/log/messages).
,
Feb 28 2017
> session_manager likely won't log anything (other than > the Chrome exits) since this is all done via Upstart, > but it looks like ui-respawn logs what it's doing > (probably to /var/log/messages). Yes, it logs to /var/log/messages, using the job name as the tag, so `grep ui-respawn /var/log/messages` will give you the relevant history.
,
Feb 28 2017
derat, do you have time to chat about this? I don't think the reboot is due to chrome crashing. However, chrome is definitely exiting abnormally. See attached annotated /var/log/messages file. "ui-respawn" does not appear in the log. session_manager should print the line "Child stopped, shutting down" if chrome is crashing too quickly, but it does not. https://cs.corp.google.com/chromeos_internal/src/platform2/login_manager/session_manager_service.cc?type=cs&q=CHILD_EXITING_TOO_FAST&sq=package:%5Echromeos_internal$&l=300 The zygote crash occurs in a non-mash chrome start during the test. I am uncertain why chrome starts so many times during this test. I cannot reproduce the problem locally on link.
,
Feb 28 2017
Annotated /var/log/messages
,
Feb 28 2017
Is this potentially the same? https://luci-milo.appspot.com/buildbot/chromeos/nyan_kitty-paladin/874
,
Feb 28 2017
akeshet, what do you mean by "the same"? It's another device reboot mid-test. I don't see anything I recognize in the logs, other than the fact that many log files are 0 size and /var/log/messages is full of nulls shortly after the test starts. I don't think there is a chrome problem in that test. The MashLogin chrome crash is probably related to the fact that chrome --mash is not correctly handling sigterm, see issue 697146 . I'm working on that now. I still think the DUT reboot is a different problem than the chrome cores. Do you have a way to see how many tests are failing with "DUT rebooted" and see if those are much more likely on nyan_kitty?
,
Mar 1 2017
Something like this happened again on https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/888 Is this catching a real problem? Or is it time to remove this test from the CQ?
,
Mar 1 2017
> Is this catching a real problem? Or is it time to remove this test from the CQ? The failure on winky included kernel CLs that touched DRM (i.e. graphics driver stuff). So, it's entirely plausible that we blocked a bad CL. I note also that a kernel bug in DRM code could plausibly cause _both_ Chrome crashes and kernel crashes.
,
Mar 1 2017
It's possible this test is tickling some other bad behavior. It is sometimes slow to run (minutes). I have a chrome CL out for review that will make chrome shutdown more cleanly and I think fix the zygote crash. https://codereview.chromium.org/2729633002/ Disabling the test is fine with me. I'll send a CL for that and I'll put the test back after my chrome CL lands.
,
Mar 1 2017
> Disabling the test is fine with me. I'll send a CL for that > and I'll put the test back after my chrome CL lands. Wait, no, let's not do that. Looking at the test wmatrix, it's consistently green. It's far more likely that the current failures are finding real bugs. If we disable the test or even change it to paper over problems, we could be borrowing trouble.
,
Mar 1 2017
I don't think I'm the right owner for this. I don't think this is a chrome problem. Looking at the failure in @16, the test reaches its last line according to debug/client.0.INFO: 03/01 09:01:58.340 INFO |desktopui_MashLogi:0046| Chrome login with --mash succeeded. https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/desktopui_MashLogin/desktopui_MashLogin.py?type=cs&q=%22chrome+--mash%22&sq=package:%5Echromeos_public$&l=46 Per /var/log/messages the reboot happens about a minute later: 2017-03-01T09:02:53.667018-08:00 INFO kernel: [ 0.000000] Booting Linux on physical CPU 0x0 The failure does not have any chrome crashes / core files. It's possible something during test cleanup is causing the reboot but I'm not sure what. I'm OK with disabling the test if it's causing CQ runs to fail (since it sounds like the CQ is unhappy right now), or keeping it if someone else can help investigate the underlying issue.
,
Mar 1 2017
> I'm OK with disabling the test if it's causing CQ runs to fail > (since it sounds like the CQ is unhappy right now), or keeping > it if someone else can help investigate the underlying issue. I repeat: disabling the test looks like the wrong thing. The CQ is most likely failing this test because one or more CLs are triggering the failure. We don't want to let those CLs into the tree. In a case like this, standard procedure is to rope in a sheriff to help figure out the CL causing the problem, and then slap that CL with -1 Verifed. I've flipped a coin, which nominated marcheu@.
,
Mar 1 2017
chrome --mash has a slightly different ozone/GPU initialization code path than regular chrome. I don't know the details. It's conceivable that this tickles DRM code differently. However, the first part of MashLogin starts up and tears down the browser successfully -- I'm not sure why the second teardown would be any different.
,
Mar 1 2017
Since I have been nominated and since I think it's a Chrome issue, I will move the test to bvt per build until the bug is resolved on the Chrome side.
,
Mar 1 2017
Are there stack-traces from the zygote crash? Is it possible that session-manager looks at the various chrome processes shutting down (e.g. ash, ui, browser etc.), and that trips its counter to reboot? When session-manager asks 'chrome --mash' to shut down, does it ask the chrome process that acts as the service-manager? Or the chrome-browser process? Does it make a difference? (If it's asking chrome-browser, and chrome-browser is shutting down, maybe that causes the rest of the chrome process (including the service-manager) to actually crash?)
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4eaef1997cf57bff00418f1024ac82cfc31fda6 commit d4eaef1997cf57bff00418f1024ac82cfc31fda6 Author: Stéphane Marchesin <marcheu@chromium.org> Date: Wed Mar 01 20:31:20 2017 desktopui_MashLogin: move to bvt-perbuild There is an issue with this test preventing the CQ from working. Until this is resolved, let's move this test to bvt-perbuild. BUG= chromium:696696 TEST=none Change-Id: If74cdbe78bd5d33166550ace4ec6f55e79c04944 Reviewed-on: https://chromium-review.googlesource.com/447870 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Stéphane Marchesin <marcheu@chromium.org> Tested-by: Stéphane Marchesin <marcheu@chromium.org> [modify] https://crrev.com/d4eaef1997cf57bff00418f1024ac82cfc31fda6/client/site_tests/desktopui_MashLogin/control
,
Mar 1 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4eaef1997cf57bff00418f1024ac82cfc31fda6 commit d4eaef1997cf57bff00418f1024ac82cfc31fda6 Author: Stéphane Marchesin <marcheu@chromium.org> Date: Wed Mar 01 20:31:20 2017 desktopui_MashLogin: move to bvt-perbuild There is an issue with this test preventing the CQ from working. Until this is resolved, let's move this test to bvt-perbuild. BUG= chromium:696696 TEST=none Change-Id: If74cdbe78bd5d33166550ace4ec6f55e79c04944 Reviewed-on: https://chromium-review.googlesource.com/447870 Reviewed-by: Aviv Keshet <akeshet@chromium.org> Commit-Queue: Stéphane Marchesin <marcheu@chromium.org> Tested-by: Stéphane Marchesin <marcheu@chromium.org> [modify] https://crrev.com/d4eaef1997cf57bff00418f1024ac82cfc31fda6/client/site_tests/desktopui_MashLogin/control
,
Mar 1 2017
Re c#26: That shouldn't have been done. There's no evidence that the test is doing any more than what it was supposed to do, which is block bugs from entering the tree through the CQ. I'm reverting the CL.
,
Mar 1 2017
#27 - This test is for a configuration of chrome that is not shipping today and is unlikely to ship for at least 6 months. It's in bvt-cq because that's the only way I can get it to run on the Chrome ToT informational builders. Its purpose is to test chrome's startup code with Ozone DRM, which has to be done on real hardware. That's why I don't feel strongly about it being disabled. sadrul, see #12 and #13. Neither derat@ nor I see evidence that session_manager is triggering the reboot. session_manager asks the chrome --mash root process to exit via SIGTERM. It doesn't handle SIGTERM properly today (that's what my Chrome CL fixes, https://codereview.chromium.org/2729633002/) but it hasn't handled SIGTERM properly for months. I suspect that killing the root process could make the window server / GPU code go through some weird shutdown / cleanup code, but that's pure conjecture on my part. If the GPU code didn't clean itself up properly could that trigger a reboot? The zygote issue was a CHECK with the same backtrace as issue 692227.
,
Mar 1 2017
Pausing for more consideration, I note the following:
* There's apparently no HWTest support for nyan on
on the PFQ (certainly not nyan_kitty), so a bug in
chrome could make it into ToT.
* This failure has affected five CQ runs since roughly
last Sunday.
* Although the test is green on the canary, the flake
dashboard indicates 7 retries for the test on nyan_kitty
and 3 more on nyan_big. All failures were for the same
symptom reported here.
* Looking at the blamelists for the failing CQ runs, there's
no overlap, which suggests the bug isn't a (single) CL.
I also note the nyan_kitty CQ failure in bug 693597 , which
seemed to indicate a chrome crash during boot.
The canary failures first showed up in R58-9301.0.0, using
Chrome 58.0.3015.0. Bug 693597 was seen paladin build
R58-9292.0.0-rc1, which should have used Chrome 58.0.3011.0.
I think the best evidence is for a Chrome bug starting in
Chrome 58.0.3015.0, or a Chrome OS bug starting in R58-9301.0.0.
There's some possibility of an earlier start given bug 693597 .
,
Mar 1 2017
> The zygote issue was a CHECK with the same backtrace as issue 692227. Hmmm... That bug pre-dates bug 693597 . Which could mean that we have a crash triggered in various ways, including sometimes at boot, sometimes by desktopui_MashLogin, poassibly sometimes in other ways. I'm checking the history of nyan_kitty, to see if we have more instances of the crash at boot, and whether we can get logs.
,
Mar 2 2017
OK. I've looked through logs on nyan_kitty, and I can't find anything to tie bug 693597 with this failure, other than timing, and the vague "chrome crashed in both cases" similarity.
,
Mar 2 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/f154aac6ce85da803ff8263ab287564c619359be commit f154aac6ce85da803ff8263ab287564c619359be Author: jamescook <jamescook@chromium.org> Date: Thu Mar 02 03:30:03 2017 mash: Cleanly exit on SIGTERM, SIGINT, SIGHUP This is needed for the Chrome OS session_manager to cleanly shutdown chrome on device shutdown and in tests. It's also useful on Linux desktop to cleanly exit chrome in response to control-C. * Refactor ShutdownDetector and signal handlers out of ChromeBrowserMainPartsPosix. Make it run a shutdown callback. * Use shutdown callback to exit the chrome --mash root process run loop. BUG= 697146 ,692227, 696696 TEST=stop ui closes chrome quickly, session_manager doesn't send signal 9 anymore, control-C on Linux desktop cleanly exists chrome Review-Url: https://codereview.chromium.org/2729633002 Cr-Commit-Position: refs/heads/master@{#454163} [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/BUILD.gn [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/BUILD.gn [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/DEPS [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/mash_runner.cc [add] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/shutdown_signal_handlers_posix.cc [add] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/shutdown_signal_handlers_posix.h [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/browser/BUILD.gn [modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/browser/chrome_browser_main_posix.cc
,
Mar 2 2017
The chrome CL landed at 7:30 PM, so it should be in for the 8 PM Chrome PFQ run. If that PFQ run passes then the fix will be in CQ runs tomorrow.
,
Mar 2 2017
Per go/goldeneye we picked up Chrome 58.0.3027.0 last night. However, that version does not contain my CL. (Not sure why. It might have been a manual uprev. Chrome PFQ has failed every run for > 1 week. Last night's failure was an auto-update failure on falco, not related to this issue.) I think we should disable the test until we pick up a version of Chrome with my fix.
,
Mar 2 2017
I manually upreved to 3027.
,
Mar 3 2017
,
Mar 7 2017
,
Mar 21 2017
,
Mar 22 2017
James this should be fixed now, right?
,
Mar 22 2017
Chrome is currently at 3041.
,
May 30 2017
,
Aug 1 2017
,
Jan 22 2018
|
||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||
Comment 1 by jamescook@chromium.org
, Feb 27 2017