Project: chromium Issues People Development process History Sign in
New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.
Starred by 2 users
Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment
desktopui_MashLogin | FAIL: Autotest client terminated unexpectedly: DUT rebooted during the test run.
Project Member Reported by akes...@chromium.org, Feb 27 2017 Back to list
Status: Started
I'll take a look.
Cc: jamescook@chromium.org
Owner: akes...@chromium.org
Status: Assigned
There's a single, small (7 MB) chrome core there.

The Chrome crash is a CHECK in some zygote code, see issue 692227. It's bad that the CHECK is happening, and bad that the test doesn't fail, but I don't think Chrome is why the device is rebooting.

I see this in /var/log/messages, maybe that's normal?
2017-02-25T06:05:11.216648-08:00 ALERT kernel: [ 1093.863585] Unhandled prefetch abort: breakpoint debug exception (0x222) at 0x00000000

Also, /var/log/messages is full of null bytes around 2017-02-25T06:05:21.696654-08:00 - maybe that's due to unclean reboot?

akeshet, maybe someone more familiar with the OS-level logs could take a look?



Cc: akes...@chromium.org
Owner: ihf@chromium.org
any ideas?
Comment 4 by ihf@chromium.org, Feb 28 2017
Owner: ----
Status: Available
If the zygote is checking, how do you know anything is working at all? 

I looked at the logs, but nothing obvious to me. There are suspicious events/log files, but I don't know how mash looks normally.
Owner: jamescook@chromium.org
@2: it seems like twice chrome exited very quickly, and then once it crashes. If this keeps happening the session_manager decides to reboot. The reboots in the logs are perfectly normal, this is all a problem on the Chrome side.
Status: Assigned
Status: Started
+derat, do you know off the top of your head how many times chrome has to abnormally exit before session_manager reboots the device? And whether it logs when it does that?

@5 - The test is supposed to start chrome and exit quickly, twice. Once it starts it with --mash and verifies it loads OOBE. Then it starts it with --mash and verifies login works. Do you see evidence that chrome is crash-looping often enough that session_manager would reboot?

@4, zygote is crashing during chrome shutdown, it's some sort of race where a process is trying to start while things are being torn down. zygote is working normally during startup.

Regardless, I will look some more.

Cc: derat@chromium.org
er, really +derat, see @7
Comment 9 by derat@chromium.org, Feb 28 2017
src/platform2/login_manager/init/scripts/ui-respawn:

# The ui job requires special respawning logic, as the system needs to respond
# differently to different kinds of exits.
#
# 0: Normal exit; respawn forever.
# 2: (Defined in session_manager_service.h) The browser is exiting too much too
#    quickly. In some cases, such as a hung GPU, rebooting the device will allow
#    the system to recover. Respawn up to TOO_CRASHY_LIMIT times in
#    TOO_CRASHY_WINDOW_SECONDS before trying a reboot. Try this up to
#    REBOOT_LIMIT times in REBOOT_WINDOW_SECONDS before giving up and waiting
#    for an AU.
# 3: (Defined in session_manager_service.h) The session_manager has detected
#    a condition that requires a powerwash to recover. It should have requested
#    that the device reboot before exiting, so stop respawning the UI.
# Other nonzero: respawn up to RESPAWN_LIMIT times in RESPAWN_WINDOW_SECONDS,
#                then stop.
...
RESPAWN_LIMIT=6
RESPAWN_WINDOW_SECONDS=60
...
TOO_CRASHY_LIMIT=1
TOO_CRASHY_WINDOW_SECONDS=180
...
REBOOT_LIMIT=1
REBOOT_WINDOW_SECONDS=$(( 3 * TOO_CRASHY_WINDOW_SECONDS ))

Comment 10 by derat@chromium.org, Feb 28 2017
session_manager likely won't log anything (other than the Chrome exits) since this is all done via Upstart, but it looks like ui-respawn logs what it's doing (probably to /var/log/messages).
> session_manager likely won't log anything (other than
> the Chrome exits) since this is all done via Upstart,
> but it looks like ui-respawn logs what it's doing
> (probably to /var/log/messages).

Yes, it logs to /var/log/messages, using the job name as
the tag, so `grep ui-respawn /var/log/messages` will give
you the relevant history.

derat, do you have time to chat about this?

I don't think the reboot is due to chrome crashing. However, chrome is definitely exiting abnormally. See attached annotated /var/log/messages file.

"ui-respawn" does not appear in the log.

session_manager should print the line "Child stopped, shutting down" if chrome is crashing too quickly, but it does not.

https://cs.corp.google.com/chromeos_internal/src/platform2/login_manager/session_manager_service.cc?type=cs&q=CHILD_EXITING_TOO_FAST&sq=package:%5Echromeos_internal$&l=300

The zygote crash occurs in a non-mash chrome start during the test. I am uncertain why chrome starts so many times during this test.

I cannot reproduce the problem locally on link.

Annotated /var/log/messages

messages
135 KB View Download
Status: Assigned
akeshet, what do you mean by "the same"?

It's another device reboot mid-test. I don't see anything I recognize in the logs, other than the fact that many log files are 0 size and /var/log/messages is full of nulls shortly after the test starts. I don't think there is a chrome problem in that test.

The MashLogin chrome crash is probably related to the fact that chrome --mash is not correctly handling sigterm, see  issue 697146 . I'm working on that now.

I still think the DUT reboot is a different problem than the chrome cores.

Do you have a way to see how many tests are failing with "DUT rebooted" and see if those are much more likely on nyan_kitty?

Something like this happened again on https://uberchromegw.corp.google.com/i/chromeos/builders/nyan_kitty-paladin/builds/888

Is this catching a real problem? Or is it time to remove this test from the CQ?
> Is this catching a real problem? Or is it time to remove this test from the CQ?

The failure on winky included kernel CLs that touched DRM
(i.e. graphics driver stuff).  So, it's entirely plausible
that we blocked a bad CL.

I note also that a kernel bug in DRM code could plausibly cause
_both_ Chrome crashes and kernel crashes.

It's possible this test is tickling some other bad behavior. It is sometimes slow to run (minutes). I have a chrome CL out for review that will make chrome shutdown more cleanly and I think fix the zygote crash. https://codereview.chromium.org/2729633002/

Disabling the test is fine with me. I'll send a CL for that and I'll put the test back after my chrome CL lands.

> Disabling the test is fine with me. I'll send a CL for that
> and I'll put the test back after my chrome CL lands.

Wait, no, let's not do that.  Looking at the test wmatrix,
it's consistently green.  It's far more likely that the current
failures are finding real bugs.  If we disable the test or
even change it to paper over problems, we could be borrowing
trouble.


Owner: ----
Status: Untriaged
I don't think I'm the right owner for this. I don't think this is a chrome problem.

Looking at the failure in @16, the test reaches its last line according to debug/client.0.INFO:

03/01 09:01:58.340 INFO |desktopui_MashLogi:0046| Chrome login with --mash succeeded.

https://cs.corp.google.com/chromeos_public/src/third_party/autotest/files/client/site_tests/desktopui_MashLogin/desktopui_MashLogin.py?type=cs&q=%22chrome+--mash%22&sq=package:%5Echromeos_public$&l=46

Per /var/log/messages the reboot happens about a minute later:

2017-03-01T09:02:53.667018-08:00 INFO kernel: [    0.000000] Booting Linux on physical CPU 0x0

The failure does not have any chrome crashes / core files.

It's possible something during test cleanup is causing the reboot but I'm not sure what.

I'm OK with disabling the test if it's causing CQ runs to fail (since it sounds like the CQ is unhappy right now), or keeping it if someone else can help investigate the underlying issue.

Cc: oka@chromium.org moch@chromium.org marc...@chromium.org
Labels: -Pri-2 Pri-1
Owner: marc...@chromium.org
Status: Assigned
> I'm OK with disabling the test if it's causing CQ runs to fail
> (since it sounds like the CQ is unhappy right now), or keeping
> it if someone else can help investigate the underlying issue.

I repeat: disabling the test looks like the wrong thing.  The
CQ is most likely failing this test because one or more CLs are
triggering the failure.  We don't want to let those CLs into the
tree.

In a case like this, standard procedure is to rope in a sheriff to
help figure out the CL causing the problem, and then slap that CL
with -1 Verifed.  I've flipped a coin, which nominated marcheu@.

Cc: kylec...@chromium.org sadrul@chromium.org
chrome --mash has a slightly different ozone/GPU initialization code path than regular chrome. I don't know the details. It's conceivable that this tickles DRM code differently. However, the first part of MashLogin starts up and tears down the browser successfully -- I'm not sure why the second teardown would be any different.

Since I have been nominated and since I think it's a Chrome issue, I will move the test to bvt per build until the bug is resolved on the Chrome side.
Are there stack-traces from the zygote crash?

Is it possible that session-manager looks at the various chrome processes shutting down (e.g. ash, ui, browser etc.), and that trips its counter to reboot?

When session-manager asks 'chrome --mash' to shut down, does it ask the chrome process that acts as the service-manager? Or the chrome-browser process? Does it make a difference? (If it's asking chrome-browser, and chrome-browser is shutting down, maybe that causes the rest of the chrome process (including the service-manager) to actually crash?)
Project Member Comment 25 by bugdroid1@chromium.org, Mar 1 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4eaef1997cf57bff00418f1024ac82cfc31fda6

commit d4eaef1997cf57bff00418f1024ac82cfc31fda6
Author: Stéphane Marchesin <marcheu@chromium.org>
Date: Wed Mar 01 20:31:20 2017

desktopui_MashLogin: move to bvt-perbuild

There is an issue with this test preventing the CQ from working.
Until this is resolved, let's move this test to bvt-perbuild.

BUG= chromium:696696 
TEST=none

Change-Id: If74cdbe78bd5d33166550ace4ec6f55e79c04944
Reviewed-on: https://chromium-review.googlesource.com/447870
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Commit-Queue: Stéphane Marchesin <marcheu@chromium.org>
Tested-by: Stéphane Marchesin <marcheu@chromium.org>

[modify] https://crrev.com/d4eaef1997cf57bff00418f1024ac82cfc31fda6/client/site_tests/desktopui_MashLogin/control

Project Member Comment 26 by bugdroid1@chromium.org, Mar 1 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d4eaef1997cf57bff00418f1024ac82cfc31fda6

commit d4eaef1997cf57bff00418f1024ac82cfc31fda6
Author: Stéphane Marchesin <marcheu@chromium.org>
Date: Wed Mar 01 20:31:20 2017

desktopui_MashLogin: move to bvt-perbuild

There is an issue with this test preventing the CQ from working.
Until this is resolved, let's move this test to bvt-perbuild.

BUG= chromium:696696 
TEST=none

Change-Id: If74cdbe78bd5d33166550ace4ec6f55e79c04944
Reviewed-on: https://chromium-review.googlesource.com/447870
Reviewed-by: Aviv Keshet <akeshet@chromium.org>
Commit-Queue: Stéphane Marchesin <marcheu@chromium.org>
Tested-by: Stéphane Marchesin <marcheu@chromium.org>

[modify] https://crrev.com/d4eaef1997cf57bff00418f1024ac82cfc31fda6/client/site_tests/desktopui_MashLogin/control

Re c#26:  That shouldn't have been done.  There's no evidence
that the test is doing any more than what it was supposed to do,
which is block bugs from entering the tree through the CQ.

I'm reverting the CL.

#27 - This test is for a configuration of chrome that is not shipping today and is unlikely to ship for at least 6 months. It's in bvt-cq because that's the only way I can get it to run on the Chrome ToT informational builders. Its purpose is to test chrome's startup code with Ozone DRM, which has to be done on real hardware. That's why I don't feel strongly about it being disabled.

sadrul, see #12 and #13. Neither derat@ nor I see evidence that session_manager is triggering the reboot. session_manager asks the chrome --mash root process to exit via SIGTERM. It doesn't handle SIGTERM properly today (that's what my Chrome CL fixes, https://codereview.chromium.org/2729633002/) but it hasn't handled SIGTERM properly for months. I suspect that killing the root process could make the window server / GPU code go through some weird shutdown / cleanup code, but that's pure conjecture on my part. If the GPU code didn't clean itself up properly could that trigger a reboot?

The zygote issue was a CHECK with the same backtrace as issue 692227.

Pausing for more consideration, I note the following:
  * There's apparently no HWTest support for nyan on
    on the PFQ (certainly not nyan_kitty), so a bug in
    chrome could make it into ToT.
  * This failure has affected five CQ runs since roughly
    last Sunday.
  * Although the test is green on the canary, the flake
    dashboard indicates 7 retries for the test on nyan_kitty
    and 3 more on nyan_big.  All failures were for the same
    symptom reported here.
  * Looking at the blamelists for the failing CQ runs, there's
    no overlap, which suggests the bug isn't a (single) CL.

I also note the nyan_kitty CQ failure in  bug 693597 , which
seemed to indicate a chrome crash during boot.

The canary failures first showed up in R58-9301.0.0, using
Chrome 58.0.3015.0.   Bug 693597  was seen paladin build
R58-9292.0.0-rc1, which should have used Chrome 58.0.3011.0.

I think the best evidence is for a Chrome bug starting in
Chrome 58.0.3015.0, or a Chrome OS bug starting in R58-9301.0.0.
There's some possibility of an earlier start given  bug 693597 .

> The zygote issue was a CHECK with the same backtrace as issue 692227.

Hmmm...  That bug pre-dates  bug 693597 .  Which could mean that we
have a crash triggered in various ways, including sometimes at boot,
sometimes by desktopui_MashLogin, poassibly sometimes in other ways.

I'm checking the history of nyan_kitty, to see if we have more
instances of the crash at boot, and whether we can get logs.

OK.  I've looked through logs on nyan_kitty, and I can't find
anything to tie  bug 693597  with this failure, other than timing,
and the vague "chrome crashed in both cases" similarity.

Project Member Comment 32 by bugdroid1@chromium.org, Mar 2 2017
The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/f154aac6ce85da803ff8263ab287564c619359be

commit f154aac6ce85da803ff8263ab287564c619359be
Author: jamescook <jamescook@chromium.org>
Date: Thu Mar 02 03:30:03 2017

mash: Cleanly exit on SIGTERM, SIGINT, SIGHUP

This is needed for the Chrome OS session_manager to cleanly shutdown chrome
on device shutdown and in tests. It's also useful on Linux desktop to cleanly
exit chrome in response to control-C.

* Refactor ShutdownDetector and signal handlers out of
ChromeBrowserMainPartsPosix. Make it run a shutdown callback.
* Use shutdown callback to exit the chrome --mash root process run loop.

BUG= 697146 ,692227, 696696 
TEST=stop ui closes chrome quickly, session_manager doesn't send signal 9
anymore, control-C on Linux desktop cleanly exists chrome

Review-Url: https://codereview.chromium.org/2729633002
Cr-Commit-Position: refs/heads/master@{#454163}

[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/BUILD.gn
[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/BUILD.gn
[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/DEPS
[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/mash/mash_runner.cc
[add] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/shutdown_signal_handlers_posix.cc
[add] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/app/shutdown_signal_handlers_posix.h
[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/browser/BUILD.gn
[modify] https://crrev.com/f154aac6ce85da803ff8263ab287564c619359be/chrome/browser/chrome_browser_main_posix.cc

The chrome CL landed at 7:30 PM, so it should be in for the 8 PM Chrome PFQ run. If that PFQ run passes then the fix will be in CQ runs tomorrow.

Per go/goldeneye we picked up Chrome 58.0.3027.0 last night. However, that version does not contain my CL. (Not sure why. It might have been a manual uprev. Chrome PFQ has failed every run for > 1 week. Last night's failure was an auto-update failure on falco, not related to this issue.)

I think we should disable the test until we pick up a version of Chrome with my fix.

Comment 35 by ihf@chromium.org, Mar 2 2017
I manually upreved to 3027.
Comment 36 by oka@chromium.org, Mar 3 2017
Cc: -oka@chromium.org shunhsingou@chromium.org
Chrome is at 58.0.3028.0 now (up to r454140) but does still not contain my commit above (r454163). The next uprev should get it.

(Aside: It's unfortunate that a CL I landed ~40 hours ago still hasn't made it across a Chrome roll.)

Labels: -current-issue
Comment 39 by moch@chromium.org, Mar 21 2017
Cc: -moch@chromium.org
Owner: jamescook@chromium.org
James this should be fixed now, right?
Comment 41 by ihf@chromium.org, Mar 22 2017
Status: Fixed
Chrome is currently at 3041.
Labels: VerifyIn-60
Labels: VerifyIn-61
Sign in to add a comment