Issue metadata
Sign in to add a comment
|
[All boards] Regression: Device hangs at black screen on signout, stateful wiped (verified mode only) |
||||||||||||||||||||||
Issue descriptionChrome Version: 60.0.3105.0/9574.0.0 dev-channel Daisy,Candy and minnie OS: Chrome What steps will reproduce the problem? (1)Sign into user -> click on Uber Tray and now Click on 'Sign out' (Please refer Video) Expected: 'Sign out' screen should be seen on clicking 'Sign out' in Uber Tray Actual: Instead Device gets turned off on clicking 'Sign out' in Uber Tary This is Regression issue as same is working fine in 60.0.3102.0/9565.0.0 dev-channel Candy @tbuckley : Please confirm the Issue Note: 1.No Crash Id's are generated 2.Again need to press Reload + Power button to turn on the Device 3.Clicking on power button also does not turn on the device.
,
May 22 2017
Note: 3.Issue is also seen on clicking 'Shut down' Button and on creating Supervised User
,
May 22 2017
+ Josafat [M60 TPM]
,
May 22 2017
@rkc any thoughts who can look at this? I don't think there should be any functional changes around the sign-out button recently. +tdanderson to confirm
,
May 22 2017
The shutdown looks unclean, and if it is a kernel crash the device should come up. But it does not. Missing crash logs, and having this observed cros-platform, it looks like the screen backlight goes to 0. Prasanthi reproduced the bug and she confirmed the Brightness is 0, but user can manually bring it back. She is uploading logs in next update.
,
May 22 2017
Tested on Minnie & Daisy with version 60.0.3105.0/9575.0.0 dev-channel. Logs here: https://pantheon.corp.google.com/storage/browser/chromiumos-test-logs/bugfiles/cr/724937/
,
May 22 2017
,
May 22 2017
I don't understand what's happening in the video. After you click "Sign out" and the screen goes black, are you pressing Reload+Power offscreen? I don't see the boot splash screen when the screen turns back on, just the login screen (as expected). I'm confused about what's actually being discussed here. Is the device shutting down or hanging, or (per #5) is the backlight brightness just being set to 0? If the understanding of the problem has changed, please update the "Summary" field to be accurate. Here's what I see in powerd.LATEST from #6: [0522/160314:INFO:daemon.cc(1484)] Session state changed to stopped ... [0522/160318:INFO:display_power_setter.cc(82)] Asking Chrome to turn all displays on [0522/160318:INFO:daemon.cc(1394)] Chrome is using normal display mode [0522/160319:INFO:internal_backlight_controller.cc(437)] Got user-triggered request to set brightness to 81.25% [0522/160319:INFO:internal_backlight_controller.cc(682)] Setting brightness to 163 (81.25%) over 200 ms [0522/160319:INFO:internal_backlight_controller.cc(437)] Got user-triggered request to set brightness to 87.5% [0522/160319:INFO:internal_backlight_controller.cc(682)] Setting brightness to 191 (87.5%) over 200 ms [0522/160320:INFO:internal_backlight_controller.cc(437)] Got user-triggered request to set brightness to 93.75% [0522/160320:INFO:internal_backlight_controller.cc(682)] Setting brightness to 221 (93.75%) over 200 ms [0522/160320:INFO:daemon.cc(1394)] Chrome is using normal display mode [0522/160320:INFO:internal_backlight_controller.cc(437)] Got user-triggered request to set brightness to 100% [0522/160320:INFO:internal_backlight_controller.cc(682)] Setting brightness to 253 (100%) over 200 ms [0522/160320:INFO:internal_backlight_controller.cc(437)] Got user-triggered request to set brightness to 100% [0522/160320:INFO:internal_backlight_controller.cc(682)] Setting brightness to 253 (100%) over 200 ms ... [0522/160626:INFO:daemon.cc(1484)] Session state changed to started The user logs out, the brightness is manually increased, and then three minutes later the user logs back in.
,
May 22 2017
Retested with more devices and had observations with 60.0.3105.0/9575.0.0. Taking back the backlight point. The sign-out causes about 9 sec delay which was assumed to be backlight issue - while pressing the brightness keys the devices were comming up on sign-in screen. The delay varies from 4-to-10 seconds.
,
May 22 2017
So it sounds like this regressed between 60.0.3102.0 and 60.0.3105.0. Can you narrow the regression range further?
,
May 22 2017
Or it could be an OS issue between 9565.0.0 and 9574.0.0.
,
May 22 2017
Some more checks on different devices over RC builds from last week. Boards on Dev channel show delay up to 16 seconds, and shorter if reboot before user session. Devices that have been tested and suspended from last week, have slower sign-out. Boards at Stable channel show 5-8 sec transition. mmanchala@ and jbanavatu@, can you re-test again and see if it is unusual delay, as we see it? If the board does not come back (lets say for a minute), use Alt+VolUp+x_key_three times. After the device reboots, generate log and upload to the linked at #6 location. Are there any peripherals attached to the device?
,
May 22 2017
I just repro'd this on my Samus running latest canary 9575.0.0 I was not able to get system back after logging out unless I long pressed the power button Feedback report: https://feedback.corp.google.com/#/Report/61346803386
,
May 22 2017
Thanks josafat@, We were able to reproduce it too. The condition is - device in normal mode. Once the board is in normal mode this is consistent failure. Short-key Ctrl+Shift+qq has same effect.
,
May 22 2017
Kalin, can you confirm when the regression started? rkc/derat@, would you be the right owner for this?
,
May 22 2017
I still don't understand from this bug report whether the device is actually shutting down, hanging, or just taking a long time to sign out and/or restart Chrome. This should be part of the bug report; it's not possible to route it to the right team without knowing this. #12 suggests that logout is just slow, but #13 changed the bug's title to suggest that the system needs to be power-cycled. Please figure out the answers to the following: a) Does the system eventually come back, or do you need to hold the power button for eight seconds? I see both things in this report. b) Is the system actually shutting down? Look at the LED (location varies depending on the device). From the various comments that have been written so far, including the statement that the power button needs to be held, I would guess that the system isn't shutting down and is instead hanging. That would suggest either a Chrome hang during exit, a Chrome hang during restart, a session_manager bug, or a hang at the kernel level.
,
May 23 2017
Here is what I see on my Samus, - the device seems to be hung state since I can see the lightbar all lighted after waiting for an hour after log out - I can also repro if I exit guest session (e.g. not related to an specific login) - Closing/Open the lid does not change the lightbar (e.g. stays on) - Long press seems to be the only way to bring device back
,
May 23 2017
So either a Chrome hang during exit or a kernel hang. Dan would know more about this than me, including where to route it.
,
May 23 2017
[cc ihf@, the current gardener] If this really affects all boards, it's odd that it doesn't appear to be causing a zillion PFQ failures. I'll see if I'm able to repro it on a lumpy device tomorrow, I guess.
,
May 23 2017
I can't repro on veyron_minnie/R60-9574.0.0. Signout works fine from my test account as well as guest.
,
May 23 2017
Issue is seen when device is in Verified Mode. Waited upto 6-8 mins time but still the device did not come back to 'Sign-out' screen And Unable to repro the issue if device is in Dev-Mode i.e. it takes 5-8 sec to come back to 'Sign-out' screen after clicking on 'Sign out'.
,
May 23 2017
,
May 23 2017
I bet the lab doesn't run tests in verified mode, so no problem for gardening.
,
May 23 2017
I'll still try to repro this, but I'm probably not going to be able to do anything to help if it only shows up in verified mode. Albert, do you have any ideas about what we can try here?
,
May 23 2017
,
May 23 2017
I'm able to repro this consistently on a verified-mode peppy running 9574.0.0 / 60.0.3105.0. I logged in with my test account, waited a few minutes, and signed out using the system menu. The device froze with a black screen. After a few minutes, I rebooted by hitting Alt+F10+x three times. The stateful partition appeared to be trashed after the system came back up -- my profile picture was missing on the login screen, a bunch of extension icons are missing in the shelf, there didn't appear to be any log files from before the reboot, and even the client ID (as displayed at chrome://system) was empty. file:///home/chronos/user/log contains previous log files, but I don't see anything useful at the end of them. I re-enabled crash reporting to generate a new client ID, signed out, and saw the issue again. This time, I synced the disk with Alt+F10+s after each Alt+F10+x. At the login screen, my profile picture was present this time, and chrome://system shows the same client ID as after I re-opted-in to crash reporting after the first crash. The old logs are still missing from file:///var/log. I only see one crash for the device's client ID (2ebf287fd617490e967987157b97f8a8), at http://crash/browse?q=reportid=%270108342768000000%27, and the kernel log doesn't appear to have any useful data. I generated another crash at http://crash/browse?q=reportid=%27757d522768000000%27 but don't see any commonality between the kernel traces in the two crashes. I suspect that the problem is at a higher layer than the kernel since the sysrq actions are still working. It looks like we don't permit other sysrq keys that would be useful here, like 't', 'w', and 'l'. I'm stumped and don't know what else to try. Why does this only show up in verified mode? Why is the stateful partition getting wiped? Does anyone more familiar with verified mode have any ideas of other things to try?
,
May 23 2017
Can someone on the test team help by bisecting this? Per the initial report, 9565.0.0 was good (I haven't verified this) and 9574.0.0 was bad. There are a decent number of OS changes in that range (see http://crosland/log/9565.0.0..9574.0.0) and it'd be very helpful to narrow it down in case we're reduced to looking for suspicious changes (which I'll start doing now...).
,
May 23 2017
Found it! https://chromium-review.googlesource.com/c/507991/ (intentionally) disabled sudo in verified mode by bind-mounting /bin/false over it. Unfortunately, the ui-post-stop script loops indefinitely until a sudo command returns true: # Make sure everything is going down. No exceptions. # The loop is so that clever daemons can't evade the kill by # racing us and killing us first; we'll just try over and over # until we win the race, and kill with pid -1 is atomic with # respect to process creation. while ! sudo -u chronos kill -9 -- -1 ; do sleep .1 done Revert at https://chromium-review.googlesource.com/c/512705.
,
May 23 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/9e6d52acf77744f78cef5029f761df5088f07453 commit 9e6d52acf77744f78cef5029f761df5088f07453 Author: Dan Erat <derat@chromium.org> Date: Tue May 23 16:51:49 2017 Revert "init: bind mount /bin/false over /usr/bin/sudo in verified mode" This reverts commit f65f0ab17ed3d3887cd0eea105beac67f65f7635. Reason for revert: This makes the ui-post-stop script, which depends on sudo, hang indefinitely: while ! sudo -u chronos kill -9 -- -1 ; do sleep .1 done BUG= chromium:724937 Change-Id: I2220d8afc02fff5a0e98aa8dc9fc45b37bd1d038 Reviewed-on: https://chromium-review.googlesource.com/512705 Reviewed-by: Mike Frysinger <vapier@chromium.org> Tested-by: Dan Erat <derat@chromium.org> [modify] https://crrev.com/9e6d52acf77744f78cef5029f761df5088f07453/init/chromeos_startup
,
May 23 2017
Are we only using sudo there to change to the chronos user and kill all chronos processes? Surely we can replace that.
,
May 23 2017
Yes, I'm sure we can (and I encourage filing a new bug to track that change, along with the reland of the sudo bind-mount change :-P). I think that this is the cause of the hang, but it still needs to be verified by the test team. Josafat said that we'll have a new image soon.
,
May 23 2017
Thank you Dan for finding this! @c30: This is a P0 breaking the user experience completely. The first course of action is to revert the offending CL, not to experiment with ways to make the system work with the CL. @TPMs: Do we not have any bots that run all the tests on verified images? If not, this sounds like a serious issue - we should re-investigate why this is and see if there are ways to fix this. Finding issues that completely destroy the user experience only on verified mode devices because everyone's devices just break is not an ideal way of finding these issues.
,
May 23 2017
The lab DUTs do run in verified boot (the systems boot off of firmware signed by the generic dev key), but they specifically bypass some settings like this sudo change as a necessity for running tests (e.g. they run test images). It is not clear that we can both run automated tests and be in a fully locked down system image at the same time. For this sort of case, we rely on manual testing.
,
May 23 2017
I'd still like to know why all of the logs under /var/log seemed to be getting wiped when I triggered this and rebooted via the sysrq key. This happened even when I gave the device multiple minutes to flush to disk and used the "emergency sync" key. Is this expected behavior? Does anyone have any guesses about why it would happen? I haven't been able to repro it in verified mode with other images.
,
May 23 2017
there's a few things we need to sort out and test independently before we can reland disabling of sudo. there's a few weird uses of sudo in platform2/ that seems like we should kill off. they aren't changes we'd want to do under the gun vs reverting my CL. in this scenario, revert was the right call. i'll file a new bug to track the various bits.
,
May 23 2017
,
May 26 2017
Verified on M60 build 9591.0.0 Signout is working fine.
,
Jun 15 2017
|
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by jbanavatu@chromium.org
, May 22 2017