Issue metadata
Sign in to add a comment
|
Unable to lock screen due to session_manager's start-user-session call to Upstart timing out |
||||||||||||||||||||||||
Issue descriptionNo specific steps. Feedback: https://feedback.corp.google.com/#/Report/52659411013 Chrome failed to call SessionManager's StartSession via DBus. [751:751:0203/125228.324988:ERROR:object_proxy.cc(583)] Failed to call method: org.chromium.SessionManagerInterface.StartSession: object_path= /org/chromium/SessionManager: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [751:751:0203/125228.325178:ERROR:session_manager_client.cc(497)] Failed to call StartSession As a result, screen could not be locked because session manager daemon thinks user session is not started: [751:751:0203/125922.144600:ERROR:object_proxy.cc(583)] Failed to call method: org.chromium.SessionManagerInterface.LockScreen: object_path= /org/chromium/SessionManager: org.chromium.SessionManagerInterface.SessionDoesNotExist: Attempt to lock screen outside of user session. 2017-02-03T12:41:43.763923-08:00 WARNING session_manager[716]: [WARNING:session_manager_impl.cc(616)] Attempt to lock screen outside of user session. Dan, you have look at some dbus failures before. So I'll let you triage this.
,
Feb 3 2017
Xiyuan and I discussed this. We really need to unravel all this code and design and find a 'good' fix to this. Dan, for now, if we just change the timeout within Chrome for the StartSession call to infinite, do you think that would be an acceptable short-term fix?
,
Feb 3 2017
,
Feb 3 2017
That's probably not a good fix. session_manager is single-threaded. If it's blocking on its call to Upstart, it won't be able to handle any other calls that Chrome makes to it, either. Figuring out why Upstart is blocking would be best, but making session_manager's call to Upstart asynchronous is an easy way to approximate the same behavior (assuming that it's not dependent on the start-user-session-triggered jobs finishing synchronously).
,
Feb 3 2017
Oh, and just to point it out: changing the timeout on Chrome's call to session_manager will have no effect (other than making us not log an error). SessionManagerClient::OnStartSession doesn't do anything with the reply, so Chrome already doesn't care at all about what session_manager does here. The problem just arises because session_manager internally abandons starting the session after its call to Upstart times out.
,
Feb 3 2017
Uploaded https://chromium-review.googlesource.com/437655 as a prospective workaround. I'll test it in a bit to make sure it doesn't break anything in an obvious way. It's too exciting to merge, I think, but it'd probably be reasonable to check it in now so it has some time to bake before the next release branch is cut.
,
Feb 3 2017
It sounds like that making the call to upstart is risky. There may be good reasons why upstart blocks there - particularly if it is blocking on shill_login_user. The other option was to retry the start session call till it succeeded :) Again, that's a possible workaround, not a fix. I do completely agree that we absolutely have to figure out why upstart is blocking in the first place. Once we do, I would prefer, if it is at all possible, we redesign this mechanism to not be a blocking call.
,
Feb 3 2017
*making the call to upstart async
,
Feb 3 2017
To be clear, this is expected behavior from Upstart. Try creating a file named /etc/init/handle-foo.conf that contains the following: start on foo task exec sleep 30 Then run "initctl emit foo" and watch it block for 30 seconds waiting for the task to complete. It's unclear whether shill_login_user actually needs to finish before session_manager can go on or whether this was just an accidental side effect of how someone wrote the script years ago (maybe Paul can comment). If it was intentional, it's probably not working as intended anyway, since Chrome isn't blocking here (or even paying attention to the response from session_manager) and will continue to interact with shill regardless of whether or not shill_login_user has finished. This message, from right after session_manager's call to Upstart timing out, is probably relevant, and points the finger at shill_login_user being responsible: 2017-02-03T12:52:28.483943-08:00 NOTICE shill_login_user[2040]: Failed to push ~chronos/shill profile It looks like this is logged in response to a org.chromium.flimflam.Manager.InsertUserProfile method call failing. See also this from powerd.LATEST: [0203/125257:INFO:suspend_delay_controller.cc(63)] Registering suspend delay 46006274 (shill) of 19500 ms on behalf of :1.10 [0203/125257:INFO:suspend_delay_controller.cc(63)] Registering dark suspend delay 46039042 (shill) of 19500 ms on behalf of :1.10 shill usually registers its suspend delays with powerd when it starts up, so the fact that it didn't happen until 12:52:57 suggests that shill was slow to start. That's going to cause all sorts of other problems, so someone on the networking side should investigate. Insulating session_manager from this kind of stuff is probably still worthwhile.
,
Feb 3 2017
"Insulating session_manager from this kind of stuff is probably still worthwhile." Agreed. I can understand the reasons why upstart blocks. Session manager depending on that, in fact, failing to start a session, if it doesn't complete fast enough is, less than ideal :)
,
Feb 4 2017
,
Feb 5 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform2/+/8f239af37cf10dc6d10ec72a99bbc14538c584ef commit 8f239af37cf10dc6d10ec72a99bbc14538c584ef Author: Daniel Erat <derat@chromium.org> Date: Sun Feb 05 20:33:35 2017 login: Make start-user-session Upstart signal async. Make session_manager's EmitEvent D-Bus method call to Upstart to emit the start-user-session signal asynchronous. Much like the call to emit login-prompt-visible, this call appears to time out in some cases (due to slow jobs that are triggered by the Upstart signal?), resulting in session_manager being blocked. In the case of start-user-session, a timeout also results in session_manager abandoning Chrome's StartSession request, which means that later attempts to lock the screen fail. BUG= chromium:688560 TEST=manual: add script section with "sleep 40" to /etc/init/shill-start-user-session.conf. check that session_manager logs an error about com.ubuntu.Upstart0_6.EmitEvent timing out but that it's still possible to lock the screen. Change-Id: If49046cf30df40b27c18bb1528439f722db692d4 Reviewed-on: https://chromium-review.googlesource.com/437655 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Xiyuan Xia <xiyuan@chromium.org> [modify] https://crrev.com/8f239af37cf10dc6d10ec72a99bbc14538c584ef/login_manager/session_manager_impl.cc [modify] https://crrev.com/8f239af37cf10dc6d10ec72a99bbc14538c584ef/login_manager/dbus_bindings/org.chromium.SessionManagerInterface.xml
,
Feb 6 2017
Is this issue affecting M-56 too?
,
Feb 6 2017
I am seeing crashes like this one that make be believe so: https://crash.corp.google.com/browse?q=product.name%3D%27Chrome_ChromeOS%27%20AND%20product.version%3D%2756.0.2924.87%27%20AND%20custom_data.ChromeCrashProto.ptype%3D%27browser%27%20AND%20custom_data.ChromeCrashProto.magic_signature_1.name%3D%27%5BAssert%5D%20ash%3A%3ALockStateController%3A%3AOnLockFailTimeout%27&ignore_case=false&enable_rewrite=true&omit_field_name=&omit_field_value=&omit_field_opt=%3D&stbtiq=&reportid=&index=0#0
,
Feb 6 2017
...that one is in top 10 crashes for M-56
,
Feb 6 2017
I suspect the above crash happened because the lock screen webui took too long to show (fix in issue 669638 ). I don't see any dbus/session manager errors in the logs.
,
Feb 6 2017
What milestone has fixes for issue 669638 ?
,
Feb 6 2017
Any build after 57.0.2981.0
,
Feb 6 2017
Agreed re the crash from #14 being unrelated to this. The logs show that session_manager wasn't blocked and asked Chrome to lock the screen as requested: [12741:12741:0206/071009.355980:VERBOSE1:screen_locker.cc(399)] Received LockScreen request from session manager and 2017-02-06T07:10:09.314212-05:00 INFO session_manager[12715]: [INFO:session_manager_impl.cc(577)] LockScreen() method called. --- Someone from the networking team still needs to figure out why shill-start-user-session (and also shill?) were hanging in the feedback report here, though. Who should that be?
,
Feb 7 2017
Seems like no need to port fixes to m56 then (please add m56 if that is not the case) +snanda, kirtika to evaluate networking issues listed in c#20
,
Feb 10 2017
So, this is a panther Chromebox (Haswell on 3.8), similar to the zako Chromebox that has been experiencing ~1min delays connecting to PoE switches. This may be related to Abhishek's CfM bug. In the log it looks like shill starts up at 12:51:57, but it gets stuck for 60 seconds: 2017-02-03T12:51:57.133895-08:00 INFO shill[1005]: [INFO:manager.cc(281)] Manager started. 2017-02-03T12:51:57.135396-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: SuspendImminentsuccess: 1 2017-02-03T12:51:57.135738-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: SuspendDonesuccess: 1 2017-02-03T12:51:57.136127-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: DarkSuspendImminentsuccess: 1 2017-02-03T12:51:57.161724-08:00 INFO shill[1005]: [INFO:manager.cc(532)] PushProfileInternal finished; 1 profile(s) now present. 2017-02-03T12:51:57.162440-08:00 INFO shill[1005]: [INFO:modem_manager.cc(107)] Start watching modem manager service: org.chromium.ModemManager 2017-02-03T12:51:57.163707-08:00 INFO shill[1005]: [INFO:modem_manager_1.cc(50)] Start watching modem manager service: org.freedesktop.ModemManager1 2017-02-03T12:51:57.164977-08:00 INFO shill[1005]: [INFO:service.cc(296)] etherneteap service 0 constructed. 2017-02-03T12:51:57.166266-08:00 INFO shill[1005]: [INFO:device.cc(240)] Device created: eth0 index 3 2017-02-03T12:51:57.170324-08:00 INFO shill[1005]: [INFO:service.cc(296)] ethernet service 1 constructed. 2017-02-03T12:51:57.171016-08:00 INFO shill[1005]: [INFO:manager.cc(1245)] Device eth0 registered. 2017-02-03T12:51:57.171103-08:00 INFO shill[1005]: [INFO:manager.cc(1514)] Device eth0 updated: enabled 2017-02-03T12:51:59.037058-08:00 INFO ModemManager[1182]: <info> ModemManager (version 1.5.0) starting in system bus... 2017-02-03T12:52:01.611445-08:00 INFO ModemManager[1182]: <info> Couldn't find support for device at '/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0': not supported by any plugin 2017-02-03T12:52:01.611532-08:00 INFO ModemManager[1182]: <info> Couldn't find support for device at '/sys/devices/pci0000:00/0000:00:1c.1/0000:02:00.0': not supported by any plugin 2017-02-03T12:52:57.204052-08:00 INFO shill[1005]: [INFO:ethernet.cc(121)] Registering eth0 with manager. 2017-02-03T12:52:57.204879-08:00 INFO shill[1005]: [INFO:device_info.cc(1278)] Creating WiFi device for station mode interface wlan0 at interface index 2 2017-02-03T12:52:57.205161-08:00 INFO shill[1005]: [INFO:device.cc(240)] Device created: wlan0 index 2 2017-02-03T12:52:57.211078-08:00 INFO shill[1005]: [INFO:manager.cc(1245)] Device wlan0 registered. 2017-02-03T12:52:57.211156-08:00 INFO shill[1005]: [INFO:manager.cc(1514)] Device wlan0 updated: disabled The kernel messages involving r8169 might provide a hint as to what is blocking. It looks like both shill and avahi-daemon were blocked: 2017-02-03T12:51:55.791291-08:00 INFO kernel: [ 1.789737] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded 2017-02-03T12:51:55.791293-08:00 NOTICE kernel: [ 1.790817] frecon(289): Frecon using drm driver i915, version 1.6, date(20080730), desc(Intel Graphics) 2017-02-03T12:51:55.791299-08:00 NOTICE kernel: [ 1.790963] DEBUG: tsm_screen: new screen 2017-02-03T12:51:55.791301-08:00 NOTICE kernel: [ 1.790993] DEBUG: tsm_vte: new vte object 2017-02-03T12:51:55.791302-08:00 DEBUG kernel: [ 1.794395] r8169 0000:01:00.0: irq 62 for MSI/MSI-X 2017-02-03T12:51:55.791304-08:00 INFO kernel: [ 1.794685] r8169 0000:01:00.0 eth0: RTL8168g/8111g at 0xffffc90000074000, 08:9e:01:00:00:01, XID 0c000800 IRQ 62 2017-02-03T12:51:55.791306-08:00 INFO kernel: [ 1.794704] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko] [~60 seconds pass, then everything comes back to life] 2017-02-03T12:52:54.393723-08:00 ERR laptop-mode[2352]: Couldn't acquire lock. Retrying.... PID is 2261\n 2017-02-03T12:52:55.397712-08:00 ERR laptop-mode[2354]: Couldn't acquire lock. Retrying.... PID is 2261\n 2017-02-03T12:52:56.401634-08:00 ERR laptop-mode[2356]: Couldn't acquire lock. Retrying.... PID is 2261\n 2017-02-03T12:52:57.201051-08:00 WARNING kernel: [ 63.132655] r8169 0000:01:00.0 eth0: unable to load firmware patch rtl_nic/rtl8168g-1.fw (-2) 2017-02-03T12:52:57.204139-08:00 INFO avahi-daemon[1197]: Network interface enumeration completed. 2017-02-03T12:52:57.204374-08:00 INFO avahi-daemon[1197]: Server startup complete. Host name is 23fcfbac04c7454f1ce0ec7a5894ed6b.local. Local service cookie is 3001996824. 2017-02-03T12:52:57.204432-08:00 ERR avahi-daemon[1197]: write() failed while writing return value to pipe: Broken pipe 2017-02-03T12:52:57.206183-08:00 INFO kernel: [ 63.135956] r8169 0000:01:00.0 eth0: link down 2017-02-03T12:52:57.206202-08:00 INFO kernel: [ 63.135999] r8169 0000:01:00.0 eth0: link down 2017-02-03T12:52:57.206207-08:00 INFO kernel: [ 63.136038] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready 2017-02-03T12:52:57.238229-08:00 INFO p2p-http-server[2378]: p2p-http-server starting [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/main.cc:49] 2017-02-03T12:52:57.238274-08:00 INFO p2p-http-server[2378]: Maximum download rate per connection set to 125000 bytes/sec [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/main.cc:75] 2017-02-03T12:52:57.251837-08:00 INFO p2p-http-server[2378]: Sending message {PortNumber: 16725} [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/server.cc:267] 2017-02-03T12:52:57.263450-08:00 ERR shill[1005]: [ERROR:wifi.cc(2509)] Scan time unreliable 2017-02-03T12:52:57.280810-08:00 INFO laptop-mode[2473]: Laptop mode 2017-02-03T12:52:57.282359-08:00 INFO laptop-mode[2474]: enabled, not active 2017-02-03T12:52:57.308632-08:00 INFO avahi-daemon[1197]: Server startup complete. Host name is 9922517a7a0f0bf9688813e05894eda9.local. Local service cookie is 3001996824. r8169 could be blocking in request_firmware(), because the default loading timeout is 60 seconds. There have not been any recent changes to r8169 or drivers/base on our Linux 3.8. But it looks like udev was upgraded from 216->225 between M56->M57, so that's something to investigate. Maybe related, maybe not: https://forums.gentoo.org/viewtopic-t-1003240.html
,
Feb 10 2017
There's another open bug regarding that FW blob: https://chromium-review.googlesource.com/c/430272/ https://bugs.chromium.org/p/chromium/issues/detail?id=682445 Is it possible that udev 216 noticed that the (optional?) FW was missing and told the kernel to fail immediately, while udev 225 makes the kernel wait 60s and time out? The CL is for panther but I'm not sure if that covers all Haswell 3.8 based Chromeboxes. It's also worth checking whether this firmware blob was ever included in Chrome OS builds. Maybe the Realtek chip is software-upgradable with a fallback to ROM, and we're stuck running buggy old ROM code instead of the latest downloadable FW image.
,
Feb 10 2017
This is indeed related with the udev upgrade, userspace firmware loading support has been removed in newer udev versions. This has been also observed on lumpy ( Issue 682578 ). A fix is underway: https://chromium-review.googlesource.com/c/438681 https://chromium-review.googlesource.com/c/438682 https://chromium-review.googlesource.com/c/438830
,
Feb 14 2017
Requesting a merge of my speculative workaround at https://chromium-review.googlesource.com/c/437655/ to M57. The firmware-loading fixes from #24 have landed. Can those also be merged to M57 (and M56, if the issue is also present there)?
,
Feb 14 2017
#25: The merge of the changes for Issue 682578 to M57 has been requested.
,
Feb 14 2017
Ah, thanks. I'll just dupe this change into that one and drop the merge request for my workaround, then.
,
Mar 5 2017
BTW, I'm getting this repeatedly on my Samus running Version 58.0.3027.0 dev - just wanted to make sure this isn't unexpected, because I *thought* the various firmware changes were already landed by 58.0.3011.0
,
Mar 5 2017
#28: Does it look like shill taking a long time to start is still the root cause in what you're seeing? Do you see the same "shill_login_user[2040]: Failed to push ~chronos/shill profile" message and see shill take a long time to register itself in /var/log/power_manager/powerd.LATEST?
,
Mar 7 2017
I also just hit this on samus running 58.0.3027.0. Nothing about "Failed to push" in that log file.
,
Mar 7 2017
What exactly are you seeing now? Presumably the screen is failing to lock, but do you see the same: Failed to call method: org.chromium.SessionManagerInterface.LockScreen: object_path= /org/chromium/SessionManager: org.chromium.SessionManagerInterface.SessionDoesNotExist: Attempt to lock screen outside of user session. error in Chrome's user log and: [WARNING:session_manager_impl.cc(616)] Attempt to lock screen outside of user session. error in /var/log/messages? If so, what (if anything) does session_manager log earlier, when you're logging in? |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by derat@chromium.org
, Feb 3 2017Summary: Unable to lock screen due to session_manager's start-user-session call to Upstart timing out (was: Dbus failure caused lock screen failure)