New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 688560 link

Starred by 4 users

Issue metadata

Status: Duplicate
Merged: issue 682578
Owner:
Closed: Feb 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Unable to lock screen due to session_manager's start-user-session call to Upstart timing out

Project Member Reported by xiy...@chromium.org, Feb 3 2017

Issue description

No specific steps.

Feedback:
https://feedback.corp.google.com/#/Report/52659411013

Chrome failed to call SessionManager's StartSession via DBus.

[751:751:0203/125228.324988:ERROR:object_proxy.cc(583)] Failed to call method: org.chromium.SessionManagerInterface.StartSession: object_path= /org/chromium/SessionManager: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[751:751:0203/125228.325178:ERROR:session_manager_client.cc(497)] Failed to call StartSession

As a result, screen could not be locked because session manager daemon thinks user session is not started:

[751:751:0203/125922.144600:ERROR:object_proxy.cc(583)] Failed to call method: org.chromium.SessionManagerInterface.LockScreen: object_path= /org/chromium/SessionManager: org.chromium.SessionManagerInterface.SessionDoesNotExist: Attempt to lock screen outside of user session.

2017-02-03T12:41:43.763923-08:00 WARNING session_manager[716]: [WARNING:session_manager_impl.cc(616)] Attempt to lock screen outside of user session.

Dan, you have look at some dbus failures before. So I'll let you triage this.
 

Comment 1 by derat@chromium.org, Feb 3 2017

Cc: jrbarnette@chromium.org mnissler@chromium.org pstew@chromium.org
Summary: Unable to lock screen due to session_manager's start-user-session call to Upstart timing out (was: Dbus failure caused lock screen failure)
session_manager seems to have received the StartSession call from Chrome, but its D-Bus call to Upstart timed out:

2017-02-03T12:52:28.462129-08:00 ERR session_manager[723]: [ERROR:object_proxy.cc(582)] Failed to call method: com.ubuntu.Upstart0_6.EmitEvent: object_path= /com/ubuntu/Upstart: org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2017-02-03T12:52:28.462157-08:00 ERR session_manager[723]: [ERROR:session_manager_impl.cc(434)] Emitting start-user-session upstart signal failed.
2017-02-03T12:52:28.462478-08:00 ERR session_manager[723]: [ERROR:session_manager_impl.cc(556)] Cannot retrieve user policy before session is started

This is probably why Chrome's call to session_manager timed out.

We've seen similar timeouts in the past when session_manager asks Upstart to emit login-prompt-visible. The working theory was that one or more slow-running jobs were started by this signal, and that Upstart deferred replying to session_manager's call until they finished. I didn't have any luck proving this theory by finding any jobs that I could delay to repro, this. In any case, I made session_manager's login-prompt-visible call to Upstart asynchronous in  issue 671662  as an optimistic workaround.

SessionManagerImpl::StartSession() currently makes the start-user-session call to Upstart synchronously, but we could also change this to be async (quite easily, now). I again have fears that this will cause subtle issues, but... maybe not? On a recent lumpy build, I only see the following references to start-user-session in /etc/init:

- login.conf: start on start-user-session, touches /run/state/logged-in
- machine-info.conf: stop on start-user-session or stopping ui, removes /tmp/machine-info
- shill-start-user-session.conf: start on start-user-session, runs shill_login_user

I suspect that it may be okay for all of those to be run asynchronously. I remain dismayed that we have trouble getting reliable, timely IPC to our init daemon, though. :-(

Comment 2 by st...@chromium.org, Feb 3 2017

Xiyuan and I discussed this. We really need to unravel all this code and design and find a 'good' fix to this.

Dan, for now, if we just change the timeout within Chrome for the StartSession call to infinite, do you think that would be an acceptable short-term fix?

Comment 3 by st...@chromium.org, Feb 3 2017

Cc: alemate@chromium.org jdufault@chromium.org

Comment 4 by derat@chromium.org, Feb 3 2017

Status: Started (was: Assigned)
That's probably not a good fix. session_manager is single-threaded. If it's blocking on its call to Upstart, it won't be able to handle any other calls that Chrome makes to it, either.

Figuring out why Upstart is blocking would be best, but making session_manager's call to Upstart asynchronous is an easy way to approximate the same behavior (assuming that it's not dependent on the start-user-session-triggered jobs finishing synchronously).

Comment 5 by derat@chromium.org, Feb 3 2017

Oh, and just to point it out: changing the timeout on Chrome's call to session_manager will have no effect (other than making us not log an error). SessionManagerClient::OnStartSession doesn't do anything with the reply, so Chrome already doesn't care at all about what session_manager does here. The problem just arises because session_manager internally abandons starting the session after its call to Upstart times out.

Comment 6 by derat@chromium.org, Feb 3 2017

Uploaded https://chromium-review.googlesource.com/437655 as a prospective workaround. I'll test it in a bit to make sure it doesn't break anything in an obvious way.

It's too exciting to merge, I think, but it'd probably be reasonable to check it in now so it has some time to bake before the next release branch is cut.

Comment 7 by st...@chromium.org, Feb 3 2017

It sounds like that making the call to upstart is risky. There may be good reasons why upstart blocks there - particularly if it is blocking on shill_login_user.


The other option was to retry the start session call till it succeeded :) Again, that's a possible workaround, not a fix.

I do completely agree that we absolutely have to figure out why upstart is blocking in the first place. Once we do, I would prefer, if it is at all possible, we redesign this mechanism to not be a blocking call.

Comment 8 by st...@chromium.org, Feb 3 2017

*making the call to upstart async

Comment 9 by derat@chromium.org, Feb 3 2017

To be clear, this is expected behavior from Upstart. Try creating a file named /etc/init/handle-foo.conf that contains the following:

start on foo
task
exec sleep 30

Then run "initctl emit foo" and watch it block for 30 seconds waiting for the task to complete.

It's unclear whether shill_login_user actually needs to finish before session_manager can go on or whether this was just an accidental side effect of how someone wrote the script years ago (maybe Paul can comment).

If it was intentional, it's probably not working as intended anyway, since Chrome isn't blocking here (or even paying attention to the response from session_manager) and will continue to interact with shill regardless of whether or not shill_login_user has finished.

This message, from right after session_manager's call to Upstart timing out, is probably relevant, and points the finger at shill_login_user being responsible:

2017-02-03T12:52:28.483943-08:00 NOTICE shill_login_user[2040]: Failed to push ~chronos/shill profile

It looks like this is logged in response to a org.chromium.flimflam.Manager.InsertUserProfile method call failing.

See also this from powerd.LATEST:

[0203/125257:INFO:suspend_delay_controller.cc(63)] Registering suspend delay 46006274 (shill) of 19500 ms on behalf of :1.10
[0203/125257:INFO:suspend_delay_controller.cc(63)] Registering dark suspend delay 46039042 (shill) of 19500 ms on behalf of :1.10

shill usually registers its suspend delays with powerd when it starts up, so the fact that it didn't happen until 12:52:57 suggests that shill was slow to start. That's going to cause all sorts of other problems, so someone on the networking side should investigate.

Insulating session_manager from this kind of stuff is probably still worthwhile.
"Insulating session_manager from this kind of stuff is probably still worthwhile."

Agreed. I can understand the reasons why upstart blocks. Session manager depending on that, in fact, failing to start a session, if it doesn't complete fast enough is, less than ideal :)

Components: OS>Systems>Network
Project Member

Comment 12 by bugdroid1@chromium.org, Feb 5 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/8f239af37cf10dc6d10ec72a99bbc14538c584ef

commit 8f239af37cf10dc6d10ec72a99bbc14538c584ef
Author: Daniel Erat <derat@chromium.org>
Date: Sun Feb 05 20:33:35 2017

login: Make start-user-session Upstart signal async.

Make session_manager's EmitEvent D-Bus method call to
Upstart to emit the start-user-session signal asynchronous.

Much like the call to emit login-prompt-visible, this call
appears to time out in some cases (due to slow jobs that are
triggered by the Upstart signal?), resulting in
session_manager being blocked.

In the case of start-user-session, a timeout also results in
session_manager abandoning Chrome's StartSession request,
which means that later attempts to lock the screen fail.

BUG= chromium:688560 
TEST=manual: add script section with "sleep 40" to
     /etc/init/shill-start-user-session.conf. check that
     session_manager logs an error about
     com.ubuntu.Upstart0_6.EmitEvent timing out but that
     it's still possible to lock the screen.

Change-Id: If49046cf30df40b27c18bb1528439f722db692d4
Reviewed-on: https://chromium-review.googlesource.com/437655
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Xiyuan Xia <xiyuan@chromium.org>

[modify] https://crrev.com/8f239af37cf10dc6d10ec72a99bbc14538c584ef/login_manager/session_manager_impl.cc
[modify] https://crrev.com/8f239af37cf10dc6d10ec72a99bbc14538c584ef/login_manager/dbus_bindings/org.chromium.SessionManagerInterface.xml

Is this issue affecting M-56 too? 
...that one is in top 10 crashes for M-56
I suspect the above crash happened because the lock screen webui took too long to show (fix in  issue 669638 ). I don't see any dbus/session manager errors in the logs.

Comment 17 Deleted

What milestone has fixes for  issue 669638 ?
Any build after 57.0.2981.0
Agreed re the crash from #14 being unrelated to this. The logs show that session_manager wasn't blocked and asked Chrome to lock the screen as requested:

[12741:12741:0206/071009.355980:VERBOSE1:screen_locker.cc(399)] Received LockScreen request from session manager

and

2017-02-06T07:10:09.314212-05:00 INFO session_manager[12715]: [INFO:session_manager_impl.cc(577)] LockScreen() method called.

---

Someone from the networking team still needs to figure out why shill-start-user-session (and also shill?) were hanging in the feedback report here, though. Who should that be?
Cc: kirtika@chromium.org snanda@chromium.org
Seems like no need to port fixes to m56 then (please add m56 if that is not the case)

+snanda, kirtika to evaluate networking issues listed in c#20


Cc: abhishekbh@chromium.org mka@chromium.org cernekee@chromium.org
So, this is a panther Chromebox (Haswell on 3.8), similar to the zako Chromebox that has been experiencing ~1min delays connecting to PoE switches.  This may be related to Abhishek's CfM bug.

In the log it looks like shill starts up at 12:51:57, but it gets stuck for 60 seconds:

2017-02-03T12:51:57.133895-08:00 INFO shill[1005]: [INFO:manager.cc(281)] Manager started.
2017-02-03T12:51:57.135396-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: SuspendImminentsuccess: 1
2017-02-03T12:51:57.135738-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: SuspendDonesuccess: 1
2017-02-03T12:51:57.136127-08:00 INFO shill[1005]: [INFO:chromeos_power_manager_proxy.cc(352)] OnSignalConnected interface: org.chromium.PowerManager signal: DarkSuspendImminentsuccess: 1
2017-02-03T12:51:57.161724-08:00 INFO shill[1005]: [INFO:manager.cc(532)] PushProfileInternal finished; 1 profile(s) now present.
2017-02-03T12:51:57.162440-08:00 INFO shill[1005]: [INFO:modem_manager.cc(107)] Start watching modem manager service: org.chromium.ModemManager
2017-02-03T12:51:57.163707-08:00 INFO shill[1005]: [INFO:modem_manager_1.cc(50)] Start watching modem manager service: org.freedesktop.ModemManager1
2017-02-03T12:51:57.164977-08:00 INFO shill[1005]: [INFO:service.cc(296)] etherneteap service 0 constructed.
2017-02-03T12:51:57.166266-08:00 INFO shill[1005]: [INFO:device.cc(240)] Device created: eth0 index 3
2017-02-03T12:51:57.170324-08:00 INFO shill[1005]: [INFO:service.cc(296)] ethernet service 1 constructed.
2017-02-03T12:51:57.171016-08:00 INFO shill[1005]: [INFO:manager.cc(1245)] Device eth0 registered.
2017-02-03T12:51:57.171103-08:00 INFO shill[1005]: [INFO:manager.cc(1514)] Device eth0 updated: enabled
2017-02-03T12:51:59.037058-08:00 INFO ModemManager[1182]: <info>  ModemManager (version 1.5.0) starting in system bus...
2017-02-03T12:52:01.611445-08:00 INFO ModemManager[1182]: <info>  Couldn't find support for device at '/sys/devices/pci0000:00/0000:00:1c.0/0000:01:00.0': not supported by any plugin
2017-02-03T12:52:01.611532-08:00 INFO ModemManager[1182]: <info>  Couldn't find support for device at '/sys/devices/pci0000:00/0000:00:1c.1/0000:02:00.0': not supported by any plugin
2017-02-03T12:52:57.204052-08:00 INFO shill[1005]: [INFO:ethernet.cc(121)] Registering eth0 with manager.
2017-02-03T12:52:57.204879-08:00 INFO shill[1005]: [INFO:device_info.cc(1278)] Creating WiFi device for station mode interface wlan0 at interface index 2
2017-02-03T12:52:57.205161-08:00 INFO shill[1005]: [INFO:device.cc(240)] Device created: wlan0 index 2
2017-02-03T12:52:57.211078-08:00 INFO shill[1005]: [INFO:manager.cc(1245)] Device wlan0 registered.
2017-02-03T12:52:57.211156-08:00 INFO shill[1005]: [INFO:manager.cc(1514)] Device wlan0 updated: disabled


The kernel messages involving r8169 might provide a hint as to what is blocking.  It looks like both shill and avahi-daemon were blocked:


2017-02-03T12:51:55.791291-08:00 INFO kernel: [    1.789737] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
2017-02-03T12:51:55.791293-08:00 NOTICE kernel: [    1.790817] frecon(289): Frecon using drm driver i915, version 1.6, date(20080730), desc(Intel Graphics)
2017-02-03T12:51:55.791299-08:00 NOTICE kernel: [    1.790963] DEBUG: tsm_screen: new screen
2017-02-03T12:51:55.791301-08:00 NOTICE kernel: [    1.790993] DEBUG: tsm_vte: new vte object
2017-02-03T12:51:55.791302-08:00 DEBUG kernel: [    1.794395] r8169 0000:01:00.0: irq 62 for MSI/MSI-X
2017-02-03T12:51:55.791304-08:00 INFO kernel: [    1.794685] r8169 0000:01:00.0 eth0: RTL8168g/8111g at 0xffffc90000074000, 08:9e:01:00:00:01, XID 0c000800 IRQ 62
2017-02-03T12:51:55.791306-08:00 INFO kernel: [    1.794704] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]

[~60 seconds pass, then everything comes back to life]

2017-02-03T12:52:54.393723-08:00 ERR laptop-mode[2352]: Couldn't acquire lock. Retrying.... PID is 2261\n
2017-02-03T12:52:55.397712-08:00 ERR laptop-mode[2354]: Couldn't acquire lock. Retrying.... PID is 2261\n
2017-02-03T12:52:56.401634-08:00 ERR laptop-mode[2356]: Couldn't acquire lock. Retrying.... PID is 2261\n
2017-02-03T12:52:57.201051-08:00 WARNING kernel: [   63.132655] r8169 0000:01:00.0 eth0: unable to load firmware patch rtl_nic/rtl8168g-1.fw (-2)
2017-02-03T12:52:57.204139-08:00 INFO avahi-daemon[1197]: Network interface enumeration completed.
2017-02-03T12:52:57.204374-08:00 INFO avahi-daemon[1197]: Server startup complete. Host name is 23fcfbac04c7454f1ce0ec7a5894ed6b.local. Local service cookie is 3001996824.
2017-02-03T12:52:57.204432-08:00 ERR avahi-daemon[1197]: write() failed while writing return value to pipe: Broken pipe
2017-02-03T12:52:57.206183-08:00 INFO kernel: [   63.135956] r8169 0000:01:00.0 eth0: link down
2017-02-03T12:52:57.206202-08:00 INFO kernel: [   63.135999] r8169 0000:01:00.0 eth0: link down
2017-02-03T12:52:57.206207-08:00 INFO kernel: [   63.136038] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
2017-02-03T12:52:57.238229-08:00 INFO p2p-http-server[2378]: p2p-http-server starting [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/main.cc:49]
2017-02-03T12:52:57.238274-08:00 INFO p2p-http-server[2378]: Maximum download rate per connection set to 125000 bytes/sec [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/main.cc:75]
2017-02-03T12:52:57.251837-08:00 INFO p2p-http-server[2378]: Sending message {PortNumber: 16725} [../../../../../../../../../mnt/host/source/src/platform2/p2p/http_server/server.cc:267]
2017-02-03T12:52:57.263450-08:00 ERR shill[1005]: [ERROR:wifi.cc(2509)] Scan time unreliable
2017-02-03T12:52:57.280810-08:00 INFO laptop-mode[2473]: Laptop mode
2017-02-03T12:52:57.282359-08:00 INFO laptop-mode[2474]: enabled, not active
2017-02-03T12:52:57.308632-08:00 INFO avahi-daemon[1197]: Server startup complete. Host name is 9922517a7a0f0bf9688813e05894eda9.local. Local service cookie is 3001996824.


r8169 could be blocking in request_firmware(), because the default loading timeout is 60 seconds.  There have not been any recent changes to r8169 or drivers/base on our Linux 3.8.  But it looks like udev was upgraded from 216->225 between M56->M57, so that's something to investigate.

Maybe related, maybe not: https://forums.gentoo.org/viewtopic-t-1003240.html
There's another open bug regarding that FW blob:

https://chromium-review.googlesource.com/c/430272/
https://bugs.chromium.org/p/chromium/issues/detail?id=682445

Is it possible that udev 216 noticed that the (optional?) FW was missing and told the kernel to fail immediately, while udev 225 makes the kernel wait 60s and time out?

The CL is for panther but I'm not sure if that covers all Haswell 3.8 based Chromeboxes.

It's also worth checking whether this firmware blob was ever included in Chrome OS builds.  Maybe the Realtek chip is software-upgradable with a fallback to ROM, and we're stuck running buggy old ROM code instead of the latest downloadable FW image.

Comment 24 by mka@google.com, Feb 10 2017

This is indeed related with the udev upgrade, userspace firmware loading support has been removed in newer udev versions. This has been also observed on lumpy ( Issue 682578 ).

A fix is underway:

https://chromium-review.googlesource.com/c/438681
https://chromium-review.googlesource.com/c/438682
https://chromium-review.googlesource.com/c/438830

Comment 25 by derat@chromium.org, Feb 14 2017

Labels: Merge-Request-57
Requesting a merge of my speculative workaround at https://chromium-review.googlesource.com/c/437655/ to M57.

The firmware-loading fixes from #24 have landed. Can those also be merged to M57 (and M56, if the issue is also present there)?

Comment 26 by mka@chromium.org, Feb 14 2017

#25: The merge of the changes for  Issue 682578  to M57 has been requested.

Comment 27 by derat@chromium.org, Feb 14 2017

Labels: -Merge-Request-57
Mergedinto: 682578
Status: Duplicate (was: Started)
Ah, thanks. I'll just dupe this change into that one and drop the merge request for my workaround, then.
BTW, I'm getting this repeatedly on my Samus running Version 58.0.3027.0 dev - just wanted to make sure this isn't unexpected, because I *thought* the various firmware changes were already landed by 58.0.3011.0
#28: Does it look like shill taking a long time to start is still the root cause in what you're seeing? Do you see the same "shill_login_user[2040]: Failed to push ~chronos/shill profile" message and see shill take a long time to register itself in /var/log/power_manager/powerd.LATEST?
I also just hit this on samus running 58.0.3027.0. Nothing about "Failed to push" in that log file.
What exactly are you seeing now? Presumably the screen is failing to lock, but do you see the same:

  Failed to call method: org.chromium.SessionManagerInterface.LockScreen: object_path= /org/chromium/SessionManager: org.chromium.SessionManagerInterface.SessionDoesNotExist: Attempt to lock screen outside of user session.

error in Chrome's user log and:

  [WARNING:session_manager_impl.cc(616)] Attempt to lock screen outside of user session.

error in /var/log/messages? If so, what (if anything) does session_manager log earlier, when you're logging in?

Sign in to add a comment