New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 803272 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
User never visited
Closed: Feb 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

kevin / bob: hang / battery drain instead of suspend if Bluetooth is in the right state

Project Member Reported by derat@chromium.org, Jan 18 2018

Issue description

Doug's been seeing a bunch of suspend issues on an M64 kevin device:

http://listnr/product/208/report/84825719182
http://listnr/product/208/report/84858080816
http://listnr/product/208/report/84911833450
http://listnr/product/208/report/84907483529
http://listnr/product/208/report/84941856834

(Some of those feedback reports may be tracking slightly different wifi suspend issues; Doug can probably confirm.)

In the initially-reported issue, powerd turns the display off etc. and executes powerd_suspend, but the kernel doesn't actually suspend the system:

...
2017-12-20T16:57:59.467487-08:00 INFO session_manager[1637]: [INFO:session_manager_impl.cc(861)] LockScreen() method called.
2017-12-20T16:57:59.578267-08:00 INFO session_manager[1637]: [INFO:session_manager_impl.cc(866)] HandleLockScreenShown() method called.
2017-12-20T16:58:19.237514-08:00 NOTICE powerd_suspend[18350]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=24384
2017-12-20T16:58:21.439320-08:00 ERR kernel: [ 9113.043856] Bluetooth: hci0 command 0x0c01 tx timeout
2017-12-20T16:59:48.586310-08:00 INFO kernel: [ 9200.190716] mwifiex_pcie 0000:01:00.0: mlan0: already connected
2017-12-20T16:59:48.588299-08:00 INFO kernel: [ 9200.192997] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:1a:dd:00:00:17: reason code 3
2017-12-20T16:59:48.591297-08:00 INFO kernel: [ 9200.195756] mwifiex_pcie 0000:01:00.0: info: trying to associate to 'GBUS' bssid 00:1a:dd:00:00:18
2017-12-20T16:59:48.597446-08:00 INFO kernel: [ 9200.202067] mwifiex_pcie 0000:01:00.0: info: associated to bssid 00:1a:dd:00:00:18 successfully
2017-12-20T16:59:48.600515-08:00 ERR wpa_supplicant[841]: WMM AC: Missing IEs
2017-12-20T16:59:48.643475-08:00 INFO avahi-daemon[2155]: Withdrawing address record for <IPv4: 51>9 on mlan0.
2017-12-20T16:59:48.643683-08:00 INFO avahi-daemon[2155]: Leaving mDNS multicast group on interface mlan0.IPv4 with address <IPv4: 51>9.
2017-12-20T16:59:48.645076-08:00 INFO avahi-daemon[2155]: Interface mlan0.IPv4 no longer relevant for mDNS.
2017-12-20T16:59:48.719650-08:00 INFO avahi-daemon[2155]: Withdrawing address record for <IPv4: 18> on arcbr0.
2017-12-20T16:59:48.719751-08:00 INFO avahi-daemon[2155]: Registering new address record for <IPv4: 18> on arcbr0.IPv4.
2017-12-20T16:59:48.719993-08:00 INFO avahi-daemon[2155]: Changing host name to '804eb5b3a8bbc62d31fa4d5f5a3b0784'.
2017-12-20T16:59:48.732547-08:00 NOTICE cros-machine-id-regen[18521]: Regenerated /var/lib/dbus/machine-id (reason: network).
2017-12-20T16:59:49.647099-08:00 INFO avahi-daemon[2155]: Server startup complete. Host name is 804eb5b3a8bbc62d31fa4d5f5a3b0784.local. Local service cookie is 2031535982.
2017-12-20T16:59:49.657208-08:00 ERR chrome[1693]: [1693:1693:1220/165949.657070:INFO:remote_commands_invalidator.cc(45)] Shutdown RemoteCommandsInvalidator.
2017-12-20T16:59:49.657348-08:00 ERR chrome[1693]: [1693:1693:1220/165949.657303:INFO:remote_commands_invalidator.cc(68)] Stopping RemoteCommandsInvalidator.
2017-12-20T16:59:49.657401-08:00 ERR chrome[1693]: [1693:1693:1220/165949.657371:INFO:remote_commands_invalidator.cc(167)] Unregister RemoteCommandsInvalidator.
2017-12-20T16:59:55.448165-08:00 INFO avahi-daemon[2155]: Joining mDNS multicast group on interface mlan0.IPv4 with address <IPv4: 55>2.
2017-12-20T16:59:55.448516-08:00 INFO avahi-daemon[2155]: New relevant interface mlan0.IPv4 for mDNS.
2017-12-20T16:59:55.448813-08:00 INFO avahi-daemon[2155]: Registering new address record for <IPv4: 55>2 on mlan0.IPv4.
2017-12-20T17:00:19.556369-08:00 INFO kernel: [ 9231.161427] mwifiex_pcie 0000:01:00.0: info: successfully disconnected from 00:1a:dd:00:00:18: reason code 0

As a result, the system stays awake and runs its battery down.

Here's what I wrote in email:

----

powerd_suspend should be logging "Cancel suspend at kernel" or "Finalizing suspend" soon after the "Going to suspend-to-RAM state" message. I suspect there's no way to get a process list at the time of the issue (since you needed to reboot to turn the screen back on), but I'd guess that one of the following commands was hanging:

- metrics_client
- cp /sys/class/rtc/rtc0/since_epoch
- power_supply_info
- writing to /sys/module/printk/parameters/console_suspend or /sys/power/pm_print_times
- writing to /sys/power/wakeup_count
 

Comment 1 by derat@chromium.org, Jan 18 2018

I haven't been able to repro this yet on a ToT kevin dev build (10317.0.0). I added some logging to powerd_suspend but everything's working as expected:

2018-01-17T17:13:42.256115-08:00 NOTICE powerd_suspend[5101]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=29
2018-01-17T17:13:42.330536-08:00 NOTICE powerd_suspend[5143]: XXX running metrics_client
2018-01-17T17:13:42.332849-08:00 NOTICE powerd_suspend[5145]: XXX copying since_epoch
2018-01-17T17:13:42.343854-08:00 NOTICE powerd_suspend[5148]: XXX running power_supply_info
2018-01-17T17:13:42.426428-08:00 NOTICE powerd_suspend[5152]: XXX doing hardware hacks
2018-01-17T17:13:42.436489-08:00 NOTICE powerd_suspend[5159]: XXX checking for wakeup event
2018-01-17T17:13:42.439104-08:00 NOTICE powerd_suspend[5160]: XXX disabling bluetooth mode change events
2018-01-17T17:13:42.510780-08:00 NOTICE powerd_suspend[5163]: Finalizing suspend
2018-01-17T17:13:42.513884-08:00 NOTICE powerd_suspend[5164]: XXX writing state
2018-01-17T17:13:42.521888-08:00 INFO kernel: [   82.258086] PM: suspend entry 2018-01-18 01:13:42.521018629 UTC
2018-01-17T17:13:42.568885-08:00 INFO kernel: [   82.258115] PM: Syncing filesystems ... done.
2018-01-17T17:13:42.568957-08:00 DEBUG kernel: [   82.305449] PM: Preparing system for sleep (mem)
2018-01-17T17:13:49.027032-08:00 INFO kernel: [   82.306224] Freezing user space processes ... (elapsed 0.002 seconds) done.
2018-01-17T17:13:49.027114-08:00 INFO kernel: [   82.309251] Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
2018-01-17T17:13:49.027134-08:00 DEBUG kernel: [   82.311421] PM: Suspending system (mem)
...

I'm also not seeing the same Chrome suspend delay timeouts that Doug (and Todd?) saw, but I don't have Smart Lock enabled (the "set up" button doesn't do anything when I click it, maybe because I have an unofficial build).
Of all the feedback reports I've sent recently, only a small handful of them have had WiFi issues (there's an issue that if WiFi crashes once while the system was running then it will cause weird hangs at suspend time).  So I would agree that we're wedging on one of the steps derat@ points at.

I haven't really tried to reproduce this on anything other than my dogfood machine yet.

My best guess is that this started somewhere around my update to M64.

Comment 3 by derat@chromium.org, Jan 18 2018

I'm definitely interested in hearing about it if you find a way to trigger this on a dev device. I just did ~20 suspend/resume cycles on my ToT dev kevin without seeing any slow suspends (I think; I tested by running "ping -O <ip>" on my workstation and verifying that I always stop getting replies soon after closing the lid).

I uploaded my dumb powerd_suspend debug logging change at https://crrev.com/c/872092 in case you want to patch it in.
I'm in an all day meeting today, but I just tried on my kevin and when I
close the lid I can count slowly to 19 before the blue light on the side
turns off (indicating that it actually suspends).  In that case it worked,
though.  Definitely something is wrong for me.  I have meetings most of
tomorrow too...

Comment 5 by derat@chromium.org, Jan 18 2018

That 20-second delay is almost certainly some process (read: chrome) that's registered a suspend delay with powerd failing to acknowledge that it's ready to suspend. Here's where that timeout lives: https://chromium.googlesource.com/chromiumos/platform2/+/master/power_manager/powerd/policy/suspend_delay_controller.cc#28

You can look at file:///var/log/power_manager/powerd.LATEST to see which client is responsible (but it's probably Chrome). I don't know if there's an easy way to figure out what within Chrome is responsible, but file:///home/chronos/user/log/chrome might have some hints.

But.

All of this happens before powerd starts powerd_suspend. It's probably not responsible for the hang that you were seeing within powerd_suspend.
The 20 second delay looks like a serious and easy to reproduce problem.  See  bug #803727 
BTW: I tried switching to a different kevin and the problem seemed to go away.  I've still seen a few s2r issues on that kevin but I _think_ those are all the WiFi/s2r problem.  I just searched through my old messages on that device and I don't see any of the pattern of:

  Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=N

...without a "Finalizing suspend" or "Cancel suspend at kernel".  ...so I'm going to say that device, for whatever odd reason, is fine.

---

Somewhat concerning is that I found a feedback report today from someone that's _not me_:

  https://listnr.corp.google.com/product/208/report/85019798622

  Close lid to put system to sleep, it does not wake up when the lid is reopened. 
  Must hard boot (long press of Power button) to bring system back to life.

Looking at that report, I see:

  2018-02-03T08:43:39.269931-08:00 NOTICE powerd_suspend[18538]: Going to suspend-to-RAM state: args=--suspend_duration=-1 --nosuspend_to_idle --wakeup_count=2793
  2018-02-03T08:43:41.462011-08:00 ERR kernel: [ 9667.964437] Bluetooth: hci_core.c:hci_cmd_timeout() hci0 command 0x0c01 tx timeout

...followed by the kernel continuing to run for hours afterwards.  

Weird that there's the same "Bluetooth: hci_core.c:hci_cmd_timeout" soon after the message from powerd_suspend that I saw in my reports.  Coincidence?

I also notice that while the kernel keeps running it appears to be constantly trying to send Chrome crash reports...

2018-02-03T09:46:32.629311-08:00 NOTICE crash_sender[19903]: Current send rate: 32sends/24hrs
2018-02-03T09:46:32.633735-08:00 NOTICE crash_sender[19904]: Cannot send more crashes:
2018-02-03T09:46:32.636546-08:00 NOTICE crash_sender[19905]:   current 32send/24hrs >=  max 32send/24hrs

...so something is unhappy!

---

I've now taken my first Chromebook and put it in dev mode to see if I can do some stress testing on it.  So far no luck, but I'll let you know.  Maybe I need to do more with Bluetooth on it?
Cc: mcchou@chromium.org
+Miao for any comment on "Bluetooth: hci_core.c:hci_cmd_timeout" as a precursor to failing to properly enter suspend and blocking powerd from eventually forcing shutdown.
Closing the loop of whether this is somehow tied to hardware: it doesn't seem to be.  I ran over 1200 s2r tests on the kevin that used to reproduce this and I couldn't get a single failure.  Something about the powerwash that happened during the switch to dev mode fixed it.

I guess it either has something to do with apps / extensions installed on the device or the WiFi/BT settings that were there...
Hi guys,

i can't access the logs.

if it hang at:

- metrics_client
- cp /sys/class/rtc/rtc0/since_epoch
- power_supply_info
- writing to /sys/module/printk/parameters/console_suspend or /sys/power/pm_print_times
- writing to /sys/power/wakeup_count

as #1 mentioned, i think:

metrics_client -e Power.SuspendAttempt 0 2 & <--- run in background, should not block

# Log the time before going to suspend (no-op if no RTC).
cp /sys/class/rtc/rtc0/since_epoch "${SUSPEND_TO_RAM_TIME_FILE}" \
  2> /dev/null || true  <---- we can check the time stamp and content of 
 "${ROOT_SPOOL_DIR}"/suspend-to-ram-time


# Remove last_resume_timings to ensure the file is fresh on resume.
rm -f "${LAST_RESUME_TIMINGS_FILE}" <--- we can check the present of "${ROOT_RUN_DIR}"/last_resume_timings

# Store the current power status.
power_supply_info 2> /dev/null \
  | grep -Eq '^[[:space:]]+online:[[:space:]]+no$'
if [ $? -eq 0 ]; then
  echo OnBattery > "${POWER_STATUS_ON_SUSPEND_FILE}"
else
  echo OnAC > "${POWER_STATUS_ON_SUSPEND_FILE}"
fi <---- we can check the time stamp of "${ROOT_RUN_DIR}"/power-status-on-suspend


enable_debug <--- unlikely to hung here



  # Disable Bluetooth mode change events.
  bt_hci_mode_change_event_ctrl "disable" <--- maybe this? since the hci_cmd_timeout

  log_msg "Finalizing suspend"






# Enables or disables Bluetooth HCI mode change events based on an
# "enable" or "disable" argument. Some Bluetooth devices enter Sniff
# Mode to conserve power shortly after activity stops. This results
# in a mode change event being sent. Disable these events before
# suspending to prevent them from waking the system.
bt_hci_mode_change_event_ctrl() {
  local file
  for file in /sys/class/bluetooth/hci[0-9]; do
    local hcidev="$(basename "${file}")"
    if [ "$1" = "disable" ]; then
      echo "Disabling Bluetooth HCI mode change event before suspend"
      hcitool -i "${hcidev}" cmd 03 01 ff ff f7 ff ff ff ff ff
    elif [ "$1" = "enable" ]; then
      echo "Enabling Bluetooth HCI mode change event after resume"
      hcitool -i "${hcidev}" cmd 03 01 ff ff ff ff ff ff ff ff
    fi
  done
}


2017-12-20T16:58:21.439320-08:00 ERR kernel: [ 9113.043856] Bluetooth: hci0 command 0x0c01 tx timeout


maybe we can add some logs in hci driver, to see if it try to send 0x0c01 cmd at that time, and maybe even fake a tx timeout there
so it's added recently in:
commit 71d2b3030e92f744611457fafb16e03af7be515a
Author: Sukumar Ghorai <sukumar.ghorai@intel.com>
Date:   Thu Oct 5 20:56:22 2017 -0700

    Bluetooth: disable HCI Mode change event in platform suspend

      HCI Mode Change Event is used by BT controller to notify host that
      connected Device (e.g. HID keyboard) is enter or exit low power mode.
      This event cause the platform to wake-up from S3/S0ix. It suggested
      to keep disable the notification when platform in sleep state, and
      unmask the event in resume path when platform coming back to active
      state. This is required to save the platform power.

    BUG=b:37256363
    TEST=suspend/resume test (powerd_dbus_suspend) with HID device connected

    Change-Id: Ia01c80d84de37cd6a8869596bc2cc284b3a92495
    Signed-off-by: Sukumar Ghorai <sukumar.ghorai@intel.com>
    Tested-by: Amit K Bag <amit.k.bag@intel.com>
    Reviewed-on: https://chromium-review.googlesource.com/704458


and the cmd we send here is exactly 0x0c01:

localhost / # hcitool -i hci0 cmd 03 01 ff ff f7 ff ff ff ff ff
< HCI Command: ogf 0x03, ocf 0x0001, plen 8
  FF FF F7 FF FF FF FF FF
> HCI Event: 0x0e plen 4
  01 01 0C 00

#define hci_opcode_pack(ogf, ocf)   ((__u16) ((ocf & 0x03ff)|(ogf << 10)))

>>> hex(1 | 3 << 10)
'0xc01'


so somehow our bt is not happy with that, and cause tx timeout.

maybe we can:
1/ only do that for some special boards? (does btusb cause unexpected wake when sniff mode changed?)
or
2/ is it safe to ignore the timeout? maybe use timeout tool to make sure it would not block the suspend?
# timeout --help
Usage: timeout [OPTION] DURATION COMMAND [ARG]...


or
3/ ask mrvl team to check why tx timeout at that time?
so it's added recently in:
commit 71d2b3030e92f744611457fafb16e03af7be515a
Author: Sukumar Ghorai <sukumar.ghorai@intel.com>
Date:   Thu Oct 5 20:56:22 2017 -0700

    Bluetooth: disable HCI Mode change event in platform suspend

      HCI Mode Change Event is used by BT controller to notify host that
      connected Device (e.g. HID keyboard) is enter or exit low power mode.
      This event cause the platform to wake-up from S3/S0ix. It suggested
      to keep disable the notification when platform in sleep state, and
      unmask the event in resume path when platform coming back to active
      state. This is required to save the platform power.

    BUG=b:37256363
    TEST=suspend/resume test (powerd_dbus_suspend) with HID device connected

    Change-Id: Ia01c80d84de37cd6a8869596bc2cc284b3a92495
    Signed-off-by: Sukumar Ghorai <sukumar.ghorai@intel.com>
    Tested-by: Amit K Bag <amit.k.bag@intel.com>
    Reviewed-on: https://chromium-review.googlesource.com/704458


and the cmd we send here is exactly 0x0c01:

localhost / # hcitool -i hci0 cmd 03 01 ff ff f7 ff ff ff ff ff
< HCI Command: ogf 0x03, ocf 0x0001, plen 8
  FF FF F7 FF FF FF FF FF
> HCI Event: 0x0e plen 4
  01 01 0C 00

#define hci_opcode_pack(ogf, ocf)   ((__u16) ((ocf & 0x03ff)|(ogf << 10)))

>>> hex(1 | 3 << 10)
'0xc01'


so somehow our bt is not happy with that, and cause tx timeout.

maybe we can:
1/ only do that for some special boards? (does btusb cause unexpected wake when sniff mode changed?)
or
2/ is it safe to ignore the timeout? maybe use timeout tool to make sure it would not block the suspend?
# timeout --help
Usage: timeout [OPTION] DURATION COMMAND [ARG]...


or
3/ ask mrvl team to check why tx timeout at that time?
Cc: jeffy.c...@rockchip.corp-partner.google.com
Owner: rajatja@google.com
Jeffy: great digging.  It seems like quite a plausible explanation.  Rajat: would you be the person to help digging?
...I'd also note that I can no longer personally reproduce this since I power washed my machine.  :(  ...so our best hope is to use what Jeffy found to try to come up with a reproduction.
I see Jeffy posted <https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/880543> which maybe is related?
Cc: sukumar....@intel.com
The HCI command was added because we had found spurious wakes on multiple platforms. Intel bluetooth team told us that this was a generic problem:

https://b.corp.google.com/issues/37256363#comment53

"This is generic problem and applicable to all BT controller. And this is BT-SIG standard hci command (not vendor command)."

We should ask Marvell team about why they don't support this command, We may see some spurious wakes depending on the devices that are connected.

Meanwhile, I can add a timeout to this command as Jeffy suggested.
That right.
Mode change HCI event is an indication that the Device connection enter to sniff mode , this cause the Host to wakeup from S3.
Also Device re-negotiate the sniff interval for better power saving when idle for longer time (after initial sniff). Ie. Device can randomly exit from Sniff mode, change sniff parameter and enter to sniff mode. and this cause the Mode change HCI event to Host, hence host to wakeup from S3 and  randomly.
The Mode Change event is used to indicate when the device associated with the Connection_Handle changes between Active mode, Hold mode, and Sniff mode. 
BR/EDR Controller shall send the Mode Change event when the Sniff Mode has started and the Mode Change event shall occur again when exit from Sniff Mode.
 
Refer . BT Specification 5.0 
(page 901) : 7.3.1 Set Event Mask Command  and (page 1130) 7.7.20 Mode Change Event

After debugging, we found that this issue using multiple HID device form different vendor, its recommend to disable this Mode Change event during S3. Bluetooth low power mode is internal between Master and Device and disable this Mode Change event will not impact any functionality.

hmm, so that should be a bug that the mrvl bt can't do event mask sometimes, and we may need to ask them to fix it.

but firstly we may need to repro it again...maybe we can check the log to see is there bt/wifi logs may help?
there might be some pre-condition before sending this command for mrvl-bt, such as supported only one at boot time?
i kept my kevin running this all night' it still running now(10 hours)

anyway i think we may still need to add a timeout here, and ignore it or abprt suspend when it failed
Mode change HCI event mask, we need only when we connected to HID/Headset BT Devices.
Owner: ----
I have uploaded a CL (https://chromium-review.googlesource.com/c/chromiumos/platform2/+/908249) to add the timeout, based on the theory that powerd was waiting on hcitool which was blocked on the command response from kernel, and the command was timing out and repeatedly retried in the kernel, by looking at code.

Note that some one with more cycles should either (1) engage with Marvell to understand why they don't support this command and / or (2) based on a vendor id / device id, do not issue the command to Marvell bluetooth devices. Of course this is only if some one can try to reproduce the issue and look deeper to see if this timeout will not help? 
Hi guys,

i tried to fake it by dropping the cmd pkg(so we would stuck at waiting for the reply as what we saw in this issue):

+++ b/net/bluetooth/hci_sock.c
@@ -1758,6 +1758,9 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg,
                u16 opcode = get_unaligned_le16(skb->data);
                u16 ogf = hci_opcode_ogf(opcode);
                u16 ocf = hci_opcode_ocf(opcode);
+               printk("jeffy, %s, %d, opcode:%x\n", __func__, __LINE__, opcode);
+               if (opcode == 0xc01)
+                       goto drop;



and #22's CL works.
Cc: bhthompson@chromium.org kbleicher@chromium.org
Labels: ReleaseBlock-Stable Proj-kevin Proj-bob
Owner: rajatja@chromium.org
Status: Assigned (was: Untriaged)
Summary: kevin / bob: hang / battery drain instead of suspend if Bluetooth is in the right state (was: kevin doesn't suspend when requested to do so)
Rajat: I'm going to assign this to you since (as per Jeffy's great digging) it's pretty clear that the Bluetooth change that landed in M-64 is responsible.

I'm going to set this to M-64 RBS.  It's probably too late to fix this in M-64, but IMHO this should prevent the rollout of M-64 to Kevin and Bob (which almost certainly has the same issue).  This was a really serious issue until I powerwashed.  As a user, it caused me frustration every day as I continually opened my Chromebook to find a black screen and a drained matter.  Since I found a feedback report that shows that this wasn't hitting just me then it seems clear that this will hit a bunch of users once M-65 rolls out.  Sticking on M-63 seems better than causing this frustration even on 1% of kevin/bob users, but TPMs can decide for sure.

IMHO landing a mitigation (like the timeout) on M65 is OK.  We can always file a separate bug for figuring out what's wrong with Marvell BT.  Hopefully the BT team can own this.
We haven't pushed stable for bob and kevin yet since they're ARC++ enabled.  We were hoping to push ARC++ enabled boards soon, however.   It'll be easy to omit bob and kevin. 

We can't omit from M-64 indefinitely, however.  We'll need a fix for a M64 merge.  We'll have additional stable releases to pull the merge in, assuming the merge is low-risk and fully tested.

Are we sure this is limited to these two boards?

Please call these out as blockers as soon as possible.  We can exclude the boards from the release, but we nee dot make sure this item remains escalated so we can re-add in the next RC.
> Are we sure this is limited to these two boards?

I suppose any other boards that drop this Bluetooth packet would be in a similar state?  I don't personally know of any, but I guess it's plausible that they exist.
Do we need to add anyone to the bug re: #26?
@27: Well, Todd and Dan Erat are already on the bug and they would be the people most likely to get pinged about s2r problems if they started getting reported, so that's a start.

The question is whether we want to be proactive here and search out boards where this might be happening.  If we aren't somehow proactive then basically what will happen is that we'll roll this out to stable and then start getting feedback reports about this.  Then we'll know, but it will be a bit late.  :(

Unfortunately it's going to be _very_ hard to be proactive here.  This problem happens only when you get your BT in a particular state.  Once your in this state then your device will start _sometimes_ failing to suspend.  I saw it happen ~once a day but couldn't reproduce it on demand.  Presumably if other devices reproduce this then it will be similar for them, though presumably the magic state needed to reproduce this could be different for each device.

The only way we found this as early as we did was that I happened to be running on kevin and I was hitting it and I cared to dig a little (and others were nice enough to help dig).  If I hadn't happened to be running on Kevin or hadn't cared to dig, all we'd have would be a few feedback reports from the beta channel.  That doesn't mean that people on the beta channel aren't hitting this--they just might be hitting it but those reports might be lost in the noise.  Someone could go through feedback reports looking for an uptick of s2r problems in a particular platform, but it wasn't a high enough incidence of problems that it caused a massive spike in reports or something.  Imagine if 1% of beta channel users are hitting this and if 1/3 of those users actually bother reporting feedback.  That's not too many reports.  ...but if 1% of people are hitting this it still could be something we want to add a blocker for.  Having 1/100 of our users frustrated with their device for 6 weeks doesn't sound fun.

Can we revert per #24 to unblock M64?
@29: I think the revert would be <https://chromium-review.googlesource.com/c/chromiumos/platform2/+/704458>

...we also believe this fix <https://chromium-review.googlesource.com/c/chromiumos/platform2/+/908249> would work it just needs to be spun to fix derat's nits.  Looks like Rajat is actually OOTO today, so presumably he'll do that on Monday.

Comment 31 by rajatja@google.com, Feb 10 2018

Hi Folks, Sorry, I was out sick on Friday and half of thursday. I'll take a look at this as soon as our systems are unpacked on Monday.
Thanks Rajat; hope you're feeling better.  We need a plan *today* since this is blocking stable.  Blocking /non-blocking, boards impacted, etc.    Per #28 this may be edge and difficult to predict a fix, timing, so that could imply a punt to M65.

Comment 33 by rajatja@google.com, Feb 12 2018

I think my immediate plan is to fix the powerd_suspend script to:

* Put a timeout (2 seconds) on hcitool (and log the status)
* Issue the command only for Intel BT controllers

Please let me know if some one has concerns. I'll get a kevin from dianders to test this.


Agreed the current fixes will mitigate?  Should we also consider timeout on call to 'powerd_suspend' as a catch all for the next time this happens?

Comment 35 by rajatja@google.com, Feb 12 2018

But powerd_suspend *is* the script that would take us to suspend. Is there a better behavior that the system could follow, if that script takes "too long"? Shut down? May not be a change we want to put in R64/R65 though.
Thats the behavior that would occur if we wrapped the script in a timeout.  powerd after seeing X number of failures to suspend (run powerd_dbus_suspend -> powerd_suspend) would instead shutdown the machine.

Comment 37 by derat@chromium.org, Feb 12 2018

Status: Started (was: Assigned)
See http://doc/1nBvSqi12HaHizqO4S6d9Z9c7VNEpzKnu6EeSMbKVoPA for thoughts about detecting cases where powerd_suspend misbehaves. Feel free to add additional thoughts there.

The plan in #33 sounds good to me.

Comment 39 by rajatja@google.com, Feb 12 2018

Filed separate b/73254577 for Marvell BT not responding to the hcitool command.

Comment 40 by rajatja@google.com, Feb 13 2018

I have it cherry-picked to R65. Do we need it cherry-picked to R64 also? I'm waiting for master CL to be submitted before requesting merge to R65.

https://chromium-review.googlesource.com/c/chromiumos/platform2/+/908249 
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/915161
Labels: M-65
Hi Rajat,

Yes, you add the Merge request label, below, if you'd like a merge to be considered.  The TPMs will review and if approved, you're clear to merge.

Did Bernie sign off on M65?

I'll have to take a look at the CL, but yes, you should tag M64 as well since this is a stable blocker

Comment 42 by rajatja@google.com, Feb 13 2018

Sorry, incorrect use of words:

"I have it cherry-picked to R65"
 => I actually meant "I have a CL for cherry-picking it to R65".

I now also have a CL for R64:
https://chromium-review.googlesource.com/c/chromiumos/platform2/+/915341

Both Cls are waiting for the master one to be submitted, before I add "Merge-Requested". 
Do we have the CLs ready to make the merge requests?  Reminder that this is a stable blocker for M64
If you're ready for merge consideration can you (or me) add the merge request labels?

Comment 46 by rajatja@google.com, Feb 13 2018

Labels: Merge-Request-65 Merge-Request-64
Project Member

Comment 47 by sheriffbot@chromium.org, Feb 13 2018

Labels: -Merge-Request-65 Merge-Review-65 Hotlist-Merge-Review
This bug requires manual review: M65 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: cmasso@(Android), cmasso@(iOS), bhthompson@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 48 by derat@chromium.org, Feb 13 2018

#45: The change to trunk is currently in the CQ (https://luci-milo.appspot.com/buildbot/chromeos/master-paladin/17744). Anyone's guess as to whether the run will pass, though.
Labels: -Merge-Request-64 Merge-Approved-64
approving merge for M64

Comment 50 by rajatja@google.com, Feb 13 2018

OK, I'll submit it to M64 now
Project Member

Comment 51 by bugdroid1@chromium.org, Feb 13 2018

Labels: merge-merged-release-R64-10176.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/fd648a8c90a8397db6b7d48c4088ea745e23db1a

commit fd648a8c90a8397db6b7d48c4088ea745e23db1a
Author: Rajat Jain <rajatja@google.com>
Date: Tue Feb 13 22:24:26 2018

power: powerd_suspend: Issue hcitool command only on Intel BT platforms

Apparently, the Marvell bluetooth controller *sometimes* fails to
respond to the command used to enable / disable mode change HCI event.
The issue is rarely reproduced at random. So:

* Issue the command only on platforms that have Intel BT controllers.
* Timeout hcitool command at 2 seconds in order to ensure that a
  bad controller cannot take forever to respond. The timeout of 2 secs
  is taken from HCI_CMD_TIMEOUT in kernel.

BUG= chromium:803272 
TEST=Suspend / resume on Kevin does not issue the hcitool command.
     Suspend resume on a Soraka issues the hcitool command.
     Faking a hang (using sleep command instead of hcitool) kills the
     command and logs a message)

Change-Id: I399f2ca794e135b703cbc0d46d15039649e30ab4
Reviewed-on: https://chromium-review.googlesource.com/915341
Reviewed-by: Dan Erat <derat@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Commit-Queue: Rajat Jain <rajatja@chromium.org>
Trybot-Ready: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/fd648a8c90a8397db6b7d48c4088ea745e23db1a/power_manager/powerd/powerd_suspend

Labels: -Hotlist-Merge-Review -Merge-Review-65 Merge-Approved-65
Project Member

Comment 53 by bugdroid1@chromium.org, Feb 14 2018

Labels: merge-merged-release-R65-10323.B
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/2ed8604fd5e10e54b8de314abf05195f0fdd29b9

commit 2ed8604fd5e10e54b8de314abf05195f0fdd29b9
Author: Rajat Jain <rajatja@google.com>
Date: Wed Feb 14 00:57:54 2018

power: powerd_suspend: Issue hcitool command only on Intel BT platforms

Apparently, the Marvell bluetooth controller *sometimes* fails to
respond to the command used to enable / disable mode change HCI event.
The issue is rarely reproduced at random. So:

* Issue the command only on platforms that have Intel BT controllers.
* Timeout hcitool command at 2 seconds in order to ensure that a
  bad controller cannot take forever to respond. The timeout of 2 secs
  is taken from HCI_CMD_TIMEOUT in kernel.

BUG= chromium:803272 
TEST=Suspend / resume on Kevin does not issue the hcitool command.
     Suspend resume on a Soraka issues the hcitool command.
     Faking a hang (using sleep command instead of hcitool) kills the
     command and logs a message)

Change-Id: I399f2ca794e135b703cbc0d46d15039649e30ab4
Reviewed-on: https://chromium-review.googlesource.com/915161
Reviewed-by: Dan Erat <derat@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Commit-Queue: Rajat Jain <rajatja@chromium.org>
Trybot-Ready: Rajat Jain <rajatja@chromium.org>

[modify] https://crrev.com/2ed8604fd5e10e54b8de314abf05195f0fdd29b9/power_manager/powerd/powerd_suspend

Project Member

Comment 54 by bugdroid1@chromium.org, Feb 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/cf7fa56a802b8f71e201f4db8f2675a134c62e39

commit cf7fa56a802b8f71e201f4db8f2675a134c62e39
Author: Rajat Jain <rajatja@google.com>
Date: Wed Feb 14 05:16:39 2018

power: powerd_suspend: Issue hcitool command only on Intel BT platforms

Apparently, the Marvell bluetooth controller *sometimes* fails to
respond to the command used to enable / disable mode change HCI event.
The issue is rarely reproduced at random. So:

* Issue the command only on platforms that have Intel BT controllers.
* Timeout hcitool command at 2 seconds in order to ensure that a
  bad controller cannot take forever to respond. The timeout of 2 secs
  is taken from HCI_CMD_TIMEOUT in kernel.

BUG= chromium:803272 
TEST=Suspend / resume on Kevin does not issue the hcitool command.
     Suspend resume on a Soraka issues the hcitool command.
     Faking a hang (using sleep command instead of hcitool) kills the
     command and logs a message)

Change-Id: I399f2ca794e135b703cbc0d46d15039649e30ab4
Reviewed-on: https://chromium-review.googlesource.com/908249
Commit-Ready: Rajat Jain <rajatja@chromium.org>
Tested-by: Rajat Jain <rajatja@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/cf7fa56a802b8f71e201f4db8f2675a134c62e39/power_manager/powerd/powerd_suspend

Comment 55 by rajatja@google.com, Feb 14 2018

Status: Fixed (was: Started)
Project Member

Comment 56 by sheriffbot@chromium.org, Feb 19 2018

Cc: rajatja@google.com bhthompson@google.com
This issue has been approved for a merge. Please merge the fix to any appropriate branches as soon as possible!

If all merges have been completed, please remove any remaining Merge-Approved labels from this issue.

Thanks for your time! To disable nags, add the Disable-Nags label.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Comment 57 by rajatja@google.com, Feb 20 2018

Labels: -Merge-Approved-64 -Merge-Approved-65

Sign in to add a comment