New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 842332 link

Starred by 3 users

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

Unrecoverable error stopping recovery

Project Member Reported by khorimoto@chromium.org, May 11 2018

Issue description

Long-running tether scan test run by rjahagir (Ruchi) on Minnie v10655.0.0.

At first, everything works as expected. Eventually, we hit the following error:
[1388:1388:0508/195310.213870:ERROR:device_event_log_impl.cc(159)] [19:53:10.213] Bluetooth: bluetooth_adapter_bluez.cc:1641 /org/bluez/hci0: Failed to stop discovery: org.bluez.Error.InProgress: Operation already in progress
[1388:1388:0508/195310.214332:ERROR:ble_scanner_impl.cc(298)] Error stopping discovery session.

At this point, nothing else succeeds. Chrome logs are attached.

Sonny says that this looks like a regression caused by his code and is investigating.
 
chrome.zip
2.4 MB Download
It would be nice to have various kinds of logs attached. Sonny, thanks for investigating!
Owner: josephsih@chromium.org
Here is the log from Ruchi's repro: https://drive.google.com/open?id=1YdnYzfmOkcjVoT1PAcGIPkmw4xIjU4Qf

Joseph, I investigated the log and here are my findings:
* It's unlikely to be caused by regression of retries in bluetoothd's Start/StopDiscovery (http://crosreview.com/1039024) since I don't find anything wrong with it from the log.
* It's also unlikely that this is a bad state in firmware, because experiment on restarting bluetoothd returns the stack to the good state.
* It seems that there is a MGMT_OP_START_SERVICE_DISCOVERY that never comes back from the kernel, so bluetoothd's state is still waiting for it to return. This is seen in file "messages.3" at time "2018-05-11T15:38:47.871228-07:00". There is no more start_discovery_complete past that command.

Could you help take a look at the kernel part where the mgmt command never returns? I suspect it is related to some kind of skipped conditional command that you've previously worked with advertising. Thanks!
This issue is interesting. With an initial look, it seems not related with skipped conditional commands since there were no skip_conditional_cmd() messages after the last MGMT_OP_START_SERVICE_DISCOVERY message. In the btnoop log, there was no "HCI Command: LE Set Scan Enable" either after the last MGMT_OP_START_SERVICE_DISCOVERY message. The MGMT_OP_START_SERVICE_DISCOVERY message just got lost somewhere.

I would need kernel debug logs for further investigation. I will see if I can reproduce this issue.
Cc: rjahagir@chromium.org
Hi Joseph,

Ruchi did another repro with kernel debug enabled, please take a look at the logs:
https://drive.google.com/file/d/1hbfmZvfijozcxafWPKocw7DzcbnsO411/view

Unfortunately there is no btmon log for that repro. Ruchi is going to try another repro with btmon log, in the meantime you can take a look at that log.
Hi Joseph, I was able to repro again with btmon log this time. PTAL!
https://drive.google.com/open?id=1DKj4jSye4nQJgjI4FqC6gXyYlKyi1uEF
Thanks Ruchi for repro-ing the issue and the logs. Could you help type "hcitool cmd 3f 05" to check the fw version? Thanks!
>hcitool cmd 3f 05

< HCI Command: ogf 0x3f, ocf 0x0005, plen 0
> HCI Event: 0x0e plen 4
  01 05 FC 01
Hi Ruchi, thanks! Just realized that Minnie uses Broadcom BCM-43540 chip. I originally assumed that it uses Intel Wireless chip.

I am looking into the logs that you provided. Thanks!
Thanks Joseph!
Summary: Minnie: Unrecoverable error in starting or stopping discovery (was: Minnie: Unrecoverable error stopping recovery)
Re C#5: In the logs, it is a similar symptom but not exactly the same.

The symptom in C#5 was "Failed to start discovery" instead of "Failed to stop discovery" as in C#0.

Let me assume temporarily that both kinds of failures had the same root cause, and update the subject a bit to include both failures.
Cc: snanda@chromium.org dwmclary@chromium.org
Summary: Minnie: Unrecoverable error stopping recovery (was: Minnie: Unrecoverable error in starting or stopping discovery)
With a 2nd thought, I think "Failed to start discovery" issue might not have exactly the same root cause of "Failed to stop discovery" issue. Hence, I created a new  issue 843943 : Minnie: Unrecoverable error starting recovery.

Let's separate the two symptoms in two issues.
Hi Ruchi, could you help running the test again on Minnie, and see if it is possible to reproduce the symptom described in C#0? Thanks!

(The logs in C#5 was considered as a separate issue and was now tracked in  issue 843943 : Minnie: Unrecoverable error starting recovery)
Thanks Joseph, I'll try to repro today and will update bugs with logs. 
Labels: -Type-Bug Type-Bug-Regression
Summary: Unrecoverable error stopping recovery (was: Minnie: Unrecoverable error stopping recovery)
We've inspected our metrics, and issue is *not* specific to Minnie.

These metrics are Eve-only and indicate a similarly drastic drop in stability:
https://uma.googleplex.com/timeline_v2?sid=dc86135fb6dd4da1db6583bfa1a1008e

According to metrics, beta channel is now seeing an ~8% success rate stopping discovery sessions. Normally, this rate hovers around 96-99%.

M-67 hits stable in 8 days, so should bug be moved to P0 instead of P1?
Kyle, looking at that link, it seems that even stable has begun to drop off significantly in the last couple days. Looking at 1-day aggregate is more clear:

https://uma.googleplex.com/timeline_v2?sid=54e27afbe7dde1d595776973749d4269

Did a new stable push cause this drop off?
Status: Started (was: Assigned)
Hi Kyle, thanks for the metrics! Yes, the issue seems to lie in the kernel and is generic to all platforms. I also filed a related  issue 845107 : Failed to stop discovery due to __hci_req_sync() err -110.

It seems to me that there were a few distinct symptoms that led to failure to start/stop discovery.

I am currently working on it. There are 2 more days for stable cut. It looks to me that there are some race conditions in the discovery operations with other ones. It would take time to investigate the root cause, fix it, and test. The issues are my top priority for now.
Thanks, Joseph. Do you see a reason why this would also cause a drop in the past week on Stable channel as well?
I think this is an issue(s) existing for quite a long time. We did not hit the issue as frequently before. But for some unknown reason, probably related with how operations/threads are scheduled to execute, the bug becomes active now.
BTW, sometimes, "Failed to stop discovery" may not be a real error. It might be a temporal error caused by "Start discovery" was still in progress. When trying to "Stop discovery" at a later timestamp, it might succeed.

The temporal error was shown in the following logs.

# The "Error stopping discovery" seen in ui.LATEST.
[1224:1224:0515/125440.714539:ERROR:device_event_log_impl.cc(156)] [12:54:40.714] Bluetooth: bluetooth_adapter_bluez.cc:1619 /org/bluez/hci0: Failed to stop discovery: org.bluez.Error.InProgress: Operation already in progress
[1224:1224:0515/125440.715476:ERROR:ble_scanner_impl.cc(279)] Error stopping discovery session.


### The corresponding logs observed in /var/log/messages.

# "Start discovery" by chrome
2018-05-15T12:54:40.163489+08:00 DEBUG bluetoothd[2613]: src/adapter.c:start_discovery() sender :1.7

# "Stop discovery" by chrome before "Start discovery" completed.
2018-05-15T12:54:40.711300+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() sender :1.7

# "Stop discovery" failed since "Start discovery" was still in progress.
2018-05-15T12:54:40.711322+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() error: start discovery in progress

# "Start discovery" completed at this time.
2018-05-15T12:54:40.712136+08:00 DEBUG bluetoothd[2613]: src/adapter.c:start_discovery_complete() status 0x00

# "Stop discovery" by chrome again.
2018-05-15T12:54:47.472306+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() sender :1.7

# "Stop discovery" completed successfully this time.
2018-05-15T12:54:47.488875+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery_complete() status 0x00


Naive question from the peanut gallery: Is there not a way for the 'start discovery' requester to know that 'stop discovery' is ongoing?
Re #20: Not from within Chrome. The logs Joseph is referencing are from bluetoothd (a layer lower).
Joseph: See #20 with respect to any particular layer.  Is there a way to prevent the mismatched starting/stopping?
The recoverable error is okay/normal/expected, at this moment magic tether is still doing retries so these cases recover eventually. What we're interested is debugging those which never recovers.
I was able to repro this overnight with Eve (M67 - 10575.40.0 - active beta build) and Minnie (M68 - 10654.0.0). Logs below:

Minnie: https://drive.google.com/open?id=13oPKks6eb6YyKggEpFAKemy8gszWQi3Z

Eve: https://drive.google.com/open?id=1uu8ZPfMqoKx-HxXJAprTPITlpTFJeroI
Thank you, Ruchi for providing the repro logs.
I looked at them, and found out the following:

1. The bug of Eve (M67-10575.40.0) looks different from the kernel bug that we see in this bug. I moved this to  bug 837656 .
2. The bug of Minnie (M68-10654.0.0) looks like the same kernel bug we're investigating here. I will do further investigation and sync with Joseph.
After investigating more of the eve log that Ruchi provided in #24, I am almost certain about the root cause of dipping success rate of InstantTethering.BluetoothDiscoverySessionStopped (https://uma.googleplex.com/timeline_v2?sid=54e27afbe7dde1d595776973749d4269), even in stable.

So my theory is this:
1. There is an increase of bluetoothd crash.
2. This crash causes Out-of-sync of discovery session in Chrome vs bluetoothd ( bug 837656 )
3. The out-of-sync caused "active discovery session not found" error, which is supported by other metrics: https://uma.googleplex.com/p/chrome/timeline_v2?sid=80fd1b75610ded00e3c3c7c0de23cb6c

So, while Joseph is fighting the bluetoothd crashes, I am also working on a fix for the out-of-sync in Chrome level to make our stack more resilient from bluetoothd crashes.

Hi Sonny, thanks for looking into it. Have you found bluetoothd crash in the logs attached in C#24? (I did not have time to look at them yet.) My patches for 2 bluetoothd crash issues have landed on R68-10669.0.0. Since then, I did not find any bluetoothd crashes in my own tests on Eve.

While you are working on out-of-sync issue at chrome level, I will keep working on this issue here. I find a few distinct symptoms while "Failed to stop discovery" occurred.
Hi Joseph,

Please find the crash dumps of Ruchi's repro in #24 in this attachment. There were several bluetoothd crashes, and the last one (maybe accidentally) caused the out-of-sync (bluetoothd.20180521.210600.12051.core):
2018-05-21T20:02:33.631437-07:00 INFO crash_reporter[4916]: Leaving core file at /var/spool/crash/bluetoothd.20180521.200233.2352.core due to developer image
2018-05-21T20:11:49.686624-07:00 INFO crash_reporter[6020]: Leaving core file at /var/spool/crash/bluetoothd.20180521.201149.4928.core due to developer image
2018-05-21T20:16:02.018085-07:00 INFO crash_reporter[6468]: Leaving core file at /var/spool/crash/bluetoothd.20180521.201602.6031.core due to developer image
2018-05-21T20:39:19.794153-07:00 INFO crash_reporter[9265]: Leaving core file at /var/spool/crash/bluetoothd.20180521.203919.6478.core due to developer image
2018-05-21T20:41:41.694503-07:00 INFO crash_reporter[9598]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204141.9276.core due to developer image
2018-05-21T20:42:58.689281-07:00 INFO crash_reporter[9761]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204258.9610.core due to developer image
2018-05-21T20:49:01.728911-07:00 INFO crash_reporter[10470]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204901.9772.core due to developer image
2018-05-21T21:02:00.673363-07:00 INFO crash_reporter[12040]: Leaving core file at /var/spool/crash/bluetoothd.20180521.210200.10481.core due to developer image
2018-05-21T21:06:00.710507-07:00 INFO crash_reporter[12496]: Leaving core file at /var/spool/crash/bluetoothd.20180521.210600.12051.core due to developer image

It's possible that the crashes were due to the same root cause, since they happened very close together. The symbols are available in golden eye for eve version 10575.40.0 (https://console.developers.google.com/storage/chromeos-image-archive/eve-release/R67-10575.40.0/), in case you want to debug it.
crash.tar.gz
3.4 MB Download
Hi Sonny, thanks for checking it out! By looking up golden-eye, the patches landed on R67-10575.45.0 and R67-10575.46.0. So the build being tested, i.e., R67-10575.40.0, does not have the patches yet. And hence the crashes were expected. Thanks.
Ah, that's good to hear, it's probably fixed by your patches after R67-10575.46.0. Maybe Ruchi can help testing with R67-10575.47.0 tomorrow.
Anyway I will also try to merge the out-of-sync patch for Chrome to M67 so that it's resilient to other unexpected bluetooth crashes.

Sign in to add a comment