Issue metadata
Sign in to add a comment
|
Unrecoverable error stopping recovery |
||||||||||||||||||||
Issue descriptionLong-running tether scan test run by rjahagir (Ruchi) on Minnie v10655.0.0. At first, everything works as expected. Eventually, we hit the following error: [1388:1388:0508/195310.213870:ERROR:device_event_log_impl.cc(159)] [19:53:10.213] Bluetooth: bluetooth_adapter_bluez.cc:1641 /org/bluez/hci0: Failed to stop discovery: org.bluez.Error.InProgress: Operation already in progress [1388:1388:0508/195310.214332:ERROR:ble_scanner_impl.cc(298)] Error stopping discovery session. At this point, nothing else succeeds. Chrome logs are attached. Sonny says that this looks like a regression caused by his code and is investigating.
,
May 14 2018
Here is the log from Ruchi's repro: https://drive.google.com/open?id=1YdnYzfmOkcjVoT1PAcGIPkmw4xIjU4Qf Joseph, I investigated the log and here are my findings: * It's unlikely to be caused by regression of retries in bluetoothd's Start/StopDiscovery (http://crosreview.com/1039024) since I don't find anything wrong with it from the log. * It's also unlikely that this is a bad state in firmware, because experiment on restarting bluetoothd returns the stack to the good state. * It seems that there is a MGMT_OP_START_SERVICE_DISCOVERY that never comes back from the kernel, so bluetoothd's state is still waiting for it to return. This is seen in file "messages.3" at time "2018-05-11T15:38:47.871228-07:00". There is no more start_discovery_complete past that command. Could you help take a look at the kernel part where the mgmt command never returns? I suspect it is related to some kind of skipped conditional command that you've previously worked with advertising. Thanks!
,
May 15 2018
This issue is interesting. With an initial look, it seems not related with skipped conditional commands since there were no skip_conditional_cmd() messages after the last MGMT_OP_START_SERVICE_DISCOVERY message. In the btnoop log, there was no "HCI Command: LE Set Scan Enable" either after the last MGMT_OP_START_SERVICE_DISCOVERY message. The MGMT_OP_START_SERVICE_DISCOVERY message just got lost somewhere. I would need kernel debug logs for further investigation. I will see if I can reproduce this issue.
,
May 15 2018
Hi Joseph, Ruchi did another repro with kernel debug enabled, please take a look at the logs: https://drive.google.com/file/d/1hbfmZvfijozcxafWPKocw7DzcbnsO411/view Unfortunately there is no btmon log for that repro. Ruchi is going to try another repro with btmon log, in the meantime you can take a look at that log.
,
May 16 2018
Hi Joseph, I was able to repro again with btmon log this time. PTAL! https://drive.google.com/open?id=1DKj4jSye4nQJgjI4FqC6gXyYlKyi1uEF
,
May 16 2018
Thanks Ruchi for repro-ing the issue and the logs. Could you help type "hcitool cmd 3f 05" to check the fw version? Thanks!
,
May 16 2018
>hcitool cmd 3f 05 < HCI Command: ogf 0x3f, ocf 0x0005, plen 0 > HCI Event: 0x0e plen 4 01 05 FC 01
,
May 17 2018
Hi Ruchi, thanks! Just realized that Minnie uses Broadcom BCM-43540 chip. I originally assumed that it uses Intel Wireless chip. I am looking into the logs that you provided. Thanks!
,
May 17 2018
Thanks Joseph!
,
May 17 2018
Re C#5: In the logs, it is a similar symptom but not exactly the same. The symptom in C#5 was "Failed to start discovery" instead of "Failed to stop discovery" as in C#0. Let me assume temporarily that both kinds of failures had the same root cause, and update the subject a bit to include both failures.
,
May 17 2018
With a 2nd thought, I think "Failed to start discovery" issue might not have exactly the same root cause of "Failed to stop discovery" issue. Hence, I created a new issue 843943 : Minnie: Unrecoverable error starting recovery. Let's separate the two symptoms in two issues.
,
May 17 2018
Hi Ruchi, could you help running the test again on Minnie, and see if it is possible to reproduce the symptom described in C#0? Thanks! (The logs in C#5 was considered as a separate issue and was now tracked in issue 843943 : Minnie: Unrecoverable error starting recovery)
,
May 17 2018
Thanks Joseph, I'll try to repro today and will update bugs with logs.
,
May 22 2018
We've inspected our metrics, and issue is *not* specific to Minnie. These metrics are Eve-only and indicate a similarly drastic drop in stability: https://uma.googleplex.com/timeline_v2?sid=dc86135fb6dd4da1db6583bfa1a1008e According to metrics, beta channel is now seeing an ~8% success rate stopping discovery sessions. Normally, this rate hovers around 96-99%. M-67 hits stable in 8 days, so should bug be moved to P0 instead of P1?
,
May 22 2018
Kyle, looking at that link, it seems that even stable has begun to drop off significantly in the last couple days. Looking at 1-day aggregate is more clear: https://uma.googleplex.com/timeline_v2?sid=54e27afbe7dde1d595776973749d4269 Did a new stable push cause this drop off?
,
May 22 2018
Hi Kyle, thanks for the metrics! Yes, the issue seems to lie in the kernel and is generic to all platforms. I also filed a related issue 845107 : Failed to stop discovery due to __hci_req_sync() err -110. It seems to me that there were a few distinct symptoms that led to failure to start/stop discovery. I am currently working on it. There are 2 more days for stable cut. It looks to me that there are some race conditions in the discovery operations with other ones. It would take time to investigate the root cause, fix it, and test. The issues are my top priority for now.
,
May 22 2018
Thanks, Joseph. Do you see a reason why this would also cause a drop in the past week on Stable channel as well?
,
May 22 2018
I think this is an issue(s) existing for quite a long time. We did not hit the issue as frequently before. But for some unknown reason, probably related with how operations/threads are scheduled to execute, the bug becomes active now.
,
May 22 2018
BTW, sometimes, "Failed to stop discovery" may not be a real error. It might be a temporal error caused by "Start discovery" was still in progress. When trying to "Stop discovery" at a later timestamp, it might succeed. The temporal error was shown in the following logs. # The "Error stopping discovery" seen in ui.LATEST. [1224:1224:0515/125440.714539:ERROR:device_event_log_impl.cc(156)] [12:54:40.714] Bluetooth: bluetooth_adapter_bluez.cc:1619 /org/bluez/hci0: Failed to stop discovery: org.bluez.Error.InProgress: Operation already in progress [1224:1224:0515/125440.715476:ERROR:ble_scanner_impl.cc(279)] Error stopping discovery session. ### The corresponding logs observed in /var/log/messages. # "Start discovery" by chrome 2018-05-15T12:54:40.163489+08:00 DEBUG bluetoothd[2613]: src/adapter.c:start_discovery() sender :1.7 # "Stop discovery" by chrome before "Start discovery" completed. 2018-05-15T12:54:40.711300+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() sender :1.7 # "Stop discovery" failed since "Start discovery" was still in progress. 2018-05-15T12:54:40.711322+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() error: start discovery in progress # "Start discovery" completed at this time. 2018-05-15T12:54:40.712136+08:00 DEBUG bluetoothd[2613]: src/adapter.c:start_discovery_complete() status 0x00 # "Stop discovery" by chrome again. 2018-05-15T12:54:47.472306+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery() sender :1.7 # "Stop discovery" completed successfully this time. 2018-05-15T12:54:47.488875+08:00 DEBUG bluetoothd[2613]: src/adapter.c:stop_discovery_complete() status 0x00
,
May 22 2018
Naive question from the peanut gallery: Is there not a way for the 'start discovery' requester to know that 'stop discovery' is ongoing?
,
May 22 2018
Re #20: Not from within Chrome. The logs Joseph is referencing are from bluetoothd (a layer lower).
,
May 22 2018
Joseph: See #20 with respect to any particular layer. Is there a way to prevent the mismatched starting/stopping?
,
May 22 2018
The recoverable error is okay/normal/expected, at this moment magic tether is still doing retries so these cases recover eventually. What we're interested is debugging those which never recovers.
,
May 22 2018
I was able to repro this overnight with Eve (M67 - 10575.40.0 - active beta build) and Minnie (M68 - 10654.0.0). Logs below: Minnie: https://drive.google.com/open?id=13oPKks6eb6YyKggEpFAKemy8gszWQi3Z Eve: https://drive.google.com/open?id=1uu8ZPfMqoKx-HxXJAprTPITlpTFJeroI
,
May 22 2018
Thank you, Ruchi for providing the repro logs. I looked at them, and found out the following: 1. The bug of Eve (M67-10575.40.0) looks different from the kernel bug that we see in this bug. I moved this to bug 837656 . 2. The bug of Minnie (M68-10654.0.0) looks like the same kernel bug we're investigating here. I will do further investigation and sync with Joseph.
,
May 22 2018
After investigating more of the eve log that Ruchi provided in #24, I am almost certain about the root cause of dipping success rate of InstantTethering.BluetoothDiscoverySessionStopped (https://uma.googleplex.com/timeline_v2?sid=54e27afbe7dde1d595776973749d4269), even in stable. So my theory is this: 1. There is an increase of bluetoothd crash. 2. This crash causes Out-of-sync of discovery session in Chrome vs bluetoothd ( bug 837656 ) 3. The out-of-sync caused "active discovery session not found" error, which is supported by other metrics: https://uma.googleplex.com/p/chrome/timeline_v2?sid=80fd1b75610ded00e3c3c7c0de23cb6c So, while Joseph is fighting the bluetoothd crashes, I am also working on a fix for the out-of-sync in Chrome level to make our stack more resilient from bluetoothd crashes.
,
May 22 2018
Hi Sonny, thanks for looking into it. Have you found bluetoothd crash in the logs attached in C#24? (I did not have time to look at them yet.) My patches for 2 bluetoothd crash issues have landed on R68-10669.0.0. Since then, I did not find any bluetoothd crashes in my own tests on Eve. While you are working on out-of-sync issue at chrome level, I will keep working on this issue here. I find a few distinct symptoms while "Failed to stop discovery" occurred.
,
May 23 2018
Hi Joseph, Please find the crash dumps of Ruchi's repro in #24 in this attachment. There were several bluetoothd crashes, and the last one (maybe accidentally) caused the out-of-sync (bluetoothd.20180521.210600.12051.core): 2018-05-21T20:02:33.631437-07:00 INFO crash_reporter[4916]: Leaving core file at /var/spool/crash/bluetoothd.20180521.200233.2352.core due to developer image 2018-05-21T20:11:49.686624-07:00 INFO crash_reporter[6020]: Leaving core file at /var/spool/crash/bluetoothd.20180521.201149.4928.core due to developer image 2018-05-21T20:16:02.018085-07:00 INFO crash_reporter[6468]: Leaving core file at /var/spool/crash/bluetoothd.20180521.201602.6031.core due to developer image 2018-05-21T20:39:19.794153-07:00 INFO crash_reporter[9265]: Leaving core file at /var/spool/crash/bluetoothd.20180521.203919.6478.core due to developer image 2018-05-21T20:41:41.694503-07:00 INFO crash_reporter[9598]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204141.9276.core due to developer image 2018-05-21T20:42:58.689281-07:00 INFO crash_reporter[9761]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204258.9610.core due to developer image 2018-05-21T20:49:01.728911-07:00 INFO crash_reporter[10470]: Leaving core file at /var/spool/crash/bluetoothd.20180521.204901.9772.core due to developer image 2018-05-21T21:02:00.673363-07:00 INFO crash_reporter[12040]: Leaving core file at /var/spool/crash/bluetoothd.20180521.210200.10481.core due to developer image 2018-05-21T21:06:00.710507-07:00 INFO crash_reporter[12496]: Leaving core file at /var/spool/crash/bluetoothd.20180521.210600.12051.core due to developer image It's possible that the crashes were due to the same root cause, since they happened very close together. The symbols are available in golden eye for eve version 10575.40.0 (https://console.developers.google.com/storage/chromeos-image-archive/eve-release/R67-10575.40.0/), in case you want to debug it.
,
May 24 2018
Hi Sonny, thanks for checking it out! By looking up golden-eye, the patches landed on R67-10575.45.0 and R67-10575.46.0. So the build being tested, i.e., R67-10575.40.0, does not have the patches yet. And hence the crashes were expected. Thanks.
,
May 24 2018
Ah, that's good to hear, it's probably fixed by your patches after R67-10575.46.0. Maybe Ruchi can help testing with R67-10575.47.0 tomorrow. Anyway I will also try to merge the out-of-sync patch for Chrome to M67 so that it's resilient to other unexpected bluetooth crashes. |
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by josephsih@chromium.org
, May 14 2018