Error Creating GATT Connection |
||||||||||||||
Issue descriptionDuring host scan... Error creating GATT connection to CAESR...O2mE=. Error code: 4 Error log: https://cs.chromium.org/chromium/src/components/cryptauth/ble/bluetooth_low_energy_weave_client_connection.cc?q=OnCreateGattConnectionError Error code: ConnectErrorCode::ERROR_FAILED, https://cs.chromium.org/chromium/src/device/bluetooth/bluetooth_device.h?q=ConnectErrorCode
,
Aug 25 2017
,
Sep 12 2017
Hi Kyle, could you capture btmon log and /var/log/messages debug log when this happens?
,
Sep 12 2017
Kyle, could you also capture all the standard debug log as described in go/bt-debug?
,
Sep 13 2017
,
Sep 13 2017
Kyle, does this put the controller in a state that it no longer accepts other commands? If not, does retrying the connection work?
,
Sep 13 2017
rkc@: No, the controller seems to recover from this situation. Retrying the connection sometimes works, but sometimes it repeats this error.
,
Sep 13 2017
Happens 1 out of 3-4 connection attempts. Happens across multiple devices. Half the time it does recover. Kyle to get us logs hopefully today.
,
Sep 13 2017
Attached are logs for this bug. The files are too big to upload in one batch, so I'll upload as part of two separate comments.
,
Sep 13 2017
Second batch...
,
Sep 13 2017
Also, it seems sometimes the CreateGattConnection() does not return in a timely manner. We currently have a 15-second timeout for creating a GATT connection, and if we hit the timeout, we give up on the connection. It really shouldn't be taking that long.
,
Sep 15 2017
what is the user impact for this Kyle?
,
Sep 15 2017
User impact is that Bluetooth connections don't work, and tethering appears to be broken.
,
Sep 19 2017
Issue 766475 has been merged into this issue.
,
Sep 20 2017
,
Sep 22 2017
Attached are more logs at Sonny's request.
,
Sep 22 2017
,
Oct 9 2017
Hi Kyle, an issue could potentially cause the disconnection observed here. The supervision timeout is set to 420 ms by Magic Tethering which is a bit too short. This short value would lead to disconnection when controllers on either side of the connection does not respond within 420 ms for whatever reasons. This may happen once in a while in our experience before. In our current default kernel settings, the supervision timeout is set to 2000 ms, while iOS devices and android devices are set to even longer values for better connection stability. If my memory serves me right, they are 4 secs and 20+ secs respectively in both devices. This is a snippet in the btsnoop.log found in "https://drive.google.com/corp/drive/folders/0B6tDJkhYQT1FdVVkUUljdjdEQ00" which shows 420 ms set by Magic Tethering. < HCI Command: LE Create Connection (0x08|0x000d) plen 25 [hci0] 51.097112 Peer address: 55:94:17:B2:E5:E4 (Resolvable) Own address type: Public (0x00) Min connection interval: 7.50 msec (0x0006) Max connection interval: 7.50 msec (0x0006) Connection latency: 0x0000 Supervision timeout: 420 msec (0x002a) ... And this short supervision timeout caused the disconnection below > HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 51.667998 LE Read Remote Used Features (0x04) Status: Connection Timeout (0x08) Handle: 3587 Features: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 > HCI Event: Disconnect Complete (0x05) plen 4 [hci0] 51.668963 Status: Success (0x00) Handle: 3587 Reason: Connection Timeout (0x08) @ MGMT Event: Device Disconnected (0x000c) plen 8 {0x0004} [hci0] 51.674954 LE Address: 55:94:17:B2:E5:E4 (Resolvable) Reason: Connection timeout (0x01) And this in turn caused the following failure in the corresponding chrome log. [3883:3883:0918/191941.670218:ERROR:device_event_log_impl.cc(156)] [19:19:41.669] Bluetooth: bluetooth_device_bluez.cc:915 /org/bluez/hci0/dev_55_94_17_B2_E5_E4: Failed to connect device: org.bluez.Error.Failed: Software caused connection abort [3883:3883:0918/191941.672186:WARNING:bluetooth_low_energy_weave_client_connection.cc(417)] Error creating GATT connection to {id: "CAESR...VEw==", addr: "55:94:17:B2:E5:E4"}. Error code: 4 So please lengthen the value to say 2000 ms and see if you can still reproduce this issue. Thanks!
,
Oct 9 2017
Thanks for the observation, Joseph! In our Bluetooth sync today, Rahul said that it seemed likely that this was a side effect of the EasyUnlock change which attempts to shorten connection latency. Rahul, assigning to you.
,
Oct 11 2017
Sonny has created a patch which seems to fix the GATT connection timeout bug ( issue 771016 ): https://chromium-review.googlesource.com/c/chromiumos/third_party/bluez/+/711374/ With this patch in place, the "Error creating GATT connection" bug becomes far more common. The error code is still 4, which corresponds to ConnectErrorCode::ERROR_FAILED.
,
Oct 11 2017
Attached are logs I took after Sonny's patch.
,
Oct 12 2017
Hi Tim, Can you help with confirming whether Easy Unlock resets the supervision timeout back to the default vallue (2000ms) after login? Since this may reduce the connection stability of Magic tethering feature. Thanks.
,
Oct 12 2017
We reset the timeout for each BLE connection, so after EasyUnlocking, a subsequent Magic Tether connection should use the default timeout.
,
Oct 12 2017
It is a bit strange that I used to see Magic Tethering logs with supervision timeout as 420 ms. However, The logs these two days showed supervision timeout as 2,000 ms again. Do not know why.
,
Oct 12 2017
In the logs attached by Kyle in C#21, the supervision timeout was 2000 ms which is the expected default value.
< HCI Command: LE Create Connection (0x08|0x000d) plen 25 [hci0] 2017-10-12 06:09:41.851475
Peer address: 6D:50:C2:45:57:45 (Resolvable)
Own address type: Public (0x00)
Min connection interval: 50.00 msec (0x0028)
Max connection interval: 70.00 msec (0x0038)
Connection latency: 0x0000
Supervision timeout: 2000 msec (0x00c8)
...
The value looks good now.
However, supervision timeout still occurred frequently. It occurred about once per minute in Kyle's log. This was way too high based on what we observed before. The reason could be either the phone did not respond in time, or the local controller of the DUT did not send out the command correctly, or the environment was too noisy. This requires further investigation to clarify which caused the supervision timeout.
,
Oct 12 2017
Sonny's patch in C#20 fixed the dbus timeout and handled the device clean up gracefully.
However, the "Error creating GATT connection" errors were still observed rather common in C#21. I think the remaining issue probably lies in the controller. In the logs attached in C#21 by Kyle, there were 18 "Error creating GATT connection" errors in total by grepping the chrome log. When matching the timestamps of these errors to those in btsnoop.log, it is found that
- (a) 16 of the errors were caused by "Connection Failed to be Established".
This is considered to be a local controller issue and Intel is investigating.
- (b) 2 of the errors were caused by "Connection Timeout", i.e., supervision timeout.
Although supervision timeouts could be caused by either the phones, the DUT, or the air noise, it might still be an issue in local controller. Should we ask Intel to examine it as well?
------------- example logs below ----------------
Example logs of Type (a):
btmon:
> HCI Event: Disconnect Complete (0x05) plen 4 [hci0] 15:12:03.208911
Reason: Connection Failed to be Established (0x3e)
chrome:
[3486:3486:1011/151203.214800:WARNING:bluetooth_low_energy_weave_client_connection.cc(438)] Error creating GATT connection to {id: "CAESR...ZjFVQ", addr: "46:01:B6:44:F3:49"}. Error code: 4
Example logs of Type (b):
btmon:
> HCI Event: LE Meta Event (0x3e) plen 12 [hci0] 15:14:20.269285
LE Read Remote Used Features (0x04)
Status: Connection Timeout (0x08)
chrome:
[3486:3486:1011/151420.273799:WARNING:bluetooth_low_energy_weave_client_connection.cc(438)] Error creating GATT connection to {id: "CAESR...vNdeK", addr: "4C:B6:2F:57:93:6F"}. Error code: 4
,
Oct 13 2017
I reproduced the "Error creating GATT connection" error with M62 9901.46.0 image from Goldeneye beta channel. Please see the logs below.
,
Oct 13 2017
I was able to reproduce "command disallowed" on LE Create Connection with only "hcitool lecc".
,
Oct 13 2017
I was also reproduce "Connection failed to be established" with only "hcitool lecc". Attached is the HCI log.
,
Oct 13 2017
b/67783914 is created for Intel to fix.
,
Oct 27 2017
Any updates?
,
Nov 10 2017
Kyle, can we mark this as fixed? Or are we still seeing this often?
,
Nov 10 2017
Nope - it's still a regular occurrence.
,
Nov 14 2017
,
Nov 15 2017
According to metrics, we are seeing GATT connection creation errors over 50% of the time. We see two main errors: (1) device::BluetoothDevice::ConnectErrorCode::ERROR_FAILED (value = 4) (2) device::BluetoothDevice::ConnectErrorCode::ERROR_UNKNOWN (value = 6) See go/chrome-os-tether-metrics for details. Joseph/Rahul, can we prioritize this issue?
,
Nov 22 2017
Any updates? M63 goes stable in < 2 weeks and this is marked as a blocker.
,
Nov 28 2017
I conducted a test on eve-release/R64-10128.0.0 today, and the same error 4 could still be observed as
Error creating GATT connection to {id: ...}. Error code: 4
As pointed out in C#27, this was mostly caused by
Reason: Connection Failed to be Established
This issue is being traced in b/67783914. There are two underlying issues.
1. Hardware issue: I suspect that there might exist some hardware issue such that the connection failure rate looks higher than usual.
2. Software issue: we should modify the kernel so that it waits for up to 6 connection intervals before declaring a connection is actually established. Currently, it prematurely assumes that a connection is established, and then it realizes that it is NOT. By fixing it, the kernel should wait a little longer before declaring that the connection is NOT established.
IMHO, the actual issue is a hardware issue. However, Intel would not like to look further into this issue until we fix the issue on the software side. Fixing software issue is not a high-priority task at this time for two reasons
(a) Fixing the software issue does not help improve connection success rate. What it does would be to emit the connection error message a bit later. After all, while the hardware does not establish the connection successfully, software cannot do anything but to emit an error message sooner or later.
(b) This connection failure is transient. It would connect sometime later after retrying a few more times.
Intel is currently focusing on the controller lost issue which is non-recoverable and quite many users complain.
(And I am busy with fixing bluetoothd crash issue which affects all users and am also supporting Intel on reproducing issues.)
May I suggest to lower the priority of this issue so that Intel could focus on the more severe issue? Thanks.
,
Nov 28 2017
Thanks for the update, Joseph. I'm re-targeting this fix to M-64 as a P2.
,
Jan 4 2018
Removing as a RBS for M64 (time, priority).
,
Jan 17 2018
,
Feb 2 2018
,
Feb 28 2018
I've been able to reproduce GATT connection errors between an Eve and one particular Android device pretty easily for some reason. The error seems to be much more common on this device. Joseph, do you mind taking a look at some of these logs?
,
Feb 28 2018
Joseph, do you have time to look at this? If you have other higher priority issues, let us know - I'll re-assign.
,
Mar 19 2018
The connection failure occurred between the chromebook bluetooth controller and the remote peripheral. The failure could be incurred by a number of reasons:
- hardware errors on the chromebook bt controller,
- hardware errors on the peripheral, or
- environmental noise interference.
The errors are recoverable by retry. However, there is not much on software side to mitigate the errors. Maybe we should bring this error to Intel’s attention?
----- more details below -----
### By looking at the logs in C#43, the ui.LATEST log showed that
[3590:3590:0228/110941.051163:ERROR:device_event_log_impl.cc(156)] … Failed to connect device: org.bluez.Error.Failed: Software caused connection abort
### Checking btsnoop_chromebook.log to see what happened on the controller.
### The controller tried to connect but failed with timeout. This was why it showed connection abort in ui.LATEST.
< HCI Command: LE Create Connection [hci0] 11:09:39.040074
> HCI Event: Command Status [hci0] 11:09:39.041940
LE Create Connection Status: Success (0x00)
< HCI Command: LE Create Connection Cancel [hci0] 11:09:41.045723
> HCI Event: Command Complete [hci0] 11:09:41.046920
LE Create Connection Cancel Status: Success (0x00)
> HCI Event: LE Meta Event [hci0] 11:09:41.047935
LE Connection Complete
Status: Unknown Connection Identifier (0x02)
### Controller tried again 20 seconds later. It succeeded this time.
< HCI Command: LE Create Connection [hci0] 11:10:01.171046
> HCI Event: Command Status [hci0] 11:10:01.172870
LE Create Connection Status: Success (0x00)
> HCI Event: LE Meta Event [hci0] 11:10:02.256888
LE Connection Complete
Status: Success (0x00)
Next two LE connection attempts failed again with timeouts which were followed by another successful connection. The connection errors happened quite frequently.
,
Mar 20 2018
Hi Sukumar, could you help take a look at the logs in C#43 and my comments in C#45, and clarify whether the connection timeout was a hardware/firmware issue? Thanks!
,
Apr 1 2018
I checked the log that captured for 10 min test: 20 -success 14 -Failure => I dont see any issue in Chrome (StP D1). I will check the Android log and will update. => Also please capture the Air log, when you see the issue next time.
,
Apr 3 2018
,
Apr 3 2018
About C#48: the same "Unknown Connection Identifier" caused the fitbit arc++ app failed to connect the fitbit zip device.
,
Apr 3 2018
Re C#47: Hi Sukumar, when you ran magic tethering, could you check the btsnoop logs on the chromebook? The "Unknown Connection Identifier" issue occurs frequently and is supposed to reproduce pretty easily.
,
Apr 4 2018
This occurs when due to some reason LE connection is not established within 2sec, hence no connection handle available in Fw for this new connection. Followed by Linux-kernel BT-stack continues and sends the disconnect request after 2sec (Linux host cmd timeout). As no connection handle available in Fw, hence Fw throws status as "Unknown Connection Identifier", for this case.
Also, we need to handle when the connection is not established in 2sec, and separately.
Example:
HCI_CMD Sent LE Create Connection Cancel
HCI_EVT Rcvd Command Complete (LE Create Connection Cancel)
HCI_EVT Rcvd LE Meta (LE Connection Complete)
Status: Unknown Connection Identifier (0x02)
Connection Handle: 0x0000
,
Apr 4 2018
I had this chat with Miao also yesterday, but should we consider increasing the 2sec kernel timeout? I am wondering whether there are substantial number of cases where it takes 2+ seconds to successfully create the connection. Do we have data that tells us if this is the case (i.e., distribution of time it takes to create connections)?
,
Apr 4 2018
Create connection has two-phase -
Step 1: Wait for Adv sync form the remote device. no timeout ie. wait forever in this stage. Followed by its enter in 2nd step;
Step 2: where it makes 6 attempts to create a connection.
i.e. total time as 6 * connection_interval
Bluetooth HCI Command - LE Create Connection
Scan Interval: 96 (60 msec)
Scan Window: 96 (60 msec)
Connection Interval Min: 40 (50 msec)
Connection Interval Max: 56 (70 msec)
,
Apr 4 2018
cont.. Observation: No Connection complete event received in 2sec (Success/ Failure). i.e. Controller is waiting for Adv sync. hence Linux-kernel BT-stack sends the disconnect request after 2sec as clean up and to allow the next activity.
,
Apr 4 2018
For step 1, is the firmware implementation to wait forever or is there an artificial timeout that is used in the firmware?
,
May 29 2018
Issue 847627 has been merged into this issue. |
||||||||||||||
►
Sign in to add a comment |
||||||||||||||
Comment 1 by mcchou@chromium.org
, Aug 14 2017Labels: -Pri-3 M-61 OS-Chrome Pri-1
Owner: mcchou@chromium.org
Status: Assigned (was: Untriaged)