New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 755326 link

Starred by 7 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocking:
issue 759091



Sign in to add a comment

Error Creating GATT Connection

Project Member Reported by khorimoto@chromium.org, Aug 14 2017

Issue description

Comment 1 by mcchou@chromium.org, Aug 14 2017

Components: UI>Shell>Networking>Tethering OS>Systems>Bluetooth
Labels: -Pri-3 M-61 OS-Chrome Pri-1
Owner: mcchou@chromium.org
Status: Assigned (was: Untriaged)
Blocking: 759091
Hi Kyle, could you capture btmon log and /var/log/messages debug log when this happens?
Kyle, could you also capture all the standard debug log as described in go/bt-debug?
Labels: ReleaseBlock-Stable

Comment 6 by r...@chromium.org, Sep 13 2017

Kyle, does this put the controller in a state that it no longer accepts other commands? If not, does retrying the connection work?

rkc@: No, the controller seems to recover from this situation. Retrying the connection sometimes works, but sometimes it repeats this error.

Comment 8 by snanda@chromium.org, Sep 13 2017

Happens 1 out of 3-4 connection attempts. Happens across multiple devices. Half the time it does recover.

Kyle to get us logs hopefully today.
Attached are logs for this bug. The files are too big to upload in one batch, so I'll upload as part of two separate comments.
prox_tether_logs.txt
118 KB View Download
bt_messages
17.7 MB Download
ui.LATEST
6.5 KB Download
Second batch...
messages
20.1 MB Download
dmesg.log
66.1 KB View Download
chrome
555 KB View Download
btsnoop.log
637 KB View Download
Also, it seems sometimes the CreateGattConnection() does not return in a timely manner. We currently have a 15-second timeout for creating a GATT connection, and if we hit the timeout, we give up on the connection.

It really shouldn't be taking that long.
what is the user impact for this Kyle?
User impact is that Bluetooth connections don't work, and tethering appears to be broken.
Issue 766475 has been merged into this issue.
Cc: sacomoto@chromium.org
Attached are more logs at Sonny's request.
prox_tether_logs.txt
84.7 KB View Download
bt_messages
5.7 MB View Download
ui.LATEST
5.2 KB Download
messages
6.3 MB View Download
dmesg.log
122 KB View Download
chrome
454 KB View Download
btsnoop.log
1.1 MB View Download
Labels: -M-61 M-63
Owner: khorimoto@chromium.org
Hi Kyle, an issue could potentially cause the disconnection observed here. The supervision timeout is set to 420 ms by Magic Tethering which is a bit too short. This short value would lead to disconnection when controllers on either side of the connection does not respond within 420 ms for whatever reasons. This may happen once in a while in our experience before.

In our current default kernel settings, the supervision timeout is set to 2000 ms, while iOS devices and android devices are set to even longer values for better connection stability. If my memory serves me right, they are 4 secs and 20+ secs respectively in both devices.


This is a snippet in the btsnoop.log found in "https://drive.google.com/corp/drive/folders/0B6tDJkhYQT1FdVVkUUljdjdEQ00" which shows 420 ms set by Magic Tethering. 

< HCI Command: LE Create Connection (0x08|0x000d) plen 25                                            [hci0] 51.097112
        Peer address: 55:94:17:B2:E5:E4 (Resolvable)
        Own address type: Public (0x00)
        Min connection interval: 7.50 msec (0x0006)
        Max connection interval: 7.50 msec (0x0006)
        Connection latency: 0x0000
        Supervision timeout: 420 msec (0x002a)
        ...


And this short supervision timeout caused the disconnection below


> HCI Event: LE Meta Event (0x3e) plen 12                                                            [hci0] 51.667998
      LE Read Remote Used Features (0x04)
        Status: Connection Timeout (0x08)
        Handle: 3587
        Features: 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00
> HCI Event: Disconnect Complete (0x05) plen 4                                                       [hci0] 51.668963
        Status: Success (0x00)
        Handle: 3587
        Reason: Connection Timeout (0x08)
@ MGMT Event: Device Disconnected (0x000c) plen 8                                           {0x0004} [hci0] 51.674954
        LE Address: 55:94:17:B2:E5:E4 (Resolvable)
        Reason: Connection timeout (0x01)


And this in turn caused the following failure in the corresponding chrome log.


[3883:3883:0918/191941.670218:ERROR:device_event_log_impl.cc(156)] [19:19:41.669] Bluetooth: bluetooth_device_bluez.cc:915 /org/bluez/hci0/dev_55_94_17_B2_E5_E4: Failed to connect device: org.bluez.Error.Failed: Software caused connection abort

[3883:3883:0918/191941.672186:WARNING:bluetooth_low_energy_weave_client_connection.cc(417)] Error creating GATT connection to {id: "CAESR...VEw==", addr: "55:94:17:B2:E5:E4"}. Error code: 4


So please lengthen the value to say 2000 ms and see if you can still reproduce this issue. Thanks!

Cc: josephsih@chromium.org tengs@chromium.org
Owner: r...@chromium.org
Thanks for the observation, Joseph! In our Bluetooth sync today, Rahul said that it seemed likely that this was a side effect of the EasyUnlock change which attempts to shorten connection latency. Rahul, assigning to you.
Sonny has created a patch which seems to fix the GATT connection timeout bug ( issue 771016 ):

https://chromium-review.googlesource.com/c/chromiumos/third_party/bluez/+/711374/

With this patch in place, the "Error creating GATT connection" bug becomes far more common. The error code is still 4, which corresponds to ConnectErrorCode::ERROR_FAILED.
Attached are logs I took after Sonny's patch.
error_gatt_connection.zip
4.1 MB Download
Hi Tim,

Can you help with confirming whether Easy Unlock resets the supervision timeout back to the default vallue (2000ms) after login? Since this may reduce the connection stability of Magic tethering feature. Thanks.

Comment 23 by tengs@chromium.org, Oct 12 2017

We reset the timeout for each BLE connection, so after EasyUnlocking, a subsequent Magic Tether connection should use the default timeout.
It is a bit strange that I used to see Magic Tethering logs with supervision timeout as 420 ms. However, The logs these two days showed supervision timeout as 2,000 ms again. Do not know why.
In the logs attached by Kyle in C#21, the supervision timeout was 2000 ms which is the expected default value.

< HCI Command: LE Create Connection (0x08|0x000d) plen 25                                                                                   [hci0] 2017-10-12 06:09:41.851475
        Peer address: 6D:50:C2:45:57:45 (Resolvable)
        Own address type: Public (0x00)
        Min connection interval: 50.00 msec (0x0028)
        Max connection interval: 70.00 msec (0x0038)
        Connection latency: 0x0000
        Supervision timeout: 2000 msec (0x00c8)
        ...

The value looks good now. 

However, supervision timeout still occurred frequently. It occurred about once per minute in Kyle's log. This was way too high based on what we observed before. The reason could be either the phone did not respond in time, or the local controller of the DUT did not send out the command correctly, or the environment was too noisy. This requires further investigation to clarify which caused the supervision timeout.

Comment 26 Deleted

Sonny's patch in C#20 fixed the dbus timeout and handled the device clean up gracefully. 

However, the "Error creating GATT connection" errors were still observed rather common in C#21. I think the remaining issue probably lies in the controller. In the logs attached in C#21 by Kyle, there were 18 "Error creating GATT connection" errors in total by grepping the chrome log. When matching the timestamps of these errors to those in btsnoop.log, it is found that

  - (a) 16 of the errors were caused by "Connection Failed to be Established".
        This is considered to be a local controller issue and Intel is investigating.

  - (b) 2 of the errors were caused by "Connection Timeout", i.e., supervision timeout.
        Although supervision timeouts could be caused by either the phones, the DUT, or the air noise, it might still be an issue in local controller. Should we ask Intel to examine it as well?



------------- example logs below ----------------

Example logs of Type (a):
  btmon:
    > HCI Event: Disconnect Complete (0x05) plen 4  [hci0] 15:12:03.208911
        Reason: Connection Failed to be Established (0x3e)

  chrome:   
[3486:3486:1011/151203.214800:WARNING:bluetooth_low_energy_weave_client_connection.cc(438)] Error creating GATT connection to {id: "CAESR...ZjFVQ", addr: "46:01:B6:44:F3:49"}. Error code: 4



Example logs of Type (b):
  btmon:
    > HCI Event: LE Meta Event (0x3e) plen 12   [hci0] 15:14:20.269285
        LE Read Remote Used Features (0x04)
        Status: Connection Timeout (0x08)

  chrome:
[3486:3486:1011/151420.273799:WARNING:bluetooth_low_energy_weave_client_connection.cc(438)] Error creating GATT connection to {id: "CAESR...vNdeK", addr: "4C:B6:2F:57:93:6F"}. Error code: 4
I reproduced the "Error creating GATT connection" error with M62 9901.46.0 image from Goldeneye beta channel. Please see the logs below.
create_gatt_connection_error.zip
12.2 MB Download
I was able to reproduce "command disallowed" on LE Create Connection with only "hcitool lecc".
hci.log
705 bytes View Download
I was also reproduce "Connection failed to be established" with only "hcitool lecc". Attached is the HCI log.
hci.log
2.5 KB View Download
b/67783914 is created for Intel to fix.
Any updates?

Comment 33 by r...@chromium.org, Nov 10 2017

Kyle, can we mark this as fixed? Or are we still seeing this often?


Nope - it's still a regular occurrence.

Comment 35 by r...@chromium.org, Nov 14 2017

Owner: josephsih@chromium.org
According to metrics, we are seeing GATT connection creation errors over 50% of the time. We see two main errors:
(1) device::BluetoothDevice::ConnectErrorCode::ERROR_FAILED (value = 4)
(2) device::BluetoothDevice::ConnectErrorCode::ERROR_UNKNOWN (value = 6)

See go/chrome-os-tether-metrics for details.

Joseph/Rahul, can we prioritize this issue?
Any updates? M63 goes stable in < 2 weeks and this is marked as a blocker.
I conducted a test on eve-release/R64-10128.0.0 today, and the same error 4 could still be observed as

    Error creating GATT connection to {id: ...}. Error code: 4

As pointed out in C#27, this was mostly caused by 

    Reason: Connection Failed to be Established

This issue is being traced in b/67783914. There are two underlying issues.

1. Hardware issue: I suspect that there might exist some hardware issue such that the connection failure rate looks higher than usual.

2. Software issue: we should modify the kernel so that it waits for up to 6 connection intervals before declaring a connection is actually established. Currently, it prematurely assumes that a connection is established, and then it realizes that it is NOT. By fixing it, the kernel should wait a little longer before declaring that the connection is NOT established.

IMHO, the actual issue is a hardware issue. However, Intel would not like to look further into this issue until we fix the issue on the software side. Fixing software issue is not a high-priority task at this time for two reasons

(a) Fixing the software issue does not help improve connection success rate. What it does would be to emit the connection error message a bit later. After all, while the hardware does not establish the connection successfully, software cannot do anything but to emit an error message sooner or later.

(b) This connection failure is transient. It would connect sometime later after retrying a few more times.

Intel is currently focusing on the controller lost issue which is non-recoverable and quite many users complain. 

(And I am busy with fixing bluetoothd crash issue which affects all users and am also supporting Intel on reproducing issues.)

May I suggest to lower the priority of this issue so that Intel could focus on the more severe issue? Thanks.
Labels: -Pri-1 -M-63 M-64 Pri-2
Thanks for the update, Joseph. I'm re-targeting this fix to M-64 as a P2.
Labels: -ReleaseBlock-Stable
Removing as a RBS for M64 (time, priority).
Labels: -M-64 M-65
Labels: -Pri-2 -M-65 Pri-1
I've been able to reproduce GATT connection errors between an Eve and one particular Android device pretty easily for some reason. The error seems to be much more common on this device. Joseph, do you mind taking a look at some of these logs?
error_creating_gatt_connection.zip
22.1 MB Download

Comment 44 by r...@chromium.org, Feb 28 2018

Joseph, do you have time to look at this? If you have other higher priority issues, let us know - I'll re-assign. 
The connection failure occurred between the chromebook bluetooth controller and the remote peripheral. The failure could be incurred by a number of reasons:
- hardware errors on the chromebook bt controller,
- hardware errors on the peripheral, or
- environmental noise interference. 
The errors are recoverable by retry. However, there is not much on software side to mitigate the errors. Maybe we should bring this error to Intel’s attention?

----- more details below -----

### By looking at the logs in C#43, the ui.LATEST log showed that
[3590:3590:0228/110941.051163:ERROR:device_event_log_impl.cc(156)] … Failed to connect device: org.bluez.Error.Failed: Software caused connection abort

### Checking btsnoop_chromebook.log to see what happened on the controller.
### The controller tried to connect but failed with timeout. This was why it showed connection abort in ui.LATEST.
< HCI Command: LE Create Connection   [hci0] 11:09:39.040074
> HCI Event: Command Status   [hci0] 11:09:39.041940
    LE Create Connection  Status: Success (0x00)
< HCI Command: LE Create Connection Cancel   [hci0] 11:09:41.045723
> HCI Event: Command Complete   [hci0] 11:09:41.046920
    LE Create Connection Cancel  Status: Success (0x00)
> HCI Event: LE Meta Event   [hci0] 11:09:41.047935
    LE Connection Complete
        Status: Unknown Connection Identifier (0x02)

### Controller tried again 20 seconds later. It succeeded this time.
< HCI Command: LE Create Connection   [hci0] 11:10:01.171046
> HCI Event: Command Status   [hci0] 11:10:01.172870
    LE Create Connection  Status: Success (0x00)
> HCI Event: LE Meta Event  [hci0] 11:10:02.256888
    LE Connection Complete
        Status: Success (0x00)

Next two LE connection attempts failed again with timeouts which were followed by another successful connection. The connection errors happened quite frequently.

Cc: sukumar....@intel.com
Hi Sukumar, could you help take a look at the logs in C#43 and my comments in C#45, and clarify whether the connection timeout was a hardware/firmware issue? Thanks!

I checked the log that captured for 10 min test:
20 -success
14 -Failure
=> I dont see any issue in Chrome (StP D1). I will check the Android log and will update.
=> Also please capture the Air log, when you see the issue next time.
Cc: rjahagir@chromium.org
 Issue 828158  has been merged into this issue.
About C#48: the same "Unknown Connection Identifier" caused the fitbit arc++ app failed to connect the fitbit zip device.
Re C#47: Hi Sukumar, when you ran magic tethering, could you check the btsnoop logs on the chromebook? The "Unknown Connection Identifier" issue occurs frequently and is supposed to reproduce pretty easily.
This occurs when due to some reason LE connection is not established within 2sec, hence no connection handle available in Fw for this new connection. Followed by Linux-kernel BT-stack continues and sends the disconnect request after 2sec (Linux host cmd timeout). As no connection handle available in Fw, hence Fw throws status as "Unknown Connection Identifier", for this case. 
Also, we need to handle when the connection is not established in 2sec, and separately.

Example:
HCI_CMD	Sent LE Create Connection Cancel
HCI_EVT	Rcvd Command Complete (LE Create Connection Cancel)
HCI_EVT	Rcvd LE Meta (LE Connection Complete)
    Status: Unknown Connection Identifier (0x02)
    Connection Handle: 0x0000

Cc: dmitrygr@google.com
I had this chat with Miao also yesterday, but should we consider increasing the 2sec kernel timeout?

I am wondering whether there are substantial number of cases where it takes 2+ seconds to successfully create the connection. Do we have data that tells us if this is the case (i.e., distribution of time it takes to create connections)?
Create connection has two-phase -
Step 1: Wait for Adv sync form the remote device. no timeout ie. wait forever in this stage.  Followed by its enter in 2nd step;

Step 2: where it makes 6 attempts to create a connection.
i.e. total time as 6 * connection_interval

Bluetooth HCI Command - LE Create Connection
    Scan Interval: 96 (60 msec)
    Scan Window: 96 (60 msec)
    Connection Interval Min: 40 (50 msec)
    Connection Interval Max: 56 (70 msec)

cont.. 
Observation: No Connection complete event received in 2sec (Success/ Failure).
i.e. Controller is waiting for Adv sync. hence Linux-kernel BT-stack sends the disconnect request after 2sec as clean up and to allow the next activity.
For step 1, is the firmware implementation to wait forever or is there an artificial timeout that is used in the firmware?
 Issue 847627  has been merged into this issue.

Sign in to add a comment