WPA2 Wifi radomly disconnected and network being blacklisted |
|||||||||||
Issue descriptionChromeOS version: 65, 67 68.0.3440.118 ChromeOS device model: kefka Case#: 16863575 Description: Customer is using wifi "SSID=KIPPNYC Chromies" and chromebooks randomly disconnect from network with 'Out of range" in UI with always good netowrk range in logs. Logfile and datettime when issue happened below: Drive link to logs: https://drive.google.com/corp/drive/u/0/folders/1d8sY0wF8BOH7eBo2BzqRzWCNx_Bzs70H In logfiles I see suspicious 2018-09-13T14:49:28.541462-04:00 DEBUG wpa_supplicant[547]: RSN: Ignored PMKID candidate without preauth flag 2018-09-13T14:49:28.541570-04:00 DEBUG wpa_supplicant[547]: message repeated 10 times: [ RSN: Ignored PMKID candidate without preauth flag] followed by 2018-09-13T14:49:27.269978-04:00 NOTICE wpa_supplicant[547]: wlan0: CTRL-EVENT-DISCONNECTED bssid=5c:50:15:37:b3:38 reason=4 locally_generated=1 2018-09-13T14:49:27.270045-04:00 NOTICE wpa_supplicant[547]: BSSID 5c:50:15:37:b3:38 blacklist count incremented to 2 .. 2018-09-13T14:49:27.275973-04:00 INFO shill[1063]: [INFO:wifi.cc(862)] WiFi wlan0 supplicant updated DisconnectReason to -4
,
Sep 18
Fresh debug logs from the customer: https://drive.google.com/open?id=1Mze5XAYMs9oBt3vrY2w6HbqyCXpypPXf
,
Sep 19
They confirmed they are not using fast roaming and in fresh logs, for [SSID=KIPPNYC Chromies], bssid: 5c:50:15:37:b3:38 which they are connecting to, I see 2018-09-17T14:55:34.773129-04:00 DEBUG wpa_supplicant[541]: nl80211: Drv Event 20 (NL80211_CMD_DEL_STATION) received for wlan0 2018-09-17T14:55:34.773147-04:00 DEBUG wpa_supplicant[541]: nl80211: Delete station 5c:50:15:37:b3:38 2018-09-17T14:55:34.776052-04:00 DEBUG wpa_supplicant[541]: nl80211: Drv Event 39 (NL80211_CMD_DEAUTHENTICATE) received for wlan0 2018-09-17T14:55:34.776069-04:00 DEBUG wpa_supplicant[541]: nl80211: Deauthenticate event 2018-09-17T14:55:34.776086-04:00 DEBUG wpa_supplicant[541]: wlan0: Event DEAUTH (12) received 2018-09-17T14:55:34.776132-04:00 DEBUG wpa_supplicant[541]: wlan0: Deauthentication notification 2018-09-17T14:55:34.776149-04:00 DEBUG wpa_supplicant[541]: wlan0: * reason 3 (locally generated) 2018-09-17T14:55:34.776168-04:00 DEBUG wpa_supplicant[541]: wlan0: * address 5c:50:15:37:b3:38 2018-09-17T14:55:34.776183-04:00 DEBUG wpa_supplicant[541]: Deauthentication frame IE(s) - hexdump(len=0): [NULL] 2018-09-17T14:55:34.776199-04:00 NOTICE wpa_supplicant[541]: wlan0: CTRL-EVENT-DISCONNECTED bssid=5c:50:15:37:b3:38 reason=3 locally_generated=1 Followed by 2018-09-17T14:55:34.776251-04:00 DEBUG wpa_supplicant[541]: Added BSSID 5c:50:15:37:b3:38 into blacklist 2018-09-17T14:55:34.776272-04:00 DEBUG wpa_supplicant[541]: wlan0: Another BSS in this ESS has been seen; try it next 2018-09-17T14:55:34.776288-04:00 DEBUG wpa_supplicant[541]: BSSID 5c:50:15:37:b3:38 blacklist count incremented to 2
,
Sep 21
Josafat: Can you help find an owner for this please ? Not sure why its not triaged yet.
,
Sep 21
+ a few people related to previous issues. Agree with Royans, it's very frustrating to file these P1 and see no action after 6 days. Is the component being monitored? Something else we should be doing?
,
Sep 21
+some people from Chrome OS WiFi team
,
Sep 24
Andrew, it's troubling that a a P1 customer bug hasn't seen any comments by eng yet. Can you find an owner who will engage?
,
Sep 24
+Kirtika to investigate
,
Sep 25
++case 16967071 Model: relm Chrome OS: 10718.88.2/68.0.3440.118 Logs: https://drive.google.com/drive/folders/1RbLKf7fEEB_NUnv66ZJN-Lt27jLsrazM?usp=sharing it's connecting to 00:0c:e6:aa:89:1c with better est_throughput, which is expected, and then, having good sig: -51 , suddenly blacklists it with: "Failure failure out-of-range". 2018-09-18T08:22:13.756518+02:00 NOTICE wpa_supplicant[558]: wlan0: Current BSS: 00:0c:e6:aa:37:07 level=-37 snr=52 est_throughput=65000 2018-09-18T08:22:13.756537+02:00 NOTICE wpa_supplicant[558]: wlan0: Selected BSS: 00:0c:e6:aa:89:1c level=-51 snr=41 est_throughput=135001 2018-09-18T08:22:13.756554+02:00 NOTICE wpa_supplicant[558]: wlan0: Allow reassociation - selected BSS has better estimated throughput (current: 65000, selected: 135001) 2018-09-18T08:22:13.756703+02:00 NOTICE wpa_supplicant[558]: wlan0: SME: Trying to authenticate with 00:0c:e6:aa:89:1c (SSID='bk-elev' freq=5220 MHz) 2018-09-18T08:22:13.757081+02:00 INFO shill[1160]: [INFO:wifi.cc(392)] ScanDone 2018-09-18T08:22:13.758589+02:00 INFO shill[1160]: [INFO:service.cc(312)] Suppressed autoconnect to service 0 (no endpoints) 2018-09-18T08:22:13.781736+02:00 NOTICE wpa_supplicant[558]: wlan0: Trying to associate with 00:0c:e6:aa:89:1c (SSID='bk-elev' freq=5220 MHz) 2018-09-18T08:22:13.782666+02:00 INFO shill[1160]: [INFO:wifi_service.cc(798)] Rep ep updated for 5. [SSID=bk-elev], bssid: 00:0c:e6:aa:89:1c, sig: -51, sec: 802_1x, freq: 5220 2018-09-18T08:22:13.782952+02:00 INFO shill[1160]: [INFO:wifi.cc(1656)] WiFi wlan0 StateChanged completed -> authenticating 2018-09-18T08:22:13.787379+02:00 NOTICE wpa_supplicant[558]: wlan0: Associated with 00:0c:e6:aa:89:1c 2018-09-18T08:22:13.787672+02:00 NOTICE wpa_supplicant[558]: wlan0: CTRL-EVENT-SUBNET-STATUS-UPDATE status=0 2018-09-18T08:22:13.791384+02:00 INFO shill[1160]: [INFO:wifi.cc(1656)] WiFi wlan0 StateChanged authenticating -> associated 2018-09-18T08:22:13.791415+02:00 INFO shill[1160]: [INFO:service.cc(398)] Service 5: state Online -> Associating 2018-09-18T08:22:13.791442+02:00 INFO shill[1160]: [INFO:manager.cc(1496)] Service 5 updated; state: Associating failure Unknown 2018-09-18T08:22:13.792333+02:00 INFO shill[1160]: [INFO:service.cc(312)] Suppressed autoconnect to service 5 (connecting) 2018-09-18T08:22:13.792360+02:00 INFO shill[1160]: [INFO:service.cc(312)] Suppressed autoconnect to service 0 (no endpoints) 2018-09-18T08:22:13.808861+02:00 INFO arc-networkd[5481]: [INFO:manager.cc(81)] Unbinding services 2018-09-18T08:22:13.814079+02:00 NOTICE wpa_supplicant[558]: wlan0: CTRL-EVENT-REGDOM-CHANGE init=CORE type=WORLD 2018-09-18T08:22:13.844912+02:00 NOTICE wpa_supplicant[558]: wlan0: CTRL-EVENT-REGDOM-CHANGE init=COUNTRY_IE type=COUNTRY alpha2=NO 2018-09-18T08:22:21.880457+02:00 NOTICE wpa_supplicant[558]: wlan0: CTRL-EVENT-DISCONNECTED bssid=00:0c:e6:aa:89:1c reason=15 2018-09-18T08:22:21.880524+02:00 NOTICE wpa_supplicant[558]: BSSID 00:0c:e6:aa:89:1c blacklist count incremented to 2 2018-09-18T08:22:21.880757+02:00 ERR wpa_supplicant[558]: nl80211: Failed to open /proc/sys/net/ipv4/conf/wlan0/drop_unicast_in_l2_multicast: No such file or directory 2018-09-18T08:22:21.880789+02:00 ERR wpa_supplicant[558]: nl80211: Failed to set IPv4 unicast in multicast filter 2018-09-18T08:22:21.886616+02:00 ERR shill[1160]: [ERROR:wifi.cc(975)] Disconnected due to inability to connect to the service. 2018-09-18T08:22:21.886719+02:00 INFO shill[1160]: [INFO:service.cc(398)] Service 5: state Associating -> Failure 2018-09-18T08:22:21.886740+02:00 INFO shill[1160]: [INFO:service.cc(984)] Noting an unexpected failure to connect. 2018-09-18T08:22:21.886793+02:00 INFO shill[1160]: [INFO:manager.cc(1496)] Service 5 updated; state: Failure failure out-of-range 2018-09-18T08:22:21.887381+02:00 INFO shill[1160]: [INFO:service.cc(398)] Service 5: state Failure -> Idle 2018-09-18T08:22:21.887448+02:00 INFO shill[1160]: [INFO:manager.cc(1496)] Service 5 updated; state: Idle failure Unknown I will ask to update to 69 and advise the if behavior is the same.
,
Sep 26
,
Sep 27
We are in the process of updating test groups to v69 and scheduling observations.
,
Sep 28
Haven't looked at the logs but responding to original description/first comment:
These two log lines are benign:
2018-09-13T14:49:28.541462-04:00 DEBUG wpa_supplicant[547]: RSN: Ignored PMKID candidate without preauth flag
2018-09-13T14:49:28.541570-04:00 DEBUG wpa_supplicant[547]: message repeated 10 times: [ RSN: Ignored PMKID candidate without preauth flag]
In the log lines below, the -4 is the interesting part. We (chromebook/client/device) are choosing to disconnect from the AP
because we cannot hear it ("inactivity").
2018-09-13T14:49:27.269978-04:00 NOTICE wpa_supplicant[547]: wlan0: CTRL-EVENT-DISCONNECTED bssid=5c:50:15:37:b3:38 reason=4 locally_generated=1
2018-09-13T14:49:27.270045-04:00 NOTICE wpa_supplicant[547]: BSSID 5c:50:15:37:b3:38 blacklist count incremented to 2
..
2018-09-13T14:49:27.275973-04:00 INFO shill[1063]: [INFO:wifi.cc(862)] WiFi wlan0 supplicant updated DisconnectReason to -4
+Matthew Wang -who might be the right owner for this. He made some roaming improvements that should have gone in ~68.
,
Sep 28
We have logs from a device on version 68.0.3440.118 that still experiences the same issue. 758 2018-09-27T15:07:25.365573-04:00 NOTICE wpa_supplicant[547]: wlan0: CTRL-EVENT-DISCONNECTED bssid=5c:50:15:37:b3:38 reason=3 locally_generated=1 I believe reason 3 is the following: Deauthenticated because sending STA is leaving (or has left) IBSS or ESS
,
Sep 28
FWIW - additional logs were added yesterday afternoon here: https://drive.google.com/drive/folders/1G1xDwCkoLpAEWzl8SGRmVB_xA6Yz6tkv Issues occurred between 2 PM and 4 PM for Chromebooks in at least 3 physical locations (all separate Internet lines and infrastructure). We were able to grab logs at 1 location due to distributed of limited IT support resources.
,
Sep 28
The last time this issue occurred, did it happen generally at the same time across the schools that were affected? How many times does the issue present itself throughout the day? Also just to confirm the details of the network, the authentication that is being used is WPA-PSK? Or WPA2-PSK pushed through the admin console? I see that the SSID is KIPPNYC Chromies @12 I found this referencing the errors that wpa_supplicant gives. It seems to default to an "out of range" error based on the code, even though the reason might be different. https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1105397 Still digging through to find root cause.
,
Oct 1
,
Oct 1
Chatted with kirtika@ offline about getting more information on this. @akhouderchah - is the progress you're working on on changing "out of range" errors related to: https://chromium-review.googlesource.com/c/chromiumos/platform2/+/1226113 ? If so it would be great to have the updates on that synced with this bug.
,
Oct 1
It would also be great to have some guidance on here on what information to work with the customer on providing that would provide better insight as to what the root cause of this may be.
,
Oct 1
Alex is driving this, just replying to #13 to clear up some confusion: Reason code 3 is a very generic reason code and in most cases, benign/not indicavtive of an issue. "reason=3 locally_generated=1" mostly happens when we wake up from suspend, realize we are not connected to the AP because we were in suspend, and issue a locally generated disconnect. This is only to clear our state, because the AP should have long disconnected us while we were in suspend.
,
Oct 1
(Again, Alex is driving this, just sharing my observations here). I took a quick (10-15mins) look at the logs from debug-logs_20180927-162703.tgz Some observations, decreasing order of priority: 1. "SSID=KIPPNYC Chromies" is on 5 GHz at 5500 MHz which is a DFS channel i.e. radars can operate on this channel and we are required by regulatory laws to lower our tx power on this channel and vacate it if we sense radar. My guess is that you will see more stability if the admins can move their network to a non-DFS channel (say 5180 MHz / 5220 Mhz / 5745 MHz / 5765 MHz). We do not have enough logging/visibility today to know if we left a channel because of DFS - we might be open to working with this school to improve our DFS handling if they test another channel and find it works better. 2. Confirmed from device logs that our wifi driver and network manager (shill) are otherwise fine, no issues being reported. All disconnects are either generic reason code 3 (which could be hiding/masking other problems like DFS, but usually 3 is because we disconnect after wake from suspend or while shutting down) or reason code 4 (inactivity aka we dont hear the AP after multiple attempts) which is indicative of a noisy environment. There's a sporadic reason code 7 which is "we think we are associated with the AP but the AP doesnt think so", or reason code 15 "4-way handshake timed out". All of these are indirect noise indicators. I am assuming this is environmental noise since the device is Kefka and we have no reports/prior evidence of board noise on kefka. tl;dr - we believe its an environment or configuration issue, not a device-side issue.
,
Oct 1
Thank you for the info. Was it a recent change to a specific ChromeOS to lower the tx power and vacate on 5500MHz at 5GHz? I only ask because the connectivity issues are new (August onward) whereas our wifi configurations have not changed with respect to channelization for quite some time. We have been running 5GHz-capable clients for quite some time as well (Chromebooks) and have not seen the same issues. I'm from the district; we are also validating with our network folks to verify the channelization as it currently stands.
,
Oct 1
August is when R68 was released, so we could look into R67->R68 differences, but your original bug description also mentions R65 and R67 which were released in May and July respectively, so sounds like the transition wasnt in the OS?
,
Oct 1
These are the logs from the reported V65 issue, not sure if they show the same information or not as what drove the initial analysis: https://drive.google.com/drive/folders/1Hty19gib8hUyvmw90ULL3RLsux_P9o7i?usp=sharing. From our observations, it was not until we updated using the USB sticks to v65 that we saw the disconnects broadly at multiple campuses. Prior to that, would need to go back to old exports to look at the distribution range of CrOS. Prob most were on 56 or 57.
,
Oct 1
I'd recommend not debugging R65 at all. It was a special release where we tried using Intel's cfg80211 (layer in the wireless stack). See here for details: https://bugs.chromium.org/p/chromium/issues/detail?id=782368 Its not surprising that DFS handling in R65 would be different from that in our regular releases (which use upstream cfg80211 for 3.18 kernel for Kefka). So step 1 for us on this bug should be to separate out R67/R68 logs and focus on those, discarding R65. I am a bit confused because your spreadsheet here lists only R67/R68: https://docs.google.com/spreadsheets/d/1_sivPzLCTSwsA7bFs9SVij_RGAapQHzx_TD2gLAncrs/edit#gid=0 I can confirm that the logs I looked at in #20 are R68 (10718.88.2): pdate_engine.20180927-155212: <app appid="{2EDF7C6D-6586-D292-0B61-4A33B62E3BF6}" cohort="1:4:" cohortname="kefka_stable" version="10718.88.2" track="stable-channel" lang="en-US" board="kefka-signed-mpkeys" hardware_class="KEFKA D25-A6L-O4P-A2I-B3Q" delta_okay="true" fw_version="" ec_version="" installdate="4270" >
,
Oct 2
Hi All, We current have a Support Case open for this and it would be imperative to continue working off of that while we investigate the root cause of this issue. @ ltaiclet - If you are available it would be helpful to set up a support call with you and Mick. Please let us know in the Case when it would be best to set this up.
,
Oct 2
Just an update:
The connection flow generally looks like one of the two:
1) * Initially connect to KIPPNYC Chromies @ 5500
* Disconnect (usually w/locally-generated code 4)
* Connect to KIPPNYC Chromies @ 2437
2) * Initially connect to KIPPNYC Chromies @ 2437
* Immediately decide to switch to 5GHz based on expected throughput increase
* Connect to KIPPNYC Chromies @ 5500
* Disconnect (usually w/locally-generated code 4)
* Connect to KIPPNYC Chromies @ 2437
One thing to note is that the 5500 and 2437 KIPPNYC Chromies BSSs are from two different APs.
In particular:
* 5c:50:15:37:b3:37 broadcasts KIPPNYC Chromies @ 2437 AND KIPPNYC Staff Mobile @ 5500
* 5c:50:15:37:b3:38 broadcasts KIPPNYC Chromies @ 5500 AND KIPPNYC Staff Mobile @ 2437
So if this is an issue with DFS, a chromebook connected to KIPPNYC Staff Mobile @ 5500 should be having the same issue at that point in time (not sure if we could find logs for that though).
@sylcat, that CL causes shill to report more reasonable error messages to the UI. In this case, the disconnects from KIPPNYC Chromies @ 5500 would generate "unknown disconnect" reasons rather than "out of range", which is less confusing for the user but doesn't help us narrow down this issue.
,
Oct 3
Thanks for the update akhouderchah@! If it would be helpful from a time perspective to have additional devices/networks to test on please feel free to reach out to me and I can assist with additional testing.
,
Oct 5
Is there any update on changing the KIPPNYC Chromies channels? I've seen a few different frequencies used by that SSID in the logs, but the issue always seems happen with DFS frequencies (INFM-CB-1435ZM2 showed this issue at 5745 MHz--which is not a DFS channel--but that machine was running on R65, which as Kirtika mentioned in #24 should be disregarded). In one log, we have a similar connection flow as in #26, but the successful connection is with 5c:50:15:72:a4:08 on frequency 5180, which we switch away from because 5c:50:15:b3:38 (freq 5500) is closer. That is, we see no issue on the non-DFS channel 36, but then have issues with the DFS channel 100). In many of these logs, not only do we see a DFS channel being used, but we see that same channel used by many other networks (e.g. KNYCRT-N, KNYCSTU-N, KNYGuest, KIPPNYC Staff Mobile, and KIPPNYC Chromies all on 5500 and all relatively close to the chromebook in question). In general, it is better in terms of network performance to have different networks on different channels, so moving KIPPNYC Chromies to an unused channel could be beneficial even if this issue was not occurring.
,
Oct 5
Thanks for the update! The Support Case has been updated and we will continue to try to gather more information on the network infrastructure that's currently being used. ltaiclet@ chatted with us earlier today and informed us that an upgrade that they're pushing to their infrastructure has greatly helped mediate the issue but that we should continue to look towards a root cause. Based on the information in #28 it seems for now that the cause seems to be frequency over-saturation and DFS channel interference so we're hoping seeing something from Cisco's side will confirm this, especially if we're able to see any drastic differences between what has been used so far and what the new infrastructure is changing. If there's any other information that will be helpful in this investigation please feel free to pass it along.
,
Oct 9
"ltaiclet@ chatted with us earlier today and informed us that an upgrade that they're pushing to their infrastructure has greatly helped mediate the issue but that we should continue to look towards a root cause." What is this upgrade? Can we get technical details? I think we mentioned earlier that 5500 MHz was sub-optimal and it would be helpful to have data from non-DFS channels. Have they tested non-DFS channels?
,
Oct 9
@kirtika - Unfortunately the only information we've been provided is that customer is pushing new hardware and firmware upgrades. We are waiting to hear back with any networking information that they can provide us and we will pass them along to this bug as soon as we receive them.
,
Oct 10
Cisco TAC is in possession of WLC configs for review/assessment in light of the log information around the DFS channels. Once we hear back, we will provide an update.
,
Nov 13
Hi, Could you please confirm if this EDU/Enterprise customer is also affected by this WIFI stability issue? Thank you. Unify # 17492722 - Chrome OS version. 70.0.3538.76 - Devices make/model: LG Chromebase 22CB25S - Codename: monroe |======Issue Definition======| Chromebase keeps dropping connections - Steps to reproduce 1. Boot Chromebase 2. Check WiFi status 3. Connect will go from ""Strong"" > ""Weak"" then autodisconnect 4. Also happen after the users logs in - What's the expected behavior? The chrome device should hold to the WiFi - What's the actual result? Devices will drop the connection randomly - Timeframe when issue started: 2 weeks ago - Does it affect all devices? 26 devices |====== Errors in the logs======| Online failure Unknown Configuring failure Unknown Associating failure Unknown blacklist count incremented to 2 Failure failure out-of-range Link to files: https://drive.google.com/drive/folders/11U3s3nbfoFygBKAc6r4-wRnpf3yj5RpU?usp=sharing
,
Jan 3
Any updates on this? Were the configuration changes made by Cisco effective in eliminating this issue? |
|||||||||||
►
Sign in to add a comment |
|||||||||||
Comment 1 by roy...@google.com
, Sep 18