Issue metadata
Sign in to add a comment
|
[wifi_matfunc] [Daisy] "Significant difference in rtt due to bgscan" failures on a few testcases. |
||||||||||||||||||||||
Issue descriptionFollowing wifi_matfunc testcases failing for M70 and M71 with "Significant difference in rtt due to bgscan" / "Significant difference in rtt due to OBSS" error. 1)network_WiFi_BgscanBackoff.wifi_bgscan_backoff 2)network_WiFi_BgscanBackoff.5760_noise_check 3)network_WiFi_OverlappingBSSScan Logs@ https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.wifi_bgscan_backoff&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.5760_noise_check&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_OverlappingBSSScan&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true sample failure : Significant difference in rtt due to bgscan: 3826.0 > 1.4 + 200
,
Oct 11
,
Oct 19
I'm sort of looking at this. Looking at packet captures for this (hint: easy to enable by running test_that with --args=packet_capture=true), it looks like something on the DUT side is *really* slowing down on packet processing. Like, if I'm reading it right, I don't actually see retransmissions for some of these ICMP replies, but the DUT is only Ack'ing (link layer) them ~30 seconds later (!!). That doesn't make a whole lot of sense to me (can the Wifi card really buffer 30 seconds of traffic?), but it does sound like device-side problem. I tried rerunning with older versions of these tests, and the problem remains. It seems like either something about the test environment, or about the OS itself has changed. But haven't really been messing with these drivers, so I'm not sure what would affect something like this. For reference, I'm attaching one pcap and .DEBUG log from the failing portion of a test run I initiated. I looked at samples like this, which yielded ~4 second RTT: 10/18 17:59:45.560 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=71 ttl=64 time=1.46 ms 10/18 17:59:46.062 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=72 ttl=64 time=1.57 ms 10/18 17:59:50.391 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=73 ttl=64 time=3828 ms 10/18 17:59:50.391 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=74 ttl=64 time=3323 ms 10/18 17:59:50.391 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=75 ttl=64 time=2813 ms 10/18 17:59:50.391 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=76 ttl=64 time=2313 ms 10/18 17:59:50.392 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=77 ttl=64 time=1814 ms 10/18 17:59:50.392 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=78 ttl=64 time=1314 ms 10/18 17:59:50.591 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=79 ttl=64 time=1.61 ms 10/18 17:59:51.092 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=80 ttl=64 time=1.64 ms 10/18 17:59:51.593 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=81 ttl=64 time=1.66 ms 10/18 17:59:52.093 DEBUG| utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=82 ttl=64 time=1.39 ms
,
Oct 19
,
Oct 24
This definitely follows the OS, not the tests. Running the exact same test checkout, I pass on the latest R69 images and fail on M70. This also doesn't follow the kernel, since that same R69 image still passes with ToT kernels. I'm wondering if something else in the system is just causing a performance oddity.
,
Oct 25
Oh, wow, it's this change: https://chromium-review.googlesource.com/1180885 commit 3552549f4a9ce929f45bbeef6f75cb1d77879015 Author: Brian Norris <briannorris@chromium.org> Date: Wed Aug 22 10:09:20 2018 -0700 Revert "CHROMIUMOS: Ignore driver's feature flag for low priority scan" This reverts commit 7a86ee6754261723ab71c2c8c54cf79d2f481402. This causes us to try low priority scans on drivers that never advertised this feature. Contrary to the original commit message, this does *not* help -- it is actively harmful, because in the end we just get back -EOPNOTSUPP and the scan is skipped entirely. This causes brcmfmac (e.g., minnie) and ath10k to fail network_WiFi_BgscanBackoff.wifi_bgscan_backoff and other tests, and it basically neuters any attempt at background scans. Instead, we should: (a) retain the original supplicant logic, that we only ask the driver for a low priority scan when it advertises support for it (if it doesn't support it, we just initiate "normal" background scans) and (b) fix any drivers that need this support to actually advertise NL80211_FEATURE_LOW_PRIORITY_SCAN Note: some firmwares implement scanning behavior such that they back off automatically if there is data traffic, and so NL80211_FEATURE_LOW_PRIORITY_SCAN is not necessarily a driver requirement. Also note: I've grepped through all our Wifi drivers for NL80211_SCAN_FLAG_LOW_PRIORITY and NL80211_FEATURE_LOW_PRIORITY_SCAN, and I have found no such driver that honors NL80211_SCAN_FLAG_LOW_PRIORITY but doesn't also advertise NL80211_FEATURE_LOW_PRIORITY_SCAN. So again, I don't think the original commit is doing anything useful for us. On the other hand, it *is* inspiring silly changes like this Linux commit: commit d9f5725fb00b87f99d6fc876d27f7d54ab351669 Author: Amitkumar Karwar <akarwar@marvell.com> Date: Tue Mar 22 12:09:56 2016 +0800 mwifiex: advertise low priority scan feature ... This patch fixes a problem in which OBSS scan request from wpa_supplicant was being rejected by cfg80211. I'm pretty sure that this "wpa_supplicant" problem is just a "Chrome OS wpa_supplicant" problem. Similar needless changes appear here: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a8d0bd3bb2442f24c937ce930776016a01866935/drivers/net/wireless/ar10k/ath/ath10k/mac.c#7503 /* ChromiumOS wpa_supplicant sends the background scan req. with * NL80211_SCAN_FLAG_LOW_PRIORITY enabled. ath10k scans using hw_scan * (firmware scan implementation. Advertise the firmware can handle the * scan requests and perform the scan w/o influencing the latencies. */ ar->hw->wiphy->features |= NL80211_FEATURE_LOW_PRIORITY_SCAN; All in all, there's no need for drivers to advertise this feature if their existing hw_scan() is performant enough. BUG=b:74572201 TEST=`network_WiFi_BgscanBackoff.wifi_bgscan_backoff` and `network_WiFi_OverlappingBSSScan` on ath10k, on older mwifiex/brcmfac, etc. Change-Id: Ibafb98ce4a7fd4f8c896e80b15df4c02e5593159 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1185523 Reviewed-by: Grant Grundler <grundler@chromium.org> I guess I was wrong in that CL, because this whole enum in wireless-3.4 is a totally bogus, never-upstreamed thing: enum cfg80211_scan_flags { CFG80211_SCAN_FLAG_TX_ABORT = NL80211_SCAN_FLAG_TX_ABORT, }; That happened to get upstream under a different feature name, with the (coincidentally?) same value: NL80211_SCAN_FLAG_LOW_PRIORITY. Except that there's no corresponding feature flag in wireless-3.4. So I guess the first question is: does this pattern actually matter to the user experience? It wasn't actually clear to me that this "background scan" feature actually gets used by shill, except in this test. If not, then it doesn't really matter. But if this *is* really broken: I could either revert the revert, or else add the feature flag properly to wireless-3.4. Or maybe both, with the less risky approach (revert) cherry-picked first.
,
Oct 25
Sorry, I've been observing this thread (including the CL pointed to above) from the sidelines, not putting much diligence into it. Just want to comment on one thing for now: "It wasn't actually clear to me that this "background scan" feature actually gets used by shill, except in this test. If not, then it doesn't really matter." We definitely use background scans in shill. At shill startup (or when it finds wpa-supplicant), shill sends wpa-supplicant "bgscan" parameters. Our current scheme is: - use algorithm "simple" for background scans (wpa-s offers choices...) - do a background scan once every X seconds if signal > -72 dBm, I believe X is 600 or so. - do a background scan once every Y seconds if signal < -72 dBm, I believe Y=30.
,
Oct 25
This was a big change on a roaming bug: https://chromium-review.googlesource.com/c/aosp/platform/system/connectivity/shill/+/1126928
,
Oct 25
This fixes mwifiex on kernel v3.8 / wireless-3.4 (i.e., daisy/snow): https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1298371
,
Oct 25
(Side note: while this bug sort of links in veyron_mickey in the stainless links, that seems to be a different problem, that doesn't correlate with that CL -- it was failing before that. And its error is different: "Background scans should detect new BSSeswithin an associated ESS.") Adding R-B-S, since this is probably not something we want to roll out for these devices.
,
Oct 25
Preemptive merge request for CL from #9, given the timeline. This only affects Marvell-Wifi devices on kernel 3.8, which would be Peach (Pit, Pi), Daisy, Snow.
,
Oct 25
This bug requires manual review: Request affecting a post-stable build Please contact the milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), geohsu@(ChromeOS), abdulsyed@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 25
Sorry, but I don't have a feel for risk or test results.. Possible to provide context? On a few boards, but still production boards.
,
Oct 25
Risk: very little. We're just exposing the "low priority scan" feature flag properly within this driver, which we used to just ignore (and use the feature anyway). So this essentially puts this particular driver back on pre-R70 behavior. Tests: I tested daisy_spring with all the tests shown in the original report. This clearly fixes the problem. For more testing, I guess we can wait and give the lab another run at running the rest of our Wifi suites on canary. The problem: we normally try to de-prioritize background scanning, so that it doesn't interfere with TX traffic. This particular driver didn't advertise that feature properly, so the CL in #6 broke that behavior -- now background scans are causing additional (~4 second) latency at times. AIUI, these background scans can be fairly frequent if you're on marginal signal (since we're looking for better networks). That said, I don't have a device on hand to get "real world" feel from natural browsing behaviors. But this test is showing pretty clear failures in the lab, and I've looked at those pretty closely. What's the reality of M70 schedule? The schedule *says* it should be out already, but that doesn't appear to be the case. I can try to check in on more tests tomorrow. I just wanted to get the request out there, in case it's not too late.
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/63339c90a50443b7c2473c8a67ae26bccead58e3 commit 63339c90a50443b7c2473c8a67ae26bccead58e3 Author: Brian Norris <briannorris@chromium.org> Date: Thu Oct 25 14:54:19 2018 CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN It turns out this version of the driver *does* support LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a downstream version called CFG80211_SCAN_FLAG_TX_ABORT / NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS: Ignore driver's feature flag for low priority scan""). Introduce the feature flag as it appears upstream, so wpa_supplicant knows how to look for it again. BUG= chromium:893857 TEST=`network_WiFi_BgscanBackoff` tests Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1298371 Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> [modify] https://crrev.com/63339c90a50443b7c2473c8a67ae26bccead58e3/drivers/net/wireless-3.4/mwifiex/cfg80211.c [modify] https://crrev.com/63339c90a50443b7c2473c8a67ae26bccead58e3/include/wireless-3.4/linux/nl80211.h
,
Oct 25
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review Please contact the milestone owner if you have questions. Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop) For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Oct 25
Per offline chat with briannorris@, approved for M70 ChromeOS.
,
Oct 25
Approving merge to M71 Chrome OS.
,
Oct 25
Thanks! I'm doing a sanity check of a branch M70 build, to make extra sure things are really definitely in shape: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931675950836339008 https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1299475 I'll run that through the lab manually. And kicking off M71 build too: https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8931673186258513312 https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1299476
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e8312be6bd565a987c3ac551a0383809fd205557 commit e8312be6bd565a987c3ac551a0383809fd205557 Author: Brian Norris <briannorris@chromium.org> Date: Thu Oct 25 21:29:14 2018 CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN It turns out this version of the driver *does* support LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a downstream version called CFG80211_SCAN_FLAG_TX_ABORT / NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS: Ignore driver's feature flag for low priority scan""). Introduce the feature flag as it appears upstream, so wpa_supplicant knows how to look for it again. BUG= chromium:893857 TEST=`network_WiFi_BgscanBackoff` tests Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1298371 Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> (cherry picked from commit 63339c90a50443b7c2473c8a67ae26bccead58e3) Reviewed-on: https://chromium-review.googlesource.com/c/1299476 [modify] https://crrev.com/e8312be6bd565a987c3ac551a0383809fd205557/drivers/net/wireless-3.4/mwifiex/cfg80211.c [modify] https://crrev.com/e8312be6bd565a987c3ac551a0383809fd205557/include/wireless-3.4/linux/nl80211.h
,
Oct 25
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/kernel/+/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d commit 827e2ed267c2e82dbd39a1b7f2db8f38d267e87d Author: Brian Norris <briannorris@chromium.org> Date: Thu Oct 25 21:29:18 2018 CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN It turns out this version of the driver *does* support LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a downstream version called CFG80211_SCAN_FLAG_TX_ABORT / NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS: Ignore driver's feature flag for low priority scan""). Introduce the feature flag as it appears upstream, so wpa_supplicant knows how to look for it again. BUG= chromium:893857 TEST=`network_WiFi_BgscanBackoff` tests Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361 Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1298371 Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> (cherry picked from commit 63339c90a50443b7c2473c8a67ae26bccead58e3) Reviewed-on: https://chromium-review.googlesource.com/c/1299475 [modify] https://crrev.com/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d/drivers/net/wireless-3.4/mwifiex/cfg80211.c [modify] https://crrev.com/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d/include/wireless-3.4/linux/nl80211.h
,
Oct 25
It seems I can't find a builder that will actually do a proper trybot on these branches (they all fail for totally unrelated reasons). So I did the next best: take the latest image for the respective branch and manually update the kernel. suite:wifi_matfunc is passing fine.
,
Oct 25
Are you using "cros tryjob --branch" correctly and using the right board configs?
,
Oct 25
I believe so. I've tried all of these: cros tryjob -g 1299475 --branch release-R70-11021.B daisy_skate-no-vmtest-pre-cq cros tryjob -g 1299475 --branch release-R70-11021.B daisy_skate-release-tryjob cros tryjob -g 1299476 --branch release-R71-11151.B daisy_skate-no-vmtest-pre-cq cros tryjob -g 1299476 --branch release-R71-11151.B daisy_skate-release-tryjob
,
Dec 21
Mostly passing on latest runs... https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-11-24&last_date=2018-12-21&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.wifi_bgscan_backoff&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-11-24&last_date=2018-12-21&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.5760_noise_check&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-11-24&last_date=2018-12-21&suite=wifi_matfunc&test=network_WiFi_OverlappingBSSScan&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by harpreet@chromium.org
, Oct 11