New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 893857 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Oct 25
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug-Regression



Sign in to add a comment

[wifi_matfunc] [Daisy] "Significant difference in rtt due to bgscan" failures on a few testcases.

Project Member Reported by jmuppala@chromium.org, Oct 9

Issue description

Following wifi_matfunc testcases failing for M70 and M71 with "Significant difference in rtt due to bgscan" / "Significant difference in rtt due to OBSS" error.

1)network_WiFi_BgscanBackoff.wifi_bgscan_backoff

2)network_WiFi_BgscanBackoff.5760_noise_check

3)network_WiFi_OverlappingBSSScan

Logs@
https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.wifi_bgscan_backoff&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true


https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_BgscanBackoff.5760_noise_check&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true


https://stainless.corp.google.com/search?view=matrix&row=build&col=board_model&first_date=2018-09-12&last_date=2018-10-09&suite=wifi_matfunc&test=network_WiFi_OverlappingBSSScan&board=daisy%7Cveyron_mickey&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&status=RUNNING&status=TEST_NA&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true


sample failure :
Significant difference in rtt due to bgscan: 3826.0 > 1.4 + 200



 
Cc: -harpreet...@samsung.com -briannorris@chromium.org harpreet@chromium.org
Cc: briannorris@chromium.org
Owner: briannorris@chromium.org
Status: Assigned (was: Untriaged)
I'm sort of looking at this.

Looking at packet captures for this (hint: easy to enable by running test_that with --args=packet_capture=true), it looks like something on the DUT side is *really* slowing down on packet processing. Like, if I'm reading it right, I don't actually see retransmissions for some of these ICMP replies, but the DUT is only Ack'ing (link layer) them ~30 seconds later (!!).

That doesn't make a whole lot of sense to me (can the Wifi card really buffer 30 seconds of traffic?), but it does sound like device-side problem.

I tried rerunning with older versions of these tests, and the problem remains. It seems like either something about the test environment, or about the OS itself has changed. But haven't really been messing with these drivers, so I'm not sure what would affect something like this.

For reference, I'm attaching one pcap and .DEBUG log from the failing portion of a test run I initiated. I looked at samples like this, which yielded ~4 second RTT:

10/18 17:59:45.560 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=71 ttl=64 time=1.46 ms
10/18 17:59:46.062 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=72 ttl=64 time=1.57 ms
10/18 17:59:50.391 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=73 ttl=64 time=3828 ms
10/18 17:59:50.391 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=74 ttl=64 time=3323 ms
10/18 17:59:50.391 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=75 ttl=64 time=2813 ms
10/18 17:59:50.391 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=76 ttl=64 time=2313 ms
10/18 17:59:50.392 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=77 ttl=64 time=1814 ms
10/18 17:59:50.392 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=78 ttl=64 time=1314 ms
10/18 17:59:50.591 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=79 ttl=64 time=1.61 ms
10/18 17:59:51.092 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=80 ttl=64 time=1.64 ms
10/18 17:59:51.593 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=81 ttl=64 time=1.66 ms
10/18 17:59:52.093 DEBUG|             utils:0287| [stdout] 64 bytes from 192.168.0.254: icmp_seq=82 ttl=64 time=1.39 ms
router.0.pcap
163 KB Download
network_WiFi_BgscanBackoff.DEBUG
189 KB Download
Labels: wifi-test-failures
This definitely follows the OS, not the tests. Running the exact same test checkout, I pass on the latest R69 images and fail on M70.

This also doesn't follow the kernel, since that same R69 image still passes with ToT kernels.

I'm wondering if something else in the system is just causing a performance oddity.
Cc: grundler@chromium.org kirtika@chromium.org
Labels: -Pri-2 Pri-1
Status: Started (was: Assigned)
Oh, wow, it's this change:

https://chromium-review.googlesource.com/1180885

commit 3552549f4a9ce929f45bbeef6f75cb1d77879015
Author: Brian Norris <briannorris@chromium.org>
Date:   Wed Aug 22 10:09:20 2018 -0700

    Revert "CHROMIUMOS: Ignore driver's feature flag for low priority scan"
    
    This reverts commit 7a86ee6754261723ab71c2c8c54cf79d2f481402.
    
    This causes us to try low priority scans on drivers that never
    advertised this feature. Contrary to the original commit message, this
    does *not* help -- it is actively harmful, because in the end we just
    get back -EOPNOTSUPP and the scan is skipped entirely.
    
    This causes brcmfmac (e.g., minnie) and ath10k to fail
    network_WiFi_BgscanBackoff.wifi_bgscan_backoff and other tests, and it
    basically neuters any attempt at background scans.
    
    Instead, we should:
    (a) retain the original supplicant logic, that we only ask the driver
        for a low priority scan when it advertises support for it (if it
        doesn't support it, we just initiate "normal" background scans) and
    (b) fix any drivers that need this support to actually advertise
        NL80211_FEATURE_LOW_PRIORITY_SCAN
    
    Note: some firmwares implement scanning behavior such that they back off
    automatically if there is data traffic, and so
    NL80211_FEATURE_LOW_PRIORITY_SCAN is not necessarily a driver
    requirement.
    
    Also note: I've grepped through all our Wifi drivers for
    NL80211_SCAN_FLAG_LOW_PRIORITY and NL80211_FEATURE_LOW_PRIORITY_SCAN,
    and I have found no such driver that honors
    NL80211_SCAN_FLAG_LOW_PRIORITY but doesn't also advertise
    NL80211_FEATURE_LOW_PRIORITY_SCAN. So again, I don't think the original
    commit is doing anything useful for us.
    
    On the other hand, it *is* inspiring silly changes like this Linux
    commit:
    
        commit d9f5725fb00b87f99d6fc876d27f7d54ab351669
        Author: Amitkumar Karwar <akarwar@marvell.com>
        Date:   Tue Mar 22 12:09:56 2016 +0800
    
            mwifiex: advertise low priority scan feature
    ...
            This patch fixes a problem in which OBSS scan request from
            wpa_supplicant was being rejected by cfg80211.
    
    I'm pretty sure that this "wpa_supplicant" problem is just a "Chrome OS
    wpa_supplicant" problem. Similar needless changes appear here:
    
    https://chromium.googlesource.com/chromiumos/third_party/kernel/+/a8d0bd3bb2442f24c937ce930776016a01866935/drivers/net/wireless/ar10k/ath/ath10k/mac.c#7503
    
            /* ChromiumOS wpa_supplicant sends the background scan req. with
             * NL80211_SCAN_FLAG_LOW_PRIORITY enabled. ath10k scans using hw_scan
             * (firmware scan implementation. Advertise the firmware can handle the
             * scan requests and perform the scan w/o influencing the latencies.
             */
            ar->hw->wiphy->features |= NL80211_FEATURE_LOW_PRIORITY_SCAN;
    
    All in all, there's no need for drivers to advertise this feature if
    their existing hw_scan() is performant enough.
    
    BUG=b:74572201
    TEST=`network_WiFi_BgscanBackoff.wifi_bgscan_backoff` and
         `network_WiFi_OverlappingBSSScan` on ath10k, on older
         mwifiex/brcmfac, etc.
    
    Change-Id: Ibafb98ce4a7fd4f8c896e80b15df4c02e5593159
    Signed-off-by: Brian Norris <briannorris@chromium.org>
    Reviewed-on: https://chromium-review.googlesource.com/1185523
    Reviewed-by: Grant Grundler <grundler@chromium.org>



I guess I was wrong in that CL, because this whole enum in wireless-3.4 is a totally bogus, never-upstreamed thing:

enum cfg80211_scan_flags {
        CFG80211_SCAN_FLAG_TX_ABORT     = NL80211_SCAN_FLAG_TX_ABORT,
};

That happened to get upstream under a different feature name, with the (coincidentally?) same value: NL80211_SCAN_FLAG_LOW_PRIORITY. Except that there's no corresponding feature flag in wireless-3.4.

So I guess the first question is: does this pattern actually matter to the user experience? It wasn't actually clear to me that this "background scan" feature actually gets used by shill, except in this test. If not, then it doesn't really matter.

But if this *is* really broken: I could either revert the revert, or else add the feature flag properly to wireless-3.4. Or maybe both, with the less risky approach (revert) cherry-picked first.
Sorry, I've been observing this thread (including the CL pointed to above) from the sidelines, not putting much diligence into it. 
Just want to comment on one thing for now: "It wasn't actually clear to me that this "background scan" feature actually gets used by shill, except in this test. If not, then it doesn't really matter."

We definitely use background scans in shill. At shill startup (or when it finds wpa-supplicant), shill sends wpa-supplicant "bgscan" parameters. 
Our current scheme is:
- use algorithm "simple" for background scans (wpa-s offers choices...)
- do a background scan once every X seconds if signal > -72 dBm, I believe X is 600 or so. 
- do a background scan once every Y seconds if signal < -72 dBm, I believe Y=30. 

This fixes mwifiex on kernel v3.8 / wireless-3.4 (i.e., daisy/snow):

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/1298371
Labels: ReleaseBlock-Stable Kernel-3.8
(Side note: while this bug sort of links in veyron_mickey in the stainless links, that seems to be a different problem, that doesn't correlate with that CL -- it was failing before that. And its error is different:

"Background scans should detect new BSSeswithin an associated ESS.")

Adding R-B-S, since this is probably not something we want to roll out for these devices.
Labels: Merge-Request-70 Merge-Request-71
Preemptive merge request for CL from #9, given the timeline.

This only affects Marvell-Wifi devices on kernel 3.8, which would be Peach (Pit, Pi), Daisy, Snow.
Project Member

Comment 12 by sheriffbot@chromium.org, Oct 25

Labels: -Merge-Request-70 Merge-Review-70 Hotlist-Merge-Review
This bug requires manual review: Request affecting a post-stable build
Please contact the milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), geohsu@(ChromeOS), abdulsyed@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Sorry, but I don't have a feel for risk or test results.. Possible to provide context?  On a few boards, but still production boards.

Risk: very little. We're just exposing the "low priority scan" feature flag properly within this driver, which we used to just ignore (and use the feature anyway). So this essentially puts this particular driver back on pre-R70 behavior.

Tests: I tested daisy_spring with all the tests shown in the original report. This clearly fixes the problem. For more testing, I guess we can wait and give the lab another run at running the rest of our Wifi suites on canary.

The problem: we normally try to de-prioritize background scanning, so that it doesn't interfere with TX traffic. This particular driver didn't advertise that feature properly, so the CL in #6 broke that behavior -- now background scans are causing additional (~4 second) latency at times. AIUI, these background scans can be fairly frequent if you're on marginal signal (since we're looking for better networks).

That said, I don't have a device on hand to get "real world" feel from natural browsing behaviors. But this test is showing pretty clear failures in the lab, and I've looked at those pretty closely.

What's the reality of M70 schedule? The schedule *says* it should be out already, but that doesn't appear to be the case. I can try to check in on more tests tomorrow. I just wanted to get the request out there, in case it's not too late.
Project Member

Comment 15 by bugdroid1@chromium.org, Oct 25

Labels: merge-merged-chromeos-3.8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/63339c90a50443b7c2473c8a67ae26bccead58e3

commit 63339c90a50443b7c2473c8a67ae26bccead58e3
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Oct 25 14:54:19 2018

CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN

It turns out this version of the driver *does* support
LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a
downstream version called CFG80211_SCAN_FLAG_TX_ABORT /
NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the
feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS:
Ignore driver's feature flag for low priority scan"").

Introduce the feature flag as it appears upstream, so wpa_supplicant
knows how to look for it again.

BUG= chromium:893857 
TEST=`network_WiFi_BgscanBackoff` tests

Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1298371
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>

[modify] https://crrev.com/63339c90a50443b7c2473c8a67ae26bccead58e3/drivers/net/wireless-3.4/mwifiex/cfg80211.c
[modify] https://crrev.com/63339c90a50443b7c2473c8a67ae26bccead58e3/include/wireless-3.4/linux/nl80211.h

Project Member

Comment 16 by sheriffbot@chromium.org, Oct 25

Labels: -Merge-Request-71 Merge-Review-71
This bug requires manual review: M71 has already been promoted to the beta branch, so this requires manual review
Please contact the milestone owner if you have questions.
Owners: benmason@(Android), kariahda@(iOS), kbleicher@(ChromeOS), govind@(Desktop)

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
Labels: -Merge-Review-70 Merge-Approved-70
Per offline chat with briannorris@, approved for M70 ChromeOS.
Labels: -Merge-Review-71 Merge-Approved-71
Approving merge to M71 Chrome OS.

Project Member

Comment 20 by bugdroid1@chromium.org, Oct 25

Labels: merge-merged-release-R71-11151.B-chromeos-3.8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/e8312be6bd565a987c3ac551a0383809fd205557

commit e8312be6bd565a987c3ac551a0383809fd205557
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Oct 25 21:29:14 2018

CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN

It turns out this version of the driver *does* support
LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a
downstream version called CFG80211_SCAN_FLAG_TX_ABORT /
NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the
feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS:
Ignore driver's feature flag for low priority scan"").

Introduce the feature flag as it appears upstream, so wpa_supplicant
knows how to look for it again.

BUG= chromium:893857 
TEST=`network_WiFi_BgscanBackoff` tests

Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1298371
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>
(cherry picked from commit 63339c90a50443b7c2473c8a67ae26bccead58e3)
Reviewed-on: https://chromium-review.googlesource.com/c/1299476

[modify] https://crrev.com/e8312be6bd565a987c3ac551a0383809fd205557/drivers/net/wireless-3.4/mwifiex/cfg80211.c
[modify] https://crrev.com/e8312be6bd565a987c3ac551a0383809fd205557/include/wireless-3.4/linux/nl80211.h

Project Member

Comment 21 by bugdroid1@chromium.org, Oct 25

Labels: merge-merged-release-R70-11021.B-chromeos-3.8
The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/kernel/+/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d

commit 827e2ed267c2e82dbd39a1b7f2db8f38d267e87d
Author: Brian Norris <briannorris@chromium.org>
Date: Thu Oct 25 21:29:18 2018

CHROMIUM: mwifiex: advertise NL80211_FEATURE_LOW_PRIORITY_SCAN

It turns out this version of the driver *does* support
LOW_PRIORITY_SCAN, but it's just hidden by the fact that it uses a
downstream version called CFG80211_SCAN_FLAG_TX_ABORT /
NL80211_SCAN_FLAG_TX_ABORT, with the same values. But it's missing the
feature flag, so it got missed in CL:1185523 ("Revert "CHROMIUMOS:
Ignore driver's feature flag for low priority scan"").

Introduce the feature flag as it appears upstream, so wpa_supplicant
knows how to look for it again.

BUG= chromium:893857 
TEST=`network_WiFi_BgscanBackoff` tests

Change-Id: I697838eb02646029481e7487ce2a3b86e6a62361
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1298371
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>
(cherry picked from commit 63339c90a50443b7c2473c8a67ae26bccead58e3)
Reviewed-on: https://chromium-review.googlesource.com/c/1299475

[modify] https://crrev.com/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d/drivers/net/wireless-3.4/mwifiex/cfg80211.c
[modify] https://crrev.com/827e2ed267c2e82dbd39a1b7f2db8f38d267e87d/include/wireless-3.4/linux/nl80211.h

Labels: -Merge-Approved-70 -Merge-Approved-71 Merge-Merged-70 Merge-Merged-71
Status: Fixed (was: Started)
It seems I can't find a builder that will actually do a proper trybot on these branches (they all fail for totally unrelated reasons). So I did the next best: take the latest image for the respective branch and manually update the kernel. suite:wifi_matfunc is passing fine.
Are you using "cros tryjob --branch" correctly and using the right board configs? 

I believe so. I've tried all of these:

cros tryjob -g 1299475 --branch release-R70-11021.B daisy_skate-no-vmtest-pre-cq
cros tryjob -g 1299475 --branch release-R70-11021.B daisy_skate-release-tryjob
cros tryjob -g 1299476 --branch release-R71-11151.B daisy_skate-no-vmtest-pre-cq
cros tryjob -g 1299476 --branch release-R71-11151.B daisy_skate-release-tryjob

Sign in to add a comment