New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 868503 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

network_WiFi_MalformedProbeResp failing with "Client failed to receive probe responses"

Project Member Reported by aashuto...@chromium.org, Jul 27

Issue description

The following command is failing

06/28 04:11:19.704 DEBUG|          ssh_host:0301| Running (ssh) '(time -p /usr/sbin/iw dev mlan0 scan freq 2412) 2>&1' from '_call_run_once|run_once|scan|timed_scan|run|run_very_slowly'
06/28 04:11:20.081 DEBUG|             utils:0286| [stdout] command failed: Device or resource busy (-16)
   
>> I can try if adding sleep before scanning fixes the issue. 

So, bss_list is None and iterating through a None object throws an error. Also we need to raise a test fail exception if bss list empty. 
 
Cc: briannorris@chromium.org kirtika@chromium.org
Labels: -Pri-3 Pri-2
For posterity here is the trace...

08/09 11:38:55.253 WARNI|              test:0637| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 631, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 837, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled TypeError: 'NoneType' object is not iterable
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/network_WiFi_MalformedProbeResp/network_WiFi_MalformedProbeResp.py", line 63, in run_once
    for bss in bss_list:
TypeError: 'NoneType' object is not iterable




There is already a 10sec sleep in between scans.

    SCAN_LOOP_SLEEP_SEC = 10
.
.
.
                while time.time() - start_time < self.SCAN_LOOP_SEC:
                    bss_list = self.context.client.iw_runner.scan(
                            self.context.client.wifi_if, [2412])
                    for bss in bss_list:
                        logging.debug('found bss: %s', bss.ssid)
                        if bss.ssid == 'TestingProbes00000000':
                            rx_probe_resp_count += 1
                    time.sleep(self.SCAN_LOOP_SLEEP_SEC)



Looks like this is mainly an issue with veyron devices.

https://stainless.corp.google.com/search?view=matrix&row=model&col=build&test=%5Enetwork%5C_WiFi%5C_MalformedProbeResp%24&status=WARN&status=FAIL&status=ERROR&status=ABORT&status=ALERT&reason=Unhandled+TypeError&exclude_cts=true&exclude_not_run=false&exclude_non_release=true&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=false&days=30





That failure message isn't very helpful. We should at least fix that so that a better error message like "Client failed to receive probe responses" can be reported.

https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1357250
Status: Available (was: Untriaged)
Owner: benchan@chromium.org
Status: Started (was: Available)
I think you've got this one :)

As noted on your CL, the problem is that we're using 'iw' for scans, while supplicant owns the device. That's a problem:

11/30 12:07:30.944 DEBUG|             utils:0287| [stdout] command failed: Device or resource busy (-16)


But the way the test is written, it will still keep trying over some time interval, which should hopefully be enough to "eventually" get a valid scan result.

We might want to improve the test to ask shill/supplicant for scans instead, but I think we have that problem elsewhere too.
Cc: benchan@chromium.org
Owner: kirtika@chromium.org
Project Member

Comment 6 by bugdroid1@chromium.org, Dec 1

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/900aaff6d54d19f35e7439e46335cdf2308a8aa9

commit 900aaff6d54d19f35e7439e46335cdf2308a8aa9
Author: Ben Chan <benchan@chromium.org>
Date: Sat Dec 01 13:19:29 2018

network_WiFi_MalformedProbeResp: fix an unhandled exception

The following failure message, triggered by an unhandled exception,
isn't very helpful:

  File "/usr/local/autotest/server/site_tests/network_WiFi_MalformedProbeResp/network_WiFi_MalformedProbeResp.py", line 63, in run_once
      for bss in bss_list:
  TypeError: 'NoneType' object is not iterable

This CL updates network_WiFi_MalformedProbeResp to handle the absence of
scan result more gracefully, so that a better error message like "Client
failed to receive probe responses" can be reported instead.

BUG=chromium:868503
TEST=Run network_WiFi_MalformedProbeResp.

Change-Id: If2364d3d5417fcb5900abfd85593777a61b594e7
Reviewed-on: https://chromium-review.googlesource.com/1357250
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Ben Chan <benchan@chromium.org>
Reviewed-by: Brian Norris <briannorris@chromium.org>

[modify] https://crrev.com/900aaff6d54d19f35e7439e46335cdf2308a8aa9/server/site_tests/network_WiFi_MalformedProbeResp/network_WiFi_MalformedProbeResp.py

From the first comment:

> So, bss_list is None and iterating through a None object throws an error. Also we need to raise a test fail exception if bss list empty. 

We don't necessarily need an exception for this case -- we retry under a timeout, so it's still possible we pick up good scan results later. Ben fixed that, so we can actually keep retrying. But so far, that isn't really helping. With the fix, we now have a few results that show:

TEST_FAIL: Client failed to receive probe responses

So it remains to be seen if fixing the supplicant vs. iw scanning issues will fix the problem (e.g., perhaps we're never completing *any* iw scans), or else maybe there's a true bug on these devices, such that they're not picking up these probes at all.
Looking a little closer, this does indeed seem to simply be a test bug. On one of the failures I mentioned in comment #7, it looks like we were unlucky enough to see errno -16 (busy) on all 6 scan attempts, but in the meantime, net.log shows that we *did* pick up the probe frames:

  DEBUG wpa_supplicant[769]: mlan0: BSS: Add new id 15 BSSID xx:xx:xx:xx:xx:xx SSID 'TestingProbes00000000' freq 2412


So if we just fix up the -EBUSY race behaviors, we should be OK.
Owner: ----
Status: Available (was: Started)
Summary: network_WiFi_MalformedProbeResp failing with "Client failed to receive probe responses" (was: network_WiFi_MalformedProbeResp failing with "TypeError: 'NoneType' object is not iterable Error)
Still happening plenty, but now we just get a proper error ("Client failed to receive probe responses"). I looked a little bit, and

1. It's not very reproducible on demand, even when I lock these same devices down
2. This test is always scheduled after network_WiFi_SuspendStress.Hidden, at least on these devices
3. #2 very often ABORTs after an hour

So maybe it's influenced by a lack of cleanup from previous tests too?

Anyway, I think we have more or less the idea on how to solve this: either get shill to stop scanning, or else direct our scans through shill/supplicant instead.

I don't think Kirtika was clearly signed up for doing the above -- her first attempt wasn't quite right, anyway. So I'll put this back as Available.

On a side note: I'll probably file a bug for #3 if there isn't one already. We're wasting a lot of lab time if those flaky tests are really taking over an hour just to do ~15 suspend/resume cycles. It seems like some quirky SSH behavior I think, where we hang for several minutes (sometimes 5-15 minutes) in the "going to suspend" phase, even if the device still managed to go through sleep/wake on its normal time.
Labels: Enterprise-Triaged

Sign in to add a comment