New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 837969 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Last visit > 30 days ago
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

wifi: autotest: Fail any wifi test which shows kernel crashes (driver, cfg80211, mac80211) or firmware stackdumps

Project Member Reported by kirtika@google.com, Apr 29 2018

Issue description

network_WiFi_CSADisconnect is a good illustration for this bug.
The driver complains about stuck tx queues after issuing a CSA followed by disconnect. We still pass the test. 

21:31:49 INFO | autoserv| 2018-04-28T21:31:26.455496-07:00 ERR kernel: [ 1756.996598] iwlwifi 0000:01:00.0: fail to flush all tx fifo queues Q 5
21:31:49 INFO | autoserv| 2018-04-28T21:31:26.455523-07:00 ERR kernel: [ 1756.996812] iwlwifi 0000:01:00.0: Queue 5 is active on fifo 3 and stuck for 10000 ms. SW [46, 47] HW [46, 47] FH TRB=0x08030502e
21:31:49 INFO | autoserv| 2018-04-28T21:31:38.705365-07:00 ERR kernel: [ 1769.247049] iwlwifi 0000:01:00.0: fail to flush all tx fifo queues Q 5
21:31:49 INFO | autoserv| 2018-04-28T21:31:38.705456-07:00 ERR kernel: [ 1769.247266] iwlwifi 0000:01:00.0: Queue 5 is active on fifo 3 and stuck for 10000 ms. SW [51, 52] HW [51, 52] FH TRB=0x080305033

The goal is to look for any and all kernel crashes in the wireless stack and wifi firmware crashes that happened during the test. 
Fail the test if any are found. 
While network_WiFi_CSADisconnect is an easy example to point to, 
another goal here is to discover kernel warnings reported in plenty
on the crash server but hard to reproduce at-will e.g. iwl_trans_pcie_grab_nic_access  / iwl_trans_pcie_reclaim warnings when
wifi drops off the PCI bus. 

The closest example I could find was the use of client/cros/cros_logging.py, which reads dmesg on the client into a buffer and greps for a desired pattern.     

[snip]
def verify_lvds_downclock(self):
        """On systems which support LVDS downclock, checks the kernel log for
        a message that an LVDS downclock mode has been added."""
        logging.info('Running verify_lvds_downclock')
        board = utils.get_board()
        if not (board == 'alex' or board == 'lumpy' or board == 'stout'):
            return ''
        # Get the downclock message from the logs.
        reader = cros_logging.LogReader()
        reader.set_start_by_reboot(-1)
        if not reader.can_find('Adding LVDS downclock mode'):
            return self.handle_error('LVDS downclock quirk not applied. ')
        return ''

[snip]

However, cros_logging.py is client-side only, it needs to be modified so that it can be used from a server-side test. 

Another approach is at CL:577064 ("Add a wifi stress test"). It does this by calling unix (cat) utilities within the test. I've chosen to work on improving cros_logging.py as opposed to do what CL:577064 is doing as cros_logging.py can be then used by all tests. 


Proposed solution:
Step 1: Extend cros_logging.py to work with remote hosts (i.e. clients) as well, instead of just locally. Follow the example of iw_runner.py
which does things in two possible ways, depending on whether self._host is None (local run) or not (calling from the server, want to run on the client). 

Step 2: Move cros_logging.py to client/common_lib (since anyone can run it now).

Step 3: cros_logging.py allows setting a "start_line" in dmesg. so you only scan dmesg from the point where the test starts. Add a hook to wifi_client to set the start_line.

Step 4: Add the run_before_once and run_after_once hooks to wifi_cell_test_base. run_before_once lets wifi_client set the start_line and run_after_once is what checks for unwanted spew, and fails the test if any is found. 



 

Comment 1 by kirtika@google.com, Apr 29 2018

Cc: rajatja@chromium.org cernekee@chromium.org briannorris@chromium.org zmarcus@google.com
Added RFC here:
https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/1034277/

Would be great to get this in the lab this week so we can start getting AER/wifi-falls-off-NIC repro instances from the lab. 

Status: Assigned (was: Untriaged)
This bug has an owner, thus, it's been triaged. Changing status to "assigned".

Sign in to add a comment