New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 811975 link

Starred by 1 user

Issue metadata

Status: Verified
Owner:
Closed: Dec 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug



Sign in to add a comment

Wifi tests fail due to "Timed out while waiting for hostapd"

Project Member Reported by harpreet@chromium.org, Feb 13 2018

Issue description

yuna - chromeos15-row3-rack2-host1

Sample log link - https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/174917026-chromeos-test/chromeos15-row3-rack2-host1/network_WiFi_SimpleConnect.wifi_checkMixedWPA/debug/


Trace...

02/07 01:12:38.650 WARNI|              test:0637| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 631, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/network_WiFi_SimpleConnect/network_WiFi_SimpleConnect.py", line 35, in run_once
    self.context.configure(router_conf)
  File "/usr/local/autotest/server/cros/network/wifi_test_context_manager.py", line 178, in configure
    multi_interface=multi_interface)
  File "/usr/local/autotest/server/site_linux_router.py", line 449, in hostap_configure
    self.start_hostapd(configuration)
  File "/usr/local/autotest/server/site_linux_router.py", line 325, in start_hostapd
    raise error.TestFail('Timed out while waiting for hostapd '
TestFail: Timed out while waiting for hostapd to start.

-------------------------------------------------------------------

Stainless link - https://stainless.corp.google.com/search?view=matrix&row=build&col=board&first_date=2018-01-17&last_date=2018-02-13&hostname=chromeos15-row3-rack2-host1&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&reason=Timed+out+while+waiting+for+hostapd+to+start&exclude_cts=true&exclude_not_run=true&exclude_non_release=false&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
 
Labels: wifi-test-failures
Owner: briannorris@chromium.org
Status: Started (was: Untriaged)
Summary: Wifi tests fail due to "Timed out while waiting for hostapd" (was: Wifi tests failing on yuna due to "Timed out while waiting for hostapd")
The original report was happening a lot, and that was likely an instance of a bad router. But this does still seems to happen occasionally.

I glanced at one failure:

https://stainless.corp.google.com/browse/chromeos-autotest-results/262637399-chromeos-test/

It looks to me like the timeout condition is not actually working correctly. According to router_host_messages, we only grepped this hostapd log for "Setup of interface done" once


2018-12-01T17:02:01.777216+00:00 NOTICE ag[29880]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Setup of interface done" /tmp/hostapd-test-managed0.log)

but the hostapd log shows that it came less than a second later:

1543683722.524832: managed0: Setup of interface done.

The test called this a timeout several seconds later, but it didn't recheck whether hostapd was up (it was):

12/01 09:02:09.655 ERROR|             utils:2762| Will raise error TestFail('Timed out while waiting for hostapd to start.',) due to unexpected return: False

This actually seems to be a fundamental bug in poll_for_condition(). poll_for_condition() basically does:

  while True:
    if condition():
       return True
    if past deadline:
      raise TimeoutError()
    sleep


This is wrong, and is especially bad if condition() takes a long time to complete [1]. In that case, condition() may cause us to exceed the deadline, but we don't re-check that condition before declaring failure.

This is a subtle issue in timeout loops that I see all over the place. It's rather unfortunate that we have that same error in our basic test helper -- poll_for_condition().


[1] See how these 3 commands in the _has_hostapd_started() poll condition span over 8 seconds:

2018-12-01T17:02:01.777216+00:00 NOTICE ag[29880]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Setup of interface done" /tmp/hostapd-test-managed0.log)

2018-12-01T17:02:07.361877+00:00 NOTICE ag[29888]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Interface initialization failed" /tmp/hostapd-test-managed0.log)

2018-12-01T17:02:09.226752+00:00 NOTICE ag[29896]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(kill -0 29872)

If this really can be that slow, we are highly likely to "time out" if we are too early for hostapd startup the first time around.
For some data: a test like this at my desk:

for j in $(seq 1 20) ; do (time (for i in $(seq 1 3); do ssh root@${ROUTER_HOSTNAME} :; done) ) 2>&1 | grep real | cut -f 2; done

shows that 3 SSH commands reliably takes right around 4 to 4.5 seconds to complete. So with some slop for lab server and router delays, that can easily add up, such that 10 seconds is really not enough time to figure out whether hostapd has started, even if hostapd never takes longer than ~5 seconds to start.
Project Member

Comment 6 by bugdroid1@chromium.org, Dec 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7345690cdcd82e8357416199b237b1e52305484f

commit 7345690cdcd82e8357416199b237b1e52305484f
Author: Brian Norris <briannorris@chromium.org>
Date: Fri Dec 07 06:06:30 2018

[autotest] site_linux_router: increase hostapd startup timeout

It doesn't actually take hostapd very long to start up (I see 3-5
seconds in the longest cases), but it can take a little while to
check whether or not it has started up. This is partly because we use 3
different methods in _has_hostapd_started(), and each is run over a new
SSH session. Additionally, poll_for_condition() is not very forgiving --
when we use a 10 second timeout, it decides to cut us off even if (due
to, e.g., network delays, autoserv load, etc.) the last time we started
a "check for liveness" was much earlier in the 10 second window. This is
apparently how the function was designed, despite its flaws. Perhaps
we'll change that in CL:1364090.

But regardless, it's fair to increase this timeout significantly.
Looking through recent failures in startup, there are exactly zero
legitimate instances of hitting this timeout condition. In all cases,
hostapd started relatively quickly, but the last time we checked in was
1 to 2 seconds before it finished starting up. If we triple the timeout,
we are likely to get at least one or two more chances to check in with
hostapd, and we can hopefully avoid the spurious timeout.

Another reason a large timeout is no problem here: we also have negative
checks in _has_hostapd_started() which can detect whether
(a) hostapd terminated early (i.e., PID is not running) or
(b) hostapd logged some known errors already and is preparing to shut
    down.
With those two cases, we will short-circuit and raise TestFail() anyway,
so we won't hit the timeout. So the only legitamite use of this timeout
is for very exceptional cases, which we aren't hitting in the lab today.

Side note: I spot-checked every recent "timeout" we've seen in the lab,
and all of them are spurious cases of
poll_for_condition()/_has_hostapd_started() declaring premature timeout.

BUG= chromium:811975 
TEST=a few network_WiFi_... tests, even with 10 seconds of delay added
     to _has_hostapd_started()

Change-Id: If814ae8b17435998fdcb4a74073ecab04b202feb
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-on: https://chromium-review.googlesource.com/1366335
Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org>

[modify] https://crrev.com/7345690cdcd82e8357416199b237b1e52305484f/server/site_linux_router.py

Status: Fixed (was: Started)
I suspect the increased timeout should cover everything I've seen in the lab, so I'll declare this Fixed.

Sign in to add a comment