Wifi tests fail due to "Timed out while waiting for hostapd" |
||||
Issue descriptionyuna - chromeos15-row3-rack2-host1 Sample log link - https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/174917026-chromeos-test/chromeos15-row3-rack2-host1/network_WiFi_SimpleConnect.wifi_checkMixedWPA/debug/ Trace... 02/07 01:12:38.650 WARNI| test:0637| The test failed with the following exception Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 631, in _exec _call_test_function(self.execute, *p_args, **p_dargs) File "/usr/local/autotest/client/common_lib/test.py", line 831, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 495, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 362, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 400, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/network_WiFi_SimpleConnect/network_WiFi_SimpleConnect.py", line 35, in run_once self.context.configure(router_conf) File "/usr/local/autotest/server/cros/network/wifi_test_context_manager.py", line 178, in configure multi_interface=multi_interface) File "/usr/local/autotest/server/site_linux_router.py", line 449, in hostap_configure self.start_hostapd(configuration) File "/usr/local/autotest/server/site_linux_router.py", line 325, in start_hostapd raise error.TestFail('Timed out while waiting for hostapd ' TestFail: Timed out while waiting for hostapd to start. ------------------------------------------------------------------- Stainless link - https://stainless.corp.google.com/search?view=matrix&row=build&col=board&first_date=2018-01-17&last_date=2018-02-13&hostname=chromeos15-row3-rack2-host1&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&reason=Timed+out+while+waiting+for+hostapd+to+start&exclude_cts=true&exclude_not_run=true&exclude_non_release=false&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true
,
Dec 1
,
Dec 5
The original report was happening a lot, and that was likely an instance of a bad router. But this does still seems to happen occasionally. I glanced at one failure: https://stainless.corp.google.com/browse/chromeos-autotest-results/262637399-chromeos-test/ It looks to me like the timeout condition is not actually working correctly. According to router_host_messages, we only grepped this hostapd log for "Setup of interface done" once 2018-12-01T17:02:01.777216+00:00 NOTICE ag[29880]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Setup of interface done" /tmp/hostapd-test-managed0.log) but the hostapd log shows that it came less than a second later: 1543683722.524832: managed0: Setup of interface done. The test called this a timeout several seconds later, but it didn't recheck whether hostapd was up (it was): 12/01 09:02:09.655 ERROR| utils:2762| Will raise error TestFail('Timed out while waiting for hostapd to start.',) due to unexpected return: False This actually seems to be a fundamental bug in poll_for_condition(). poll_for_condition() basically does: while True: if condition(): return True if past deadline: raise TimeoutError() sleep This is wrong, and is especially bad if condition() takes a long time to complete [1]. In that case, condition() may cause us to exceed the deadline, but we don't re-check that condition before declaring failure. This is a subtle issue in timeout loops that I see all over the place. It's rather unfortunate that we have that same error in our basic test helper -- poll_for_condition(). [1] See how these 3 commands in the _has_hostapd_started() poll condition span over 8 seconds: 2018-12-01T17:02:01.777216+00:00 NOTICE ag[29880]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Setup of interface done" /tmp/hostapd-test-managed0.log) 2018-12-01T17:02:07.361877+00:00 NOTICE ag[29888]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(grep "Interface initialization failed" /tmp/hostapd-test-managed0.log) 2018-12-01T17:02:09.226752+00:00 NOTICE ag[29896]: autotest server[stack::<lambda>|_has_hostapd_started|run] -> ssh_run(kill -0 29872) If this really can be that slow, we are highly likely to "time out" if we are too early for hostapd startup the first time around.
,
Dec 5
,
Dec 6
For some data: a test like this at my desk:
for j in $(seq 1 20) ; do (time (for i in $(seq 1 3); do ssh root@${ROUTER_HOSTNAME} :; done) ) 2>&1 | grep real | cut -f 2; done
shows that 3 SSH commands reliably takes right around 4 to 4.5 seconds to complete. So with some slop for lab server and router delays, that can easily add up, such that 10 seconds is really not enough time to figure out whether hostapd has started, even if hostapd never takes longer than ~5 seconds to start.
,
Dec 7
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7345690cdcd82e8357416199b237b1e52305484f commit 7345690cdcd82e8357416199b237b1e52305484f Author: Brian Norris <briannorris@chromium.org> Date: Fri Dec 07 06:06:30 2018 [autotest] site_linux_router: increase hostapd startup timeout It doesn't actually take hostapd very long to start up (I see 3-5 seconds in the longest cases), but it can take a little while to check whether or not it has started up. This is partly because we use 3 different methods in _has_hostapd_started(), and each is run over a new SSH session. Additionally, poll_for_condition() is not very forgiving -- when we use a 10 second timeout, it decides to cut us off even if (due to, e.g., network delays, autoserv load, etc.) the last time we started a "check for liveness" was much earlier in the 10 second window. This is apparently how the function was designed, despite its flaws. Perhaps we'll change that in CL:1364090. But regardless, it's fair to increase this timeout significantly. Looking through recent failures in startup, there are exactly zero legitimate instances of hitting this timeout condition. In all cases, hostapd started relatively quickly, but the last time we checked in was 1 to 2 seconds before it finished starting up. If we triple the timeout, we are likely to get at least one or two more chances to check in with hostapd, and we can hopefully avoid the spurious timeout. Another reason a large timeout is no problem here: we also have negative checks in _has_hostapd_started() which can detect whether (a) hostapd terminated early (i.e., PID is not running) or (b) hostapd logged some known errors already and is preparing to shut down. With those two cases, we will short-circuit and raise TestFail() anyway, so we won't hit the timeout. So the only legitamite use of this timeout is for very exceptional cases, which we aren't hitting in the lab today. Side note: I spot-checked every recent "timeout" we've seen in the lab, and all of them are spurious cases of poll_for_condition()/_has_hostapd_started() declaring premature timeout. BUG= chromium:811975 TEST=a few network_WiFi_... tests, even with 10 seconds of delay added to _has_hostapd_started() Change-Id: If814ae8b17435998fdcb4a74073ecab04b202feb Signed-off-by: Brian Norris <briannorris@chromium.org> Reviewed-on: https://chromium-review.googlesource.com/1366335 Reviewed-by: Kirtika Ruchandani <kirtika@chromium.org> [modify] https://crrev.com/7345690cdcd82e8357416199b237b1e52305484f/server/site_linux_router.py
,
Dec 7
I suspect the increased timeout should cover everything I've seen in the lab, so I'll declare this Fixed.
,
Dec 22
Similar test failure has happened only once (yesterday on orco) since the fix landed. Declaring this as fixed/verified. https://stainless.corp.google.com/search?view=matrix&row=queued_date&col=board&first_date=2018-10-25&last_date=2018-12-22&status=GOOD&status=WARN&status=FAIL&status=ERROR&status=ABORT&reason=Timed+out+while+waiting+for+hostapd+to+start&exclude_cts=true&exclude_not_run=true&exclude_non_release=false&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true |
||||
►
Sign in to add a comment |
||||
Comment 1 by harpreet@chromium.org
, Feb 13 2018