New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 912288 link

Starred by 4 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

network_WiFi_SuspendStress: 10-second suspend actually takes 15+ minutes

Project Member Reported by briannorris@chromium.org, Dec 5

Issue description

On most (all?) veyron and many daisy, our "10-second" suspend is taking about 15 minutes often enough that it frequently causes tests that only suspend ~15 times (and so should take on the order of 10-ish minutes, with delays for Wifi connectivity, testing) to take well over an hour. So the test gets ABORTed due to timeout.

For example here:

https://stainless.corp.google.com/browse/chromeos-autotest-results/263505757-chromeos-test/

12/04 13:33:45.005 INFO |       wifi_client:0665| Suspending DUT for 10 seconds...
12/04 13:48:45.122 INFO |       wifi_client:0667| ...done suspending

I managed to pull out syslog from the following test, to see:

2018-12-04T13:16:39.250424-08:00 NOTICE powerd_suspend[30340]: Finalizing suspend
...
2018-12-04T13:16:48.379738-08:00 NOTICE powerd_suspend[30374]: Resume finished
...
2018-12-04T13:16:51.716836-08:00 INFO kernel: [ 3219.635063] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
2018-12-04T13:16:51.727569-08:00 WARNING kernel: [ 3219.639365] smsc75xx 2-1:1.0 eth0: device_set_wakeup_enable error -22
2018-12-04T13:16:51.728039-08:00 INFO kernel: [ 3219.642824] smsc75xx 2-1:1.0 eth0: link up, 1000Mbps, full-duplex, lpa 0xC5E1


The DUT then appears to sit there waiting, until it finally sees another SSH login about 15 minutes later.

I can't tell exactly what's at fault here. This particular suspend operation is being run over our shill_proxy (and XML RPC server), so it's possible it doesn't handle the suspend/resume network interruption quite right.
 
Summary: network_WiFi_Suspend: 10-second suspend actually takes 15+ minutes (was: network_WiFi_Suspend: [veyron, daisy] 10-second suspend actually takes 15+ minutes)
I'm pretty sure this isn't just on veyron and daisy. It happens at my desk too, with other systems. I managed to watch a little bit of the post-suspend/resume traffic between the autoserv and the xmlrpc server running on the DUT. It looks like they were both waiting on each other, but they both still thought the connection was OK -- they were waiting in either select() or recvfrom(). If one managed to timeout (and I was able to force this by interrupting one side of the connection with gdb), they would result in 'Connection refused' errors, like in  bug 864273  and  bug 845732 .

I think these are all symptoms of the same problem: somehow, the xmlrpc interface can get wedged across suspend/resume -- sometimes this recovers after ~15 minutes, but sometimes it just results in connection failures that kill the test.

This is not exactly my area of expertise, but it seems like a deep root cause as to why all our network_WiFi_SuspendStress tests are inherently flaky.
Summary: network_WiFi_SuspendStress: 10-second suspend actually takes 15+ minutes (was: network_WiFi_Suspend: 10-second suspend actually takes 15+ minutes)
Cc: briannorris@chromium.org
 Issue 845732  has been merged into this issue.
Cc: aashuto...@chromium.org dsunk...@chromium.org
 Issue 864273  has been merged into this issue.
Labels: Enterprise-Triaged

Sign in to add a comment