Issue metadata
Sign in to add a comment
|
make recover_duts.py more aggressive |
||||||||||||||||||||||||
Issue descriptionRecover_duts.py runs hooks (currently only one, check_ethernet.hook) every 10 minutes (SLEEP_DELAY = 600). Given that we would like to detect and recover connectivity problems in DUTs more quickly, and that the overhead of hook execution seems low, we'd like to push this down to say 30s. Also note that recover_duts.py sleeps for a long time before running the hooks the first time (LONG_REBOOT_DELAY = 600). This also seems too long: maybe 2-3 minutes instead? This would help if the ethernet NIC doesn't come up properly at boot.
,
Aug 25 2017
Sure!
,
Aug 25 2017
One of the CLs I have posted should help with this as well: https://chromium-review.googlesource.com/c/chromiumos/platform/crostestutils/+/611391/
,
Aug 25 2017
One of the CLs I have posted should help with this as well: https://chromium-review.googlesource.com/c/chromiumos/platform/crostestutils/+/611391/
,
Aug 28 2017
Richard dug up the bug which documented the timing requirements for power_SuspendStress (and how that affects recover_duts): http://crbug.com/334951 The only feasible solution I see is tests which take down the network also be responsible for bringing the network back up. In other words, the test can "stop recover_duts" and run check_ethernet.hook at the right times (when it thinks network should be up again). Everything else proposed so far just looks racy to me. The only (quite reasonable) complaint about that approach is it puts a burden on the test writer. My response to that is: then don't take down part of the lab infrastructure in that test. The counter argument is there really aren't that many tests which bounce the ethernet ports. We need to sort this out.
,
Aug 28 2017
Agreed on putting the onus on the test to be more cooperative. There should only be a handful of tests that do such things. One option could be to have such tests drop a file in /tmp that recover_duts.py / check_ethernet.hook look for and give longer grace period in such cases.
,
Aug 28 2017
> One option could be to have such tests drop a file in /tmp that > recover_duts.py / check_ethernet.hook look for and give longer > grace period in such cases. I've mentally rejected this approach since it will always need "tuning" and become probabilistic. The test knows when the lab network should be up or down. And the tests can directly use check_ethernet.hook (or whatever we end up calling this) to verify (and help revive) link state. Key thing is check_ethernet.hook NOT reboot the system - I think it's important to let the caller decide that.
,
Aug 28 2017
#7 why would it need any special tuning? The test will create the file before it starts playing with connectivity. Recover_duts will check for the file, then will go to sleep for the length of time specified in the file. Right now the test works, so using the current recover_duts interval will preserve the behavior.
,
Aug 28 2017
How does the test know the check_ethernet.hook isn't already running? How frequently does recover_duts need to check for the file? Anyway, I was thinking of recover_duts as part of the test infrastructure and not as an independent watchdog. Richard and Sameer convinced me recover_dut needs to be independent of the test and just take information from the test as a "hint". Anyway, let's put any high level changes to recover_dut or check_ethernet.hook on hold until we can point at some measurable failure rate that they can (or did) recover from.
,
Aug 29 2017
#9 the test only needs to wait a few seconds between creating the file and turning connectivity off. Recover_duts runs normally. When it starts, it will look for the file (not there yet), then check that connectivity is present. If present, nothing happens. If not present, something is already wrong and it's OK to mess things up. The check will take very little time. So there is a race, but it's one that we can make the test win reliably.
,
Sep 10
Grant, am I wrong or you already worked on this in a separate context?
,
Sep 10
I used the "wrong" bug to implement the changes described by this issue. Sorry. :( So closing as "dup" in order to link the two bugs. |
|||||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||||
Comment 1 by snanda@chromium.org
, Aug 25 2017