moblab: moblab_runSuite should retry the tests launched on moblab |
|||||
Issue descriptionSee issue 714330 for context. - One of the sub-DUTs on moblab fails provision. - The corresponding test fails - moblab_RunSuite fails. but, if we had retried the test on the other good DUT, moblab_RunSuite would have succeeded. So, why not?
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/49420390d1488c1ed56fa7dc03a202e39cb2fe87 commit 49420390d1488c1ed56fa7dc03a202e39cb2fe87 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:38 2017 moblab_RunSuite: Retry tests launched on moblab moblab's sub-DUTs are often unreliable, or at least, are as likely to fail provision as any other DUT. We retry tests in suites launched from the builders in order to avoid the suite failing due to a small number of provision failures. By the same logic, we should retry the tests inside the suite launched on moblab: to avoid failing the test when a small number of provisions fail. At the same time, keep the retry limit small: We don't have that many DUTs attached to each moblab instance, so retrying tests directly translates to longer end-to-end moblab_RunSuite time. BUG= chromium:729099 TEST=moblab_RunSuite Change-Id: If528c95241ca422f7f0ed8d17984ac37eff0f430 Reviewed-on: https://chromium-review.googlesource.com/522926 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Simran Basi <sbasi@chromium.org> [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.dummyServer [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/moblab_RunSuite.py [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.smoke
,
Jun 8 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/49420390d1488c1ed56fa7dc03a202e39cb2fe87 commit 49420390d1488c1ed56fa7dc03a202e39cb2fe87 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jun 08 06:45:38 2017 moblab_RunSuite: Retry tests launched on moblab moblab's sub-DUTs are often unreliable, or at least, are as likely to fail provision as any other DUT. We retry tests in suites launched from the builders in order to avoid the suite failing due to a small number of provision failures. By the same logic, we should retry the tests inside the suite launched on moblab: to avoid failing the test when a small number of provisions fail. At the same time, keep the retry limit small: We don't have that many DUTs attached to each moblab instance, so retrying tests directly translates to longer end-to-end moblab_RunSuite time. BUG= chromium:729099 TEST=moblab_RunSuite Change-Id: If528c95241ca422f7f0ed8d17984ac37eff0f430 Reviewed-on: https://chromium-review.googlesource.com/522926 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Simran Basi <sbasi@chromium.org> [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.dummyServer [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/moblab_RunSuite.py [modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.smoke
,
Jun 8 2017
Not yet done. The tests used by the moblab_RunSuite's internal suites do not necessarily request job level retries. e.g., moblab_RunSuite/control.dummyServer eventually uses the test dummy_PassServer. None of its control files has JOB_RETIRES set. We can either go add JOB_RETRIES to these control files (this will affect the test everywhere, or actually teach suites to retry all tests at least once. That proposal is here: https://chromium-review.googlesource.com/c/528313/
,
Jun 9 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7295bf332df608b1d4e9cdc9f0c769b71ffbae46 commit 7295bf332df608b1d4e9cdc9f0c769b71ffbae46 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Fri Jun 09 14:10:42 2017 [autotest] Bump all tests to retry at least once in a suite. When a suite requests job retries, we retry a test only if the test itself also request retries. For important suites (running on CQ / BVT), we would like tests that fail as a result of their provision job failing to get at least one more chance to run. This CL is a short-term fix. It bumps up the individual test retry limit to at least 1, so that each test is protected from its DUT failing provision. BUG= chromium:730885 BUG= chromium:729099 TEST=- run test_that with a test that doesn't request retries - inject a bug in the provision code so that the DUT fails provision. - watch the DUT fail provision, and the test get retried (of course that retry will again due to the same injected bug). TEST=(updated) unittests. Change-Id: I59b3ae36bb78c94fce234976d81297245cedd661 Reviewed-on: https://chromium-review.googlesource.com/528313 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Ilja H. Friedel <ihf@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite.py [modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite_unittest.py
,
Jun 9 2017
This should be done pending a push-to-prod. Will mark verified once I see this behaviour in the field. haddowk@ Can you also keep an eye out? If a test (internal to moblab) is retried on moblab in moblab_RunSuite, we win.
,
Jul 26 2017
Validation that this is done: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/130637080-chromeos-test/chromeos2-row1-rack8-host1/ Although it didn't save this build :(
,
Jul 27 2017
So I see the retries https://screenshot.googleplex.com/wcqUGUWvESq However the suite still gets retried in total because the run suite command exits with a warning 07-26-2017 [21:51:30] Output below this line is for buildbot consumption: @@@STEP_LINK@[Test-Logs]: dummy_PassServer: retry_count: 1, GOOD: completed successfully@http://localhost/tko/retrieve_logs.cgi?job=/results/5-moblab/@@@ @@@STEP_LINK@[Flake-Dashboard]: dummy_PassServer@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=dummy_PassServer@@@ @@@STEP_LINK@[Test-History]: dummy_PassServer@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=dummy_PassServer@@@ Will return from run_suite with status: WARNING Traceback (most recent call last): File "/usr/local/autotest/client/common_lib/test.py", line 818, in _call_test_function return func(*args, **dargs) File "/usr/local/autotest/client/common_lib/test.py", line 471, in execute dargs) File "/usr/local/autotest/client/common_lib/test.py", line 348, in _call_run_once_with_retry postprocess_profiled_run, args, dargs) File "/usr/local/autotest/client/common_lib/test.py", line 381, in _call_run_once self.run_once(*args, **dargs) File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 65, in run_once raise e AutoservRunError: command execution error * Command: /usr/bin/ssh -a -x -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22 chromeos2-row1-rack8-host1 "export LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger -tag \"autotest\" \"server[stack::run_as_moblab|run|wrapper] -> ssh_run(su - moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan --build=cyan-release/R59-9460.60.0 --suite_name=dummy_server --retry=True --max_retries=1')\";fi; su - moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan --build=cyan-release/R59-9460.60.0 --suite_name=dummy_server --retry=True --max_retries=1'" Exit status: 2 Duration: 1842.33667278 So I am sorry at the moment I am not sure this helps much.
,
Jul 27 2017
Link showing the suite being retried http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=130697823
,
Jul 27 2017
Still a step in the right direction. This sounds like we just need moblab_RunSuite to succeed in case of test retries internally. In the link you pointed to: this is the instance of moblab_RunSuite that was retried: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=130698332 (considered bad) Looking at the logs, you're right: we retried moblab_RunSuite because run_suite raised an exception because of a warning in the suite. This is easy to fix. CL incoming.
,
Aug 10 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/60dc56a33fc027fb484f00da051c79429c7f1fa1 commit 60dc56a33fc027fb484f00da051c79429c7f1fa1 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Aug 10 04:46:12 2017 moblab_RunSuite: Ignore non-critical errors from run_suite This allows retries of tests on the moblab to be considered success for the moblab_RunSuite test. Other warnings from the test / suite should also be ignored because they indicate non-critical problems with the infrastructure. BUG= chromium:729099 TEST=Run moblab_RunSuite and force an internal test retry by killing an autoserv process at the opportune moment. Change-Id: Ia0f5e0fb5c2e58361935cc6f85651ca62797aec5 Reviewed-on: https://chromium-review.googlesource.com/590092 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Keith Haddow <haddowk@chromium.org> [modify] https://crrev.com/60dc56a33fc027fb484f00da051c79429c7f1fa1/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
,
Aug 10 2017
Back to watching the moblab paladin to verify this is fixed or not. If you find a CQ run where we retried a moblab test internally and succeeded, please add link here and Verify. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Jun 2 2017