New issue
Advanced search Search tips

Issue 729099 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Aug 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 714330



Sign in to add a comment

moblab: moblab_runSuite should retry the tests launched on moblab

Project Member Reported by pprabhu@chromium.org, Jun 2 2017

Issue description

See  issue 714330  for context.

- One of the sub-DUTs on moblab fails provision.
- The corresponding test fails
- moblab_RunSuite fails.

but, if we had retried the test on the other good DUT, moblab_RunSuite would have succeeded.

So, why not?
 
Project Member

Comment 2 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/49420390d1488c1ed56fa7dc03a202e39cb2fe87

commit 49420390d1488c1ed56fa7dc03a202e39cb2fe87
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:38 2017

moblab_RunSuite: Retry tests launched on moblab

moblab's sub-DUTs are often unreliable, or at least, are as likely to
fail provision as any other DUT. We retry tests in suites launched from
the builders in order to avoid the suite failing due to a small number
of provision failures. By the same logic, we should retry the tests
inside the suite launched on moblab: to avoid failing the test when a
small number of provisions fail.

At the same time, keep the retry limit small: We don't have that many
DUTs attached to each moblab instance, so retrying tests directly
translates to longer end-to-end moblab_RunSuite time.

BUG= chromium:729099 
TEST=moblab_RunSuite

Change-Id: If528c95241ca422f7f0ed8d17984ac37eff0f430
Reviewed-on: https://chromium-review.googlesource.com/522926
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Simran Basi <sbasi@chromium.org>

[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.dummyServer
[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.smoke

Project Member

Comment 3 by bugdroid1@chromium.org, Jun 8 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/49420390d1488c1ed56fa7dc03a202e39cb2fe87

commit 49420390d1488c1ed56fa7dc03a202e39cb2fe87
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jun 08 06:45:38 2017

moblab_RunSuite: Retry tests launched on moblab

moblab's sub-DUTs are often unreliable, or at least, are as likely to
fail provision as any other DUT. We retry tests in suites launched from
the builders in order to avoid the suite failing due to a small number
of provision failures. By the same logic, we should retry the tests
inside the suite launched on moblab: to avoid failing the test when a
small number of provisions fail.

At the same time, keep the retry limit small: We don't have that many
DUTs attached to each moblab instance, so retrying tests directly
translates to longer end-to-end moblab_RunSuite time.

BUG= chromium:729099 
TEST=moblab_RunSuite

Change-Id: If528c95241ca422f7f0ed8d17984ac37eff0f430
Reviewed-on: https://chromium-review.googlesource.com/522926
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Simran Basi <sbasi@chromium.org>

[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.dummyServer
[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
[modify] https://crrev.com/49420390d1488c1ed56fa7dc03a202e39cb2fe87/server/site_tests/moblab_RunSuite/control.smoke

Not yet done.
The tests used by the moblab_RunSuite's internal suites do not necessarily request job level retries.

e.g., moblab_RunSuite/control.dummyServer eventually uses the test dummy_PassServer. None of its control files has JOB_RETIRES set.

We can either go add JOB_RETRIES to these control files (this will affect the test everywhere, or actually teach suites to retry all tests at least once.
That proposal is here: https://chromium-review.googlesource.com/c/528313/

Project Member

Comment 5 by bugdroid1@chromium.org, Jun 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7295bf332df608b1d4e9cdc9f0c769b71ffbae46

commit 7295bf332df608b1d4e9cdc9f0c769b71ffbae46
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Jun 09 14:10:42 2017

[autotest] Bump all tests to retry at least once in a suite.

When a suite requests job retries, we retry a test only if the test
itself also request retries. For important suites (running on CQ / BVT),
we would like tests that fail as a result of their provision job failing
to get at least one more chance to run.

This CL is a short-term fix. It bumps up the individual test retry limit
to at least 1, so that each test is protected from its DUT failing
provision.

BUG= chromium:730885 
BUG= chromium:729099 
TEST=- run test_that with a test that doesn't request retries
     - inject a bug in the provision code so that the DUT fails
       provision.
     - watch the DUT fail provision, and the test get retried (of course
       that retry will again due to the same injected bug).
TEST=(updated) unittests.

Change-Id: I59b3ae36bb78c94fce234976d81297245cedd661
Reviewed-on: https://chromium-review.googlesource.com/528313
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite.py
[modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite_unittest.py

Cc: haddowk@chromium.org
Status: Fixed (was: Started)
This should be done pending a push-to-prod.
Will mark verified once I see this behaviour in the field.

haddowk@ Can you also keep an eye out?
If a test (internal to moblab) is retried on moblab in moblab_RunSuite, we win.
Status: Verified (was: Fixed)
Validation that this is done: https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/130637080-chromeos-test/chromeos2-row1-rack8-host1/

Although it didn't save this build :(

So I see the retries

https://screenshot.googleplex.com/wcqUGUWvESq

However the suite still gets retried in total because the run suite command exits with a warning

   07-26-2017 [21:51:30] Output below this line is for buildbot consumption:
  @@@STEP_LINK@[Test-Logs]: dummy_PassServer: retry_count: 1, GOOD: completed successfully@http://localhost/tko/retrieve_logs.cgi?job=/results/5-moblab/@@@
  @@@STEP_LINK@[Flake-Dashboard]: dummy_PassServer@https://wmatrix.googleplex.com/retry_teststats/?days_back=30&tests=dummy_PassServer@@@
  @@@STEP_LINK@[Test-History]: dummy_PassServer@https://wmatrix.googleplex.com/unfiltered?hide_missing=True&tests=dummy_PassServer@@@
  Will return from run_suite with status: WARNING
  Traceback (most recent call last):
    File "/usr/local/autotest/client/common_lib/test.py", line 818, in _call_test_function
      return func(*args, **dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 471, in execute
      dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 348, in _call_run_once_with_retry
      postprocess_profiled_run, args, dargs)
    File "/usr/local/autotest/client/common_lib/test.py", line 381, in _call_run_once
      self.run_once(*args, **dargs)
    File "/usr/local/autotest/server/site_tests/moblab_RunSuite/moblab_RunSuite.py", line 65, in run_once
      raise e
  AutoservRunError: command execution error
  * Command: 
      /usr/bin/ssh -a -x     -o StrictHostKeyChecking=no -o
      UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
      ServerAliveInterval=900 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
      -o Protocol=2 -l root -p 22 chromeos2-row1-rack8-host1 "export
      LIBC_FATAL_STDERR_=1; if type \"logger\" > /dev/null 2>&1; then logger
      -tag \"autotest\" \"server[stack::run_as_moblab|run|wrapper] -> ssh_run(su
      - moblab -c '/usr/local/autotest/site_utils/run_suite.py --pool=''
      --board=cyan --build=cyan-release/R59-9460.60.0 --suite_name=dummy_server
      --retry=True --max_retries=1')\";fi; su - moblab -c
      '/usr/local/autotest/site_utils/run_suite.py --pool='' --board=cyan
      --build=cyan-release/R59-9460.60.0 --suite_name=dummy_server --retry=True
      --max_retries=1'"
  Exit status: 2
  Duration: 1842.33667278

So I am sorry at the moment I am not sure this helps much.
Link showing the suite being retried
http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=130697823
Labels: -Pri-3 Pri-2
Status: Started (was: Verified)
Still a step in the right direction. This sounds like we just need moblab_RunSuite to succeed in case of test retries internally.

In the link you pointed to: this is the instance of moblab_RunSuite that was retried: http://cautotest.corp.google.com/afe/#tab_id=view_job&object_id=130698332 (considered bad)

Looking at the logs, you're right: we retried moblab_RunSuite because run_suite raised an exception because of a warning in the suite.

This is easy to fix. CL incoming.
Project Member

Comment 11 by bugdroid1@chromium.org, Aug 10 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/60dc56a33fc027fb484f00da051c79429c7f1fa1

commit 60dc56a33fc027fb484f00da051c79429c7f1fa1
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Aug 10 04:46:12 2017

moblab_RunSuite: Ignore non-critical errors from run_suite

This allows retries of tests on the moblab to be considered success for
the moblab_RunSuite test. Other warnings from the test / suite should
also be ignored because they indicate non-critical problems with the
infrastructure.

BUG= chromium:729099 
TEST=Run moblab_RunSuite and force an internal test retry by killing an
     autoserv process at the opportune moment.

Change-Id: Ia0f5e0fb5c2e58361935cc6f85651ca62797aec5
Reviewed-on: https://chromium-review.googlesource.com/590092
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Keith Haddow <haddowk@chromium.org>

[modify] https://crrev.com/60dc56a33fc027fb484f00da051c79429c7f1fa1/server/site_tests/moblab_RunSuite/moblab_RunSuite.py

Status: Fixed (was: Started)
Back to watching the moblab paladin to verify this is fixed or not.

If you find a CQ run where we retried a moblab test internally and succeeded, please add link here and Verify.

Sign in to add a comment