New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 730885 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Jun 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

run_suite --retry does not retry tests with no JOB_RETRIES set in the control file

Project Member Reported by mcchou@chromium.org, Jun 8 2017

Issue description

Builder:
veyron_speedy-paladin

Build #:
https://chromegw.corp.google.com/i/chromeos/builders/veyron_speedy-paladin/builds/5550

Error messages from https://pantheon.corp.google.com/storage/browser/chromeos-autotest-results/122066343-chromeos-test/chromeos4-row4-rack11-host16/debug/:
06/07 16:27:59.119 DEBUG|              test:0390| Test failed due to No answer to ping from chromeos4-row4-rack11-host16. Exception log follows the after_iteration_hooks.
06/07 16:27:59.119 DEBUG|              test:0393| starting after_iteration_hooks
06/07 16:27:59.119 DEBUG|              test:0396| after_iteration_hooks completed
06/07 16:27:59.121 WARNI|              test:0616| The test failed with the following exception
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 610, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 824, in _call_test_function
    raise error.UnhandledTestFail(e)
UnhandledTestFail: Unhandled AutoservError: No answer to ping from chromeos4-row4-rack11-host16
Traceback (most recent call last):
  File "/usr/local/autotest/client/common_lib/test.py", line 818, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 471, in execute
    dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 348, in _call_run_once_with_retry
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/client/common_lib/test.py", line 381, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/server/site_tests/provision_AutoUpdate/provision_AutoUpdate.py", line 113, in run_once
    force_full_update=force)
  File "/usr/local/autotest/server/afe_utils.py", line 208, in machine_install_and_update_labels
    *args, **dargs)
  File "/usr/local/autotest/server/hosts/cros_host.py", line 809, in machine_install_by_devserver
    'No answer to ping from %s' % self.hostname)
AutoservError: No answer to ping from chromeos4-row4-rack11-host16
 
Owner: pprabhu@chromium.org
Summary: veyron_speedy-paladin: Test not retried after provision failure. (was: veyron_speedy-paladin: Unhandled AutoservError: No answer to ping from chromeos4-row4-rack11-host16)
This is pretty normal provision flake, the bug is... why didn't the test retry?
https://viceroy.corp.google.com/chromeos/suite_details?job_id=122066331

The provision job on chromeos4-row4-rack11-host16 failed after 36 minutes. This is a PITA in its own right, but what's bad is that the affected test generic_RebootTest wasn't retried.

I've been seeing this mis-behaviour on moblab as I test my CL to turn on retries on moblab: https://chromium-review.googlesource.com/c/522926/


Either I'm missing something or test retries in a suite are simply broken right now.
Status: started (was: Untriaged)
OK, I finally understand job retries. The test control file needs to say it wants to be retried. This calls for an audit of all tests used in important suites or just relaxing this requirement. We may be retrying some tests right now but not others.

https://chromium-review.googlesource.com/c/527935/

Project Member

Comment 4 by bugdroid1@chromium.org, Jun 9 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7295bf332df608b1d4e9cdc9f0c769b71ffbae46

commit 7295bf332df608b1d4e9cdc9f0c769b71ffbae46
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Fri Jun 09 14:10:42 2017

[autotest] Bump all tests to retry at least once in a suite.

When a suite requests job retries, we retry a test only if the test
itself also request retries. For important suites (running on CQ / BVT),
we would like tests that fail as a result of their provision job failing
to get at least one more chance to run.

This CL is a short-term fix. It bumps up the individual test retry limit
to at least 1, so that each test is protected from its DUT failing
provision.

BUG= chromium:730885 
BUG= chromium:729099 
TEST=- run test_that with a test that doesn't request retries
     - inject a bug in the provision code so that the DUT fails
       provision.
     - watch the DUT fail provision, and the test get retried (of course
       that retry will again due to the same injected bug).
TEST=(updated) unittests.

Change-Id: I59b3ae36bb78c94fce234976d81297245cedd661
Reviewed-on: https://chromium-review.googlesource.com/528313
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Ilja H. Friedel <ihf@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite.py
[modify] https://crrev.com/7295bf332df608b1d4e9cdc9f0c769b71ffbae46/server/cros/dynamic_suite/suite_unittest.py

Summary: run_suite --retry does not retry tests with no JOB_RETRIES set in the control file (was: veyron_speedy-paladin: Test not retried after provision failure.)
Status: Fixed (was: Started)
Should be done, pending push-to-prod.

Sign in to add a comment