New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 750918 link

Starred by 2 users

Issue metadata

Status: Verified
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug
okr



Sign in to add a comment

enable test-level retries for VMTest

Project Member Reported by akes...@chromium.org, Jul 31 2017

Issue description

A single failing test within vmtest suite causes the entire suite to be retried.

Enable test-level retries instead, to save time in the CQ.
 
Cc: jkop@chromium.org
+jkop. This is a candidate start bug. Most of the work would probably be in the autotest repo (VMTest uses the test_that tool to run tests locally against a virtual device; test_that source code lives in autotest repo).
Here's an example of a VMTest failure that led to the entire suite/stage being retried https://luci-milo.appspot.com/buildbot/chromeos/betty-paladin/282 . The VMTest stage logs should help get started.

There are other example out there of retrys where things succeeded on retry, I just didn't have one conveiently available.
Cc: pwang@chromium.org derat@chromium.org ihf@chromium.org
 Issue 749268  has been merged into this issue.

Comment 4 by pwang@chromium.org, Aug 18 2017

Is it possible for us to move the VMTest to run on the lab (or treat them the same as current HWTest jobs), in that case, we can benefits from setting test-level retries as well as other benefits of the current system.
^ The VM instance that we are testing against runs on the builder. Moving vmtest off the builder is a much bigger project than what is needed here.
Owner: jkop@chromium.org
Status: Assigned (was: Available)
Provisionally assigning to jkop@

Comment 7 by jkop@chromium.org, Aug 28 2017

Status: Started (was: Assigned)

Comment 8 by jkop@chromium.org, Sep 12 2017

Status: Assigned (was: Started)

Comment 9 by jkop@chromium.org, Sep 21 2017

Status: Started (was: Assigned)
Checking my understanding:

The error in the run with the retried suite was a problem with the test harness. My intuition would be that if the test harness is broken, everything within it would fail, and so rerunning the suite would be the desired result.

Betty is our only VM suite at present, so I should only look within betty failures for similar results.

https://luci-milo.appspot.com/buildbot/chromeos/betty-paladin/817 is a VM failure but did not retry at all, and so it is not an instance of this problem and is unlikely to be useful for comparison.
Ok, we'll need to fish around for an example of a failure that was of an individual test and not the test harness.

I'm pretty sure I've seen this.
I think this fits the bill (modulo that the tests failed upon retry too, nevertheless the point stands and benefit from test-level retry still applies)

https://luci-milo.appspot.com/buildbot/chromeos/betty-arc64-paladin/589

Comment 12 by jkop@chromium.org, Oct 12 2017

Labels: OKR

Comment 13 by jkop@chromium.org, Oct 18 2017

Ran some SQL to find the potential impact of the change. TL;DR: In the unrealistic perfect case, we could save 0.5% of VM test time.

I ran each of these on the last month of data, starting from the last full day before querying.

#VMTest retries where the second try passed
SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable
WHERE
	name="VMTest (attempt 2)" AND status = "pass" AND
	build_id IN (
		select build_id from buildStageTable where name="VMTest (attempt 1)" AND
			start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"
	);
# => 1249895 seconds = 347 hours = 0.19% of VMTest runtime

#VMTest retries which had completed at query time, regardless of success.
SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable
WHERE
	name="VMTest (attempt 2)" AND final = 1 AND
	build_id IN (
		select build_id from buildStageTable where name="VMTest (attempt 1)" AND
			start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"
	);
# => 3542447 seconds = 984 hours = 0.53% of VMTest runtime

#VMTest first runs, total time.
SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable
WHERE name="VMTest (attempt 1)" AND final = 1;
# => 665067843 seconds = 184741 hours = 99.3% of VMTest runtime

With that in mind I'd judge this a very low priority.
mysql> select COUNT(*) from buildStageTable where name='VMTest (attempt 1)' and status != 'skipped' and board = 'betty' and last_updated > "2017-09-18 00:00:01";
+----------+
| COUNT(*) |
+----------+
|     3287 |
+----------+
1 row in set (4.32 sec)

mysql> select COUNT(*) from buildStageTable where name='VMTest (attempt 2)' and status != 'skipped' and board = 'betty' and last_updated > "2017-09-18 00:00:01";
+----------+
| COUNT(*) |
+----------+
|      416 |
+----------+
1 row in set (4.39 sec)

Comment 15 by jkop@chromium.org, Oct 18 2017

Missed the time range on that last query. Should be

SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable
WHERE name="VMTest (attempt 1)" AND final = 1 AND start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01";
# => 21496308 seconds = 5791 hours

Revised breakdown:

85.9% of VMTest runtime: first tries
14.2% of VMTest runtime: second tries
 5.0% of VMTest runtime: second tries which pass

This would amount to 2-6% of total CQ runtime being taken up by retrying VMTest and so looks far more promising. I take back my earlier judgment.

Comment 16 by jkop@chromium.org, Oct 19 2017

On akeshet's recommendation, limited it to Pre-CQ only for a better assessment of the problem at hand for the OKR. 

Overall results:
Of Pre-CQ builds where any VM Test is run, VM test retries are 4.8% of the total build time.
Of all Pre-CQ builds, retries are 4.3% of total build time.

In runs with a retry, the retry takes up 25% of the total length of the build. This implies that about 5% of Pre-CQ runs currently include a retry.

SQL follows for inspection:

#time spent on retries:
SELECT SUM(TIMESTAMPDIFF(SECOND, stages.start_time,  stages.finish_time))
FROM buildTable builds
JOIN buildStageTable stages ON stages.build_id = builds.id
WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"
AND stages.name ="VMTest (attempt 2)" AND builds.builder_name="pre_cq";
# => 1062437 seconds = 295 hours = 4.8% of VM-including precq build time, 4.3% of all precq build time
#time spent on precq builds including a VM Test
SELECT SUM(TIMESTAMPDIFF(SECOND, builds.start_time,  builds.finish_time))
FROM buildTable builds
JOIN buildStageTable stages ON stages.build_id = builds.id
WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"
AND stages.name ="VMTest (attempt 1)" AND builds.builder_name="pre_cq";
# => 21971622 seconds = 6103 hours
#including precqs without any VM tests
SELECT SUM(TIMESTAMPDIFF(SECOND, builds.start_time,  builds.finish_time))
FROM buildTable builds
WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"
AND builds.builder_name="pre_cq";
# => 24600233 seconds = 6833 hours
Is there any update on this? Still seems like a strong handling-time-saving candidate.

Comment 18 by jkop@google.com, Nov 16 2017

I'd set it aside until earlier this week, so no update yet.
Cc: jrbarnette@chromium.org
 Issue 794707  has been merged into this issue.

Comment 20 by jkop@chromium.org, Dec 15 2017

Labels: -Pri-2 Pri-1
This is looking more urgent, so moved to the top of my queue.

Comment 21 by ihf@chromium.org, Dec 21 2017

People interested in this issue may also be interested in  issue 797109 . Long story short, the GCE instance can become unresponsive for several minutes even when fairly idle. This can trigger various timeouts.

Comment 22 by ihf@chromium.org, Dec 22 2017

I would like to step back and pose a question. How hard would it be to make VMTest/betty instances look more like a real DUT. With shard and servo (issue 797164) and all that comes with being a DUT. Obviously it would require some stubbing. This bug here would not exist though if VMTest/betty behaved like real hardware. Also we would get other things for free, like sharding over multiple VMs (which PoHsien has been working on, but there are limits). Also we could separate cloud build instances from test instances and maybe get better utilization (but then ACL and artifact transfer becomes an issue).

I think this is a long winded way of saying that maybe this shouldn't be a starter bug (because it has a long tail and will keep giving).
RE 21&22 I agree those are useful investigations. Let's fork those and keep this  bug about the test-level retries work (which requires some moderate feature work in test_that and is a good size starter bug).
Project Member

Comment 24 by bugdroid1@chromium.org, Jan 19 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/64c039007a346ee80a263b045c8b074c17727fb8

commit 64c039007a346ee80a263b045c8b074c17727fb8
Author: Jacob Kopczynski <jkop@google.com>
Date: Fri Jan 19 01:40:45 2018

vm_tests: Fetch jobs from afe added midstream

Currently perform_local_run fetches jobs from AFE once and then executes
them all serially.
This change makes it poll AFE for further jobs once it has finished its queue,
and returns only if no new jobs found.

BUG= chromium:750918 
TEST=Unittests for test_runner_utils and test_that

Change-Id: Ie461396a984e3614322b3cd595216dbc44ef3437
Reviewed-on: https://chromium-review.googlesource.com/851259
Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/64c039007a346ee80a263b045c8b074c17727fb8/site_utils/test_runner_utils.py

Project Member

Comment 25 by bugdroid1@chromium.org, Feb 8 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97

commit aa78bdf75a43ce39fe0b2c675eb5013ec6539c97
Author: Jacob Kopczynski <jkop@google.com>
Date: Thu Feb 08 09:24:39 2018

chromite: add retries to VM test config

Add retry and max_retries fields to the VM test config class. Use these
in the vm test suite; set to disabled by default for other cases.

BUG= chromium:750918 
TEST=tryjob

Change-Id: I3163bbd63c3f5cb4316195f5e56428677c15488d
Reviewed-on: https://chromium-review.googlesource.com/902948
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/config_dump.json
[modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/lib/constants.py
[modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/lib/config_lib.py
[modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/chromeos_config_unittest.py
[modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/chromeos_config.py

Labels: -OKR okr
Is this believed to be working?

Comment 27 by jkop@chromium.org, Mar 5 2018

No, one last CL needed. Once crrev.com/c/860163 is in I believe this will work.
Project Member

Comment 28 by bugdroid1@chromium.org, Mar 6 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2cefa1f107097a3f2a920567a2a39a011f88220e

commit 2cefa1f107097a3f2a920567a2a39a011f88220e
Author: Jacob Kopczynski <jkop@google.com>
Date: Tue Mar 06 02:34:45 2018

autotest: LocalSuite subclass for VM test retries

New subclass of Suite with methods mimicking but not duplicating the existing
schedule and result-retry flow, to handle retries for a local suite.

handle_local_result and _retry_local_result take an AFE job ID rather than a
Result object, and do not take a Waiter instance.

In RetryHandler, _should_retry_local_job also takes an ID rather than a Result,
and checks fewer considerations; it takes for granted that the job was not
aborted, and does not filter by _retry_level.

In other respects they closely mirror the behavior of _handle_result,
_retry_result, and _should_retry.

BUG= chromium:750918 
TEST=suite:dummyflake retried, unit tests, and tryjob.

Change-Id: I68bf950b389bc247a9c1c1d6f9f7711023a13c10
Reviewed-on: https://chromium-review.googlesource.com/860163
Commit-Ready: Jacob Kopczynski <jkop@chromium.org>
Tested-by: Jacob Kopczynski <jkop@chromium.org>
Reviewed-by: Aviv Keshet <akeshet@chromium.org>

[modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/server/cros/dynamic_suite/suite.py
[modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/generate_test_report
[modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/test_runner_utils.py
[modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/test_runner_utils_unittest.py

Comment 29 by jkop@chromium.org, Mar 7 2018

Blockedon: 818859

Comment 30 by jkop@chromium.org, Mar 7 2018

818859 doesn't block turning test-level retries up, but probably does block turning suite-level retries down.
Status update?

Comment 32 by jkop@chromium.org, Mar 19 2018

Retrying on the test level has been pushed since the 8th with no obvious problems. No builds have failed at the VM Test stage since. There are a slightly suspicious number of betty runs failing with an exception, but this doesn't appear to be related.

I haven't declared victory yet because I don't have traction on 818859 and expect a lot of gains may only come when stage-level retries are turned down.
If test level retries are working, let's close this and track other benefits that may accrue some suite retries turndown separately.

Comment 34 by jkop@chromium.org, Mar 19 2018

Status: Fixed (was: Started)

Comment 35 by jkop@chromium.org, Mar 20 2018

Status: Verified (was: Fixed)

Comment 36 by jkop@chromium.org, Mar 20 2018

Blockedon: -818859

Sign in to add a comment