enable test-level retries for VMTest |
|||||||||||||
Issue descriptionA single failing test within vmtest suite causes the entire suite to be retried. Enable test-level retries instead, to save time in the CQ.
,
Aug 15 2017
Here's an example of a VMTest failure that led to the entire suite/stage being retried https://luci-milo.appspot.com/buildbot/chromeos/betty-paladin/282 . The VMTest stage logs should help get started. There are other example out there of retrys where things succeeded on retry, I just didn't have one conveiently available.
,
Aug 18 2017
Issue 749268 has been merged into this issue.
,
Aug 18 2017
Is it possible for us to move the VMTest to run on the lab (or treat them the same as current HWTest jobs), in that case, we can benefits from setting test-level retries as well as other benefits of the current system.
,
Aug 18 2017
^ The VM instance that we are testing against runs on the builder. Moving vmtest off the builder is a much bigger project than what is needed here.
,
Aug 21 2017
Provisionally assigning to jkop@
,
Aug 28 2017
,
Sep 12 2017
,
Sep 21 2017
Checking my understanding: The error in the run with the retried suite was a problem with the test harness. My intuition would be that if the test harness is broken, everything within it would fail, and so rerunning the suite would be the desired result. Betty is our only VM suite at present, so I should only look within betty failures for similar results. https://luci-milo.appspot.com/buildbot/chromeos/betty-paladin/817 is a VM failure but did not retry at all, and so it is not an instance of this problem and is unlikely to be useful for comparison.
,
Oct 4 2017
Ok, we'll need to fish around for an example of a failure that was of an individual test and not the test harness. I'm pretty sure I've seen this.
,
Oct 5 2017
I think this fits the bill (modulo that the tests failed upon retry too, nevertheless the point stands and benefit from test-level retry still applies) https://luci-milo.appspot.com/buildbot/chromeos/betty-arc64-paladin/589
,
Oct 12 2017
,
Oct 18 2017
Ran some SQL to find the potential impact of the change. TL;DR: In the unrealistic perfect case, we could save 0.5% of VM test time. I ran each of these on the last month of data, starting from the last full day before querying. #VMTest retries where the second try passed SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable WHERE name="VMTest (attempt 2)" AND status = "pass" AND build_id IN ( select build_id from buildStageTable where name="VMTest (attempt 1)" AND start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01" ); # => 1249895 seconds = 347 hours = 0.19% of VMTest runtime #VMTest retries which had completed at query time, regardless of success. SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable WHERE name="VMTest (attempt 2)" AND final = 1 AND build_id IN ( select build_id from buildStageTable where name="VMTest (attempt 1)" AND start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01" ); # => 3542447 seconds = 984 hours = 0.53% of VMTest runtime #VMTest first runs, total time. SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable WHERE name="VMTest (attempt 1)" AND final = 1; # => 665067843 seconds = 184741 hours = 99.3% of VMTest runtime With that in mind I'd judge this a very low priority.
,
Oct 18 2017
mysql> select COUNT(*) from buildStageTable where name='VMTest (attempt 1)' and status != 'skipped' and board = 'betty' and last_updated > "2017-09-18 00:00:01"; +----------+ | COUNT(*) | +----------+ | 3287 | +----------+ 1 row in set (4.32 sec) mysql> select COUNT(*) from buildStageTable where name='VMTest (attempt 2)' and status != 'skipped' and board = 'betty' and last_updated > "2017-09-18 00:00:01"; +----------+ | COUNT(*) | +----------+ | 416 | +----------+ 1 row in set (4.39 sec)
,
Oct 18 2017
Missed the time range on that last query. Should be SELECT SUM(TIMESTAMPDIFF(SECOND, start_time, finish_time)) FROM buildStageTable WHERE name="VMTest (attempt 1)" AND final = 1 AND start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01"; # => 21496308 seconds = 5791 hours Revised breakdown: 85.9% of VMTest runtime: first tries 14.2% of VMTest runtime: second tries 5.0% of VMTest runtime: second tries which pass This would amount to 2-6% of total CQ runtime being taken up by retrying VMTest and so looks far more promising. I take back my earlier judgment.
,
Oct 19 2017
On akeshet's recommendation, limited it to Pre-CQ only for a better assessment of the problem at hand for the OKR. Overall results: Of Pre-CQ builds where any VM Test is run, VM test retries are 4.8% of the total build time. Of all Pre-CQ builds, retries are 4.3% of total build time. In runs with a retry, the retry takes up 25% of the total length of the build. This implies that about 5% of Pre-CQ runs currently include a retry. SQL follows for inspection: #time spent on retries: SELECT SUM(TIMESTAMPDIFF(SECOND, stages.start_time, stages.finish_time)) FROM buildTable builds JOIN buildStageTable stages ON stages.build_id = builds.id WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01" AND stages.name ="VMTest (attempt 2)" AND builds.builder_name="pre_cq"; # => 1062437 seconds = 295 hours = 4.8% of VM-including precq build time, 4.3% of all precq build time #time spent on precq builds including a VM Test SELECT SUM(TIMESTAMPDIFF(SECOND, builds.start_time, builds.finish_time)) FROM buildTable builds JOIN buildStageTable stages ON stages.build_id = builds.id WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01" AND stages.name ="VMTest (attempt 1)" AND builds.builder_name="pre_cq"; # => 21971622 seconds = 6103 hours #including precqs without any VM tests SELECT SUM(TIMESTAMPDIFF(SECOND, builds.start_time, builds.finish_time)) FROM buildTable builds WHERE builds.final = 1 AND builds.start_time BETWEEN "2017-09-18 00:00:01" AND "2017-10-18 00:00:01" AND builds.builder_name="pre_cq"; # => 24600233 seconds = 6833 hours
,
Nov 16 2017
Is there any update on this? Still seems like a strong handling-time-saving candidate.
,
Nov 16 2017
I'd set it aside until earlier this week, so no update yet.
,
Dec 13 2017
,
Dec 15 2017
This is looking more urgent, so moved to the top of my queue.
,
Dec 21 2017
People interested in this issue may also be interested in issue 797109 . Long story short, the GCE instance can become unresponsive for several minutes even when fairly idle. This can trigger various timeouts.
,
Dec 22 2017
I would like to step back and pose a question. How hard would it be to make VMTest/betty instances look more like a real DUT. With shard and servo (issue 797164) and all that comes with being a DUT. Obviously it would require some stubbing. This bug here would not exist though if VMTest/betty behaved like real hardware. Also we would get other things for free, like sharding over multiple VMs (which PoHsien has been working on, but there are limits). Also we could separate cloud build instances from test instances and maybe get better utilization (but then ACL and artifact transfer becomes an issue). I think this is a long winded way of saying that maybe this shouldn't be a starter bug (because it has a long tail and will keep giving).
,
Jan 2 2018
RE 21&22 I agree those are useful investigations. Let's fork those and keep this bug about the test-level retries work (which requires some moderate feature work in test_that and is a good size starter bug).
,
Jan 19 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/64c039007a346ee80a263b045c8b074c17727fb8 commit 64c039007a346ee80a263b045c8b074c17727fb8 Author: Jacob Kopczynski <jkop@google.com> Date: Fri Jan 19 01:40:45 2018 vm_tests: Fetch jobs from afe added midstream Currently perform_local_run fetches jobs from AFE once and then executes them all serially. This change makes it poll AFE for further jobs once it has finished its queue, and returns only if no new jobs found. BUG= chromium:750918 TEST=Unittests for test_runner_utils and test_that Change-Id: Ie461396a984e3614322b3cd595216dbc44ef3437 Reviewed-on: https://chromium-review.googlesource.com/851259 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/64c039007a346ee80a263b045c8b074c17727fb8/site_utils/test_runner_utils.py
,
Feb 8 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/chromite/+/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97 commit aa78bdf75a43ce39fe0b2c675eb5013ec6539c97 Author: Jacob Kopczynski <jkop@google.com> Date: Thu Feb 08 09:24:39 2018 chromite: add retries to VM test config Add retry and max_retries fields to the VM test config class. Use these in the vm test suite; set to disabled by default for other cases. BUG= chromium:750918 TEST=tryjob Change-Id: I3163bbd63c3f5cb4316195f5e56428677c15488d Reviewed-on: https://chromium-review.googlesource.com/902948 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Don Garrett <dgarrett@chromium.org> [modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/config_dump.json [modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/lib/constants.py [modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/lib/config_lib.py [modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/chromeos_config_unittest.py [modify] https://crrev.com/aa78bdf75a43ce39fe0b2c675eb5013ec6539c97/cbuildbot/chromeos_config.py
,
Mar 4 2018
Is this believed to be working?
,
Mar 5 2018
No, one last CL needed. Once crrev.com/c/860163 is in I believe this will work.
,
Mar 6 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/2cefa1f107097a3f2a920567a2a39a011f88220e commit 2cefa1f107097a3f2a920567a2a39a011f88220e Author: Jacob Kopczynski <jkop@google.com> Date: Tue Mar 06 02:34:45 2018 autotest: LocalSuite subclass for VM test retries New subclass of Suite with methods mimicking but not duplicating the existing schedule and result-retry flow, to handle retries for a local suite. handle_local_result and _retry_local_result take an AFE job ID rather than a Result object, and do not take a Waiter instance. In RetryHandler, _should_retry_local_job also takes an ID rather than a Result, and checks fewer considerations; it takes for granted that the job was not aborted, and does not filter by _retry_level. In other respects they closely mirror the behavior of _handle_result, _retry_result, and _should_retry. BUG= chromium:750918 TEST=suite:dummyflake retried, unit tests, and tryjob. Change-Id: I68bf950b389bc247a9c1c1d6f9f7711023a13c10 Reviewed-on: https://chromium-review.googlesource.com/860163 Commit-Ready: Jacob Kopczynski <jkop@chromium.org> Tested-by: Jacob Kopczynski <jkop@chromium.org> Reviewed-by: Aviv Keshet <akeshet@chromium.org> [modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/server/cros/dynamic_suite/suite.py [modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/generate_test_report [modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/test_runner_utils.py [modify] https://crrev.com/2cefa1f107097a3f2a920567a2a39a011f88220e/site_utils/test_runner_utils_unittest.py
,
Mar 7 2018
,
Mar 7 2018
818859 doesn't block turning test-level retries up, but probably does block turning suite-level retries down.
,
Mar 19 2018
Status update?
,
Mar 19 2018
Retrying on the test level has been pushed since the 8th with no obvious problems. No builds have failed at the VM Test stage since. There are a slightly suspicious number of betty runs failing with an exception, but this doesn't appear to be related. I haven't declared victory yet because I don't have traction on 818859 and expect a lot of gains may only come when stage-level retries are turned down.
,
Mar 19 2018
If test level retries are working, let's close this and track other benefits that may accrue some suite retries turndown separately.
,
Mar 19 2018
,
Mar 20 2018
,
Mar 20 2018
|
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by akes...@chromium.org
, Aug 15 2017