Do not provision a single DUT more than twice in a single suite |
|||||
Issue descriptionProvisioning a failing DUT more than twice is rarely successful and often just delays the inevitable. Suggest setting a max limit of provisions per DUT of 2. Example where this would help: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1562882 Recommended by davidriley@. This in particular helps mitigate the effects of a bad DUT.
,
Jun 14 2017
When you say there's no easy way, what level of effort are we talking about? We can quantify effort vs reward.
,
Jun 14 2017
iiuc, this request becomes obsolete after we split out provisioning of DUTs for a suite into it's own pre-suite (ayatane@'s well-on-the-way project). If my understanding is correct, I'd say just focus on that. ayatane@: do I understand this right?
,
Jun 14 2017
RE #2 1-2 weeks, plus non negligible risk of adding expensive afe queries unless done carefully
,
Jun 14 2017
,
Jun 14 2017
Re #3, probably. All of the different failure conditions and timeouts makes it hard to say for sure. There is at least one case where having a provision suite wouldn't help without additional followup work (but I think such followup work would be desirable anyway). What is the desired outcome of this feature request? If it is to fail a test suite as soon as possible in the even of a bad CL causing provision failures, the provision suite will provide that outcome very well. If it is to prevent a bad DUT from timing out a test suite, more work would be needed.
,
Jun 22 2017
Is there documentation on the provisioning suite? I'm curious what happens if one DUT takes 30 minutes to provision, fails, and then takes another 30 minutes. Will that hold up the entire build? The desired outcome is to improve CQ time. Data shows that suites which have a third provision of a given DUT don't improve success rate of the build enough to justify the amount of time that they slow down the entire CQ run when they occur.
,
Jun 22 2017
yes, the long provision job will hold up the test it's associated and thus hold up the entire suite run. Builder has a timeout on hwtest though, so hwtest will be aborted and the build will be claimed as infra failure.
,
Jun 23 2017
So is this the right solution? We're going to be slowing down CQ runs so that builds can be claimed as infra failures -- but I don't see why we need the provision suite to actually enforce that.
,
Jun 23 2017
I think we're going to need some graphical timelines to reach clarity on this issue... Forging ahead anyway, the only time a provision failure is going to hold up the entire suite is if there is only one more test job to run, and that test is waiting on the "bad" DUT to provision. In the middle of a suite, we aren't holding back the suite by assigning a test to a "bad" DUT pending provision, since if we decided to stop using that DUT, said test job would be waiting in the queue for an available DUT anyway. Either way, we're going to be down one DUT that won't be available to run tests. Furthermore, in the case that there is one one test job left waiting on a "bad" DUT that is provisioning for the third time, I believe that our DUT allocation for suites is lean enough that if we're down a DUT the suite is dangerously close to timing out anyway, so we would not actually lose much time to the last job.
,
Jun 23 2017
Maybe I'm misunderstanding the provision suite (hence me asking for documentation on it). Is the provision suite a blocking suite that runs before all other suites? Does it include a provision job for each and every DUT that will be used? The following is an example build to look at and ensure is not made worse: https://uberchromegw.corp.google.com/i/chromeos/builders/elm-release/builds/1238 Notice how many provision jobs there are: https://viceroy.corp.google.com/chromeos/suite_details?build_id=1612092
,
Jun 23 2017
>Is the provision suite a blocking suite that runs before all other suites? Yes, before HWTest. >Does it include a provision job for each and every DUT that will be used? Basically, yes, but it doesn't have to wait for all of them to finish. The naive provision suite has been added, adding useful features like "wait for N DUTs out of M to complete successfully" is next. Design for the provision suite has been done somewhat informally since it arose as an approach to the solving various other provisioning problems, I owe a updated design doc. Provision suite aside, #10 was addressing this bug itself. Blacklisting a DUT for a suite run is strictly worse than continuing to try to provision the DUT, except for the very last test job as I noted above.
,
Jun 23 2017
So what happens if the DUTs are re-provisioned by another build before testing is complete like in the example I posted in c#11? Re c#10: Why do you say that? If a DUT is going to be re-provisioned and fail, and blocks a test from running on another device, isn't it worse?
,
Jun 28 2017
#13 >So what happens if the DUTs are re-provisioned by another build before testing is complete like in the example I posted in c#11? That's odd, I will need to investigate. Generally speaking, a DUT won't need to get provisioned again once it has been successfully provisioned for a suite run. >If a DUT is going to be re-provisioned and fail, and blocks a test from running on another device, isn't it worse? Unless all the other tests are finished, the other devices will be busy running those other tests. The test will either need to sit in the queue behind the other tests that will run on those other DUTs, or get assigned to a DUT that at least has a chance to finish provisioning successfully.
,
Jun 28 2017
c#14: I think it can happen if there's insufficient DUTs to satisfy requests from different builds (eg pfq and release builders). So I think what ends up happening from the cases that I've looked at is you have a particular DUT get tied up with provisioning and the provision/test finally fails. You start another provision, and during that time (since the provision failure might be 20-50 minutes), all the other tests finish and you're blocked. So if there is a provision suite that is waiting for one slow provision that is 40 minutes before timing out failing and everything else is 10 minutes, then if it's all in one suite, won't they all block?
,
Dec 26 2017
,
Oct 25
|
|||||
►
Sign in to add a comment |
|||||
Comment 1 by akes...@chromium.org
, Jun 14 2017