cbuildbot: Improve swarming run_suite call failure mode |
||||||||||||||||||
Issue descriptionForked off from issue 730700 . Currently, our run_suite flow is: (1) call run_suite with --create_and_return to create a suite and return immediately (doesn't take longer than ~10 minutes max) (2) call run_suite with -m <job_id> (but not --no-wait) to wait for suite to finish. This can take arbitrarily long. We used to pass --no-wait to run_suite once upon a time. This was changed in: https://chromium-review.googlesource.com/302100 https://chromium-review.googlesource.com/487825 One problem this is causing is that cbuildbot logs go silent while we wait for the suite to return. This causes buildbot to kill the cbuildbot run due to silence. ( issue 730700 ) Related, whenever cbuildbot is interrupted in step (2), we spew a very user-unfriendly dump of the buildbot-cbuildbot process. This is a generic failure mode that doesn't tell us anything about why the build failed. Proposal: Change (2) to use --no-wait again, and retry for the required amount of time (without exponential backoff, this is expected to take long) Add a step (3) to actually generate the json results needed for hwtest results / suite details. The reason to keep (3) separate is that suite detail generation may take a while, and we don't want the timeouts for (2) to be guided by how long (3) takes. (We are willing to try (2) many times, but we don't expect it to take any time at all.)
,
Jun 7 2017
+akeshet, +shuqianz: who reviewed and authored the CL that removed --no-wait. I can't find enough breadcrumbs to definitely say why we wanted to remove --no-wait. Do you want to defend the decision to not use that flag?
,
Jun 7 2017
Correction, we already have (3) _HWTestDumpJson. So, I'm simply proposing adding --no-wait to (2) and looping on it for the expected amount of time, instead of a long-call on run_suite.
,
Jun 7 2017
Really +people I said I'll +
,
Jun 7 2017
If we want to use no-wait, we will lose all of the progress output from run_suite. The only output we will have is string of json dump. Another way to fix this is to add a timeout on the swarming call.
,
Jun 7 2017
Issue 730700 has been merged into this issue.
,
Jun 7 2017
Add a dynamic timeout which will be timed out before buildbot is timed out, and print the results run_suite.py has obtained (if possible) at that time is good enough. Currently the timeout is longer than buildbot's timeout, so buildbot is timed out first, which 1) gives the swarming task no time to print the results it already gets, 2) generate useless logs caused by buildbot termination.
,
Jun 13 2017
#7: Nack. Playing timeout-chicken with a different system, whose timeout value may change unexpectedly as far as HWTest stage is concerned is creating problem in the future. I strongly feel that the problem here is the long run_suite call. We should not depend on logging messages being generated in the chromeos-proxy server, being uploaded to swarming appengine app and then returned to the builder to get progress updates on the build. We should instead poll for the progress of the run_suite command with the --no-wait flag, like we used to do. What's the reasoning against doing that? shuqianz@: What progress output do we loose, exactly? afaict, run_suite just says "run_suite still has XXXX minutes to timeout" until the very end. I'd expect to get the same output when I poll (every 10 minutes say) with --no-wait. In the final call to run_suite, when the suite has already finished, I should be able to get the "result output" that we get currently. If that's not supported today (i.e., you can't call run_suite -m YYY if YYY suite has already finished), we can add the feature to run_suite to return results for finished suites correctly. Am I missing anything?
,
Jun 13 2017
,
Jun 13 2017
To be clear about what we did before I made the change https://chromium-review.googlesource.com/302100: Previously, we had two run_suite calls in HWTest stage, first is with flag create_and_return to create a suite and exist, second is with flag --mock_job_id is to wait for the result and stream the output. We did pass --no_wait to both run_suite calls, but the default value was None, and we didn't change it during HWTest stage, which means --no_wait = False according to run_suite.py. Therefore, we never acutally used --no_wait before. And for the output will be missing if we use --no_wait, I mean all the suite result, suite timing, dut diagnosis, links to the suite and so on. However, we can refactor the run_suite code to move the parsing result code out of _handle_job_wait. So the steps to switch to use --no_wait will be: 1. Refactor run_suite.py to move result parsing out of _handle_job_wait to be a separate method: _gathering_suite_result() 2. Make the __handle_job_nowait to a tuple (BOOL_INDICATE_FINISH_OR_NOT, RESULT_OBJECT) 3. Teach the HWTest stage command to use no_wait instead of wait. Add polling to check run_suite process and return result when it finishes. I expect this is to be at least 1 week work, including the testing(tryjob & unittest) for this.
,
Jun 13 2017
,
Jun 13 2017
Re #10: Thanks for the useful dump there. As for Chase-Pending: This bug neither results in / nor extends any outages. So I doubt this will make the cut. Please see: https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/bug_workflow?pli=1 for the requirements to go on the Chase list. OTOH, I think this is a fixit-sized bug.
,
Jun 13 2017
,
Jun 13 2017
I'll add that this likely also causes the swarming bot to be under-utilized because one swarming bot is occupied in executing the long wait for a single run_suite call for the whole duration of the suite. (See #13).
,
Jun 14 2017
,
Jun 16 2017
,
Aug 31 2017
Issue 760842 has been merged into this issue.
,
Jan 25 2018
,
Jan 25 2018
FYI, "One problem this is causing is that cbuildbot logs go silent while we wait for the suite to return. This causes buildbot to kill the cbuildbot run due to silence." This is fixed in Issue 793499 .
,
Feb 14 2018
,
Mar 30 2018
,
Mar 30 2018
,
Apr 25 2018
,
Apr 30 2018
,
Jun 8 2018
,
Jul 27
run_suite is HWTest
,
Jul 27
> run_suite is HWTest
If I've understood this issue properly, the request is for changes
to the run_suite invocation in the chromite code. Specifically,
I think the conversation is around changes to this code:
https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/commands.py?l=801
|
||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||
Comment 1 by pprabhu@chromium.org
, Jun 7 2017