New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 730729 link

Starred by 5 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocked on:
issue 732868

Blocking:
issue 805770
issue 715012
issue 730700



Sign in to add a comment

cbuildbot: Improve swarming run_suite call failure mode

Project Member Reported by pprabhu@chromium.org, Jun 7 2017

Issue description

Forked off from  issue 730700 .

Currently, our run_suite flow is:
(1) call run_suite with --create_and_return to create a suite and return immediately (doesn't take longer than ~10 minutes max)
(2) call run_suite with -m <job_id> (but not --no-wait) to wait for suite to finish. This can take arbitrarily long.

We used to pass --no-wait to run_suite once upon a time.
This was changed in:
https://chromium-review.googlesource.com/302100
https://chromium-review.googlesource.com/487825

One problem this is causing is that cbuildbot logs go silent while we wait for the suite to return. This causes buildbot to kill the cbuildbot run due to silence. ( issue 730700 )

Related, whenever cbuildbot is interrupted in step (2), we spew a very user-unfriendly dump of the buildbot-cbuildbot process. This is a generic failure mode that doesn't tell us anything about why the build failed.


Proposal:
Change (2) to use --no-wait again, and retry for the required amount of time (without exponential backoff, this is expected to take long)
Add a step (3) to actually generate the json results needed for hwtest results / suite details.
The reason to keep (3) separate is that suite detail generation may take a while, and we don't want the timeouts for (2) to be guided by how long (3) takes. (We are willing to try (2) many times, but we don't expect it to take any time at all.)


 
Blocking: 730700
Status: Available (was: Untriaged)
+akeshet, +shuqianz: who reviewed and authored the CL that removed --no-wait.
I can't find enough breadcrumbs to definitely say why we wanted to remove --no-wait.

Do you want to defend the decision to not use that flag?
Correction, we already have (3) _HWTestDumpJson.
So, I'm simply proposing adding --no-wait to (2) and looping on it for the expected amount of time, instead of a long-call on run_suite.
Cc: akes...@chromium.org dgarr...@chromium.org
Owner: shuqianz@chromium.org
Really +people I said I'll +
If we want to use no-wait, we will lose all of the progress output from run_suite. The only output we will have is string of json dump. Another way to fix this is to add a timeout on the swarming call.
Cc: xixuan@chromium.org shuqianz@chromium.org mcchou@chromium.org bleung@chromium.org
 Issue 730700  has been merged into this issue.
Add a dynamic timeout which will be timed out before buildbot is timed out, and print the results run_suite.py has obtained (if possible) at that time is good enough. 

Currently the timeout is longer than buildbot's timeout, so buildbot is timed out first, which 1) gives the swarming task no time to print the results it already gets, 2) generate useless logs caused by buildbot termination.
#7: Nack. Playing timeout-chicken with a different system, whose timeout value may change unexpectedly as far as HWTest stage is concerned is creating problem in the future.

I strongly feel that the problem here is the long run_suite call. We should not depend on logging messages being generated in the chromeos-proxy server, being uploaded to swarming appengine app and then returned to the builder to get progress updates on the build.

We should instead poll for the progress of the run_suite command with the --no-wait flag, like we used to do.
What's the reasoning against doing that?

shuqianz@: What progress output do we loose, exactly?
afaict, run_suite just says "run_suite still has XXXX minutes to timeout" until the very end.
I'd expect to get the same output when I poll (every 10 minutes say) with --no-wait.

In the final call to run_suite, when the suite has already finished, I should be able to get the "result output" that we get currently. If that's not supported today (i.e., you can't call run_suite -m YYY if YYY suite has already finished), we can add the feature to run_suite to return results for finished suites correctly.

Am I missing anything?
Blockedon: 732868
To be clear about what we did before I made the change https://chromium-review.googlesource.com/302100:
Previously, we had two run_suite calls in HWTest stage, first is with flag create_and_return to create a suite and exist, second is with flag --mock_job_id is to wait for the result and stream the output. We did pass --no_wait to both run_suite calls, but the default value was None, and we didn't change it during HWTest stage, which means --no_wait = False according to run_suite.py.

Therefore, we never acutally used --no_wait before. 

And for the output will be missing if we use --no_wait, I mean all the suite result, suite timing, dut diagnosis, links to the suite and so on.

However, we can refactor the run_suite code to move the parsing result code out of _handle_job_wait. So the steps to switch to use --no_wait will be:
1. Refactor run_suite.py to move result parsing out of _handle_job_wait to be a separate method: _gathering_suite_result()
2. Make the __handle_job_nowait to a tuple (BOOL_INDICATE_FINISH_OR_NOT, RESULT_OBJECT)
3. Teach the HWTest stage command to use no_wait instead of wait. Add polling to check run_suite process and return result when it finishes. 

I expect this is to be at least 1 week work, including the testing(tryjob & unittest) for this.
Labels: Chase-Pending
Re #10: Thanks for the useful dump there.

As for Chase-Pending: This bug neither results in / nor extends any outages. So I doubt this will make the cut.

Please see: https://sites.google.com/a/google.com/chromeos/for-team-members/infrastructure/bug_workflow?pli=1 for the requirements to go on the Chase list.

OTOH, I think this is a fixit-sized bug.
Blocking: 715012
I'll add that this likely also causes the swarming bot to be under-utilized because one swarming bot is occupied in executing the long wait for a single run_suite call for the whole duration of the suite. (See #13).
Cc: nxia@chromium.org mojahsu@chromium.org
 Issue 730997  has been merged into this issue.
Labels: -Chase-Pending Hotlist-Fixit
Cc: ayatane@chromium.org jrbarnette@chromium.org kitching@chromium.org mka@chromium.org
 Issue 760842  has been merged into this issue.
Blocking: 805770
FYI, "One problem this is causing is that cbuildbot logs go silent while we wait for the suite to return. This causes buildbot to kill the cbuildbot run due to silence."

This is fixed in  Issue 793499 .
Status: Assigned (was: Available)
Components: Infra>Client>ChromeOS>CI
Components: -Infra>Client>ChromeOS
Labels: -current-issue
Owner: ----
Status: Available (was: Assigned)

Comment 25 by nxia@chromium.org, Jun 8 2018

Cc: -nxia@chromium.org
Components: -Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
run_suite is HWTest
Components: Infra>Client>ChromeOS>CI
> run_suite is HWTest

If I've understood this issue properly, the request is for changes
to the run_suite invocation in the chromite code.  Specifically,
I think the conversation is around changes to this code:
    https://cs.corp.google.com/chromeos_public/chromite/cbuildbot/commands.py?l=801

Sign in to add a comment