scheduler crashloops on moblab in the cq fail in a bad way |
||||||
Issue descriptionExample build: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5824 This failed due to a scheduler crashloop ( Issue 718615 ) introduced by a bad CL (we think, https://chromium-review.googlesource.com/#/c/438670/) Problem: the way that this fails is that the scheduler crashloops, and thus none of the moblab sub-jobs make any progress, and the suite eventually times out. This timeout is categorized as an "infra failure" and also wasted 2 full hours. Proposal: 1) change the moblab_quick suite so that before it runs that suite against its sub-duts, it does some basic sanity checks against services that are supposed to stably be running on it. 2) ^ might not be good enough, because the crashlooping might only happen once there are jobs to run, so it might only start once we kick off the suite. Is there somewhere we can add logic that will detect that scheduler has started crashlooping and force the job to end with a descriptive failure?
,
May 4 2017
,
May 4 2017
What about split the single run_suite call into 3 step: 1. call run_suite with -c, get the suite job id 2. wait for x mins, search for any child job for the suite job. x can be around 5, so moblab has enough time to stage the artifacts and create child jobs. If no job found, fail the test. 3. call run_suite with -m to wait for all tests to finish.
,
May 4 2017
That sounds excellent.
,
May 5 2017
+ Richard - we should discuss this @ next CQ meeting
,
May 9 2017
discussed with phobbs@
,
Aug 15 2017
,
Aug 18 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/37db4eab116ef35ca6ac2a7108539702502a637d commit 37db4eab116ef35ca6ac2a7108539702502a637d Author: Paul Hobbs <phobbs@google.com> Date: Fri Aug 18 03:41:59 2017 [autotest] Fix moblab test when scheduler crashloops BUG= chromium:718618 TEST=None Change-Id: I1b0aaf8c5be1951def5915642e9db1803ab5bc77 Reviewed-on: https://chromium-review.googlesource.com/616117 Commit-Ready: Paul Hobbs <phobbs@google.com> Tested-by: Paul Hobbs <phobbs@google.com> Reviewed-by: Dan Shi <dshi@google.com> [modify] https://crrev.com/37db4eab116ef35ca6ac2a7108539702502a637d/site_utils/run_suite.py [modify] https://crrev.com/37db4eab116ef35ca6ac2a7108539702502a637d/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
,
Aug 22 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d9d238063f42d0f43228ca13713d01e83da7116c commit d9d238063f42d0f43228ca13713d01e83da7116c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Aug 22 00:46:16 2017 Revert "[autotest] Fix moblab test when scheduler crashloops" This reverts commit 37db4eab116ef35ca6ac2a7108539702502a637d. Reason for revert: Needs thorough verification with moblab-trybot Original change's description: > [autotest] Fix moblab test when scheduler crashloops > > BUG= chromium:718618 > TEST=None > > Change-Id: I1b0aaf8c5be1951def5915642e9db1803ab5bc77 > Reviewed-on: https://chromium-review.googlesource.com/616117 > Commit-Ready: Paul Hobbs <phobbs@google.com> > Tested-by: Paul Hobbs <phobbs@google.com> > Reviewed-by: Dan Shi <dshi@google.com> BUG= chromium:718618 BUG= chromium:757658 Change-Id: Ib6f99d19e28d7d128f6acf90883594b5679c8d56 Reviewed-on: https://chromium-review.googlesource.com/624568 Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> [modify] https://crrev.com/d9d238063f42d0f43228ca13713d01e83da7116c/site_utils/run_suite.py [modify] https://crrev.com/d9d238063f42d0f43228ca13713d01e83da7116c/server/site_tests/moblab_RunSuite/moblab_RunSuite.py
,
Mar 29 2018
|
||||||
►
Sign in to add a comment |
||||||
Comment 1 by sbasi@chromium.org
, May 4 2017