New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 718618 link

Starred by 1 user

Issue metadata

Status: Archived
Owner:
Last visit > 30 days ago
Closed: Mar 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

scheduler crashloops on moblab in the cq fail in a bad way

Project Member Reported by akes...@chromium.org, May 4 2017

Issue description

Example build: https://luci-milo.appspot.com/buildbot/chromeos/guado_moblab-paladin/5824

This failed due to a scheduler crashloop ( Issue 718615 ) introduced by a bad CL (we think, https://chromium-review.googlesource.com/#/c/438670/)

Problem: the way that this fails is that the scheduler crashloops, and thus none of the moblab sub-jobs make any progress, and the suite eventually times out. This timeout is categorized as an "infra failure" and also wasted 2 full hours.


Proposal: 

1) change the moblab_quick suite so that before it runs that suite against its sub-duts, it does some basic sanity checks against services that are supposed to stably be running on it.

2) ^ might not be good enough, because the crashlooping might only happen once there are jobs to run, so it might only start once we kick off the suite. Is there somewhere we can add logic that will detect that scheduler has started crashlooping and force the job to end with a descriptive failure?
 

Comment 1 by sbasi@chromium.org, May 4 2017

Atleast it prevented bad code from going in and breaking the lab ;)

So right now the test kicks off run_suite and is waiting on that.

https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/server/site_tests/moblab_RunSuite/moblab_RunSuite.py#48

You can make it more intelligent to kick off the suite and return and then have a while loop that checks the state of moblab and the state of the suite (via moblab's rpcs) and returns when the suite exits.
Cc: davidri...@chromium.org

Comment 3 by dshi@chromium.org, May 4 2017

What about split the single run_suite call into 3 step:
1. call run_suite with -c, get the suite job id
2. wait for x mins, search for any child job for the suite job. x can be around 5, so moblab has enough time to stage the artifacts and create child jobs. If no job found, fail the test.
3. call run_suite with -m to wait for all tests to finish.
That sounds excellent.

Comment 5 by aut...@google.com, May 5 2017

Labels: -current-issue
Owner: jrbarnette@chromium.org
+ Richard - we should discuss this @ next CQ meeting
Owner: pho...@chromium.org
Status: Assigned (was: Untriaged)
discussed with phobbs@

Comment 7 by pho...@chromium.org, Aug 15 2017

Status: Started (was: Assigned)
Project Member

Comment 8 by bugdroid1@chromium.org, Aug 18 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/37db4eab116ef35ca6ac2a7108539702502a637d

commit 37db4eab116ef35ca6ac2a7108539702502a637d
Author: Paul Hobbs <phobbs@google.com>
Date: Fri Aug 18 03:41:59 2017

[autotest] Fix moblab test  when scheduler crashloops

BUG= chromium:718618 
TEST=None

Change-Id: I1b0aaf8c5be1951def5915642e9db1803ab5bc77
Reviewed-on: https://chromium-review.googlesource.com/616117
Commit-Ready: Paul Hobbs <phobbs@google.com>
Tested-by: Paul Hobbs <phobbs@google.com>
Reviewed-by: Dan Shi <dshi@google.com>

[modify] https://crrev.com/37db4eab116ef35ca6ac2a7108539702502a637d/site_utils/run_suite.py
[modify] https://crrev.com/37db4eab116ef35ca6ac2a7108539702502a637d/server/site_tests/moblab_RunSuite/moblab_RunSuite.py

Project Member

Comment 9 by bugdroid1@chromium.org, Aug 22 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/d9d238063f42d0f43228ca13713d01e83da7116c

commit d9d238063f42d0f43228ca13713d01e83da7116c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Aug 22 00:46:16 2017

Revert "[autotest] Fix moblab test  when scheduler crashloops"

This reverts commit 37db4eab116ef35ca6ac2a7108539702502a637d.

Reason for revert: Needs thorough verification with moblab-trybot

Original change's description:
> [autotest] Fix moblab test  when scheduler crashloops
> 
> BUG= chromium:718618 
> TEST=None
> 
> Change-Id: I1b0aaf8c5be1951def5915642e9db1803ab5bc77
> Reviewed-on: https://chromium-review.googlesource.com/616117
> Commit-Ready: Paul Hobbs <phobbs@google.com>
> Tested-by: Paul Hobbs <phobbs@google.com>
> Reviewed-by: Dan Shi <dshi@google.com>

BUG= chromium:718618 
BUG= chromium:757658 

Change-Id: Ib6f99d19e28d7d128f6acf90883594b5679c8d56
Reviewed-on: https://chromium-review.googlesource.com/624568
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>

[modify] https://crrev.com/d9d238063f42d0f43228ca13713d01e83da7116c/site_utils/run_suite.py
[modify] https://crrev.com/d9d238063f42d0f43228ca13713d01e83da7116c/server/site_tests/moblab_RunSuite/moblab_RunSuite.py

Status: Archived (was: Started)

Sign in to add a comment