New issue
Advanced search Search tips

Issue 840933 link

Starred by 3 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

pre-cq-launcher should use buildbucket builder statuses.

Project Member Reported by dgarr...@chromium.org, May 8 2018

Issue description

I thought this was filed before, but I can't find it.

Currently, the pre-cq-launcher fires off PreCQ builders, and reads the results from CIDB.

This is problematic in a number of ways.

1) CLs are not annotated with links to the builds until after all builders get far enough to create CIDB entries. This makes it look like the launcher has not picked up a CL it has.

2) Builds that fail before creating CIDB entries are invisible to the pre-cq-launcher. It doesn't know the builds failed, and instead issues a generic "timeout" error confusing to users.

3) Because we need to use timeouts to know that builds didn't start, we can't distinguish between bad CLs and build infrastructure issues.

4) Because we need to use timeouts, we need to keep them reasonably short. Our build infrastructure could handle a much higher CL load if CLs could be left in the build queue for longer.

 
Components: -Infra>Client>ChromeOS>Test Infra>Client>ChromeOS>CI
I propose that the pre-cq-launcher (or it's replacement) instead use buildbucket to track the state of builds it launches, then follows up with CIDB queries for more detailed information about a build if needed.

1) This allows it to link to launcher builders immediately.
2) This allows it to remove the "failed to launch" timeout and instead trust buildbucket timeouts for the same purpose.
3) This allows it to correctly detect build failures that happen if there are not CIDB entries for a build.

Yep, sounds great!
Owner: dgarr...@chromium.org
Status: Assigned (was: Untriaged)
What is the timeline for replacing pre-cq-launcher with a service?
I didn't have an item in my design for a PreCQ launcher replacement. If anything, I figured that the Swarming Scheduler could fire off something that looks like the launcher. Then, once it is running, we could do all the bookkeeping (CL annotations, CIDB udpates) inside of the Recipe instead of inside CBuildBot thereby eliminating the delay.

What do you think?

In other words, leave it as a builder-as-a-cron-job. Okay.

Though, I don't see how recipe vs cbuildbot will make much difference in performance. The only heavy cbuildbot startup cost is the initial sync.

But, since the launcher currently does a full repo sync each time it iterates through it's loop, we have to pay that cost no matter what. Well, unless we can figure out why it does that and avoid the cost.

Project Member

Comment 7 by bugdroid1@chromium.org, May 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/08be0a3a846983e9e38bce8605d77a93853a87d6

commit 08be0a3a846983e9e38bce8605d77a93853a87d6
Author: Shuhei Takahashi <nya@chromium.org>
Date: Fri May 18 05:21:14 2018

validation_pool: Avoid spamming code reviews.

In pre-CQ, individual pre-CQ bots send failure notifications,
in contrast to CQ where a failure notification is sent only by
master-paladin. This leads to more spams as we add more
builders to pre-CQ.

This change modifies the notification logic so that only the
first failing pre-CQ bot sends a notification. Also updates
the notification message to refer CL status viewer to see
the full results.

BUG= chromium:842074 
BUG=chromium:840933
TEST=unit tests

Change-Id: I2cc282a8965a57cb7a583b3198d394e5db988c15
Reviewed-on: https://chromium-review.googlesource.com/1059975
Commit-Ready: Don Garrett <dgarrett@chromium.org>
Tested-by: Shuhei Takahashi <nya@chromium.org>
Reviewed-by: Gregory Meinke <gmeinke@chromium.org>
Reviewed-by: Don Garrett <dgarrett@chromium.org>

[modify] https://crrev.com/08be0a3a846983e9e38bce8605d77a93853a87d6/lib/cl_messages.py
[modify] https://crrev.com/08be0a3a846983e9e38bce8605d77a93853a87d6/cbuildbot/validation_pool.py
[modify] https://crrev.com/08be0a3a846983e9e38bce8605d77a93853a87d6/lib/cl_messages_unittest.py
[modify] https://crrev.com/08be0a3a846983e9e38bce8605d77a93853a87d6/cbuildbot/validation_pool_unittest.py

Project Member

Comment 8 by bugdroid1@chromium.org, May 18 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/e1b14496039c843107300cefb22e8891c5c95aa9

commit e1b14496039c843107300cefb22e8891c5c95aa9
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Fri May 18 07:29:40 2018

Roll src/third_party/chromite/ 87247739f..08be0a3a8 (1 commit)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/87247739f384..08be0a3a8469

$ git log 87247739f..08be0a3a8 --date=short --no-merges --format='%ad %ae %s'
2018-05-16 nya validation_pool: Avoid spamming code reviews.

Created with:
  roll-dep src/third_party/chromite
BUG= chromium:842074 ,chromium:840933


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: I2a4cf0ee81922f6c47f0c0c148531e16b1b9236e
Reviewed-on: https://chromium-review.googlesource.com/1065498
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#559826}
[modify] https://crrev.com/e1b14496039c843107300cefb22e8891c5c95aa9/DEPS

Cc: jclinton@chromium.org
 Issue 898977  has been merged into this issue.
The bug merged in is an example of the kinds of failures we are seeing today, that fixing this would avoid.

"PreCQ launcher did not notify Gerrit of failure for ~18 hours"

Sign in to add a comment