New issue
Advanced search Search tips

Issue 819419 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

Alert if important PFQ and release builders don't run for X number of hours

Project Member Reported by pprabhu@chromium.org, Mar 7 2018

Issue description

On my deputy shifts, I've handled issues from our client teams pointing out that some release builder hasn't run in days.
Example:  issue 819357 
The root cause is that the buildslave dies for some reason and we have a 1:1 mapping between these builders and the buildslaves. This leaves the builder with no slave to run on.

We should add viceroy alerts for the important pfq / android pfq / release builders not even running for, say 1 day.


This situation doesn't arise for paladins where we'd notice immediately.


 
Labels: Chase-Pending
This is borderline chase. We usually reserve chase for cq / outage causing lab failures.
But not being able to even run important builds on the other waterfalls is kinda bad.
Owner: akes...@chromium.org
Status: Assigned (was: Untriaged)
1) Master could own this metric, and increment an "important board didn't run"  counter

2) Master could export an "important build existence metric" which we could join on build run metric (this is analagous to prod role metric).


Labels: -Chase-Pending Chase
Project Member

Comment 4 by bugdroid1@chromium.org, Mar 14 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/chromite/+/0ac72f67b0730bf4255f555b13d7ee34960e6f68

commit 0ac72f67b0730bf4255f555b13d7ee34960e6f68
Author: Aviv Keshet <akeshet@chromium.org>
Date: Wed Mar 14 21:21:49 2018

completion_stages: add a has_important_slave metric to master completion

BUG= chromium:819419 
TEST=None

Change-Id: I5d9ecda99040ba134ab5d013d7997a99038bc327
Reviewed-on: https://chromium-review.googlesource.com/961279
Commit-Ready: Aviv Keshet <akeshet@chromium.org>
Tested-by: Aviv Keshet <akeshet@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/0ac72f67b0730bf4255f555b13d7ee34960e6f68/cbuildbot/stages/completion_stages.py

Project Member

Comment 5 by bugdroid1@chromium.org, Mar 16 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/b8e7c2d9a192a0e7c441c6f34284fea6ef68dcd5

commit b8e7c2d9a192a0e7c441c6f34284fea6ef68dcd5
Author: chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Date: Fri Mar 16 22:50:01 2018

Roll src/third_party/chromite/ 3b75c9d82..3ad8f333d (31 commits)

https://chromium.googlesource.com/chromiumos/chromite.git/+log/3b75c9d82ebf..3ad8f333d567

$ git log 3b75c9d82..3ad8f333d --date=short --no-merges --format='%ad %ae %s'
2018-03-16 dgarrett Revert "Reland "pre_cq_launcher: Swarming for chromeos-infra-puppet-pre-cq.""
2018-03-16 dgarrett Reland "pre_cq_launcher: Swarming for chromeos-infra-puppet-pre-cq."
2018-03-14 ayatane autotest-pre-cq: Remove builder and stage [2/2]
2018-03-16 dgarrett Revert "pre_cq_launcher: Swarming for chromeos-infra-puppet-pre-cq."
2018-03-15 dgarrett chromeos_config: Move fuzzer builds into new bucket.
2018-03-16 dgarrett Revert "commands: RunBranchUtilTest -> RunLocalTryjob"
2018-03-13 dgarrett pre_cq_launcher: Swarming for chromeos-infra-puppet-pre-cq.
2018-02-07 dgarrett commands: RunBranchUtilTest -> RunLocalTryjob
2018-03-14 dgarrett cbuildbot_run: Switch more build links to Legoland.
2018-03-13 dgarrett swarming_lib: Remove SWARMING_TASK_ID from cmds.
2018-03-08 dgarrett moblab_vm_unitest: Fix lint issues.
2018-03-14 ihf chromeos_config: add more arcnext experimental coverage.
2018-03-14 ayatane autotest-pre-cq: Remove this [1/2]
2018-03-14 norvez chromeos_config: remove dead code
2018-03-09 dgarrett summarize_build_stats: Add blank line at beginning.
2018-01-09 dgarrett cros tryjob: Remove buildbot URL generation.
2017-09-14 craigb image_test: Remove check that kernel is not ELF.
2018-03-15 ihf Revert "chromeos_config: temporarily mark eve-arcnext-paladin experimental"
2018-03-15 ihf Revert "chromeos_config: temporarily experimental eve-arcnext-mst-android-pfq"
2018-03-13 lhchavez chromeos_config: Add betty-arcnext builder config
2018-03-13 achuith cbuildbot: Add missing files to index.
2018-03-13 akeshet completion_stages: add a has_important_slave metric to master completion
2018-03-13 dgarrett precq-launcher: Start using Legoland build details page.
2018-03-08 dgarrett chromite-pre-cq: Disable CidbIntegrationTest.
2018-03-14 akeshet chromeos_config: temporarily experimental eve-arcnext-mst-android-pfq
2018-03-13 akeshet chromeos_config: temporarily mark eve-arcnext-paladin experimental
2018-03-12 haddowk [chromite] Make guado_moblab important again
2018-03-13 chrome-bot Update config settings by config-updater.
2018-03-12 gmeinke chromium-config: replace cros_config_host_py in chromite
2018-03-12 yunlian Enable ThinLTO on all AMD64 boards.
2018-03-12 achuith cbuildbot: Log timing of GenerateUploadJSON.

Created with:
  roll-dep src/third_party/chromite
BUG=821930, 822517 , 821615 ,None,821618,821227,None,821664,821930,None,815377,747385,461595,821664,821664,811989,819419,821618,820305,821664,821664,819017,813442,707803,811989


The AutoRoll server is located here: https://chromite-chromium-roll.skia.org

Documentation for the AutoRoller is here:
https://skia.googlesource.com/buildbot/+/master/autoroll/README.md

If the roll is causing failures, please contact the current sheriff, who should
be CC'd on the roll, and stop the roller if necessary.


TBR=chrome-os-gardeners@chromium.org

Change-Id: Ib6aaddf338307e994865a092ecb322a432148692
Reviewed-on: https://chromium-review.googlesource.com/967273
Commit-Queue: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Reviewed-by: Chromite Chromium Autoroll <chromite-chromium-autoroll@skia-buildbots.google.com.iam.gserviceaccount.com>
Cr-Commit-Position: refs/heads/master@{#543855}
[modify] https://crrev.com/b8e7c2d9a192a0e7c441c6f34284fea6ef68dcd5/DEPS

1 more CL to go
Alert expected this week.
This looks like a good "how many times has important slave run in 1 day" counter. It appears to also work for 0-values.

http://shortn/_zFmy2bvS87
Cc: jclinton@chromium.org
Components: -Infra>Client>ChromeOS Infra>Client>ChromeOS>CI
+jclinton fyi

I'm adding this alert in https://critique.corp.google.com/#review/191011875
Status: Started (was: Assigned)
Status: Fixed (was: Started)

Sign in to add a comment