New issue
Advanced search Search tips

Issue 902806 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Nov 7
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 0
Type: Bug

Blocked on:
issue 902834



Sign in to add a comment

chromeos-amd64-generic-rel overloaded

Project Member Reported by martiniss@chromium.org, Nov 7

Issue description

This bot is significantly overloaded. Swarming tasks and LUCI builds are pending.

Unclear why, but this is blocking the commit queue.
 
Owner: martiniss@chromium.org
Status: Assigned (was: Unconfirmed)
Cc: mar...@chromium.org
Uhoh, think I found the problem:

https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8930504758384886112/+/steps/wait_for_tasks__with_patch___2_/0/stdout

Looks like swarming somehow 500-ed. I think the builds exit before all the swarming tasks are done, which probably ends up overloading the swarming pool.

MA, is this related to the swarming outage this morning you think?
Have you been seeing that infrequently?

The bot looks like it got particularly bad today. Maybe just because of the swarming outage.
telemetry_perf_unittests is also timing out frequently on this bot, at least in a few runs I've seen. https://chromium-swarm.appspot.com/task?id=4107d414ab423e10&refresh=10&show_raw=1 is an example.
Cc: nednguyen@chromium.org
Ned, do you know why TPU is timing out in that task? Or do you know someone who might?
Project Member

Comment 7 by bugdroid1@chromium.org, Nov 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/3b04594ac61c1731192e56b6a97575150c17ae2e

commit 3b04594ac61c1731192e56b6a97575150c17ae2e
Author: Stephen Martinis <martiniss@chromium.org>
Date: Wed Nov 07 18:59:54 2018

Remove chromeos-amd64-generic-rel from CQ

The bot is broken at the moment.

NOTRY=true

Bug:  902806 
Change-Id: I1ee551c95617444e5ec4edbdb3d8d4fde6d7dc5f
Reviewed-on: https://chromium-review.googlesource.com/c/1324090
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: John Budorick <jbudorick@chromium.org>
Cr-Commit-Position: refs/heads/master@{#606093}
[modify] https://crrev.com/3b04594ac61c1731192e56b6a97575150c17ae2e/infra/config/branch/cq.cfg

Blockedon: 902834
Re #6, it's likely the same reason other tests are failing (cros_vm_sanity_test, chrome_all_tast_tests): the browser is crashing. telemetry_unittests just fails very poorly when that happens (it hangs forever until it gets timedout)

 Bug 902834  for the crashes.
I'm thinking of cancelling all the pending LUCI builds and swarming tasks for this builder, since we know that they're all going to fail. 
I've cancelled all the pending jobs on the trybot itself. 
Cc: achuith@chromium.org
+achuith for poor handling of browser crash on CHromeOS :-/ (see #8)
Cc: -nednguyen@chromium.org
Bot is recovering. Once the swarming pool is not overloaded, I'll re-add it back to the CQ.
I'm going to cancel the currently pending swarming tasks, and wait for the existing swarming tasks to finish executing. 
Cc: bpastene@chromium.org
Ben, you're probably interested in this.
I started https://ci.chromium.org/p/chromium/builders/luci.chromium.try/chromeos-amd64-generic-rel/126222 with a DEPS whitespace change, which should run all the tests. If the revert in  bug 902834  (https://crrev.com/c/1323623) fixes it, tests should run ok, and I'll re-add the bot to the CQ
Project Member

Comment 17 by bugdroid1@chromium.org, Nov 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/5c8381c98e5bff06716d7a7c131dc9a08ef5646e

commit 5c8381c98e5bff06716d7a7c131dc9a08ef5646e
Author: Stephen Martinis <martiniss@chromium.org>
Date: Wed Nov 07 21:36:59 2018

Revert "Remove chromeos-amd64-generic-rel from CQ"

This reverts commit 3b04594ac61c1731192e56b6a97575150c17ae2e.

Reason for revert: builder has recovered.

Original change's description:
> Remove chromeos-amd64-generic-rel from CQ
> 
> The bot is broken at the moment.
> 
> NOTRY=true
> 
> Bug:  902806 
> Change-Id: I1ee551c95617444e5ec4edbdb3d8d4fde6d7dc5f
> Reviewed-on: https://chromium-review.googlesource.com/c/1324090
> Commit-Queue: Stephen Martinis <martiniss@chromium.org>
> Reviewed-by: John Budorick <jbudorick@chromium.org>
> Cr-Commit-Position: refs/heads/master@{#606093}

TBR=martiniss@chromium.org,bpastene@chromium.org,jbudorick@chromium.org

Change-Id: I6c4f751514d9fc67ccf4004643f2368cc5effc68
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Bug:  902806 
Reviewed-on: https://chromium-review.googlesource.com/c/1324415
Reviewed-by: Stephen Martinis <martiniss@chromium.org>
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Cr-Commit-Position: refs/heads/master@{#606171}
[modify] https://crrev.com/5c8381c98e5bff06716d7a7c131dc9a08ef5646e/infra/config/branch/cq.cfg

Issue 902877 has been merged into this issue.
Status: Fixed (was: Assigned)
A CI build is passing tests, so we should be good to go. I'm re-adding the bot to the CQ
Thanks for handling this stephen!

I'll file a couple follow-ups to help avoid this kinda of cascading failure in the future.
 bug 902919  for more capacity
bug 902924 for making telemetry tests fail faster
Cc: jbudorick@chromium.org
It's worth noting that John is migrating Swarming instances to LIFO.

Sign in to add a comment