chromeos-amd64-generic-rel overloaded |
||||||||||
Issue descriptionThis bot is significantly overloaded. Swarming tasks and LUCI builds are pending. Unclear why, but this is blocking the commit queue.
,
Nov 7
Uhoh, think I found the problem: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8930504758384886112/+/steps/wait_for_tasks__with_patch___2_/0/stdout Looks like swarming somehow 500-ed. I think the builds exit before all the swarming tasks are done, which probably ends up overloading the swarming pool. MA, is this related to the swarming outage this morning you think?
,
Nov 7
No, we've started seeing this; --- Failed to access https://chromium-swarm.appspot.com/_ah/api/swarming/v1/tasks/get_states?task_id=4107b16a414d7810&task_id=4107b17297f0bf10&task_id=4107b17af3d05710&task_id=4107b17c4eb3af10&task_id=4107b17df0a31710&task_id=4107b17f9d91fb10&task_id=4107b1813f686110&task_id=4107b18350c11f10&task_id=4107b1892ea5d110&task_id=4107b18c5e252110&task_id=4107b18d958fc110&task_id=4107b18f24e40110&task_id=4107b19050d73f10&task_id=4107b191dfc88610&task_id=4107b195757b0e10&task_id=4107b1996e490110&task_id=4107b19e75257e10&task_id=4107b1a07cbaa910&task_id=4107b1a34cdf6110&task_id=4107b1a56fbec810&limit=200 --- I suspect this is due to https://cs.chromium.org/chromium/infra/luci/appengine/swarming/handlers_endpoints.py?l=538
,
Nov 7
Have you been seeing that infrequently? The bot looks like it got particularly bad today. Maybe just because of the swarming outage.
,
Nov 7
telemetry_perf_unittests is also timing out frequently on this bot, at least in a few runs I've seen. https://chromium-swarm.appspot.com/task?id=4107d414ab423e10&refresh=10&show_raw=1 is an example.
,
Nov 7
Ned, do you know why TPU is timing out in that task? Or do you know someone who might?
,
Nov 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/3b04594ac61c1731192e56b6a97575150c17ae2e commit 3b04594ac61c1731192e56b6a97575150c17ae2e Author: Stephen Martinis <martiniss@chromium.org> Date: Wed Nov 07 18:59:54 2018 Remove chromeos-amd64-generic-rel from CQ The bot is broken at the moment. NOTRY=true Bug: 902806 Change-Id: I1ee551c95617444e5ec4edbdb3d8d4fde6d7dc5f Reviewed-on: https://chromium-review.googlesource.com/c/1324090 Commit-Queue: Stephen Martinis <martiniss@chromium.org> Reviewed-by: John Budorick <jbudorick@chromium.org> Cr-Commit-Position: refs/heads/master@{#606093} [modify] https://crrev.com/3b04594ac61c1731192e56b6a97575150c17ae2e/infra/config/branch/cq.cfg
,
Nov 7
Re #6, it's likely the same reason other tests are failing (cros_vm_sanity_test, chrome_all_tast_tests): the browser is crashing. telemetry_unittests just fails very poorly when that happens (it hangs forever until it gets timedout) Bug 902834 for the crashes.
,
Nov 7
I'm thinking of cancelling all the pending LUCI builds and swarming tasks for this builder, since we know that they're all going to fail.
,
Nov 7
I've cancelled all the pending jobs on the trybot itself.
,
Nov 7
+achuith for poor handling of browser crash on CHromeOS :-/ (see #8)
,
Nov 7
,
Nov 7
Bot is recovering. Once the swarming pool is not overloaded, I'll re-add it back to the CQ.
,
Nov 7
I'm going to cancel the currently pending swarming tasks, and wait for the existing swarming tasks to finish executing.
,
Nov 7
Ben, you're probably interested in this.
,
Nov 7
I started https://ci.chromium.org/p/chromium/builders/luci.chromium.try/chromeos-amd64-generic-rel/126222 with a DEPS whitespace change, which should run all the tests. If the revert in bug 902834 (https://crrev.com/c/1323623) fixes it, tests should run ok, and I'll re-add the bot to the CQ
,
Nov 7
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/5c8381c98e5bff06716d7a7c131dc9a08ef5646e commit 5c8381c98e5bff06716d7a7c131dc9a08ef5646e Author: Stephen Martinis <martiniss@chromium.org> Date: Wed Nov 07 21:36:59 2018 Revert "Remove chromeos-amd64-generic-rel from CQ" This reverts commit 3b04594ac61c1731192e56b6a97575150c17ae2e. Reason for revert: builder has recovered. Original change's description: > Remove chromeos-amd64-generic-rel from CQ > > The bot is broken at the moment. > > NOTRY=true > > Bug: 902806 > Change-Id: I1ee551c95617444e5ec4edbdb3d8d4fde6d7dc5f > Reviewed-on: https://chromium-review.googlesource.com/c/1324090 > Commit-Queue: Stephen Martinis <martiniss@chromium.org> > Reviewed-by: John Budorick <jbudorick@chromium.org> > Cr-Commit-Position: refs/heads/master@{#606093} TBR=martiniss@chromium.org,bpastene@chromium.org,jbudorick@chromium.org Change-Id: I6c4f751514d9fc67ccf4004643f2368cc5effc68 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: 902806 Reviewed-on: https://chromium-review.googlesource.com/c/1324415 Reviewed-by: Stephen Martinis <martiniss@chromium.org> Commit-Queue: Stephen Martinis <martiniss@chromium.org> Cr-Commit-Position: refs/heads/master@{#606171} [modify] https://crrev.com/5c8381c98e5bff06716d7a7c131dc9a08ef5646e/infra/config/branch/cq.cfg
,
Nov 7
Issue 902877 has been merged into this issue.
,
Nov 7
A CI build is passing tests, so we should be good to go. I'm re-adding the bot to the CQ
,
Nov 7
Thanks for handling this stephen! I'll file a couple follow-ups to help avoid this kinda of cascading failure in the future.
,
Nov 7
I started a post mortem at https://docs.google.com/document/d/1e5pSPsn9rb4KikQYpj5Qgml4GvgqhrfYj1MJNA7pluU/edit#, I'll fill it out.
,
Nov 7
bug 902919 for more capacity bug 902924 for making telemetry tests fail faster
,
Nov 19
It's worth noting that John is migrating Swarming instances to LIFO. |
||||||||||
►
Sign in to add a comment |
||||||||||
Comment 1 by martiniss@chromium.org
, Nov 7Status: Assigned (was: Unconfirmed)