New issue
Advanced search Search tips

Issue 869187 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

swarming.py collect --print-status-updates times out on swarming

Project Member Reported by martiniss@chromium.org, Jul 30

Issue description

https://ci.chromium.org/p/infra/builders/luci.infra.try/Build%20Recipes%20Tester/b8939547300825591824 is a sample build

The specific step that fails is 'swarming.py collect'. I pass in --print-status-updates, but it only seems to print out one update. I looked at the code and it should do this every 15 minutes if I understand it correctly.

I also don't think the task has a timeout less than 15 minutes? There's an execution timeout of 30 minutes....

Any ideas what's happening here?

There's a separate issue where it's not actually collecting the issue itself. Although maybe there's a bug in the swarming.py collect code which causes it to hang and never collect the task.
 
Description: Show this description
Cc: vadimsh@chromium.org mar...@chromium.org
Components: Infra>Platform>Swarming>Admin
Labels: -OS-Linux -Pri-3 Pri-1
Status: Available (was: Unconfirmed)
Can someone familiar with swarming take a look? 
Looks like a problem with the recipe. Tasks:
https://chromium-swarm.appspot.com/task?id=3f05918474816f10
https://chromium-swarm.appspot.com/task?id=3f05930303f1b010

In particular the first task didn't print anything as the recipe ran. I think the right fix is to disable the I/O timeout.
https://chromium-swarm.appspot.com/task?id=3f05918474816f10&show_raw=1
What do you mean didn't print anything? https://chromium-swarm.appspot.com/task?id=3f05918474816f10&refresh=10&show_raw=1 doesn't print anything because the recipe is running. That's due to how kitchen works afaik.

I'm more confused why `swarming.py collect` runs for 23 hours in https://chromium-swarm.appspot.com/task?id=3f05918474816f10&refresh=10 (same task). It doesn't print anything, it just sits there and hangs. And IIRC I've run this locally and it's collected the result just fine. I'll double check right now.
Don't be confused by the Milo output, the task https://chromium-swarm.appspot.com/task?id=3f05918474816f10 only ran for 30 minutes exactly.

That's why I recommend disabling the io_timeout.
Ah.

But the task has no I/O timeout. It just has "--" there, which I assume means nothing. Do I want to disable the execution timeout instead?

Also I'm not sure how to change that, since this task is triggered by buildbucket...
Errr, you're right. It has an execution timeout of 30m though, which seems to be too low.
And yes the default is low, to make sure people don't run long tasks without knowing.
Project Member

Comment 9 by bugdroid1@chromium.org, Jul 31

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/687b3af44494680e7f4d0bacb45d09d6ec6dc084

commit 687b3af44494680e7f4d0bacb45d09d6ec6dc084
Author: Stephen Martinis <martiniss@chromium.org>
Date: Tue Jul 31 20:04:25 2018

tcmalloc builder: Remove cores dimension

Don't really need to specify cores, as only one bot has that builder
dimension.

Bug: 868024, 869187
Change-Id: Ie59cf98fc59d4bbc869618bc2add13aeedf10759
Reviewed-on: https://chromium-review.googlesource.com/1157179
Commit-Queue: Stephen Martinis <martiniss@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Cr-Commit-Position: refs/heads/master@{#579539}
[modify] https://crrev.com/687b3af44494680e7f4d0bacb45d09d6ec6dc084/infra/config/global/cr-buildbucket.cfg

Sign in to add a comment