New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 653728 link

Starred by 1 user

Issue metadata

Status: WontFix
Owner:
Last visit > 30 days ago
Closed: Oct 2016
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 649391



Sign in to add a comment

Change recipe to stop expiring on swarming.py collect

Project Member Reported by mar...@chromium.org, Oct 6 2016

Issue description

There's multiple timeouts when a task is triggered on swarming:
- expiration is the timeout that can trigger if the task wasn't handed to a worker before this time. By default it is 1h. So if a task wasn't handed to an available worker before 1h after creation, it is immediately marked as EXPIRED and never run.
- hard_timeout is the longest duration is allowed to run. It is by default 1h.

The worst case is that results may become available after expiration minus 1 second + hard_timeout minus 1 second. By default, this is nearly 2 hours.

When the recipe module collects results, it uses a 1 hour timeout. So tasks may be marked as timeout but when the user looks at the results, the task succeeded. This is super confusing.

AI:
- Remove the --timeout flag from swarming.py collect.


Refs:
https://github.com/luci/luci-py/blob/master/client/swarming.py#L1052
https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py
 
Owner: katthomas@chromium.org
It seems as though this might not actually be the case. It looks as though the entry point to collect does not pass a timeout (https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py#866) so we pass the default value, which appears to be sane (https://github.com/luci/luci-py/blob/master/client/swarming.py#L1220).

I'm going to look into confirming this. If anyone knows how to do that, let me know.
Blocking: 649391
Project Member

Comment 6 by bugdroid1@chromium.org, Oct 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/f33e8e1237971cb0ca2725751eae9c133a1ff93c

commit f33e8e1237971cb0ca2725751eae9c133a1ff93c
Author: katthomas <katthomas@google.com>
Date: Wed Oct 12 15:55:36 2016

Log more when loading shard output.json fails

When we use swarming to run steps for build bots, and
swarming fails to load the output.json for and shard, the
whole step is marked as an infra failure.

Sometimes, the shard completes successfully, but swarming
can't load the output.json for whatever reason. We're not
sure why this is happening, and hopefully this will help.

BUG= 653728 

Review-Url: https://codereview.chromium.org/2413553002

[modify] https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py

Project Member

Comment 7 by bugdroid1@chromium.org, Oct 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/f33e8e1237971cb0ca2725751eae9c133a1ff93c

commit f33e8e1237971cb0ca2725751eae9c133a1ff93c
Author: katthomas <katthomas@google.com>
Date: Wed Oct 12 15:55:36 2016

Log more when loading shard output.json fails

When we use swarming to run steps for build bots, and
swarming fails to load the output.json for and shard, the
whole step is marked as an infra failure.

Sometimes, the shard completes successfully, but swarming
can't load the output.json for whatever reason. We're not
sure why this is happening, and hopefully this will help.

BUG= 653728 

Review-Url: https://codereview.chromium.org/2413553002

[modify] https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py

Project Member

Comment 8 by bugdroid1@chromium.org, Oct 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/ffeed3944daf472db8c9da8c1ab524dc5ef5398f

commit ffeed3944daf472db8c9da8c1ab524dc5ef5398f
Author: recipe-roller <recipe-roller@chromium.org>
Date: Wed Oct 12 16:10:27 2016

Roll recipe dependencies (trivial).

This is an automated CL created by the recipe roller. This CL rolls recipe
changes from upstream projects (e.g. depot_tools) into downstream projects
(e.g. tools/build).

More info is at https://goo.gl/zkKdpD. Use https://goo.gl/noib3a to file a bug
(or complain)

build:
  https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c Log more when loading shard output.json fails (katthomas@google.com)

TBR=martiniss@chromium.org,phajdan.jr@chromium.org
BUG= 653728 

Recipe-Tryjob-Bypass-Reason: Autoroller
Bugdroid-Send-Email: False
Review-Url: https://codereview.chromium.org/2416593002
Cr-Commit-Position: refs/heads/master@{#424756}

[modify] https://crrev.com/ffeed3944daf472db8c9da8c1ab524dc5ef5398f/infra/config/recipes.cfg

Project Member

Comment 9 by bugdroid1@chromium.org, Oct 12 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/infra.git/+/8f704e1155ff8c95234e403e6f0507f4bea21894

commit 8f704e1155ff8c95234e403e6f0507f4bea21894
Author: recipe-roller <recipe-roller@chromium.org>
Date: Wed Oct 12 16:16:41 2016

Roll recipe dependencies (trivial).

This is an automated CL created by the recipe roller. This CL rolls recipe
changes from upstream projects (e.g. depot_tools) into downstream projects
(e.g. tools/build).

More info is at https://goo.gl/zkKdpD. Use https://goo.gl/noib3a to file a bug
(or complain)

build:
  https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c Log more when loading shard output.json fails (katthomas@google.com)

TBR=martiniss@chromium.org,phajdan.jr@chromium.org
BUG= 653728 

Recipe-Tryjob-Bypass-Reason: Autoroller
Bugdroid-Send-Email: False
Review-Url: https://codereview.chromium.org/2411683004

[modify] https://crrev.com/8f704e1155ff8c95234e403e6f0507f4bea21894/infra/config/recipes.cfg

Labels: Hotlist-Infra-Flakiness
Project Member

Comment 11 by bugdroid1@chromium.org, Oct 14 2016

Status: WontFix (was: Unconfirmed)
I'm going to close this for now. If we see another step that exceeds its timeout but is marked as a success, we'll have more information now. 

Sign in to add a comment