Change recipe to stop expiring on swarming.py collect |
||||
Issue descriptionThere's multiple timeouts when a task is triggered on swarming: - expiration is the timeout that can trigger if the task wasn't handed to a worker before this time. By default it is 1h. So if a task wasn't handed to an available worker before 1h after creation, it is immediately marked as EXPIRED and never run. - hard_timeout is the longest duration is allowed to run. It is by default 1h. The worst case is that results may become available after expiration minus 1 second + hard_timeout minus 1 second. By default, this is nearly 2 hours. When the recipe module collects results, it uses a 1 hour timeout. So tasks may be marked as timeout but when the user looks at the results, the task succeeded. This is super confusing. AI: - Remove the --timeout flag from swarming.py collect. Refs: https://github.com/luci/luci-py/blob/master/client/swarming.py#L1052 https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py
,
Oct 11 2016
It seems as though this might not actually be the case. It looks as though the entry point to collect does not pass a timeout (https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py#866) so we pass the default value, which appears to be sane (https://github.com/luci/luci-py/blob/master/client/swarming.py#L1220). I'm going to look into confirming this. If anyone knows how to do that, let me know.
,
Oct 11 2016
We're reporting a missing shard because for some reason we're not able to load the shard output json. I'm going to start with making the error message here more descriptive: https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py?q=%22Missing+or+invalid+gtest+JSON+file%22+exact:yes&sq=package:chromium&dr=C&l=115 Relevant log: https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Ftryserver.chromium.win%2Fwin_chromium_x64_rel_ng%2F287235%2F%2B%2Frecipes%2Fsteps%2Fbrowser_tests__with_patch__on_Windows-7-SP1%2F0%2Fstdout (Search "End of shard 1")
,
Oct 11 2016
,
Oct 11 2016
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/f33e8e1237971cb0ca2725751eae9c133a1ff93c commit f33e8e1237971cb0ca2725751eae9c133a1ff93c Author: katthomas <katthomas@google.com> Date: Wed Oct 12 15:55:36 2016 Log more when loading shard output.json fails When we use swarming to run steps for build bots, and swarming fails to load the output.json for and shard, the whole step is marked as an infra failure. Sometimes, the shard completes successfully, but swarming can't load the output.json for whatever reason. We're not sure why this is happening, and hopefully this will help. BUG= 653728 Review-Url: https://codereview.chromium.org/2413553002 [modify] https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/f33e8e1237971cb0ca2725751eae9c133a1ff93c commit f33e8e1237971cb0ca2725751eae9c133a1ff93c Author: katthomas <katthomas@google.com> Date: Wed Oct 12 15:55:36 2016 Log more when loading shard output.json fails When we use swarming to run steps for build bots, and swarming fails to load the output.json for and shard, the whole step is marked as an infra failure. Sometimes, the shard completes successfully, but swarming can't load the output.json for whatever reason. We're not sure why this is happening, and hopefully this will help. BUG= 653728 Review-Url: https://codereview.chromium.org/2413553002 [modify] https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/ffeed3944daf472db8c9da8c1ab524dc5ef5398f commit ffeed3944daf472db8c9da8c1ab524dc5ef5398f Author: recipe-roller <recipe-roller@chromium.org> Date: Wed Oct 12 16:10:27 2016 Roll recipe dependencies (trivial). This is an automated CL created by the recipe roller. This CL rolls recipe changes from upstream projects (e.g. depot_tools) into downstream projects (e.g. tools/build). More info is at https://goo.gl/zkKdpD. Use https://goo.gl/noib3a to file a bug (or complain) build: https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c Log more when loading shard output.json fails (katthomas@google.com) TBR=martiniss@chromium.org,phajdan.jr@chromium.org BUG= 653728 Recipe-Tryjob-Bypass-Reason: Autoroller Bugdroid-Send-Email: False Review-Url: https://codereview.chromium.org/2416593002 Cr-Commit-Position: refs/heads/master@{#424756} [modify] https://crrev.com/ffeed3944daf472db8c9da8c1ab524dc5ef5398f/infra/config/recipes.cfg
,
Oct 12 2016
The following revision refers to this bug: https://chromium.googlesource.com/infra/infra.git/+/8f704e1155ff8c95234e403e6f0507f4bea21894 commit 8f704e1155ff8c95234e403e6f0507f4bea21894 Author: recipe-roller <recipe-roller@chromium.org> Date: Wed Oct 12 16:16:41 2016 Roll recipe dependencies (trivial). This is an automated CL created by the recipe roller. This CL rolls recipe changes from upstream projects (e.g. depot_tools) into downstream projects (e.g. tools/build). More info is at https://goo.gl/zkKdpD. Use https://goo.gl/noib3a to file a bug (or complain) build: https://crrev.com/f33e8e1237971cb0ca2725751eae9c133a1ff93c Log more when loading shard output.json fails (katthomas@google.com) TBR=martiniss@chromium.org,phajdan.jr@chromium.org BUG= 653728 Recipe-Tryjob-Bypass-Reason: Autoroller Bugdroid-Send-Email: False Review-Url: https://codereview.chromium.org/2411683004 [modify] https://crrev.com/8f704e1155ff8c95234e403e6f0507f4bea21894/infra/config/recipes.cfg
,
Oct 12 2016
,
Oct 14 2016
The following revision refers to this bug: https://chrome-internal.googlesource.com/chrome/tools/build_limited/scripts/slave/+/e22f92fea14cffb851511b3bf1417c92f22336ad commit e22f92fea14cffb851511b3bf1417c92f22336ad Author: recipe-roller <recipe-roller@chromium.org> Date: Fri Oct 14 21:41:49 2016
,
Oct 26 2016
I'm going to close this for now. If we see another step that exceeds its timeout but is marked as a success, we'll have more information now. |
||||
►
Sign in to add a comment |
||||
Comment 1 by katthomas@chromium.org
, Oct 10 2016