Improve resiliency of swarming tasks recipe collection |
|||||||||||||||||||||
Issue descriptionMore often than not, the swarming recipe doesn't have enough information to know what the hell happened on a swarming task and has to report an infrastructure failure where it could report something more informative. This affects the CQ negatively, increases retries at build level, the negative effect of false interpretation as infrastructure failure is huge. The swarming recipe module could be much smarter in its interpretation of result code. It needs to differentiate between three classes of problems: Infrastructure failure: - Task expiration; the task didn't execute due to lack of infrastructure throughput. This is currently not well reported to the user. - A subset of the shards missing due to task expiration, same as above. This is reported as "shard missing", which is confusing the hell of users. - Swarming client reported an "internal_failure"; this is a true internal failure and should be reported as such. Execution failure: - Execution timeout; this needs to be reported explicitly as it is, so the user can understand what happened. This is not the case at the moment and it is easy to miss from the logs and swarming tasks. - test_launcher crashing mid task. A quick look of 3 builds from Katie's query seems to reveal anecdotally (need to confirm) that a fair number of these may be improved with issue 649831 . This needs to be reported accordingly. It could be reported as "no data" or as a real failure (preferred). This part is googletest specific. Test failure: - Normal failure flow. AI: - Artificially produce and investigate each of these failure case and improve the analysis and UI accordingly. Source: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py
,
Sep 23 2016
wrong vadim
,
Sep 29 2016
I filed issue 645259 for maybe exactly the same problem...
,
Oct 3 2016
Issue 645259 is different; the issue here is about recipe_modules/swarming/api.py, issue 645259 is about chromite. Commented on the other about possible improvements.
,
Oct 10 2016
,
Oct 11 2016
,
Oct 11 2016
,
Oct 26 2016
,
Oct 27 2016
It reminds me that swarming.py collect must be changed to always return 0, unless there's a Swarming failure that needs to be considered an infra failure. The actual task exit code is in the json. This will permit better classification of error. https://github.com/luci/luci-py/blob/master/client/swarming.py#L1192 needs to accept a new flag --no-exit-code to tell the script to stop using the old behavior. this flag will be passed down to collect() https://github.com/luci/luci-py/blob/master/client/swarming.py#L844needs to if no_exit_code: return 0 if exit_code is not None else 1 return exit_code if exit_code is not None else 1 That should help a lot with the classification of failures.
,
Oct 31 2016
,
Nov 10 2016
,
Nov 16 2016
M-A, are you working on this? If so, any updates? If not, should we be assigning to someone else?
,
Nov 16 2016
Yes, someone else should take on this.
,
Nov 16 2016
@maruel, who would be a good person to take this?
,
Nov 16 2016
Dave/Micheal/Tim - is this anything you could help out with? It is one of our top issues with driving down infrastructure failure rates.
,
Nov 17 2016
jparent@ Could you add some data in this bug showing this affecting infrastructure failure rate? Links to cases where it has failed would also be useful. That info would let me understand the correct course of action to take here.
,
Nov 17 2016
katthomas@, can you provide that info to tansell?
,
Nov 18 2016
Yup! I'll get back here early next week with some examples.
,
Nov 19 2016
Here is an example we see frequently: crbug.com/666983 A step is misbehaving or hanging in some way such that the build hits its overall timeout and the step is killed by the master. In this case, the result is "Internal failure" and marked purple which isn't very helpful. One other thing we're doing to help in this specific case is encouraging people to add timeouts to steps in recipes. That way, if the recipe kills the step, it will be recorded as a purple/red failure as appropriate.
,
Nov 19 2016
Here is another example: https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384 Missing shard reported due to output.json missing. It looked like the task actually timed out tough.
,
Nov 22 2016
https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384 looks like a real infra issue. The Android device seems to have failed in some way. Looking at the logs linked in http://crbug.com/666983 -- The step which timed out don't appear to be running on isolate / swarming? Did you link the wrong things?
,
Nov 22 2016
While I agree with everything maruel@ is proposing, at the moment I'm not sure this is an actively occuring issue or how much of maruel@'s proposal is even unimplemented. I'm bumping this back to "unconfirmed" until I we can get enough information to decided on an action to take.
,
Nov 22 2016
Assigning to myself until I've gathered enough info for tansell.
,
Nov 22 2016
@maruel, correct me if I’m wrong, but I think the primary issue is that swarming could be providing more useful information than it is currently. Task expiration, e.g. https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183251The task expires due to lack of capacity. We know this information (you can see it here: https://chromium-swarm.appspot.com/task?id=328d8c3f09c99b10&refresh=10&show_raw=1), but the result is that is reported is “Internal Failure.” Shard missing/execution timeout, e.g. https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384 Here we have a task that times out due to an infra failure. This is displayed as “missing shard,” which is not helpful. Like the example above, the result that is reported is “Internal Failure.” Reporting a timeout would be more helpful.
,
Nov 22 2016
$ ./swarming.py collect -S chromium-swarm.appspot.com 328d8c3f09c99b10 --task-summary-json foo.json N/A: 328d8c3f09c99b10 None $ cat foo.json { "shards": [ { "abandoned_ts": "2016-11-18T01:01:12.531950", "bot_dimensions": null, "bot_id": null, "bot_version": null, "children_task_ids": [], "completed_ts": null, "cost_saved_usd": null, "costs_usd": null, "created_ts": "2016-11-17T23:59:03.328790", "deduped_from": null, "durations": [], "exit_codes": [], "failure": false, "id": "328d8c3f09c99b10", "internal_failure": false, "isolated_out": null, "modified_ts": "2016-11-18T01:01:12.531950", "name": "cc_unittests (with patch)/Android/6fa2b7a9fd/linux_android_rel_ng/183251", "outputs": [], "outputs_ref": null, "properties_hash": null, "server_versions": null, "started_ts": null, "state": 48, "tags": [ (...) ], "try_number": null, "user": "benhenry@google.com" } ] } The internal failure is NOT true. "state" is 48, which means EXPIRED (0x30). https://github.com/luci/luci-py/blob/master/appengine/swarming/server/task_result.py#L94 So there's a problem at the recipe level or higher.
,
Nov 23 2016
@maruel, can you say more about the intention behind this issue? My understanding was that wanted to be more precise in how we're reporting failures from swarming. For 328d8c3f09c99b10 and others like it, expiration is clear from the swarming display, but not from the buildbot display. I agree this is "not well reported to the user" as it's hidden behind a "missing shard" link on the step (in the buildbot UI).
,
Nov 23 2016
,
Nov 23 2016
> For 328d8c3f09c99b10 and others like it, expiration is clear from the swarming display, but not from the buildbot display. What this issue is about is what the recipe generates as metadata, that eventually bubbles up to the buildbot display. Also the fact that Milo shows this as an infra failure (which not to be confused with swarming internal failure) makes sense. It is an infra failure. I think we need to add more text to explain the "why", e.g. there's not enough VMs/bots.
,
Nov 29 2016
@tansell, does this help?
,
Dec 3 2016
tansell@, ping?
,
Dec 5 2016
From Marc's last comment it seems that; a) These failures are being correctly detected as infra failures. b) The error message output reported to developers could be improved. If so, this sounds like a pretty low priority task? As a Chrome developer any infra failure is identical to me. A developer knowing that there is a capacity issue doesn't actually improve the infra reliability? I feel like I'm missing something here? jparent@ said "It is one of our top issues with driving down infrastructure failure rates."?
,
Dec 5 2016
As a Chrome developer, yes, it doesn't matter what type of infra failure it is, but for the purposes of eliminating infra failures, we need the information.
,
Dec 6 2016
Yes, this will be helpful for more efficiently diagnosing/triaging infra failures.
,
Dec 8 2016
,
Dec 8 2016
FWIW, this is confusing for perfbot sheriffs; we recently switched to swarming from buildbot, and almost all of our steps are red, when some should be purple.
,
Dec 9 2016
Tim, given that you are about to go on leave until the end of February, is there someone else who could own this? I don't want a P1 languishing.
,
Dec 11 2016
tansell is gone now. @maruel, who else might be a good choice?
,
Dec 12 2016
Sorry, I hadn't realized that Tim was going on extended leave. There's two parts: 1) Immediate improvement (what this bug is about). In particular _default_collect_step() and _gtest_collect_step() to print out a string specifically when a task result is one of State.EXPIRED, State.TIMED_OUT or State.BOT_DIED. 2) General code health issue. I discussed with Dirk about healing recipes_module/swarming. The code needs refactoring, along swarming.py itself and the way sharding is currently done. Michael is the only person currently working in this are but I don't want to put too much burden on him, at least not until layout tests are on swarming. --- Focusing on 1), I'll take the bug and will try to make something quick to improve the status quo: I'll just look for the explicit state and add an entry to step_result.presentation.links that is more explicit about the state instead of the generic "missing shard #".
,
Dec 13 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/be59a952e1b9c80ec37b8b7d8f49b471859247ac commit be59a952e1b9c80ec37b8b7d8f49b471859247ac Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Tue Dec 13 22:38:39 2016 Minor recipe_modules/swarming cleanup. Preparatory test-only change as I'm trying to figure out how to add more test cases for a follow up. R=vadimsh@chromium.org BUG= 649841 Change-Id: I7fb8cc19f5ae67132abf7d6a6cfe0da2b16be657 Reviewed-on: https://chromium-review.googlesource.com/419660 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [rename] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/basic.json [delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8.6_cipd_packages.json [delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_show_isolated_out_in_collect_step.json [delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_show_shards_in_collect_step.json [delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_trybot.json [add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/show_isolated_out_in_collect_step.json [add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/show_shards_in_collect_step.json [add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/trybot.json [modify] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.py
,
Dec 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/61c59449c14705e705ec399b42306f79d7e73efd commit 61c59449c14705e705ec399b42306f79d7e73efd Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Wed Dec 14 03:31:55 2016 Improve swarming task handling of failure modes. Handles the following modes: - TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated. This is a user failure. - EXPIRED: there was not enough capacity to run the task. This is an infrastructure failure; capacity does not handle the demand. - BOT_DIED: there was a Swarming internal failure; this is an infrastructure failure. Remove iOS specific recipe code to generalize failure handling. R=vadimsh@chromium.org BUG= 649841 Change-Id: Iac5dd293ec112a4b71a7b3bcd9871a7252b37d1a Reviewed-on: https://chromium-review.googlesource.com/419135 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/api.py [delete] https://crrev.com/2bac5a784f0cf3c529456ed9e6e8a4292612a366/scripts/slave/recipe_modules/ios/example.expected/expired.json [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/example.expected/timed_out.json [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/example.py [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/api.py [add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json [add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json [add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json [add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.py [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py [modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py
,
Dec 14 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/8758e4ed83393b9b874d8c64d0f37edd6d14444e commit 8758e4ed83393b9b874d8c64d0f37edd6d14444e Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Wed Dec 14 15:35:01 2016 Revert "Improve swarming task handling of failure modes." This reverts commit 61c59449c14705e705ec399b42306f79d7e73efd. Reason for revert: Blew up on the try server. BUG= 674147 Original change's description: > Improve swarming task handling of failure modes. > > Handles the following modes: > - TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated. > This is a user failure. > - EXPIRED: there was not enough capacity to run the task. This is an > infrastructure failure; capacity does not handle the demand. > - BOT_DIED: there was a Swarming internal failure; this is an infrastructure > failure. > > Remove iOS specific recipe code to generalize failure handling. > > R=vadimsh@chromium.org > BUG= 649841 > > Change-Id: Iac5dd293ec112a4b71a7b3bcd9871a7252b37d1a > Reviewed-on: https://chromium-review.googlesource.com/419135 > Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> > Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> > TBR=maruel@chromium.org,vadimsh@chromium.org,smut@chromium.org,martiniss@chromium.org,nednguyen@google.com,chromium-reviews@chromium.org,eyaich@chromium.org,katthomas@chromium.org,mcgreevy@chromium.org BUG= 649841 NOPRESUBMIT=true NOTREECHECKS=true NOTRY=true Change-Id: I9de3f60758e1f01254b2c6a535fef650e4b8781f Reviewed-on: https://chromium-review.googlesource.com/419701 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/api.py [add] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.expected/expired.json [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.expected/timed_out.json [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.py [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/api.py [delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json [delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json [delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json [delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/example.py [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py [modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py
,
Dec 14 2016
I'm definitely hoping to help out with the "general code health" parts of M-A's comment #38. There's a bunch of surgery that needs to be done both on the swarming module itself and on its integration w/ the chromium module and the way steps work in the chromium recipe ( bug 661682 ).
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/tools/build.git/+/ce619064a09f4e6d9b80f347c378952b57ab95d9 commit ce619064a09f4e6d9b80f347c378952b57ab95d9 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Thu Dec 15 14:51:17 2016 Improve swarming task handling of failure modes. (bis) Handles the following modes: - TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated. This is a user failure. - EXPIRED: there was not enough capacity to run the task. This is an infrastructure failure; capacity does not handle the demand. - BOT_DIED: there was a Swarming internal failure; this is an infrastructure failure. Remove iOS specific recipe code to generalize failure handling. BUG= 649841 Change-Id: Ibabb96ffbc2bbcba0f062b0f125e737673e3b6d4 Reviewed-on: https://chromium-review.googlesource.com/419822 Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> Reviewed-by: smut <smut@chromium.org> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Reviewed-by: Katie Thomas <katthomas@google.com> [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/ios/example.expected/timed_out.json [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/api.py [add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json [add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json [add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json [add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.py [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py [modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipes/chromium.expected/dynamic_swarmed_passed_isolated_script_test_with_swarming_failure.json
,
Dec 21 2016
This looks like the CL landed? I don't think it successfully worked though. Looking at https://luci-milo.appspot.com/buildbot/chromium.perf/Win%208%20Perf/125, there are expired tasks in the build which are marked as red. https://chromium-swarm.appspot.com/task?id=333408df5e964b10&refresh=10&show_raw=1 is an example. I'll dig in a bit to see if there's an obvious bug.
,
Dec 21 2016
I didn't close the bug because I didn't trust my change and wanted to do follow up investigation, but I hadn't taken the time yet. https://luci-milo.appspot.com/buildbot/chromium.perf/Win%208%20Perf/125 doesn't load for me. :/
,
Jan 19 2017
@maruel - any update on this?
,
Jan 19 2017
I did some work on this in https://chromium.googlesource.com/chromium/tools/build/+/26b0ef9d726c07e8b8c81d16e0f5bd70cf6ebace. I believe expired tasks are still not being marked as purple. I'll investigate this today.
,
Jan 25 2017
,
Feb 14 2017
This bug might be able to be closed. More of the perf waterfall has shown up as purple lately.
,
Mar 8 2017
|
|||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||
Comment 1 by iannucci@chromium.org
, Sep 23 2016