New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 649841 link

Starred by 5 users

Issue metadata

Status: Fixed
Owner:
Closed: Mar 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: All
Pri: 1
Type: Bug

Blocked on:
issue 649831

Blocking:
issue 649391



Sign in to add a comment

Improve resiliency of swarming tasks recipe collection

Project Member Reported by mar...@chromium.org, Sep 23 2016

Issue description

More often than not, the swarming recipe doesn't have enough information to know what the hell happened on a swarming task and has to report an infrastructure failure where it could report something more informative. This affects the CQ negatively, increases retries at build level, the negative effect of false interpretation as infrastructure failure is huge.

The swarming recipe module could be much smarter in its interpretation of result code. It needs to differentiate between three classes of problems:

Infrastructure failure:
- Task expiration; the task didn't execute due to lack of infrastructure throughput. This is currently not well reported to the user.
- A subset of the shards missing due to task expiration, same as above. This is reported as "shard missing", which is confusing the hell of users.
- Swarming client reported an "internal_failure"; this is a true internal failure and should be reported as such.

Execution failure:
- Execution timeout; this needs to be reported explicitly as it is, so the user can understand what happened. This is not the case at the moment and it is easy to miss from the logs and swarming tasks.
- test_launcher crashing mid task. A quick look of 3 builds from Katie's query seems to reveal anecdotally (need to confirm) that a fair number of these may be improved with  issue 649831 . This needs to be reported accordingly. It could be reported as "no data" or as a real failure (preferred). This part is googletest specific.

Test failure:
- Normal failure flow.


AI:
- Artificially produce and investigate each of these failure case and improve the analysis and UI accordingly.


Source: https://chromium.googlesource.com/chromium/tools/build/+/master/scripts/slave/recipe_modules/swarming/api.py

 
Cc: vadimb@chromium.org
+vadim
Cc: -vadimb@chromium.org vadimsh@chromium.org
wrong vadim
I filed issue 645259 for maybe exactly the same problem...
Issue 645259 is different; the issue here is about recipe_modules/swarming/api.py, issue 645259 is about chromite. Commented on the other about possible improvements.
Labels: Restrict-View-Google

Comment 6 by mar...@chromium.org, Oct 11 2016

Labels: -Restrict-View-Google
Blocking: 649391
Owner: katthomas@chromium.org

Comment 9 by mar...@chromium.org, Oct 27 2016

It reminds me that swarming.py collect must be changed to always return 0, unless there's a Swarming failure that needs to be considered an infra failure. The actual task exit code is in the json. This will permit better classification of error.

https://github.com/luci/luci-py/blob/master/client/swarming.py#L1192
needs to accept a new flag --no-exit-code to tell the script to stop using the old behavior.

this flag will be passed down to collect()
https://github.com/luci/luci-py/blob/master/client/swarming.py#L844needs
to
if no_exit_code:
  return 0 if exit_code is not None else 1
return exit_code if exit_code is not None else 1

That should help a lot with the classification of failures.
Owner: ----
Owner: mar...@chromium.org
Status: Assigned (was: Available)
M-A, are you working on this? If so, any updates?  If not, should we be assigning to someone else?

Comment 13 by maruel@google.com, Nov 16 2016

Cc: mar...@chromium.org
Owner: ----
Status: Available (was: Assigned)
Yes, someone else should take on this.
@maruel, who would be a good person to take this?

Comment 15 by jpar...@google.com, Nov 16 2016

Cc: mcgreevy@chromium.org djd@chromium.org tansell@chromium.org
Dave/Micheal/Tim - is this anything you could help out with?  It is one of our top issues with driving down infrastructure failure rates.
jparent@ Could you add some data in this bug showing this affecting infrastructure failure rate? Links to cases where it has failed would also be useful.

That info would let me understand the correct course of action to take here.

Comment 17 by jpar...@google.com, Nov 17 2016

katthomas@, can you provide that info to tansell?
Yup!

I'll get back here early next week with some examples. 
Owner: tansell@chromium.org
Here is an example we see frequently:

crbug.com/666983

A step is misbehaving or hanging in some way such that the build hits its overall timeout and the step is killed by the master. 

In this case, the result is "Internal failure" and marked purple which isn't very helpful. One other thing we're doing to help in this specific case is encouraging people to add timeouts to steps in recipes. That way, if the recipe kills the step, it will be recorded as a purple/red failure as appropriate. 
Here is another example:

https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384

Missing shard reported due to output.json missing. It looked like the task actually timed out tough. 
https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384
 looks like a real infra issue. The Android device seems to have failed in some way.

Looking at the logs linked in http://crbug.com/666983 -- The step which timed out don't appear to be running on isolate / swarming?

Did you link the wrong things?
Status: Unconfirmed (was: Available)
While I agree with everything maruel@ is proposing, at the moment I'm not sure this is an actively occuring issue or how much of maruel@'s proposal is even unimplemented.

I'm bumping this back to "unconfirmed" until I we can get enough information to decided on an action to take.
Owner: katthomas@chromium.org
Assigning to myself until I've gathered enough info for tansell.
@maruel, correct me if I’m wrong, but I think the primary issue is that swarming could be providing more useful information than it is currently. 

Task expiration, e.g. https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183251The task expires due to lack of capacity. We know this information (you can see it here: https://chromium-swarm.appspot.com/task?id=328d8c3f09c99b10&refresh=10&show_raw=1), but the result is that is reported is “Internal Failure.”

Shard missing/execution timeout, e.g.
https://luci-milo.appspot.com/buildbot/tryserver.chromium.android/linux_android_rel_ng/183384
Here we have a task that times out due to an infra failure. This is displayed as “missing shard,” which is not helpful. Like the example above, the result that is reported is “Internal Failure.” Reporting a timeout would be more helpful. 
$ ./swarming.py collect -S chromium-swarm.appspot.com 328d8c3f09c99b10 --task-summary-json foo.json
N/A: 328d8c3f09c99b10 None
$ cat foo.json
{
  "shards": [
    {
      "abandoned_ts": "2016-11-18T01:01:12.531950", 
      "bot_dimensions": null, 
      "bot_id": null, 
      "bot_version": null, 
      "children_task_ids": [], 
      "completed_ts": null, 
      "cost_saved_usd": null, 
      "costs_usd": null, 
      "created_ts": "2016-11-17T23:59:03.328790", 
      "deduped_from": null, 
      "durations": [], 
      "exit_codes": [], 
      "failure": false, 
      "id": "328d8c3f09c99b10", 
      "internal_failure": false, 
      "isolated_out": null, 
      "modified_ts": "2016-11-18T01:01:12.531950", 
      "name": "cc_unittests (with patch)/Android/6fa2b7a9fd/linux_android_rel_ng/183251", 
      "outputs": [], 
      "outputs_ref": null, 
      "properties_hash": null, 
      "server_versions": null, 
      "started_ts": null, 
      "state": 48, 
      "tags": [
(...)
      ], 
      "try_number": null, 
      "user": "benhenry@google.com"
    }
  ]
}


The internal failure is NOT true. "state" is 48, which means EXPIRED (0x30).
https://github.com/luci/luci-py/blob/master/appengine/swarming/server/task_result.py#L94

So there's a problem at the recipe level or higher.
@maruel, can you say more about the intention behind this issue? My understanding was that wanted to be more precise in how we're reporting failures from swarming.

For 328d8c3f09c99b10 and others like it, expiration is clear from the swarming display, but not from the buildbot display. I agree this is "not well reported to the user" as it's hidden behind a "missing shard" link on the step (in the buildbot UI).

 
Status: Assigned (was: Unconfirmed)
> For 328d8c3f09c99b10 and others like it, expiration is clear from the swarming display, but not from the buildbot display.

What this issue is about is what the recipe generates as metadata, that eventually bubbles up to the buildbot display.

Also the fact that Milo shows this as an infra failure (which not to be confused with swarming internal failure) makes sense. It is an infra failure. I think we need to add more text to explain the "why", e.g. there's not enough VMs/bots.
Owner: tansell@chromium.org
@tansell, does this help?
tansell@, ping?
From Marc's last comment it seems that;

 a) These failures are being correctly detected as infra failures.
 b) The error message output reported to developers could be improved.

If so, this sounds like a pretty low priority task? As a Chrome developer any infra failure is identical to me. A developer knowing that there is a capacity issue doesn't actually improve the infra reliability?

I feel like I'm missing something here? jparent@ said "It is one of our top issues with driving down infrastructure failure rates."?
As a Chrome developer, yes, it doesn't matter what type of infra failure it is, but for the purposes of eliminating infra failures, we need the information.
Yes, this will be helpful for more efficiently diagnosing/triaging infra failures.
Labels: Infra-Failures
Cc: sullivan@chromium.org
FWIW, this is confusing for perfbot sheriffs; we recently switched to swarming from buildbot, and almost all of our steps are red, when some should be purple.
Tim, given that you are about to go on leave until the end of February, is there someone else who could own this?

I don't want a P1 languishing.
tansell is gone now. @maruel, who else might be a good choice?
Cc: dpranke@chromium.org
Owner: mar...@chromium.org
Sorry, I hadn't realized that Tim was going on extended leave.

There's two parts:
1) Immediate improvement (what this bug is about). In particular _default_collect_step() and _gtest_collect_step() to print out a string specifically when a task result is one of State.EXPIRED, State.TIMED_OUT or State.BOT_DIED.
 
2) General code health issue. I discussed with Dirk about healing recipes_module/swarming. The code needs refactoring, along swarming.py itself and the way sharding is currently done. Michael is the only person currently working in this are but I don't want to put too much burden on him, at least not until layout tests are on swarming.

---

Focusing on 1), I'll take the bug and will try to make something quick to improve the status quo: I'll just look for the explicit state and add an entry to step_result.presentation.links that is more explicit about the state instead of the generic "missing shard #".
Project Member

Comment 39 by bugdroid1@chromium.org, Dec 13 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/be59a952e1b9c80ec37b8b7d8f49b471859247ac

commit be59a952e1b9c80ec37b8b7d8f49b471859247ac
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Dec 13 22:38:39 2016

Minor recipe_modules/swarming cleanup.

Preparatory test-only change as I'm trying to figure out how to add more test
cases for a follow up.

R=vadimsh@chromium.org
BUG= 649841 

Change-Id: I7fb8cc19f5ae67132abf7d6a6cfe0da2b16be657
Reviewed-on: https://chromium-review.googlesource.com/419660
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[rename] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/basic.json
[delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8.6_cipd_packages.json
[delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_show_isolated_out_in_collect_step.json
[delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_show_shards_in_collect_step.json
[delete] https://crrev.com/935dd609d71e0e1de1effc9a51f301ff3936feef/scripts/slave/recipe_modules/swarming/example.expected/basic_0.8_trybot.json
[add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/show_isolated_out_in_collect_step.json
[add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/show_shards_in_collect_step.json
[add] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.expected/trybot.json
[modify] https://crrev.com/be59a952e1b9c80ec37b8b7d8f49b471859247ac/scripts/slave/recipe_modules/swarming/example.py

Project Member

Comment 40 by bugdroid1@chromium.org, Dec 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/61c59449c14705e705ec399b42306f79d7e73efd

commit 61c59449c14705e705ec399b42306f79d7e73efd
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 14 03:31:55 2016

Improve swarming task handling of failure modes.

Handles the following modes:
- TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated.
  This is a user failure.
- EXPIRED: there was not enough capacity to run the task. This is an
  infrastructure failure; capacity does not handle the demand.
- BOT_DIED: there was a Swarming internal failure; this is an infrastructure
  failure.

Remove iOS specific recipe code to generalize failure handling.

R=vadimsh@chromium.org
BUG= 649841 

Change-Id: Iac5dd293ec112a4b71a7b3bcd9871a7252b37d1a
Reviewed-on: https://chromium-review.googlesource.com/419135
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>

[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/api.py
[delete] https://crrev.com/2bac5a784f0cf3c529456ed9e6e8a4292612a366/scripts/slave/recipe_modules/ios/example.expected/expired.json
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/example.expected/timed_out.json
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/ios/example.py
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/api.py
[add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json
[add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json
[add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json
[add] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/example.py
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
[modify] https://crrev.com/61c59449c14705e705ec399b42306f79d7e73efd/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py

Project Member

Comment 41 by bugdroid1@chromium.org, Dec 14 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/8758e4ed83393b9b874d8c64d0f37edd6d14444e

commit 8758e4ed83393b9b874d8c64d0f37edd6d14444e
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 14 15:35:01 2016

Revert "Improve swarming task handling of failure modes."

This reverts commit 61c59449c14705e705ec399b42306f79d7e73efd.

Reason for revert: Blew up on the try server.

BUG= 674147 

Original change's description:
> Improve swarming task handling of failure modes.
> 
> Handles the following modes:
> - TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated.
>   This is a user failure.
> - EXPIRED: there was not enough capacity to run the task. This is an
>   infrastructure failure; capacity does not handle the demand.
> - BOT_DIED: there was a Swarming internal failure; this is an infrastructure
>   failure.
> 
> Remove iOS specific recipe code to generalize failure handling.
> 
> R=vadimsh@chromium.org
> BUG= 649841 
> 
> Change-Id: Iac5dd293ec112a4b71a7b3bcd9871a7252b37d1a
> Reviewed-on: https://chromium-review.googlesource.com/419135
> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
> Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
> 

TBR=maruel@chromium.org,vadimsh@chromium.org,smut@chromium.org,martiniss@chromium.org,nednguyen@google.com,chromium-reviews@chromium.org,eyaich@chromium.org,katthomas@chromium.org,mcgreevy@chromium.org
BUG= 649841 
NOPRESUBMIT=true
NOTREECHECKS=true
NOTRY=true

Change-Id: I9de3f60758e1f01254b2c6a535fef650e4b8781f
Reviewed-on: https://chromium-review.googlesource.com/419701
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/api.py
[add] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.expected/expired.json
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.expected/timed_out.json
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/ios/example.py
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/api.py
[delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json
[delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json
[delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json
[delete] https://crrev.com/2c2055ef8646cd7ca972d18a659f2e8ad1fc40d3/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/example.py
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
[modify] https://crrev.com/8758e4ed83393b9b874d8c64d0f37edd6d14444e/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py

I'm definitely hoping to help out with the "general code health" parts of M-A's comment  #38. There's a bunch of surgery that needs to be done both on the swarming module itself and on its integration w/ the chromium module and the way steps work in the chromium recipe ( bug 661682 ).
Project Member

Comment 43 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/tools/build.git/+/ce619064a09f4e6d9b80f347c378952b57ab95d9

commit ce619064a09f4e6d9b80f347c378952b57ab95d9
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Dec 15 14:51:17 2016

Improve swarming task handling of failure modes. (bis)

Handles the following modes:
- TIMED_OUT: a task ran but it ran for too long so it was forcibly terminated.
  This is a user failure.
- EXPIRED: there was not enough capacity to run the task. This is an
  infrastructure failure; capacity does not handle the demand.
- BOT_DIED: there was a Swarming internal failure; this is an infrastructure
  failure.

Remove iOS specific recipe code to generalize failure handling.

BUG= 649841 

Change-Id: Ibabb96ffbc2bbcba0f062b0f125e737673e3b6d4
Reviewed-on: https://chromium-review.googlesource.com/419822
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: smut <smut@chromium.org>
Reviewed-by: Vadim Shtayura <vadimsh@chromium.org>
Reviewed-by: Katie Thomas <katthomas@google.com>

[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/ios/example.expected/timed_out.json
[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/api.py
[add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_expired.json
[add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/isolated_script_timeout.json
[add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/swarming_expired.json
[add] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.expected/swarming_timeout.json
[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/example.py
[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/resources/collect_gtest_task.py
[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipe_modules/swarming/tests/collect_gtest_task_test.py
[modify] https://crrev.com/ce619064a09f4e6d9b80f347c378952b57ab95d9/scripts/slave/recipes/chromium.expected/dynamic_swarmed_passed_isolated_script_test_with_swarming_failure.json

This looks like the CL landed? 

I don't think it successfully worked though. Looking at https://luci-milo.appspot.com/buildbot/chromium.perf/Win%208%20Perf/125, there are expired tasks in the build which are marked as red. https://chromium-swarm.appspot.com/task?id=333408df5e964b10&refresh=10&show_raw=1 is an example.

I'll dig in a bit to see if there's an obvious bug.
I didn't close the bug because I didn't trust my change and wanted to do follow up investigation, but I hadn't taken the time yet.

https://luci-milo.appspot.com/buildbot/chromium.perf/Win%208%20Perf/125 doesn't load for me. :/
@maruel - any update on this?
Owner: martiniss@chromium.org
I did some work on this in https://chromium.googlesource.com/chromium/tools/build/+/26b0ef9d726c07e8b8c81d16e0f5bd70cf6ebace. 

I believe expired tasks are still not being marked as purple. I'll investigate this today.
Labels: Hotlist-Infra-Failures
This bug might be able to be closed. More of the perf waterfall has shown up as purple lately.
Status: Fixed (was: Assigned)

Sign in to add a comment