New issue
Advanced search Search tips

Issue 916553 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug

Blocking:
issue 916548



Sign in to add a comment

Swarming: disambiguous various task failure in BOT_DIED

Project Member Reported by mar...@chromium.org, Dec 19

Issue description

BOT_DIED conflates too many failure modes, so expand the states, as they fall in different categories.

- SKIPPED_INTERNAL_FAILURE: failed to hand the task to a bot, so no execution ever ran. That's a NEVER_RAN_DONE_MASK.

- MISSING_INPUTS: failed to setup, the inputs couldn't be found (404). That's a EXECUTION_DONE_MASK.

- RUN_INTERNAL_FAILURE: ran but the bot failed while running. That's a TRANSIENT_DONE_MASK.

- BOT_DISAPPEARED: run but the bot went MISSING while running (for example BSOD). That's a TRANSIENT_DONE_MASK.
 
Project Member

Comment 1 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/132c74e3ce003b29221bf3e5fc38af8c4f0bc049

commit 132c74e3ce003b29221bf3e5fc38af8c4f0bc049
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 19 22:33:36 2018

[swarming] Further proto fine tuning

Lots of small things I realized after the fact. Doing these before piling more
data (TaskResult).

- Assign a bug to every TaskState not implemented yet and comment this.
- Assign a bug to every BotEventType not implemented yet and comment this.
- Add TaskState MISSING_INPUTS and SKIPPED_INTERNAL_FAILURE for further
  disambiguation of failure modes.
- Rename INTERNAL_FAILURE to RAN_INTERNAL_FAILURE to disambiguate with the 2 new
  related failures.
- Fix CASTree.digest back to an hex encoded string. It's 2x larger, but
  much simpler to work with.
- Rename BotInfo.raw to supplemental. The end goal is to have as much
  structured data as possible, but still allow customer-specified
  additional data. Renaming 'raw' to 'supplemental' makes this clearer.
- Create PhysicalEntity, which will enable the clear separation between
  host characteristics and device characteristics.
- Declare TaskProperties.containment, futureproofing the need for task process
  containment.

This change is a breaking change (proto message renumbering); TaskRequest is not
used anywhere yet.

Bug: 757931
Bug: 808836
Bug: 870723
Bug: 905087
Bug: 913978
Bug: 916553
Bug: 916556
Bug: 916557
Bug: 916559
Bug: 916560
Bug: 916562
Bug: 916570
Bug: 916578
Change-Id: Ic1a57d15d028802ad5cf8c6a2f13da15fac662c4
Reviewed-on: https://chromium-review.googlesource.com/c/1384425
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/handlers_prpc_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/plugin_prpc_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming.proto
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming_prpc_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_request.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_request_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_result.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_result_test.py

Sign in to add a comment