New issue
Advanced search Search tips

Issue 870723 link

Starred by 3 users

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 916548
issue 645323



Sign in to add a comment

Swarming: monitor more bot states

Project Member Reported by mar...@chromium.org, Aug 3

Issue description

The current bot states are: idle, busy, dead, maintenance, quarantined.


Problems:
- The states do not represent 100% of the bot time.
  - When an host is rebooting, it's currently listed as idle.
  - While a bot cleans up its cache, it's currently listed as idle. This used to be mostly done within task's scope but this changed with  issue 868083 .
- Bot hooks are reported independently from the bot itself. This means it cannot be represented on the server as a bot state. This is tracked as issue 835274.


AI:
- Add new states:
  - "overhead" for internal cleanup; this could be subclassified as:
    - hooks (with the hook name)
    - "cache_cleanup" for isolated and named cache cleanup.
  - "rebooting" when we know the host is rebooting due to the bot's action.

http://go/swarming-monitoring-v2#heading=h.tn6n7ysq9dtk
 
Summary: Swarming: monitor more bot states (was: Swarming: monitor bot overhead)
Sounds great to me, but isn't a priority from our point of view.
Project Member

Comment 3 by bugdroid1@chromium.org, Dec 14

Project Member

Comment 4 by bugdroid1@chromium.org, Dec 17

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/16c31d11f6241740151587ec1a8c495112146278

commit 16c31d11f6241740151587ec1a8c495112146278
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Dec 17 18:45:12 2018

[swarming] stream BotEvent to BigQuery

This should enable mass analysis. It uses the same proto than the BotEvent
returned by the pRPC api.

Include setup_bigquery.sh inspired from ../isolate/'s version with tweaks. It
will further be updated in follow ups as more BQ tables are created.

Do not enable the cron yet.

Bug: 870723
Change-Id: If7675b65b9027964d223859d898c17c7d8b332b4
Reviewed-on: https://chromium-review.googlesource.com/c/1373972
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/PRESUBMIT.py
[modify] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/README.md
[add] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/bqh.py
[modify] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/server/bot_management_test.py
[add] https://crrev.com/16c31d11f6241740151587ec1a8c495112146278/appengine/swarming/setup_bigquery.sh

Project Member

Comment 5 by bugdroid1@chromium.org, Dec 17

Project Member

Comment 6 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/3738f2150fe13557e7ceb53c5c3f808fbcc111e8

commit 3738f2150fe13557e7ceb53c5c3f808fbcc111e8
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 19 03:14:56 2018

[swarming] Add TaskRequest message and all necessary submessages

They are not used yet. The TaskResult proto is added in a separate CL
because this CL is already large enough.

This will be needed when streaming the task results to BigQuery. We expect the
use of the same format than the API.

Bug: 870723
Change-Id: Ide64463beb9fe66199ec6d5aeab59bc184a1b9cd
Reviewed-on: https://chromium-review.googlesource.com/c/1380774
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/proto/api/plugin.proto
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/proto/api/plugin_prpc_pb2.py
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/proto/api/swarming.proto
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/proto/api/swarming_pb2.py
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/proto/api/swarming_prpc_pb2.py
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/server/task_request.py
[modify] https://crrev.com/3738f2150fe13557e7ceb53c5c3f808fbcc111e8/appengine/swarming/server/task_request_test.py

Blocking: 916548
Project Member

Comment 8 by bugdroid1@chromium.org, Dec 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/132c74e3ce003b29221bf3e5fc38af8c4f0bc049

commit 132c74e3ce003b29221bf3e5fc38af8c4f0bc049
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 19 22:33:36 2018

[swarming] Further proto fine tuning

Lots of small things I realized after the fact. Doing these before piling more
data (TaskResult).

- Assign a bug to every TaskState not implemented yet and comment this.
- Assign a bug to every BotEventType not implemented yet and comment this.
- Add TaskState MISSING_INPUTS and SKIPPED_INTERNAL_FAILURE for further
  disambiguation of failure modes.
- Rename INTERNAL_FAILURE to RAN_INTERNAL_FAILURE to disambiguate with the 2 new
  related failures.
- Fix CASTree.digest back to an hex encoded string. It's 2x larger, but
  much simpler to work with.
- Rename BotInfo.raw to supplemental. The end goal is to have as much
  structured data as possible, but still allow customer-specified
  additional data. Renaming 'raw' to 'supplemental' makes this clearer.
- Create PhysicalEntity, which will enable the clear separation between
  host characteristics and device characteristics.
- Declare TaskProperties.containment, futureproofing the need for task process
  containment.

This change is a breaking change (proto message renumbering); TaskRequest is not
used anywhere yet.

Bug: 757931
Bug: 808836
Bug: 870723
Bug: 905087
Bug: 913978
Bug: 916553
Bug: 916556
Bug: 916557
Bug: 916559
Bug: 916560
Bug: 916562
Bug: 916570
Bug: 916578
Change-Id: Ic1a57d15d028802ad5cf8c6a2f13da15fac662c4
Reviewed-on: https://chromium-review.googlesource.com/c/1384425
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/handlers_prpc_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/plugin_prpc_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming.proto
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/proto/api/swarming_prpc_pb2.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_request.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_request_test.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_result.py
[modify] https://crrev.com/132c74e3ce003b29221bf3e5fc38af8c4f0bc049/appengine/swarming/server/task_result_test.py

Project Member

Comment 9 by bugdroid1@chromium.org, Dec 21

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d8e3e988330ca1b5bb6e332853d0319f43188a72

commit d8e3e988330ca1b5bb6e332853d0319f43188a72
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Dec 21 21:05:57 2018

[swarming] Add TaskResult message and all necessary submessages

They are not used yet. The proto change is done in a separate CL because this is
already large enough.

This will be needed when streaming the task results to BigQuery. We expect the
use of the same format than the API.

Include a small crash fix in TaskRequest.to_proto().

Bug: 917106
Bug: 870723
Change-Id: Ifa3a88dbf66bcbc5b862cdb14124bec55a1ba4b2
Reviewed-on: https://chromium-review.googlesource.com/c/1379351
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/proto/api/plugin_prpc_pb2.py
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/proto/api/swarming.proto
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/proto/api/swarming_pb2.py
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/proto/api/swarming_prpc_pb2.py
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/server/task_request.py
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/server/task_result.py
[modify] https://crrev.com/d8e3e988330ca1b5bb6e332853d0319f43188a72/appengine/swarming/server/task_result_test.py

Sign in to add a comment