New issue
Advanced search Search tips

Issue 826421 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Feature

Blocking:
issue 829525
issue 825843



Sign in to add a comment

Remove inequality on last_seen_ts from BotInfos queries

Project Member Reported by mar...@chromium.org, Mar 27 2018

Issue description

Part of issue 825843 was caused by using an inequality filter combined with a repeated field. We can at least remove the inequality by adding to composite a dead value.

This can easily be done by a cron job.

See BotInfo._calc_composite()
https://cs.chromium.org/chromium/infra/luci/appengine/swarming/server/bot_management.py?l=184
 
Labels: -Pri-2 Pri-1
Bumping priority, looks like I'll have to implement now. :/
Blocking: 825843
Owner: mar...@chromium.org
Status: Started (was: Available)
Should be doable in two commits.
Project Member

Comment 4 by bugdroid1@chromium.org, Apr 4 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/c14663096bbb106dd7f0a6ae65cb5439afb45b68

commit c14663096bbb106dd7f0a6ae65cb5439afb45b68
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Apr 04 20:48:51 2018

[swarming] Redo bot_management_test.py in preparation of refactor

Rename BotInfo.composite constants in preparation to add more.

Cleanup the tests a bit in preparation as more states are being added.

Bug:  826421 , 817976 
Change-Id: I84d4397713f52a5fec4dd392ce5f1f0f7747c70e
Reviewed-on: https://chromium-review.googlesource.com/993034
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/c14663096bbb106dd7f0a6ae65cb5439afb45b68/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/c14663096bbb106dd7f0a6ae65cb5439afb45b68/appengine/swarming/server/bot_management_test.py

Blocking: 829525
Project Member

Comment 6 by bugdroid1@chromium.org, Apr 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/21ce635d37263d0b95c7659371624214134271ba

commit 21ce635d37263d0b95c7659371624214134271ba
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Apr 05 20:31:13 2018

[swarming] precompute dead and alive bots

The end goal is to remove the inequality filter on last_seen_ts from the BotInfo
queries, which is believed to make it harder to count entities.

Switch BotInfo.composite from ComputedProperty to IntegerProperty, because
otherwise the property is recomputed on *load*, which makes it impossible to
know which entities to refresh in the cron job. Duh. :(

Rename tidy_stale to cron_tidy_stale to make the naming coherent with the other
cron functions.

Follow up CLs:
1. Enable the cron job and switch the query filter to stop using last_seen_ts.
2. Remove old indexes and manually vacuum them.

This will remove a fair number of indexes, which will help with overall health.

Bug:  826421 
Change-Id: I6328256f0a4e64c3901abc70db9fec89696f1a86
Reviewed-on: https://chromium-review.googlesource.com/994092
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>

[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/cron.yaml
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/index.yaml
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/server/task_queues.py
[modify] https://crrev.com/21ce635d37263d0b95c7659371624214134271ba/appengine/swarming/server/task_queues_test.py

Project Member

Comment 7 by bugdroid1@chromium.org, Apr 5 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/83bad5ae88ec21eb4076b266b3a393f542f27278

commit 83bad5ae88ec21eb4076b266b3a393f542f27278
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Apr 05 22:40:14 2018

[swarming] enable dead bot cron job

I'll commit only once 21ce635d37263 has been deployed everywhere.

Bug:  826421 
Change-Id: I545d9e8d388bf9ed25f644f7b871cc2385efbb80
Reviewed-on: https://chromium-review.googlesource.com/998757
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/83bad5ae88ec21eb4076b266b3a393f542f27278/appengine/swarming/cron.yaml

Project Member

Comment 8 by bugdroid1@chromium.org, Apr 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/f31363e439280afd6b0f297d27471d5e3277a0e3

commit f31363e439280afd6b0f297d27471d5e3277a0e3
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Apr 10 10:48:25 2018

[swarming] switch BotInfo to use new simpler index

Stop doing inequality query for dead bots. This should fix the datastore index
problem.

Bug:  826421 
Change-Id: I27760ea81bf69e0880fc22f866379a9334765fcc
Reviewed-on: https://chromium-review.googlesource.com/998934
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/handlers_endpoints.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/handlers_endpoints_test.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/index.yaml
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/message_conversion.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/server/lease_management.py
[modify] https://crrev.com/f31363e439280afd6b0f297d27471d5e3277a0e3/appengine/swarming/ts_mon_metrics.py

Project Member

Comment 9 by bugdroid1@chromium.org, Apr 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/96694430f61b58977ad6e262efb143581b1c9a80

commit 96694430f61b58977ad6e262efb143581b1c9a80
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Apr 10 21:52:24 2018

swarming: fix stale index causing exception in tx

Between the time the BotInfo query yeild entities and the transaction runs, the
BotInfo entity could be deleted. This happens for Machine Provider managed
machines, as BotInfo is deleted once the lease is released, the bot is deleted.

This causes exceptions in cron_update_bot_info(), which is annoying but not a
big deal.

R=iannucci@chromium.org

Bug:  826421 
Change-Id: I04791b34f9a8f4080cb91fa27927b19ae1d4d7b6
Reviewed-on: https://chromium-review.googlesource.com/1005281
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/96694430f61b58977ad6e262efb143581b1c9a80/appengine/swarming/server/bot_management.py

Mostly done! Need to deploy, and vacuum indexes.
Project Member

Comment 11 by bugdroid1@chromium.org, Apr 10 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/90da805a13a5115b68ccb17f89d9eeb4a67aa099

commit 90da805a13a5115b68ccb17f89d9eeb4a67aa099
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Apr 10 22:28:44 2018

[swarming] reduce parallelism in cron_update_bot_info

A lot of CommitError are being raised due to contention on the production
server.

Lower from 25 parallel transaction down to 5 to try to help reduce the failure
rate of this cron job.

Put the code inside a try/finally so it can log how many items it processed even
when failing.

Bug:  826421 
Change-Id: I647dbe3bcaffb9c704ab6a26f464315717b24329
Reviewed-on: https://chromium-review.googlesource.com/1006095
Reviewed-by: Robbie Iannucci <iannucci@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/90da805a13a5115b68ccb17f89d9eeb4a67aa099/appengine/swarming/server/bot_management.py

Status: Fixed (was: Started)
and CL https://chromium-review.googlesource.com/1007334

Deployed everywhere.
Indexes vacuumed.
\o/

There's still some on-going exceptions remaining so I may do a follow up CL to try to tame them a bit more but it's working well.
Project Member

Comment 13 by bugdroid1@chromium.org, Apr 13 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d83708119e4b9805fb65f7735ac5b1bbd611b954

commit d83708119e4b9805fb65f7735ac5b1bbd611b954
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Fri Apr 13 19:40:15 2018

[swarming] variable aliasing strikes again

'lambda: foo(bar.biz)' will not work correctly when inside a loop. This resulted
in the same bot being run in the transaction. :( It took me an unreasonable
amount of time to figure this out.

Log the bots that are updated as dead.
Reduce log level otherwise to debug.

TBR=iannucci@chromium.org
Bug:  826421 
Change-Id: Iada8e800e7006742bf074bc140a513202a12d831
Reviewed-on: https://chromium-review.googlesource.com/1012742
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/d83708119e4b9805fb65f7735ac5b1bbd611b954/appengine/swarming/server/bot_management.py

Sign in to add a comment