New issue
Advanced search Search tips

Issue 884579 link

Starred by 1 user

Issue metadata

Status: Started
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug



Sign in to add a comment

Swarming: delete old entities.

Project Member Reported by mar...@chromium.org, Sep 17

Issue description

Let's use:

- BotEvent: 2 years
- TaskOutputChunk: 2 years
- TaskRequest: 3 years.
 
Components: Infra>Platform>Swarming
I'll start with:
- TaskRequest: 4 years

and will decide to probably trim more aggressively afterwards. We'll want to trim the breadcrumbs for deleted bots too, which is not done right now; this includes BotRoot, BotDimensions, BotSettings, etc
Also found:
- Bot
- File
- FileChunk
- MachineWhitelist
Project Member

Comment 4 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/028b98df7254d0bd5d4ca58e83d23613e8aae617

commit 028b98df7254d0bd5d4ca58e83d23613e8aae617
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Sep 18 17:27:10 2018

[swarming] Trim BotEvent older than 2 years

Used to be 3 years. Now that the backlog is done, it's fine to trim
more. We don't expect history older than 2 years to be of much value. This will
permit deleting hundreds of millions of entities.

Improve unit test to confirm that the right instance was deleted,
instead of just counting the number of entities.

Reorganize the cron code a little bit to keep less code inside the try/finally.

Stop using BotEvents as plural in the comments.

Bug: 884579
Change-Id: I9107696127e9623c222d0b3fe58b436e70ea9d3c
Reviewed-on: https://chromium-review.googlesource.com/1227034
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/028b98df7254d0bd5d4ca58e83d23613e8aae617/appengine/swarming/cron.yaml
[modify] https://crrev.com/028b98df7254d0bd5d4ca58e83d23613e8aae617/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/028b98df7254d0bd5d4ca58e83d23613e8aae617/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/028b98df7254d0bd5d4ca58e83d23613e8aae617/appengine/swarming/server/task_queues.py

These are done:
- chromium-swarm-dev
- omnibot-legion-swarming-server
- omnibot-swarming-server

Project Member

Comment 6 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/c5c026e79addda6ef5576480fc8e7772f109a9e6

commit c5c026e79addda6ef5576480fc8e7772f109a9e6
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Sep 19 00:18:28 2018

[swarming] reduce error->warning for rebuild-task-cache failure

This task queue can fail when Cloud DB decides that there's too much
contention. This is not an hard failure, the task will be retried.

Make the task queue return a 503 instead of 500; since the service never
returns 503 by itself, it permits knowing that it was not a real
failure, and the only reason this is retried is so that the Task engine
retries the task.

Bug: 884579
Change-Id: If2fa4af760d7427695ef78700891411ddb4acbc9
Reviewed-on: https://chromium-review.googlesource.com/1227035
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/c5c026e79addda6ef5576480fc8e7772f109a9e6/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/c5c026e79addda6ef5576480fc8e7772f109a9e6/appengine/swarming/server/task_queues.py

Project Member

Comment 7 by bugdroid1@chromium.org, Sep 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/332fcc9ee863e24410e4ff25e699c319496c792d

commit 332fcc9ee863e24410e4ff25e699c319496c792d
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Sep 19 19:15:26 2018

[swarming] Add code to trim old tasks and old TaskOutputChunk

It's starting to be in tens of terrabytes, it's worth deleting old junk.

This code is not yet enabled, as cron.yaml is not touched. This will be
done in a follow up.

Includes unit test that ensures the right entity is deleted.

Bug: 884579
Change-Id: I2fbc906231cd97df0277fce256247c8a8e06fe90
Reviewed-on: https://chromium-review.googlesource.com/1227036
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/332fcc9ee863e24410e4ff25e699c319496c792d/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/332fcc9ee863e24410e4ff25e699c319496c792d/appengine/swarming/server/task_request.py
[modify] https://crrev.com/332fcc9ee863e24410e4ff25e699c319496c792d/appengine/swarming/server/task_request_test.py
[modify] https://crrev.com/332fcc9ee863e24410e4ff25e699c319496c792d/appengine/swarming/server/task_result.py
[modify] https://crrev.com/332fcc9ee863e24410e4ff25e699c319496c792d/appengine/swarming/server/task_result_test.py

Project Member

Comment 8 by bugdroid1@chromium.org, Nov 24

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/ca6c1941d80eabbd9aa12f671ced3bf7d13370f6

commit ca6c1941d80eabbd9aa12f671ced3bf7d13370f6
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Sat Nov 24 19:01:32 2018

[swarming] Properly delete old tasks.

The previous code deleted *recent* tasks, and it was testing for this
incorrect behavior. Oops.

Use TaskRequest.created_ts instead of TaskRequest.key, as they key
ordering is not what we want.

R=qyearsley@chromium.org

Bug: 884579
Change-Id: I4212e414e6a681acd6ccd16be73630ce1a6e5a4b
Reviewed-on: https://chromium-review.googlesource.com/c/1349375
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/ca6c1941d80eabbd9aa12f671ced3bf7d13370f6/appengine/swarming/server/task_request.py
[modify] https://crrev.com/ca6c1941d80eabbd9aa12f671ced3bf7d13370f6/appengine/swarming/server/task_request_test.py

Project Member

Comment 9 by bugdroid1@chromium.org, Nov 28

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/88bedbacf1d85879deb91584323efab33aadc881

commit 88bedbacf1d85879deb91584323efab33aadc881
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Nov 28 18:38:23 2018

[swarming] enable cron jobs to trim TaskRequest and TaskOutputChunk

Respectively:
/internal/cron/delete_old_tasks
/internal/cron/delete_old_task_output_chunks

This is now safe as all instances had their versions prior 3880-ca6c194
were deleted.

Bug: 884579
Change-Id: I083dab9d84fb3ff550fe748a51dd54be38effbe0
Reviewed-on: https://chromium-review.googlesource.com/c/1351431
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/88bedbacf1d85879deb91584323efab33aadc881/appengine/swarming/cron.yaml

Lowering to:
- BotEvent: 1 year
- TaskRequest entity group: 18 months

This removes the need for a TaskOutput specific cron job.
Project Member

Comment 11 by bugdroid1@chromium.org, Dec 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/43f034525fad4492b85245b05acd9d68671abe4e

commit 43f034525fad4492b85245b05acd9d68671abe4e
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Dec 03 14:21:43 2018

[swarming] Remove TaskOutput removal cron job

It is not needed, as TaskRequest entity group will be kept for only 18 months.

It is safe to remove the handler in the same CL than cron.yaml, as
uploading a version with the trimmed cron.yaml takes effect immediately,
even if the default version is not bumped.

Bug: 884579
Change-Id: Iebb2094d299ccce202e32b3778de782486c639aa
Reviewed-on: https://chromium-review.googlesource.com/c/1354365
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/43f034525fad4492b85245b05acd9d68671abe4e/appengine/swarming/cron.yaml
[modify] https://crrev.com/43f034525fad4492b85245b05acd9d68671abe4e/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/43f034525fad4492b85245b05acd9d68671abe4e/appengine/swarming/server/task_result.py
[modify] https://crrev.com/43f034525fad4492b85245b05acd9d68671abe4e/appengine/swarming/server/task_result_test.py

Project Member

Comment 12 by bugdroid1@chromium.org, Dec 3

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d18f6439bb30f910083396acc220412f40ae13d8

commit d18f6439bb30f910083396acc220412f40ae13d8
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Mon Dec 03 22:23:52 2018

[swarming] Lower task and BotEvent retention

Lower BotEvent retention to 12 months. In practice we don't even need
this much.

Lower TaskRequest entity groups retention to 18 months. This subsumes
the TaskOutput retention policy; no need to trim stdout on a shorter period, at
least for now.

Bug: 884579
Change-Id: I67e1fea2f4356c38e79511ae7998308b5aeb4d78
Reviewed-on: https://chromium-review.googlesource.com/c/1354364
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>

[modify] https://crrev.com/d18f6439bb30f910083396acc220412f40ae13d8/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/d18f6439bb30f910083396acc220412f40ae13d8/appengine/swarming/server/task_request.py

Project Member

Comment 13 by bugdroid1@chromium.org, Dec 5

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d2a77d1006ed33d803e2292e15738b66e8944b24

commit d2a77d1006ed33d803e2292e15738b66e8944b24
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 05 18:06:52 2018

[swarming] Delete old bot; Log timestamp for entities being deleted

The logging will help to know if the cron job are keeping up with the
creation rate so that all stale entities are deleted within the expected
time frame. This is done for both TaskRequest and for BotEvent.

Deleting old bots is important with Machine Provider created VMs
otherwise cruft just accumulates. We want to keep a clean and tidy DB.
:)

The cron job will be added once this code is deployed everywhere.

Bug: 884579
Change-Id: I6fd0d7526446e7439273c9f706d14ead86817d7a
Reviewed-on: https://chromium-review.googlesource.com/c/1363252
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/d2a77d1006ed33d803e2292e15738b66e8944b24/appengine/swarming/handlers_backend.py
[modify] https://crrev.com/d2a77d1006ed33d803e2292e15738b66e8944b24/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/d2a77d1006ed33d803e2292e15738b66e8944b24/appengine/swarming/server/bot_management_test.py
[modify] https://crrev.com/d2a77d1006ed33d803e2292e15738b66e8944b24/appengine/swarming/server/task_request.py
[modify] https://crrev.com/d2a77d1006ed33d803e2292e15738b66e8944b24/appengine/swarming/server/task_request_test.py

Project Member

Comment 14 by bugdroid1@chromium.org, Dec 5

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/52977ce45e492a3681b368d202466aa840dbc016

commit 52977ce45e492a3681b368d202466aa840dbc016
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Wed Dec 05 22:52:02 2018

[swarming] Be tolerant to inconsistent index

This is to handle new logging code in d2a77d1006ed33d803e2292e15738b66e8944b24.

Bug: 884579
Change-Id: I7b688e3d489f0fe3c96eba6b7a90c438d7e5473a
Reviewed-on: https://chromium-review.googlesource.com/c/1363534
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/52977ce45e492a3681b368d202466aa840dbc016/appengine/swarming/server/bot_management.py

Project Member

Comment 15 by bugdroid1@chromium.org, Dec 6

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/d00338cf557c4c79f5665d0fca3763b4cfe5d9dc

commit d00338cf557c4c79f5665d0fca3763b4cfe5d9dc
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Thu Dec 06 17:12:10 2018

[swarming] Fine tune the logging added in d2a77d1006ed33d803e22

Log the delta, so we can see how far behind the cron job is on
production without having to calculate it manually.

Bug: 884579
Change-Id: Id0f82fee778f0260cbce17aeeb76d299a777ff52
Reviewed-on: https://chromium-review.googlesource.com/c/1365354
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/d00338cf557c4c79f5665d0fca3763b4cfe5d9dc/appengine/swarming/server/bot_management.py
[modify] https://crrev.com/d00338cf557c4c79f5665d0fca3763b4cfe5d9dc/appengine/swarming/server/task_request.py

Project Member

Comment 16 by bugdroid1@chromium.org, Dec 11

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/943fd59b5f2cab15bb7153a7a52c86f70a017675

commit 943fd59b5f2cab15bb7153a7a52c86f70a017675
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Dec 11 15:05:51 2018

[swarming] enable cron job to delete stale bot

Bug: 884579
Change-Id: I009f1f755de80996e00b5f7605d75eee3420867a
Reviewed-on: https://chromium-review.googlesource.com/c/1365353
Reviewed-by: Quinten Yearsley <qyearsley@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/943fd59b5f2cab15bb7153a7a52c86f70a017675/appengine/swarming/cron.yaml

Sign in to add a comment