restart everything regularly (host_scheduler, shard_client, gs_offloader, scheduler) , and tell us when we do restart |
|||||
Issue descriptionCurrently, we don't have metrics to tell us when these services start. Add a /start metric as well.
,
Jan 25 2018
autotest CL stack for this: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/885115/
,
Jan 25 2018
And once those are pushed to prod, puppet change to provide the new option: https://chrome-internal-review.googlesource.com/#/c/chromeos/chromeos-admin/+/554678
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3c09255f82133a040b9c31371723686e654256b8 commit 3c09255f82133a040b9c31371723686e654256b8 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 19:12:00 2018 autotest: Stop reporting metadata to elasticsearch. We don't look at / maintain this anymore. All metrics should go to monarch. Anything more detailed should go to cloud trace / logs. BUG=chromium:804509 TEST=None Change-Id: Ie6b16b1d3178c51b134c6b3be006a748d1128333 Reviewed-on: https://chromium-review.googlesource.com/885108 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/3c09255f82133a040b9c31371723686e654256b8/scheduler/host_scheduler.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4c94b2172a1a74f00f2d2a5f751338e90bf1d201 commit 4c94b2172a1a74f00f2d2a5f751338e90bf1d201 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 19:12:01 2018 autotest: Use ts_mon as context, use indirect metrics process. BUG=chromium:804509 TEST=None Change-Id: I19e97cc4df784e5ebe34e7b1afabb2be7839de5a Reviewed-on: https://chromium-review.googlesource.com/885109 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/4c94b2172a1a74f00f2d2a5f751338e90bf1d201/scheduler/host_scheduler.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7b6683167553f8103d662dda00e43ea9eaf62076 commit 7b6683167553f8103d662dda00e43ea9eaf62076 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 19:12:01 2018 Remove a lie about how upstart treats host_scheduler exits. host_scheduler is run as a daemon, not a task. This means that the 'respawn' stanza ignores the exit status, and always respawns the process. This is what we want. The comment was wrong. BUG=chromium:804509 TEST=None Change-Id: I48c29f002f297233d3ee0a8e187f5da62741159b Reviewed-on: https://chromium-review.googlesource.com/885110 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/7b6683167553f8103d662dda00e43ea9eaf62076/scheduler/host_scheduler.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8b14506b57702159699d0b7626dbe0953708de23 commit 8b14506b57702159699d0b7626dbe0953708de23 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 20:57:44 2018 autotest: Add host_scheduler argument to quit after requested time. This allows us to cleanly restart host_scheduler every so often in the lab. BUG=chromium:804509 TEST=Local host_scheduler run with --metrics-file --lifetime-hours Change-Id: I8ac2b6cca4f350b60c1502050a7932eb3efb2990 Reviewed-on: https://chromium-review.googlesource.com/885111 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/8b14506b57702159699d0b7626dbe0953708de23/scheduler/host_scheduler.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1 commit 01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 20:57:44 2018 autotest: Add lifetime_hours option to shard_client. When provided, this option instructs shard_client to run for a stipulated amount of time and then exit. BUG=chromium:804509 TEST=Local shard_client run, hacked up to not really do anything. Change-Id: Id066022d77ffedee05725e24285b363e2a732d6a Reviewed-on: https://chromium-review.googlesource.com/885112 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1/scheduler/shard/shard_client.py [modify] https://crrev.com/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1/scheduler/shard/shard_client_unittest.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a commit 68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 22:22:57 2018 autotest: Add option to fake metrics collection in shard_client. Brings the options on-par with other services. BUG=chromium:804509 TEST=Local shard_client run, hacked out to not really do anything. Change-Id: I6564682b2c779636c13ede2f00d07ba5a677bc14 Reviewed-on: https://chromium-review.googlesource.com/885113 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a/scheduler/shard/shard_client.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/52bb2cc822ee5152fc348c6664be2f1d014dca72 commit 52bb2cc822ee5152fc348c6664be2f1d014dca72 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 22:22:58 2018 autotest: Report process start counter from scheduler, host_scheduler. BUG=chromium:804509 TEST=None Change-Id: I0478415effbca8e5c9bb8c44b4cba5db1587aa32 Reviewed-on: https://chromium-review.googlesource.com/885114 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/52bb2cc822ee5152fc348c6664be2f1d014dca72/scheduler/monitor_db.py [modify] https://crrev.com/52bb2cc822ee5152fc348c6664be2f1d014dca72/scheduler/host_scheduler.py
,
Jan 25 2018
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/766605e835f56c4e7ca81f69166499ee6440672d commit 766605e835f56c4e7ca81f69166499ee6440672d Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Thu Jan 25 23:58:35 2018 autotest: Stop reporting stats to elasticsearch. We don't look at it / maintain it. Use ts_mon for metrics and logs / cloud trace for more detailed stuff. BUG=chromium:804509 TEST=Local run Change-Id: Ib002b3d0a10d561eb5f886ccd5230ebf29f8b051 Reviewed-on: https://chromium-review.googlesource.com/885115 Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org> Tested-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org> Reviewed-by: Paul Hobbs <phobbs@google.com> [modify] https://crrev.com/766605e835f56c4e7ca81f69166499ee6440672d/scheduler/monitor_db.py
,
Jan 26 2018
Next steps: - push-to-prod - push puppet CL to start using the new flags to set deadline for scheduler, shard_client - Add vi/ dashboards for the **/start metrics that I've added. Verify that all services are getting restarted routinely.
,
Jan 31 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/5b8d5235ccf642b703150f6f0b805ad7740675c0 commit 5b8d5235ccf642b703150f6f0b805ad7740675c0 Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Jan 31 18:03:00 2018
,
Feb 5 2018
Looks like this might have caused an outage this morning? Host_scheduler died on all the shards. We're still investigating.
,
Feb 5 2018
,
Feb 5 2018
,
Feb 5 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e04312f63990f8640702001fa6ae6a3124210d3f commit e04312f63990f8640702001fa6ae6a3124210d3f Author: Prathmesh Prabhu <pprabhu@google.com> Date: Mon Feb 05 19:29:56 2018
,
Feb 6 2018
Next up: deploy the process lifetimes to staging alone, and repro, debug the failure in issue 809098.
,
Feb 6 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d92c61150e96242085a5b70494dd00f776691e6c commit d92c61150e96242085a5b70494dd00f776691e6c Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Tue Feb 06 21:29:23 2018
,
Feb 7 2018
The following revision refers to this bug: https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/18d34c35157241277e646b92de366820e5bd8d9a commit 18d34c35157241277e646b92de366820e5bd8d9a Author: Prathmesh Prabhu <pprabhu@chromium.org> Date: Wed Feb 07 00:12:50 2018
,
Mar 4 2018
Does the --lifetime-hours log anything specific in the service log when the process dies due to the lifetime timeout? If not, it should. I'm not sure otherwise how to distinguish it from the shutdown in https://bugs.chromium.org/p/chromium/issues/detail?id=817904&desc=2#c8
,
Sep 13
The scheduled restarts were rolled back because they were implicated in a host-scheduler outage. I'm not actively trying to reland them as we're not currently facing load issues that would be mitigated by these. Back to "available" plate. |
|||||
►
Sign in to add a comment |
|||||
Comment 1 by pprabhu@chromium.org
, Jan 22 2018