New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 804509 link

Starred by 1 user

Issue metadata

Status: Available
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 809098



Sign in to add a comment

restart everything regularly (host_scheduler, shard_client, gs_offloader, scheduler) , and tell us when we do restart

Project Member Reported by pprabhu@chromium.org, Jan 22 2018

Issue description

Currently, we don't have metrics to tell us when these services start. 

Add a /start metric as well.
 
Status: Assigned (was: Untriaged)
Status: Started (was: Assigned)
autotest CL stack for this: https://chromium-review.googlesource.com/c/chromiumos/third_party/autotest/+/885115/
And once those are pushed to prod, puppet change to provide the new option: https://chrome-internal-review.googlesource.com/#/c/chromeos/chromeos-admin/+/554678
Project Member

Comment 4 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/3c09255f82133a040b9c31371723686e654256b8

commit 3c09255f82133a040b9c31371723686e654256b8
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 19:12:00 2018

autotest: Stop reporting metadata to elasticsearch.

We don't look at / maintain this anymore. All metrics should go to
monarch. Anything more detailed should go to cloud trace / logs.

BUG=chromium:804509
TEST=None

Change-Id: Ie6b16b1d3178c51b134c6b3be006a748d1128333
Reviewed-on: https://chromium-review.googlesource.com/885108
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/3c09255f82133a040b9c31371723686e654256b8/scheduler/host_scheduler.py

Project Member

Comment 5 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/4c94b2172a1a74f00f2d2a5f751338e90bf1d201

commit 4c94b2172a1a74f00f2d2a5f751338e90bf1d201
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 19:12:01 2018

autotest: Use ts_mon as context, use indirect metrics process.

BUG=chromium:804509
TEST=None

Change-Id: I19e97cc4df784e5ebe34e7b1afabb2be7839de5a
Reviewed-on: https://chromium-review.googlesource.com/885109
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/4c94b2172a1a74f00f2d2a5f751338e90bf1d201/scheduler/host_scheduler.py

Project Member

Comment 6 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/7b6683167553f8103d662dda00e43ea9eaf62076

commit 7b6683167553f8103d662dda00e43ea9eaf62076
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 19:12:01 2018

Remove a lie about how upstart treats host_scheduler exits.

host_scheduler is run as a daemon, not a task. This means that the
'respawn' stanza ignores the exit status, and always respawns the
process. This is what we want. The comment was wrong.

BUG=chromium:804509
TEST=None

Change-Id: I48c29f002f297233d3ee0a8e187f5da62741159b
Reviewed-on: https://chromium-review.googlesource.com/885110
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/7b6683167553f8103d662dda00e43ea9eaf62076/scheduler/host_scheduler.py

Project Member

Comment 7 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/8b14506b57702159699d0b7626dbe0953708de23

commit 8b14506b57702159699d0b7626dbe0953708de23
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 20:57:44 2018

autotest: Add host_scheduler argument to quit after requested time.

This allows us to cleanly restart host_scheduler every so often in the
lab.

BUG=chromium:804509
TEST=Local host_scheduler run with --metrics-file --lifetime-hours

Change-Id: I8ac2b6cca4f350b60c1502050a7932eb3efb2990
Reviewed-on: https://chromium-review.googlesource.com/885111
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/8b14506b57702159699d0b7626dbe0953708de23/scheduler/host_scheduler.py

Project Member

Comment 8 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1

commit 01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 20:57:44 2018

autotest: Add lifetime_hours option to shard_client.

When provided, this option instructs shard_client to run for a
stipulated amount of time and then exit.

BUG=chromium:804509
TEST=Local shard_client run, hacked up to not really do anything.

Change-Id: Id066022d77ffedee05725e24285b363e2a732d6a
Reviewed-on: https://chromium-review.googlesource.com/885112
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1/scheduler/shard/shard_client.py
[modify] https://crrev.com/01ef91b19a1c092e6cfd2bd869c5bb24e545f0a1/scheduler/shard/shard_client_unittest.py

Project Member

Comment 9 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a

commit 68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 22:22:57 2018

autotest: Add option to fake metrics collection in shard_client.

Brings the options on-par with other services.

BUG=chromium:804509
TEST=Local shard_client run, hacked out to not really do anything.

Change-Id: I6564682b2c779636c13ede2f00d07ba5a677bc14
Reviewed-on: https://chromium-review.googlesource.com/885113
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/68baeb3d9fad0176f4f27cd908cb80fdff6e1c3a/scheduler/shard/shard_client.py

Project Member

Comment 10 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/52bb2cc822ee5152fc348c6664be2f1d014dca72

commit 52bb2cc822ee5152fc348c6664be2f1d014dca72
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 22:22:58 2018

autotest: Report process start counter from scheduler, host_scheduler.

BUG=chromium:804509
TEST=None

Change-Id: I0478415effbca8e5c9bb8c44b4cba5db1587aa32
Reviewed-on: https://chromium-review.googlesource.com/885114
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/52bb2cc822ee5152fc348c6664be2f1d014dca72/scheduler/monitor_db.py
[modify] https://crrev.com/52bb2cc822ee5152fc348c6664be2f1d014dca72/scheduler/host_scheduler.py

Project Member

Comment 11 by bugdroid1@chromium.org, Jan 25 2018

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/third_party/autotest/+/766605e835f56c4e7ca81f69166499ee6440672d

commit 766605e835f56c4e7ca81f69166499ee6440672d
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Thu Jan 25 23:58:35 2018

autotest: Stop reporting stats to elasticsearch.

We don't look at it / maintain it. Use ts_mon for metrics and logs /
cloud trace for more detailed stuff.

BUG=chromium:804509
TEST=Local run

Change-Id: Ib002b3d0a10d561eb5f886ccd5230ebf29f8b051
Reviewed-on: https://chromium-review.googlesource.com/885115
Commit-Ready: Prathmesh Prabhu <pprabhu@chromium.org>
Tested-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Prathmesh Prabhu <pprabhu@chromium.org>
Reviewed-by: Paul Hobbs <phobbs@google.com>

[modify] https://crrev.com/766605e835f56c4e7ca81f69166499ee6440672d/scheduler/monitor_db.py

Next steps:

- push-to-prod
- push puppet CL to start using the new flags to set deadline for scheduler, shard_client
- Add vi/ dashboards for the **/start metrics that I've added.

Verify that all services are getting restarted routinely.
Project Member

Comment 13 by bugdroid1@chromium.org, Jan 31 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/5b8d5235ccf642b703150f6f0b805ad7740675c0

commit 5b8d5235ccf642b703150f6f0b805ad7740675c0
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Jan 31 18:03:00 2018

Looks like this might have caused an outage this morning? Host_scheduler died on all the shards. We're still investigating.
Cc: xixuan@chromium.org
Blockedon: 809098
Project Member

Comment 17 by bugdroid1@chromium.org, Feb 5 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/e04312f63990f8640702001fa6ae6a3124210d3f

commit e04312f63990f8640702001fa6ae6a3124210d3f
Author: Prathmesh Prabhu <pprabhu@google.com>
Date: Mon Feb 05 19:29:56 2018

Next up: deploy the process lifetimes to staging alone, and repro, debug the failure in issue 809098.
Project Member

Comment 19 by bugdroid1@chromium.org, Feb 6 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/d92c61150e96242085a5b70494dd00f776691e6c

commit d92c61150e96242085a5b70494dd00f776691e6c
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Tue Feb 06 21:29:23 2018

Project Member

Comment 20 by bugdroid1@chromium.org, Feb 7 2018

The following revision refers to this bug:
  https://chrome-internal.googlesource.com/chromeos/chromeos-admin/+/18d34c35157241277e646b92de366820e5bd8d9a

commit 18d34c35157241277e646b92de366820e5bd8d9a
Author: Prathmesh Prabhu <pprabhu@chromium.org>
Date: Wed Feb 07 00:12:50 2018

Does the --lifetime-hours log anything specific in the service log when the process dies due to the lifetime timeout? If not, it should. I'm not sure otherwise how to distinguish it from the shutdown in https://bugs.chromium.org/p/chromium/issues/detail?id=817904&desc=2#c8
Owner: ----
Status: Available (was: Started)
The scheduled restarts were rolled back because they were implicated in a host-scheduler outage.

I'm not actively trying to reland them as we're not currently facing load issues that would be mitigated by these.
Back to "available" plate.

Sign in to add a comment